# Pandas Dataframes
## What are they and why do we want them?

Pandas is an open source project that provides Python with some very much welcomed data structures that make data anlaysis far easier. If you've worked with R before, you know precisely what a dataframe is and have a general sense of how to get around one. If you haven't, don't worry! The goal of this session is to get you up to speed with using Pandas in Python. It takes some getting used to at first but after a little time, it should start to make some sense.

The term dataframe is a little nebulous and offputting if you're new to this sort of thing but in reality a dataframe is nothing more than what you're use to seeing in Excel. (Okay, so this isn't entirely true. Pandas does offer a great deal more than what Excel does but the general idea stands. Dataframes can be thought of as tables in Excel more or less when you're getting started)

Lets jump right in with importing Pandas and get up and running with one of the provided toy datasets.

In [24]:
import pandas as pd
import numpy as np
import seaborn as sns # We're only bringing in seaborn for it's sample data here

In [7]:
print(sns.get_dataset_names())

['anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'exercise', 'flights', 'fmri', 'gammas', 'iris', 'mpg', 'planets', 'tips', 'titanic']


In [14]:
sns.load_dataset('titanic').head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Two things to notice here . . .
- The dataset that is returned here is tabular looking (that is a Pandas dataframe)
- The second is the .head() method. This is a method available to Pandas dataframes (and series) that will tell you the top 5 (or n) values in a dataframe. It's useful if you're data is big and you just want to see a small portion of it.
  - Likewise there is a tail function that operates from the bottom.

### Okay, so we can load seaborn's built in data but what about a csv (or other data formats).
Pandas can natively read the following formats.
- CSV
- JSON
- HTML
- Your local clipboard
- Microsoft Excel
- Open Document (the open source excel)
- HDF5
- Feather
- Parquet
- Msgpack
- Stata
- SAS
- Python Pickles (Sounds strange but is actually really cool. For you R users out there, this is Python's equivelant of an RData file).
- SQL
- Google Big Query

See here for the full list and documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

I've downloaded the titanic data from the seaborns data repository (https://github.com/mwaskom/seaborn-data) so we can use the read_csv function ourselves.

In [15]:
# while we're here let's see how to write a csv as well.
data = sns.load_dataset('titanic')
type(data)

pandas.core.frame.DataFrame

In [22]:
data.to_csv('./data/titanic.csv', index=False) # index set to false keeps us from writing the row index to csv

In [23]:
# now, let's read it back.
myData = pd.read_csv('./data/titanic.csv')
print(type(myData))
myData.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### I had mentioned data strucutres (It was supposed to be plural)
Pandas has the notion of a few data structures, these structures include the following.
- Dataframes (which we've already seen)
- Series (which, technically we've already seen, you just may not realize it yet)

If you want some deeper reading on Pandas data structures, check out https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html

### So what is this Series we just mentioned?
A dataframe consists of columns and rows, these columns or rows can be had as one dimensional arrays (i.e. Series). Let's take a look at the first column and I'll prove it's a series.

In [28]:
print(type(myData['survived'])) # notice that the type is a series. Remember back to when we worked with numpy arrays? 
print(myData['survived'].head())

<class 'pandas.core.series.Series'>
0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64


Two things are presented by the previous example
- The first and most obvious is subsetting of columns and rows.
- The second and slightly less obvious is the series data type. That column should be boolean but it is being represented as an int64! We can clean this up in two ways but we'll get to that after we deal with the first matter.

### Slicing and Dicing in Pandas
Learning how to get the data you want out of a Pandas dataframe can be a little frustrating at first but let's step through your options.
- loc (standing for location) - Intended to be used with labels
- iloc (standing for index location) - intended to be used with index values
- index names (i.e. columns or rows) - kind of a shortcut if you're after a particular column or row
- dot notation - similar in a lot of ways to R's $ notation

In [61]:
# First up, loc
print(
    myData.loc[0:5, 'survived']
) # select your rows, and then you're columns

# this get's to be pretty handy when you want to filter your data. (There is a filter function but we'll ignore that here.)
print(
    myData.loc[0:5, 'survived'] == 0
)
# the above gets us what is typically referred to as a mask. It provides a way to "mask" our data with the boolean series and get only those that are true.
mask = myData.loc[0:5, 'survived'] == 0
tmp = myData.loc[0:5, :]

tmp.loc[mask.values, :]

0    0
1    1
2    1
3    1
4    0
5    0
Name: survived, dtype: int64
0     True
1    False
2    False
3    False
4     True
5     True
Name: survived, dtype: bool


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True


In [67]:
# now for iloc
myData.iloc[0:5, 0] # again, we're getting the first 5 from the survived column
# iloc isn't going to work with labels though

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

In [70]:
# with particular column names
print(myData['survived'].head(6))

0    0
1    1
2    1
3    1
4    0
5    0
Name: survived, dtype: int64


In [72]:
# and finally dot notation
myData.survived.head(6)

0    0
1    1
2    1
3    1
4    0
5    0
Name: survived, dtype: int64

Now, let's work with the series data and change the datatype of the survived column

In [79]:
print(myData['survived'].head(6))
type(myData['survived']) # it is in fact a series

0    0
1    1
2    1
3    1
4    0
5    0
Name: survived, dtype: int64


pandas.core.series.Series

In [84]:
print(myData['survived'].astype('bool').head())
# a full list of pandas data types can be found here https://pbpython.com/pandas_dtypes.html
myData.head()
# It didn't change! What gives? . . . We never reassigned the column with the new info.

0    False
1     True
2     True
3     True
4    False
Name: survived, dtype: bool


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [86]:
myData['survived'] = myData['survived'].astype('bool')
myData.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,False,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,True,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,True,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,True,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,False,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


We could have (and probably should have) initialized our data with a specific data type. Let's do that now.

In [90]:
typedData = pd.read_csv('./data/titanic.csv', dtype={'survived':'bool'})
typedData.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,False,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,True,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,True,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,True,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,False,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Some notes on using Pandas
In general you really don't want to grow dataframes in a loop. This will be slow and often times won't work if you have big enough datasets (it doesn't take much). The reason being is that the dataframe gets copied a lot when used in loops often times. This leads to memory usage getting out of hand quite quickly.

Instead, keep in mind some of the things we learned in numpy. Try to stick to vectorized operations where possible. If you have to loop in a dataframe for any reason, pre-allocate the column or dataframe first that you'll be filling.

In [99]:
pd.DataFrame(index=range(0,10), columns=['column1', 'column2'])

Unnamed: 0,column1,column2
0,,
1,,
2,,
3,,
4,,
5,,
6,,
7,,
8,,
9,,
