# Pandas intro class 01

Objective:
- Learn to read and write data using `pandas`
- Selecting subset of data

There are many ways to do things using `pandas`, generally we want our work to be:
- simple
- explicit
- easy to read
- efficient

## Getting started

In [5]:
import pandas as pd # alias 
import numpy as np 

# set row and column display if you need to
# pd.set_option('display.max_columns', 100)
# pd.set_option('display.max_rows', 500)

In [6]:
# read data in
df = pd.read_csv('data/sample_data.csv')
len([m for m in dir(df) if not m.startswith('_')])

218

In [10]:
# what is this?
dir(df)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 

In [8]:
df.columns

<bound method NDFrame.keys of         name state  color favorite food  age  height  score  count
0       Jane    NY   blue         Steak   30     165    4.6     10
1       Niko    TX  green          Lamb    2      70    8.3      4
2      Aaron    FL    red         Mango   12     120    9.0      3
3   Penelope    AL  white         Apple    4      80    3.3     12
4       Dean    AK   gray        Cheese   32     180    1.8      8
5  Christina    TX  black         Melon   33     172    9.5     99
6   Cornelia    TX    red         Beans   69     150    2.2     44>

In [12]:
df.shape

(7, 8)

In [13]:
df.head()

Unnamed: 0,name,state,color,favorite food,age,height,score,count
0,Jane,NY,blue,Steak,30,165,4.6,10
1,Niko,TX,green,Lamb,2,70,8.3,4
2,Aaron,FL,red,Mango,12,120,9.0,3
3,Penelope,AL,white,Apple,4,80,3.3,12
4,Dean,AK,gray,Cheese,32,180,1.8,8


In [14]:
df.tail()

Unnamed: 0,name,state,color,favorite food,age,height,score,count
2,Aaron,FL,red,Mango,12,120,9.0,3
3,Penelope,AL,white,Apple,4,80,3.3,12
4,Dean,AK,gray,Cheese,32,180,1.8,8
5,Christina,TX,black,Melon,33,172,9.5,99
6,Cornelia,TX,red,Beans,69,150,2.2,44


In [16]:
df.dtypes

name              object
state             object
color             object
favorite food     object
age                int64
height             int64
score            float64
count              int64
dtype: object

In [22]:
df.isnull().mean()

name             0.0
state            0.0
color            0.0
favorite food    0.0
age              0.0
height           0.0
score            0.0
count            0.0
dtype: float64

In [None]:
# other functions for different applications
pd.read_sql()
pd.read_excel()
pd.read_table() # same as read_csv
pd.read_parquet()

In [None]:
# convert data in python to dataframe
pd.DataFrame()

In [None]:
# directly read from SQL database - usually requires some setup and credentials
pd.read_sql('select * from table limit 5', con)

## Subsetting data
### Selecting a single column
- brackets `[]`
    
    Square bracket notation is used for accessing members of a collection, whether that's by key in the case of a dictionary or other mapping, or by index in the case of a sequence like a list or string
    
- dot notation `.`
    
    The dot operator is used for accessing attributes of any object
    
**Takeaway**: brackets `[]` will always work, while dot notation `.` may give some issues if your column names contain space or the same as DataFrame methods. I'd recommend to use a method that always works.

In [23]:
df.name # dot method

0         Jane
1         Niko
2        Aaron
3     Penelope
4         Dean
5    Christina
6     Cornelia
Name: name, dtype: object

In [29]:
df['name'] # [] method

0         Jane
1         Niko
2        Aaron
3     Penelope
4         Dean
5    Christina
6     Cornelia
Name: name, dtype: object

In [25]:
type(df)

pandas.core.frame.DataFrame

In [28]:
type(df[['name']]) # nx1 / 1xn

pandas.core.frame.DataFrame

In [30]:
df['favorite food'] # won't work for dot method

0     Steak
1      Lamb
2     Mango
3     Apple
4    Cheese
5     Melon
6     Beans
Name: favorite food, dtype: object

In [38]:
# rename column - look at doc string and example
df.rename(columns={'favorite food':'favorite_food'}, inplace=True)

In [36]:
# sometimes dot won't work
df.favorite_food

0     Steak
1      Lamb
2     Mango
3     Apple
4    Cheese
5     Melon
6     Beans
Name: favorite_food, dtype: object

In [77]:
df.count()

name             7
state            7
color            7
favorite_food    7
age              7
height           7
score            7
count            7
dtype: int64

### Selecting multiple columns
You can select multiple columns when giving bracket a list of column names.

In [40]:
cols = ['name','state','age']
df[cols]

Unnamed: 0,name,state,age
0,Jane,NY,30
1,Niko,TX,2
2,Aaron,FL,12
3,Penelope,AL,4
4,Dean,AK,32
5,Christina,TX,33
6,Cornelia,TX,69


In [41]:
type(df[cols])

pandas.core.frame.DataFrame

### Selecting rows
Pandas DataFrame reference rows and columns in two ways: location(index) and lable. Hence you can access then using two ways. The following indexers allow you to select rows and columns at the same time.
- `loc` accessese data by label
- `iloc` accesses data by index/integer location

Don't use `ix` indexer as it's deprecated.

No need to use `iat` and `at`, if performance is a concern, use `numpy`.

In [42]:
df

Unnamed: 0,name,state,color,favorite_food,age,height,score,count
0,Jane,NY,blue,Steak,30,165,4.6,10
1,Niko,TX,green,Lamb,2,70,8.3,4
2,Aaron,FL,red,Mango,12,120,9.0,3
3,Penelope,AL,white,Apple,4,80,3.3,12
4,Dean,AK,gray,Cheese,32,180,1.8,8
5,Christina,TX,black,Melon,33,172,9.5,99
6,Cornelia,TX,red,Beans,69,150,2.2,44


In [43]:
rows = [1,3,5]
columns = [2,4,6]

sub_df = df.loc[rows, columns]

In [44]:
type(sub_df)

pandas.core.frame.DataFrame

In [45]:
sub_df

Unnamed: 0,color,age,score
1,green,2,8.3
3,white,4,3.3
5,black,33,9.5


In [48]:
rows = [1,3,5]
columns = ['name','state','age']

df.loc[rows, columns]
# note here the rows means different thing from above

Unnamed: 0,name,state,age
1,Niko,TX,2
3,Penelope,AL,4
5,Christina,TX,33


In [49]:
df.index

RangeIndex(start=0, stop=7, step=1)

In [50]:
# you can reset index to any column
df.set_index('name', inplace=True)

In [51]:
df

Unnamed: 0_level_0,state,color,favorite_food,age,height,score,count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jane,NY,blue,Steak,30,165,4.6,10
Niko,TX,green,Lamb,2,70,8.3,4
Aaron,FL,red,Mango,12,120,9.0,3
Penelope,AL,white,Apple,4,80,3.3,12
Dean,AK,gray,Cheese,32,180,1.8,8
Christina,TX,black,Melon,33,172,9.5,99
Cornelia,TX,red,Beans,69,150,2.2,44


In [52]:
df.index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object', name='name')

In [56]:
# rows = [1,3,5]
rows = ['Dean','Penelope']
columns = ['state','age']
df.loc[rows, columns]

Unnamed: 0_level_0,state,age
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Dean,AK,32
Penelope,AL,4


In [25]:
df.index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object', name='name')

In [58]:
rows = df.index[1::2]
columns = ['state','age'] # df.columns[0:-2:1]
df.loc[rows, columns]

Unnamed: 0_level_0,state,age
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Niko,TX,2
Penelope,AL,4
Christina,TX,33


In [60]:
df.iloc[[1,3,5]][columns]

Unnamed: 0_level_0,state,age
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Niko,TX,2
Penelope,AL,4
Christina,TX,33


## Assigning values
When assigning values, you are usually assiging to a subset of DataFrame. 

We want to subset the `name` column of the DataFrame and change one of the values.

In [61]:
df.reset_index(inplace=True)
df

Unnamed: 0,name,state,color,favorite_food,age,height,score,count
0,Jane,NY,blue,Steak,30,165,4.6,10
1,Niko,TX,green,Lamb,2,70,8.3,4
2,Aaron,FL,red,Mango,12,120,9.0,3
3,Penelope,AL,white,Apple,4,80,3.3,12
4,Dean,AK,gray,Cheese,32,180,1.8,8
5,Christina,TX,black,Melon,33,172,9.5,99
6,Cornelia,TX,red,Beans,69,150,2.2,44


In [62]:
name = df['name']
name

0         Jane
1         Niko
2        Aaron
3     Penelope
4         Dean
5    Christina
6     Cornelia
Name: name, dtype: object

In [63]:
type(name)

pandas.core.series.Series

In [64]:
name.loc[0] = 'Emma'
name

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  name.loc[0] = 'Emma'


0         Emma
1         Niko
2        Aaron
3     Penelope
4         Dean
5    Christina
6     Cornelia
Name: name, dtype: object

In [65]:
df

Unnamed: 0,name,state,color,favorite_food,age,height,score,count
0,Emma,NY,blue,Steak,30,165,4.6,10
1,Niko,TX,green,Lamb,2,70,8.3,4
2,Aaron,FL,red,Mango,12,120,9.0,3
3,Penelope,AL,white,Apple,4,80,3.3,12
4,Dean,AK,gray,Cheese,32,180,1.8,8
5,Christina,TX,black,Melon,33,172,9.5,99
6,Cornelia,TX,red,Beans,69,150,2.2,44


This is a side-effect we've seen in modifying elements in mutable collections before.

Whenever we selected the name column with `df['name']`, Pandas did not make a copy of the underlying data. Both the `name` Series and the `df` DataFrame are referencing the same underlying data, which is a NumPy array. Thus, when we made the assignment with `name.loc[0] = 'Emma'`, reference was updated. Since both `name` and `df` reference this array in memory, they both report the new name `Emma`.

And based on what we learned before, if this side-effect is undesirable, we can handle with a `copy`.

In [66]:
df = pd.read_csv('data/sample_data.csv')
name = df['name'].copy()

In [67]:
name.loc[0] = 'Emma'
name

0         Emma
1         Niko
2        Aaron
3     Penelope
4         Dean
5    Christina
6     Cornelia
Name: name, dtype: object

In [68]:
df['name']

0         Jane
1         Niko
2        Aaron
3     Penelope
4         Dean
5    Christina
6     Cornelia
Name: name, dtype: object

### `SettingWithCopyWarning` - what is it and how to deal with it
Sometimes you'll get a `SettingWithCopyWarning` when you try to assign value. It usually means the value assignment may not work and you'll need to modify your code and double check.

In [69]:
df['age'] < 18

0    False
1     True
2     True
3     True
4    False
5    False
6    False
Name: age, dtype: bool

In [70]:
# condition filter / condition indexing / mask
df[ df['age'] < 18 ]

Unnamed: 0,name,state,color,favorite food,age,height,score,count
1,Niko,TX,green,Lamb,2,70,8.3,4
2,Aaron,FL,red,Mango,12,120,9.0,3
3,Penelope,AL,white,Apple,4,80,3.3,12


In [71]:
df[ df['age'] < 18 ]['score'] # chain method

1    8.3
2    9.0
3    3.3
Name: score, dtype: float64

In [72]:
# let's do some assignment using chained indexing
df[df['age'] < 18]['score'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['age'] < 18]['score'] = 0


In [73]:
df

Unnamed: 0,name,state,color,favorite food,age,height,score,count
0,Jane,NY,blue,Steak,30,165,4.6,10
1,Niko,TX,green,Lamb,2,70,8.3,4
2,Aaron,FL,red,Mango,12,120,9.0,3
3,Penelope,AL,white,Apple,4,80,3.3,12
4,Dean,AK,gray,Cheese,32,180,1.8,8
5,Christina,TX,black,Melon,33,172,9.5,99
6,Cornelia,TX,red,Beans,69,150,2.2,44


Chained indexing here didn't result in any value assignment. Let's use the correct way.

If you want to know why chained indexing didn't work for assignment, here is why

>Chained indexing does not work because an intermediate DataFrame is created with new, copied data. It is this intermediate DataFrame that has its data changed. But, this intermediate DataFrame is never assigned to a variable, therefore there is no reference to it.

In [74]:
df.loc[df['age'] < 18, 'score'] = 0
df

Unnamed: 0,name,state,color,favorite food,age,height,score,count
0,Jane,NY,blue,Steak,30,165,4.6,10
1,Niko,TX,green,Lamb,2,70,0.0,4
2,Aaron,FL,red,Mango,12,120,0.0,3
3,Penelope,AL,white,Apple,4,80,0.0,12
4,Dean,AK,gray,Cheese,32,180,1.8,8
5,Christina,TX,black,Melon,33,172,9.5,99
6,Cornelia,TX,red,Beans,69,150,2.2,44


In [75]:
df_new = df[['state', 'count']]
df_new

Unnamed: 0,state,count
0,NY,10
1,TX,4
2,FL,3
3,AL,12
4,AK,8
5,TX,99
6,TX,44


Sometimes the `SettingWithCopyWarning` is triggered even there is nothing wrong. For example:

In [76]:
df_new['count'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['count'] = 0


In [77]:
df_new

Unnamed: 0,state,count
0,NY,0
1,TX,0
2,FL,0
3,AL,0
4,AK,0
5,TX,0
6,TX,0


In [78]:
df

Unnamed: 0,name,state,color,favorite food,age,height,score,count
0,Jane,NY,blue,Steak,30,165,4.6,10
1,Niko,TX,green,Lamb,2,70,0.0,4
2,Aaron,FL,red,Mango,12,120,0.0,3
3,Penelope,AL,white,Apple,4,80,0.0,12
4,Dean,AK,gray,Cheese,32,180,1.8,8
5,Christina,TX,black,Melon,33,172,9.5,99
6,Cornelia,TX,red,Beans,69,150,2.2,44


In this instance, Pandas makes a completely new copy of the data when creating the new DataFrame when executing `df_new = df[['state', 'count']]`. The values of this DataFrame can be set without worrying about changing the original. The warning showed up as we didn't explicitly create a copy. If we want to avoid the warning, we can use `.copy()` explicitly.

In [79]:
df_new = df[['state', 'count']].copy()
df_new['count'] = 0
df_new

Unnamed: 0,state,count
0,NY,0
1,TX,0
2,FL,0
3,AL,0
4,AK,0
5,TX,0
6,TX,0


There is no need to worry about the `SettingWithCopyWarning` warning, usually the following two cases apply to your work, the warning is basically saying pandas doesn't know which one you are trying to do:
- you want to work on original DataFrame and modify its values
- you want to create a new DataFrame for your work and keep the original unchanged

For the first case, make sure you are indexing properly as demonstrated and know the values have been changed. 

For the second case, create explicit copy when possible (memory/performance concern exists sometimes), or if you are familiar with what's happening behind the scene, you can ignore the warinig.

### Write data out
We can use built-in methods to write DataFrame to desirable format.

In [59]:
df.to_csv('data/college.csv', index=False) 

In [64]:
with open('data/modified_sample_data.csv', 'r') as file:
    print(file.read())

name,state,color,favorite food,age,height,score,count
Jane,NY,blue,Steak,30,165,4.6,10
Niko,TX,green,Lamb,2,70,0.0,4
Aaron,FL,red,Mango,12,120,0.0,3
Penelope,AL,white,Apple,4,80,0.0,12
Dean,AK,gray,Cheese,32,180,1.8,8
Christina,TX,black,Melon,33,172,9.5,99
Cornelia,TX,red,Beans,69,150,2.2,44



### Assignment: read about what the following methods do and get familiar with them, we will use them in future classes.

* T
* abs
* all
* any
* append
* asfreq
* astype
* clip
* columns
* copy
* corr
* count
* cov
* cummax
* cummin
* cumprod
* cumsum
* describe
* diff
* drop
* drop_duplicates
* dropna
* dtypes
* equals
* expanding
* fillna
* groupby
* head
* idxmax
* idxmin
* iloc
* index
* interpolate
* isin
* isna
* loc
* max
* mean
* median
* melt
* merge
* min
* mode
* nlargest
* notna
* nsmallest
* nunique
* pct_change
* pivot_table
* plot
* prod
* quantile
* rank
* rename
* replace
* resample
* reset_index
* rolling
* round
* sample
* select_dtypes
* shape
* shift
* sort_index
* sort_values
* std
* sum
* tail
* to_csv
* to_sql
* values
* var