In [1]:
import numpy as np
import pandas as pd

## Series

Pandas has two data structures a Series and a Dataframe. A series is like a column in excel, basically a list of datapoints all of the same type. And the basic way to create a series object is below:

In [5]:
s = pd.Series(
    np.random.randn(5),
    index=['a', 'b', 'c', 'd', 'e'],
    name='example'
)
s

a   -0.343082
b    0.751050
c    0.259860
d    0.285549
e   -0.379612
Name: example, dtype: float64

There are other ways to make a series (like from a dictionary), but in general this is the only one that I ever use. So notice that a series has basically three important parts:

1. The data
2. The index 
3. The name

The data can be a list of data, or a single instance that broadcasts, like below:

In [6]:
pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])

a    5
b    5
c    5
d    5
e    5
dtype: int64

(Broadcasting as we will see later on is really important).

The data is basically what you as a data scientist are interested in. The index is often used in time series, but otherwise I really don't use the index for series (now I do use the index for dataframes quite a lot!). But notice that each datapoint is associated with an index.

Finally the name. The name is only really important when you add a series to a dataframe. In that case the name of the series becomes the column. 

You have so far not seen why series are all that useful, but now we start to get into it. Series have various ways that you can index into them:

In [7]:
s[0]

-0.34308151667924974

In [8]:
s.head(3)

a   -0.343082
b    0.751050
c    0.259860
Name: example, dtype: float64

In [10]:
s[:3]

a   -0.343082
b    0.751050
c    0.259860
Name: example, dtype: float64

In [11]:
s[[4, 2, 1]]

e   -0.379612
c    0.259860
b    0.751050
Name: example, dtype: float64

In [14]:
print(s.values)
type(s.values)

[-0.34308152  0.75105006  0.25986049  0.28554885 -0.37961175]


numpy.ndarray

In [15]:
s['e'] = 500
s

a     -0.343082
b      0.751050
c      0.259860
d      0.285549
e    500.000000
Name: example, dtype: float64

Generally Speaking I don't do any of the above operations - and if you find yourself using them definitely give some thought on whether you should be using pandas for those operations or whether you should be using Numpy.

Now let me show you some operations that I frequently use:

In [16]:
s[[True, True, False, False, True]]

a     -0.343082
b      0.751050
e    500.000000
Name: example, dtype: float64

In [19]:
print(s > 0)
print(s[s > 0])

a    False
b     True
c     True
d     True
e     True
Name: example, dtype: bool
b      0.751050
c      0.259860
d      0.285549
e    500.000000
Name: example, dtype: float64


In [21]:
s[s < 0] *= -1     # Data manipulation "In place"
s

a      0.343082
b      0.751050
c      0.259860
d      0.285549
e    500.000000
Name: example, dtype: float64

But one thing that is super useful about series is that you can do vectorized operations (fast computations on everything in the entire series) on them. And you have already seen one. 

In [25]:
s + s, 2*s, s / 10, s - s

(a       0.686163
 b       1.502100
 c       0.519721
 d       0.571098
 e    1000.000000
 Name: example, dtype: float64,
 a       0.686163
 b       1.502100
 c       0.519721
 d       0.571098
 e    1000.000000
 Name: example, dtype: float64,
 a     0.034308
 b     0.075105
 c     0.025986
 d     0.028555
 e    50.000000
 Name: example, dtype: float64,
 a    0.0
 b    0.0
 c    0.0
 d    0.0
 e    0.0
 Name: example, dtype: float64)

In [28]:
print(type(np.exp(s)))
np.exp(s)

<class 'pandas.core.series.Series'>


a     1.409284e+00
b     2.119224e+00
c     1.296749e+00
d     1.330492e+00
e    1.403592e+217
Name: example, dtype: float64

In [29]:
s.mean()

100.32790818292477

In [30]:
s.median()

0.34308151667924974

These types of operations that are over columns is what pandas is made for. Any time you stray from doing operations over columns, you should think to yourself: is pandas the best tool for me?

Now doing operations over one column might seem useful, but what about operations over multiple columns.

## DataFrames

Series are nice, but the really nice thing about them is dataframes. Dataframes are like an entire excel spreadsheet! As you can probably guess, dataframes are a list of series, each one with a name and the same index. Thus an easy way to create a dataframe is to create it with a dictionary of series/lists:

In [32]:
d = {
    'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])
}

df = pd.DataFrame(d)

df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [33]:
d = {
    'one' : 'Hellow',
    'two' : np.array([1., 2., 3., 4.])
}

df = pd.DataFrame(d)

df

Unnamed: 0,one,two
0,Hellow,1.0
1,Hellow,2.0
2,Hellow,3.0
3,Hellow,4.0


There are plenty of ways to create one of these things, but generally speaking just knowing one is enough. You can always change the values of the index or the columns later:

In [35]:
df.columns =['1', '2']
df.index = ['a', 'b', 'c', 'd']

df

Unnamed: 0,1,2
a,Hellow,1.0
b,Hellow,2.0
c,Hellow,3.0
d,Hellow,4.0


Dataframes are basically just dictionaries of columns/series, so you can use most of the same techniques you used for series on dataframes themselves. 

The general way to ref a column is below:

In [36]:
d = {
    'one' : 'Hellow',
    'two' : np.array([1., 2., 3., 4.])
}

df = pd.DataFrame(d)

df

Unnamed: 0,one,two
0,Hellow,1.0
1,Hellow,2.0
2,Hellow,3.0
3,Hellow,4.0


In [38]:
df['two']

0    1.0
1    2.0
2    3.0
3    4.0
Name: two, dtype: float64

You can then do anything with the series that we did above, nifty.

There is a dot notation shortcut, but it is almost better not to know it because it can lead to errors if not used correctly!

You can of course delete and make new columns, with broadcasting as well

In [41]:
# del df['one'] 
df

Unnamed: 0,two
0,1.0
1,2.0
2,3.0
3,4.0


In [42]:
df['three'] = df['two'] + df['two']
df['four'] = 'four'
df

Unnamed: 0,two,three,four
0,1.0,2.0,four
1,2.0,4.0,four
2,3.0,6.0,four
3,4.0,8.0,four


In [43]:
df['five'] = df['four'][:2]
df

Unnamed: 0,two,three,four,five
0,1.0,2.0,four,four
1,2.0,4.0,four,four
2,3.0,6.0,four,
3,4.0,8.0,four,


Again there are other ways of inserting columns (insert and assign methods) but I never use them. The benefits of using other methods also seems pretty small.

Next let's go over indexing and selecting with dataframes. There are basically 4 ways to do so:

In [44]:
df['two']

0    1.0
1    2.0
2    3.0
3    4.0
Name: two, dtype: float64

In [45]:
df[['two', 'four']]

Unnamed: 0,two,four
0,1.0,four
1,2.0,four
2,3.0,four
3,4.0,four


In [50]:
print(df.loc[0:2, 'two'])      # loc selects on the basis of labels, so 0, 1, 2 index would be selected
print(df.iloc[0:2, 0])

0    1.0
1    2.0
2    3.0
Name: two, dtype: float64
0    1.0
1    2.0
Name: two, dtype: float64


In [53]:
df.iloc[-2:]

Unnamed: 0,two,three,four,five
2,3.0,6.0,four,
3,4.0,8.0,four,


## DataFrame Functions

In addition to doing these columnwise operations, you can also do some dataframewise operations. 

The most useful of these is the copy method, it makes a copy :)

In [54]:
df.copy()

Unnamed: 0,two,three,four,five
0,1.0,2.0,four,four
1,2.0,4.0,four,four
2,3.0,6.0,four,
3,4.0,8.0,four,


The astype method converts the types of columns =

In [56]:
df['two'].astype(np.int64)

0    1
1    2
2    3
3    4
Name: two, dtype: int64

The next thing that I very commonly use is the dataframe transpose ability:

In [57]:
df.T

Unnamed: 0,0,1,2,3
two,1.0,2.0,3.0,4.0
three,2.0,4.0,6.0,8.0
four,four,four,four,four
five,four,four,,


This puts the rows as the columns and the columns as the rows. It can be a good way to do row-wise operations, but mainly I do it to display dataframe values. Below are the three common ways to display dataframe values:

In [59]:
df.head(2)

Unnamed: 0,two,three,four,five
0,1.0,2.0,four,four
1,2.0,4.0,four,four


In [60]:
df.tail(2)

Unnamed: 0,two,three,four,five
2,3.0,6.0,four,
3,4.0,8.0,four,


In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   two     4 non-null      float64
 1   three   4 non-null      float64
 2   four    4 non-null      object 
 3   five    2 non-null      object 
dtypes: float64(2), object(2)
memory usage: 260.0+ bytes


In [63]:
df.describe(include='all')

Unnamed: 0,two,three,four,five
count,4.0,4.0,4,2
unique,,,1,1
top,,,four,four
freq,,,4,2
mean,2.5,5.0,,
std,1.290994,2.581989,,
min,1.0,2.0,,
25%,1.75,3.5,,
50%,2.5,5.0,,
75%,3.25,6.5,,


You will notice however that when the number of columns is too much the display is messy:

In [64]:
for i in range(20):
    df[i] = i
df.head()

Unnamed: 0,two,three,four,five,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
0,1.0,2.0,four,four,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
1,2.0,4.0,four,four,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
2,3.0,6.0,four,,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
3,4.0,8.0,four,,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19


In [65]:
df.T

Unnamed: 0,0,1,2,3
two,1.0,2.0,3.0,4.0
three,2.0,4.0,6.0,8.0
four,four,four,four,four
five,four,four,,
0,0,0,0,0
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3
4,4,4,4,4
5,5,5,5,5


Sometimes this will also truncate. To view more you can always change the view options below (btw, there are many many options in pandas, you can check them all out either [here](https://pandas.pydata.org/pandas-docs/stable/options.html) or with a `pd.set_option?`):

In [None]:
pd.set_option?

In [66]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 7)

In [67]:
import seaborn as sns
import numpy as np
import pandas as pd

# Pandas Indexing and Selecting

Let's talk about slicing and dicing pandas data. We are going to be going over four topics today:

* Review the basics
* Multi-index
* Getting Single Values
* Pointing out some stuff you don't need to worry about

As always you can check out the full documentation: [basic indexing](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) and [advanced indexing](http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html). But be warned that they are very long and tell you way more than you'd need to know :)

## Review the Basics

First let's start with a bit of a recap on traditional indexing and selection. To start off with, here is the data we are going to be working with (good old tips data):

In [68]:
tips = sns.load_dataset('tips', cache=False)
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None


There are basically 4 ways to do get data from dataframes:

In [70]:
tips[['total_bill', 'tip']].head()

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.5
3,23.68,3.31
4,24.59,3.61


In [73]:
tips.iloc[3:5, 1:3]

Unnamed: 0,tip,sex
3,3.31,Male
4,3.61,Female


In [74]:
tips.loc[2:4, 'sex': 'smoker']

Unnamed: 0,sex,smoker
2,Male,No
3,Male,No
4,Female,No


In [None]:
# Boolean selection, advance selection, conditional selection

In [78]:
print(tips['tip']>3)
len(tips[tips['tip']>3])

0      False
1      False
2       True
3       True
4       True
       ...  
239     True
240    False
241    False
242    False
243    False
Name: tip, Length: 244, dtype: bool


98

But this is just the tip of the iceberg (well actually it's 90% of the iceberg). 

But there are a couple of other important concepts that you will most likely get into when diving into other pandas functionalities.

# Multi-index

A subject that you might not think that you'd need - but turns out to be a rather frequent usecase. 

The initial idea behind the multi-index was to provide a framework to work with higher dim data (and thus a replacement for panels).

But because of some operations it became quite commonplace. In almost all cases multi-index comes from [groupby's](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Group%20Operations.ipynb) (you will almost never construct it or read it in yourself).

Let's do an example below:

In [79]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [101]:
mi_tips = tips.groupby(['sex','smoker']).agg({'tip':'mean', 'size': 'mean'})
mi_tips

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,Yes,3.0511667,2.5
Male,No,3.1134021,2.7113402
Female,Yes,2.9315152,2.2424242
Female,No,2.7735185,2.5925926


In [102]:
mi_tips['tip']['Male']['Yes']

3.0511666666666666

In [104]:
mi_tips.loc[('Male', 'Yes'),'tip']

3.0511666666666666

In [113]:
mi_tips.at[('Male', 'Yes'), 'tip']

3.0511666666666666

In [84]:
mi_tips.index

MultiIndex([(  'Male', 'Yes'),
            (  'Male',  'No'),
            ('Female', 'Yes'),
            ('Female',  'No')],
           names=['sex', 'smoker'])

Ultimately there are a ton of operations that you can do on top of this type of data. And there are equivalent multi-index operations you can do, like this:

Unnamed: 0,sex,smoker,tip,size
0,Male,Yes,3.0511667,2.5
1,Male,No,3.1134021,2.7113402
2,Female,Yes,2.9315152,2.2424242
3,Female,No,2.7735185,2.5925926


But in that way you'd have a learn a lot of details and there are always exceptions. 

So the way that I have always deal with this is simply by resetting the index.

In [106]:
ri_tips = mi_tips.reset_index()
ri_tips

Unnamed: 0,sex,smoker,tip,size
0,Male,Yes,3.0511667,2.5
1,Male,No,3.1134021,2.7113402
2,Female,Yes,2.9315152,2.2424242
3,Female,No,2.7735185,2.5925926


Notice how we get values spread out over the full column now. So in this way it is easy to select only the male non-smokers:

In [107]:
ri_tips[(ri_tips['smoker']=='No') & (ri_tips['sex']=='Male')]

Unnamed: 0,sex,smoker,tip,size
1,Male,No,3.1134021,2.7113402


Another way you can deal with this is to only certain indexes out:

In [108]:
out_tips = mi_tips.reset_index(level=0)
out_tips

Unnamed: 0_level_0,sex,tip,size
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Yes,Male,3.0511667,2.5
No,Male,3.1134021,2.7113402
Yes,Female,2.9315152,2.2424242
No,Female,2.7735185,2.5925926


And finally you can pull indexes back into the index (basically only useful for certain types of merges).

In [109]:
ri_tips.set_index(['sex', 'smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,Yes,3.0511667,2.5
Male,No,3.1134021,2.7113402
Female,Yes,2.9315152,2.2424242
Female,No,2.7735185,2.5925926


# Getting Single Values

The next little indexing trick is one that is mostly about speed. But it is getting and setting single values. It is a pretty simple:

In [110]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


When getting/setting single values you should use the `at` function

In [111]:
tips.at[0, 'total_bill'] = 9000
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,9000.0,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [114]:
tips.iat[0, 0]

9000.0

If you are modifying single values of a dataframe you should always use these guys. It's faster and it is a good way to know that you are not messing up (often times modifying the data can result in odd errors).

So just to prove it's faster let's time it!

In [115]:
%%timeit
tips.at[0, 'total_bill'] = 6

11 µs ± 410 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [116]:
%%timeit
tips.loc['total_bill', 0] = 6

49.7 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


# Where, Masks and Queries

These are things that are built into pandas that I have personally never used, mostly because they are pretty redundant and don't happen too often.

They are a bit faster, yes. But the mental space is probably not worth it. So if you wanna learn it, go for it (docs are [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-query-method)). If not, probably won't matter.

Let me show you how you'd duplicate mask functionality below. 

In [129]:
arr = np.random.randn(25).reshape((5, 5))
arr.ndim

2

In [130]:
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2,3,4
0,0.5817368,-0.0846473,-0.4443737,0.1563095,0.0665524
1,-0.4544828,1.582635,-0.4338208,-0.2266673,-0.6480814
2,-0.6859754,-0.1049425,1.8529372,-0.5325923,0.347486
3,-0.5125943,0.4710272,-0.8094906,-0.0329974,0.609457
4,-0.2792712,0.2243155,0.6576722,-2.4911714,0.0480442


In [131]:
df.where(df > 0)

Unnamed: 0,0,1,2,3,4
0,0.5817368,,,0.1563095,0.0665524
1,,1.582635,,,
2,,,1.8529372,,0.347486
3,,0.4710272,,,0.609457
4,,0.2243155,0.6576722,,0.0480442


In [134]:
df[df > 0]

Unnamed: 0,0,1,2,3,4
0,0.5817368,,,0.1563095,0.0665524
1,,1.582635,,,
2,,,1.8529372,,0.347486
3,,0.4710272,,,0.609457
4,,0.2243155,0.6576722,,0.0480442


# Missing Data in Pandas

In [135]:
import pandas as pd
import numpy as np

In [136]:
s = pd.Series(["Sam",np.nan,"Tim","Kim"])
s

0    Sam
1    NaN
2    Tim
3    Kim
dtype: object

In [138]:
print(s.isnull())
s.isnull().sum()

0    False
1     True
2    False
3    False
dtype: bool


1

In [139]:
print(s.notnull())
s.notnull().sum()

0     True
1    False
2     True
3     True
dtype: bool


3

In [140]:
s[3] = None
s.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [141]:
s

0     Sam
1     NaN
2     Tim
3    None
dtype: object

In [142]:
s[0] = ""
s.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [144]:
s[0] = "Sam"
s.dropna()

0    Sam
2    Tim
dtype: object

In [146]:
from numpy import nan as NA

In [147]:
df=pd.DataFrame([[1,2,3],[4,NA,5],
                 [NA,NA,NA]])
df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,,,


In [148]:
df.dropna()

Unnamed: 0,0,1,2
0,1.0,2.0,3.0


In [149]:
df.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0


In [154]:
df.dropna(thresh=2) # Only retains those rows which have atleast 2 non-null values

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0


In [156]:
df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,,,


In [157]:
df.fillna({0:15, 1:20, 2:25})

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,20.0,5.0
2,15.0,20.0,25.0


In [158]:
df.fillna(method="ffill")

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,2.0,5.0
2,4.0,2.0,5.0


In [159]:
df.fillna(method="ffill", limit=1)

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,2.0,5.0
2,4.0,,5.0


# Arithmetic Operations in Pandas

In [160]:
s1=pd.Series(np.arange(4),
             index=["a","c","d","e"])
s2=pd.Series(np.arange(5),
             index=["a","c","e","f","g"])

In [165]:
s1

a    0
c    1
d    2
e    3
dtype: int64

In [162]:
s2

a    0
c    1
e    2
f    3
g    4
dtype: int64

In [163]:
s1 + s2

a    0.0
c    2.0
d    NaN
e    5.0
f    NaN
g    NaN
dtype: float64

In [166]:
df1=pd.DataFrame(
    np.arange(6).reshape(2,3),
    columns=list("ABC"),
    index=["Tim","Tom"])
df2=pd.DataFrame(
    np.arange(9).reshape(3,3),
    columns=list("ACD"),
    index=["Tim","Kate","Tom"])

In [172]:
df1.at['Tim', 'A'] = np.nan
df1

Unnamed: 0,A,B,C
Tim,,1,2
Tom,3.0,4,5


In [179]:
df2.at['Tim', 'A'] = np.nan
df2

Unnamed: 0,A,C,D
Tim,,1,2
Kate,3.0,4,5
Tom,6.0,7,8


In [180]:
df1 + df2

Unnamed: 0,A,B,C,D
Kate,,,,
Tim,,,3.0,
Tom,9.0,,12.0,


In [181]:
df1.add(df2, fill_value=0) # Take good example such that fill_value = 0

Unnamed: 0,A,B,C,D
Kate,3.0,,4.0,5.0
Tim,,1.0,3.0,2.0
Tom,9.0,4.0,12.0,8.0


In [171]:
1/df1

Unnamed: 0,A,B,C
Tim,inf,1.0,0.5
Tom,0.3333333,0.25,0.2


In [182]:
3*df

Unnamed: 0,0,1,2
0,3.0,6.0,9.0
1,12.0,,15.0
2,,,


## Applying a Function