# Python for Data Science


### Pandas 

#### Intro Video - Wrangling Data with Pandas - Google Cloud Platform

<a href="http://www.youtube.com/watch?feature=player_embedded&v=XDAnFZqJDvI
" target="_blank"><img src="http://img.youtube.com/vi/XDAnFZqJDvI/0.jpg" 
alt="IMAGE ALT TEXT HERE" width="240" height="180" border="10" /></a>

In [5]:
import pandas as pd
pd.__version__

'0.23.0'

In [6]:
import numpy as np
import matplotlib.pyplot as plt

### Object Creation

Creating a `Series` by passing a list of values, letting pandas create a default integer index:

In [7]:
s = pd.Series([1,3,5,np.nan,6,8])

In [8]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a `DataFrame` by passing a NumPy array, with a datetime index and labeled columns:

In [9]:
dates = pd.date_range('20180101', periods=6)

In [10]:
dates

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

In [11]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [12]:
df

Unnamed: 0,A,B,C,D
2018-01-01,0.161521,-0.176777,-1.053566,0.055093
2018-01-02,-0.355685,-0.750852,-0.630708,0.233918
2018-01-03,-0.561194,0.850295,-1.190817,0.296357
2018-01-04,0.508058,0.170012,-0.025409,-0.0566
2018-01-05,-1.416954,-0.677444,1.301718,1.061319
2018-01-06,-0.428887,0.05636,-0.836003,-0.906351


Creating a `DataFrame` by passing a dict of objects that can be converted to series-like.

In [13]:
df2 = pd.DataFrame({ 'A' : 1.,
            'B' : pd.Timestamp('20180102'),
            'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
            'D' : np.array([3] * 4,dtype='int32'),
            'E' : pd.Categorical(["test","train","test","train"]),
            'F' : 'foo' })

In [14]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2018-01-02,1.0,3,test,foo
1,1.0,2018-01-02,1.0,3,train,foo
2,1.0,2018-01-02,1.0,3,test,foo
3,1.0,2018-01-02,1.0,3,train,foo


In [15]:
#The columns of the resulting DataFrame have different dtypes.
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [16]:
#Tab completion for column names
df2.<TAB>
df2.A                  df2.bool
df2.abs                df2.boxplot
df2.add                df2.C
df2.add_prefix         df2.clip
df2.add_suffix         df2.clip_lower
df2.align              df2.clip_upper
df2.all                df2.columns
df2.any                df2.combine
df2.append             df2.combine_first
df2.apply              df2.compound
df2.applymap           df2.consolidate
df2.D

SyntaxError: invalid syntax (<ipython-input-16-a9b614e70a2e>, line 2)

### Viewing Data
Here is how to view the top and bottom rows of the frame:

In [17]:
df2.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2018-01-02,1.0,3,test,foo
1,1.0,2018-01-02,1.0,3,train,foo
2,1.0,2018-01-02,1.0,3,test,foo
3,1.0,2018-01-02,1.0,3,train,foo


In [18]:
df2.tail()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2018-01-02,1.0,3,test,foo
1,1.0,2018-01-02,1.0,3,train,foo
2,1.0,2018-01-02,1.0,3,test,foo
3,1.0,2018-01-02,1.0,3,train,foo


Display the index, columns, and the underlying NumPy data:

In [19]:
df.index

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

In [20]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [21]:
df.values

array([[ 0.1615208 , -0.17677725, -1.05356561,  0.05509325],
       [-0.3556852 , -0.75085224, -0.6307075 ,  0.23391823],
       [-0.56119445,  0.85029505, -1.19081697,  0.29635654],
       [ 0.5080582 ,  0.17001176, -0.02540858, -0.05659981],
       [-1.416954  , -0.67744376,  1.30171821,  1.06131927],
       [-0.42888651,  0.0563596 , -0.83600288, -0.90635132]])

`describe()` shows a quick statistic summary of your data:

In [22]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.348857,-0.088068,-0.405797,0.113956
std,0.661386,0.593718,0.930787,0.635412
min,-1.416954,-0.750852,-1.190817,-0.906351
25%,-0.528117,-0.552277,-0.999175,-0.028677
50%,-0.392286,-0.060209,-0.733355,0.144506
75%,0.032219,0.141599,-0.176733,0.280747
max,0.508058,0.850295,1.301718,1.061319


In [23]:
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


Transposing your data:

In [24]:
df.T

Unnamed: 0,2018-01-01 00:00:00,2018-01-02 00:00:00,2018-01-03 00:00:00,2018-01-04 00:00:00,2018-01-05 00:00:00,2018-01-06 00:00:00
A,0.161521,-0.355685,-0.561194,0.508058,-1.416954,-0.428887
B,-0.176777,-0.750852,0.850295,0.170012,-0.677444,0.05636
C,-1.053566,-0.630708,-1.190817,-0.025409,1.301718,-0.836003
D,0.055093,0.233918,0.296357,-0.0566,1.061319,-0.906351


Sorting by an axis:

In [46]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2018-01-01,-0.413071,-0.98864,0.978263,0.476278
2018-01-02,-0.563323,0.238888,0.262722,1.495812
2018-01-03,-0.574882,0.294734,1.254385,0.32463
2018-01-04,2.302291,1.26308,0.523339,1.417684
2018-01-05,0.873408,-0.576795,1.163846,1.627111
2018-01-06,0.788913,-0.801073,-0.021522,0.556139


In [47]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2018-01-06,0.556139,-0.021522,-0.801073,0.788913
2018-01-02,1.495812,0.262722,0.238888,-0.563323
2018-01-04,1.417684,0.523339,1.26308,2.302291
2018-01-01,0.476278,0.978263,-0.98864,-0.413071
2018-01-05,1.627111,1.163846,-0.576795,0.873408
2018-01-03,0.32463,1.254385,0.294734,-0.574882


### Selection - Getting
Selecting a single column, which yields a Series, equivalent to df.A:

In [25]:
df['A']

2018-01-01    0.161521
2018-01-02   -0.355685
2018-01-03   -0.561194
2018-01-04    0.508058
2018-01-05   -1.416954
2018-01-06   -0.428887
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [26]:
df[0:3]

Unnamed: 0,A,B,C,D
2018-01-01,0.161521,-0.176777,-1.053566,0.055093
2018-01-02,-0.355685,-0.750852,-0.630708,0.233918
2018-01-03,-0.561194,0.850295,-1.190817,0.296357


In [28]:
df['20180102':'20180104']

Unnamed: 0,A,B,C,D
2018-01-02,-0.355685,-0.750852,-0.630708,0.233918
2018-01-03,-0.561194,0.850295,-1.190817,0.296357
2018-01-04,0.508058,0.170012,-0.025409,-0.0566


#### Selection by Label

In [29]:
df.loc[dates[0]]

A    0.161521
B   -0.176777
C   -1.053566
D    0.055093
Name: 2018-01-01 00:00:00, dtype: float64

In [30]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2018-01-01,0.161521,-0.176777
2018-01-02,-0.355685,-0.750852
2018-01-03,-0.561194,0.850295
2018-01-04,0.508058,0.170012
2018-01-05,-1.416954,-0.677444
2018-01-06,-0.428887,0.05636


In [33]:
df.loc['20180102':'20180104',['A','B']] #Showing label slicing, both endpoints are included:

Unnamed: 0,A,B
2018-01-02,-0.355685,-0.750852
2018-01-03,-0.561194,0.850295
2018-01-04,0.508058,0.170012


In [34]:
df.loc['20180102',['A','B']] #Reduction in the dimensions of the returned object:

A   -0.355685
B   -0.750852
Name: 2018-01-02 00:00:00, dtype: float64

In [36]:
df.loc[dates[0],'A'] #For getting a scalar value:

0.16152079623063323

In [37]:
df.at[dates[0],'A'] #For getting fast access to a scalar (equivalent to the prior method):

0.16152079623063323

#### Selection by Position

In [38]:
df.iloc[3]

A    0.508058
B    0.170012
C   -0.025409
D   -0.056600
Name: 2018-01-04 00:00:00, dtype: float64

In [39]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2018-01-04,0.508058,0.170012
2018-01-05,-1.416954,-0.677444


In [40]:
df.iloc[[1,2,4],[0,2]] #By lists of integer position locations, similar to the numpy/python style:

Unnamed: 0,A,C
2018-01-02,-0.355685,-0.630708
2018-01-03,-0.561194,-1.190817
2018-01-05,-1.416954,1.301718


In [45]:
df.iloc[1:3,:] #For slicing rows explicitly:

Unnamed: 0,A,B,C,D
2018-01-02,-0.355685,-0.750852,-0.630708,0.233918
2018-01-03,-0.561194,0.850295,-1.190817,0.296357


In [46]:
df.iloc[:,1:3] #For slicing columns explicitly:

Unnamed: 0,B,C
2018-01-01,-0.176777,-1.053566
2018-01-02,-0.750852,-0.630708
2018-01-03,0.850295,-1.190817
2018-01-04,0.170012,-0.025409
2018-01-05,-0.677444,1.301718
2018-01-06,0.05636,-0.836003


In [47]:
df.iloc[1,1] #For getting a value explicitly:

-0.7508522352909704

In [48]:
df.iat[1,1] #For getting fast access to a scalar (equivalent to the prior method):

-0.7508522352909704

### Selection - Setting

In [52]:
#Setting a new column automatically aligns the data by the indexes.
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20180102', periods=6))

In [53]:
s1 

2018-01-02    1
2018-01-03    2
2018-01-04    3
2018-01-05    4
2018-01-06    5
2018-01-07    6
Freq: D, dtype: int64

In [54]:
df.at[dates[0],'A'] = 0 #Setting values by label:

In [55]:
df.iat[0,1] = 0 #Setting values by position:

In [56]:
df

Unnamed: 0,A,B,C,D
2018-01-01,0.0,0.0,-1.053566,0.055093
2018-01-02,-0.355685,-0.750852,-0.630708,0.233918
2018-01-03,-0.561194,0.850295,-1.190817,0.296357
2018-01-04,0.508058,0.170012,-0.025409,-0.0566
2018-01-05,-1.416954,-0.677444,1.301718,1.061319
2018-01-06,-0.428887,0.05636,-0.836003,-0.906351


Quiz

In [57]:
#Setting by assigning with a NumPy array:
df.loc[:,'D'] = np.array([5] * len(df))