# Python For Machine Learning Bootcamp
## 3. Pandas
### Beginner Intro
Pandas vs. Numpy:
* Pandas is better when working on huge datasets (>=500k rows) whilst Numpy is ideal for smaller data (<=50k rows).
* Numpy arrays are best accessed via their in-built index, whilst pandas can have custom labels assigned to series to allow custom sorting, accessing etc.
* Pandas is generally great with tables, time series, abitrary data etc.

Pandas Object Definitions:
* 1 dimension = Series
* 2 dimension = Dataframe
* 3 dimensions = Panel

### Series Object

In [3]:
# load modules
import pandas as pd

# create basic list
data = [1, 2, 3, 4]

# convert into Series
series1 = pd.Series(data) # auto given an index that's external to the Series

# check object type
type(series1)

pandas.core.series.Series

* Changing the index labels is important/useful for customisation.

In [4]:
# check current index
series1

0    1
1    2
2    3
3    4
dtype: int64

In [6]:
# change index labels
series1 = pd.Series(data, index = ['a', 'b', 'c', 'd'])
series1

a    1
b    2
c    3
d    4
dtype: int64

* 2 dimensional objects in pandas are called DataFrames.
* Each column can have a different type, but the data type within each column must be the same.

In [11]:
# create data as list
data = [1, 2, 3, 4, 5]

# load into dataframe
df = pd.DataFrame(data)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [12]:
# create dict
d1 = {'fruits':['apples', 'banana', 'mangoes'],
      'count':[10, 20, 15]}

# load into df
df = pd.DataFrame(d1)
df

Unnamed: 0,fruits,count
0,apples,10
1,banana,20
2,mangoes,15


In [13]:
# create series
s = pd.Series([6, 12], index = ['a', 'b'])

# load into df
df = pd.DataFrame(s)
df

Unnamed: 0,0
a,6
b,12


* You can also load numpy arrays (and other objects) into dataframes.
* The above examples used standard Python objects (list, dict) and pandas objects (series).

In [15]:
# load modules
import numpy as np

# create numpy array
arr = np.array([[50000, 60000], ['John', 'James']])

# load into df
df = pd.DataFrame({'name':arr[1], 'salary':arr[0]})
df

Unnamed: 0,name,salary
0,John,50000
1,James,60000


### Merge, Join and Concatenate
### 1) Merge
* First, we'll look at a merge.

In [21]:
# create data lists for df1
player = ['P1', 'P2', 'P3']
point = [8, 9, 6]
title = ['Game1', 'Game 2', 'Game 3']

# load into dfs
df1 = pd.DataFrame({'Player':player, 'Points':point, 'Title':title})

# create data lists for df2
player = ['P1', 'P5', 'P6']
power = ['Punch', 'Kick', 'Elbow']
title = ['Game1', 'Game 5', 'Game6']

# load into df
df2 = pd.DataFrame({'Player':player, 'Power':power, 'Title':title})

# show dfs
df1

Unnamed: 0,Player,Points,Title
0,P1,8,Game1
1,P2,9,Game 2
2,P3,6,Game 3


In [22]:
df2

Unnamed: 0,Player,Power,Title
0,P1,Punch,Game1
1,P5,Kick,Game 5
2,P6,Elbow,Game6


* There are common values and column names between 'Player' and 'Title' so these are good columns to merge on.
* The below merge is an inner merge, which means it only merges rows which have common values in the specified column(s).
* The "Title" column gets split into two different columns because you haven't told the merge function that the "Title" column is shared between DataFrames.

In [24]:
# inner merge
df1.merge(df2, on='Player', how='inner')

Unnamed: 0,Player,Points,Title_x,Power,Title_y
0,P1,8,Game1,Punch,Game1


* A left merge takes every row from the first dataframe (df1 below) and only pulls in values from the second dataframe (df2 below) where there is a match on the specified column.
* For all non-matching rows, you'll get NaN values.

In [25]:
# left merge
df1.merge(df2, on='Player', how='left')

Unnamed: 0,Player,Points,Title_x,Power,Title_y
0,P1,8,Game1,Punch,Game1
1,P2,9,Game 2,,
2,P3,6,Game 3,,


* Right merge is identical to left except it keeps all rows from the second dataframe and only pulls in matches from the first dataframe.

In [26]:
# right merge
df1.merge(df2, on='Player', how='right')

Unnamed: 0,Player,Points,Title_x,Power,Title_y
0,P1,8.0,Game1,Punch,Game1
1,P5,,,Kick,Game 5
2,P6,,,Elbow,Game6


* Outer merge will take all rows from both dataframes, merging any rows which have matching values in the specified column(s) and simply imputing NaNs to the rows it cannot merge.

In [28]:
# outer merge
df1.merge(df2, on='Player', how='outer')

Unnamed: 0,Player,Points,Title_x,Power,Title_y
0,P1,8.0,Game1,Punch,Game1
1,P2,9.0,Game 2,,
2,P3,6.0,Game 3,,
3,P5,,,Kick,Game 5
4,P6,,,Elbow,Game6


### 2) Join
* Join is very similar to merge.
* 'merge' merges dataframes based on their attribute names (i.e. column names).
* Whereas a 'join' joins dataframes based on their index names (i.e. the index column values).

In [50]:
# create data lists for df
player = ['P1', 'P5', 'P6']
power = ['Punch', 'Kick', 'Elbow']
title = ['Game1', 'Game 5', 'Game6']

# load into df2
df3 = pd.DataFrame({'Player':player,
                    'Power':power,
                    'Title':title},
                    index = ['L1', 'L2', 'L3'])

# create data lists for df3
player = ['P1', 'P2', 'P3']
point = [8, 9, 6]
title = ['Game1', 'Game 2', 'Game 3']

# load into df3
df4 = pd.DataFrame({'Players':player,
                    'Point':point,
                    'Titles':title},
                    index = ['L2', 'L3', 'L4'])

# show df
df3

Unnamed: 0,Player,Power,Title
L1,P1,Punch,Game1
L2,P5,Kick,Game 5
L3,P6,Elbow,Game6


In [51]:
df4

Unnamed: 0,Players,Point,Titles
L2,P1,8,Game1
L3,P2,9,Game 2
L4,P3,6,Game 3


* The inner/left/right/outer rules are identical between merges and joins.
* The difference is that the join is working based off the index.
* You can see below that only rows with common indexes between the two dataframes are kept.

In [52]:
# inner join
df3.join(df4, how='inner')

Unnamed: 0,Player,Power,Title,Players,Point,Titles
L2,P5,Kick,Game 5,P1,8,Game1
L3,P6,Elbow,Game6,P2,9,Game 2


* Again, in the below joins, left keeps all rows from first df and fills with nulls for non-matching rows.
* Right keeps all rows from second df and fills with nulls for non-matching rows.
* Outer keeps all rows from both dataframes and joins only where indexes match, using nulls in all other cases.

In [53]:
# inner join
df3.join(df4, how='left')

Unnamed: 0,Player,Power,Title,Players,Point,Titles
L1,P1,Punch,Game1,,,
L2,P5,Kick,Game 5,P1,8.0,Game1
L3,P6,Elbow,Game6,P2,9.0,Game 2


In [55]:
# inner join
df3.join(df4, how='right')

Unnamed: 0,Player,Power,Title,Players,Point,Titles
L2,P5,Kick,Game 5,P1,8,Game1
L3,P6,Elbow,Game6,P2,9,Game 2
L4,,,,P3,6,Game 3


In [54]:
# inner join
df3.join(df4, how='outer')

Unnamed: 0,Player,Power,Title,Players,Point,Titles
L1,P1,Punch,Game1,,,
L2,P5,Kick,Game 5,P1,8.0,Game1
L3,P6,Elbow,Game6,P2,9.0,Game 2
L4,,,,P3,6.0,Game 3


### 3. Concatenation
* Concatenations literally append one dataframe to another, rather than doing any merging.
* If columns are shared then they will be added in line with each other (i.e. you will get 3 columns instead of 6 if the 3 columns in each dataframe have the same label and data type, but again, no merging will occur.
* NaN values are created where columns are not shared between dfs.

In [56]:
# show df3
df3

Unnamed: 0,Player,Power,Title
L1,P1,Punch,Game1
L2,P5,Kick,Game 5
L3,P6,Elbow,Game6


In [57]:
# show df4
df4

Unnamed: 0,Players,Point,Titles
L2,P1,8,Game1
L3,P2,9,Game 2
L4,P3,6,Game 3


In [58]:
# concatenate 2 dfs
pd.concat([df3, df4])

Unnamed: 0,Player,Power,Title,Players,Point,Titles
L1,P1,Punch,Game1,,,
L2,P5,Kick,Game 5,,,
L3,P6,Elbow,Game6,,,
L2,,,,P1,8.0,Game1
L3,,,,P2,9.0,Game 2
L4,,,,P3,6.0,Game 3


### Read in CSV Data

In [1]:
# load modules
import pandas as pd

# read csv into df
empl = pd.read_csv('ZEMPMD.csv')

# show top 'n' rows of df
empl.head(3)

Unnamed: 0,Employee ID,Employee Name,Department ID,Role ID,FTE,Start Date,End Date,Custom Role Name,Freelance
0,7100018000.0,SARAH MUSKER,40040110,100236,1.0,03/06/2019,17/08/2019,,Y
1,1000245000.0,Hollie Donaghue,40040105,101091,1.0,02/09/2015,24/10/2019,LIVE Account Executive,N
2,1000188000.0,Thomas Bolton,40040113,100893,1.0,01/01/2001,30/12/2019,UX,N


In [2]:
# check object type (it's a df)
type(empl)

pandas.core.frame.DataFrame

In [4]:
# check last 'n' rows
empl.tail(3)

Unnamed: 0,Employee ID,Employee Name,Department ID,Role ID,FTE,Start Date,End Date,Custom Role Name,Freelance
970,1000178000.0,Alistair Bell,40040110,100236,1.0,01/01/2001,02/05/2020,,N
971,1000296000.0,Anna Jachimczak,40040113,100089,1.0,06/08/2018,,Designer,N
972,1000178000.0,Gina Naylor,40040110,101677,1.0,01/01/2001,,,N


In [6]:
# check rows, cols of df
empl.shape

(973, 9)

In [7]:
# check col details and check nulls
empl.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 973 entries, 0 to 972
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Employee ID       972 non-null    float64
 1   Employee Name     973 non-null    object 
 2   Department ID     973 non-null    int64  
 3   Role ID           973 non-null    object 
 4   FTE               973 non-null    float64
 5   Start Date        973 non-null    object 
 6   End Date          432 non-null    object 
 7   Custom Role Name  37 non-null     object 
 8   Freelance         973 non-null    object 
dtypes: float64(2), int64(1), object(6)
memory usage: 68.5+ KB


In [8]:
# check col means
empl.mean()

Employee ID      2.166524e+09
Department ID    4.004012e+07
FTE              9.886140e-01
dtype: float64

* Note that columns in a dataframe/dataset are called __"attributes"__.

In [9]:
# check col median
empl.median()

# check col standard deviation
empl.std()

# check max and min of col
empl.max()
empl.min()

Employee ID      1.000260e+09
Department ID    4.004011e+07
FTE              1.000000e+00
dtype: float64

In [10]:
# check value counts per col (indicates if there are nulls)
empl.count()

Employee ID         972
Employee Name       973
Department ID       973
Role ID             973
FTE                 973
Start Date          973
End Date            432
Custom Role Name     37
Freelance           973
dtype: int64

In [11]:
# check all of the above per numerical attribute
empl.describe()

Unnamed: 0,Employee ID,Department ID,FTE
count,972.0,973.0,973.0
mean,2166524000.0,40040120.0,0.988614
std,2401353000.0,22.3456,0.068396
min,71008730.0,40040100.0,0.4
25%,1000210000.0,40040110.0,1.0
50%,1000260000.0,40040110.0,1.0
75%,1000311000.0,40040130.0,1.0
max,7100019000.0,40040180.0,1.071429


### Data Cleaning
* You can use dot notation to specify columns (e.g. empl.EID) but it has it's limitations, such as when columns have spaces in between them or if you want to create new columns etc.
* Therefore, apart from convenience, it's always best to use square brackets and quotes (a.k.a. indexing e.g. empl['EID']) for explicit coding.

In [16]:
# rename cols
empl = empl.rename(columns={'Employee ID':'EID'})

# replace nulls with mean
empl['EID'] = empl['EID'].fillna(empl['EID'].mean())

# drop columns
empl = empl.drop(columns=['FTE'])
# OR del empl['FTE']

* Correlation matrices will check the correlation between values in all numerical columns in the data.
* This shows how strong the relationship is between each combo of 2 numerical variables (e.g. if you change one variable by 1, how much would the other variable change by? It ranges between -1 and 1).

In [17]:
# build correlation matrix (check correlation of variables)
df = empl.corr()
df

Unnamed: 0,EID,Department ID
EID,1.0,-0.071439
Department ID,-0.071439,1.0


* Converting column types can be very straightforward, both of the below methods (including the commented out row) can do it.
* The reason to use "to_numeric" over "astype" is that if there are any values which result in nulls (e.g. if you're converting a string column to an int column and the column contains a value which can't be converted to an int e.g. S123) then you can use the "errors='coerce'" parameter to safely convert the value into NaN rather than triggering an error.

In [24]:
# load modules
import numpy as np

# convert type
#empl['Role ID'] = empl['Role ID'].astype(int)
empl['Role ID'] = pd.to_numeric(empl['Role ID'], errors='coerce')

# replace NaN with 0
empl['Role ID'] = empl['Role ID'].replace(np.nan, 0, regex=True)

# check the change ('Role ID' has 973 values, instead of 972 and 1 null)
empl.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 973 entries, 0 to 972
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   EID               973 non-null    float64
 1   Employee Name     973 non-null    object 
 2   Department ID     973 non-null    int64  
 3   Role ID           973 non-null    float64
 4   Start Date        973 non-null    object 
 5   End Date          432 non-null    object 
 6   Custom Role Name  37 non-null     object 
 7   Freelance         973 non-null    object 
dtypes: float64(2), int64(1), object(5)
memory usage: 60.9+ KB


In [19]:
empl.dtypes

EID                 float64
Employee Name        object
Department ID         int64
Role ID              object
Start Date           object
End Date             object
Custom Role Name     object
Freelance            object
dtype: object

### Data Manipulation
* iloc enables row/column selection using indexing.

In [25]:
# select column by index
empl.iloc[:4] # first part is rows, so asking for all rows, second is cols

Unnamed: 0,EID,Employee Name,Department ID,Role ID,Start Date,End Date,Custom Role Name,Freelance
0,7100018000.0,SARAH MUSKER,40040110,100236.0,03/06/2019,17/08/2019,,Y
1,1000245000.0,Hollie Donaghue,40040105,101091.0,02/09/2015,24/10/2019,LIVE Account Executive,N
2,1000188000.0,Thomas Bolton,40040113,100893.0,01/01/2001,30/12/2019,UX,N
3,1000230000.0,Abbigayle Van Der Westhuizen,40040177,100291.0,01/01/2001,,,N


In [31]:
# all rows and cols
empl.iloc[:,:] # all rows, all cols

# first 5 rows of 4th column
empl.iloc[0:5, 4]

# row 6 onwards, col 4 onwards
empl.iloc[6:, 4:]

Unnamed: 0,Start Date,End Date,Custom Role Name,Freelance
6,10/10/2016,,,N
7,04/10/2017,,,N
8,27/03/2017,,,N
9,10/10/2018,,,N
10,20/09/2016,,,N
...,...,...,...,...
968,01/01/2001,,,N
969,25/11/2019,,,N
970,01/01/2001,02/05/2020,,N
971,06/08/2018,,Designer,N


* loc lets you select rows/columns via labels

In [33]:
# all rows for freelance column
empl.loc[:, 'Freelance']

# first 7 rows for freelance column
empl.loc[:6, 'Freelance']

# first 7 records, multiple columns
empl.loc[:6, 'End Date':'Freelance']

Unnamed: 0,End Date,Custom Role Name,Freelance
0,17/08/2019,,Y
1,24/10/2019,LIVE Account Executive,N
2,30/12/2019,UX,N
3,,,N
4,,,N
5,,Producer,N
6,,,N


In [35]:
# assign single value to all rows in column
empl['Freelance'] = "Y"
empl.head(5)

Unnamed: 0,EID,Employee Name,Department ID,Role ID,Start Date,End Date,Custom Role Name,Freelance
0,7100018000.0,SARAH MUSKER,40040110,100236.0,03/06/2019,17/08/2019,,Y
1,1000245000.0,Hollie Donaghue,40040105,101091.0,02/09/2015,24/10/2019,LIVE Account Executive,Y
2,1000188000.0,Thomas Bolton,40040113,100893.0,01/01/2001,30/12/2019,UX,Y
3,1000230000.0,Abbigayle Van Der Westhuizen,40040177,100291.0,01/01/2001,,,Y
4,1000314000.0,Lydia Hill,40040119,100137.0,12/08/2019,,,Y


In [36]:
# double values in column
function = lambda x: x*2 # multiply passed value by 2

# apply function to EID column
empl['EID'] = empl['EID'].apply(function)
empl.head(5)

Unnamed: 0,EID,Employee Name,Department ID,Role ID,Start Date,End Date,Custom Role Name,Freelance
0,14200040000.0,SARAH MUSKER,40040110,100236.0,03/06/2019,17/08/2019,,Y
1,2000491000.0,Hollie Donaghue,40040105,101091.0,02/09/2015,24/10/2019,LIVE Account Executive,Y
2,2000375000.0,Thomas Bolton,40040113,100893.0,01/01/2001,30/12/2019,UX,Y
3,2000461000.0,Abbigayle Van Der Westhuizen,40040177,100291.0,01/01/2001,,,Y
4,2000629000.0,Lydia Hill,40040119,100137.0,12/08/2019,,,Y


In [38]:
# sort column ascending (by Employee ID)
empl.sort_values(by='EID')

# sort column descending (by Employee ID)
empl.sort_values(by='EID', ascending=False)

Unnamed: 0,EID,Employee Name,Department ID,Role ID,Start Date,End Date,Custom Role Name,Freelance
709,1.420004e+10,CALI WARHAM,40040110,100236.0,02/02/2020,,,Y
849,1.420004e+10,DAVE LAMBERT,40040110,100233.0,20/01/2020,,,Y
836,1.420004e+10,STUART CAINE,40040110,100236.0,03/01/2020,,,Y
831,1.420004e+10,NICK HUNSLEY,40040110,101677.0,18/11/2019,,,Y
820,1.420004e+10,LUCINDA CURTOIS,40040138,100984.0,04/11/2019,,,Y
...,...,...,...,...,...,...,...,...
547,2.000345e+09,Allen Davidson,40040136,100866.0,01/01/2001,04/09/2015,,Y
730,2.000287e+09,David Price,40040110,100058.0,01/01/2001,,,Y
923,2.000268e+09,Susan Little - Mcgeehan,40040115,100060.0,01/01/2001,,,Y
775,2.000267e+09,Ian Davidson,40040115,100309.0,01/01/2001,01/11/2019,,Y


In [42]:
# build filter to show DID > 40040110 and freelance is "Y"
filter1 = (empl['Department ID']) > 40040110 & (empl['Freelance'] == "Y")

# apply filter
new_df = empl[filter1]

# show
new_df.head(3)

Unnamed: 0,EID,Employee Name,Department ID,Role ID,Start Date,End Date,Custom Role Name,Freelance
0,14200040000.0,SARAH MUSKER,40040110,100236.0,03/06/2019,17/08/2019,,Y
1,2000491000.0,Hollie Donaghue,40040105,101091.0,02/09/2015,24/10/2019,LIVE Account Executive,Y
2,2000375000.0,Thomas Bolton,40040113,100893.0,01/01/2001,30/12/2019,UX,Y
