# DataFrame Data Structure

It's the primary object that we'll be working with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label.

We can also think of the DataFrame itself as simply a two-axes labeled array.

In [52]:
# Example
# Let's create three school records for students and their calss grades. We'll create each as
# a series which has a student name, the class name, and the score.

import pandas as pd

record1 = pd.Series({'Name' : 'Alice',
                         'Class' : 'Physics',
                         'Score' : 85})

record2 = pd.Series({'Name' : 'Jack',
                         'Class' : 'Chemistry',
                         'Score' : 82})

record3 = pd.Series({'Name' : 'Helen',
                         'Class' : 'Biology',
                         'Score' : 90})

In [53]:
# Like a Series, the DataFrame object is index. Here we'll use a group of series, where each
# each series represents row of data.
# Just like the Series function, we can pass in our individual items in an array, and we can
# pass in our index values as a second arguments

df = pd.DataFrame([record1, record2, record3], index = ['school', 'school2', 'school3'])

# Just like the Series we can use the head() function to see the first several rows of the
# dataframe, including indices from both axes, and we can use this to verify the columns and
# the rows
df.head()

Unnamed: 0,Name,Class,Score
school,Alice,Physics,85
school2,Jack,Chemistry,82
school3,Helen,Biology,90


In [54]:
# So, we have the index, which is in the leftmost column and is the school name, and then we
# have the rows of dataa, where each row has a column header which was given in our initial
# record dictionaries

In [55]:
# Alternative method
# We could use a list of dictionaries, where each dictionary represents a row of data

students = [{'Name' : 'Alice',
            'Class' : 'Physics',
            'Score' : 85},
            {'Name' : 'Jack',
            'Class' : 'Chemistry',
            'Score' : 82},
            {'Name' : 'Helen',
            'Class' : 'Biology',
            'Score' : 90}
           ]

# Then we pass this list of dictionaries into the DataFrame function
df = pd.DataFrame(students, index = ['school1', 'school2', 'school1'])

df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [56]:
# Similar to the series, we can extract data using the .iloc and .loc attributes. Because the
# DataFrame is two-dimensional, passing a single value to the loc indexing operator will return
# the series if there's only one row to return

# For instance, if we wanted to select data associated with school2, we would just query the
# .loc attribute with one parameter
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [57]:
# We can see that the name of the series is returned as the index value, while the column name
# is included in the output

# We can also check the data type
type(df.loc['school2'])

pandas.core.series.Series

In [58]:
# It's important to remember that the indices and column names along either axes horizontal
# or vertical, could be non-unique. In this example, we see two records for school as1 as 
# different rows. If we use a single value with the DataFrame loc attribute, multiple rows
# of the DataFrame will return, not as a new series, but as a new DataFarme

print(df.loc['school1'])
print('\n')
print(type(df.loc['school1']))

          Name    Class  Score
school1  Alice  Physics     85
school1  Helen  Biology     90


<class 'pandas.core.frame.DataFrame'>


In [59]:
# We can quickly select data based on multiple azes

# School1's student names
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [60]:
# The pandas developers have implemented this using the indexing oeprator and not as parameters
# to a function

# If we wanted to select a single column, we could transpose the matrix 

df.T

Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [61]:
# Then we can call .loc on the transpose to get the student names only
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [62]:
# However, since iloc and loc are used for row selection, Panda reserves the indexing operator
# directly on the DataFrame for column selection. In a Panda's DataFrame, column always have a
# name. So this selection is always labeled based, and is not as confusing as it was when using
# the square bracket operator on the series object.
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [63]:
# In practice, this works really well since we're often trying to add or drop new columns.
# However, this also means that we get a key error if we try and use .loc with a column name.
df.loc['Name']

KeyError: 'Name'

In [None]:
# The result of a single column projection is a Series object
type(df['Name'])

In [None]:
# Since the result of using the indexing operator is either a DataFrame or Serie, we can 
# chain operations together.

# For instance, we can select all of the rows which related to school1 using .loc, then
# project the naem column from just those rows
df.loc['school1']['Name']

In [None]:
# Type should be a DataFrame
print(type(df.loc['school1']))

# Type should be a Series
print(type(df.loc['school1']['Name']))

In [None]:
# Chaining, by indexing on the return type of another index, can come with some costs and is
# best avoided if we can use another approach.
# In particular, chaining tends to cause Pandas to return a copy of the DataFrameinstead of a
# view on the DataFrame
# For selecting data, this is not a bid geal, though it might be slower than necessary
# If we are changing data though this is an important distinction and can be a source of error

In [None]:
# Another approach

# As we saw, .loc does row selection, and it can take two parameter, the row index and the
# list of column names. The .loc attribute also supports slicing.

# If we wanted to select all rows, we can use a colon to indicante a full slice from beginning
# to end. This is just like slicing characters in a list in python. Then we can add the column
# name as the second parameter as a string. If we wanted to include multiple columns, we could
# do in a list and pandas will bring back only the columns we have asked fro

# Example
df.loc[:, ['Name', 'Score']]

In [None]:
# The colon means that we want to get all of the rows and the list in the second argument
# position is the list of columns we want to get back

In [None]:
# That's selecting and projecting data form a DataFrame based on row and column labels.

# Key concepts:
# -> Rows and columns are just for our benefit. Underneath this is just a two axes labeled
# array and transposing the columns is easy
# -> We should avoid chaining as it can cause unpredictable results, where we intend to obtain
# a view of the data, but instead Pandas returns to us a copy

## Dropping data

In [None]:
# Dropping data
# The drop() function returns a copy of the DataFrame with the given rows removed

df.drop('school1')

In [None]:
# But if we look at our original DataFrame we see the data is still intact
df

In [None]:
# Drop has two optional parameters

# "inplace" -> if set to true, the DataFrame will be updated in place, instead of a copy
# being returned

# "axes" -> by default, the value is 0 (indicating row axis). To drop a column, we change it
# to 1

copy_df = df.copy()

copy_df.drop("Name", inplace = True, axis = 1)
copy_df

In [None]:
# Another way to drop a column:
# Through the use of the indexing operator, using the del keyword
# However, this way takes immediate effect on the DataFrame and does not return a view
del copy_df['Class']
copy_df

## Adding data

In [None]:
# Adding a new column to the DataFrame is as easy as assigning it to some value using the
# indexing operator.

# Example
df['ClassRanking'] = None
df

# Indexing and Loading

In [None]:
# The Jupyter notebooks use ipython as the kernel underneath, which provides convenient ways
# to integrate lower level shell commands, whi are programms run in the underlying operating
# system.
# This is super handy for intgration of our data science workflows

# We use a shell command called "cat" for "concatenate", which just outputs the content of a
# file

# In ipython it we prepend the line with an ! it will execute the remainder of the line as a
# shell command

!cat Admission_Predict.csv

In [None]:
import pandas as pd

# Note that we use pandas but not the dataframe in the instruction
df = pd.read_csv('Admission_Predict.csv')

df.head()

In [None]:
# By default, index starts with 0 while students' serial number starts from 1.
# We can set the serial number as the index by using "index_col"

df = pd.read_csv('Admission_Predict.csv', index_col = 0)
df.head()

In [None]:
# Let's rename some of the columnas
# It takes a parameter called "columns", and we need to pass into a dictionary which theys are
# the old column name and the value is the corresponding new column name

new_df = df.rename(columns = {
    'GRE Score' : 'GRE Score',
    'TOEFL Score' : 'TOEFL Score',
    'University Rating' : 'University Rating',
    'SOP' : 'Statement of Purpose',
    'LOR' : 'Letter of Recommendation',
    'CGPA' : 'CGPA',
    'Reseach' : 'Research',
    'Chance of Admit' : 'Chance of Admit'
})

new_df.head()

In [None]:
# Only the SOP change, but not the LOR. Why?

# We check the column names
df.columns

In [None]:
# We see that there is a space after 'LOR'

# We update that value (we don't need to pass is all the values again)
new_df = new_df.rename(columns = {'LOR ' : 'Letter of Recommendation'})
new_df.head()

In [None]:
# That worked well, but if there is a tab or two spaces?
# We can create some function that does the cleaning and then rell renamed to apply that
# function across all of the data. For that we can use the function strip().
# When we pass this in to rename we pass the function as the mapper parameter, and then
# indicate whether the axis should be columns or index(row labels)
new_df = new_df.rename(mapper = str.strip, axis = 'columns')

print(df.columns)
print('\n')
print(new_df.columns)
print('\n')
print(new_df.head())

In [None]:
# We can also use the df.columns attribute by assigning to it a list of column names which
# will directly rename the columns. This will directly modify the original dataframe and is
# very efficient especially when we have a lot of column names and you only want to change a
# few.
# This technique is not affected by subtle errores in the column names, what just happened.

# Example
df2 = pd.read_csv('Admission_Predict.csv', index_col = 0)
df2.columns

In [None]:
# Let's change it

# We store the column names in a list and pass it through a list comprehension
cols = list(df2.columns)
cols = [x.lower().strip() for x in cols]

# We overwrite the column names
df2.columns = cols

df2.head()

There are other data sources we can load directly into dataframes: HTML, web pages, databases and other file formats. CSV is the most important though

# Querifying a DataFrame

-> Boolean masking

A boolean mask is an array which can be of one dimension (series) or two (dataframe), where each of the values of the array are either true or false. This array is essentially overlaid on top of the data structure that we are querying. And any cell aligned with the true value will be admitted into our final results, and the rest will not.

In [27]:
# Example

import pandas as pd
df = pd.read_csv('Admission_Predict.csv', index_col = 0)

# We'll clean up the poorly named columns
df.columns = [x.lower().strip() for x in df.columns]

df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [28]:
# Boolean masks are created by applying operators directly to the pandas series or dataframes

# Students with chance higher than 0.7 -> we have to refer to the specific column
admit_mask = df['chance of admit'] > 0.7
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

In [None]:
# The result of broadcasting a comparison operator is a Boolean mask - true/false values
# depending upon the results of the comparison. Underneath, pandas is applying the comparison
# operator we specified through vectorization (so efficiently and in parallel) to all of the
# values in the array we specified (chance of admit).
# The result is a series, since only one column is being operator on, filled with either 
# True or False values, which is what the comparison operator returns

In [29]:
# Now what?
# We can lay it on top of the dataframe to hide the data we don't want
df.where(admit_mask).head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


In [30]:
# Only data which met the condition was retained. Falses are replaced by NaN

# If we don't want the NaN data -> dropna()
df.where(admit_mask).dropna().head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


In [31]:
# We can see how it skips the index 5 -> data just dropped

# A shorthand for the previous instruction is the following:
df[df['chance of admit'] > 0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


In [None]:
# It returns the same result

In [32]:
# Reviewing indexing operator, it can do two things:

# 1) It can be called with a string parameter to project a single column
print(df['gre score'].head())

# 2) It can send it a list of columns as strings
print(df[['gre score', 'toefl score']].head())

Serial No.
1    337
2    324
3    316
4    322
5    314
Name: gre score, dtype: int64
            gre score  toefl score
Serial No.                        
1                 337          118
2                 324          107
3                 316          104
4                 322          110
5                 314          103


In [33]:
# We can also send it a boolean mask
df[df['gre score'] > 325].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
6,330,115,5,4.5,3.0,9.34,1,0.9
12,327,111,4,4.0,4.5,9.0,1,0.84
13,328,112,4,4.0,4.5,9.1,1,0.78
23,328,116,5,5.0,5.0,9.5,1,0.94


In [None]:
# Each of these is mimicing the functionality from either .loc() or .where().dropna()

In [34]:
# Multiple boolean masks -> multiple criteria for including

# With pandas it is not as with usual computer science. We can't simply use 'and' or 'or'
(df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [35]:
# The problem is that we have series objects, and python underneath doesn't know how to
# compare two series using and or or. Instead, the pandas authors have overwritten the pipe |
# and ampersand ¬ operators to handle this for us
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [None]:
# Important:
# & -> instead of "and"
# | -> instead of "or"
# () -> we have to enclose each condition within parentheses -> otherwise, we get an error
df['chance of admit'] > 0.7 & df['chance of admit'] < 0.9

In [None]:
# The problem is that Python is trying to bitwise and a 0.7 and a pandas dataframe, when you
# really want to bitwise and the broadcasted dataframes together

In [None]:
# Another way:
# instead of comparison operators, we use built in functions
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

In [None]:
# These function are build right into the Series and DataFrame objects, so we can chain them
# too, which results in the same answer
df['chance of admit'].gt(0.7).lt(0.9)

In [None]:
# This only works if our operator, such as <= or >= is built into the DataFrame

# Indexing DataFrames

Both Series and DataFrames can have indices applied to them.

The index is essentially a row level label, and in pandas the rows correspond to axis zero. Indices can either be autogenerated, such as when we create a new Series without an index, in which case we get numeric values, or they can be set explicitly, like when we use the dictionary object to create the series, or when we load data from CSV files and set appropiate parameters.

Another option for setting an index is to use the set_index() function. This function takes a list of columns and promotes those columns to an index.

In [64]:
# The set_index funtion is a destructive process, and it doesn't keep the current index
# If we want to keep the current index, we need to manually create a new column and copy
# into it values from the index attribute

import pandas as pd
df = pd.read_csv('Admission_predict.csv', index_col = 0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [65]:
# Let's say that we don't want to index the DataFrame by serial numbers, but instead by the
# chance of admit. But let's assume we want to keep the serial number for later. So, let's
# preserve the serial number into a new column. We can do this using the indexing operator
# on the string that has the column label. Then we can use the set_index to set index of 
# the column to chance of admit

# So we copy the indexed data into its own column
df['Serial Number'] = df.index

# Then We set the index to another column
df = df.set_index('Chance of Admit ')
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
Chance of Admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


In [66]:
# We'll see that when we create a new index from an existing column the index has a name,
# which is the original name of the column.

# We can get rid of the index completely by calling the function reset_index(). This promotes
# the index into a column and creates a default numbered index
df = df.reset_index()
df.head()

Unnamed: 0,Chance of Admit,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


In [67]:
# Multi-level indexing
# This is similar to composite keys in relational database systems. To create a multi-level
# index, we simply call set index and give it a list of columns that we're interested in
# promoting to an index.

# Pandas will search through these in order, finding the distinct data and form composite
# indices. A good example of this is often found when dealing with geographical data which is
# sorted by regions or demographics

# Example
df2 = pd.read_csv('census.csv')
df2.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [68]:
# In this data set there are two summarized levels, one that contains data for the whole
# country, and one that contains summary data for each state.
# We want to see a list of all the unique values in a given column. In this DataFrame, we see
# that the possible values for the sum level are using the unique function on the DataFrame.
# THis is similar to the SQL distinct operator

# Here we can run unique on the su level of our current DataFrame
df2['SUMLEV'].unique()

array([40, 50])

In [69]:
# We see that there are only two different values: 40 and 50

# Let's exclude all of the rows that are summaries at the state level and just keep the
# county data
df2 = df2[df2['SUMLEV'] == 50]
df2.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [70]:
# The US Census data breaks down population estimates by state and county. We can load the
# data and set the index to be a combination of the state and county values and see how pandas
# handles it in a DataFrame.
# We do this by creating a list of the column identifiers we want to have indexed, and then
# calling set index with this list and assigning the output as appropiate. We see here that
# we have a dual index, first the state name and second the county name

df2 = df2.set_index(['STNAME', 'CTYNAME'])
df2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [71]:
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.369970,1.859511,-0.848580,-1.402476,-1.577232,-0.884411
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wyoming,Sweetwater County,50,4,8,56,37,43806,43806,43593,44041,45104,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
Wyoming,Teton County,50,4,8,56,39,21294,21294,21297,21482,21697,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
Wyoming,Uinta County,50,4,8,56,41,21118,21118,21102,20912,20989,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
Wyoming,Washakie County,50,4,8,56,43,8533,8533,8545,8469,8443,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


In [73]:
# How can we query this DataFrame?
# the loc attribute can take multiple arguments and it could query both the row and th columns.
# When we use a MultiIndex, we must provide the arguments in order by the level we wish to
# query. Inside of the index, each column is called a level and the outermost column is level
# zero

# In our case, we need to enter first the state and then the county (following the index
# pattern)
df2.loc['Michigan', 'Washtenaw County']

SUMLEV          50.000000
REGION           2.000000
DIVISION         3.000000
STATE           26.000000
COUNTY         161.000000
                  ...    
RNETMIG2011      5.191395
RNETMIG2012      1.248106
RNETMIG2013      4.226778
RNETMIG2014      3.801394
RNETMIG2015      0.595048
Name: (Michigan, Washtenaw County), Length: 98, dtype: float64

In [74]:
# If we are interested in comparing two counties, we can pass a list of tuples describing
# the indices we wish to query into loc. Since we have a MultiIndex of two values, the state
# and the county, we need to provide two values as each element of our filtering list.

# Each tuple should have two elements, the first element being the first index and the 
# second element being the second index

# We'll compare Washtenaw County with Wayne County
df2.loc[ [('Michigan', 'Washtenaw County'),
        ('Michigan', 'Wayne County')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Michigan,Washtenaw County,50,2,3,26,161,344791,345066,345563,349048,351213,...,0.129569,-4.309822,-1.780293,-2.955078,-6.078985,5.191395,1.248106,4.226778,3.801394,0.595048
Michigan,Wayne County,50,2,3,26,163,1820584,1820641,1815199,1801273,1792514,...,-13.340073,-10.271616,-14.119617,-11.903253,-8.762835,-11.344758,-8.098421,-11.732437,-9.161648,-6.010195


In [75]:
# The hierarchical labeling isn't just for rows. We could transpose this matrix and now have
# hierarchical column labels.
# Projecting a single column which has these labels works exactly the same

# Missing values

Types of missing values:

For instance, if we are running a survey and a respondant didn't answer a question the missing value is actually an omission:
-> **Missing at Random**: if there are other variables that might be used to predict the variable which is missing
-> **Missing Completely at Random**: If there is no relationship to other variables

Other examples: data might be missing because it wasn't collected or because it wouldn't make sense if it were collected. Example of this last one: joining a list of people at a university with a list of offices in the university (students generally on't have offices)

In [88]:
# Pandas is pretty good at detecting missing values directly from underlying data formats,
# like CSV files
# Although most missing values are often formatted as NaN, NULL, None or N/A, sometimes
# missing values are not labeled so clearly.
# The pandas read_csv() function has a parameter called na_values to let us specify the form
# of missing values. It allows scalar, string, list, or dictionaries to be used
import pandas as pd

In [89]:
df_3 = pd.read_csv('class_grades.csv')
df_3.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [90]:
# We can actually use the function .isnull() to create a boolean mask of the whole dataframe.
# This effectively broadcasts isnull() function to every cell of data.

mask2 = df_3.isnull()
mask2.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [91]:
# This can be useful for processing rows based on certain columns of data. Another useful
# operation is to be able to drop all of those rows which have any missing data, which can
# be done with the dropna() function.
df_3.dropna().head(15)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72
12,8,97.16,103.71,72.5,93.52,63.33
13,7,91.28,83.53,81.25,99.81,92.22


In [92]:
# If we want to fill all missing values with, we could use fillna()
# This function takes a number or parameters

# We use the place so that it makes the change to the original DataFrame, instead of returning
# us a copy
df_3.fillna(0, inplace = True)
df_3.head(15)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [93]:
# We can also use the na_filter option to turn off white space filtering, if white space is an
# actual value of interest --> In practice, this is pretty rare
# In data without any NA's, passing na_filter = False, can improve the performance of reading
# a large file

# Sometimes it is also useful to consider missing values as actually having information
df_4 = pd.read_csv('log.csv')
df_4.head(15)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [85]:
# There are a lot of missing values in the paused and volume columns. It's not efficient to
# senf this information across the network if it hasn't changed. To this particular system
# just inserts null values into the database if there's no changes

In [94]:
# Next up is the method parameter()
# The two common fill values are:
# -> ffill: is for forward filling and it updates an na value for a particular cell with the 
#    the value from the previous row
# -> bfill: is for backward filling and it fills the missing values with the next valid value

# Important: the data needs to be sorted in order for this work correctly.
# Data coming from traditional management systems usually has no order guarantee

# In Pandas we can sort either by index or by values.
# In this case, we'll promote the time stamp to an index and then sort of the index
df_4 = df_4.set_index('time')
df_4 = df_4.sort_index()
df_4.head(20)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [96]:
# We realize that the index isn't really unique -> this is very common

# We'll use a multi-level indexing on time and user together instead, promote the user name
# to a second level of the index to deal with that issue

df_4 = df_4.reset_index()
df_4 = df_4.set_index(['time', 'user'])
df_4.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [97]:
# Now that the data is properly sorted out, we can fill the missing datas using ffill. It's
# good to remember when dealing with missing values so we can deal with individual columns
# or set of columns by projecting them. So we don't have to fix all missing values in one 
# command
df_4 = df_4.fillna(method = 'ffill')
df_4.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0
1469974514,cheryl,intro.html,8,False,10.0
1469974524,sue,advanced.html,25,False,10.0
1469974544,cheryl,intro.html,9,False,10.0
1469974554,sue,advanced.html,26,False,10.0
1469974574,cheryl,intro.html,10,False,10.0


In [98]:
# We can also do customized fill-in to replace values with the replace() function. It allows
# replacement from several approches: value-to-value, list, dictionary, regex.

# Example
df5 = pd.DataFrame({
    'A' : [1, 1, 2, 3, 4],
    'B' : [3, 6, 3, 8, 9],
    'C' : ['a', 'b', 'c', 'd', 'e']
})

df5

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [100]:
# We can replace 1's with 100
# Let's try the value-to-value approach
df5.replace(1, 100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [101]:
# Change two values: 1's to 100 and 3's to 300
df5.replace([1, 3], [100, 300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


In [103]:
# Replace using regex
# First parameter -> to replace the regex pattern we want to match
# Second parameter -> The value we want to emit upton match
# Third parameter -> 'regex = True'

# Example
# We want to replace everything that end with ".html" and replace it with ".webpage"
df6 = pd.read_csv('log.csv')
df6.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [104]:
# Solution
# .* -> Anything before '.html'
# $ -> '.html' has to be at the end
df6.replace(to_replace = '.*.html$', value = 'webpage', regex = True)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,


When we use statistical functions on DataFrames, these functions tipically ignore missing values. For instance: mean value.
This is usually what we want, but we should be aware of the values that are being excluded.

# Example: Manipulating DataFrame

In [17]:
import pandas as pd

df7 = pd.read_csv('presidents.csv')

df7.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"


In [18]:
# We see a bunch of footnotes in the "Born" column which might cause issues.
# Let's start with cleaning up that name into firstname and lastname. We're going to tackle
# this with a regex.

# We want to create two new columns and apply a regex to the projection of the "President"
# column

# We make a copy of the President column
df7['First'] = df7['President']

# Then we can call replace() and just have a pattern that matches the last name and set it to
# an empty string
df7['First'] = df7['First'].replace('[ ].*', '', regex = True)
df7.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James


In [19]:
# This works, but it's kind of gross and slow, since we had to make a full copy of a column
# then go through and update strings.
# There a few other ways we can deal with this.

# The most general one: apply()

# First we drop (remove) the column we made
del(df7['First'])

# The apply() function on a dataframe will take some arbitrary function we have written and
# applt it to either a Series (a single column) or DataFrame across all rows or columns.

# Let's write a function which just splits a string into two pieces using a single row of
# data
def splitname(row):
    # The row is a single Series object which is a single row indexed by column values
    row['First'] = row['President'].split(' ')[0]
    row['Last'] = row['President'].split(' ')[-1]
    return row

# Now we indicate that we want to apply the function across columns
df7 = df7.apply(splitname, axis = 'columns')

df7.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


In [20]:
# It isn't less gross, but it achieves the result
# Other way to do it is with .extract()

# First we delete the columns we just created
del(df7['First'])
del(df7['Last'])

# Extract takes a regular expresion as input and specifically requires us to set capture
# groups that correspond to the output columns we are interested in.

# First we write the pattern
pattern = '(^[\w]*)(?:.* )([\w]*$)'

df7['President'].str.extract(pattern).head()

Unnamed: 0,0,1
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


In [21]:
# We can name the patterns in the regex
pattern = '(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)'

# Now we call extract
names = df7['President'].str.extract(pattern).head()
names

Unnamed: 0,First,Last
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


In [22]:
# Now we can just copy these into our main dataframe
df7['First'] = names['First']
df7['Last'] = names['Last']
df7.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


In [23]:
# Pandas str module documentation:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

In [24]:
# Now let's clean up the Born column

# We first get rid of anything that isn't in the pattern of Month Day and Year
df7['Born'] = df7['Born'].str.extract('([\w]{3} [\w]{1,2}, [\w]{4})')
df7['Born'].head()

0    Feb 22, 1732
1    Oct 30, 1735
2    Apr 13, 1743
3    Mar 16, 1751
4    Apr 28, 1758
Name: Born, dtype: object

In [25]:
# The type of the column is object, and we can actually update it to datetime type
df7['Born'] = pd.to_datetime(df7['Born'])
df7['Born'].head()

0   1732-02-22
1   1735-10-30
2   1743-04-13
3   1751-03-16
4   1758-04-28
Name: Born, dtype: datetime64[ns]

In [26]:
# This makes subsequent processing on the dataframe around dates much easier

str functions are incredibly useful and because they are vertorized, they are very efficient