# DataFrames

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. We can think of the DataFrame itself as simply a two-axes labeled array.

In [1]:
import pandas as pd

In [2]:
record1 = pd.Series({'Name' : 'Alice',
                     'Class' : 'Physics',
                     'Score' : 85})
record2 = pd.Series({'Name' : 'Jack',
                     'Class' : 'Chemistry',
                     'Score' : 82})
record3 = pd.Series({'Name' : 'Helen',
                     'Class' : 'Biology',
                     'Score' : 90})

In [3]:
# Like a Series, the DataFrame index is object.
# We'll use a group of series, where each series represets a row of data. Just like the Series
# function, we can pass in our individual items in an array, and we can pass in our index
# values as a second argument

df = pd.DataFrame([record1, record2, record3], index = ['school1', 'school2', 'school3'])
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school3,Helen,Biology,90


In [4]:
# The index is the leftmost column (school name)
# Then we have the rows of data, where each row has a column header which was given in our
# initial record dictionaries

In [5]:
# Alternative method

# We can use a list of dictionaries, where each dictionary represents a row of data
students = [{'Name' : 'Alice',
             'Class' : 'Physics',
             'Score' : 85},
           {'Name' : 'Jack',
             'Class' : 'Chemistry',
             'Score' : 82},
           {'Name' : 'Helen',
             'Class' : 'Biology',
             'Score' : 90}]

# Then we pass this list of dicitonaries into the DataFrame fuction
df = pd.DataFrame(students, index = ['school1', 'school2', 'school1'])

df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [6]:
# Similar to the series, we can extract data using the .iloc and .loc attributes.
# Because the DataFrame is two-dimensional, passing a single value to the loc indexing
# operator will return the series if there's only one row to return.

# Example: data associated with school2
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [8]:
# The name of the Series is returned as the index value, while the column name is included in
# the output

# We can check the data type
type(df.loc['school2'])

pandas.core.series.Series

In [9]:
# Indices and column names along either axes horizontal or vertical, could be non-unique.
# If we use non-unique values with the DataFrame loc attribute, multiple rows of the DataFrame
# will return, not as a new Series, but as a new DataFrame

# Example
df.loc['school1']

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


In [10]:
# If we check the type
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [11]:
# We can quickly select data based on multiple axes
# We can supply wo parameter to .loc, one being the row index and the other being the column
# name

# Example
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [12]:
# Just like for the Series, the pandas developers have implemented this using the indexing
# operator and not as parameters to a function

# How can we select a single column?
# There are several options

# 1) We could transpose the matrix with the T attribute
df.T

Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [13]:
# Then we call .loc on the transpose to get the student names only
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [14]:
# 2)
# However, since iloc and loc are used for row selection, Pandas reserves the indexing operator
# directly on the DataFrame for column selection. 
# In a Panda's DataFrame, columns always have a name. So this selection is always label based.

df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [16]:
# However, this also means that we get a key error if we try and use .loc with a column name
df.loc['Name']

KeyError: 'Name'

In [17]:
# The result of a single column projection is a Series object
type(df['Name'])

pandas.core.series.Series

In [18]:
# Since the result of using the indexing operator is either a DataFrame or Series, we can
# chain operations together

# Example
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [19]:
# We can check the types to see the difference

print(type(df.loc['school1']))     #DataFrame
print(type(df.loc['school1']['Name']))     #Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [20]:
# .loc -> for rows
# indexing -> for columns

In [21]:
# Chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the
# DataFrame. So it's better to avoid it.
# For selecting data this is not a big deal, though it might be slower than necessary.
# It's important to keep this distinction in mind

In [22]:
# .loc does row selection and it can take two parameters:
# 1) the row index
# 2) the list of column names
# The .loc attribute also supports slicing

# To select all rows, we can use a colon ':' to indicate a full slice
# Then we can add the column name as the second parameter as a string
# We can include multiple columns in a list

# columns 'Name' and 'Score' for all the rows
df.loc[:, ['Name', 'Score']]

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


In [23]:
# That's selecting and projecting data from a DataFrame based on row and column labels.

## Dropping data

### drop function()

In [24]:
# We can use the drop() function
# It takes a single parameter, whi is the index or row label, to drop
# It doesn't change the DataFrame by default -> instead it returns a copy of the DataFrame
# with the given rows removed

df.drop('school1')

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [25]:
# If we look at the original DataFrame, the data is still intact
df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [26]:
# Drop has two interesting optional parameters:
# 1) inplace -> if 'True', the DataFrame will be updated in place, instead of a copy being
# returned
# 2) axis -> which, row (0) or column (1) axes, should be dropped -> By default is 0

# Let's make a copy of our dataframe
copy_df = df.copy()

# Let's drop the name column in this copy
copy_df.drop('Name', inplace = True, axis = 1)
copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


### del operator

In [27]:
# Another way to drop a column is directly through the use of the indexing operator, using
# the del keyword.
# This way of dropping data takes immediate effect on the DataFrame and does not return a view

del copy_df['Class']
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


## Adding Data

In [28]:
# Adding a new column to the DataFrame is as easy as assigning it to some value using the
# indexing operator

# Example: add a class ranking column with default value of None
# This broadcasts the default value to the new column immediately

df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,


## Indexing and Loading

In [32]:
# The Jupyter notebooks use ipython as the kernel underneath, which provices convenient ways
# to integrate lower level shell commands, which are programms rnning in the underlying operating
# system.

# In ipython is we prepend the line with an exclamation matk it will execute the remainder of
# the line as a shell command

# Here we will use 'cat' for 'concatenate', which just outputs the contents of a file
!cat Admission_Predict.csv

Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR ,CGPA,Research,Chance of Admit 
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4,4.5,8.87,1,0.76
3,316,104,3,3,3.5,8,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2,3,8.21,0,0.65
6,330,115,5,4.5,3,9.34,1,0.9
7,321,109,3,3,4,8.2,1,0.75
8,308,101,2,3,4,7.9,0,0.68
9,302,102,1,2,1.5,8,0,0.5
10,323,108,3,3.5,3,8.6,0,0.45
11,325,106,3,3.5,4,8.4,1,0.52
12,327,111,4,4,4.5,9,1,0.84
13,328,112,4,4,4.5,9.1,1,0.78
14,307,109,3,4,3,8,1,0.62
15,311,104,3,3.5,2,8.2,1,0.61
16,314,105,3,3.5,2.5,8.3,0,0.54
17,317,107,3,4,3,8.7,0,0.66
18,319,106,3,4,3,8,1,0.65
19,318,110,3,4,3,8.8,0,0.63
20,303,102,3,3.5,3,8.5,0,0.62
21,312,107,3,3,2,7.9,1,0.64
22,325,114,4,3,2,8.4,0,0.7
23,328,116,5,5,5,9.5,1,0.94
24,334,119,5,5,4.5,9.7,1,0.95
25,336,119,5,4,3.5,9.8,1,0.97
26,340,120,5,4.5,4.5,9.6,1,0.94
27,322,109,5,4.5,3.5,8.8,0,0.76
28,298,98,2,1.5,2.5,7.5,1,0.44
29,295,93,1,2,2,7.2,0,0.46
30,310,99

In [2]:
# The column identifiers are listed as string on the first line of the file
# Then we have rows of data, all columns separated by commas

import pandas as pd

# We turn the CSV into a dataframe
df = pd.read_csv('Admission_Predict.csv')

df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [4]:
# By default, index start with 0 while the student's serial number start from 1 -> Pandas
# created a new index


# Instead, we can set the serial no. as the index if we want by using the index_col
df = pd.read_csv('Admission_Predict.csv', index_col = 0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [5]:
# Let's change SOP and LOR column names
# We can use the rename() function. It takes a parameter called columns, and we need to pass
# into dictionary where the keys are the old column name and the value is the corresponding
# new column name

new_df = df.rename(columns = {'GRE Score' : 'GRE Score',
                             'TOELF SCore' : 'TOEFL Score',
                             'University Rating' : 'University Rating',
                             'SOP' : 'Statement of Purpose',
                             'LOR' : 'Letter of Recommendation',
                             'CGPA' : 'CGPA',
                             'Research' : 'Research',
                             'Chance of Admit' : 'Chance of Admit'})

new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [6]:
# Only 'SOP' changed, but not 'LOR'
# Let's investigate the column names
new_df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

In [9]:
# We can see that there is an space after 'LOR' and after 'Chance of Admit'

# There are several ways we can address this:
# 1) Changing the column name including the space

new_df = new_df.rename(columns = {'LOR ' : 'Letter of Recomendation'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recomendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [10]:
# It worked, but what if that was a tab or two spaces?
# 2) Another way is to use the string function strip(). When we pass this in to rename, we 
# pass the function as the mapper parameter, and then indicate whether the axis should be
# columns or index (row labels)
new_df = new_df.rename(mapper = str.strip, axis = 'columns')

new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recomendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [11]:
# IMPORTANT: rename() isn't modifying the dataframe. It is just showing us a copy in new_df
# with the changed names
df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research', 'Chance of Admit '],
      dtype='object')

In [12]:
# We can also use the df.columns attribute by assigning to it a list of column names which
# will directly rename the columns. This will directly modify the original dataframe and is
# very efficient, especially when we have many columns and just want to change a few.
# This technique is also not affected by subtle errors in the column names.
# With a list, we can use the list index to change a certain value or use list comprehension
# to change all of the values.

# Example
# 1) We get out list
cols = list(df.columns)

# 2) Then we update the values of the column names list
cols = [x.lower().strip() for x in cols]

# 3) We update the column names in our dataframe with the list
df.columns = cols

df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


There are other data sources we can load directly into dataframes as well, including HTML, web pages, databases, and other file formats. However, CSV is by far the most common data format.

## Querying DataFrame

First step: understanding boolean masking

A boolean mask is an array which can be of one dimension like a series or two dimensions like a dataframe, where each of the values in the array are either true or false. This array is essentially overlaid on top of the data structure what we're querying. And any cell alligned with the true value will be admitted into our final results and the rest won't

In [14]:
import pandas as pd

df = pd.read_csv('Admission_Predict.csv', index_col = 0)

# We clean up the column names
df.columns = [x.lower().strip() for x in df.columns]

df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [15]:
# Boolean masks are created by applying operators directly to the pandas Series or 
# DataFrame objects.

# Example: students that have a chance higher than 0.7
admit_mask = df['chance of admit'] > 0.7
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

In [16]:
# The result of broadcasting a comparison operator is a Boolean mask: true/false
# Underneath pandas is applying the comparison operator we specified through vectorization
# (so efficiently and parallel) to all values in the array we specified.
# The result is a Series since only one column is being operator on, filled with true/false

In [17]:
# Once we have the boolean mask, we can lay it on top of the data to 'hide' the data we don't
# want, which is represented by all of the False values.
# We do this by using the .where() function on the original DataFrame
df.where(admit_mask).head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


In [19]:
# The resulting DataFrame keeps the original indexed values, and only data which met the
# condition was retained. For the rest, they have NaN data instead, but these rows were not
# dropped from our dataset.

# Now, if we want to remove the NaN data, we do the following
df.where(admit_mask).dropna().head(10)

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9
7,321.0,109.0,3.0,3.0,4.0,8.2,1.0,0.75
12,327.0,111.0,4.0,4.0,4.5,9.0,1.0,0.84
13,328.0,112.0,4.0,4.0,4.5,9.1,1.0,0.78
23,328.0,116.0,5.0,5.0,5.0,9.5,1.0,0.94
24,334.0,119.0,5.0,5.0,4.5,9.7,1.0,0.95


In [20]:
# A shorthand for where() and dropna() is the following:
df[df['chance of admit'] > 0.7].head(10)

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9
7,321,109,3,3.0,4.0,8.2,1,0.75
12,327,111,4,4.0,4.5,9.0,1,0.84
13,328,112,4,4.0,4.5,9.1,1,0.78
23,328,116,5,5.0,5.0,9.5,1,0.94
24,334,119,5,5.0,4.5,9.7,1,0.95


In [21]:
# The indexing operator does two things:

# 1) It can be called with a string parameter to project a single column -> we obtain a Series
# 2) We can send it a list of columns as strings -> we obtain a DataFrame

In [22]:
# Example of 1)
df['gre score'].head()

Serial No.
1    337
2    324
3    316
4    322
5    314
Name: gre score, dtype: int64

In [23]:
df[['gre score', 'toefl score']].head()

Unnamed: 0_level_0,gre score,toefl score
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1
1,337,118
2,324,107
3,316,104
4,322,110
5,314,103


In [24]:
# We can also send it a boolean mask
df[df['gre score'] > 320].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9
7,321,109,3,3.0,4.0,8.2,1,0.75


In [25]:
# Each of these is mimicing functionality from either .loc () or .where().dropna()

### Combining multiple booleans masks

In [26]:
# We cobine multiple boolean masks as multiple criteria for including

# This multimasking is done differently in python compared to other languages

# This returns an error
(df['chance of admit'] > 0.7) and (df['chance of admit'] > 0.9)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [27]:
# The problem is that we have Series objects, and python underneatch doesn't know how to
# compare two series using 'and' or 'or'. Instead, the pandas authors have overwritten the
# pipe | and ampersand & to habdle this

(df['chance of admit'] > 0.7) & (df['chance of admit'] > 0.9)

Serial No.
1       True
2      False
3      False
4      False
5      False
       ...  
396    False
397    False
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

In [29]:
# It is also important to put the condition within parentheses. Otherwise, it returns an error

df['chance of admit'] > 0.7 & df['chance of admit'] > 0.9

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

In [None]:
# The problem is that Python is trying to bitwise 'and' a 0.7 and a pandas dataframe, when
# we actually want to bitwise 'and' the broadcasted dataframes together

In [30]:
# Another way

# We can get rid of the comparison operator completely and use the build in functions which
# mimic this approach
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [31]:
# These functions are build right into the Series and DataFrame objects, so we can chain them
# too, which results in the same answer and the use of no visual operators 
df['chance of admit'].gt(0.7).lt(0.9)

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool

In [32]:
# This only works if our operator, such as less than or greater than, is built into the
# DataFrame.

Boolean masking is extremely important and often used in the world of data science.

With boolean masking we can select data based on the criteria we desire.

## Indexing DataFrames

Both Series and DataFrames can have indices applied to them. The index is essentially a row level bale, and in pandas the rows correspond to axid zero.

Indices can:
- either be autogenerated, such as when we create a new Series without an index, in which case we get numeric values
- or they van be set explicitly, like when we use the dictionary object to create the series, or when we load data from the CSV file and set appropiate parameters.

Another option for setting an index is to use the set_index() function. This function takes a list of columns and promotes those columns to an index.

In [5]:
# The set_index() function is a destructive process, and it doesn't keep the current index.
# If we want to keep the current index, we need to manually create a new column and copy into
# it values from the index attribute

import pandas as pd
df = pd.read_csv('Admission_Predict.csv', index_col = 0)
df.head()


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [7]:
# Quick pre-step: we clean the column names
df.columns = [x.lower().strip() for x in df.columns]
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [8]:
# Example: we want 'chance of admit' to be the new index, but also want to keep serial no.

# So we copy the indexed data into its own column
df['serial number'] = df.index

# Then we set the index to another column
df = df.set_index('chance of admit')
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,serial number
chance of admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


In [9]:
# When we create a new index from an existing column, the index has a name, which is the
# original name of the column

# We can get rid of the index completely by calling the function reset_index(). This promotes
# the index into a column and creates a default numbered index
df = df.reset_index()
df.head()

Unnamed: 0,chance of admit,gre score,toefl score,university rating,sop,lor,cgpa,research,serial number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


### Multi-level indexing

In [11]:
# This is simlar to composite keys in relational database systems.
# To create a multi-level index, we simply call set_index and give it a list of columns that
# we're interested in promoting to an index

# Pandas will search through these in order, finding the distinct data and form composite
# indices
# A good example of this is often found when dealing with geographical data which is sorted
# by regions or demographics

# Example
df = pd.read_csv('/Users/jonathansuarezcaceres/Downloads/1_Data Science/Intro to DS with Python/census.csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [12]:
# There are two summarized levels: 
# -> for the whole country and 
# -> for each state

# In this DataFrame, we see that the possible values for the sum level are using the unique
# function on the DataFrame. This is similar to the SQL distinct operator

df['SUMLEV'].unique()

array([40, 50])

In [13]:
# We see that there are only two different values: 40 and 50

In [14]:
# Let's exclude all of the rows that are summaries at the state level and just keep the
# county data

df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [15]:
# Let's reduce the data that we're going to look at to just the total population estimates
# and the total number of births.

# We can do this by creating a list of column names that we want to keep, then project those
# and assign the resulting DataFrame to our df variable

columns_to_keep = ['STNAME','CTYNAME','BIRTHS2010','BIRTHS2011','BIRTHS2012','BIRTHS2013',
                   'BIRTHS2014','BIRTHS2015','POPESTIMATE2010','POPESTIMATE2011',
                   'POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']

df = df[columns_to_keep]
df.head() 

Unnamed: 0,STNAME,CTYNAME,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
1,Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
2,Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
3,Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
4,Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
5,Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [16]:
# The US Census data breaks fown population estimates by state and county

# We can load the data and set the index to be a combination of the state and country values
# and wee how pandas habdles it in a DataFrame.

# We do this by creating a list of the column identifiers we want to have indexed, and then
# calling set_index with this list and assigning the output as appropiate.

# We have a dual index: first the state name and second the county name

df = df.set_index(['STNAME', 'CTYNAME'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [21]:
# How can we query this DataFrame?

# The .loc attribute can take multiple arguments and it can query both, rows and columns.
# When we use a MultiIndex, we must provide the arguments in order by the level we wish to
# query. Inside of the index, each column is called a level and the outermost column is level
# zero

# Example: Washtenaw County, Michigan state
df.loc['Michigan', 'Washtenaw County']

BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

In [23]:
# Comparing two counties

# We pass a list of tuples describing the indices we wish to query into loc.
# Since we have a MultiIndex of two values, state and county, we need to provide two values of
# each element of our filtering list.
# Each tuple should have two elements, the first element being the first index and the
# second element being the second index.

df.loc[[('Michigan', 'Washtenaw County'), ('Michigan', 'Wayne County')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,977,3826,3780,3662,3683,3709,345563,349048,351213,354289,357029,358880
Michigan,Wayne County,5918,23819,23270,23377,23607,23586,1815199,1801273,1792514,1775713,1766008,1759335


The hierarchical labeling isn't just for rows. We can transpose the matrix and have hierarchical column labels. And projecting a single column which has these labels works exactly the same.

## Missing values

For instance, if we are running a survey and a respondant didn't answer a question the missing value is actually an omission:
- **Missing at random**: if there are other variables that might be used to predict the variable which is missing
- **Missing Completely at Random(MCAR)**: if there is no relationship to other variables

Other example: data might be missing because it wasn't collected, either by the researcher or because it wouldn't make sense if it were collected. This last example is very common: joining a list of people at a university with a list of offices in the university (students generally don't have offices)

In [24]:
# Pandas is pretty good at detecting missing values directly from underlying data formats, like
# CSV files. Although most missing values are often formatted as NaN, NULL, None or N/A,
# sometimes missing values are not labeled so clearly.

# The pandas read_csv() function has a parameter called na_values to let us specify the form
# of missing values. It allows scalar, string, list or dictionaries to be used

import pandas as pd

df = pd.read_csv('/Users/jonathansuarezcaceres/Downloads/1_Data Science/Intro to DS with Python/class_grades.csv')
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [25]:
# We can actually use the function .isnull() to create a boolean mask of the whole dataframe.
# This effectively broadcasts the isnull() function to every cell of data

mask = df.isnull()
mask.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [26]:
# This can be useful for processing rows based on certain columns of data.
# Another useful operation it to be able to drop all of those rows which have any missing
# data, which can be done with the dropna() function

df.dropna().head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72
12,8,97.16,103.71,72.5,93.52,63.33
13,7,91.28,83.53,81.25,99.81,92.22


In [27]:
# Some of the rows are gone

# Another handy function that Pandas has for working with missing values is the filling 
# function: fillna()
# This function takes a number of parameters. We could pass in a single value wich is called
# a scalar value to change all of the missing data to one value

# Example: fill all missing values with 0
df.fillna(0, inplace = True)
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [28]:
# The inplace attribute causes pandas to fill the values inline and does not return a copy of
# the dataframe, but isntead modifies the dataframe we have

In [29]:
# We can also use the na_filter option to turn off white space filtering, if whi space is an
# actual value of interest. But in practice, this is pretty rare. In data without any NAs,
# passing na_filter=Flase can improve the performance of reading a large file

# In addition to rules controlling how missingvalues might b loaded, it's sometimes useful to
# consider missing values as actually having information.

df = pd.read_csv('/Users/jonathansuarezcaceres/Downloads/1_Data Science/Intro to DS with Python/log.csv')
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [30]:
# Sometimes it is not efficient to sent information across the network if it hasn't changed.
# An often case is that the system inserts null values into the database if there's no changes.

### Method parameter()

In [31]:
# The two common fill values are:
# -> ffill : forward filling : updates an na value for a particular cell with the value from
# the previous row
# -> bfill : backward filling : it fills the missing values with the next valid value

# Important: the data needs to be sorted in order for this to have the expected effect
# Data which comes from traditional database management systems usually have no order guarantee

# In Pandas we can sort either by index or by values
# Here we'll sort by index
df = df.set_index('time')
df = df.sort_index()
df.head(20)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [32]:
# The index isn't really uniqu
# Let's reset the index and use some multi-level indexing on time and user together instead

# Promote the user name to a second level of the index
df = df.reset_index()
df = df.set_index(['time', 'user'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [33]:
# Now that we have the data indexed and sorted appropiately, we can fill the missing data
# using ffill

df = df.fillna(method = 'ffill')
df.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0
1469974514,cheryl,intro.html,8,False,10.0
1469974524,sue,advanced.html,25,False,10.0
1469974544,cheryl,intro.html,9,False,10.0
1469974554,sue,advanced.html,26,False,10.0
1469974574,cheryl,intro.html,10,False,10.0


In [34]:
# We can also do customized fill-in to replace vales with replace() function.
# It allows replacement from several approaches:
# - value-to-value
# - list
# - regex

# Example
df = pd.DataFrame({'A' : [1, 1, 2, 3, 4],
                  'B' : [3, 6, 3, 8, 9],
                  'C' : ['a', 'b', 'c', 'd', 'e']})

df

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [35]:
# value-to-value approach

# We can replace 1's with 100
df.replace(1, 100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [36]:
# list approach

# We can change two values at once
df.replace([1, 3], [100, 300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


In [37]:
# regex approach

df = pd.read_csv('/Users/jonathansuarezcaceres/Downloads/1_Data Science/Intro to DS with Python/log.csv')
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [39]:
# To replace with regex:
# 1st parameter : the pattern we want to replace
# 2nd parameter : the value we want to emit upon match
# 3rd parameter : 'regex=True'

# We want to detect all html pages in the 'video' column -> this means, ends with '.html'
# We want to overwrite this with the keyword 'webpage'

# Solution
df.replace(to_replace = '.*.html$', value = 'webpage', regex = True)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,


When we use statistical functions on DataFrames, these functions tipically ignore missing values, for instance, the mean value of a DataFrame.

This is usually what we want, but we should be aware of it anyways.

# Example: Manipulating DataFrames

In [46]:
import pandas as pd

df = pd.read_csv('/Users/jonathansuarezcaceres/Downloads/1_Data Science/Intro to DS with Python/presidents.csv')
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"


In [41]:
# The footnotes in the 'Born' column may cause some issues

# Let's start with cleaning up the name into firstname and lastname (using Regex)

# We create two new columns and apply a regex to the projection of the 'President' column

In [47]:
# First possible solution

#We make a copy of the President column
df['First'] = df['President']

# Then we call replace() and just have a pattern that matches the last name and set it to an
# empty string
df['First'] = df['First'].replace('[ ].*', '', regex = True)

df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James


In [43]:
# That works but it's kind of gross and slow, since we had to make a full copy of a column, then
# fo through and update strings.

In [48]:
# Another solution (the most general one)
# Using the apply() function

# First we drop the column we just created
del(df['First'])

# The apply() function takes some arbitrary function we have written and applies it to either
# a Series (a single column) or DataFrame across all rows or columns.

# Let's write a function that splits a string into two pieces using a single row of data
def splitname(row):
    # The row is a single Series object which is a single row indexed by column variables
    # Let's extract the firstname and create a new entry in the Series
    row['First'] = row['President'].split(' ')[0]
    # Let's do the same with the last word in the string
    row['Last'] = row['President'].split(' ')[-1]
    
    return row

# Now we apply this to the dataframe indicating we want to apply it across columns
df = df.apply(splitname, axis = 'columns')
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


In [49]:
# Another solution

# Applying the extract() function

# First, let's dorp the columns we just created
del(df['First'])
del(df['Last'])

# extract() takes a regular expreison as input and specifically requires us to set capture
# groups that correspond to the output columns we are interested in.

# First we write the pattern
pattern = '(^[\w]*)(?:.* )([\w]*$)'

# The extract() function is build into the str attribute of the Series object
df['President'].str.extract(pattern).head()

Unnamed: 0,0,1
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


In [51]:
# That looks nice, but we still miss the column names
pattern = '(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)'

# Now we call extract
names = df['President'].str.extract(pattern).head()
names

Unnamed: 0,First,Last
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


In [52]:
# Now we can just copy this into our dataframe
df['First'] = names['First']
df['Last'] = names['Last']
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


In [53]:
# Pandas str module
# https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

In [54]:
# Now let's clean up the 'Born' column

# First we get rid of anything that isn't in the pattern of Month Day and Year
df['Born'] = df['Born'].str.extract('([\w]{3} [\w]{1,2}, [\w]{4})')
df['Born'].head()

0    Feb 22, 1732
1    Oct 30, 1735
2    Apr 13, 1743
3    Mar 16, 1751
4    Apr 28, 1758
Name: Born, dtype: object

In [55]:
# This cleans up the 'Born' column, but the type of the data is still object.
# We can update this to date/time type as follows:
df['Born'] = pd.to_datetime(df['Born'])
df['Born'].head()

0   1732-02-22
1   1735-10-30
2   1743-04-13
3   1751-03-16
4   1758-04-28
Name: Born, dtype: datetime64[ns]

In [None]:
# This makes any subsequent processing on the dataframe around dates much easier