# Data Analysis - Project

---

In this notebook we cover all the previous classes - go through what we learned Pandas functionality and apply them to real world data to analyze the USA presidents dataset and draw helpful insights based on analysis. After the data analysis process we discuss about the way how the results and what results must be presented and the possible improvements. At the very last part, we cover Pandas GUI - Graphical User Interface.



$$
$$


### Lecture outline

---


* Fully fledged data analysis


* Presenting results and insights


* Discussion about improvements


* Graphical User Interface for Pandas


# > The code is not optimized in any direction!!!

### DRY - Don't Repeat Yourself
### KISS - Keep It Stupid Simple

In [1]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

# Data Processing


---

Usually, in this stage we clean and process data in order to have it in an appropriate form. Most of the time, data cleaning and processing takes 80% of data scientist's time and is the most tedious process. However, this is the step what we makes TRUE data scientists. Because, you can copy-paste code to build the Machine Learning models bu you cannot copy-paste code for data cleaning. This is where the true art starts.

## Read Data

In [2]:
df = pd.read_csv("data/presidents.csv")

In [3]:
df.shape # We have 44 rows and 8 columns

(44, 8)

In [4]:
df.dtypes # All columns are represented as string

#                                  int64
President                         object
Born                              object
  Age atstart of presidency       object
Age atend of presidency           object
Post-presidencytimespan           object
      Died                        object
Age                               object
dtype: object

In [5]:
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"


## Process columns

---

Let deal with DataFrame columns. Rename them and remove leading and trailing spaces if any.

In [6]:
df.columns # Columns contain leading and trailing spaces

Index(['#', 'President            ', 'Born', '  Age atstart of presidency   ',
       'Age atend of presidency', 'Post-presidencytimespan', '      Died',
       'Age'],
      dtype='object')

In [8]:
df.columns = df.columns.str.strip() # Remove spaces - same as TRIM function in Excel


df.columns

Index(['#', 'President', 'Born', 'Age atstart of presidency',
       'Age atend of presidency', 'Post-presidencytimespan', 'Died', 'Age'],
      dtype='object')

In [9]:
column_mapping = {"#": "presidency_order",
                  "President": "president",
                  "Born": "birth_date",
                  "Age atstart of presidency": "age_at_start",
                  "Age atend of presidency": "age_at_end",
                  "Post-presidencytimespan": "post_presidency_timespan",
                  "Died": "death_date", "Age": "age_at_death"}



df = df.rename(column_mapping, axis=1) # Rename columns

In [12]:
df.head()

Unnamed: 0,presidency_order,president,birth_date,age_at_start,age_at_end,post_presidency_timespan,death_date,age_at_death
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"


### Feature Description

---

* `presidency_order` - The order of presidency


* `president` - First and last name of the president


* `birth_date` - Date of birth


* `age_at_start` - The age at the start of presidency


* `age_at_end` - The age at the end of presidency


* `post_presidency_timespan` - The period between death and presidency end


* `death_date` - Death date


* `age_at_death` - The age at the moment of death

## Remove footnotes

---

Some columns contain footnote such as `[a]` in `birth_date` column or `[e]` in `age_at_end` column. We have to remove them as they do not carry any information and even might cause some issues.

In [13]:
birth_date = (df["birth_date"].str.split("[", expand=True)
                              .drop(1, axis=1)
                              .rename({0: "birth_date"}, axis=1))


age_at_end = (df["age_at_end"].str.split("[", expand=True)
                              .drop(1, axis=1)
                              .rename({0: "age_at_end"}, axis=1))


post_presidency_timespan = (df["post_presidency_timespan"].str.split("[", expand=True)
                                                          .drop(1, axis=1)
                                                          .rename({0: "post_presidency_timespan"}, axis=1))

We removed the footnotes but did not change the columns in the initial DataFrame. Note also, that we save processed columns as separate DataFrame. So we need to drop these columns from initial DataFrame and add processed ones instead.

In [16]:
df = df.drop(["birth_date", "age_at_end", "post_presidency_timespan"], axis=1) # Drop columns

In [17]:
df

Unnamed: 0,presidency_order,president,age_at_start,death_date,age_at_death
0,1,George Washington,"57 years, 67 daysApr 30, 1789","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"61 years, 125 daysMar 4, 1797","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"57 years, 325 daysMar 4, 1801","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"57 years, 353 daysMar 4, 1809","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"58 years, 310 daysMar 4, 1817","Jul 4, 1831","73 years, 67 days"
5,6,John Quincy Adams,"57 years, 236 daysMar 4, 1825","Feb 23, 1848","80 years, 227 days"
6,7,Andrew Jackson,"61 years, 354 daysMar 4, 1829","Jun 8, 1845","78 years, 85 days"
7,8,Martin Van Buren,"54 years, 89 daysMar 4, 1837","Jul 24, 1862","79 years, 231 days"
8,9,William H. Harrison,"68 years, 23 daysMar 4, 1841","Apr 4, 1841","68 years, 54 days"
9,10,John Tyler,"51 years, 6 daysApr 4, 1841","Jan 18, 1862","71 years, 295 days"


In [18]:
df = pd.concat([df, birth_date, age_at_end, post_presidency_timespan], axis=1) # Concatenate processed column

In [19]:
df

Unnamed: 0,presidency_order,president,age_at_start,death_date,age_at_death,birth_date,age_at_end,post_presidency_timespan
0,1,George Washington,"57 years, 67 daysApr 30, 1789","Dec 14, 1799","67 years, 295 days","Feb 22, 1732","65 years, 10 daysMar 4, 1797","2 years, 285 days"
1,2,John Adams,"61 years, 125 daysMar 4, 1797","Jul 4, 1826","90 years, 247 days","Oct 30, 1735","65 years, 125 daysMar 4, 1801","25 years, 122 days"
2,3,Thomas Jefferson,"57 years, 325 daysMar 4, 1801","Jul 4, 1826","83 years, 82 days","Apr 13, 1743","65 years, 325 daysMar 4, 1809","17 years, 122 days"
3,4,James Madison,"57 years, 353 daysMar 4, 1809","Jun 28, 1836","85 years, 104 days","Mar 16, 1751","65 years, 353 daysMar 4, 1817","19 years, 116 days"
4,5,James Monroe,"58 years, 310 daysMar 4, 1817","Jul 4, 1831","73 years, 67 days","Apr 28, 1758","66 years, 310 daysMar 4, 1825","6 years, 122 days"
5,6,John Quincy Adams,"57 years, 236 daysMar 4, 1825","Feb 23, 1848","80 years, 227 days","Jul 11, 1767","61 years, 236 daysMar 4, 1829","18 years, 356 days"
6,7,Andrew Jackson,"61 years, 354 daysMar 4, 1829","Jun 8, 1845","78 years, 85 days","Mar 15, 1767","69 years, 354 daysMar 4, 1837","8 years, 96 days"
7,8,Martin Van Buren,"54 years, 89 daysMar 4, 1837","Jul 24, 1862","79 years, 231 days","Dec 5, 1782","58 years, 89 daysMar 4, 1841","21 years, 142 days"
8,9,William H. Harrison,"68 years, 23 daysMar 4, 1841","Apr 4, 1841","68 years, 54 days","Feb 9, 1773","68 years, 54 days Apr 4, 1841",
9,10,John Tyler,"51 years, 6 daysApr 4, 1841","Jan 18, 1862","71 years, 295 days","Mar 29, 1790","54 years, 340 daysMar 4, 1845","16 years, 320 days"


## Split Columns


---

The values of columns `age_at_start` and `age_at_end` consists of two parts: the first part is the age of the president and the second part is the date the president hold the office - White House and left the office, respectively. It's better to split these two columns into two parts, actual age and the date.


To split these columns we have to figure out the common symbol or character on which we perform the split operation. If we observe, such a common character is `days` inside each value for each of those columns. Under common I mean the character or symbol which does not change across rows.

In [20]:
df[["age_at_start", "age_at_end"]].head()

Unnamed: 0,age_at_start,age_at_end
0,"57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797"
1,"61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801"
2,"57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809"
3,"57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817"
4,"58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825"


In [21]:
age_start = (df["age_at_start"].str.split("days", expand=True)
                               .rename({0: "age_at_start", 1: "presidency_start_date"}, axis=1))


age_end = (df["age_at_end"].str.split("days", expand=True)
                           .rename({0: "age_at_end", 1: "presidency_end_date"}, axis=1))

Now, drop `age_at_start` and `age_at_end` columns and insert new derived columns instead.

In [23]:
df = df.drop(["age_at_start", "age_at_end"], axis=1) # Drop columns

In [24]:
df = pd.concat([df, age_start, age_end], axis=1) # Add new columns

In [25]:
df.head()

Unnamed: 0,presidency_order,president,death_date,age_at_death,birth_date,post_presidency_timespan,age_at_start,presidency_start_date,age_at_end,presidency_end_date
0,1,George Washington,"Dec 14, 1799","67 years, 295 days","Feb 22, 1732","2 years, 285 days","57 years, 67","Apr 30, 1789","65 years, 10","Mar 4, 1797"
1,2,John Adams,"Jul 4, 1826","90 years, 247 days","Oct 30, 1735","25 years, 122 days","61 years, 125","Mar 4, 1797","65 years, 125","Mar 4, 1801"
2,3,Thomas Jefferson,"Jul 4, 1826","83 years, 82 days","Apr 13, 1743","17 years, 122 days","57 years, 325","Mar 4, 1801","65 years, 325","Mar 4, 1809"
3,4,James Madison,"Jun 28, 1836","85 years, 104 days","Mar 16, 1751","19 years, 116 days","57 years, 353","Mar 4, 1809","65 years, 353","Mar 4, 1817"
4,5,James Monroe,"Jul 4, 1831","73 years, 67 days","Apr 28, 1758","6 years, 122 days","58 years, 310","Mar 4, 1817","66 years, 310","Mar 4, 1825"


$$
$$

Some columns contain `days` component along with year. It's better to split these columns and will have year and days as a separate parts. That will make analysis process more smooth. Such columns are: `age_at_death`, `post_presidency_timespan`, `age_at_start`, `age_at_end`

$$
$$

In [26]:
df[["age_at_death", "post_presidency_timespan", "age_at_start", "age_at_end"]].head()

Unnamed: 0,age_at_death,post_presidency_timespan,age_at_start,age_at_end
0,"67 years, 295 days","2 years, 285 days","57 years, 67","65 years, 10"
1,"90 years, 247 days","25 years, 122 days","61 years, 125","65 years, 125"
2,"83 years, 82 days","17 years, 122 days","57 years, 325","65 years, 325"
3,"85 years, 104 days","19 years, 116 days","57 years, 353","65 years, 353"
4,"73 years, 67 days","6 years, 122 days","58 years, 310","66 years, 310"


In [27]:
age_at_death = (df["age_at_death"].str.rstrip("days")
                                  .str.split("years,", expand=True)
                                  .rename({0: "age_at_death_year",
                                           1: "age_at_death_days"},
                                          axis=1))

In [28]:
age_at_death

Unnamed: 0,age_at_death_year,age_at_death_days
0,67,295
1,90,247
2,83,82
3,85,104
4,73,67
5,80,227
6,78,85
7,79,231
8,68,54
9,71,295


`post_presidency_timespan` column contains some uncommon values such as `1 year, 259 days` and `103 days`. So we could not use the same approach we used above. To deal such a situation we have to use `Regular Expression`.

In [29]:
post_presidency_timespan = (df["post_presidency_timespan"].str.rstrip("days")
                                                          .str.replace("year[s]?", "", regex=True)
                                                          .str.split(",", expand=True)
                                                          .rename({0: "post_presidency_timespan_year",
                                                                   1: "post_presidency_timespan_days"},
                                                                  axis=1))


post_presidency_timespan.loc[10] = [np.nan, 103] # Swap the values for one row

In [30]:
age_at_start = (df["age_at_start"].str.split("years,", expand=True)
                                  .rename({0: "age_at_start_year",
                                           1: "age_at_start_days"},
                                          axis=1))

In [31]:
age_at_end = (df["age_at_end"].str.split("years,", expand=True)
                                  .rename({0: "age_at_end_year",
                                           1: "age_at_end_days"},
                                          axis=1))

**drop old columns and add new ones**

In [33]:
df.head()

Unnamed: 0,presidency_order,president,death_date,age_at_death,birth_date,post_presidency_timespan,age_at_start,presidency_start_date,age_at_end,presidency_end_date
0,1,George Washington,"Dec 14, 1799","67 years, 295 days","Feb 22, 1732","2 years, 285 days","57 years, 67","Apr 30, 1789","65 years, 10","Mar 4, 1797"
1,2,John Adams,"Jul 4, 1826","90 years, 247 days","Oct 30, 1735","25 years, 122 days","61 years, 125","Mar 4, 1797","65 years, 125","Mar 4, 1801"
2,3,Thomas Jefferson,"Jul 4, 1826","83 years, 82 days","Apr 13, 1743","17 years, 122 days","57 years, 325","Mar 4, 1801","65 years, 325","Mar 4, 1809"
3,4,James Madison,"Jun 28, 1836","85 years, 104 days","Mar 16, 1751","19 years, 116 days","57 years, 353","Mar 4, 1809","65 years, 353","Mar 4, 1817"
4,5,James Monroe,"Jul 4, 1831","73 years, 67 days","Apr 28, 1758","6 years, 122 days","58 years, 310","Mar 4, 1817","66 years, 310","Mar 4, 1825"


In [34]:
df = df.drop(["age_at_death", "post_presidency_timespan", "age_at_start", "age_at_end"], axis=1) # Drop columns

In [35]:
df = pd.concat([df, age_at_death, post_presidency_timespan, age_at_start, age_at_end], axis=1) # Add new columns

In [36]:
df.head()

Unnamed: 0,presidency_order,president,death_date,birth_date,presidency_start_date,presidency_end_date,age_at_death_year,age_at_death_days,post_presidency_timespan_year,post_presidency_timespan_days,age_at_start_year,age_at_start_days,age_at_end_year,age_at_end_days
0,1,George Washington,"Dec 14, 1799","Feb 22, 1732","Apr 30, 1789","Mar 4, 1797",67,295,2,285,57,67,65,10
1,2,John Adams,"Jul 4, 1826","Oct 30, 1735","Mar 4, 1797","Mar 4, 1801",90,247,25,122,61,125,65,125
2,3,Thomas Jefferson,"Jul 4, 1826","Apr 13, 1743","Mar 4, 1801","Mar 4, 1809",83,82,17,122,57,325,65,325
3,4,James Madison,"Jun 28, 1836","Mar 16, 1751","Mar 4, 1809","Mar 4, 1817",85,104,19,116,57,353,65,353
4,5,James Monroe,"Jul 4, 1831","Apr 28, 1758","Mar 4, 1817","Mar 4, 1825",73,67,6,122,58,310,66,310


## Type casting

---

The columns are represented as sting objects. We have to convert them into appropriate type.

In [37]:
df.dtypes

presidency_order                  int64
president                        object
death_date                       object
birth_date                       object
presidency_start_date            object
presidency_end_date              object
age_at_death_year                object
age_at_death_days                object
post_presidency_timespan_year    object
post_presidency_timespan_days    object
age_at_start_year                object
age_at_start_days                object
age_at_end_year                  object
age_at_end_days                  object
dtype: object

### DateTime objects

---

Pandas supports `datetime object` - meaning that we can convert string representation of date into appropriate type and then operate on this object by using different methods.


The candidates for this conversion are: `death_date`, `birth_date`, `presidency_start_date`, and `presidency_end_date`

In [38]:
df["death_date"] = pd.to_datetime(df["death_date"].str.strip().str.replace("(living)", "", regex=False))

In [39]:
df["birth_date"] = pd.to_datetime(df["birth_date"].str.strip())

In [40]:
df["presidency_start_date"] = pd.to_datetime(df["presidency_start_date"].str.strip())

In [41]:
df["presidency_end_date"] = pd.to_datetime(df["presidency_end_date"].str.strip())

### Numeric objects

---

We have columns which are clearly numeric. However, they are interpreted as strings by Pandas due to a fact that Pandas cannot type cast automatically.

The candidates for numeric type are all columns except datetime columns and `president` column.

In [42]:
numeric_cols = ["age_at_death_year", "age_at_death_days",
                "post_presidency_timespan_year", "post_presidency_timespan_days",
               "age_at_start_year", "age_at_start_days",
               "age_at_end_year", "age_at_end_days"]

In [43]:
df[numeric_cols] = df[numeric_cols].apply(lambda x: x.str.strip()) # Apply strip function to all columns

In [44]:
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric) # Apply type casting

In [45]:
df.dtypes

presidency_order                          int64
president                                object
death_date                       datetime64[ns]
birth_date                       datetime64[ns]
presidency_start_date            datetime64[ns]
presidency_end_date              datetime64[ns]
age_at_death_year                         int64
age_at_death_days                         int64
post_presidency_timespan_year           float64
post_presidency_timespan_days           float64
age_at_start_year                         int64
age_at_start_days                         int64
age_at_end_year                           int64
age_at_end_days                           int64
dtype: object

## Reorder Columns

---

Let reorder columns to have them in logical order

In [47]:
df.head()

Unnamed: 0,presidency_order,president,death_date,birth_date,presidency_start_date,presidency_end_date,age_at_death_year,age_at_death_days,post_presidency_timespan_year,post_presidency_timespan_days,age_at_start_year,age_at_start_days,age_at_end_year,age_at_end_days
0,1,George Washington,1799-12-14,1732-02-22,1789-04-30,1797-03-04,67,295,2.0,285.0,57,67,65,10
1,2,John Adams,1826-07-04,1735-10-30,1797-03-04,1801-03-04,90,247,25.0,122.0,61,125,65,125
2,3,Thomas Jefferson,1826-07-04,1743-04-13,1801-03-04,1809-03-04,83,82,17.0,122.0,57,325,65,325
3,4,James Madison,1836-06-28,1751-03-16,1809-03-04,1817-03-04,85,104,19.0,116.0,57,353,65,353
4,5,James Monroe,1831-07-04,1758-04-28,1817-03-04,1825-03-04,73,67,6.0,122.0,58,310,66,310


In [48]:
columnsTitles = ["presidency_order",
                 "president",
                 "presidency_start_date", 
                 "presidency_end_date",
                "birth_date",
                 "age_at_start_year",
                 "age_at_start_days",
                "age_at_end_year",
                 "age_at_end_days",
                "post_presidency_timespan_year",
                 "post_presidency_timespan_days",
                "death_date",
                 "age_at_death_year",
                 "age_at_death_days"]

In [49]:
df = df.reindex(columns=columnsTitles)

In [50]:
df.head()

Unnamed: 0,presidency_order,president,presidency_start_date,presidency_end_date,birth_date,age_at_start_year,age_at_start_days,age_at_end_year,age_at_end_days,post_presidency_timespan_year,post_presidency_timespan_days,death_date,age_at_death_year,age_at_death_days
0,1,George Washington,1789-04-30,1797-03-04,1732-02-22,57,67,65,10,2.0,285.0,1799-12-14,67,295
1,2,John Adams,1797-03-04,1801-03-04,1735-10-30,61,125,65,125,25.0,122.0,1826-07-04,90,247
2,3,Thomas Jefferson,1801-03-04,1809-03-04,1743-04-13,57,325,65,325,17.0,122.0,1826-07-04,83,82
3,4,James Madison,1809-03-04,1817-03-04,1751-03-16,57,353,65,353,19.0,116.0,1836-06-28,85,104
4,5,James Monroe,1817-03-04,1825-03-04,1758-04-28,58,310,66,310,6.0,122.0,1831-07-04,73,67


## Add Party Affiliation and Birth Place


---

Pandas can read HTML tables from the website. Here, I use this functionality to enrich our data with the party affiliation and birth place of the USA presidents. However, these data is messy and it needs separate processing.

In [62]:
party = pd.read_html("https://www.britannica.com/topic/Presidents-of-the-United-States-1846696")[0]

In [63]:
party.head()

Unnamed: 0.1,Unnamed: 0,no.,president,birthplace,political party,term
0,,1,George Washington,Va.,Federalist,1789–97
1,,2,John Adams,Mass.,Federalist,1797–1801
2,,3,Thomas Jefferson,Va.,Democratic-Republican,1801–09
3,,4,James Madison,Va.,Democratic-Republican,1809–17
4,,5,James Monroe,Va.,Democratic-Republican,1817–25


In [64]:
birth_place = pd.read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States_by_home_state")[0]

In [65]:
birth_place.head()

Unnamed: 0,Date of birth,President,Birthplace,State† of birth,In office
0,"February 22, 1732",George Washington,Westmoreland County,Virginia†,"(1st) April 30, 1789 – March 4, 1797"
1,"October 30, 1735",John Adams,Braintree,Massachusetts†,"(2nd) March 4, 1797 – March 4, 1801"
2,"April 13, 1743*",Thomas Jefferson,Shadwell,Virginia†,"(3rd) March 4, 1801 – March 4, 1809"
3,"March 16, 1751",James Madison,Port Conway,Virginia†,"(4th) March 4, 1809 – March 4, 1817"
4,"April 28, 1758",James Monroe,Monroe Hall,Virginia†,"(5th) March 4, 1817 – March 4, 1825"


Write these data in `CSV` file

In [None]:
# party.to_csv("data/party.csv", index=False)

# birth_place.to_csv("data/birth_place.csv", index=False)

### Process Political Party Affiliation

In [66]:
party.head()

Unnamed: 0.1,Unnamed: 0,no.,president,birthplace,political party,term
0,,1,George Washington,Va.,Federalist,1789–97
1,,2,John Adams,Mass.,Federalist,1797–1801
2,,3,Thomas Jefferson,Va.,Democratic-Republican,1801–09
3,,4,James Madison,Va.,Democratic-Republican,1809–17
4,,5,James Monroe,Va.,Democratic-Republican,1817–25


Drop unnecessary columns

In [67]:
party = party.drop(["Unnamed: 0", "no.", "birthplace", "term"], axis=1)

In [68]:
party

Unnamed: 0,president,political party
0,George Washington,Federalist
1,John Adams,Federalist
2,Thomas Jefferson,Democratic-Republican
3,James Madison,Democratic-Republican
4,James Monroe,Democratic-Republican
5,John Quincy Adams,National Republican
6,Andrew Jackson,Democratic
7,Martin Van Buren,Democratic
8,William Henry Harrison,Whig
9,John Tyler,Whig


Remove last three row as they contain extra redundant information.

In [69]:
party = party.drop([44, 45, 46, 47], axis=0)

Remove leading and trailing spaces

In [70]:
party = party.apply(lambda x: x.str.strip())

In [71]:
party

Unnamed: 0,president,political party
0,George Washington,Federalist
1,John Adams,Federalist
2,Thomas Jefferson,Democratic-Republican
3,James Madison,Democratic-Republican
4,James Monroe,Democratic-Republican
5,John Quincy Adams,National Republican
6,Andrew Jackson,Democratic
7,Martin Van Buren,Democratic
8,William Henry Harrison,Whig
9,John Tyler,Whig


#### Merge `party` DataFrame with our initial DataFrame

---

The order is not preserved. Hence we merge these two DataFrames on index.

In [73]:
df = df.merge(party["political party"], left_index=True, right_index=True)

In [74]:
df.head()

Unnamed: 0,presidency_order,president,presidency_start_date,presidency_end_date,birth_date,age_at_start_year,age_at_start_days,age_at_end_year,age_at_end_days,post_presidency_timespan_year,post_presidency_timespan_days,death_date,age_at_death_year,age_at_death_days,political party
0,1,George Washington,1789-04-30,1797-03-04,1732-02-22,57,67,65,10,2.0,285.0,1799-12-14,67,295,Federalist
1,2,John Adams,1797-03-04,1801-03-04,1735-10-30,61,125,65,125,25.0,122.0,1826-07-04,90,247,Federalist
2,3,Thomas Jefferson,1801-03-04,1809-03-04,1743-04-13,57,325,65,325,17.0,122.0,1826-07-04,83,82,Democratic-Republican
3,4,James Madison,1809-03-04,1817-03-04,1751-03-16,57,353,65,353,19.0,116.0,1836-06-28,85,104,Democratic-Republican
4,5,James Monroe,1817-03-04,1825-03-04,1758-04-28,58,310,66,310,6.0,122.0,1831-07-04,73,67,Democratic-Republican


### Birth Place Data

In [79]:
birth_place.head()

Unnamed: 0,birth_date,president,city,state,In office
0,"February 22, 1732",George Washington,Westmoreland County,Virginia†,"(1st) April 30, 1789 – March 4, 1797"
1,"October 30, 1735",John Adams,Braintree,Massachusetts†,"(2nd) March 4, 1797 – March 4, 1801"
2,"April 13, 1743*",Thomas Jefferson,Shadwell,Virginia†,"(3rd) March 4, 1801 – March 4, 1809"
3,"March 16, 1751",James Madison,Port Conway,Virginia†,"(4th) March 4, 1809 – March 4, 1817"
4,"April 28, 1758",James Monroe,Monroe Hall,Virginia†,"(5th) March 4, 1817 – March 4, 1825"


Remove last two rows

In [77]:
birth_place = birth_place.drop([43, 44, 45, 46], axis=0)

Rename columns

In [78]:
column_mapping = {"Date of birth": "birth_date", "President": "president",
                 "Birthplace": "city", "State† of birth": "state"}


birth_place = birth_place.rename(column_mapping, axis=1)

Remove `†` character from the `state` column

In [80]:
birth_place["state"] = birth_place["state"].str.strip("†").str.strip()

Remove leading and trailing spaces

In [81]:
birth_place = birth_place.apply(lambda x: x.str.strip())

The order of rows in `birth_place` DataFrame is not set according to presidency order. Hence, we need to find at least one common column between `birth_place` and our initial DataFrame. That column could be `president` as it is represented in both DataFrame.

In [83]:
birth_place.iloc[7]["president"] = "William H. Harrison" # Change value to have proper merge result

In [84]:
birth_place = birth_place.drop(["birth_date", "In office"], axis=1) # Drop unnecessary columns

#### Merge `birth_place` DataFrame with our initial DataFrame

In [86]:
df = df.merge(birth_place, how="inner", on="president")

In [88]:
df.head()

Unnamed: 0,presidency_order,president,presidency_start_date,presidency_end_date,birth_date,age_at_start_year,age_at_start_days,age_at_end_year,age_at_end_days,post_presidency_timespan_year,post_presidency_timespan_days,death_date,age_at_death_year,age_at_death_days,political party,city,state
0,1,George Washington,1789-04-30,1797-03-04,1732-02-22,57,67,65,10,2.0,285.0,1799-12-14,67,295,Federalist,Westmoreland County,Virginia
1,2,John Adams,1797-03-04,1801-03-04,1735-10-30,61,125,65,125,25.0,122.0,1826-07-04,90,247,Federalist,Braintree,Massachusetts
2,3,Thomas Jefferson,1801-03-04,1809-03-04,1743-04-13,57,325,65,325,17.0,122.0,1826-07-04,83,82,Democratic-Republican,Shadwell,Virginia
3,4,James Madison,1809-03-04,1817-03-04,1751-03-16,57,353,65,353,19.0,116.0,1836-06-28,85,104,Democratic-Republican,Port Conway,Virginia
4,5,James Monroe,1817-03-04,1825-03-04,1758-04-28,58,310,66,310,6.0,122.0,1831-07-04,73,67,Democratic-Republican,Monroe Hall,Virginia


# Data Analysis...

---

Here we try to extract as much information from our data as possible.

In [89]:
df.head()

Unnamed: 0,presidency_order,president,presidency_start_date,presidency_end_date,birth_date,age_at_start_year,age_at_start_days,age_at_end_year,age_at_end_days,post_presidency_timespan_year,post_presidency_timespan_days,death_date,age_at_death_year,age_at_death_days,political party,city,state
0,1,George Washington,1789-04-30,1797-03-04,1732-02-22,57,67,65,10,2.0,285.0,1799-12-14,67,295,Federalist,Westmoreland County,Virginia
1,2,John Adams,1797-03-04,1801-03-04,1735-10-30,61,125,65,125,25.0,122.0,1826-07-04,90,247,Federalist,Braintree,Massachusetts
2,3,Thomas Jefferson,1801-03-04,1809-03-04,1743-04-13,57,325,65,325,17.0,122.0,1826-07-04,83,82,Democratic-Republican,Shadwell,Virginia
3,4,James Madison,1809-03-04,1817-03-04,1751-03-16,57,353,65,353,19.0,116.0,1836-06-28,85,104,Democratic-Republican,Port Conway,Virginia
4,5,James Monroe,1817-03-04,1825-03-04,1758-04-28,58,310,66,310,6.0,122.0,1831-07-04,73,67,Democratic-Republican,Monroe Hall,Virginia


Sort DataFrame by `presidency_order`

In [90]:
df = df.sort_values(by="presidency_order").reset_index(drop=True)

Summary Statistics

In [91]:
df.describe().round(2).T.iloc[1:, 1:]

Unnamed: 0,mean,std,min,25%,50%,75%,max
age_at_start_year,54.66,6.26,42.0,51.0,54.5,57.25,69.0
age_at_start_days,173.45,117.11,6.0,88.5,152.5,310.25,354.0
age_at_end_year,59.77,6.58,46.0,55.0,59.0,65.0,77.0
age_at_end_days,196.07,100.16,10.0,119.25,193.0,280.75,354.0
post_presidency_timespan_year,13.43,9.16,1.0,6.5,11.0,19.0,38.0
post_presidency_timespan_days,180.66,112.62,0.0,114.0,175.0,296.0,356.0
age_at_death_year,71.3,12.55,46.0,63.0,71.0,79.25,94.0
age_at_death_days,162.11,94.06,8.0,79.5,165.0,228.75,344.0


Political party and presidents distribution

In [92]:
pd.DataFrame(df["political party"].value_counts())

Unnamed: 0,political party
Republican,18
Democratic,15
Whig,4
Democratic-Republican,3
Federalist,2
National Republican,1
Democratic (Union),1


State and president distribution

In [93]:
pd.DataFrame(df["state"].value_counts())

Unnamed: 0,state
Virginia,8
Ohio,7
New York,4
Massachusetts,4
New Jersey,2
North Carolina,2
Vermont,2
Texas,2
Kentucky,1
South Carolina,1


What is the longest and shortest period between presidency start and end date? We can calculate it my taking difference between `age_at_start_year` and `age_at_end_year` then find the maximum and minimum value of this column.

In [94]:
(df["age_at_end_year"] - df["age_at_start_year"]).max() # Max period of presidency is 12 years

12

In [95]:
df.iloc[(df["age_at_end_year"] - df["age_at_start_year"]).idxmax()]["president"]

'Franklin D. Roosevelt'

In [96]:
(df["age_at_end_year"] - df["age_at_start_year"]).min() # Min period of presidency is 0 years. Maybe few days!!!

0

In [97]:
df.iloc[(df["age_at_end_year"] - df["age_at_start_year"]).idxmin()]["president"]

'William H. Harrison'

Fact about [William H. Harrison](https://en.wikipedia.org/wiki/William_Henry_Harrison)

Which President lived the longest and shortest after presidency end?

In [98]:
df.iloc[df["post_presidency_timespan_year"].idxmax()]["president"] # The longest living president

'Jimmy Carter'

In [99]:
df.iloc[df["post_presidency_timespan_year"].idxmin()]["president"] # The shortest living president

'Chester A. Arthur'

Which president was the oldest and the youngest at the start of the presidency?

In [100]:
df.iloc[df["age_at_start_year"].idxmax()]["president"] # The oldest president at the start of the presidency

'Ronald Reagan'

# The youngest president at the start of the presidency

In [101]:
df.iloc[df["age_at_start_year"].idxmin()]["president"] 

'Theodore Roosevelt'

Which president died the oldest and the youngest?

In [102]:
df.iloc[df["age_at_death_year"].idxmax()]["president"] # The oldest died president after presidency end

'Jimmy Carter'

In [103]:
df.iloc[df["age_at_death_year"].idxmin()]["president"] # The yougest died president after presidency end

'John F. Kennedy'

## What else can we add?

---

The analysis I did is the least what can be done with this data. This is a homework for you to extend the analysis.

1) Add other univariate and bivariate analysis and clearly state your findings

2) Use `groupby` to group the data by some column

3) Use Pandas `pivot_table` and to see the relationship between variables

4) Use `crosstab` to have frequency tables

5) Try to find hidden relationship between variables if possible

6) Add data visualization

# Summary

---

This lecture aimed to show you how you can utilize Pandas capabilities to process and analyze messy data as well as present you findings and tell a story.