# Data Analysis - Project

---

In this notebook we cover all the previous classes - go through what we learned Pandas functionality and apply them to real world data to analyze the USA presidents dataset and draw helpful insights based on analysis. After the data analysis process we discuss about the way how the results and what results must be presented and the possible improvements. At the very last part, we cover Pandas GUI - Graphical User Interface.



$$
$$


### Lecture outline

---


* Fully fledged data analysis


* Presenting results and insights


* Discussion about improvements


* Graphical User Interface for Pandas


# > The code is not optimized in any direction!!!

### DRY - Don't Repeat Yourself
### KISS - Keep It Stupid Simple

In [None]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

# Data Processing


---

Usually, in this stage we clean and process data in order to have it in an appropriate form. Most of the time, data cleaning and processing takes 80% of data scientist's time and is the most tedious process. However, this is the step what we makes TRUE data scientists. Because, you can copy-paste code to build the Machine Learning models bu you cannot copy-paste code for data cleaning. This is where the true art starts.

## Read Data

In [None]:
df = pd.read_csv("data/presidents.csv")

In [None]:
df.shape # We have 44 rows and 8 columns

In [None]:
df.dtypes # All columns are represented as string

In [None]:
df.head()

## Process columns

---

Let deal with DataFrame columns. Rename them and remove leading and trailing spaces if any.

In [None]:
df.columns # Columns contain leading and trailing spaces

In [None]:
df.columns = df.columns.str.strip() # Remove spaces - same as TRIM function in Excel


df.columns

In [None]:
column_mapping = {"#": "presidency_order",
                  "President": "president",
                  "Born": "birth_date",
                  "Age atstart of presidency": "age_at_start",
                  "Age atend of presidency": "age_at_end",
                  "Post-presidencytimespan": "post_presidency_timespan",
                  "Died": "death_date", "Age": "age_at_death"}



df = df.rename(column_mapping, axis=1) # Rename columns

### Feature Description

---

* `presidency_order` - The order of presidency


* `president` - First and last name of the president


* `birth_date` - Date of birth


* `age_at_start` - The age at the start of presidency


* `age_at_end` - The age at the end of presidency


* `post_presidency_timespan` - Th period between death and presidency end


* `death_date` - Death date


* `age_at_death` - The age at the moment of death

## Remove footnotes

---

Some columns contain footnote such as `[a]` in `birth_date` column or `[e]` in `age_at_end` column. We have to remove them as they do not carry any information and even might cause some issues.

In [None]:
birth_date = (df["birth_date"].str.split("[", expand=True)
                              .drop(1, axis=1)
                              .rename({0: "birth_date"}, axis=1))


age_at_end = (df["age_at_end"].str.split("[", expand=True)
                              .drop(1, axis=1)
                              .rename({0: "age_at_end"}, axis=1))


post_presidency_timespan = (df["post_presidency_timespan"].str.split("[", expand=True)
                                                          .drop(1, axis=1)
                                                          .rename({0: "post_presidency_timespan"}, axis=1))

We removed the footnotes but did not change the columns in the initial DataFrame. Note also, that we save processed columns as separate DataFrame. So we need to drop these columns from initial DataFrame and add processed ones instead.

In [None]:
df = df.drop(["birth_date", "age_at_end", "post_presidency_timespan"], axis=1) # Drop columns

In [None]:
df = pd.concat([df, birth_date, age_at_end, post_presidency_timespan], axis=1) # Concatenate processed column

## Split Columns


---

The values of columns `age_at_start` and `age_at_end` consists of two parts: the first part is the age of the president and the second part is the date the president hold the office - White House and left the office, respectively. It's better to split these two columns into two parts, actual age and the date.


To split these columns we have to figure out the common symbol or character on which we perform the split operation. If we observe, such a common character is `days` inside each value for each of those columns. Under common I mean the character or symbol which does not change across rows.

In [None]:
df[["age_at_start", "age_at_end"]].head()

In [None]:
age_start = (df["age_at_start"].str.split("days", expand=True)
                               .rename({0: "age_at_start", 1: "presidency_start_date"}, axis=1))


age_end = (df["age_at_end"].str.split("days", expand=True)
                           .rename({0: "age_at_end", 1: "presidency_end_date"}, axis=1))

Now, drop `age_at_start` and `age_at_end` columns and insert new derived columns instead.

In [None]:
df = df.drop(["age_at_start", "age_at_end"], axis=1) # Drop columns

In [None]:
df = pd.concat([df, age_start, age_end], axis=1) # Add new columns

In [None]:
df.head()

$$
$$

Some columns contain `days` component along with year. It's better to split these columns and will have year and days as a separate parts. That will make analysis process more smooth. Such columns are: `age_at_death`, `post_presidency_timespan`, `age_at_start`, `age_at_end`

$$
$$

In [None]:
df[["age_at_death", "post_presidency_timespan", "age_at_start", "age_at_end"]].head()

In [None]:
age_at_death = (df["age_at_death"].str.rstrip("days")
                                  .str.split("years,", expand=True)
                                  .rename({0: "age_at_death_year",
                                           1: "age_at_death_days"},
                                          axis=1))

`post_presidency_timespan` column contains some uncommon values such as `1 year, 259 days` and `103 days`. So we could not use the same approach we used above. To deal such a situation we have to use `Regular Expression`.

In [None]:
post_presidency_timespan = (df["post_presidency_timespan"].str.rstrip("days")
                                                          .str.replace("year[s]?", "", regex=True)
                                                          .str.split(",", expand=True)
                                                          .rename({0: "post_presidency_timespan_year",
                                                                   1: "post_presidency_timespan_days"},
                                                                  axis=1))


post_presidency_timespan.loc[10] = [np.nan, 103] # Swap the values for one row

In [None]:
age_at_start = (df["age_at_start"].str.split("years,", expand=True)
                                  .rename({0: "age_at_start_year",
                                           1: "age_at_start_days"},
                                          axis=1))

In [None]:
age_at_end = (df["age_at_end"].str.split("years,", expand=True)
                                  .rename({0: "age_at_end_year",
                                           1: "age_at_end_days"},
                                          axis=1))

**drop old columns and add new ones**

In [None]:
df = df.drop(["age_at_death", "post_presidency_timespan", "age_at_start", "age_at_end"], axis=1) # Drop columns

In [None]:
df = pd.concat([df, age_at_death, post_presidency_timespan, age_at_start, age_at_end], axis=1) # Add new columns

In [None]:
df.head()

## Type casting

---

The columns are represented as sting objects. We have to convert them into appropriate type.

In [None]:
df.dtypes

### DateTime objects

---

Pandas supports `datetime object` - meaning that we can convert string representation of date into appropriate type and then operate on this object by using different methods.


The candidates for this conversion are: `death_date`, `birth_date`, `presidency_start_date`, and `presidency_end_date`

In [None]:
df["death_date"] = pd.to_datetime(df["death_date"].str.strip().str.replace("(living)", "", regex=False))

In [None]:
df["birth_date"] = pd.to_datetime(df["birth_date"].str.strip())

In [None]:
df["presidency_start_date"] = pd.to_datetime(df["presidency_start_date"].str.strip())

In [None]:
df["presidency_end_date"] = pd.to_datetime(df["presidency_end_date"].str.strip())

### Numeric objects

---

We have columns which are clearly numeric. However, they are interpreted as strings by Pandas due to a fact that Pandas cannot type cast automatically.

The candidates for numeric type are all columns except datetime columns and `president` column.

In [None]:
numeric_cols = ["age_at_death_year", "age_at_death_days",
                "post_presidency_timespan_year", "post_presidency_timespan_days",
               "age_at_start_year", "age_at_start_days",
               "age_at_end_year", "age_at_end_days"]

In [None]:
df[numeric_cols] = df[numeric_cols].apply(lambda x: x.str.strip()) # Apply strip function to all columns

In [None]:
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric) # Apply type casting

In [None]:
df.dtypes

## Reorder Columns

---

Let reorder columns to have them in logical order

In [None]:
df.head()

In [None]:
columnsTitles = ["presidency_order", "president", "presidency_start_date", "presidency_end_date",
                "birth_date", "age_at_start_year", "age_at_start_days",
                "age_at_end_year", "age_at_end_days",
                "post_presidency_timespan_year", "post_presidency_timespan_days",
                "death_date", "age_at_death_year", "age_at_death_days"]

In [None]:
df = df.reindex(columns=columnsTitles)

In [None]:
df.head()

## Add Party Affiliation and Birth Place


---

Pandas can read HTML tables from the website. Here, I use this functionality to enrich our data with the party affiliation and birth place of the USA presidents. However, these data is messy and it needs separate processing.

In [None]:
party = pd.read_html("https://www.britannica.com/topic/Presidents-of-the-United-States-1846696")[0]

In [None]:
birth_place = pd.read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States_by_home_state")[0]

Write these data in `CSV` file

In [None]:
# party.to_csv("data/party.csv", index=False)

# birth_place.to_csv("data/birth_place.csv", index=False)

### Process Political Party Affiliation

In [None]:
party.head()

Drop unnecessary columns

In [None]:
party = party.drop(["Unnamed: 0", "no.", "birthplace", "term"], axis=1)

Remove last three row as they contain extra redundant information.

In [None]:
party = party.drop([44, 45, 46], axis=0)

Remove leading and trailing spaces

In [None]:
party = party.apply(lambda x: x.str.strip())

#### Merge `party` DataFrame with our initial DataFrame

---

The order is not preserved. Hence we merge these two DataFrames on index.

In [None]:
df = df.merge(party["political party"], left_index=True, right_index=True)

### Birth Place Data

In [None]:
birth_place.head()

Remove last two rows

In [None]:
birth_place = birth_place.drop([45, 46], axis=0)

Rename columns

In [None]:
column_mapping = {"Date of birth": "birth_date", "President": "president",
                 "Birthplace": "city", "State† of birth": "state"}


birth_place = birth_place.rename(column_mapping, axis=1)

Remove `†` character from the `state` column

In [None]:
birth_place["state"] = birth_place["state"].str.strip("†").str.strip()

Remove leading and trailing spaces

In [None]:
birth_place = birth_place.apply(lambda x: x.str.strip())

The order of rows in `birth_place` DataFrame is not set according to presidency order. Hence, we need to find at least one common column between `birth_place` and our initial DataFrame. That column could be `president` as it is represented in both DataFrame.

In [None]:
birth_place.iloc[7]["president"] = "William H. Harrison" # Change value to have proper merge result

In [None]:
birth_place = birth_place.drop(["birth_date", "In office"], axis=1) # Drop unnecessary columns

#### Merge `birth_place` DataFrame with our initial DataFrame

In [None]:
df = df.merge(birth_place, how="inner", on="president")

# Data Analysis...

---

In this we try to extract as much information from our data as possible.

In [None]:
df.head()

Sort DataFrame by `presidency_order`

In [None]:
df = df.sort_values(by="presidency_order").reset_index(drop=True)

Summary Statistics

In [None]:
df.describe().round(2).T.iloc[1:, 1:]

Political party and presidents distribution

In [None]:
pd.DataFrame(df["political party"].value_counts())

State and president distribution

In [None]:
pd.DataFrame(df["state"].value_counts())

What is the longest and shortest period between presidency start and end date? We can calculate it my taking difference between `age_at_start_year` and `age_at_end_year` then find the maximum and minimum value of this column.

In [None]:
(df["age_at_end_year"] - df["age_at_start_year"]).max() # Max period of presidency is 12 years

In [None]:
df.iloc[(df["age_at_end_year"] - df["age_at_start_year"]).idxmax()]["president"]

In [None]:
(df["age_at_end_year"] - df["age_at_start_year"]).min() # Min period of presidency is 0 years. Maybe few days!!!

In [None]:
df.iloc[(df["age_at_end_year"] - df["age_at_start_year"]).idxmin()]["president"]

Fact about [William H. Harrison](https://en.wikipedia.org/wiki/William_Henry_Harrison)

Which President lived the longest and shortest after presidency end?

In [None]:
df.iloc[df["post_presidency_timespan_year"].idxmax()]["president"] # The longest living president

In [None]:
df.iloc[df["post_presidency_timespan_year"].idxmin()]["president"] # The shortest living president

Which president was the oldest and the youngest at the start of the presidency?

In [None]:
df.iloc[df["age_at_start_year"].idxmax()]["president"] # The oldest president at the start of the presidency

In [None]:
df.iloc[df["age_at_start_year"].idxmin()]["president"] # The youngest president at the start of the presidency

Which president died the oldest and the youngest?

In [None]:
df.iloc[df["age_at_death_year"].idxmax()]["president"] # The oldest died president after presidency end

In [None]:
df.iloc[df["age_at_death_year"].idxmin()]["president"] # The yougest died president after presidency end

## What else can we add?

---

The analysis I did is the least what can be done with this data. This is a homework for you to extend the analysis.

1) Add other univariate and bivariate analysis and clearly state your findings

2) Use `groupby` to group the data by some column

3) Use Pandas `pivot_table` and to see the relationship between variables

4) Use `crosstab` to have frequency tables

5) Try to find hidden relationship between variables if possible

6) Add data visualization

# Summary

---

This lecture aimed to show you how you can utilize Pandas capabilities to process and analyze messy data as well as present you findings and tell a story.