# EDA USA

## Overview
This notebook contains **Exploratory Data Analysis** and **Visualization** for the cases and vaccinations in India.

**Sections:**
1. [Data Ingestion](#Data_Ingestion)
2. [Summary Statistics](#Summary_Statistics)
3. [Data Cleaning](#Data_Cleaning)
4. [Visualization](#Visualization)
5. [Correlation](#Correlation)
6. [Preprocessing](#Preprocessing)
7. [Conclusion](#Conclusion)

***
## Setup

***NOTE***: Please install seaborn version >=  `0.11.0`. You could update your seaborn installation by running `pip install -U seaborn`

In [None]:
import os
from datetime import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

***
<a id='Data_Ingestion'></a>
## 1. Data Ingestion

### 1.1 Getting daily state-wise cases for USA

In [None]:
cases = pd.read_csv("../raw_datasets/usa_cases_12-10-2021.csv", index_col=0)
cases

### 1.2  Getting daily state-wise vaccinations for USA

In [None]:
vacc = pd.read_csv("../raw_datasets/usa_vaccines_12-10-2021.csv", index_col=0)
vacc

***
<a id='Summary_Statistics'></a>
## 2. Summary Statistics

In [None]:
# Summary statistics for cases in USA
cases.describe()

In [None]:
# Summary statistics for vaccines in USA
vacc.describe()

<a id='missing_outliers'></a>
### 2.2 How many missing data and outliers? 

In [None]:
# Provides the number of missing values for cases in India
cases.isnull().sum()

In [None]:
cases.isnull().sum().sum()

In [None]:
# Provides the number of missing values for vaccines in India
vacc.isnull().sum()

In [None]:
vacc.isnull().sum().sum()

In [None]:
z_confirm = np.abs(stats.zscore(cases['Confirmed']))
print(z_confirm)

In [None]:
z_adminstered = np.abs(stats.zscore(vacc['total_vaccinations'], nan_policy='omit'))
print(z_adminstered)

In [None]:
threshold = 3
print(np.where(z_confirm>3))

In [None]:
np.where(z_adminstered>3)

### 2.3 Any Inconsistent, Incomplete, duplicate or incorrect data


In [None]:
cases.duplicated().sum()

In [None]:
vacc.duplicated().sum()

In [None]:
incomplete_cases = cases.isnull().any(axis=1)
incomplete_cases

In [None]:
incomplete_vacc = vacc.isnull().any(axis=1)
incomplete_vacc

***
<a id='Data_Cleaning'></a>
## 3. Data Cleaning

### 3.1 cases 

In [None]:
cases.info()

In [None]:
# Provides the fraction of nulls in a particular column
cases.isnull().sum() / len(cases)

Above we can see that `Cases_28_Days` and `Death_28_Days` have a Null Ratio of `0.9963`, which means 99% of rows are Null. So we can just drop these columns

In [None]:
# Dropping the columns with high null ratio
cases.drop(['Cases_28_Days', 'Deaths_28_Days'], inplace = True, axis = 1)

In [None]:
# After dropping
cases.info()

### 3.2 vacc

In [None]:
vacc.info()

In [None]:
# Provides the fraction of nulls in a particular column
vacc.isnull().sum() / len(vacc)

There is no column with a significant majority of Nulls. **All columns can be left as is**

In [None]:
cases.to_csv('../cleaned_datasets/usa/statewise_cases_usa.csv')
vacc.to_csv('../cleaned_datasets/usa/statewise_vacc_usa.csv')

***
<a id='Visualization'></a>
## 4. Visualization

### 4.1 Histograms

In [None]:
histogram_filter_cases = cases[['Confirmed', 'Recovered']]
sns.histplot(data=histogram_filter_cases, bins=30, kde=True)

In [None]:
histogram_filter_cases = cases[['Deaths']]
sns.histplot(data=histogram_filter_cases, bins=30, kde=True)

### 4.2 Bar Charts

In [None]:
barchart_filter_cases = cases[['Confirmed', 'Recovered','Deaths']]
sns.set_theme(style='whitegrid')
sns.barplot(data=barchart_filter_cases)

In [None]:
barchart_filter_vacc = vacc[['people_vaccinated', 'people_fully_vaccinated']]
sns.set_theme(style='whitegrid')
sns.barplot(data=barchart_filter_vacc)

### 4.3 Line Plots

#### Getting Time Series Data
Before obtaining Line Plots we first extract the time series data. This is done by
- Grouping by Date
- Aggregating Confirmed, Deaths and Recovered by Sum

In [None]:
# Cumulative Time Series (Add on to previous date's cases)
cum_timeseries = cases.groupby(['Date']).agg(Confirmed = ('Confirmed', 'sum'), Deaths = ('Deaths', 'sum'), Recovered = ('Recovered', 'sum'))

In [None]:
# Delta Time series (Cases on a particular day)
delta_timeseries = cases.groupby(['Date']).agg(Confirmed = ('Confirmed', 'sum'), Deaths = ('Deaths', 'sum'), Recovered = ('Recovered', 'sum')).diff()

#### 4.3.1 Cumulative confirmed cases

In [None]:
cum_timeseries.Confirmed.plot(figsize=(8, 8))

#### 4.3.2 Daily confirmed cases

In [None]:
delta_timeseries.Confirmed.plot(figsize=(8, 8))

#### 4.3.3 Cumulative deaths

In [None]:
sns.lineplot(data = cum_timeseries, x = "Date", y = "Deaths")

#### 4.3.4 Daily deaths

In [None]:
sns.lineplot(data = delta_timeseries, x = "Date", y = "Deaths")

#### 4.3.5 Cumulative recoveries

In [None]:
sns.lineplot(data = cum_timeseries, x = "Date", y = "Recovered")

#### 4.3.6 Daily recoveries

In [None]:
sns.lineplot(data = delta_timeseries, x = "Date", y = "Recovered")

In [None]:
cum_timeseries.to_csv('../cleaned_datasets/usa/cum_cases_usa.csv')
delta_timeseries.to_csv('../cleaned_datasets/usa/daily_cases_usa.csv')

In [None]:
cum_vacc = vacc.groupby(['date']).agg(total_doses = ('total_vaccinations', 'sum'), people_vacc = ('people_vaccinated', 'sum'), people_fully_vacc = ('people_fully_vaccinated', 'sum'), daily_vacc = ('daily_vaccinations', 'sum'))

In [None]:
cum_vacc.to_csv('../cleaned_datasets/usa/vacc_usa.csv')

### 4.4 PCA

***
<a id='Correlation'></a>
## 5. Correlation

- Find the correlation matrix for `cases` and `vacc` to see if any attributes are strongly correlated (we take a threshold of 80%)
- We see if the correlation is meaningful, or indicative of excessive attributes

### 5.1 Correlation for `cases`

In [None]:
# Correlation matrix
corr_cases = cases.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_cases, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 15))

sns.heatmap(corr_cases, mask=mask, center=0, square=True, annot=True)

### 5.2 Correlation for `vacc`

In [None]:
# Correlation matrix
corr_vacc = vacc.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_vacc, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))

sns.heatmap(corr_vacc, mask=mask, center=0, square=True, annot=True)

***
<a id='Preprocessing'></a>
## 6. Preprocessing

***
<a id='Conclusion'></a>
## 7. Conclusion