# EDA USA

## Overview
This notebook contains **Exploratory Data Analysis** and **Visualization** for the cases and vaccinations in India.

**Sections:**
1. [Data Ingestion](#Data_Ingestion)
2. [Summary Statistics](#Summary_Statistics)
3. [Data Cleaning](#3.-Data-Cleaning)
4. [Visualization](#Visualization)
5. Correlation
6. Preprocessing
7. [Conclusion](#Conclusion)

***
## Setup

In [None]:
import os
from datetime import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

***
## 1. Data Ingestion

### 1.1 Getting daily state-wise cases for USA

In [None]:
cases = pd.read_csv("usa_cases_12-10-2021.csv")
cases

### 1.2  Getting daily state-wise vaccinations for USA

In [None]:
vacc = pd.read_csv("usa_vaccines_12-10-2021.csv")
vacc

***
## 2. Summary Statistics

In [None]:
# Summary statistics for cases in USA
cases.describe()

In [None]:
# Summary statistics for vaccines in USA
vacc.describe()

<a id='missing_outliers'></a>
### 2.2 How many missing data and outliers? 

In [None]:
# Provides the number of missing values for cases in India
cases.isnull().sum()

In [None]:
cases.isnull().sum().sum()

In [None]:
# Provides the number of missing values for vaccines in India
vacc.isnull().sum()

In [None]:
vacc.isnull().sum().sum()

In [None]:
z_confirm = np.abs(stats.zscore(cases['Confirmed']))
print(z_confirm)

In [None]:
z_adminstered = np.abs(stats.zscore(vacc['total_vaccinations']))
print(z_adminstered)

In [None]:
threshold = 3
print(np.where(z_confirm>3))

In [None]:
np.where(z_adminstered>3)

### 2.3 Any Inconsistent, Incomplete, duplicate or incorrect data


In [None]:
cases.duplicated().sum()

In [None]:
vacc.duplicated().sum()

In [None]:
incomplete_cases = cases.isnull().any(axis=1)
incomplete_cases

In [None]:
incomplete_vacc = vacc.isnull().any(axis=1)
incomplete_vacc

***
## 3. Data Cleaning

### 3.1 cases 

In [None]:
cases.info()

In [None]:
# Provides the fraction of nulls in a particular column
cases.isnull().sum() / len(cases)

Above we can see that `Cases_28_Days` and `Death_28_Days` have a Null Ratio of `0.9963`, which means 99% of rows are Null. So we can just drop these columns

In [None]:
#Dropping the columns with high null ratio
cases.drop(['Cases_28_Days', 'Deaths_28_Days'], inplace = True, axis = 1)

In [None]:
#After dropping
cases.info()

### 3.2 vacc

In [None]:
vacc.info()

In [None]:
# Provides the fraction of nulls in a particular column
vacc.isnull().sum() / len(vacc)

There is no column with a significant majority of Nulls. **All columns can be left as is**

***
<a id='Visualization'></a>
## 4. Visualization

### 4.1 Histograms

In [None]:
histogram_filter_cases = cases[['Confirmed', 'Recovered']]
sns.histplot(data=histogram_filter_cases, bins=30, kde=True)

In [None]:
histogram_filter_cases = cases[['Deaths']]
sns.histplot(data=histogram_filter_cases, bins=30, kde=True)

### 4.2 Bar Charts

In [None]:
barchart_filter_cases = cases[['Confirmed', 'Recovered','Deaths']]
sns.set_theme(style='whitegrid')
sns.barplot(data=barchart_filter_cases)

In [None]:
barchart_filter_vacc = vacc[['people_vaccinated', 'people_fully_vaccinated']]
sns.set_theme(style='whitegrid')
sns.barplot(data=barchart_filter_vacc)