# EDA India

## Overview
This notebook contains **Exploratory Data Analysis** and **Visualization** for the cases and vaccinations in India.

**Sections:**
1. [Data Ingestion](#Data_Ingestion)
2. [Summary Statistics](#Summary_Statistics)
3. [Visualization](#Visualization)
4. Correlation
5. Preprocessing
6. [Conclusion](#Conclusion)

***
## Setup 

In [None]:
!pip install seaborn

In [None]:
import os
from datetime import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

***
<a id='Data_Ingestion'></a>
## 1. Data Ingestion

### 1.1 Getting daily state-wise cases for India

In [None]:
cases = pd.read_csv("india_cases_12-10-2021.csv")
cases

### 1.2  Getting daily state-wise vaccinations for India

In [None]:
vacc = pd.read_csv("india_vaccines_12-10-2021.csv")
vacc

***
<a id='Summary_Statistics'></a>
# 2. Summary statistics

In [None]:
# Summary statistics for cases in India
cases.describe()

In [None]:
# Summary statistics for vaccines in India
vacc.describe()

<a id='missing_outliers'></a>
### 2.2 How many missing data and outliers? 

In [None]:
# Provides the number of missing values for cases in India
cases.isnull().sum()

In [None]:
cases.isnull().sum().sum()

In [None]:
# Provides the number of missing values for vaccines in India
vacc.isnull().sum()

In [None]:
vacc.isnull().sum().sum()

In [None]:
z_confirm = np.abs(stats.zscore(cases['Confirmed']))
print(z_confirm)

In [None]:
z_adminstered = np.abs(stats.zscore(vacc['Total Doses Administered']))
print(z_adminstered)

In [None]:
threshold = 3
print(np.where(z_confirm>3))

In [None]:
np.where(z_adminstered>3)

### 2.3 Any Inconsistent, Incomplete, duplicate or incorrect data

In [None]:
cases.duplicated().sum()

In [None]:
vacc.duplicated().sum()

In [None]:
incomplete_cases = cases.isnull().any(axis=1)
incomplete_cases

In [None]:
incomplete_vacc = vacc.isnull().any(axis=1)
incomplete_vacc

***
<a id='Visualization'></a>
## 3. Visualization

### 3.1 Histograms

In [None]:
histogram_filter_cases = cases[['Confirmed', 'Recovered']]
sns.histplot(data=histogram_filter_cases, bins=30, kde=True)

In [None]:
histogram_filter_cases = cases[['Deaths']]
sns.histplot(data=histogram_filter_cases, bins=30, kde=True)

### 3.2 Bar Charts

In [None]:
barchart_filter_cases = cases[['Confirmed', 'Recovered','Deaths']]
sns.set_theme(style='whitegrid')
sns.barplot(data=barchart_filter_cases)

In [None]:
barchart_filter_vacc = vacc[['First Dose Administered', 'Second Dose Administered']]
sns.set_theme(style='whitegrid')
sns.barplot(data=barchart_filter_vacc)

In [None]:
barchart_filter_vacc = vacc[['18-44 Years (Doses Administered)', '45-60 Years (Doses Administered)', '60+ Years (Doses Administered)']]
sns.set_theme(style='whitegrid')
b = sns.barplot(data=barchart_filter_vacc)
b.set(xticklabels=['18-44 Years','45-60 Years','60+ Years'])

### 3.3 Line Plots 

### 3.5 PCA 

***
## 6. Conclusion 

- How many rows and attributes?
    - Size of cases in india is: `(18214, 22)`.
    - Size of vaccinations in india is: `(9990, 24)`.
- How many missing data and outliers?
    - `153446` missing elements for cases.
    - `73305` missing elements for vaccines.
- Any inconsistent, incomplete, duplicate or incorrect data?
    - All values in cases and vaccines are unique.
    - Cases in india contains `18214` incomplete rows.
    - Vaccinations in india contains `9990` incomplete rows.
- Are the variables correlated to each other?

- Are any of the preprocessing techniques needed: dimensionality reduction, range transformation, standardization, etc.?

- Does PCA help visualize the data? Do we get any insights from histograms/bar charts/line plots, etc.?