### **1. Data Loading & Structural Validation-**

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("/content/sample_data/India_COVID19_Statewise_TimeSeries_Analytics_2021.csv")

In [6]:
df.head(4)

Unnamed: 0_level_0,State_UT,Population,New_Cases,New_Deaths,New_Recoveries,Total_Cases,Total_Deaths,Total_Recoveries,Active_Cases
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-01-01,Andaman and Nicobar,57755036,477,10,395,477,10,395,72
2021-01-01,Tamil Nadu,145901950,503,15,430,503,15,430,58
2021-01-01,Sikkim,72132436,507,12,438,507,12,438,57
2021-01-01,Andhra Pradesh,46145249,475,9,400,475,9,400,66


In [8]:
df = df.reset_index()
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['State_UT', 'Date'])

In [9]:
# Focus on one state for time-series validation (example: Andaman and Nicobar)
state_df = df[df['State_UT'] == 'Andaman and Nicobar'].copy()
state_df.set_index('Date', inplace=True)

state_df = state_df[['New_Cases', 'Total_Cases', 'Active_Cases', 'New_Deaths']]
state_df.dropna(inplace=True)

Purpose: Ensure clean chronological time-series structure.

###**2. Missing Value Assessment**

In [20]:
df.isnull().sum()


Unnamed: 0,0
Date,0
State_UT,0
Population,0
New_Cases,0
New_Deaths,0
New_Recoveries,0
Total_Cases,0
Total_Deaths,0
Total_Recoveries,0
Active_Cases,0


Purpose - Validate data completeness before statistical computation.

### **3. Case Fatality Rate (CFR) Calculation**

In [21]:
df['Case_Fatality_Rate'] = (df['Total_Deaths'] / df['Total_Cases']) * 100

df[['State_UT', 'Date', 'Case_Fatality_Rate']].head()


Unnamed: 0,State_UT,Date,Case_Fatality_Rate
0,Andaman and Nicobar,2021-01-01,2.096436
56,Andaman and Nicobar,2021-01-02,1.74538
74,Andaman and Nicobar,2021-01-03,1.349528
110,Andaman and Nicobar,2021-01-04,1.822785
158,Andaman and Nicobar,2021-01-05,2.134515


Purpose - Measure mortality severity relative to total confirmed cases.

### **4. Infection Rate per Population**

In [22]:
df['Infection_Rate'] = (df['Total_Cases'] / df['Population']) * 100

df[['State_UT', 'Infection_Rate']].describe()


Unnamed: 0,Infection_Rate
count,13140.0
mean,0.124523
std,0.178794
min,0.000248
25%,0.037833
50%,0.075494
75%,0.128746
max,1.576285


Purpose : Normalize total cases relative to population size for fair severity comparison.

### **5. Daily Growth Rate Calculation**

In [23]:
df['Daily_Growth_Rate'] = df.groupby('State_UT')['Total_Cases'].pct_change() * 100

df[['State_UT', 'Date', 'Daily_Growth_Rate']].head()


Unnamed: 0,State_UT,Date,Daily_Growth_Rate
0,Andaman and Nicobar,2021-01-01,
56,Andaman and Nicobar,2021-01-02,104.192872
74,Andaman and Nicobar,2021-01-03,52.156057
110,Andaman and Nicobar,2021-01-04,33.265857
158,Andaman and Nicobar,2021-01-05,25.721519


Purpose - Evaluate expansion speed of pandemic spread.

### **6. Correlation Matrix Analysis**

In [24]:
correlation_matrix = df[['New_Cases', 'New_Deaths',
                          'New_Recoveries', 'Active_Cases',
                          'Case_Fatality_Rate']].corr()

correlation_matrix


Unnamed: 0,New_Cases,New_Deaths,New_Recoveries,Active_Cases,Case_Fatality_Rate
New_Cases,1.0,0.140669,0.920126,0.000958,0.002458
New_Deaths,0.140669,1.0,0.127002,0.000681,0.131527
New_Recoveries,0.920126,0.127002,1.0,-0.005877,-0.002208
Active_Cases,0.000958,0.000681,-0.005877,1.0,0.069586
Case_Fatality_Rate,0.002458,0.131527,-0.002208,0.069586,1.0


Pupose - Identify relationships between core pandemic variables.

### **7. Statistical Significance Testing (Pearson Correlation)**

In [25]:
from scipy.stats import pearsonr

corr, p_value = pearsonr(df['New_Cases'], df['New_Deaths'])

print("Correlation:", corr)
print("P-value:", p_value)


Correlation: 0.14066853130999088
P-value: 4.732456234082602e-59


Purpose - Validate whether observed correlation between new cases and new deaths is statistically significant.

### **8. Distribution Analysis (Skewness & Kurtosis)**

In [26]:
df[['New_Cases', 'New_Deaths']].describe()

Unnamed: 0,New_Cases,New_Deaths
count,13140.0,13140.0
mean,500.142161,10.011416
std,22.07581,3.185661
min,410.0,1.0
25%,485.0,8.0
50%,500.0,10.0
75%,515.0,12.0
max,580.0,23.0


In [27]:
df[['New_Cases', 'New_Deaths']].skew()

Unnamed: 0,0
New_Cases,0.058712
New_Deaths,0.297348


In [28]:
df[['New_Cases', 'New_Deaths']].kurt()

Unnamed: 0,0
New_Cases,0.019054
New_Deaths,0.068412


Purpose - Assess surge behavior and tail heaviness of case distribution.

### **8. Variance Stability Check (Rolling Std Example)**

In [29]:
state_sample = df[df['State_UT'] == df['State_UT'].iloc[0]].copy()
state_sample.set_index('Date', inplace=True)

rolling_std = state_sample['New_Cases'].rolling(window=7).std()

rolling_std.head()

Unnamed: 0_level_0,New_Cases
Date,Unnamed: 1_level_1
2021-01-01,
2021-01-02,
2021-01-03,
2021-01-04,
2021-01-05,


Purpose - Evaluate volatility behavior across time.