# Data Cleaning of COVID-19 in India Data Analysis

## Objective
Clean and standardize Covid-19 case and vaccination datasets for accurate analysis.

## Data Quality Issues Identified
- Inconsistent state names
- Duplicate and invalid entries
- Missing values
- Datatype inconsistencies

## Cleaning Steps Performed
- Column name standardization
- State name normalization
- Removal of non-state rows
- Datetime conversion
- Export of cleaned datasets

### Step 1: Import Libraries

In [1]:
import pandas as pd
import numpy as np

### Step 2: Load Raw Data

In [2]:
covid_df = pd.read_csv("../Raw_data/covid_19_india.csv")
vaccine_df = pd.read_csv("../Raw_data/covid_vaccine_statewise.csv")

### Step 3: Initial Inspection & Clean `covid_19_india.csv` 

In [3]:
covid_df.head()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,1,2020-01-30,6:00 PM,Kerala,1,0,0,0,1
1,2,2020-01-31,6:00 PM,Kerala,1,0,0,0,1
2,3,2020-02-01,6:00 PM,Kerala,2,0,0,0,2
3,4,2020-02-02,6:00 PM,Kerala,3,0,0,0,3
4,5,2020-02-03,6:00 PM,Kerala,3,0,0,0,3


In [4]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18110 entries, 0 to 18109
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Sno                       18110 non-null  int64 
 1   Date                      18110 non-null  object
 2   Time                      18110 non-null  object
 3   State/UnionTerritory      18110 non-null  object
 4   ConfirmedIndianNational   18110 non-null  object
 5   ConfirmedForeignNational  18110 non-null  object
 6   Cured                     18110 non-null  int64 
 7   Deaths                    18110 non-null  int64 
 8   Confirmed                 18110 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.2+ MB


In [5]:
# Drop columns that are not needed for this specific analysis
covid_df.drop(['Sno','Time','ConfirmedIndianNational','ConfirmedForeignNational'], axis=1, inplace=True)

In [6]:
covid_df.head()

Unnamed: 0,Date,State/UnionTerritory,Cured,Deaths,Confirmed
0,2020-01-30,Kerala,0,0,1
1,2020-01-31,Kerala,0,0,1
2,2020-02-01,Kerala,0,0,2
3,2020-02-02,Kerala,0,0,3
4,2020-02-03,Kerala,0,0,3


In [7]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18110 entries, 0 to 18109
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Date                  18110 non-null  object
 1   State/UnionTerritory  18110 non-null  object
 2   Cured                 18110 non-null  int64 
 3   Deaths                18110 non-null  int64 
 4   Confirmed             18110 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 707.6+ KB


In [8]:
# Convert the 'Date' column from object format to datetime
covid_df['Date'] = pd.to_datetime(covid_df['Date'])

In [9]:
# Remove hidden spaces from Columns names
covid_df.columns = covid_df.columns.str.strip()

In [10]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18110 entries, 0 to 18109
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  18110 non-null  datetime64[ns]
 1   State/UnionTerritory  18110 non-null  object        
 2   Cured                 18110 non-null  int64         
 3   Deaths                18110 non-null  int64         
 4   Confirmed             18110 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 707.6+ KB


In [11]:
#check for duplicated values
covid_df.duplicated().sum()

np.int64(0)

In [12]:
#check for null values
pd.isnull(covid_df).sum()

Date                    0
State/UnionTerritory    0
Cured                   0
Deaths                  0
Confirmed               0
dtype: int64

In [13]:
# Display the all unique names in the 'State/UnionTerritory' column
covid_df['State/UnionTerritory'].unique()

array(['Kerala', 'Telengana', 'Delhi', 'Rajasthan', 'Uttar Pradesh',
       'Haryana', 'Ladakh', 'Tamil Nadu', 'Karnataka', 'Maharashtra',
       'Punjab', 'Jammu and Kashmir', 'Andhra Pradesh', 'Uttarakhand',
       'Odisha', 'Puducherry', 'West Bengal', 'Chhattisgarh',
       'Chandigarh', 'Gujarat', 'Himachal Pradesh', 'Madhya Pradesh',
       'Bihar', 'Manipur', 'Mizoram', 'Andaman and Nicobar Islands',
       'Goa', 'Unassigned', 'Assam', 'Jharkhand', 'Arunachal Pradesh',
       'Tripura', 'Nagaland', 'Meghalaya',
       'Dadra and Nagar Haveli and Daman and Diu',
       'Cases being reassigned to states', 'Sikkim', 'Daman & Diu',
       'Lakshadweep', 'Telangana', 'Dadra and Nagar Haveli', 'Bihar****',
       'Madhya Pradesh***', 'Himanchal Pradesh', 'Karanataka',
       'Maharashtra***'], dtype=object)

In [14]:
# Group the data by state and calculate the total number of confirmed cases for each state
covid_df.groupby('State/UnionTerritory')[['Confirmed']].sum()

Unnamed: 0_level_0,Confirmed
State/UnionTerritory,Unnamed: 1_level_1
Andaman and Nicobar Islands,1938498
Andhra Pradesh,392432753
Arunachal Pradesh,7176907
Assam,99837011
Bihar,132231166
Bihar****,1430909
Cases being reassigned to states,345565
Chandigarh,10858627
Chhattisgarh,163776262
Dadra and Nagar Haveli,20722


In [15]:
# Clean State Names (Remove Symbols & Extra Spaces)
covid_df['State/UnionTerritory'] = (
    covid_df['State/UnionTerritory']
    .str.replace(r'[*]+', '', regex=True)
    .str.strip()
)

In [16]:
# Fix Spelling Variations
state_corrections = {
    'Karanataka': 'Karnataka',
    'Himanchal Pradesh': 'Himachal Pradesh',
    'Telengana': 'Telangana'
}

covid_df['State/UnionTerritory'] = covid_df['State/UnionTerritory'].replace(state_corrections)

In [17]:
# Merge UT Names
covid_df['State/UnionTerritory'] = covid_df['State/UnionTerritory'].replace({
    'Dadra and Nagar Haveli': 'Dadra and Nagar Haveli and Daman and Diu',
    'Daman & Diu': 'Dadra and Nagar Haveli and Daman and Diu'
})

In [18]:
# Group the data by state and calculate the total number of confirmed cases for each state
covid_df.groupby('State/UnionTerritory')[['Confirmed']].sum()

Unnamed: 0_level_0,Confirmed
State/UnionTerritory,Unnamed: 1_level_1
Andaman and Nicobar Islands,1938498
Andhra Pradesh,392432753
Arunachal Pradesh,7176907
Assam,99837011
Bihar,133662075
Cases being reassigned to states,345565
Chandigarh,10858627
Chhattisgarh,163776262
Dadra and Nagar Haveli and Daman and Diu,1959356
Delhi,287227765


In [19]:
# Add Active Cases Column 
covid_df['Active_Cases']= covid_df['Confirmed'] - (covid_df['Cured'] + covid_df['Deaths'])

In [20]:
covid_df.tail()

Unnamed: 0,Date,State/UnionTerritory,Cured,Deaths,Confirmed,Active_Cases
18105,2021-08-11,Telangana,638410,3831,650353,8112
18106,2021-08-11,Tripura,77811,773,80660,2076
18107,2021-08-11,Uttarakhand,334650,7368,342462,444
18108,2021-08-11,Uttar Pradesh,1685492,22775,1708812,545
18109,2021-08-11,West Bengal,1506532,18252,1534999,10215


In [21]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18110 entries, 0 to 18109
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  18110 non-null  datetime64[ns]
 1   State/UnionTerritory  18110 non-null  object        
 2   Cured                 18110 non-null  int64         
 3   Deaths                18110 non-null  int64         
 4   Confirmed             18110 non-null  int64         
 5   Active_Cases          18110 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 849.0+ KB


In [22]:
#check for null values
pd.isnull(covid_df).sum()

Date                    0
State/UnionTerritory    0
Cured                   0
Deaths                  0
Confirmed               0
Active_Cases            0
dtype: int64

In [23]:
covid_df.describe()

Unnamed: 0,Date,Cured,Deaths,Confirmed,Active_Cases
count,18110,18110.0,18110.0,18110.0,18110.0
mean,2020-11-30 21:49:50.127001344,278637.5,4052.402264,301031.4,18341.481502
min,2020-01-30 00:00:00,0.0,0.0,0.0,-9368.0
25%,2020-07-26 00:00:00,3360.25,32.0,4376.75,322.0
50%,2020-12-03 00:00:00,33364.0,588.0,39773.5,2305.5
75%,2021-04-08 00:00:00,278869.8,3643.75,300149.8,12454.75
max,2021-08-11 00:00:00,6159676.0,134201.0,6363442.0,701614.0
std,,614890.9,10919.076411,656148.9,52896.528487


### Step 4: Initial Inspection & Clean `covid_vaccine_statewise.csv` 

In [24]:
vaccine_df.head()

Unnamed: 0,Updated On,State,Total Doses Administered,Sessions,Sites,First Dose Administered,Second Dose Administered,Male (Doses Administered),Female (Doses Administered),Transgender (Doses Administered),...,18-44 Years (Doses Administered),45-60 Years (Doses Administered),60+ Years (Doses Administered),18-44 Years(Individuals Vaccinated),45-60 Years(Individuals Vaccinated),60+ Years(Individuals Vaccinated),Male(Individuals Vaccinated),Female(Individuals Vaccinated),Transgender(Individuals Vaccinated),Total Individuals Vaccinated
0,16/01/2021,India,48276.0,3455.0,2957.0,48276.0,0.0,,,,...,,,,,,,23757.0,24517.0,2.0,48276.0
1,17/01/2021,India,58604.0,8532.0,4954.0,58604.0,0.0,,,,...,,,,,,,27348.0,31252.0,4.0,58604.0
2,18/01/2021,India,99449.0,13611.0,6583.0,99449.0,0.0,,,,...,,,,,,,41361.0,58083.0,5.0,99449.0
3,19/01/2021,India,195525.0,17855.0,7951.0,195525.0,0.0,,,,...,,,,,,,81901.0,113613.0,11.0,195525.0
4,20/01/2021,India,251280.0,25472.0,10504.0,251280.0,0.0,,,,...,,,,,,,98111.0,153145.0,24.0,251280.0


In [25]:
# Rename 'Updated On' to 'Vaccine_Date'
vaccine_df.rename(columns={'Updated On':'Vaccine_Date'},inplace = True)

In [26]:
vaccine_df.head()

Unnamed: 0,Vaccine_Date,State,Total Doses Administered,Sessions,Sites,First Dose Administered,Second Dose Administered,Male (Doses Administered),Female (Doses Administered),Transgender (Doses Administered),...,18-44 Years (Doses Administered),45-60 Years (Doses Administered),60+ Years (Doses Administered),18-44 Years(Individuals Vaccinated),45-60 Years(Individuals Vaccinated),60+ Years(Individuals Vaccinated),Male(Individuals Vaccinated),Female(Individuals Vaccinated),Transgender(Individuals Vaccinated),Total Individuals Vaccinated
0,16/01/2021,India,48276.0,3455.0,2957.0,48276.0,0.0,,,,...,,,,,,,23757.0,24517.0,2.0,48276.0
1,17/01/2021,India,58604.0,8532.0,4954.0,58604.0,0.0,,,,...,,,,,,,27348.0,31252.0,4.0,58604.0
2,18/01/2021,India,99449.0,13611.0,6583.0,99449.0,0.0,,,,...,,,,,,,41361.0,58083.0,5.0,99449.0
3,19/01/2021,India,195525.0,17855.0,7951.0,195525.0,0.0,,,,...,,,,,,,81901.0,113613.0,11.0,195525.0
4,20/01/2021,India,251280.0,25472.0,10504.0,251280.0,0.0,,,,...,,,,,,,98111.0,153145.0,24.0,251280.0


In [27]:
# Convert the 'Vaccine_Date' column from object format to datetime
vaccine_df['Vaccine_Date'] = pd.to_datetime(vaccine_df['Vaccine_Date'],dayfirst=True)

In [28]:
# Remove hidden spaces from Columns names
vaccine_df.columns = vaccine_df.columns.str.strip()

In [29]:
vaccine_df.columns

Index(['Vaccine_Date', 'State', 'Total Doses Administered', 'Sessions',
       'Sites', 'First Dose Administered', 'Second Dose Administered',
       'Male (Doses Administered)', 'Female (Doses Administered)',
       'Transgender (Doses Administered)', 'Covaxin (Doses Administered)',
       'CoviShield (Doses Administered)', 'Sputnik V (Doses Administered)',
       'AEFI', '18-44 Years (Doses Administered)',
       '45-60 Years (Doses Administered)', '60+ Years (Doses Administered)',
       '18-44 Years(Individuals Vaccinated)',
       '45-60 Years(Individuals Vaccinated)',
       '60+ Years(Individuals Vaccinated)', 'Male(Individuals Vaccinated)',
       'Female(Individuals Vaccinated)', 'Transgender(Individuals Vaccinated)',
       'Total Individuals Vaccinated'],
      dtype='object')

In [30]:
vaccine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7845 entries, 0 to 7844
Data columns (total 24 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   Vaccine_Date                         7845 non-null   datetime64[ns]
 1   State                                7845 non-null   object        
 2   Total Doses Administered             7621 non-null   float64       
 3   Sessions                             7621 non-null   float64       
 4   Sites                                7621 non-null   float64       
 5   First Dose Administered              7621 non-null   float64       
 6   Second Dose Administered             7621 non-null   float64       
 7   Male (Doses Administered)            7461 non-null   float64       
 8   Female (Doses Administered)          7461 non-null   float64       
 9   Transgender (Doses Administered)     7461 non-null   float64       
 10  Covaxin (Dos

In [31]:
# Display the all unique names in the 'State' column
vaccine_df['State'].unique()

array(['India', 'Andaman and Nicobar Islands', 'Andhra Pradesh',
       'Arunachal Pradesh', 'Assam', 'Bihar', 'Chandigarh',
       'Chhattisgarh', 'Dadra and Nagar Haveli and Daman and Diu',
       'Delhi', 'Goa', 'Gujarat', 'Haryana', 'Himachal Pradesh',
       'Jammu and Kashmir', 'Jharkhand', 'Karnataka', 'Kerala', 'Ladakh',
       'Lakshadweep', 'Madhya Pradesh', 'Maharashtra', 'Manipur',
       'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha', 'Puducherry',
       'Punjab', 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana',
       'Tripura', 'Uttar Pradesh', 'Uttarakhand', 'West Bengal'],
      dtype=object)

In [32]:
# Replace 'India' with the new string in the 'State' column
vaccine_df['State'] = vaccine_df['State'].replace('India', 'Cases being reassigned to states')

In [33]:
# Display the all unique names in the 'State' column
vaccine_df['State'].unique()

array(['Cases being reassigned to states', 'Andaman and Nicobar Islands',
       'Andhra Pradesh', 'Arunachal Pradesh', 'Assam', 'Bihar',
       'Chandigarh', 'Chhattisgarh',
       'Dadra and Nagar Haveli and Daman and Diu', 'Delhi', 'Goa',
       'Gujarat', 'Haryana', 'Himachal Pradesh', 'Jammu and Kashmir',
       'Jharkhand', 'Karnataka', 'Kerala', 'Ladakh', 'Lakshadweep',
       'Madhya Pradesh', 'Maharashtra', 'Manipur', 'Meghalaya', 'Mizoram',
       'Nagaland', 'Odisha', 'Puducherry', 'Punjab', 'Rajasthan',
       'Sikkim', 'Tamil Nadu', 'Telangana', 'Tripura', 'Uttar Pradesh',
       'Uttarakhand', 'West Bengal'], dtype=object)

In [34]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                              0
State                                     0
Total Doses Administered                224
Sessions                                224
Sites                                   224
First Dose Administered                 224
Second Dose Administered                224
Male (Doses Administered)               384
Female (Doses Administered)             384
Transgender (Doses Administered)        384
Covaxin (Doses Administered)            224
CoviShield (Doses Administered)         224
Sputnik V (Doses Administered)         4850
AEFI                                   2407
18-44 Years (Doses Administered)       6143
45-60 Years (Doses Administered)       6143
60+ Years (Doses Administered)         6143
18-44 Years(Individuals Vaccinated)    4112
45-60 Years(Individuals Vaccinated)    4111
60+ Years(Individuals Vaccinated)      4111
Male(Individuals Vaccinated)           7685
Female(Individuals Vaccinated)         7685
Transgender(Individuals Vaccinat

In [35]:
# Remove all rows where the 'Total Doses Administered' value is missing (NaN)
vaccine_df.dropna(subset=['Total Doses Administered'], inplace=True)

In [36]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                              0
State                                     0
Total Doses Administered                  0
Sessions                                  0
Sites                                     0
First Dose Administered                   0
Second Dose Administered                  0
Male (Doses Administered)               160
Female (Doses Administered)             160
Transgender (Doses Administered)        160
Covaxin (Doses Administered)              0
CoviShield (Doses Administered)           0
Sputnik V (Doses Administered)         4626
AEFI                                   2183
18-44 Years (Doses Administered)       5919
45-60 Years (Doses Administered)       5919
60+ Years (Doses Administered)         5919
18-44 Years(Individuals Vaccinated)    3888
45-60 Years(Individuals Vaccinated)    3887
60+ Years(Individuals Vaccinated)      3887
Male(Individuals Vaccinated)           7461
Female(Individuals Vaccinated)         7461
Transgender(Individuals Vaccinat

In [37]:
# Replace missing (NaN) values with 0 in specific columns
vaccine_df.fillna({'Sputnik V (Doses Administered)': 0, 'AEFI': 0}, inplace=True)

In [38]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                              0
State                                     0
Total Doses Administered                  0
Sessions                                  0
Sites                                     0
First Dose Administered                   0
Second Dose Administered                  0
Male (Doses Administered)               160
Female (Doses Administered)             160
Transgender (Doses Administered)        160
Covaxin (Doses Administered)              0
CoviShield (Doses Administered)           0
Sputnik V (Doses Administered)            0
AEFI                                      0
18-44 Years (Doses Administered)       5919
45-60 Years (Doses Administered)       5919
60+ Years (Doses Administered)         5919
18-44 Years(Individuals Vaccinated)    3888
45-60 Years(Individuals Vaccinated)    3887
60+ Years(Individuals Vaccinated)      3887
Male(Individuals Vaccinated)           7461
Female(Individuals Vaccinated)         7461
Transgender(Individuals Vaccinat

In [39]:
# Consolidate gender-based vaccination data by filling missing values in 'Doses Administered' 
# with values from 'Individuals Vaccinated'. This merges two columns into one 'Male' column.
vaccine_df['Male'] = vaccine_df['Male (Doses Administered)'].combine_first(vaccine_df['Male(Individuals Vaccinated)'])

# Repeat the consolidation for Female records
vaccine_df['Female'] = vaccine_df['Female (Doses Administered)'].combine_first(vaccine_df['Female(Individuals Vaccinated)'])

# Repeat the consolidation for Transgender records
vaccine_df['Transgender'] = vaccine_df['Transgender (Doses Administered)'].combine_first(vaccine_df['Transgender(Individuals Vaccinated)'])

In [40]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                              0
State                                     0
Total Doses Administered                  0
Sessions                                  0
Sites                                     0
First Dose Administered                   0
Second Dose Administered                  0
Male (Doses Administered)               160
Female (Doses Administered)             160
Transgender (Doses Administered)        160
Covaxin (Doses Administered)              0
CoviShield (Doses Administered)           0
Sputnik V (Doses Administered)            0
AEFI                                      0
18-44 Years (Doses Administered)       5919
45-60 Years (Doses Administered)       5919
60+ Years (Doses Administered)         5919
18-44 Years(Individuals Vaccinated)    3888
45-60 Years(Individuals Vaccinated)    3887
60+ Years(Individuals Vaccinated)      3887
Male(Individuals Vaccinated)           7461
Female(Individuals Vaccinated)         7461
Transgender(Individuals Vaccinat

In [41]:
# Consolidate gender-based vaccination data by filling missing values in 'Doses Administered' 
# with values from 'Individuals Vaccinated'. This merges two columns into one '18-44 Years' column.
vaccine_df['18-44 Years'] = vaccine_df['18-44 Years (Doses Administered)'].combine_first(vaccine_df['18-44 Years(Individuals Vaccinated)'])

# Repeat the consolidation for '45-60 Years' records
vaccine_df['45-60 Years'] = vaccine_df['45-60 Years (Doses Administered)'].combine_first(vaccine_df['45-60 Years(Individuals Vaccinated)'])

# Repeat the consolidation for '60+ Years' records
vaccine_df['60+ Years'] = vaccine_df['60+ Years (Doses Administered)'].combine_first(vaccine_df['60+ Years(Individuals Vaccinated)'])

In [42]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                              0
State                                     0
Total Doses Administered                  0
Sessions                                  0
Sites                                     0
First Dose Administered                   0
Second Dose Administered                  0
Male (Doses Administered)               160
Female (Doses Administered)             160
Transgender (Doses Administered)        160
Covaxin (Doses Administered)              0
CoviShield (Doses Administered)           0
Sputnik V (Doses Administered)            0
AEFI                                      0
18-44 Years (Doses Administered)       5919
45-60 Years (Doses Administered)       5919
60+ Years (Doses Administered)         5919
18-44 Years(Individuals Vaccinated)    3888
45-60 Years(Individuals Vaccinated)    3887
60+ Years(Individuals Vaccinated)      3887
Male(Individuals Vaccinated)           7461
Female(Individuals Vaccinated)         7461
Transgender(Individuals Vaccinat

In [43]:
vaccine_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7621 entries, 0 to 7838
Data columns (total 30 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   Vaccine_Date                         7621 non-null   datetime64[ns]
 1   State                                7621 non-null   object        
 2   Total Doses Administered             7621 non-null   float64       
 3   Sessions                             7621 non-null   float64       
 4   Sites                                7621 non-null   float64       
 5   First Dose Administered              7621 non-null   float64       
 6   Second Dose Administered             7621 non-null   float64       
 7   Male (Doses Administered)            7461 non-null   float64       
 8   Female (Doses Administered)          7461 non-null   float64       
 9   Transgender (Doses Administered)     7461 non-null   float64       
 10  Covaxin (Doses Ad

In [44]:
# Fill missing values in 'Total Individuals Vaccinated' by summing 'Male', 'Female', and 'Transgender'
vaccine_df['Total Individuals Vaccinated'] = vaccine_df['Total Individuals Vaccinated'].fillna(
    vaccine_df[['Male', 'Female', 'Transgender']].sum(axis=1)
)

In [45]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                              0
State                                     0
Total Doses Administered                  0
Sessions                                  0
Sites                                     0
First Dose Administered                   0
Second Dose Administered                  0
Male (Doses Administered)               160
Female (Doses Administered)             160
Transgender (Doses Administered)        160
Covaxin (Doses Administered)              0
CoviShield (Doses Administered)           0
Sputnik V (Doses Administered)            0
AEFI                                      0
18-44 Years (Doses Administered)       5919
45-60 Years (Doses Administered)       5919
60+ Years (Doses Administered)         5919
18-44 Years(Individuals Vaccinated)    3888
45-60 Years(Individuals Vaccinated)    3887
60+ Years(Individuals Vaccinated)      3887
Male(Individuals Vaccinated)           7461
Female(Individuals Vaccinated)         7461
Transgender(Individuals Vaccinat

In [46]:
# Remove old gender-based columns
vaccine_df.drop(['Male (Doses Administered)','Female (Doses Administered)','Transgender (Doses Administered)'], axis=1, inplace=True)
vaccine_df.drop(['Male(Individuals Vaccinated)','Female(Individuals Vaccinated)','Transgender(Individuals Vaccinated)'], axis=1, inplace=True)

# Remove old age-group columns
vaccine_df.drop(['18-44 Years (Doses Administered)','45-60 Years (Doses Administered)','60+ Years (Doses Administered)'], axis=1, inplace=True)
vaccine_df.drop(['18-44 Years(Individuals Vaccinated)','45-60 Years(Individuals Vaccinated)','60+ Years(Individuals Vaccinated)'], axis=1, inplace=True)

In [47]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                          0
State                                 0
Total Doses Administered              0
Sessions                              0
Sites                                 0
First Dose Administered               0
Second Dose Administered              0
Covaxin (Doses Administered)          0
CoviShield (Doses Administered)       0
Sputnik V (Doses Administered)        0
AEFI                                  0
Total Individuals Vaccinated          0
Male                                  0
Female                                0
Transgender                           0
18-44 Years                        2186
45-60 Years                        2185
60+ Years                          2185
dtype: int64

In [48]:
vaccine_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7621 entries, 0 to 7838
Data columns (total 18 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   Vaccine_Date                     7621 non-null   datetime64[ns]
 1   State                            7621 non-null   object        
 2   Total Doses Administered         7621 non-null   float64       
 3   Sessions                         7621 non-null   float64       
 4   Sites                            7621 non-null   float64       
 5   First Dose Administered          7621 non-null   float64       
 6   Second Dose Administered         7621 non-null   float64       
 7   Covaxin (Doses Administered)     7621 non-null   float64       
 8   CoviShield (Doses Administered)  7621 non-null   float64       
 9   Sputnik V (Doses Administered)   7621 non-null   float64       
 10  AEFI                             7621 non-null   float64       
 

In [49]:
vaccine_df.columns

Index(['Vaccine_Date', 'State', 'Total Doses Administered', 'Sessions',
       'Sites', 'First Dose Administered', 'Second Dose Administered',
       'Covaxin (Doses Administered)', 'CoviShield (Doses Administered)',
       'Sputnik V (Doses Administered)', 'AEFI',
       'Total Individuals Vaccinated', 'Male', 'Female', 'Transgender',
       '18-44 Years', '45-60 Years', '60+ Years'],
      dtype='object')

In [50]:
# Reposition the Position of Column
new_order = ['Vaccine_Date','State','Total Doses Administered','Sessions','Sites','First Dose Administered','Second Dose Administered','Covaxin (Doses Administered)','CoviShield (Doses Administered)','Sputnik V (Doses Administered)', 'AEFI', 'Male', 'Female', 'Transgender','18-44 Years', '45-60 Years', '60+ Years','Total Individuals Vaccinated']
vaccine_df = vaccine_df.reindex(columns=new_order)

In [51]:
vaccine_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7621 entries, 0 to 7838
Data columns (total 18 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   Vaccine_Date                     7621 non-null   datetime64[ns]
 1   State                            7621 non-null   object        
 2   Total Doses Administered         7621 non-null   float64       
 3   Sessions                         7621 non-null   float64       
 4   Sites                            7621 non-null   float64       
 5   First Dose Administered          7621 non-null   float64       
 6   Second Dose Administered         7621 non-null   float64       
 7   Covaxin (Doses Administered)     7621 non-null   float64       
 8   CoviShield (Doses Administered)  7621 non-null   float64       
 9   Sputnik V (Doses Administered)   7621 non-null   float64       
 10  AEFI                             7621 non-null   float64       
 

In [52]:
#check for null values
pd.isnull(vaccine_df).sum()

Vaccine_Date                          0
State                                 0
Total Doses Administered              0
Sessions                              0
Sites                                 0
First Dose Administered               0
Second Dose Administered              0
Covaxin (Doses Administered)          0
CoviShield (Doses Administered)       0
Sputnik V (Doses Administered)        0
AEFI                                  0
Male                                  0
Female                                0
Transgender                           0
18-44 Years                        2186
45-60 Years                        2185
60+ Years                          2185
Total Individuals Vaccinated          0
dtype: int64

### Step 5: Save Cleaned Data

In [53]:
# Export the cleaned COVID-19 cases dataframe to a CSV file
covid_df.to_csv("../Clean_data/covid_19_india_cleaned.csv", index=False)

# Export the cleaned Vaccination dataframe to a CSV file
vaccine_df.to_csv("../Clean_data/covid_vaccine_statewise_cleaned.csv", index=False)