***Preprocessing and Training Data Development***

In this notebook, we shall be preparing the data from the previous step of Exploratory Data Analysis for the Machine Learning Models.

In [9]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [11]:
df = pd.read_csv('../../Unit11A/EDACapstone/USCovidData.csv')

In [12]:
df.head()

Unnamed: 0,date,new_cases,new_deaths,new_cases_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_hosp_admissions,...,people_fully_vaccinated,total_boosters,new_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,stringency_index,tests_units_not performed,tests_units_tests performed
0,2020-01-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
1,2020-01-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
2,2020-01-24,1.0,0.0,0.003,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
3,2020-01-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
4,2020-01-26,3.0,0.0,0.009,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 815 entries, 0 to 814
Data columns (total 29 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   date                                 815 non-null    object 
 1   new_cases                            815 non-null    float64
 2   new_deaths                           815 non-null    float64
 3   new_cases_per_million                815 non-null    float64
 4   reproduction_rate                    815 non-null    float64
 5   icu_patients                         815 non-null    float64
 6   icu_patients_per_million             815 non-null    float64
 7   hosp_patients                        815 non-null    float64
 8   hosp_patients_per_million            815 non-null    float64
 9   weekly_hosp_admissions               815 non-null    float64
 10  weekly_hosp_admissions_per_million   815 non-null    float64
 11  total_tests                     

The dataframe has no missing values, and its all numeric except the date column. It is supposed to be a datetimeindex object and its is supposed to be the index of the dataframe:

In [14]:
df['date'] = pd.to_datetime(df['date'])

In [15]:
df.set_index('date', inplace=True)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 815 entries, 2020-01-22 to 2022-04-15
Data columns (total 28 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   new_cases                            815 non-null    float64
 1   new_deaths                           815 non-null    float64
 2   new_cases_per_million                815 non-null    float64
 3   reproduction_rate                    815 non-null    float64
 4   icu_patients                         815 non-null    float64
 5   icu_patients_per_million             815 non-null    float64
 6   hosp_patients                        815 non-null    float64
 7   hosp_patients_per_million            815 non-null    float64
 8   weekly_hosp_admissions               815 non-null    float64
 9   weekly_hosp_admissions_per_million   815 non-null    float64
 10  total_tests                          815 non-null    float64
 11  new_tests    

Now the data is numeric and complete. But it still needs to be standardized so it can be scaled down. We shall use sklearn's StandardScaler for that.

In [17]:
scaler = StandardScaler()
scale = scaler.fit_transform(df)

In [18]:
scale.mean()

4.857347361024348e-17

In [19]:
scale.std(ddof=0)

1.0

In [20]:
scaledf = pd.DataFrame(scale, columns=df.columns, index=df.index)

In [21]:
scaledf

Unnamed: 0_level_0,new_cases,new_deaths,new_cases_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,...,people_fully_vaccinated,total_boosters,new_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,stringency_index,tests_units_not performed,tests_units_tests performed
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-22,-0.684809,-1.230281,-0.684809,-2.064156,-1.252235,-1.252234,-1.139472,-1.139472,-1.121894,-1.121894,...,-0.940899,-0.500739,-0.758638,-0.936484,-0.983961,-0.940900,-0.497097,-3.781973,4.460654,-4.460654
2020-01-23,-0.684809,-1.230281,-0.684809,-2.064156,-1.252235,-1.252234,-1.139472,-1.139472,-1.121894,-1.121894,...,-0.940899,-0.500739,-0.758638,-0.936484,-0.983961,-0.940900,-0.497097,-3.781973,4.460654,-4.460654
2020-01-24,-0.684802,-1.230281,-0.684802,-2.064156,-1.252235,-1.252234,-1.139472,-1.139472,-1.121894,-1.121894,...,-0.940899,-0.500739,-0.758638,-0.936484,-0.983961,-0.940900,-0.497097,-3.781973,4.460654,-4.460654
2020-01-25,-0.684809,-1.230281,-0.684809,-2.064156,-1.252235,-1.252234,-1.139472,-1.139472,-1.121894,-1.121894,...,-0.940899,-0.500739,-0.758638,-0.936484,-0.983961,-0.940900,-0.497097,-3.781973,4.460654,-4.460654
2020-01-26,-0.684788,-1.230281,-0.684788,-2.064156,-1.252235,-1.252234,-1.139472,-1.139472,-1.121894,-1.121894,...,-0.940899,-0.500739,-0.758638,-0.936484,-0.983961,-0.940900,-0.497097,-3.781973,4.460654,-4.460654
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-04-11,-0.383126,-0.903682,-0.383126,0.797451,-1.072014,-1.072011,-0.881618,-0.881618,-0.835729,-0.835724,...,1.446885,2.554187,-0.247425,1.699904,1.459048,1.446802,2.581803,0.024797,-0.224182,0.224182
2022-04-12,-0.491969,-0.661268,-0.491969,0.797451,-1.076960,-1.076950,-0.881900,-0.881899,-0.832645,-0.832645,...,1.447260,2.556284,-0.413497,1.701446,1.459365,1.447165,2.583866,0.024797,-0.224182,0.224182
2022-04-13,-0.399617,-0.260625,-0.399618,0.797451,-1.077305,-1.077294,-0.881234,-0.881234,-0.830517,-0.830519,...,1.447315,2.556612,-0.726404,1.701446,1.459365,1.447165,2.583866,0.024797,-0.224182,0.224182
2022-04-14,-0.307182,-0.441168,-0.307181,0.797451,-1.074660,-1.074652,-0.879108,-0.879111,-0.828634,-0.828629,...,-0.940899,2.556612,-0.758638,-0.936484,-0.983961,-0.940900,-0.497097,0.024797,-0.224182,0.224182


All that is left is to split the data into training and test sets. This will be done in the next notebook where we shall be fitting a model to the data and making predictions.

In [22]:
scaledf.to_csv('ScaledData.csv')