# UnitedStates_COVID_19

Coronavirus is a family of viruses that can cause illness, which can vary from common cold and cough to sometimes more severe disease. Middle East Respiratory Syndrome (MERS-CoV) and Severe Acute Respiratory Syndrome (SARS-CoV) were such severe cases with the world already has faced. SARS-CoV-2 (n-coronavirus) is the new virus of the coronavirus family, which first discovered in 2019, which has not been identified in humans before. It is a contiguous virus which started from Wuhan in December 2019. Which later declared as Pandemic by WHO due to high rate spreads throughout the world. Currently (on the date 20 May 2020), this leads to a total of 300K+ Deaths across the globe, including 90K+ deaths alone in USA.The dataset  is provided to identify the deaths and recovered cases.

Field description:

* Province_State - The name of the State within the USA.
* Country_Region - The name of the Country (US).
* Last_Update - The most recent date the file was pushed.
* Lat - Latitude.
* Long_ - Longitude.
* Confirmed - Aggregated confirmed case count for the state.
* Deaths - Aggregated Death case count for the state.
* Recovered - Aggregated Recovered case count for the state.
* Active - Aggregated confirmed cases that have not been resolved (Active = Confirmed - Recovered - Deaths).
* FIPS - Federal Information Processing Standards code that uniquely  identifies counties within the USA.
* Incident_Rate - confirmed cases per 100,000 persons.
* People_Tested - Total number of people who have been tested.
* People_Hospitalized - Total number of people hospitalized.
* Mortality_Rate - Number recorded deaths * 100/ Number confirmed cases.
* UID - Unique Identifier for each row entry.
* ISO3 - Officialy assigned country code identifiers.
* Testing_Rate - Total number of people tested per 100,000 persons.
* Hospitalization_Rate - Total number of people hospitalized * 100/ Number of confirmed cases.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('covid_19.csv',parse_dates=['Last_Update'])
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(15,14))
sns.heatmap(df.corr(),annot=True)

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull())

In [None]:
#filling missing values
df['Lat']=df['Lat'].fillna(df['Lat'].mean())
df['Long_']=df['Long_'].fillna(df['Long_'].mean())
df['Recovered']=df['Recovered'].fillna(df['Recovered'].mean())
df['Incident_Rate']=df['Incident_Rate'].fillna(df['Incident_Rate'].mean())
df['People_Tested']=df['People_Tested'].fillna(df['People_Tested'].mean())
df['People_Hospitalized']=df['People_Hospitalized'].fillna(df['People_Hospitalized'].mean())
df['Mortality_Rate']=df['Mortality_Rate'].fillna(df['Mortality_Rate'].mean())
df['Testing_Rate']=df['Testing_Rate'].fillna(df['Testing_Rate'].mean())
df['Hospitalization_Rate']=df['Hospitalization_Rate'].fillna(df['Hospitalization_Rate'].mean())

In [None]:
df.isnull().sum()

In [None]:
df['Province_State'].nunique()

In [None]:
df['Total_cases']=df['Confirmed']+df['Recovered']+df['Deaths']+df['Active']

In [None]:
df.drop(['Last_Update','FIPS','UID','ISO3'],axis=1,inplace=True)

In [None]:
df[['Province_State','Confirmed']].groupby(['Province_State']).mean()

In [None]:
df[['Province_State','Recovered']].groupby(['Province_State']).mean()

In [None]:
df[['Province_State','Deaths']].groupby(['Province_State']).mean()

In [None]:
df.skew()

In [None]:
df.Confirmed.plot(kind='kde')

In [None]:
df.Deaths.plot(kind='kde')

In [None]:
df.Lat.plot(kind='kde')

In [None]:
#Transform the data
from sklearn.preprocessing import LabelEncoder

for column in df.columns:
    if df[column].dtype ==np.object:
        df[column]=LabelEncoder().fit_transform(df[column])

#Linear Regression for Recovered cases

In [None]:
y=df['Recovered']
x=df.drop(['Recovered','Deaths'],axis=1)

In [None]:
from sklearn.preprocessing import MinMaxScaler
sc= MinMaxScaler()
x= sc.fit_transform(x)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=80)
max_r_score=0
for r_state in range(42,100):
    x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=r_state,test_size=0.20)
    lr = LinearRegression()
    lr.fit(x_train,y_train),
    pred=lr.predict(x_test)
    r2_scr=r2_score(y_test,pred)
    print("r2_score corresponding to random state:",r_state, " is: ",r2_scr)
    if r2_scr>max_r_score:
        max_r_score=r2_scr
        final_r_state=r_state
        
print("max r2 score corresponding to",final_r_state," is ",max_r_score)     

In [None]:
#Finalizing the tain_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=82,test_size=0.20)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
#best parameters for KNN

from sklearn.model_selection import GridSearchCV
knn=KNeighborsRegressor()
grid_param ={'n_neighbors':range(1,30)}
gd = GridSearchCV(knn,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

In [None]:
#best parameters for DecisionTree

dtr=DecisionTreeRegressor()
grid_param ={'criterion':['mse','friedman_mse','mae']}
gd=GridSearchCV(dtr,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

In [None]:
# best parameters for SVR

svr = SVR()
grid_param ={'kernel': ('linear','poly','rbf'), 'C':[1,10]}
gd = GridSearchCV(svr,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

In [None]:
#best parameters for RandomForest

rfr=RandomForestRegressor()
grid_param={"n_estimators":[10,100,500,1000]}
gd=GridSearchCV(rfr,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

In [None]:
model=[]
score=[]
cvs=[]

for i in [LinearRegression(),KNeighborsRegressor(n_neighbors=29),DecisionTreeRegressor(criterion='mae'),SVR(kernel='poly',C=10),RandomForestRegressor(n_estimators=500)]:
    model.append(i)
    print('\n')
    i.fit(x_train,y_train)
    i.score(x_train,y_train)
    pred=i.predict(x_test)
    r2_scr=r2_score(y_test,pred)
    print('R2 score of',i,'is:',r2_scr)
    score.append(r2_scr)
    print('\n')
    cv_score=cross_val_score(i,x,y,cv=5,scoring='r2').mean()
    print('The CV Score is', cv_score)
    cvs.append(cv_score)
    print('\n')

In [None]:
result=pd.DataFrame({'Model':['LinearRegresssion','KNeighborsRegressor','DecisionTreeRegressor','SVR','RandomForestRegressor'],'R Score':score,'Cross_val_score':cvs})
result

Since Linear Regression is giving better results,we finalize the same.

Saving the model

In [None]:
from sklearn.externals import joblib
joblib.dump(lr,'covid_recovered.lr')

#Linear Regression for Death cases

In [None]:
y=df['Deaths']
x=df.drop(['Recovered','Deaths'],axis=1)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=80)
max_r_score=0
for r_state in range(42,100):
    x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=r_state,test_size=0.20)
    lr = LinearRegression()
    lr.fit(x_train,y_train),
    pred=lr.predict(x_test)
    r2_scr=r2_score(y_test,pred)
    print("r2_score corresponding to random state:",r_state, " is: ",r2_scr)
    if r2_scr>max_r_score:
        max_r_score=r2_scr
        final_r_state=r_state
        
print("max r2 score corresponding to",final_r_state," is ",max_r_score)     

In [None]:
#Finalizing the tain_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=47,test_size=0.20)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
#best parameters for KNN

from sklearn.model_selection import GridSearchCV
knn=KNeighborsRegressor()
grid_param ={'n_neighbors':range(1,30)}
gd = GridSearchCV(knn,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

In [None]:
#best parameters for DecisionTree

dtr=DecisionTreeRegressor()
grid_param ={'criterion':['mse','friedman_mse','mae']}
gd=GridSearchCV(dtr,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

In [49]:
# best parameters for SVR

svr = SVR()
grid_param ={'kernel': ('linear','poly','rbf'), 'C':[1,10]}
gd = GridSearchCV(svr,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

KeyboardInterrupt: 

In [None]:
#best parameters for RandomForest

rfr=RandomForestRegressor()
grid_param={"n_estimators":[10,100,500,1000]}
gd=GridSearchCV(rfr,grid_param)
gd.fit(x_train,y_train)
gd.best_params_

In [None]:
model=[]
score=[]
cvs=[]

for i in [LinearRegression(),KNeighborsRegressor(n_neighbors=16),DecisionTreeRegressor(criterion='mae'),SVR(kernel='poly',C=10),RandomForestRegressor(n_estimators=1000)]:
    model.append(i)
    print('\n')
    i.fit(x_train,y_train)
    i.score(x_train,y_train)
    pred=i.predict(x_test)
    r2_scr=r2_score(y_test,pred)
    print('R2 score of',i,'is:',r2_scr)
    score.append(r2_scr)
    print('\n')
    cv_score=cross_val_score(i,x,y,cv=5,scoring='r2').mean()
    print('The CV Score is', cv_score)
    cvs.append(cv_score)
    print('\n')

In [None]:
result=pd.DataFrame({'Model':['LinearRegresssion','KNeighborsRegressor','DecisionTreeRegressor','SVR','RandomForestRegressor'],'R Score':score,'Cross_val_score':cvs})
result

Since Linear Regression is giving better results,we finalize the same.

Saving the model

In [None]:
from sklearn.externals import joblib
joblib.dump(lr,'covid_deaths.lr')