<a href="https://colab.research.google.com/github/AVJdataminer/COVID19_GC/blob/master/COVID19_GuidedCapstoneStep4andStep5_AnswerKey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guided Capstone Step 4. Pre-processing and Training Data Development - Answer Key

**The Data Science Method**  


1.   Problem Identification 


2.   Data Wrangling 
  
 
3.   Exploratory Data Analysis   

4.   **Pre-processing and Training Data Development**  
 * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

**<font color='DarkBlue'> Start by loading the necessary packages as we did in step 3 and printing out our current working directory just to confirm we are in the correct project directory. </font>**

In [1]:
import os
import pandas as pd
import datetime
import seaborn as sns
from matplotlib import pyplot as plt
import json
%matplotlib inline
import plotly.graph_objects as go 
import numpy as np
%matplotlib inline
#os.listdir()

ModuleNotFoundError: No module named 'plotly'

**<font color='DarkBlue'> If you need to change your path refer back to the notebook on steps 1 & 2 on how to do that. Then load the csv file you created in step 3, remember it should be saved inside your data subfolder and print the first five rows.</font>**

In [None]:
file='https://raw.githubusercontent.com/AVJdataminer/COVID19_GC/master/%20data/step3_output.csv'
df=pd.read_csv(file)
df.head()

In [None]:
ds = df.groupby(['timestamp.date']).agg({'confirmed':'sum','deaths':'sum', 'recovered':'sum'}).reset_index()
ds.head()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(
                x=ds['timestamp.date'],
                y=ds['confirmed'],
                name="confirmed",
                line_color='red',
                opacity=0.8))
fig.add_trace(go.Scatter(
                x=ds['timestamp.date'],
                y=ds['deaths'],
                name="deaths",
                line_color='dimgray',
                opacity=0.8))
fig.add_trace(go.Scatter(
                x=ds['timestamp.date'],
                y=ds['recovered'],
                name="recovered",
                line_color='green',
                opacity=0.8))

# Use date string to set xaxis range
fig.update_layout(xaxis_range=['2020-01-22','2020-04-22'],
                  title_text="COVID-19 US Confirmed Cases")
fig.show()

## Create dummy features for categorical variables - when applicable.

**<font color='DarkBlue'> Check the values for `province_state` and determine if dummies need to be created, if so, add the dummies back to the dataframe and remove the original column for `province_state`. </font>**

In [None]:
df.province_state.value_counts()

**<font color='DarkBlue'> Currently there are no states in this dataset so we skip this step. </font>**

In [None]:
#df = pd.concat([df, pd.get_dummies(df['province_state'])], axis=1).drop(['province_state'], axis =1)
#print(df.shape)
#df.head()

## Standardize the magnitude of numeric features

**<font color='DarkBlue'> In the last step you may remember we applied a scaler to our data before fitting the Lasso Regression, however, we didn't save that in the output data so we will need to apply that step again before modeling the US data. Additionally, we need use the simple imputer to fill the null values once again. Start by filling the null values than apply the scaler to the filled numpy array. </font>**

In [None]:
from sklearn.impute import SimpleImputer
response ='confirmed'
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X = df.drop([response], axis=1)._get_numeric_data()
imputer=imp.fit(X)
X_filled=imputer.transform(X)

from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_filled)
X_scaled=scaler.transform(X_filled)
y = df[[response]].values

## Split into training and testing datasets

**<font color='DarkBlue'> Split the data into training and testing data subset based on date.</font>**

In [None]:
from datetime import datetime,timedelta
def dt_splitter(date_col, X, y, test_size):
        date_col = pd.to_datetime(date_col)
        xw_date=pd.DataFrame(X).merge(date_col,left_index=True, right_index=True)
        ad = (max(xw_date.date)- min(xw_date.date)).days*test_size
        split_date = min(xw_date.date) + timedelta(days=ad)
        X_train = xw_date.loc[xw_date['date'] <= split_date].drop(['date'], axis=1).values
        X_test = xw_date.loc[xw_date['date'] > split_date].drop(['date'], axis=1).values
        yw_date=pd.DataFrame(y).merge(date_col,left_index=True, right_index=True)
        y_train=yw_date.loc[yw_date['date'] <= split_date].drop(['date'], axis=1).values
        y_test=yw_date.loc[yw_date['date'] > split_date].drop(['date'], axis=1).values
        return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test=dt_splitter(df['date'], X_scaled, y, .80)

In [None]:
print(X.shape)
y.shape

In [None]:
#Fit the model
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error,explained_variance_score,mean_absolute_error
from math import sqrt
from sklearn.metrics import explained_variance_score
lassoreg = Lasso(alpha=1,normalize=True, max_iter=1e5)
lassoreg.fit(X_train,y_train)
y_pred = lassoreg.predict(X_test)
print('Mean explained variance score for confirmed cases for the testing period =%.3f' % explained_variance_score(y_test, y_pred))

In [None]:
plt.scatter(y_pred,y_test)
plt.plot([x for x in range(0,150000)],[x for x in range(0,150000)], color='red')
plt.title("Model y predicted by actuals")
plt.xlabel("Predicted")
plt.ylabel("Actual")

Model Confirmed cases with timeseries

In [None]:
#create timeseries data, so only date and confirmed cases data frame.
cdf = df[['date', 'confirmed']]
cdf['date'] = pd.to_datetime(cdf['date'])
cdf.set_index('date', inplace = True)

In [None]:
y = cdf['confirmed']
y.plot()

In [None]:
# Import seasonal_decompose 
from statsmodels.tsa.seasonal import seasonal_decompose

# Make a variable called decomposition
decomposition = seasonal_decompose(y,freq=30)

# Make three variables for trend, seasonal and residual components respectively. 
# Assign them the relevant features of decomposition 
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

# Plot the original data, the trend, the seasonality, and the residuals 
plt.subplot(411)
plt.plot(y, label = 'Original')
plt.legend(loc = 'best')
plt.subplot(412)
plt.plot(trend, label = 'Trend')
plt.legend(loc = 'best')
plt.subplot(413)
plt.plot(seasonal, label = 'Seasonality')
plt.legend(loc = 'best')
plt.subplot(414)
plt.plot(residual, label = 'Residuals')
plt.legend(loc = 'best')
plt.tight_layout()

In [None]:
#testing for stationarity
from statsmodels.tsa.stattools import kpss
kpss(y)

Since our p-value is less than 0.05, we should reject the Null hypothesis and deduce the non-stationarity of our data. 

But our data need to be stationary! So we need to do some transforming.

In [None]:
# Import mean_squared_error and ARIMA
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima_model import ARIMA

In [None]:
# Make a function to find the MSE of a single ARIMA model 
def evaluate_arima_model(data, arima_order):
    # Needs to be an integer because it is later used as an index.
    split=int(len(data) * 0.8) 
    train, test = data[0:split], data[split:len(data)]
    past=[x for x in train]
    # make predictions
    predictions = list()
    for i in range(len(test)):#timestep-wise comparison between test data and one-step prediction ARIMA model. 
        model = ARIMA(past, order=arima_order)
        model_fit = model.fit(disp=0)
        future = model_fit.forecast()[0]
        predictions.append(future)
        past.append(test[i])
    # calculate out of sample error
    error = mean_squared_error(test, predictions)
    return error

In [None]:
# Make a function to evaluate different ARIMA models with several different p, d, and q values.
def evaluate_models(dataset, p_values, d_values, q_values):
    best_score, best_cfg = float("inf"), None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                order = (p,d,q)
                try:
                    mse = evaluate_arima_model(dataset, order)
                    if mse < best_score:
                        best_score, best_cfg = mse, order
                    print('ARIMA%s MSE=%.3f' % (order,mse))
                except:
                    continue
    return print('Best ARIMA%s MSE=%.3f' % (best_cfg, best_score))

In [None]:
# Now, we choose a couple of values to try for each parameter.
p_values = [x for x in range(0, 4)]
d_values = [x for x in range(0, 1)]
q_values = [x for x in range(15, 20)]

In [None]:
# Finally, we can find the optimum ARIMA model for our data.
# Nb. this can take a while...!
import warnings
warnings.filterwarnings("ignore")
y_log = np.log(y)
evaluate_models(y_log, p_values, d_values, q_values)

In [None]:
2p=0
d=1
q=2
model = ARIMA(y_log, order=(p,d,q))
model_fit = model.fit()
forecast = model_fit.forecast(24)

In [None]:
model_fit.summary()

In [None]:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(cdf.confirmed, order=(0,1,2))
model_fit = model.fit()
print(model_fit.summary())
print('Residuals Description')
print(model_fit.resid.describe())

In [None]:
plt.figure(figsize=(15,6))
plt.plot(y_log.diff())
plt.plot(model_fit.predict(), color = 'red')

In [None]:
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(y)

In [None]:
from statsmodels.tsa.stattools import acf, pacf
#calling auto correlation function
lag_acf = acf(y, nlags=300)
#Plot PACF:
plt.figure(figsize=(16, 7))
plt.plot(lag_acf,marker='+')
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(y)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(y)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')
plt.xlabel('number of lags')
plt.ylabel('correlation')
plt.tight_layout()

In [None]:

#calling partial correlation function
lag_pacf = pacf(y, nlags=30, method='ols')
#Plot PACF:
plt.figure(figsize=(16, 7))
plt.plot(lag_pacf,marker='+')
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(y)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(y)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.xlabel('Number of lags')
plt.ylabel('correlation')
plt.tight_layout()