# Model Trial #3.5
## Purpose: 
The purpose of this trial will be to explore the potential of machine learning to predict deaths based on population density and other information related to population size. Additionally, information from states containing vaccinations was purposely withheld in order to assess its influences of the predictive abilites of the model. 

- This Trial will also avoid using the StandardScaler in order to remedy the errors of the first trial.
- This trial will provide more data to the model.


In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter

In [None]:
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from matplotlib import pyplot

In [None]:
import os
# Find the latest version of spark 3.0 from http://www.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.3'
spark_version = 'spark-3.1.3'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:9 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:10 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Reading package lists... Done


In [None]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Tokens").getOrCreate()

In [None]:
# Read in data from S3 Buckets
from pyspark import SparkFiles
url = "https://utah-covid-project.s3.us-west-1.amazonaws.com/final_copy_covid19.csv"
spark.sparkContext.addFile(url)
import_df = spark.read.csv(SparkFiles.get("final_copy_covid19.csv"), sep=",", header=True)
df = import_df.toPandas()
df.head(50)

Unnamed: 0,state,infected,deaths,population,pop_density,icu_beds,gini,unemployment,hospitals,health_spending,pollution,Med-large_airports,total_vaccines_administered,recipients_one_dose,fully_vaccinated
0,Alabama,1511092,20321,4908621,96.9221,1533,0.4847,2.7,101,7281,8.1,1.0,6087103.0,3031464.0,2459363.0
1,Alaska,302448,1321,734002,1.2863,119,0.4081,5.8,21,11064,6.4,1.0,1093808.0,500864.0,445530.0
2,Arizona,2264159,31244,7378494,64.955,1559,0.4713,4.5,83,6452,9.7,1.0,11508661.0,5191587.0,4364444.0
3,Arkansas,943944,11970,3038999,58.403,732,0.4719,3.5,88,7408,7.1,0.0,4081816.0,1981905.0,1612690.0
4,California,11171759,95620,39937489,256.3727,7338,0.4899,3.9,359,7549,12.8,9.0,71306336.0,32284670.0,27773615.0
5,Colorado,1649212,13426,5845526,56.4011,1597,0.4586,2.5,89,6804,6.7,1.0,10299034.0,4507596.0,3980992.0
6,Connecticut,885767,11317,3563077,735.8689,674,0.4945,3.8,32,9859,7.2,1.0,7277732.0,3345245.0,2771081.0
7,Delaware,305284,3080,982895,504.3073,186,0.4522,3.9,7,10254,8.3,0.0,1729273.0,794932.0,656886.0
8,Florida,7082717,80647,21992985,410.1256,5604,0.4852,2.8,217,8076,7.4,7.0,36035360.0,16742736.0,14140484.0
9,Georgia,2809555,38468,10736059,186.6719,2508,0.4813,3.1,145,6587,8.3,1.0,14324212.0,6823694.0,5684843.0


In [None]:
# Setting the State name as the index and data types since new data was imported
s_df = df.set_index('state')
s_df.index.name = "State"
s_df.dtypes

infected                       object
deaths                         object
population                     object
pop_density                    object
icu_beds                       object
gini                           object
unemployment                   object
hospitals                      object
health_spending                object
pollution                      object
Med-large_airports             object
total_vaccines_administered    object
recipients_one_dose            object
fully_vaccinated               object
dtype: object

 **Note**:
As can be seen above, there are no string values that need to be converted into numerical values. However, there are commas separating the thousands place in the newly imported vaccination data. Therefore, the next step can only be proceeded once this is dealt with; the next step will involve creating the features and target.

In [None]:
# Creating the DataFrame that contains the desired features
X = s_df.drop(columns="deaths")

# Creating the DataFrame that contains the desired target
y = pd.DataFrame(s_df["deaths"])


In [None]:
# Inspecting the dataset's statistical information. 
X.describe()

Unnamed: 0,infected,population,pop_density,icu_beds,gini,unemployment,hospitals,health_spending,pollution,Med-large_airports,total_vaccines_administered,recipients_one_dose,fully_vaccinated
count,50,50,50.0,50,50.0,50.0,50,50,50.0,50.0,50.0,50.0,50.0
unique,50,50,50.0,50,48.0,25.0,49,50,34.0,7.0,50.0,50.0,50.0
top,1511092,4908621,96.9221,1533,0.4813,3.5,56,7281,8.1,1.0,6087103.0,3031464.0,2459363.0
freq,1,1,1.0,1,3.0,5.0,2,1,3.0,22.0,1.0,1.0,1.0


**Note**: As observed from the cell above, the variation between the columns/features is indeed large. This could have an affect on the end results and predictive abilities of the model. However, the features will not be scaled.

### Splitting into Train and Test sets

Now that both the features and target have been inspected, it is time to split the data into training and test sets. 

In [None]:
X_train = X.loc[["Alaska", "Alabama", "Arkansas", "Arizona", "California", "Colorado", "Delaware", "Florida", "Georgia", "Hawaii", "Iowa", "Idaho", "Illinois", "Indiana", "Kansas", "Kentucky", "Louisiana", "Massachusetts", "Maryland", "Maine", "Michigan", "Minnesota", "Missouri", "Mississippi", "Montana", "North Carolina", "North Dakota", "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "Nevada", "New York", "Ohio", "Oklahoma", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Virginia", "Vermont", "Washington", "Wisconsin", "West Virginia", "Wyoming"]]
X_test = X.loc[["Utah", "Oregon", "Connecticut"]]
y_train = y.loc[["Alaska", "Alabama", "Arkansas", "Arizona", "California", "Colorado", "Delaware", "Florida", "Georgia", "Hawaii", "Iowa", "Idaho", "Illinois", "Indiana", "Kansas", "Kentucky", "Louisiana", "Massachusetts", "Maryland", "Maine", "Michigan", "Minnesota", "Missouri", "Mississippi", "Montana", "North Carolina", "North Dakota", "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "Nevada", "New York", "Ohio", "Oklahoma", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Virginia", "Vermont", "Washington", "Wisconsin", "West Virginia", "Wyoming"], ["deaths"]]
y_test = y.loc[["Utah", "Oregon", "Connecticut"],["deaths"]]

In [None]:
# Creating StandardScaler instance
scaler = StandardScaler()

In [None]:
# Fitting Standard Scaller
X_scaler = scaler.fit(X_train)

In [None]:
# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

### Fitting the model
By this point, the data has been split into training and test sets. Thus, the model is ready to be fit to the training data.
- This trial avoids using the StandardScalar despite knowing the risk of certain features having larger biases that may be overwhelming when compared to the biases of other features; ultimately affecting the model and its predictive accurary. 

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

### Testing the model
The model has been fitted to the data. It's ready for death predictions. 

In [None]:
# Make a prediction
y_pred = model.predict(X_test)
y_pred

array([[ 3676.62848909],
       [ 5496.58441112],
       [10193.20055879]])

In [None]:
# Showing the predicted outputs
for i in range(len(X_test_scaled)):
    index = ['Utah', 'Oregon', 'Connecticut']
    print("Predicted deaths for " +index[i]+ "= %3d" % y_pred[i][0])

Predicted deaths for Utah= 3676
Predicted deaths for Oregon= 5496
Predicted deaths for Connecticut= 10193


### Results
The models results will now be tested for accuracy

In [None]:
# Setting up the data for analysis
Column_names = ['Predicted', 'Actual', 'State']
results = pd.DataFrame(y_pred)
results["test"] = y_test["deaths"].values
results["state"] = ['Utah', 'Oregon', 'Connecticut']
results.set_axis(Column_names, axis=1, inplace=True)

# Setting the State name as the index
y_results = results.set_index('State')
y_results.head()

Unnamed: 0_level_0,Predicted,Actual
State,Unnamed: 1_level_1,Unnamed: 2_level_1
Utah,3676.628489,4989
Oregon,5496.584411,8492
Connecticut,10193.200559,11317


In [None]:
# Accuracy Calculations
realVals = y_results.Actual.astype(int)
predictedVals = y_results.Predicted

# Empty arrays holding the error data
abs_error_array = []
rel_error_array = []
per_error_array = []

# For loop that calculates the results for each state
for i in range(len(y_results)):
    abs_error_calc = round(abs(realVals[i] - predictedVals[i]))
    rel_error_calc = round(abs(((realVals[i] - predictedVals[i])/realVals[i])), 2)
    percent_error_calc = round(abs(((realVals[i] - predictedVals[i])/realVals[i])*100))
    std_calc = round(np.std([realVals[i], predictedVals[i]]))
    abs_error_array.append(abs_error_calc)
    rel_error_array.append(rel_error_calc)
    per_error_array.append(percent_error_calc)
    
# Creating series so that the errors can be added to the main DataFrame
abs_error = pd.Series(abs_error_array)
rel_error = pd.Series(rel_error_array)
per_error = pd.Series(per_error_array)

The metrics that were evaluated to test the model's accuracy are: Absolute Error, Relative Error, and Percent Error.

$\text{Absolute Error} = |V_{A} - V_{P}|$

$\text{Relative Error} = |\frac{V_{A} - V_{P}}{V_{A}}|$

$\text{Percent Error} = |\frac{V_{A} - V_{P}}{V_{A}}|*100\%$

$\text{Nomenclature}:$
- $V_{A} = \text{Acutal/Measured Value}$
- $V_{P} = \text{Predicted/Model Value}$

In [None]:
# Adding the calculations to the results DataFrame
y_results["Absolute Error"] = abs_error.values
y_results["Relative Error"] = rel_error.values
y_results["Percent Error"] = per_error.values
y_results

Unnamed: 0_level_0,Predicted,Actual,Absolute Error,Relative Error,Percent Error
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Utah,3676.628489,4989,1312,0.26,26
Oregon,5496.584411,8492,2995,0.35,35
Connecticut,10193.200559,11317,1124,0.1,10


In [32]:
# Formatting the DataFrame 
y_results["Predicted"] = y_results["Predicted"].astype(int).map("{:.0f}".format)
y_results["Actual"] = y_results["Actual"]
y_results["Absolute Error"] = y_results["Absolute Error"]
y_results["Relative Error"] = y_results["Relative Error"]
y_results["Percent Error"] = y_results["Percent Error"]
y_results

# Creating the output file (CSV)
os.makedirs("/content/drive/MyDrive/Colab Notebooks/Final Challenge/Final Challenge Code/",exist_ok=True)
# Exporting the results into a CSV.
checkpoint_path = "/content/drive/MyDrive/Colab Notebooks/Final Challenge/Final Challenge Code/Trail_3_results.csv"

### Conclusion
The model produced average results once again; the predictive accuray of the model is hovering around 74% (Utah), 65% (Oregon), and 90% (Connecticut). The analytics metrics utilized demonstrate that error is still quite significant for states like Oregon. The model is underestimating the amount of deaths for all of the tested states, but not as severely for Utah and Connecticut. Curiousity brings forth the following questions again: 
- Will the model always underestimate the actual amount of deaths due to COVID 19? If so, can a correction factor be implemented and will the implementation of these correction factor(s) be dependent on unique features of a state(s)? 
- Can the model's accuracy be improved by increasing the amount of data provided such as vaccinations information? 

The fate of **Model Trial #4** will be determined in the near future after more analysis. This is due to the fact that the accuracy for states like Connecticut actually went down. While for other states such as Utah and Oregon, the accuracy went up. This is after adding more data to the model, so a little bit of confusion presents itself now. From here, more analysis of the dataset will be necessary to see what might potentially be swaying the model's predictive abilites. If nothing conclusive is found, it may possible to explore the use of correction factors if certain states can be grouped together due to similiar qualities and accuracies. 