# Project 3
## Liam McFall

1) Strategy and Overview

For this project, my strategy is to try different types of models beyond basic least squares regression. I am intrigued by the potential for tree based models. I expect a random forest to perform better than the basic regression, which is a low bar to clear since the basic regression I created was essentially meaningless. I also plan on updating the data set using data from the same source that I originally used, this way I have as much data as possible at my disposal. Something that we don't have access to as much with a regular linear model that we do have when using something tree based, is a graph that shows feature importance. I think that this will be an interesting factor to consider.

2) Refining the Model

In [4]:
# Setup
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession\
    .builder\
    .appName("Project_2")\
    .getOrCreate()

display(spark)
print("Success")

Success


In [7]:
# Data loading

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

data_schema = [#StructField('_c0', IntegerType(), True),
               StructField('X', IntegerType(), True), 
               StructField('County', StringType(), True), 
               StructField('State', StringType(), True), 
               StructField('Lat', DoubleType(), True), 
               StructField('Long_', DoubleType(), True), 
               StructField('Confirmed', IntegerType(), True), 
               StructField('Deaths', IntegerType(), True), 
               StructField('Active', IntegerType(), True), 
               StructField('Combined_Key', StringType(), True), 
               StructField('State_popDensity', DoubleType(), True)]

final_struc = StructType(fields=data_schema)

covid = spark.read.format("csv")\
    .option("header", "true")\
    .schema(final_struc)\
    .load("/Users/liammcfall/Desktop/usCovid_byCounty.csv")

covid.printSchema()
covid.show(10)

covid.count()

root
 |-- X: integer (nullable = true)
 |-- County: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long_: double (nullable = true)
 |-- Confirmed: integer (nullable = true)
 |-- Deaths: integer (nullable = true)
 |-- Active: integer (nullable = true)
 |-- Combined_Key: string (nullable = true)
 |-- State_popDensity: double (nullable = true)

+---+---------+--------------+-----------+------------+---------+------+------+--------------------+----------------+
|  X|   County|         State|        Lat|       Long_|Confirmed|Deaths|Active|        Combined_Key|State_popDensity|
+---+---------+--------------+-----------+------------+---------+------+------+--------------------+----------------+
|  1|Abbeville|South Carolina|34.22333378|-82.46170658|       33|     0|    33|Abbeville, South ...|        173.3174|
|  2|   Acadia|     Louisiana| 30.2950649|-92.41419698|      140|    10|   130|Acadia, Louisiana...|        107.5175|
|  3| Accom

2885

In [9]:
# Model Set up

from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(inputCols = ['Lat', 'Long_', 'State_popDensity'], 
                                  outputCol = 'features', 
                                  handleInvalid = "skip")
vecCovid = vectorAssembler.transform(covid)
#vecCovid.count()

vecCovid = vecCovid.select(['features', 'Confirmed'])
print("Full Vector Sample")
vecCovid.show(5)

train, test = vecCovid.randomSplit([.75,.25], 7)
print("Training Vector Sample")
train.show(5)
print("Test Vector Sample")
test.show(5)

Full Vector Sample
+--------------------+---------+
|            features|Confirmed|
+--------------------+---------+
|[34.22333378,-82....|       33|
|[30.2950649,-92.4...|      140|
|[37.76707161,-75....|      429|
|[43.4526575,-116....|      717|
|[41.33075609,-94....|        3|
+--------------------+---------+
only showing top 5 rows

Training Vector Sample
+--------------------+---------+
|            features|Confirmed|
+--------------------+---------+
|[20.86399628,-156...|      116|
|[21.45803166,-157...|      405|
|[22.03935037,-159...|       21|
|[25.20904673,-81....|       80|
|[25.6112362,-80.5...|    13371|
+--------------------+---------+
only showing top 5 rows

Test Vector Sample
+--------------------+---------+
|            features|Confirmed|
+--------------------+---------+
|[19.60121157,-155...|       74|
|[26.15184651,-80....|     5553|
|[26.39418217,-98....|      353|
|[26.64676272,-80....|     3480|
|[26.90131002,-81....|      294|
+--------------------+---------

In [12]:
print("Rerunning project 2 Linear regression with the new data")

from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol = 'features', labelCol='Confirmed',fitIntercept=False)
lr_model = lr.fit(train)

print("Evaluation of Training Data")
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))
print("MSE: %f" % lr_model.summary.meanSquaredError)
print("r2: %f" % lr_model.summary.r2)

print("Evaluation of Model using Test Data")
lr_evaluation_summary = lr_model.evaluate(test)
print("MAE: %f" % lr_evaluation_summary.meanAbsoluteError)
print("MSE: %f" % lr_evaluation_summary.meanSquaredError)
print("r2: %f" % lr_evaluation_summary.r2)

Evaluation of Training Data
Coefficients: [48.027931434034855,18.018296559026833,1.6701261863379708]
Intercept: 0.0
MSE: 18908488.878441
r2: 0.030726
Evaluation of Model using Test Data
MAE: 539.824126
MSE: 1036130.131559
r2: 0.061021


In [37]:
print("Random Forest")

from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

rf = RandomForestRegressor(featuresCol = 'features', labelCol = 'Confirmed', seed = 3)
rf_model = rf.fit(train)

rf_model.featureImportances
predictions = rf_model.transform(test)

evaluator_mae = RegressionEvaluator(
    labelCol="Confirmed", predictionCol="prediction", metricName="mae")
mae = evaluator_mae.evaluate(predictions)

evaluator_mse = RegressionEvaluator(
    labelCol="Confirmed", predictionCol="prediction", metricName="mse")
mse = evaluator_mse.evaluate(predictions)

evaluator_r2 = RegressionEvaluator(
    labelCol="Confirmed", predictionCol="prediction", metricName="r2")
r2 = evaluator_r2.evaluate(predictions)

print("Evaluation of Model using Test Data")
print("MAE: %f" % mae)
print("MSE: %f" % mse)
print("r2: %f" % r2)

print("Feature Importance")
rf_model.featureImportances

Random Forest
Evaluation of Model using Test Data
MAE: 336.517116
MSE: 596402.434308
r2: 0.422709
Feature Importance


SparseVector(3, {0: 0.442, 1: 0.288, 2: 0.27})

In [36]:
rf_lm_mae = mae/lr_evaluation_summary.meanAbsoluteError
rf_lm_mse = mse/lr_evaluation_summary.meanSquaredError
rf_lm_r2 = r2/lr_evaluation_summary.r2

print("RF performance vs Linear Model")
print("MAE: %f" % rf_lm_mae)
print("MSE: %f" % rf_lm_mse)
print("r2: %f" % rf_lm_r2)

RF performance vs Linear Model
MAE: 0.623383
MSE: 0.575606
r2: 6.927283


3) Verify the Model Metrics

After running the model using the same test and training data in both a linear regression, and a Random Forest, it is clear that the random forest is the superior performer. The Random Forest had a 38% smaller mean absolute error, and a 43% smaller mean squared error. Additionally, the random forest had an R^2 6.9 times the linear regression model, explaining more than 42% of the variance in the model. Using the random forest, we are also able to look feature importance. Latitude is shown as the most importance feature in this model with 44% of the importance. Surprisingly, population density is the least important factor of the 3 at 27%, but I think this is probably due to the lack of variance in this feature since we are using state population density, not at the county level due. I am really happy with how much better this model fits the data compared to the basic linear regression model using the same data. I think that makes sense due to the fact that this data has a time series element and all of these counties are going to be at different locations on their own curves. I wasn't quite sure how to account for this, but the random forest ended up performing reasonably well within that constraint.

4) Conclusions

    - A regression is using certain pieces of an observation (ie. independent variables) in order to predict a dependent variable. A regression is when this prediction is for a continuous value, where as a discrete prediction would be a classification. In this case, a Random forest regression was used. What happens in this is there are a number of individual decision trees created with the algorithm, each created using a subset of observations of the training data with resampling of the data. This means that in the "training" set, all observations may not be used but some may be used more than once. The prediction for each observation is then the average of the predictions from all of the individual decision trees.
    - An excel pivot table is just going to be aggregations of data, not predicting anything. The purpose of my model would be to look at what counties are doing well or poorly in terms of how many cases they SHOULD have. The prediction represents the number that they should have given the data, and the difference between the prediction and the confirmed case numbers is what we are really looking for. Using an excel pivot ta le you would not be able to do this comparison. While a linear model would be quite meaningless in this excersize as shown by only 6% of the variance being explained by the model, the Random forest has an R^2 of above .42 which means that this extremely basic model with only 3 variables is able to explain more than 42% of the data which is a great starting point.
    - My use case for this analysis did not really change, the model itself just got better when using a random forest compared to a linear regression. The key predictor in the model was latitude, but longitude and state population density were also important variables. I am confident in population density being an important factor in transmission, but I would have preferred to have been able to find it at the county level instead of using state population density as a proxy. The states that have large urban centers, but also have large rural areas are going to be thrown off by this lack of specificity. I'm on the fence about how important both latitude and longitude are. These are more descriptive factors than anything. I suppose if climate does play a factor in transmission then latitude may in fact be important, but epidemiological research that has come out surrounding COVID has been inconclusive on that aspect. I think a great feature to add would be "days since 1st infection." Adding this feature would incorporate some of the time effects on the model that it currently lacks. This would add a way to incorporate where along each county's individual curve they are. In terms of boundary condition, the model must be bound to have an intercept of greater than 0, because "Confirmed" must be positive. It is impossible to have "Confirmed" cases be less than 0, so the intercept can't be less than 0 either. Overall, this isn't a completely linear data set which is probably why a random forest works so much better than a standard linear regression. There is more flexibility afforded by using a random forest as the model, and than is showcased by how much better it performed.