# Objective
 This notebook analyzes data on the spread of COVID-19 with the objective of demonstrating how one can model the spread of the infection using time-based data on Confirmed cases, Country, State, Population, and Air Pollution levels (Indoor, Outdoor, and Ozone) to help understand what drives transmission rates

#Data Lineage

Data Scources:
* Time based data on Fatalities, Confirmed cases, Country and State: COVID-19 Open Research Dataset (CORD-19) prepared by the White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle). 
* Population data: World Health Organization Population stats
* Pollution Data: World Health Organization Air Pollution stats (total air pollution column is dropped since it is a sum of indoor and outdoor pollution)
######All the three data sources are joined based on the country value

Feature Engineering:
* Convert Date into Unix timestamp using unix_timeStamp casting
* Fill Null values in Province State by replacing with Country values using Coalasce function 
* Convert Categorical Values in Country and State into numeric values using String Indexer
* Remove any rows with Null Values

Data Modelling:
* Set up Fatalities as the predictor column and rest of the columns as feaures using RFormula
* Set up RFormula as the base pipeline
* Set up the Regression ML Models on Pipeline
* Evaluate the best model using CrossValidation with Regression evaluator and the evaluator metric R2

ML Models Used
* Linear Regression: HyperParametrs (Regularization and Elastic Net)
* Generalized Linear Regression: Families(Gaussian, Poisson)
* Isotonic Regression: Parameters(Isotonic, Anotonic)

Model Evaluation and Result
* Plot and Compare all the models in graph
* Test the best model by applying it on Test Data
* Plot the predicions againt the actual values
* Calculate Coefficients and Intercept
* Evaluate Regression Metrics: R2 and RMSE

###Remove any old files

In [10]:
%sh
rm  /databricks/driver/test*
rm  /databricks/driver/train*
rm /databricks/driver/Population*
rm /databricks/driver/pollution*


###Train Data

In [12]:
%sh
wget https://raw.githubusercontent.com/HenryBernreuter/kaggleData/master/train.csv

In [13]:
trainDF = spark.read.csv('file:/databricks/driver/train.csv', inferSchema=True, header=True, mode='DROPMALFORMED')
display(trainDF)

Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities
1,,Afghanistan,2020-01-22T00:00:00.000+0000,0.0,0.0
2,,Afghanistan,2020-01-23T00:00:00.000+0000,0.0,0.0
3,,Afghanistan,2020-01-24T00:00:00.000+0000,0.0,0.0
4,,Afghanistan,2020-01-25T00:00:00.000+0000,0.0,0.0
5,,Afghanistan,2020-01-26T00:00:00.000+0000,0.0,0.0
6,,Afghanistan,2020-01-27T00:00:00.000+0000,0.0,0.0
7,,Afghanistan,2020-01-28T00:00:00.000+0000,0.0,0.0
8,,Afghanistan,2020-01-29T00:00:00.000+0000,0.0,0.0
9,,Afghanistan,2020-01-30T00:00:00.000+0000,0.0,0.0
10,,Afghanistan,2020-01-31T00:00:00.000+0000,0.0,0.0


###Population Data

In [15]:
%sh
wget https://raw.githubusercontent.com/Mahati-K/DataSets/master/Population.csv


In [16]:
populationDF = spark.read.csv(path='file:///databricks/driver/Population.csv',header='true', inferSchema ='true', sep=',', mode='DROPMALFORMED')
display(populationDF)

Country,Population
Aruba,105845.0
Afghanistan,37172386.0
Angola,30809762.0
Albania,2866376.0
Andorra,77006.0
Arab World,419790588.0
United Arab Emirates,9630959.0
Argentina,44494502.0
Armenia,2951776.0
American Samoa,55465.0


###Air Pollution Data

In [18]:
%sh
wget https://raw.githubusercontent.com/Mahati-K/DataSets/master/death-rates-from-air-pollution.csv

In [19]:
pollutionDF = spark.read.csv(path='file:///databricks/driver/death-rates-from-air-pollution.csv',header='true', inferSchema ='true', sep=',', mode='DROPMALFORMED')
display(pollutionDF)

Entity,Code,Year,"Air pollution (total) (deaths per 100,000)","Indoor air pollution (deaths per 100,000)","Outdoor particulate matter (deaths per 100,000)","Outdoor ozone pollution (deaths per 100,000)"
Afghanistan,AFG,2017,183.9413871,134.9937531,45.73766239,5.810624892
Albania,ALB,2017,40.48112425,18.28075417,20.83739279,1.803883519
Algeria,DZA,2017,43.68351173,0.191766503,41.97623323,2.052063536
American Samoa,ASM,2017,58.93948537,20.31779722,38.62055628,0.001434633
Andean Latin America,,2017,32.77592701,10.68905781,21.47181719,0.802404539
Andorra,AND,2017,15.56947846,0.165664162,13.08853102,2.708944277
Angola,AGO,2017,95.2199092,62.19905799,29.75929262,5.029588394
Antigua and Barbuda,ATG,2017,32.35337088,1.435593983,30.90778066,0.012518973
Argentina,ARG,2017,31.05788284,2.260202746,27.3380539,1.745889964
Armenia,ARM,2017,64.81278003,11.90403664,50.07469349,3.860213646


###Join Population Data to Train Data

In [21]:
populationDF = populationDF.withColumnRenamed('Country', 'Country_Region')

trainDF = trainDF.join(populationDF, ['Country_Region'])

In [22]:
display(trainDF)

Country_Region,Id,Province_State,Date,ConfirmedCases,Fatalities,Population
Afghanistan,1,,2020-01-22T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,2,,2020-01-23T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,3,,2020-01-24T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,4,,2020-01-25T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,5,,2020-01-26T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,6,,2020-01-27T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,7,,2020-01-28T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,8,,2020-01-29T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,9,,2020-01-30T00:00:00.000+0000,0.0,0.0,37172386
Afghanistan,10,,2020-01-31T00:00:00.000+0000,0.0,0.0,37172386


###Join Pollution Data to Train Data

In [24]:
pollutionDF = pollutionDF.drop('Code', 'Year', 'Air pollution (total) (deaths per 100,000)')
pollutionDF = pollutionDF.withColumnRenamed('Indoor air pollution (deaths per 100,000)', 'IndoorPollution')
pollutionDF = pollutionDF.withColumnRenamed('Outdoor particulate matter (deaths per 100,000)', 'OutdoorPollution')
pollutionDF = pollutionDF.withColumnRenamed('Outdoor ozone pollution (deaths per 100,000)', 'OzonePollution')

pollutionDF = pollutionDF.withColumnRenamed('Entity', 'Country_Region')

trainDF = trainDF.join(pollutionDF, ['Country_Region'])

In [25]:
display(trainDF)

Country_Region,Id,Province_State,Date,ConfirmedCases,Fatalities,Population,IndoorPollution,OutdoorPollution,OzonePollution
Afghanistan,1,,2020-01-22T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,2,,2020-01-23T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,3,,2020-01-24T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,4,,2020-01-25T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,5,,2020-01-26T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,6,,2020-01-27T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,7,,2020-01-28T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,8,,2020-01-29T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,9,,2020-01-30T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,10,,2020-01-31T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892


# Explore, Clean and Visualize data

Explore Confirmed Cases and Fatalities against Time using Melt

Pandas Melt function is used to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars)

Confirmed Cases

In [30]:
import pandas as pd
trainPD = trainDF.toPandas()
visualization1 = pd.melt(trainPD, id_vars=['Date'], value_vars=['ConfirmedCases'])
display(visualization1)

Date,variable,value
2020-01-22T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-23T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-24T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-25T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-26T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-27T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-28T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-29T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-30T00:00:00.000+0000,ConfirmedCases,0.0
2020-01-31T00:00:00.000+0000,ConfirmedCases,0.0


Fatalities

In [32]:
visualization2 = pd.melt(trainPD, id_vars=['Date'], value_vars=['Fatalities'])
display(visualization2)

Date,variable,value
2020-01-22T00:00:00.000+0000,Fatalities,0.0
2020-01-23T00:00:00.000+0000,Fatalities,0.0
2020-01-24T00:00:00.000+0000,Fatalities,0.0
2020-01-25T00:00:00.000+0000,Fatalities,0.0
2020-01-26T00:00:00.000+0000,Fatalities,0.0
2020-01-27T00:00:00.000+0000,Fatalities,0.0
2020-01-28T00:00:00.000+0000,Fatalities,0.0
2020-01-29T00:00:00.000+0000,Fatalities,0.0
2020-01-30T00:00:00.000+0000,Fatalities,0.0
2020-01-31T00:00:00.000+0000,Fatalities,0.0


###Explore Data Types

Features (Independent Variables)
* Country_Region refers to Country 
* Province_State refers to the State in the country
* Date refers to the date on which the numbers were recorded
* ConfirmedCases refer to the number of people who tested positive for Corona virus
* Population refers to the country population as a whole
* Outdoor, Indoor and Ozone Pollution refers to the pollution levels in particulate matter,
refer to https://www.airnow.gov/aqi/aqi-basics/ for safe levels of particulate matter

Predictor (Dependent Variables)
* Fatalities refer to the number of people who succumbed to the Virus

In [35]:
display(trainDF)

Country_Region,Id,Province_State,Date,ConfirmedCases,Fatalities,Population,IndoorPollution,OutdoorPollution,OzonePollution
Afghanistan,1,,2020-01-22T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,2,,2020-01-23T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,3,,2020-01-24T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,4,,2020-01-25T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,5,,2020-01-26T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,6,,2020-01-27T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,7,,2020-01-28T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,8,,2020-01-29T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,9,,2020-01-30T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892
Afghanistan,10,,2020-01-31T00:00:00.000+0000,0.0,0.0,37172386,134.9937531,45.73766239,5.810624892


###Explore all the countries in the train data

In [37]:

trainPD = trainDF.toPandas()
countries = trainPD['Country_Region'].unique()
print(f'{len(countries)} countries are in dataset:\n{countries}')

In [38]:
trainDF.summary().show()

###Replace null values of Province_State values with corresponding Country_Region values, using Coalesce function from PySpark SQL functions

In [40]:
from pyspark.sql.functions import coalesce
trainDF = trainDF.withColumn('Province_State', coalesce('Province_State', 'Country_Region'))
trainDF.show()

In [41]:
trainDF.summary().show()

###Convert timestamp value to Numeric value, by converting into unix time

In [43]:
trainDF = trainDF.dropna()
trainDF = trainDF.drop('ID')

from pyspark.sql.functions import unix_timestamp
trainDF = trainDF.withColumn('Date', unix_timestamp('Date'))
trainDF.summary().show()

Check the Datatypes of each column using printSchema

In [45]:
trainDF.printSchema()

###Convert Categorical(String) values to Numeric Values using String Indexer

In [47]:
from pyspark.ml.feature import StringIndexer


For Country

In [49]:
indexer = StringIndexer().setInputCol("Country_Region").setOutputCol("Country")
trainDF = indexer.fit(trainDF).transform(trainDF)
trainDF = trainDF.drop("Country_Region")
trainDF.show()


For State

In [51]:
indexer = StringIndexer().setInputCol("Province_State").setOutputCol("State")
trainDF = indexer.fit(trainDF).transform(trainDF)
trainDF = trainDF.drop("Province_State")
trainDF.show()

In [52]:
trainDF.printSchema()

####View the relation between each of the colums using Seaborn pairplot

In [54]:
import seaborn as sns
import matplotlib.pyplot as plt

trainPD = trainDF.toPandas()

sns.set(style="white", font_scale=0.5)
g = sns.pairplot(trainPD, size=1.5, vars=["Date", "State","Country", "ConfirmedCases", "Fatalities", "Population", "IndoorPollution", "OutdoorPollution", "OzonePollution" ])
g.fig.set_figheight(8)
g.fig.set_figwidth(8)
display(g.fig)

Summary of the Train Data

In [56]:
curr_date = trainPD['Date'].max()
world_cum_confirmed = sum(trainPD[trainPD['Date'] == curr_date].ConfirmedCases)
world_cum_fatal = sum(trainPD[trainPD['Date'] == curr_date].Fatalities)
print('Number of Countries: ', len(trainPD['Country'].unique()))
# print('End date in train data set: ', curr_date)
print('Number of confirmed cases: ', world_cum_confirmed)
print('Number of fatalities cases: ', world_cum_fatal)


pr_confirm = trainPD['ConfirmedCases'].value_counts(normalize=True)
pr_fatal = trainPD['Fatalities'].value_counts(normalize=True)

print(f'Percentage of confirmed cases = {pr_confirm[1:].sum()*100}%')
print(f'Percentage of fatalites = {pr_fatal[1:].sum()*100}%')

In [57]:
# trainDF = trainDF.drop('ConfirmedCases')

###Split Train Data and Test Date

In [59]:
train_data,test_data = trainDF.randomSplit([0.7,0.3],24)

#Setting-up Regression models over pipeline

Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). We consider two hyperparameters for tuning: Regularization Parameter, Elastic Net Parameter

Generalized linear models (GLMs) are specifications of linear models where the response variable follows some distribution from the exponential family of distributions. Here we consider two distributions in the paramgrid: Gaussian, Poisson

Isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as close to the observations as possible. It has one optional parameter called isotonic which specifies if the isotonic regression is isotonic or antitonic. We consider both for tuning.

In [64]:
from pyspark.ml.regression import LinearRegression, GeneralizedLinearRegression, IsotonicRegression, DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import RFormula 

import numpy as np

columns = trainDF.columns
columns.remove('Fatalities')
formula = "{} ~ {}".format("Fatalities", " + ".join(columns))
print("Formula : {}".format(formula))
rFormula = RFormula(formula = formula)

pipeline = Pipeline(stages=[])
basePipeline =[rFormula]

lr = LinearRegression(maxIter=10)
pl_lr = basePipeline + [lr]
pg_lr = ParamGridBuilder()\
        .baseOn({pipeline.stages: pl_lr})\
        .addGrid(lr.regParam, [0.1, 0.01])\
        .addGrid(lr.elasticNetParam, [0.1, 0.8])\
        .build()


glr = GeneralizedLinearRegression()
pl_glr = basePipeline + [glr]
pg_glr = ParamGridBuilder()\
          .baseOn({pipeline.stages: pl_glr})\
          .addGrid(glr.family, ['gaussian', 'poisson'])\
          .build()


ir = IsotonicRegression()
pl_ir = basePipeline + [ir]
pg_ir = ParamGridBuilder()\
      .baseOn({pipeline.stages: pl_ir})\
      .addGrid(ir.isotonic, [True, False])\
      .build()


# dt = DecisionTreeRegressor()
# pl_dt = basePipeline + [dt]
# pg_dt = ParamGridBuilder()\
#       .baseOn({pipeline.stages: pl_dt})\
#       .build()

# One grid from the individual grids
paramGrid =  pg_lr + pg_glr + pg_ir


###Run the parameterized pipelines with Crossvalidation

In [66]:
cv = CrossValidator()\
      .setEstimator(pipeline)\
      .setEvaluator(RegressionEvaluator()\
                       .setMetricName("r2"))\
      .setEstimatorParamMaps(paramGrid)\
      .setNumFolds(10)

cvModel = cv.fit(train_data) 

####Best Model

In [68]:
cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]

####Worst Model

In [70]:
cvModel.getEstimatorParamMaps()[ np.argmin(cvModel.avgMetrics) ]

#Model Evaluation

Model Measures

In [73]:
import re
def paramGrid_model_name(model):
  params = [v for v in model.values() if type(v) is not list]
  name = [v[-1] for v in model.values() if type(v) is list][0]
  name = re.match(r'([a-zA-Z]*)', str(name)).groups()[0]
  return "{}{}".format(name,params)

# Resulting metric and model description
# get the measure from the CrossValidator, cvModel.avgMetrics
# get the model name & params from the paramGrid
# put them together here:
kmeans_measures = zip(cvModel.avgMetrics, [paramGrid_model_name(model) for model in paramGrid])
metrics,model_names = zip(*kmeans_measures)

Plot Model Measures

In [75]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.clf() # clear figure
fig = plt.figure( figsize=(5, 5))
plt.style.use('fivethirtyeight')
axis = fig.add_axes([0.1, 0.3, 0.8, 0.6])
# plot the metrics as Y
#plt.plot(range(len(model_names)),metrics)
plt.bar(range(len(model_names)),metrics)
# plot the model name & param as X labels
plt.xticks(range(len(model_names)), model_names, rotation=70, fontsize=6)
plt.yticks(fontsize=6)
#plt.xlabel('model',fontsize=8)
plt.ylabel('R2 (better is greater)',fontsize=8)
plt.title('Model evaluations')
display(plt.show())

## Run the best model on the test data

In [77]:
predictions = cvModel.transform(test_data)

display(predictions)

Date,ConfirmedCases,Fatalities,Population,IndoorPollution,OutdoorPollution,OzonePollution,Country,State,features,label,prediction
1579651200,0.0,0.0,71625,6.837761311,30.83747323,3.12914e-05,27.0,30.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 71625.0, 6.837761311, 30.83747323, 3.12914E-5, 27.0, 30.0))",0.0,15.687573670849815
1579651200,0.0,0.0,77006,0.165664162,13.08853102,2.708944277,17.0,15.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 77006.0, 0.165664162, 13.08853102, 2.708944277, 17.0, 15.0))",0.0,-4.323598084003606
1579651200,0.0,0.0,96762,3.660678959,38.22685752,0.555509281,76.0,111.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 96762.0, 3.660678959, 38.22685752, 0.555509281, 76.0, 111.0))",0.0,2.038568442629185
1579651200,0.0,0.0,211028,74.22618558,33.31577333,3.860018238,106.0,161.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 211028.0, 74.22618558, 33.31577333, 3.860018238, 106.0, 161.0))",0.0,14.622609728641692
1579651200,0.0,0.0,286641,0.042679749,34.91535429,0.140741811,89.0,130.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 286641.0, 0.042679749, 34.91535429, 0.140741811, 89.0, 130.0))",0.0,-6.952162849580418
1579651200,0.0,0.0,575991,11.35105155,36.92001981,0.582342985,67.0,99.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 575991.0, 11.35105155, 36.92001981, 0.582342985, 67.0, 99.0))",0.0,8.87514140676285
1579651200,0.0,0.0,622227,17.17675752,29.61053759,1.009761408,120.0,180.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 622227.0, 17.17675752, 29.61053759, 1.009761408, 120.0, 180.0))",0.0,-16.12580179224824
1579651200,0.0,0.0,754394,24.59026121,40.63592738,11.32963327,24.0,27.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 754394.0, 24.59026121, 40.63592738, 11.32963327, 24.0, 27.0))",0.0,7.671956147834862
1579651200,0.0,0.0,883483,57.96317382,42.52815292,0.128667006,143.0,216.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 883483.0, 57.96317382, 42.52815292, 0.128667006, 143.0, 216.0))",0.0,7.645240195159204
1579651200,0.0,0.0,1265303,3.426270277,44.60339225,0.792785212,58.0,76.0,"List(1, 8, List(), List(1.5796512E9, 0.0, 1265303.0, 3.426270277, 44.60339225, 0.792785212, 58.0, 76.0))",0.0,10.527436837403002


## Visualize the results of the best model predictions

In [79]:
labeledPredictions = predictions.select("label", "prediction")
labeledPredictions.show(5)
y_test,predictions = zip(*labeledPredictions.collect())
fig, ax = plt.subplots()
plt.scatter(y_test,predictions)
display(fig)

## Regression Metrics

In [81]:
# Summarize the model over the training set and print out some metrics
print("Best pipeline: ", cvModel.bestModel.stages)

model = cvModel.bestModel.stages[1]
print("Best model: ", model)
print("\nRFormula: ", cvModel.bestModel.stages[0])


print("\nIntercept", model.intercept)
print("Coefficients", model.coefficients)

trainingSummary = model.summary

print("\n r2: %f" % trainingSummary.r2)
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)

# print("\nGeneralizedLinearRegression Metrics:")
# print("\nStandardError", trainingSummary.coefficientStandardErrors)
# print("\nAIC", trainingSummary.aic)
# print("\nDispersion", trainingSummary.dispersion)
# print("\nDeviance", trainingSummary.deviance)
# print("\nNull Deviance", trainingSummary.nullDeviance)
# print("\nresidualDegreeOfFreedomNull", trainingSummary.residualDegreeOfFreedomNull)

# https://www.theanalysisfactor.com/r-glm-model-fit/
# https://spark.apache.org/docs/latest/ml-classification-regression.html#generalized-linear-regression


* The coefficients of the features for the best linear model indicate that the features Ozone pollution values and Country play a significant role in the prediction of fatalities followed by Indoor and Outdoor pollution, while the Population has the least influence.
* The RMSE and r2 value, along with the visualizations for the best model evaluated using ParamGrid and CrossValidation, show that the data (Corona Virus Fatality Growth) follows ideal linear graph at the time at which it was collected.

#Tableau Vizualizations

https://public.tableau.com/profile/dmoti1#!/vizhome/Covid19TermProject/ConfirmedCases_Maps