#                                     Food-disease-outbreaks prediction

## About : 
### One in every person in the United States gets sick from eating food at various location in different states. While most foodborne illnesses are not considered outbreaks. But if some of them are at a larger scale, they can be life threatening. The problem at hand is the Foodborne outbreaks in the past caused by various factors.  We want to find a way to predict the major factors that contribute towards foodborne diseases.

## Features :
### 1.Supervised Learning System — It takes into consideration the key features like State, Month, Year, Food Consumed, Location of Food Consumption to predict the possible number of illnesses. This is surely helpful for Centre for Disease Control and Prevention to taking neccessary measures ahead of time to mitigate the risks.

### 2.Type of Predictions Made

   #### Prediction of Illnesses Count based on the features mentioned above
   #### Classification of Diseases as High Scaled or Low Scaled
   #### Time Series Analysis to understand the impact of Month and Year on the Oubtreak Impact

### 3.Algorithms Used — There are multiple algorithms involved in this system to tackle the outbreaks.

   #### Regression Models - Linear Regression, Linear Regression with Regularization and Cross Validation, Decision Tree Regression, Random Forest Regression
   #### Classfication Model- Logistic Regression
   #### Time Series Analysis - Base Time Series Model, Rolling Average Model

### 4.Data Cleansing and Normalization — Data is messy and hence we have perfomed data cleaning using pandas to ensure that data loss is minimal and the values which we are predicted are uniformly distributed to ensure best outcomes. Example of operations - Removing null values, replacing null values with mean values, removing non impactful features, taking Logarithm of prediction values to bring it on a normalized scale.

### 5.Data Visualization to represent findings — At many instances, the data has been represented using Matplotlib, Seaborn and Pandas library to assist stakeholders in understanding the findings in the easist and user friendly way.

### 6.String Indexing, One Hot Encoding - String Indexing and One hot encoding is performed

### 7.Evaluation criteria 
   #### Regression models - RMSE
   #### Classification models - AUC ROC

### Performing Machine Learning on a Food-based illness dataset available from CDC and FDA to predict illness counts based on State, Month, Location and type of Food Consumed

### The dataset used for the propect is the Foodborne Disease Outbreaks, 1998-2015.

### Link to the dataset - https://www.kaggle.com/cdc/foodborne-diseases

## Prerequisites 
### 1. Databricks community edition (alternatively Jupyter notebook installed on your local machine)
### 2. Python version 3.6.1
### 3. Spark version 2.0
### 4. Libraries - pandas, numpy, pyspark, matplotilib, seaborn
### 5. Python package manager 'PIP'

## Contributors:

### 1. Anmol Handa
### 2. Anuj Jain
### 3. Justin Thierry

## Special Thanks:
### 1. Dr. Daniel Acuna
### 2. Mr. Tong Zeng

## Getting setup:
### 1. Local machine
###  - Installing Python version 3.6.1 on local machine
###  - Installing 'PIP', the package manager for python
###  - Installing the libraries pandas, numpy, matplotlib, pyspark, seaborn ( ex. pip install pandas ) from cmd
###  - Installing jupyter notebook ( cmd - pip install jupyter ) from cmd
###  - Launching jupyter notebook ( cmd - jupyter notebook ) from cmd
###  - Putting the csv/dataset in the working directory

### 2. Databricks
###  - Picking Python version 3.6.1
###  - Creating a spark cluster
###  - Putting the csv/dataset in the HDFS directory of databricks

## For further reference to the code, following is the github link
## https://github.com/handaanmol/Food-disease-outbreaks

## Beginning of the code

#### The following python libraries are imported to load various utility, mathematical, visualization, pipelines and machine learning packages in python.

In [1]:
#linear algebra and mathematical packages
import numpy as np 
import pandas as pd 

#visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

#spark 2 related packages like SQL, machine learning
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col, udf

from pyspark.ml.stat import Correlation

from pyspark.ml.classification import LogisticRegression

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel

from pyspark.ml.feature import Bucketizer, StringIndexer, OneHotEncoder, StandardScaler, VectorAssembler

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.evaluation import RegressionEvaluator
%matplotlib inline


ModuleNotFoundError: No module named 'pyspark'

#### In the following code, we are importing the dataset from the our machine into databricks cluster. Further, we are reading that data on the databricks cluster

#### To import dataset in local machine, you need to change the path to the local path**

In [2]:
outbreaks = pd.read_csv("/dbfs/FileStore/tables/outbreaks.csv")
outbreaks.head(10)

FileNotFoundError: File b'/dbfs/FileStore/tables/outbreaks.csv' does not exist

#### Checking the null values for all columns in the comma seperated values dataset.

#### We found that null values are high in the columns names 'ingredients' and 'serotype/genotype'

In [20]:
outbreaks.isnull().sum()

#### There are 19119 records in the dataset, and the columns 'ingredients' and 'serotype' have almost the same number of null values. So we will not be considering these columns as part of our analysis. Following is a heatmap that shows the distribution of null values in the dataset. Yellow is represented as null values.

In [22]:
plt.cla()
sns.heatmap(outbreaks.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.tight_layout()
display()

#### Renaming the column "Serotype/Genotype" to "Serotype"

In [24]:
outbreaks =outbreaks.rename(index=str, columns={"Serotype/Genotype": "Serotype"})

#### Plotting the distribution of the values under the column "Species"

In [26]:
plt.subplots(figsize=(20,15))
sns.countplot(x='Species', data=outbreaks, order= outbreaks.Species.value_counts().iloc[2:8].index)
display()

#### Plotting the distribution of the illnesses throughout all the states in the U.S.A

In [28]:
plt.cla()
df2 = pd.pivot_table(outbreaks, index='State', values='Illnesses', aggfunc='count')
ax = df2.plot(kind='bar', color='steelblue',figsize=(25,10))
plt.title('Foodborne Illnesses Cases By Year')
plt.ylabel('Illiness Cases')
display()

#### Showing the distribution of food types in the data. As we see below, these are some of the most popular food types consumed.

In [30]:
outbreaks.Food.value_counts()

#### Dropping columns from the dataset that we will not use for analysis. The columns are "Ingredient","Serotype","Species","Status","Fatalities"

In [32]:
outbreaks.drop(['Ingredient', 'Serotype', 'Species', 'Status', 'Fatalities'], axis=1, inplace=True)
outbreaks.head()

#### Plotting the new reduced dataset to see the frequency of null values

In [34]:
plt.cla()
sns.heatmap(outbreaks.isnull(), yticklabels=False, cbar=False, cmap='viridis')
display()

#### Checking Null values in location

In [36]:
outbreaks.isnull().sum()

#### Checking count of values in dataset

In [38]:
outbreaks.count()

#### Plotting how illnesses are distributed over the dataset

In [40]:
plt.cla()
sns.distplot(outbreaks.Illnesses, bins=10, color='red')
plt.title('Distribution of Illnesses in Traning Set')
display()

#### Plotting the Distribution of FoodBorne Illnesses by State. We found that California and illinois are the states with the most illnesses.

In [42]:
plt.cla()
df2 = pd.pivot_table(outbreaks, index='State', values='Illnesses', aggfunc='sum')
ax = df2.plot(kind='bar', color='steelblue',figsize=(25,10))
plt.title('Foodborne Illnesses Cases By State')
plt.ylabel('Illiness Cases')
display()

#### Plotting the Distribution of FoodBorne Illnesses by Year. We found that there is declining trend with some crests and troughs.

In [44]:
plt.cla()
df2 = pd.pivot_table(outbreaks, index='Year', values='Illnesses', aggfunc='sum')
ax = df2.plot(kind='bar', color='steelblue',figsize=(25,10))
plt.title('Foodborne Illnesses Cases By Year')
plt.ylabel('Illness Cases')
display()

#### Plotting the Distribution of FoodBorne Illnesses by Months. There is no interesting trend found.

In [46]:
#Distribution of illness by Months
plt.cla()
df2 = pd.pivot_table(outbreaks, index='Month', values='Illnesses', aggfunc='mean')
ax = df2.plot(kind='bar', color='steelblue',figsize=(25,10))
plt.title('Foodborne Illnesses Cases By Month')
plt.ylabel('Illiness Cases')
display()

#### Looking at the number of top food items

In [48]:
outbreaks.Food.value_counts()

#### Replacing null values of food column with "Unspecified", and replacing null values in Location with "Unknown" to prevent data loss

In [50]:
outbreaks.Food.fillna("Unspecified", inplace=True)
outbreaks.Location.fillna("Unknown", inplace=True)
outbreaks.Location.value_counts()


#### Filling Hospitalizations Null values with 0

In [52]:
outbreaks.Hospitalizations.fillna(0, inplace=True)

#### Creating Normalized Column for Hospitalizations/Illnesses

In [54]:
outbreaks['normalized_hospitalizations'] = outbreaks.apply(lambda row: round((row.Hospitalizations/row.Illnesses)*100), axis=1)

#### Checking distrivution Normalized Hospitalizations over the dataset

In [56]:
plt.cla()
sns.distplot(outbreaks.normalized_hospitalizations, bins=10, color='red')
plt.title('Distribution of Normalized Hospitalizations')
display()

In [57]:
outbreaks.head(5)

#### Plotting the distribution of illnesses in the dataset to visualize its left skewness

In [59]:
plt.cla()
sns.distplot(outbreaks.Illnesses, bins=10, color='red')
plt.title('Distribution of Illnesses')
display()

#### Plotting the distribution of illnesses in the dataset after they have been standardised by log scale

In [61]:
plt.cla()
sns.distplot(np.log10(outbreaks.Illnesses), bins=10, color='red')
plt.title('Distribution of Illnesses standardized by log scale')
display()

#### Adding new column log- illness in our data

In [63]:
outbreaks['Illnesses_log'] = np.log(outbreaks.Illnesses)

In [64]:
outbreaks.head()

#### Loading libraries to make use of packages for dataframe manipulation, running regression and classification models

In [66]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext
s1 = SQLContext(sc)

from pyspark.sql import functions as fn
# Functionality for computing features
from pyspark.ml import feature
# Functionality for regression
from pyspark.ml import regression
# Funcionality for classification
from pyspark.ml import classification
# Object for creating sequences of transformations
from pyspark.ml import Pipeline

In [67]:
df = spark.createDataFrame(outbreaks)

In [68]:
df.dtypes

#### Deciding the split ratio of the data and transforming them into training, testing and validation using randomsplit function

In [70]:
training_df, validation_df, testing_df = df.randomSplit([0.6, 0.3, 0.1])
display(training_df)

#### Base model using only "Year" as the feature

In [72]:
#Base Model
model1 = Pipeline(stages=[
  feature.VectorAssembler(inputCols=['Year'], outputCol='features'),
  regression.LinearRegression(featuresCol='features', labelCol='Illnesses_log')  
]).fit(training_df)

In [73]:
rmse = fn.sqrt(fn.avg((fn.col('Illnesses_log') - fn.col('prediction'))**2))
model1.transform(validation_df).select(rmse).show()

In [74]:
model1.transform(testing_df).select(rmse).show(100)

In [75]:
model1.transform(testing_df).select((fn.col('Illnesses_log') - fn.col('prediction'))**2, fn.col('Illnesses_log'), fn.col('prediction')).show(5000)

#### We found that RMSE values of the base model is 1.16207846173807

#### Linear Regression Model 2 with features as Year, State and Month. Using StringIndexer and VectorAssembler to perform One Hot Encoding

In [78]:
#Model 2 - with year, State and Month
model2 = Pipeline(stages=[feature.VectorAssembler(inputCols=['Year'],
                                        outputCol='features'),
                          feature.StringIndexer(inputCol='Month', outputCol='encoded_Month'),
                          feature.VectorAssembler(inputCols=['features', 'encoded_Month'], outputCol='semi_final_features'),
                          feature.StringIndexer(inputCol='State', outputCol='encoded_State'),
                          feature.VectorAssembler(inputCols=['semi_final_features', 'encoded_State'], outputCol='final_features'),
                 regression.LinearRegression(featuresCol='final_features', labelCol='Illnesses_log')]).fit(training_df)

In [79]:
model2.transform(validation_df).select(rmse).show(100)

In [80]:
model2.transform(testing_df).select(rmse).show(100)

In [81]:
model2.transform(testing_df).select((fn.col('Illnesses_log') - fn.col('prediction'))**2, fn.col('Illnesses_log'), fn.col('prediction')).show(5000)

#### We found that the RMSE values of the model 2 is 1.13077887009722

In [83]:
#Testing String Indexer
#indexer_model = StringIndexer(inputCol='Month', outputCol="{0}_indexed".format('Month')).fit(training_df)
#indexed_df = indexer_model.transform(training_df)
#indexed_df.show(5)

#### Linear Regression Model 3 with features State and Month only

In [85]:
#Model 3 
#With state, Month only
model3 = Pipeline(stages=[feature.StringIndexer(inputCol='Month', outputCol='encoded_Month'),
                          feature.VectorAssembler(inputCols=['encoded_Month'], outputCol='semi_final_features'),
                          feature.StringIndexer(inputCol='State', outputCol='encoded_State'),
                          feature.VectorAssembler(inputCols=['semi_final_features', 'encoded_State'], outputCol='final_features'),
                 regression.LinearRegression(labelCol='Illnesses_log', featuresCol='final_features')]).fit(training_df)

In [86]:
model3.transform(validation_df).select(rmse).show()

In [87]:
model3.transform(testing_df).select(rmse).show()

#### We found the RMSE of model 3 is 1.1307483108756038

#### We find that Model 3 is better than Model 2, and Model 2 is better than the base model.

In [90]:
#Model 3> Model 2 > Model 1

In [91]:
#Adding String Indexer
indexer_model = StringIndexer(inputCol='Month', outputCol="{0}_indexed".format('Month')).fit(df)
indexed_df = indexer_model.transform(df)
indexer_model2 = StringIndexer(inputCol='State', outputCol="{0}_indexed".format('State')).fit(indexed_df)
indexed_df = indexer_model2.transform(indexed_df)
indexed_df.toPandas()

#### Creating a new dataframe which has states and month values indexed.

In [93]:
#We have a new dataframe indexed with states and month
indexed_df.head(5)

In [94]:
outbreaks_new= outbreaks.copy()
outbreaks_new.head(5)

#### Dropping the column "Hospitalizations"

In [96]:
outbreaks_new.drop(['Hospitalizations'], axis=1, inplace=True)
outbreaks_new.head(1)

In [97]:
outbreaks_new.Location.value_counts()

In [98]:
outbreaks_new.Food.value_counts()

#### Creating dummy variable for the column "Location" to pick the first location out of the list of location and similarly for food also.

In [100]:
#Creating dummies for Location Variable
outbreaks_new['Location_modified']=outbreaks_new['Location'].str.split(';').str[0]
outbreaks_new['Food_modified']=outbreaks_new['Food'].str.split(',').str[0]
outbreaks_new['Food_modified_new']=outbreaks_new['Food_modified'].str.split(';').str[0]

In [101]:
outbreaks_new.Food_modified_new.value_counts()

In [102]:
list(outbreaks_new.columns)

In [103]:
df = spark.createDataFrame(outbreaks_new)
df.show(50)

#### Performing StringIndexing and One Hot Encoding on the newly modified columns

In [105]:
categorical_columns = ["Year","Month","State", "Location_modified", "Food_modified_new"]
string_indexer_models = []
one_hot_encoders = []
for col_name in categorical_columns:
    # OneHotEncoders map number indices column to column of binary vectors
    string_indexer_model = StringIndexer(inputCol=col_name, outputCol="{0}_indexed".format(col_name)).fit(df)
    df = string_indexer_model.transform(df)
    string_indexer_models.append(string_indexer_model)
    
    one_hot_encoder = OneHotEncoder(inputCol="{0}_indexed".format(col_name), outputCol="{0}_encoded".format(col_name), dropLast=False)
    df = one_hot_encoder.transform(df)
    
    one_hot_encoders.append(one_hot_encoder)

In [106]:
display(df)

#### Displaying the correlation matrix between all the features and illnesses_log. We found that location has a very high correlation

In [108]:
#Correlation between all features and Illnesses_log
corr_columns = ["Year","Month_indexed","State_indexed", "Location_modified_indexed", "Food_modified_new_indexed", "Illnesses_log"]
corr_df=df.select(corr_columns).toPandas()
plt.cla()
sns.heatmap(corr_df.corr(),annot=True)
display()

In [109]:
training_df, validation_df, testing_df = df.randomSplit([0.6, 0.3, 0.1])
#display(testing_df)
testing_df.columns

In [110]:
df.printSchema()

#### Linear Regression Model 4 with features as State_Encoded and Location_modified_encoded

In [112]:
model4 = Pipeline(stages=[
  feature.VectorAssembler(inputCols=['State_encoded', 'Location_modified_encoded'],outputCol='features'),
  regression.LinearRegression(featuresCol='features', labelCol='Illnesses_log')]).fit(training_df)

In [113]:
model4.transform(validation_df).select(rmse).show()

In [114]:
model4.transform(testing_df).select(rmse).show()

#### We found that the RMSE of model 4 is 0.9771288451129172

#### Linear Regression model 5 with features as "State_encoded", "Location_modified_encoded" and "Food_modified_encoded"

In [117]:
model5 = Pipeline(stages=[
  feature.VectorAssembler(inputCols=['State_encoded', 'Location_modified_encoded', 'Food_modified_new_encoded'],outputCol='features'),
  regression.LinearRegression(featuresCol='features', labelCol='Illnesses_log',maxIter=5, regParam=0.0, elasticNetParam=0.0)]).fit(training_df)


In [118]:
model5.transform(training_df).select(rmse).show()

In [119]:
model5.transform(validation_df).select(rmse).show()

In [120]:
model5.transform(testing_df).select(rmse).show()

In [121]:
model5.transform(testing_df).select((fn.col('Illnesses_log') - fn.col('prediction'))**2, fn.col('Illnesses_log'), fn.col('prediction')).show(500)

#### We found that the RMSE of the model 5 is 0.9561234226307269

#### Linear Regression model 6 with features as ", Month_encoded","Year_encoded", "Food_encoded", "State_encoded and "Location_encoded"

In [124]:
#Model 6 with all the features - Year, Month, State, Food and Location
model6 = Pipeline(stages=[
  feature.VectorAssembler(inputCols=['Month_encoded','Year_encoded','State_encoded', 'Location_modified_encoded', 'Food_modified_new_encoded'],outputCol='features'),
  regression.LinearRegression(featuresCol='features', labelCol='Illnesses_log',maxIter=5, regParam=0.00, elasticNetParam=0.0)]).fit(training_df)


In [125]:
model6.transform(training_df).select(rmse).show()

In [126]:
model6.transform(validation_df).select(rmse).show()

In [127]:
model6.transform(testing_df).select(rmse).show()

#### We found that the RMSE of model 6 is 0.9522669914148608

#### Linear Regression model 7 with features as "Month_indexed","Year_indexed","State_indexed","Location_modified_indexed","Food_modified_indexed"

In [130]:
#Model 6 modification with non encoded by indexed values
model7 = Pipeline(stages=[
  feature.VectorAssembler(inputCols=['Month_indexed','Year_indexed','State_indexed', 'Location_modified_indexed', 'Food_modified_new_indexed'],outputCol='features'),
  regression.LinearRegression(featuresCol='features', labelCol='Illnesses_log',maxIter=5, regParam=0.0, elasticNetParam=0.0)]).fit(training_df)


In [131]:
display(testing_df)


In [132]:
model7.transform(training_df).select(rmse).show()

In [133]:
model7.transform(validation_df).select(rmse).show()

In [134]:
model7.transform(testing_df).select(rmse).show()

#### The RMSE value of model 7 is 1.0759126056961936

#### After linear modelling, we found that taking indexes instead of encoding might not be a great idea

In [137]:
import pyspark.ml.tuning as tune
grid = tune.ParamGridBuilder()

#### We also found that model 6 is the best until and we will pick it for performing regularization and cross validation

#### Performing cross validation with elastic net regularization on the best Linear model - Model 6

In [140]:
reg = regression.LinearRegression(labelCol = 'Illnesses_log', featuresCol = 'features', maxIter=5)

In [141]:
grid = grid.addGrid(reg.elasticNetParam, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0])

In [142]:
grid = grid.addGrid(reg.regParam, np.arange(0,.1,.01))

In [143]:
np.arange(0,0.1,0.01)

In [144]:
grid = grid.build()

In [145]:
evaluator = RegressionEvaluator(labelCol=reg.getLabelCol(), predictionCol=reg.getPredictionCol())

In [146]:
va= feature.VectorAssembler(inputCols=['Month_encoded','Year_encoded','State_encoded', 'Location_modified_encoded', 'Food_modified_new_encoded'],outputCol='features')

In [147]:
crossPipe = Pipeline(stages=[va,reg])

In [148]:
cv = tune.CrossValidator(estimator = crossPipe, estimatorParamMaps = grid, evaluator= evaluator, numFolds = 3)

In [149]:
list1=list()

#### Creating a method to iterate on all elastic net parameters and perform cross validation to give the best parameters

In [151]:
class CrossValidatorVerbose(CrossValidator):
  def _fit(self, dataset):
        est = self.getOrDefault(self.estimator)
        epm = self.getOrDefault(self.estimatorParamMaps)
        numModels = len(epm)

        eva = self.getOrDefault(self.evaluator)
        metricName = eva.getMetricName()

        nFolds = self.getOrDefault(self.numFolds)
        seed = self.getOrDefault(self.seed)
        h = 1.0 / nFolds

        randCol = self.uid + "_rand"
        df = dataset.select("*", rand(seed).alias(randCol))
        metrics = [0.0] * numModels

        for i in range(nFolds):
            foldNum = i + 1
            print("Comparing models on fold %d" % foldNum)

            validateLB = i * h
            validateUB = (i + 1) * h
            condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
            validation = df.filter(condition)
            train = df.filter(~condition)

            for j in range(numModels):
                paramMap = epm[j]
                model = est.fit(train, paramMap)
                # TODO: duplicate evaluator to take extra params from input
                metric = eva.evaluate(model.transform(validation, paramMap))
                metrics[j] += metric

                avgSoFar = metrics[j] / foldNum
                print("params: %s\t%s: %f\tavg: %f" % (
                    {param.name: val for (param, val) in paramMap.items()},
                    metricName, metric, avgSoFar))
                list1.append([{param.name: val for (param, val) in paramMap.items()},metric,avgSoFar])
              # paramsList.append([[{param.name: val for (param, val) in paramMap.items()},metricName,metric,avgSoFar]])

        if eva.isLargerBetter():
            bestIndex = np.argmax(metrics)
        else:
            bestIndex = np.argmin(metrics)

        bestParams = epm[bestIndex]
        bestModel = est.fit(dataset, bestParams)
        avgMetrics = [m / nFolds for m in metrics]
        bestAvg = avgMetrics[bestIndex]
        print("Best model:\nparams: %s\t%s: %f" % (
            {param.name: val for (param, val) in bestParams.items()},
            metricName, bestAvg))

        return self._copyValues(CrossValidatorModel(bestModel, avgMetrics))
      

In [152]:
cvVer = CrossValidatorVerbose(estimator = crossPipe, estimatorParamMaps = grid, evaluator= evaluator, numFolds = 3)

In [153]:
from pyspark.sql.functions import rand

#### Splitting whole dataset into training and test to perform cross validation

In [155]:
training, test = df.randomSplit([0.7,0.3],0)

#### Performing cross validation on the train and test

In [157]:
cvVer.fit(training).transform(test)

#### Fetching 10 best parameters from the cross validation and Elastic Net Regularization

In [159]:
#Fetch Parameters automatically
list1
df_cross_val=pd.DataFrame(list1,columns=['Regularization_Parameters','RMSE','AVG'])
df_cross_val.head()
df_cross_val['Regularization_Parameters']=df_cross_val['Regularization_Parameters'].astype('str') 

In [160]:
df_cross_val['Regularization_Parameters'] = df_cross_val['Regularization_Parameters'].replace({'regParam' : 'regP'}, regex=True)
df_cross_val['Regularization_Parameters'] = df_cross_val['Regularization_Parameters'].replace({'elasticNetParam' : 'elNetP'}, regex=True)
df_cross_val['Regularization_Parameters'] = df_cross_val['Regularization_Parameters'].replace({'{' : ''}, regex=True)
df_cross_val['Regularization_Parameters'] = df_cross_val['Regularization_Parameters'].replace({'}' : ''}, regex=True)


In [161]:
df_cross_val.head(10)

In [162]:
df_cross_val_best=df_cross_val.nsmallest(10, 'RMSE')
df_cross_val_best.head()

##### Plotting best 10 regularization with least RMSE to show a comparison

In [164]:
#Plot of regularizations with RMSE values
plt.figure()
fig=plt.figure(figsize=(25, 10), dpi= 60)
sns.pointplot( y = 'RMSE', x = 'Regularization_Parameters', data = df_cross_val_best, palette='Blues',)
plt.xticks(rotation =60)
plt.tight_layout()
display()

In [165]:
a = spark.createDataFrame(df_cross_val_best)
display(a)

#### Best Value of Regular Param and Elastic Param 
#### params: {regParam': 0.029999999999999999, 'elasticNetParam': 0.2}	rmse: 0.952868, it will change everytime

In [167]:
#Best Value of Regular Param and Elastic Param 
#params: {regParam': 0.029999999999999999, 'elasticNetParam': 0.2}	rmse: 0.952868, it will change everytime
regParamBest=0.029999999999999999
elasticNetParamBest=0.2

#### Putting these best values in the model 6 taken by us

In [169]:
#Putting these best values in the model 6 taken by us
final_regression_model = Pipeline(stages=[
  feature.VectorAssembler(inputCols=['Month_encoded','Year_encoded','State_encoded','Location_modified_encoded','Food_modified_new_encoded'],outputCol='features'),
  regression.LinearRegression(featuresCol='features', labelCol='Illnesses_log',maxIter=5, regParam=regParamBest, elasticNetParam=elasticNetParamBest)]).fit(training)

#### Utilizing the best parameter value to test the model. Observervation is that RMSE goes down further

In [171]:
final_regression_model.transform(training).select(rmse).show()
#older value was 0.8996500501049035

In [172]:

temp_result=final_regression_model.transform(test)
display(temp_result)

#### Plotting Scatter Plot between predicted values and actual values

In [174]:
#Plotting Linear Model
plt.cla()
plotting_df=temp_result.toPandas()
plt.scatter(plotting_df['Illnesses_log'],plotting_df["prediction"])
display()

In [175]:
training_df.printSchema()

In [176]:
finalModelFit =  cv.fit(training)

In [177]:
final_RMSE_value = evaluator.evaluate(finalModelFit.transform(test))

In [178]:
final_RMSE_value

In [179]:
pred =  finalModelFit.transform(test)

In [180]:
pred.select('Illnesses_log', 'prediction').show(500)

#### Fetching Coefficient and R2 values for the best to interpret features importance

In [182]:
final_regression_model.stages[-1].coefficients

In [183]:
final_regression_model.stages[-1].summary.r2

In [184]:
coefficients_list=final_regression_model.stages[-1].coefficients
coefficients_list

####  Fetching and plotting coefficients for Year to understand their importance

In [186]:
year_list=df.toPandas()['Year'].unique()
len(year_list)
year_coefficients_list= coefficients_list[:18]
year_coefficients_list
year_coefficients = pd.DataFrame(
    {'year': year_list,
     'coefficients': year_coefficients_list
    })
year_coefficients.head(19)

In [187]:
#Plotting year coefficients
plt.cla()
fig=plt.figure(figsize=(25, 10), dpi= 80)
sns.barplot( y = 'coefficients', x = 'year', data = year_coefficients, palette='Blues')
plt.xticks(rotation = 60)
plt.tight_layout()
display()

### Year 2000, Year 2013, Year 2014 are most important factors here

####  Fetching and plotting coefficients for Month to understand their importance

In [190]:
#Month coefficients
month_list=df.toPandas()['Month_indexed'].unique()
month_name_list=df.toPandas()['Month'].unique()
len(month_list)
month_coefficients_list= coefficients_list[18:30]
month_coefficients_list
month_coefficients = pd.DataFrame(
    {'month': month_list,
     'coefficients': month_coefficients_list,
     'month_name':month_name_list
    })
month_coefficients.head(13)


In [191]:
#Plotting month coefficients
plt.cla()
fig=plt.figure(figsize=(25, 10), dpi= 80)
sns.barplot( y = 'coefficients', x = 'month_name', data = month_coefficients, palette='GnBu_d')
plt.xticks(rotation = 60)
plt.tight_layout()
display()

### Conclusion for Months
### Jan seems to be an important one in terms of months

####  Fetching and plotting coefficients for States indexed values to understand their importance

In [194]:
#For State, coffecients
state_list=df.toPandas()['State_indexed'].unique()
state_name_list=df.toPandas()['State'].unique()
len(state_list)



In [195]:
state_coefficients_list= coefficients_list[30:85]
state_coefficients_list
state_coefficients = pd.DataFrame(
    {'state': state_list,
     'coefficients': state_coefficients_list,
     'state_name':state_name_list
    })
state_coefficients_temp=state_coefficients.nlargest(10, 'coefficients')

In [196]:
plt.cla()
fig=plt.figure(figsize=(25, 10), dpi= 80)
sns.barplot( y = 'coefficients', x = 'state_name', data = state_coefficients_temp, palette='coolwarm')
plt.xticks(rotation = 60)
plt.tight_layout()
display()

#### Importance of top 10 states in predicting results shown above

####  Fetching and plotting coefficients for Location to understand their importance

In [199]:
#Based on Location, top most features
location_list=df.toPandas()['Location_modified_indexed'].unique()
location_name_list=df.toPandas()['Location_modified'].unique()
len(location_list)



In [200]:
location_coefficients_list= coefficients_list[85:106]
location_coefficients_list
location_coefficients = pd.DataFrame(
    {'location': location_list,
     'coefficients': location_coefficients_list,
     'location_name':location_name_list
    })
location_coefficients.head(13)

In [201]:
#Plotting year coefficients
plt.cla()
fig=plt.figure(figsize=(25, 10), dpi= 100)
sns.barplot( y = 'coefficients', x = 'location_name', data = location_coefficients, palette='Blues')
plt.xticks(rotation = 90)
plt.tight_layout()
display()

### Important features found here are - food consumption at hospital , nursing homes, assisted living facility and school/college/University

#### Fetching and plotting coefficients for Food Indexed to understand their importance

In [204]:
#Last but not least food
food_list=df.toPandas()['Food_modified_new_indexed'].unique()
food_name_list=df.toPandas()['Food_modified_new'].unique()
len(food_list)

In [205]:
food_coefficients_list= coefficients_list[106:1055]
food_coefficients_list
food_coefficients = pd.DataFrame(
    {'food': food_list,
     'coefficients': food_coefficients_list,
     'food_name':food_name_list
    })
food_coefficients_temp=food_coefficients.nlargest(20, 'coefficients')

In [206]:
#Plotting year coefficients
plt.figure()
fig=plt.figure(figsize=(25, 10), dpi= 80)
sns.barplot( y = 'coefficients', x = 'food_name', data = food_coefficients_temp, palette='husl')
plt.xticks(rotation = 60)
plt.tight_layout()
display()

### Main Food Items involved are Crab, Fries etc

In [208]:
#reg_best = regression.LinearRegression(labelCol = 'Illnesses_log', featuresCol = 'features', maxIter=5, regParam=0.040000000000000001, elasticNetParam=0.2)

#### Giving input value for features and getting predictions

In [210]:
#df_for_prof.dtypes

In [211]:
df_for_prof = spark.createDataFrame(outbreaks_new)
df_for_prof.show(50)
categorical_columns = ["Year","Month","State", "Location_modified", "Food_modified_new"]
string_indexer_models = []
one_hot_encoders = []
training_df_prof, validation_df_prof, testing_df_prof = df_for_prof.randomSplit([0.6, 0.3, 0.1])
#display(testing_df)
display(testing_df_prof)
test_df_prof=testing_df_prof.toPandas()
test_df_prof.loc[-1] = [2003, "August", "Utah", "Restaurant", "Lo Mein", 4, 3, 0.111, "Restaurant", "Lo Mein", "Lo Mein" ]  # adding a row
test_df_prof.index = test_df_prof.index + 1  # shifting index
test_df_prof = test_df_prof.sort_index()

In [212]:
testing_df_prof=spark.createDataFrame(test_df_prof)
display(testing_df_prof)

In [213]:
for col_name in categorical_columns:
    # OneHotEncoders map number indices column to column of binary vectors
    string_indexer_model = StringIndexer(inputCol=col_name, outputCol="{0}_indexed".format(col_name)).fit(testing_df_prof)
    testing_df_prof = string_indexer_model.transform(testing_df_prof)
    string_indexer_models.append(string_indexer_model)
    
    one_hot_encoder = OneHotEncoder(inputCol="{0}_indexed".format(col_name), outputCol="{0}_encoded".format(col_name), dropLast=False)
    testing_df_prof = one_hot_encoder.transform(testing_df_prof)
    
    one_hot_encoders.append(one_hot_encoder)
display(testing_df_prof)

In [214]:
#model6 = Pipeline(stages=[
#  feature.VectorAssembler(inputCols=['Month_encoded','Year_encoded','State_encoded', 'Location_modified_encoded', 'Food_modified_new_encoded'],outputCol='features'),
#  regression.LinearRegression(featuresCol='features', labelCol='Illnesses_log',maxIter=5, regParam=0.00, elasticNetParam=0.0)]).fit(training_df)#
#model6.transform(testing_df_prof).select(fn.col('prediction')).show(5)
#training_df.dtypes

#### Adding few more models to check the RMSE , first one is Generalized Linear Regression - Model 8

In [216]:
model8 = Pipeline(stages=[
  va,
  regression.GeneralizedLinearRegression(family="gaussian", link="identity",featuresCol='features', labelCol='Illnesses_log',  maxIter=5, regParam=0.0 )]).fit(training_df)

In [217]:
model8.transform(validation_df).select(rmse).show()

In [218]:
model8.transform(testing_df).select(rmse).show()

In [219]:
training_df.printSchema()

In [220]:
display(df)

#### Using Decision Tree Regression - Model 9

In [222]:
#Decision Tree Regression
model9 = Pipeline(stages=[
va,
regression.DecisionTreeRegressor(featuresCol='features', labelCol='Illnesses_log')]).fit(training_df)

In [223]:
model9.transform(validation_df).select(rmse).show()

In [224]:
model9.transform(testing_df).select(rmse).show()

#### Using Random Forest Regression - Model 10

In [226]:
#Random Forest regression
model10 = Pipeline(stages=[
va,
regression.RandomForestRegressor(featuresCol='features', labelCol='Illnesses_log')]).fit(training_df)

In [227]:
model10.transform(validation_df).select(rmse).show()

In [228]:
model10.transform(testing_df).select(rmse).show()

#### Using Gradient Boosting Regression - Model 11

In [230]:
#Gradient Boosting Regression
model11 = Pipeline(stages=[
va,
regression.GBTRegressor(featuresCol='features', labelCol='Illnesses_log')]).fit(training_df)

In [231]:
model11.transform(validation_df).select(rmse).show()

In [232]:
model11.transform(testing_df).select(rmse).show()

### The best Model has been found with linear regression with set elastic and normal regularization parameters - Model 6

In [234]:
#The best Model has been found with linear regression with set elastic and normal regularization parameters - Model 6

## Time Series Analysis based on Year and Month to understand the trends of illnesses over the years

#### Bringing data into proper shape before performing Time Series Analysis

In [237]:
#Trying Time Series Analysis
outbreaks_time_series = outbreaks_new.copy()

In [238]:
outbreaks_time_series.head()

In [239]:
import calendar
d = {'January':'01', 'February':'02', 'March':'03', 'April':'04','May':'05', 'June':'06', 'July':'07', 'August':'08', 'Spetember':'09','October':'10', 'November':'11', 'December':'12' }

In [240]:
outbreaks_time_series.Month = outbreaks_time_series.Month.map(d)

In [241]:
outbreaks_time_series.head(5)
outbreaks_time_series.Month.value_counts()

In [242]:
outbreaks_time_series.Year= outbreaks_time_series["Year"].map(str)+ "-" + outbreaks_time_series["Month"]

In [243]:
outbreaks_time_series.dtypes

In [244]:
outbreaks_time_series.Year= outbreaks_time_series["Year"]+ "-" + "01"


In [245]:
outbreaks_time_series.Year=outbreaks_time_series['Year']

In [246]:
outbreaks_time_series.head()

In [247]:
outbreaks_time_series['Year']=pd.to_datetime(outbreaks_time_series.Year, format="%Y-%m-%d")

In [248]:
time_series_model_df=pd.DataFrame(outbreaks_time_series.groupby('Year')['Illnesses'].sum()).copy()

In [249]:
time_series_model_df.index

#### Forming a line plot to understand the trend of time on the Illnesses

In [251]:
plt.cla()
time_series_model_df.plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20)
display()

#### Removing the noise from previous time series plot by forming a rolling average plot of time vs Illnesses

In [253]:
#Rolling Average
plt.cla()
illnesses = time_series_model_df[['Illnesses']]
illnesses.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20)
display()

#### The rolling average plot shows a clear decreasing trend in terms of illnesses over years

In [255]:
time_series_analysis_df=illnesses.rolling(12).mean().reset_index()
time_series_analysis_df = time_series_analysis_df[np.isfinite(time_series_analysis_df['Illnesses'])]
time_series_analysis_df.reset_index(inplace=True)
time_series_analysis_df.drop(['index'], axis=1,inplace=True)
time_series_analysis_df.reset_index(inplace=True)
time_series_analysis_df

#### Understanding the average number of illnesses for every month since 1999

In [257]:
time_series_analysis_df.count()

### Cleaning data to create a regression plot between Date and Illnesses

In [259]:
df_time_series = spark.createDataFrame(time_series_analysis_df)
display(df_time_series)
df_time_series.describe


In [260]:
import pyspark.sql.functions as fn
from pyspark.sql.types import *
df_time_series= df_time_series.select(fn.unix_timestamp(fn.col('Year'), format='yyyy-MM-dd HH:mm:ss.000').alias('date'),'index', 'Illnesses')


In [261]:
df_time_series.show()

In [262]:
training = df_time_series.where((df_time_series['index'] >= 0) & (df_time_series['index']<150))
display(training)

In [263]:
testing = df_time_series.where((df_time_series['index'] >= 150) & (df_time_series['index']<=187))
display(testing)

In [264]:
#training, testing = df_time_series.randomSplit([0.8, 0.2#], 0.0)

### Creating a linear model based on date (timestamp) to check possible linearity in illnesses with time

In [266]:
model_time = Pipeline(stages=[
  feature.VectorAssembler(inputCols=['date'], outputCol='features'),
  regression.LinearRegression(featuresCol='features', labelCol='Illnesses',maxIter=5, regParam=0.01, elasticNetParam=0.2)  
]).fit(training)

In [267]:
rmse_time = fn.sqrt(fn.avg((fn.col('Illnesses') - fn.col('prediction'))**2))
test_model_time=model_time.transform(testing).select(((fn.col('Illnesses') - fn.col('prediction'))**2),fn.col('Illnesses'), fn.col('prediction'), fn.col('date'))

In [268]:
test_model_time.show(30)

In [269]:
model_time.transform(testing).select(rmse_time).show()

#### Plotting a scatter plot for time vs Illnesses

In [271]:
plt.cla()
a4_dims = (11.7, 8.27)
fig, ax = plt.subplots(figsize=a4_dims)
g=sns.FacetGrid(data=df_time_series.toPandas(),size=8) #mapping maps in the grids using facetgrid
g.map(plt.scatter, 'date', 'Illnesses')
display()

In [272]:
test_model_df=test_model_time.toPandas()

In [273]:
test_model_df.count()

#### Fitting Linear model with trend of data to create an amazing visualization

In [275]:
plt.cla()
plt.plot(test_model_df.date, test_model_df.Illnesses, color='g')
plt.plot(test_model_df.date, test_model_df.prediction, color='orange')
plt.xlabel('Years 1998-2015')
plt.ylabel('Illnesses Occurred')
plt.title('Illnesses vs Years - Linear Regression Basic')
plt.show()
display()

## Performing Logistic Regression to classify Illnesses as High Scaled or Low Scaled

#### Analyzing the data to understand the threshold value on which we will be dividing the illnesses as high or low

In [278]:
#Logistic Modeling
model_time.transform(testing).select(((fn.col('Illnesses') - fn.col('prediction'))**2)).show(200)

In [279]:
display(df)

In [280]:
df.select(fn.avg("Illnesses_log")).show()

In [281]:
df.select(fn.max("Illnesses_log")).show()

In [282]:
df.select(fn.min("Illnesses_log")).show()

In [283]:
outbreaks_pandas_df=df.toPandas()

In [284]:
outbreaks_pandas_df.head()

#### Checking Distribution of Illnesses_log (logarithmic scale of Illnesses)

In [286]:
#Check Distribution of Illnesses_log
plt.cla()
sns.distplot(outbreaks_pandas_df['Illnesses_log'], kde=False, bins=30)
display()

#### Jointplot between Illnesses count and Illnesses_log

In [288]:
#jointplot between Illnesses count and Illnesses_log
plt.cla()
sns.jointplot(x='Illnesses', y='Illnesses_log', data=outbreaks_pandas_df , kind='kde')
display()

#### From the Viz. , we got to know that data is uniformly distributed across Illness_log value 2 (on scale of 0 - 7.5)

In [290]:
print outbreaks_pandas_df[outbreaks_pandas_df.Illnesses_log >=3]['Illnesses']

In [291]:
print outbreaks_pandas_df[outbreaks_pandas_df.Illnesses_log >=2].count()

In [292]:
outbreaks_pandas_df['Illnesses_impact'] = np.where(outbreaks_pandas_df['Illnesses_log']>=2, 1, 0)

In [293]:
outbreaks_pandas_df.head()

#### Importing package pipe from Professor Acuna's github repository

In [295]:
from pyspark_pipes import pipe

In [296]:
df_log = spark.createDataFrame(outbreaks_pandas_df)
display(df_log)

#### Using randomsplit to divide the df_log dataframe into training and testing datasets

In [298]:
training_df2,testing_df2 = df_log.randomSplit([0.8, 0.2], 0)
display(training_df2)

#### Logistic Regression model with features as "Month_encoded", "Year_Encoded, "State_Encoded", "Location_modified_encoded" and "Food_modified_new_encoded"

In [300]:
model_class1 = pipe(feature.VectorAssembler(inputCols=['Month_encoded','Year_encoded','State_encoded', 'Location_modified_encoded', 'Food_modified_new_encoded'],outputCol='features'),
     classification.LogisticRegression(labelCol='Illnesses_impact'))

In [301]:
model_class1_fitted = model_class1.fit(training_df2)

In [302]:
model_class1_fitted.transform(testing_df2)

#### Defining a method to perform binary classifier evaluation

In [304]:
def binary_evaluation(model_pipeline, model_fitted, data):
  return BinaryClassificationEvaluator(labelCol=model_pipeline.getStages()[-1].getLabelCol(), 
                                rawPredictionCol=model_pipeline.getStages()[-1].getRawPredictionCol()).\
    evaluate(model_fitted.transform(data))

In [305]:
model1ROC_test=binary_evaluation(model_class1,model_class1_fitted,testing_df2)
model1ROC_test
#base accuracy 0.7563

In [306]:
model1ROC_train=binary_evaluation(model_class1,model_class1_fitted,training_df2)
model1ROC_train

#### Using feature scaling to analyse if the ROC is increasing

In [308]:
model_class2 = pipe(feature.VectorAssembler(inputCols=['Month_encoded','Year_encoded','State_encoded', 'Location_modified_encoded', 'Food_modified_new_encoded'],outputCol='features'),
                    feature.StandardScaler(withMean=True),
     classification.LogisticRegression(labelCol='Illnesses_impact'))


In [309]:
model_class2_fitted = model_class2.fit(training_df2)

In [310]:
model_class2_fitted.transform(testing_df2)

In [311]:
model2ROC_train=binary_evaluation(model_class2,model_class2_fitted,training_df2)
model2ROC_train

#### We found that feature scaling didn't help and provide results that we wanted

In [313]:

model2ROC_test=binary_evaluation(model_class2,model_class2_fitted,testing_df2)
model2ROC_test

### Adding Regularization and Cross Validation

In [315]:
lr = classification.LogisticRegression(labelCol='Illnesses_impact', featuresCol = 'features', maxIter=5)
lr.getPredictionCol()

In [316]:

paramGrid = ParamGridBuilder() \
    .addGrid(lr.elasticNetParam, [0., 0.2, 0.4, 0.6, 0.8, 1.0]) \
    .addGrid(lr.regParam, [ 0. ,  0.01,  0.02,  0.03,  0.04,  0.05,  0.06,  0.07,  0.08,  0.09]) \
    .build()

In [317]:
evaluator2 = BinaryClassificationEvaluator(labelCol=lr.getLabelCol(), rawPredictionCol=lr.getPredictionCol())
crossPipe2 = Pipeline(stages=[va,lr])

In [318]:
cv2 = tune.CrossValidator(estimator = crossPipe2, estimatorParamMaps = paramGrid, evaluator= evaluator2, numFolds = 2)

#### Getting the best model from cross validation and fitting it

In [320]:
final_class_model_fitted = cv2.fit(training_df2)

#### Carrying out are under ROC of the best fitted model

In [322]:
model3ROC_test=evaluator2.evaluate(final_class_model_fitted.transform(testing_df2))
model3ROC_test

In [323]:
model3ROC_train=evaluator2.evaluate(final_class_model_fitted.transform(training_df2))
model3ROC_train

In [324]:
#Attempt
training_df3, validation_df3, testing_df3 = df_log.randomSplit([0.6, 0.3, 0.1])
display(training_df3)
feature_assembler = VectorAssembler(inputCols=['Month_encoded','Year_encoded','State_encoded', 'Location_modified_encoded', 'Food_modified_new_encoded'], outputCol="features")
assembled_train_df = feature_assembler.transform(training_df3).cache()



In [325]:
assembled_validation_df = feature_assembler.transform(validation_df3).cache()

In [326]:
assembled_test_df = feature_assembler.transform(testing_df3).cache()

In [327]:
assembled_train_df.columns

#### Analysing how we specify the class_weight using the weightCol feature

In [329]:
log_reg = LogisticRegression(featuresCol='features', labelCol='Illnesses_impact', maxIter=20, family='binomial')
evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='Illnesses_impact', metricName='areaUnderROC')

In [330]:
model = log_reg.fit(assembled_train_df)

In [331]:
train_preds = model.transform(assembled_validation_df)

In [332]:
print(train_preds.columns)

In [333]:
train_areaUnderROC = evaluator.evaluate(train_preds)
train_areaUnderROC

In [334]:
trainpredlbls = train_preds.select("prediction", "Illnesses_impact").cache()

In [335]:
trainpredlbls.limit(500).toPandas()

#### Creating a method to calculate the accuracy

In [337]:
def accuracy(predlbls):
    counttotal = predlbls.count()
    correct = predlbls.filter(col('Illnesses_impact') == col("prediction")).count()
    wrong = predlbls.filter(col('Illnesses_impact') != col("prediction")).count()
    ratioCorrect = float(correct)/counttotal
    print("Correct: {0}, Wrong: {1}, Model Accuracy: {2}".format(correct, wrong, np.round(ratioCorrect, 2)))

In [338]:
accuracy(trainpredlbls)

In [339]:
train_summary = model.evaluate(assembled_train_df)
validation_summary = model.evaluate(assembled_validation_df)

In [340]:
print('Training Accuracy   :', train_summary.accuracy)
print('Validation Accuracy :', validation_summary.accuracy)

In [341]:
train_summary.areaUnderROC


#### Calculating the summary of area under ROC

In [343]:
validation_summary.areaUnderROC

In [344]:
validation_summary.fMeasureByLabel(beta=1.0)

In [345]:
validation_summary.precisionByLabel

#### Our model should at least perform better than the Null Accuracy. Null Accuracy is defined as the accuracy we would have got if we would have blindly predicted the majority class of the training set as the label

#### Looking for Base model, Null accuracy

In [348]:
train_total = trainpredlbls.count()
train_label0count = float(trainpredlbls.filter(col("Illnesses_impact") == 0.0).count())
train_label1count = float(trainpredlbls.filter(col("Illnesses_impact") == 1.0).count())

#### If we had predicted everything to be the majority lable, then what would be the accuracy

In [350]:
max(train_label0count, train_label1count) / train_total


#### Test Accuracy

In [352]:
test_preds = model.transform(assembled_test_df)

In [353]:
test_areaUnderROC = evaluator.evaluate(test_preds)
test_areaUnderROC

In [354]:
testpredlbls = test_preds.select("prediction", "Illnesses_impact")

In [355]:
accuracy(testpredlbls)

In [356]:
test_summary = model.evaluate(assembled_test_df)

In [357]:
test_summary.accuracy

In [358]:
test_summary.areaUnderROC

In [359]:
test_summary.fMeasureByLabel(beta=1.0)

In [360]:
test_summary.precisionByLabel

In [361]:
test_summary.recallByLabel

In [362]:
test_summary.roc.limit(10).toPandas()

In [363]:
train_roc_pdf = train_summary.roc.toPandas()
validation_roc_pdf = validation_summary.roc.toPandas()
test_roc_pdf = test_summary.roc.toPandas()

#### Plotting the ROC curve for the best logistic regression model with regularization and corss validation implemented. The features are the same as the best linear model.

In [365]:
plt.figure(figsize=(6,4))
plt.plot(train_roc_pdf['FPR'], train_roc_pdf['TPR'], lw=1, label='Train AUC = %0.2f' % (train_summary.areaUnderROC))
plt.plot(validation_roc_pdf['FPR'], validation_roc_pdf['TPR'], lw=1, label='Validation AUC = %0.2f' % (test_summary.areaUnderROC))
plt.plot(test_roc_pdf['FPR'], test_roc_pdf['TPR'], lw=1, label='Test AUC = %0.2f' % (validation_summary.areaUnderROC))
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='NULL Accuracy')
plt.title('ROC AUC Curve')
plt.tight_layout()
plt.legend(loc="best" )
display()

## Conclusion : 
## Use our platform to input year, month, state, Location and Food being consumed and in return our platform will forecast the amount of illnesses that can be produced based on inputs with an RMSE of 0.95. The platform also categorized illness as High or Low level with an AUC of 0.77. Lastly, the platform has a future scope of Time Series Forecasting to detect the illnesses trends in advance.  Stay Alert stay Healthy!