You've been hired by a dog food company to try to predict why some batches of their dog food are spoiling much quicker than intended! Unfortunately this Dog Food company hasn't upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary a lot, but which is the chemical that has the strongest effect? 

The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. The food scientists believe one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one! 


First, we need to create the Spark Session

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://mirrors.sonic.net/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xzf spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Afterwards, we can read the file and inspect it

In [2]:
#Please drop the file in the environments 'Files' panel
df = spark.read.options(header="true", inferSchema="true").csv("/content/dog_food.csv")
df.describe().toPandas()

Unnamed: 0,summary,A,B,C,D,Spoiled
0,count,490.0,490.0,490.0,490.0,490.0
1,mean,5.53469387755102,5.504081632653061,9.126530612244895,5.579591836734694,0.2857142857142857
2,stddev,2.9515204234399057,2.8537966089662063,2.0555451971054275,2.8548369309982857,0.4522156316461346
3,min,1.0,1.0,5.0,1.0,0.0
4,max,10.0,10.0,14.0,10.0,1.0


The idea for this assignment is to use Tree Methods to find underlying patterns in data, preventing the model from making undue assumptions about the data itself and letting it speak by itself. In any case, and as with the previous models, we must first assemble a Vector with a "features" and a "label" tag so that it can be processed by Spark. For more commented code, please see other assignments:

In [8]:
from pyspark.ml.feature import VectorAssembler, Imputer
assembler = VectorAssembler(inputCols= [e for e in df.columns if e not in ('Spoiled')]  , outputCol='features', handleInvalid='skip')
output = assembler.transform(df)
imputer = Imputer(inputCols=['Spoiled'], outputCols=['label'], strategy='mean')
imputer_model = imputer.fit(output)
output = imputer_model.transform(output)

As always, we divide the data into a train and a test set, so that we can test the metrics and see if everything went OK. 

In [12]:
train, test = output.randomSplit([0.7, 0.3])

We are using three different tree methods, which come bundled with spark:


* [Decission Tree Classifier](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html): It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).
* [Random Forest Classifier](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html):  For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.
* [Gradient Boosted Trees](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.classification.GBTClassifier.html): It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest.

We will use the three methods to find if their results match, and to pick the most accurate of the three.

In [13]:
#Fist, we alias the methods to make them easier to call
from pyspark.ml.classification import (RandomForestClassifier, GBTClassifier, DecisionTreeClassifier)
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier(numTrees = 100)
gbt = GBTClassifier()

In [14]:
#We fit the three models
dtc_model = dtc.fit(train)
rfc_model = rfc.fit(train)
gbt_model = gbt.fit(train)

In [15]:
#And get their predictions
dtc_preds = dtc_model.transform(test)
rfc_preds = rfc_model.transform(test)
gbt_preds = gbt_model.transform(test)

To evaluate the models, we import ``` MulticlassClassificationEvaluator ```, which will give us an accuracy metric


In [16]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(metricName='accuracy')

And we display it

In [23]:
print(f'DTC: {evaluator.evaluate(dtc_preds)} \t Features Importance: {dtc_model.featureImportances}') 
print(f'RFC: {evaluator.evaluate(rfc_preds)}  Features Importance: {rfc_model.featureImportances}')
print(f'GBT: {evaluator.evaluate(gbt_preds)} \t Features Importance: {gbt_model.featureImportances}')

DTC: 0.959731543624161 	 Features Importance: (4,[0,1,2,3],[0.010142105007278246,0.0016897828403142857,0.96352393472086,0.024644177431547582])
RFC: 0.9664429530201343  Features Importance: (4,[0,1,2,3],[0.025437259862519997,0.024627784289245877,0.9287766018080698,0.021158354040164428])
GBT: 0.959731543624161 	 Features Importance: (4,[0,1,2,3],[0.008071973502739728,0.03292367928105117,0.9086511641861458,0.05035318303006327])


As we can see, all methods have a quite similar (and quite high!) accuracy score, with the three of them coinciding in atttributing the 3rd column (**Preservative C**) an outsize influence (> 90%) on dog food spoilage. 

Thus, we can conclude that it is **Preservative C** which is the most responsible for Dog Food Spoilage, and we can recommend for it to stop being used.