# Tree Methods Consulting Project 

You've been hired by a dog food company to try to predict why some batches of their dog food are spoiling much quicker than intended! Unfortunately this Dog Food company hasn't upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary a lot, but which is the chemical that has the strongest effect? The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. The food scientists beelive one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one!
Use Machine Learning with RF to find out which parameter had the most predicitive power, thus finding out which chemical causes the early spoiling! So create a model and then find out how you can decide which chemical is the problem!

* Pres_A : Percentage of preservative A in the mix
* Pres_B : Percentage of preservative B in the mix
* Pres_C : Percentage of preservative C in the mix
* Pres_D : Percentage of preservative D in the mix
* Spoiled: Label indicating whether or not the dog food batch was spoiled.
___

**Think carefully about what this problem is really asking you to solve. While we will use Machine Learning to solve this, it won't be with your typical train/test split workflow. If this confuses you, skip ahead to the solution code along walk-through!**
____

In [39]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dog_foog').getOrCreate()

In [40]:
df = spark.read.csv('dog_food.csv', header=True, inferSchema=True)
df.show(1)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
+---+---+----+---+-------+
only showing top 1 row



In [41]:
df.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [42]:
# All numerical feature no need of string Indexer or categorical encoding 
# Target: Spoiled
from pyspark.sql.functions import countDistinct
# binary classification
df.agg(countDistinct(df['Spoiled'])).show()

+-----------------------+
|count(DISTINCT Spoiled)|
+-----------------------+
|                      2|
+-----------------------+



In [47]:
# DataImbalaned-but we will ignore that here
# In a detailed ML model we have to fix imbalance with syntehtic data/downsample
df.groupBy('Spoiled').count().show()

+-------+-----+
|Spoiled|count|
+-------+-----+
|    0.0|  350|
|    1.0|  140|
+-------+-----+



### Data transformation for Spark

In [48]:
df.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [49]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'], 
                            outputCol='features')
df2 = assembler.transform(df)
df2.show(2)

+---+---+----+---+-------+------------------+
|  A|  B|   C|  D|Spoiled|          features|
+---+---+----+---+-------+------------------+
|  4|  2|12.0|  3|    1.0|[4.0,2.0,12.0,3.0]|
|  5|  6|12.0|  7|    1.0|[5.0,6.0,12.0,7.0]|
+---+---+----+---+-------+------------------+
only showing top 2 rows



In [50]:
# train-test split
train_Data, test_Data = df2.randomSplit([0.7,0.3])

In [51]:
train_Data.describe().show()

+-------+------------------+------------------+-----------------+-----------------+-------------------+
|summary|                 A|                 B|                C|                D|            Spoiled|
+-------+------------------+------------------+-----------------+-----------------+-------------------+
|  count|               337|               337|              337|              337|                337|
|   mean| 5.560830860534125| 5.462908011869437| 9.12166172106825|5.614243323442136| 0.2789317507418398|
| stddev|2.9180284069567755|2.8763125110383694|2.114302112031551|2.816750297459208|0.44914076510131523|
|    min|                 1|                 1|              5.0|                1|                0.0|
|    max|                10|                10|             14.0|               10|                1.0|
+-------+------------------+------------------+-----------------+-----------------+-------------------+



In [52]:
test_Data.describe().show()

+-------+-----------------+------------------+-----------------+-----------------+-------------------+
|summary|                A|                 B|                C|                D|            Spoiled|
+-------+-----------------+------------------+-----------------+-----------------+-------------------+
|  count|              153|               153|              153|              153|                153|
|   mean|5.477124183006536| 5.594771241830065|9.137254901960784|5.503267973856209| 0.3006535947712418|
| stddev| 3.03290065978243|2.8107825949385754|1.926473542407869|2.944942549271046|0.46004815709395613|
|    min|                1|                 1|              6.0|                1|                0.0|
|    max|               10|                10|             13.0|               10|                1.0|
+-------+-----------------+------------------+-----------------+-----------------+-------------------+



In [53]:
# Selecting only feature and label before splitting is NOT necessary

In [54]:
train_Data = train_Data.select(train_Data['features'], 
                               train_Data['Spoiled'])
test_Data = test_Data.select(test_Data['features'], 
                             test_Data['Spoiled'])

In [55]:
train_Data.show()

+-------------------+-------+
|           features|Spoiled|
+-------------------+-------+
| [1.0,1.0,10.0,8.0]|    1.0|
| [1.0,1.0,12.0,2.0]|    1.0|
| [1.0,1.0,13.0,3.0]|    1.0|
|  [1.0,2.0,9.0,1.0]|    0.0|
|  [1.0,3.0,8.0,3.0]|    0.0|
|  [1.0,3.0,8.0,5.0]|    0.0|
|[1.0,4.0,13.0,10.0]|    1.0|
| [1.0,5.0,8.0,10.0]|    0.0|
|[1.0,5.0,12.0,10.0]|    1.0|
|[1.0,5.0,13.0,10.0]|    1.0|
|  [1.0,6.0,7.0,8.0]|    0.0|
|  [1.0,6.0,8.0,1.0]|    0.0|
|  [1.0,6.0,8.0,3.0]|    0.0|
|[1.0,6.0,11.0,10.0]|    1.0|
|  [1.0,7.0,7.0,2.0]|    0.0|
|  [1.0,7.0,7.0,6.0]|    0.0|
|  [1.0,7.0,8.0,4.0]|    0.0|
| [1.0,7.0,11.0,9.0]|    1.0|
|  [1.0,8.0,6.0,6.0]|    0.0|
| [1.0,8.0,7.0,10.0]|    0.0|
+-------------------+-------+
only showing top 20 rows



### ML model preparation

In [56]:
from pyspark.ml.classification import (DecisionTreeClassifier,
                                       RandomForestClassifier, 
                                       GBTClassifier)

In [57]:
det = DecisionTreeClassifier(featuresCol='features',
                             labelCol='Spoiled',
                             predictionCol='prediction')

rfc = RandomForestClassifier(featuresCol='features',
                             labelCol='Spoiled',
                             predictionCol='prediction', 
                             numTrees=200)

gbt = GBTClassifier(featuresCol='features',
                    labelCol='Spoiled',
                    predictionCol='prediction', 
                    maxIter=200)

In [58]:
# Fitting models
dt_model = det.fit(train_Data)
rfc_model = rfc.fit(train_Data)
gbt_model = gbt.fit(train_Data)

In [59]:
# Prediction
dt_pred = dt_model.transform(test_Data)
rfc_pred = rfc_model.transform(test_Data)
gbt_pred = gbt_model.transform(test_Data)

In [60]:
dt_pred.show(5)

+------------------+-------+-------------+-----------+----------+
|          features|Spoiled|rawPrediction|probability|prediction|
+------------------+-------+-------------+-----------+----------+
|[1.0,1.0,12.0,4.0]|    1.0|   [0.0,63.0]|  [0.0,1.0]|       1.0|
| [1.0,2.0,9.0,4.0]|    0.0|  [185.0,0.0]|  [1.0,0.0]|       0.0|
| [1.0,3.0,9.0,8.0]|    0.0|  [185.0,0.0]|  [1.0,0.0]|       0.0|
| [1.0,4.0,8.0,1.0]|    0.0|   [27.0,0.0]|  [1.0,0.0]|       0.0|
| [1.0,4.0,8.0,5.0]|    0.0|  [185.0,0.0]|  [1.0,0.0]|       0.0|
+------------------+-------+-------------+-----------+----------+
only showing top 5 rows



### Hyperparameter tuning: https://docs.databricks.com/applications/machine-learning/mllib/binary-classification-mllib-pipelines.html

### Evaluate the ML models

In [61]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction',
                                              labelCol='Spoiled',
                                              metricName="accuracy")

In [62]:
acc_dt = evaluator.evaluate(dt_pred)
acc_dt

0.9673202614379085

In [63]:
acc_rfc = evaluator.evaluate(rfc_pred)
acc_rfc

0.9738562091503268

In [64]:
acc_gbt = evaluator.evaluate(gbt_pred)
acc_gbt

0.9673202614379085

## Making confusion matrix and classification report

In [65]:
rfc_pred.show(1)

+------------------+-------+--------------------+--------------------+----------+
|          features|Spoiled|       rawPrediction|         probability|prediction|
+------------------+-------+--------------------+--------------------+----------+
|[1.0,1.0,12.0,4.0]|    1.0|[10.8338986077273...|[0.05416949303863...|       1.0|
+------------------+-------+--------------------+--------------------+----------+
only showing top 1 row



**confusion matrix using pySpark**

In [66]:
from pyspark.mllib.evaluation import MulticlassMetrics
#rfc_pred = rfc_model.transform(test_Data)
predictionRDD = rfc_pred.select(['Spoiled',
                                 'prediction']).rdd.map(lambda line: 
                                                        (line[1], line[0]))

In [67]:
metrics = MulticlassMetrics(predictionAndLabels=predictionRDD)
print(metrics.confusionMatrix().toArray())

[[107.   0.]
 [  4.  42.]]


In [68]:
acc_rfc

0.9738562091503268

In [70]:
# This matches with acc_rfc
(107+42)/(107+42+4)

0.9738562091503268

**Classification report and confusion matrix using scikit-learn**

In [71]:
y_true = rfc_pred.select('Spoiled').collect()
y_pred = rfc_pred.select('prediction').collect()

from sklearn.metrics import classification_report, confusion_matrix
# we get same output as done before
print(confusion_matrix(y_true,y_pred))

[[107   0]
 [  4  42]]


In [72]:
# Classification report
print(classification_report(y_true,y_pred))

              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98       107
         1.0       1.00      0.91      0.95        46

    accuracy                           0.97       153
   macro avg       0.98      0.96      0.97       153
weighted avg       0.97      0.97      0.97       153



### Feature importance

In [73]:
dt_model.featureImportances

SparseVector(4, {0: 0.0058, 1: 0.0247, 2: 0.9578, 3: 0.0118})

In [74]:
rfc_model.featureImportances

SparseVector(4, {0: 0.0277, 1: 0.023, 2: 0.9223, 3: 0.0269})

In [75]:
df.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [76]:
gbt_model.featureImportances

SparseVector(4, {0: 0.0325, 1: 0.0312, 2: 0.9093, 3: 0.0269})

In [77]:
gbt_model.featureImportances.values

array([0.03254089, 0.0312051 , 0.90931327, 0.02694074])

In [None]:
# Feature C is really important for predicting whether dog food will go bad or NOT!