<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/Project_DecisionTrees_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tree Methods Consulting Project

You've been hired by a dog food company to try to predict why some batches of their dog food are spoiling much quicker than intended! Unfortunately this Dog Food company hasn't upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary a lot, but which is the chemical that has the strongest effect? The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. The food scientists beelive one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one!
Use Machine Learning with RF to find out which parameter had the most predicitive power, thus finding out which chemical causes the early spoiling! So create a model and then find out how you can decide which chemical is the problem!

* Pres_A : Percentage of preservative A in the mix
* Pres_B : Percentage of preservative B in the mix
* Pres_C : Percentage of preservative C in the mix
* Pres_D : Percentage of preservative D in the mix
* Spoiled: Label indicating whether or not the dog food batch was spoiled.
___

**Think carefully about what this problem is really asking you to solve. While we will use Machine Learning to solve this, it won't be with your typical train/test split workflow. If this confuses you, skip ahead to the solution code along walk-through!**
____

# Good Luck!

## Install pyspark and download the data

In [1]:
# Install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=3edd88764f33997748975b24da2c7cb96ce52d782ba1a9854ffe5cc5a8ad9b34
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [2]:
# Download the necessary files
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/DecisionTress/dog_food.csv

--2023-10-04 15:59:25--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/DecisionTress/dog_food.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7197 (7.0K) [text/plain]
Saving to: ‘dog_food.csv’


2023-10-04 15:59:25 (102 MB/s) - ‘dog_food.csv’ saved [7197/7197]



## Import libraries and read in the data

In [3]:
# import libraries
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

In [4]:
# Create a spark session
spark = SparkSession.builder.appName('dog_food_project').getOrCreate()

In [5]:
# Read in the data
data = spark.read.csv('dog_food.csv', header=True, inferSchema=True)

In [6]:
# Print Schema
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [10]:
# Show the data
data.show()

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
| 10|  3|13.0|  9|    1.0|
|  8|  5|14.0|  5|    1.0|
|  5|  8|12.0|  8|    1.0|
|  6|  5|12.0|  9|    1.0|
|  3|  3|12.0|  1|    1.0|
|  9|  8|11.0|  3|    1.0|
|  1| 10|12.0|  3|    1.0|
|  1|  5|13.0| 10|    1.0|
|  2| 10|12.0|  6|    1.0|
|  1| 10|11.0|  4|    1.0|
|  5|  3|12.0|  2|    1.0|
|  4|  9|11.0|  8|    1.0|
|  5|  1|11.0|  1|    1.0|
|  4|  9|12.0| 10|    1.0|
|  5|  8|10.0|  9|    1.0|
+---+---+----+---+-------+
only showing top 20 rows



## Assemble data into a features columns
We'll need to index the lable `Private` with the StringIndexer method

In [11]:
# Print out columns
data.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [13]:
# Create object assembler
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],
                            outputCol='features')

In [14]:
# Create features column from assembler
output = assembler.transform(data)
output.show()

+---+---+----+---+-------+-------------------+
|  A|  B|   C|  D|Spoiled|           features|
+---+---+----+---+-------+-------------------+
|  4|  2|12.0|  3|    1.0| [4.0,2.0,12.0,3.0]|
|  5|  6|12.0|  7|    1.0| [5.0,6.0,12.0,7.0]|
|  6|  2|13.0|  6|    1.0| [6.0,2.0,13.0,6.0]|
|  4|  2|12.0|  1|    1.0| [4.0,2.0,12.0,1.0]|
|  4|  2|12.0|  3|    1.0| [4.0,2.0,12.0,3.0]|
| 10|  3|13.0|  9|    1.0|[10.0,3.0,13.0,9.0]|
|  8|  5|14.0|  5|    1.0| [8.0,5.0,14.0,5.0]|
|  5|  8|12.0|  8|    1.0| [5.0,8.0,12.0,8.0]|
|  6|  5|12.0|  9|    1.0| [6.0,5.0,12.0,9.0]|
|  3|  3|12.0|  1|    1.0| [3.0,3.0,12.0,1.0]|
|  9|  8|11.0|  3|    1.0| [9.0,8.0,11.0,3.0]|
|  1| 10|12.0|  3|    1.0|[1.0,10.0,12.0,3.0]|
|  1|  5|13.0| 10|    1.0|[1.0,5.0,13.0,10.0]|
|  2| 10|12.0|  6|    1.0|[2.0,10.0,12.0,6.0]|
|  1| 10|11.0|  4|    1.0|[1.0,10.0,11.0,4.0]|
|  5|  3|12.0|  2|    1.0| [5.0,3.0,12.0,2.0]|
|  4|  9|11.0|  8|    1.0| [4.0,9.0,11.0,8.0]|
|  5|  1|11.0|  1|    1.0| [5.0,1.0,11.0,1.0]|
|  4|  9|12.0

In [16]:
# Select only the features and labels
final_data = output.select('Spoiled', 'features')
final_data.show()

+-------+-------------------+
|Spoiled|           features|
+-------+-------------------+
|    1.0| [4.0,2.0,12.0,3.0]|
|    1.0| [5.0,6.0,12.0,7.0]|
|    1.0| [6.0,2.0,13.0,6.0]|
|    1.0| [4.0,2.0,12.0,1.0]|
|    1.0| [4.0,2.0,12.0,3.0]|
|    1.0|[10.0,3.0,13.0,9.0]|
|    1.0| [8.0,5.0,14.0,5.0]|
|    1.0| [5.0,8.0,12.0,8.0]|
|    1.0| [6.0,5.0,12.0,9.0]|
|    1.0| [3.0,3.0,12.0,1.0]|
|    1.0| [9.0,8.0,11.0,3.0]|
|    1.0|[1.0,10.0,12.0,3.0]|
|    1.0|[1.0,5.0,13.0,10.0]|
|    1.0|[2.0,10.0,12.0,6.0]|
|    1.0|[1.0,10.0,11.0,4.0]|
|    1.0| [5.0,3.0,12.0,2.0]|
|    1.0| [4.0,9.0,11.0,8.0]|
|    1.0| [5.0,1.0,11.0,1.0]|
|    1.0|[4.0,9.0,12.0,10.0]|
|    1.0| [5.0,8.0,10.0,9.0]|
+-------+-------------------+
only showing top 20 rows



> **NOTE:** Since the problem only requires to find the most weighted variable in the model (which preservative influences the most to the result of spoiled or not), we don't need to split the data into train and test. We're not trying to predict whether a row is poiled or not.

## Train and Fit Decision Tree models

In [17]:
# Create classifier objects
dtc = DecisionTreeClassifier(labelCol='Spoiled', featuresCol='features')
rfc = RandomForestClassifier(labelCol='Spoiled', featuresCol='features')
gbt = GBTClassifier(labelCol='Spoiled', featuresCol='features')

In [18]:
# Fit models
dtc_model = dtc.fit(final_data)
rfc_model = rfc.fit(final_data)
gbt_model = gbt.fit(final_data)

In [21]:
# Now check the 'featureImportance of the train models
print(f'DTC feature importances: {dtc_model.featureImportances}')
print(f'RFC feature importances: {rfc_model.featureImportances}')
print(f'GBT feature importances: {gbt_model.featureImportances}')

DTC feature importances: (4,[1,2,3],[0.0019107795086908742,0.9831676511855764,0.014921569305732818])
RFC feature importances: (4,[0,1,2,3],[0.015496512702336391,0.025991843685610527,0.9392441723914908,0.019267471220562302])
GBT feature importances: (4,[0,1,2,3],[0.02962567485294246,0.03830179415146122,0.8286277188140007,0.10344481218159562])


## Conclusion
The feature importances of each model shows the 3rd parameter with the greatest value, this indicates that the C preservative influences the most on the result of a spoiled dog food.

**C is the answer**