<a href="https://colab.research.google.com/github/FZMuri/Data-Analytics-Portfolio/blob/main/DT_Mini_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree Mini Project

A pet food company wants to know why some batches of their pet foods are spoiling much quicker than expected. The pet food company first mixes up a batch of preservative that contains four different preservative chemicals (A,B,C,D) and then is completed with a "filler" chemical. The food scientists believe one of the A,B,C, or D preservatives is causing the problem, but need your help to figure out which one.
Use the Decision Tree algorithm to find out which parameter had the most predictive power, thus finding out which chemical causes the early spoiling. So, create a DT model and then find out how you can decide which chemical is the problem.

- Pres_A : Percentage of preservative A in the mix
- Pres_B : Percentage of preservative B in the mix
- Pres_C : Percentage of preservative C in the mix
- Pres_D : Percentage of preservative D in the mix
- Spoiled: Label indicating whether or not the pet food batch was spoiled.


In [None]:
 !pip install pyspark
import pyspark
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('petfood').getOrCreate()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 36 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 48.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=ef70f160b3d1f10460692e1f055d1291e9ebbcc11218ed2e40cf78f5978c498a
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils

In [None]:
data = spark.read.csv('pet_food.csv', inferSchema=True, header= True)

In [None]:
data.show()

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
| 10|  3|13.0|  9|    1.0|
|  8|  5|14.0|  5|    1.0|
|  5|  8|12.0|  8|    1.0|
|  6|  5|12.0|  9|    1.0|
|  3|  3|12.0|  1|    1.0|
|  9|  8|11.0|  3|    1.0|
|  1| 10|12.0|  3|    1.0|
|  1|  5|13.0| 10|    1.0|
|  2| 10|12.0|  6|    1.0|
|  1| 10|11.0|  4|    1.0|
|  5|  3|12.0|  2|    1.0|
|  4|  9|11.0|  8|    1.0|
|  5|  1|11.0|  1|    1.0|
|  4|  9|12.0| 10|    1.0|
|  5|  8|10.0|  9|    1.0|
+---+---+----+---+-------+
only showing top 20 rows



In [None]:
data.count()

490

### Create a vector assembler to transform our data

**VectorAssembler** is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models.

In [None]:
vec_assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],
                                outputCol= 'features')
vecdata = vec_assembler.transform(data)
vecdata.show(3)

+---+---+----+---+-------+------------------+
|  A|  B|   C|  D|Spoiled|          features|
+---+---+----+---+-------+------------------+
|  4|  2|12.0|  3|    1.0|[4.0,2.0,12.0,3.0]|
|  5|  6|12.0|  7|    1.0|[5.0,6.0,12.0,7.0]|
|  6|  2|13.0|  6|    1.0|[6.0,2.0,13.0,6.0]|
+---+---+----+---+-------+------------------+
only showing top 3 rows



In [None]:
vecdata.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)
 |-- features: vector (nullable = true)



In [None]:
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(vecdata)

labelIndexer = StringIndexer(inputCol= 'Spoiled', outputCol= 'indexedLabel').fit(vecdata)

# Split the data into training and test sets (30% held out for testing)
(trainDf, testDf) = vecdata.randomSplit([0.75, 0.25],1)

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel",
                            featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainDf)

# Make predictions.
predictions = model.transform(testDf)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", 
                                              predictionCol="prediction", 
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print('Decision Tree accuracy is {:.3f}.'.format(accuracy))

treeModel = model.stages[2]
print (treeModel) # summary only

+----------+------------+-------------------+
|prediction|indexedLabel|           features|
+----------+------------+-------------------+
|       0.0|         0.0|  [1.0,2.0,9.0,1.0]|
|       0.0|         0.0|  [1.0,4.0,9.0,6.0]|
|       1.0|         1.0|[1.0,4.0,13.0,10.0]|
|       0.0|         0.0|  [1.0,5.0,8.0,3.0]|
|       0.0|         0.0| [1.0,5.0,8.0,10.0]|
+----------+------------+-------------------+
only showing top 5 rows

Decision Tree accuracy is 0.950.
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_7a8afe40d3ff, depth=5, numNodes=25, numClasses=2, numFeatures=4


### ALTERNATE WAY

In [None]:
# Create a vector assembler to transform our data.
vec_assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],
                                outputCol= 'features')
vecDf= vec_assembler.transform(data)
# vecIris_df.show()
# vecIris_df.printSchema()
# vecIris_df.count()

# StringIndexer encodes a string column of labels to a column of label indices.
# The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.
indexer = StringIndexer(inputCol= 'Spoiled', outputCol= 'label')
indexerModel = indexer.fit(vecDf)
indexVecDf = indexerModel.transform(vecDf)
# indexVecIris_df.show(5)

# Split test & train set
(train_df, test_df) = indexVecDf.randomSplit([.7,.3],1)

# Decision Tree Classification
from pyspark.ml.classification import DecisionTreeClassifier
decTree_classifier = DecisionTreeClassifier(labelCol='label', featuresCol='features')

decTree_model = decTree_classifier.fit(train_df)
decTree_pred = decTree_model.transform(test_df)
# decTree_pred.select(['species','features','label','prediction']).show(46)

decTree_eval = MulticlassClassificationEvaluator(metricName='accuracy')
decTree_accuracy = decTree_eval.evaluate(decTree_pred)
print('Decision Tree accuracy is {:.3f}'.format(decTree_accuracy))

Decision Tree accuracy is 0.979


### MLP WAY

In [None]:
train_df.show()

+---+---+----+---+-------+-------------------+-----+
|  A|  B|   C|  D|Spoiled|           features|label|
+---+---+----+---+-------+-------------------+-----+
|  1|  1|10.0|  8|    1.0| [1.0,1.0,10.0,8.0]|  1.0|
|  1|  1|12.0|  2|    1.0| [1.0,1.0,12.0,2.0]|  1.0|
|  1|  1|12.0|  4|    1.0| [1.0,1.0,12.0,4.0]|  1.0|
|  1|  1|13.0|  3|    1.0| [1.0,1.0,13.0,3.0]|  1.0|
|  1|  3| 8.0|  3|    0.0|  [1.0,3.0,8.0,3.0]|  0.0|
|  1|  3| 8.0|  5|    0.0|  [1.0,3.0,8.0,5.0]|  0.0|
|  1|  3| 9.0|  8|    0.0|  [1.0,3.0,9.0,8.0]|  0.0|
|  1|  4| 8.0|  1|    0.0|  [1.0,4.0,8.0,1.0]|  0.0|
|  1|  4| 8.0|  5|    0.0|  [1.0,4.0,8.0,5.0]|  0.0|
|  1|  4| 8.0|  7|    0.0|  [1.0,4.0,8.0,7.0]|  0.0|
|  1|  4| 9.0|  3|    0.0|  [1.0,4.0,9.0,3.0]|  0.0|
|  1|  5|12.0| 10|    1.0|[1.0,5.0,12.0,10.0]|  1.0|
|  1|  5|13.0| 10|    1.0|[1.0,5.0,13.0,10.0]|  1.0|
|  1|  6| 7.0|  8|    0.0|  [1.0,6.0,7.0,8.0]|  0.0|
|  1|  6| 8.0|  1|    0.0|  [1.0,6.0,8.0,1.0]|  0.0|
|  1|  6| 8.0|  3|    0.0|  [1.0,6.0,8.0,3.0]|

In [None]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
layers = [4,6,6,2]
mlp_classifier = MultilayerPerceptronClassifier(layers=layers, seed = 1)
mlp_model = mlp_classifier.fit(train_df)
mlp_pred = mlp_model.transform(test_df)
mlp_pred.select(['features','label','prediction']).show(5)

+-------------------+-----+----------+
|           features|label|prediction|
+-------------------+-----+----------+
|  [1.0,2.0,9.0,1.0]|  0.0|       0.0|
|  [1.0,2.0,9.0,4.0]|  0.0|       0.0|
|  [1.0,4.0,9.0,6.0]|  0.0|       0.0|
|[1.0,4.0,13.0,10.0]|  1.0|       1.0|
|  [1.0,5.0,8.0,3.0]|  0.0|       0.0|
+-------------------+-----+----------+
only showing top 5 rows



In [None]:
mlp_eval = MulticlassClassificationEvaluator(metricName = 'accuracy')
mlp_accuracy = mlp_eval.evaluate(mlp_pred)
print('MLP accuracy is {:.3f}'.format(mlp_accuracy))

MLP accuracy is 0.972
