# Tree Methods - Dog Food Company Data

Aim - Predict which preservative in dog food is causing it to spoil much quicker than intended! 

Steps to follow: 

1. Create a Spark Session and load data
2. Check for missing values (if yes, drop or fill them)
3. Check whether or not data is in the format - label, features (if not, assemble the features using an assembler)
4. Import RandomForestClassifier and create it's instance
5. Create a model by using the instance to train/fit data 
6. Use trained model to check the feature importance (the most important feature would be the reason for spoilage)

In [1]:
# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dogfood').getOrCreate()

In [2]:
# Load data
data = spark.read.csv('dog_food.csv',inferSchema=True,header=True)

In [3]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [4]:
data.head(1)

[Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0)]

An observation shows us the percentage of preservatives A,B,C and D in a particular batch of dog food and whether or not the batch was spoiled.

In [5]:
# Check for any missing values
#data.describe().show()

from pyspark.sql.functions import isnan, isnull, when, count, col

data.select([count(when(isnan(c)| isnull(c), c)).alias(c) for c in data.columns]).show()

+---+---+---+---+-------+
|  A|  B|  C|  D|Spoiled|
+---+---+---+---+-------+
|  0|  0|  0|  0|      0|
+---+---+---+---+-------+



In [6]:
# Convert data to the format - features, labels
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [7]:
data.columns

['A', 'B', 'C', 'D', 'Spoiled']

In [8]:
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],outputCol="features")

In [9]:
output = assembler.transform(data)

In [10]:
from pyspark.ml.classification import RandomForestClassifier

In [11]:
rfc = RandomForestClassifier(labelCol='Spoiled',featuresCol='features')

In [12]:
output.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)
 |-- features: vector (nullable = true)



In [13]:
final_data = output.select('features','Spoiled')
final_data.head()

Row(features=DenseVector([4.0, 2.0, 12.0, 3.0]), Spoiled=1.0)

In [14]:
rfc_model = rfc.fit(final_data)

In [15]:
rfc_model.featureImportances

SparseVector(4, {0: 0.0248, 1: 0.0316, 2: 0.9097, 3: 0.0339})

Feature at index 2 (Preservative C) is the most important feature which means it is causing the early spoilage.

--------------------------------------------------------------------