# Random Forest

Our task here is to identify which of four preservatives used in dog food is responsible for spoilage of the food. Instead of predicting a label we'll be fitting a model and then evaluating the feature importances to identify the most important preservative.

In [1]:
import findspark
findspark.init("/home/bryan/Documents/Code/spark-2.4.5-bin-hadoop2.7")

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('rf_spoil').getOrCreate()

# EDA

In [3]:
data = spark.read.csv("data/dog_food.csv", inferSchema=True, header=True)

In [4]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [6]:
data.show(3)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
+---+---+----+---+-------+
only showing top 3 rows



In [8]:
for i in ['A','B','C','D']:
    print(i, data.corr(i, 'Spoiled'))

A 0.05997252035498859
B -0.08647446339982875
C 0.858620384785075
D -0.016066621878644233


> ### As a simple comparison we've generated the Pearson correlation coefficients for each preservative and the label 'Spoiled'. It appears that preservative C has the highest correlation coeff at 0.86. 

In [10]:
from pyspark.ml.feature import VectorAssembler

In [16]:
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'], outputCol='features')
data = assembler.transform(data)

# Random Forest

In [19]:
from pyspark.ml.classification import RandomForestClassifier

In [47]:
rf = RandomForestClassifier(featuresCol='features', labelCol='Spoiled', numTrees=200, subsamplingRate=0.9, seed=7)

In [48]:
rf_fitted = rf.fit(data)

In [49]:
zipped_importances = zip(['A','B','C','D'],rf_fitted.featureImportances) #C, A, D, B

In [50]:
for i in zipped_importances:
    print(i)

('A', 0.020919018932629495)
('B', 0.018565829162076757)
('C', 0.9358252570207033)
('D', 0.024689894884590505)


> ### Quickly checking the feature importances after fitting a model on all of the data confirms our intuition and the result of the correlation coefficient. However, the result for 'D' doesn't coincide with what we saw above. Regardless, it's clear that preservative C is responsible for the spoilage.

# END