# Random Forest Project
## 1) Brief
* This data shows how 4 compounds (A, B, C and D) make up a dog food recipe are linked to whether or not the dog food spoils too early
* We will run a random forest to determine which of the 4 compounds is causing the early spoilage and try to infer thresholds (i.e. % amounts) of each compound which are important to the outcome

## 2) Load Data

In [4]:
# load libs
import findspark

# store location of spark files
findspark.init('/home/matt/spark-3.0.2-bin-hadoop3.2')

# load libs
import pyspark
from pyspark.sql import SparkSession

# start new session
spark = SparkSession.builder.appName('tree').getOrCreate()

# read in data
df = spark.read.csv('Data/dog_food.csv', inferSchema=True, header=True)

# show schema
df.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [5]:
# peek at data
df.show(3)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
+---+---+----+---+-------+
only showing top 3 rows



## 3) Results
* Here we build a simple decision tree model which assesses the relationship between features (mix of compounds in the dog food) and the outcome (whether or not the dog food spoils early)
* This is actually a very simple problem that allows us to just look at the feature importances i.e. which feature is most important in regards to separating the data (based on outcome/spoilage) with the highest purity
* Clearly we can see that feature 2 (i.e. index position 2 = compound C) is by far the highest predictor of spoilage and hence this compound should be removed to prevent early spoilage

In [6]:
# load libs for converting to spark-friendly format (i.e. features vector)
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# create assembler to convert 4 feature cols into 1
assembler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],outputCol="features")

# create single features column
output = assembler.transform(df)

# load random forest libs
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier

# create random forest classifier
rfc = DecisionTreeClassifier(labelCol='Spoiled', featuresCol='features')

# select final data (features and outcomes only)
final_data = output.select('features','Spoiled')

# fit model to final data
rfc_model = rfc.fit(final_data)

# show feature importances
rfc_model.featureImportances

SparseVector(4, {1: 0.0019, 2: 0.9832, 3: 0.0149})