# Feature Importance Using Random Forest Classification & PySpark

Predict which chemical preservative is affecting dog food batches the most using Random Forest

* Pres_A : Percentage of preservative A in the mix
* Pres_B : Percentage of preservative B in the mix
* Pres_C : Percentage of preservative C in the mix
* Pres_D : Percentage of preservative D in the mix
* Spoiled: Label indicating whether or not the dog food batch was spoiled

In [0]:
# Start a spark session
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('dog').getOrCreate()

In [0]:
# Import the datafile
data=spark.read.csv('dbfs:/FileStore/shared_uploads/hrishagni95@gmail.com/dog_food-1.csv',inferSchema=True,header=True)

In [0]:
# Sneak peek of the dataset
data.show()

In [0]:
# Import RandomForestClassifier from pyspark
from pyspark.ml.classification import RandomForestClassifier

In [0]:
# Import VectorAssembler from pyspark
from pyspark.ml.feature import VectorAssembler

In [0]:
# Create an instance of VectorAssembler
assembler=VectorAssembler(inputCols=['A','B','C','D'],outputCol='features')

In [0]:
# Transform the data to create a features column to be used for classification
output=assembler.transform(data)

In [0]:
# Create an instance of RandomForestClassifier
rfc=RandomForestClassifier(labelCol='Spoiled',featuresCol='features')

In [0]:
output.printSchema()

In [0]:
# Filter the transformed data set
final_data=output.select('features','Spoiled')

In [0]:
final_data.show()

In [0]:
# Fit the filtered data on the RandomForestClassifier model
rfc_model=rfc.fit(final_data)

In [0]:
# Display the feature importances
rfc_model.featureImportances

In [0]:
# As we can see feature 2 or chemical C is the most important in affecting the dog food batches