## Importing the libraries

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.\
        builder.\
        appName('Dog_Food').\
        getOrCreate()

## Preparing the data

### Importing the data

In [3]:
df = spark.read.csv('dog_food.csv', inferSchema=True, header=True)
df.show(5)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
+---+---+----+---+-------+
only showing top 5 rows



In [4]:
df.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



### Encoding the label

In [5]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Spoiled', outputCol='label').fit(df)
output = indexer.transform(df)

### Preparing the vector

In [6]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=output.columns[:-1], outputCol='features').transform(output)
data = assembler.select('features', 'label')
data.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[4.0,2.0,12.0,3.0...|  1.0|
|[5.0,6.0,12.0,7.0...|  1.0|
|[6.0,2.0,13.0,6.0...|  1.0|
|[4.0,2.0,12.0,1.0...|  1.0|
|[4.0,2.0,12.0,3.0...|  1.0|
+--------------------+-----+
only showing top 5 rows



## Fitting and predicting with Random Forest Classifier

In [7]:
from pyspark.ml.classification import RandomForestClassifier
model = RandomForestClassifier(numTrees=150).fit(data)
model.featureImportances

SparseVector(5, {0: 0.0013, 1: 0.0023, 2: 0.3516, 3: 0.0014, 4: 0.6435})

Highest important feature is *D* with *65%*.