# PySpark DecisionTrees

- <a href='https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier'>Link</a> to `DecisionTreeClassifier`
- <a href='https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier'>Link</a> to `RandomForestClassifier`
- <a href='https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier'>Link</a> to `GBTClassifier`


## Content
1. [Predict if a college is private or public](#coll)
2. [Feature importance](#importance)

In [1]:
# find spark
import findspark
findspark.init()

In [2]:
# imports
from pyspark.sql import SparkSession
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler, StringIndexer

In [3]:
# init spark session
spark = SparkSession.builder.appName('trees').getOrCreate()

<a id='coll'></a>
## 1. Predict if a college is private or public

In [4]:
# load data
data = spark.read.csv('data/College.csv', inferSchema=True, header=True)

In [5]:
data.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [6]:
data.show()

+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|              School|Private|Apps|Accept|Enroll|Top10perc|Top25perc|F_Undergrad|P_Undergrad|Outstate|Room_Board|Books|Personal|PhD|Terminal|S_F_Ratio|perc_alumni|Expend|Grad_Rate|
+--------------------+-------+----+------+------+---------+---------+-----------+-----------+--------+----------+-----+--------+---+--------+---------+-----------+------+---------+
|Abilene Christian...|    Yes|1660|  1232|   721|       23|       52|       2885|        537|    7440|      3300|  450|    2200| 70|      78|     18.1|         12|  7041|       60|
|  Adelphi University|    Yes|2186|  1924|   512|       16|       29|       2683|       1227|   12280|      6450|  750|    1500| 29|      30|     12.2|         16| 10527|       56|
|      Adrian College|    Yes|1428|  1097|   336|       22|       50|       1036|         99|  

We just need to use the `StringIndexer` on Private

In [7]:
# StringIndexer on Private
indexer = StringIndexer(inputCol='Private', outputCol='Privateidx')
model = indexer.fit(data)
data = model.transform(data)

In [8]:
# in our case 0 = Yes and 1 = No
data.select('Private', 'Privateidx').show()

+-------+----------+
|Private|Privateidx|
+-------+----------+
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|    Yes|       0.0|
|     No|       1.0|
+-------+----------+
only showing top 20 rows



In [9]:
# create featue vector
assambler = VectorAssembler(inputCols=['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F_Undergrad','P_Undergrad',
                           'Outstate', 'Room_Board', 'Books', 'Personal', 'PhD', 'Terminal', 'S_F_Ratio',
                           'perc_alumni', 'Expend', 'Grad_Rate'],
                outputCol= 'features')

data = assambler.transform(data)

# only select the relevant columns
final_data = data.select('features', 'Privateidx')

# split data in train and test
train, test = final_data.randomSplit([0.7, 0.3])

In [10]:
# create models
dt = DecisionTreeClassifier(labelCol='Privateidx')    
rf = RandomForestClassifier(labelCol='Privateidx')
gbc = GBTClassifier(labelCol='Privateidx')

In [11]:
# fit on training data
model_dt = dt.fit(train)
model_rf = rf.fit(train)
model_gbc = gbc.fit(train)

In [12]:
# make predictions
preds_dt = model_dt.transform(test)
preds_rf = model_rf.transform(test)
preds_gbc = model_gbc.transform(test)

In [21]:
model_list = [preds_dt, preds_rf, preds_gbc]

for model in model_list:
    print(model.show(10))

+--------------------+----------+-------------+--------------------+----------+
|            features|Privateidx|rawPrediction|         probability|prediction|
+--------------------+----------+-------------+--------------------+----------+
|[81.0,72.0,51.0,3...|       0.0|  [294.0,1.0]|[0.99661016949152...|       0.0|
|[141.0,118.0,55.0...|       0.0|  [294.0,1.0]|[0.99661016949152...|       0.0|
|[152.0,128.0,75.0...|       0.0|  [294.0,1.0]|[0.99661016949152...|       0.0|
|[212.0,197.0,91.0...|       0.0|  [294.0,1.0]|[0.99661016949152...|       0.0|
|[232.0,182.0,99.0...|       0.0|    [2.0,4.0]|[0.33333333333333...|       1.0|
|[244.0,198.0,82.0...|       0.0|  [294.0,1.0]|[0.99661016949152...|       0.0|
|[268.0,253.0,103....|       0.0|  [294.0,1.0]|[0.99661016949152...|       0.0|
|[279.0,276.0,126....|       0.0|    [2.0,4.0]|[0.33333333333333...|       1.0|
|[291.0,245.0,126....|       0.0|  [294.0,1.0]|[0.99661016949152...|       0.0|
|[292.0,241.0,96.0...|       0.0|  [294.

In [30]:
# creat evaluator
evaluator = MulticlassClassificationEvaluator(labelCol='Privateidx', predictionCol='prediction', 
                                              metricName='accuracy')

# for loop to loop through each model
model_names = ['DecisionTree', 'RandomForest', 'Gradient-boosted tree']
for model, name in zip(model_list, model_names):
    accuracy = evaluator.evaluate(model)
    print(name, ':')
    print(accuracy)
    print('#'*30)

DecisionTree :
0.9421487603305785
##############################
RandomForest :
0.9297520661157025
##############################
Gradient-boosted tree :
0.9545454545454546
##############################


Gradient-boosted tree scored the best

<a id='importance'></a>
## 2. Feature Importance
A dog food company hired us to identify which combination of chemicals lead to a spoils product.

Task: get the feature importance of each chemical

Data:
- A: share of chemical a
- B: share of chemical b
- C: share of chemical c
- D: share of chemical d
- Spoiled: indicates whether the pack of dog food has gone bad.

In [13]:
data = spark.read.csv('data/dog_food.csv', inferSchema=True, header=True)

In [5]:
data.printSchema()

root
 |-- A: integer (nullable = true)
 |-- B: integer (nullable = true)
 |-- C: double (nullable = true)
 |-- D: integer (nullable = true)
 |-- Spoiled: double (nullable = true)



In [8]:
data.show(5)

+---+---+----+---+-------+
|  A|  B|   C|  D|Spoiled|
+---+---+----+---+-------+
|  4|  2|12.0|  3|    1.0|
|  5|  6|12.0|  7|    1.0|
|  6|  2|13.0|  6|    1.0|
|  4|  2|12.0|  1|    1.0|
|  4|  2|12.0|  3|    1.0|
+---+---+----+---+-------+
only showing top 5 rows



In [14]:
assambler = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],
                            outputCol='features')
data = assambler.transform(data)

In [17]:
final_data = data.select('features', 'Spoiled')

In [19]:
final_data.show(5)

+------------------+-------+
|          features|Spoiled|
+------------------+-------+
|[4.0,2.0,12.0,3.0]|    1.0|
|[5.0,6.0,12.0,7.0]|    1.0|
|[6.0,2.0,13.0,6.0]|    1.0|
|[4.0,2.0,12.0,1.0]|    1.0|
|[4.0,2.0,12.0,3.0]|    1.0|
+------------------+-------+
only showing top 5 rows



In [20]:
# define random forest
rf = RandomForestClassifier(labelCol='Spoiled')
model_rf = rf.fit(final_data)

In [23]:
model_rf.featureImportances

SparseVector(4, {0: 0.0185, 1: 0.0175, 2: 0.9449, 3: 0.0191})

### Answer
chemical c most certainly leads to the spoiled dog food 

## Resources
- Udemy course 