# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://jupyterhub.ischool.syr.edu/ workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [43]:
# load these packages
import pyspark
from pyspark.ml import feature, classification
from pyspark.ml import Pipeline
from pyspark.sql import functions as fn
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd

# Part 1: Random Forest and gradient boosted trees

In these questions, we will examine the famous [Auto dataset](https://vincentarelbundock.github.io/Rdatasets/doc/ISLR/Auto.html). With this dataset, the goal is to predict the miles per gallon (`mpg`) performance based on characteristics of the car such as number of cylinders (`cylinders`), displacement between wheels (`displacement`), horsepower of the engine (`horsepower`), weight of the car (`weight`), top acceleration (`acceleration`), year of the model (`year`), and origin (`origin`).

In [44]:
# read-only
mpg_df = spark.read.csv('Auto.csv', header=True, inferSchema=True).\
    drop('_c0').\
    withColumn('horsepower2', fn.col('horsepower').cast('int')).\
    drop('horsepower').\
    withColumnRenamed('horsepower2', 'horsepower').\
    dropna()
training_df, validation_df, testing_df = mpg_df.randomSplit([0.6, 0.3, 0.1], seed=0)
mpg_df.printSchema()

root
 |-- mpg: double (nullable = true)
 |-- cylinders: integer (nullable = true)
 |-- displacement: double (nullable = true)
 |-- weight: integer (nullable = true)
 |-- acceleration: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- origin: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- horsepower: integer (nullable = true)



# Question 1: (10 pts)

Create three pipelines that contain three different random forests that take in all features from `mpg_df` (`cylinders`, `displacement`, `horsepower`, `weight`, `acceleration`, `year`, and `origin`) to predict (`mpg`). **Set the `seed` parameter of the random forest to 0.** Fit these pipelines to the training data (`training_df`):

- `pipe_rf1`: Random forest with `maxDepth=1` and `numTrees=60`
- `pipe_rf2`: Random forest with `maxDepth=3` and `numTrees=40`
- `pipe_rf3`: Random forest with `maxDepth=6`, `numTrees=20`

In [45]:
# create the fitted pipelines `pipe_rf1`, `pipe_rf2`, and `pipe_rf3` here'
pipe_rf1 = Pipeline(stages=[feature.VectorAssembler(inputCols=['cylinders', 'displacement','horsepower','weight', 'year', 'acceleration','origin'],
                           outputCol='features'), regression.RandomForestRegressor(labelCol = 'mpg',featuresCol= 'features',maxDepth=1, numTrees = 60, seed =0)]).fit(training_df)
pipe_rf2 = Pipeline(stages=[feature.VectorAssembler(inputCols=['cylinders', 'displacement','horsepower','weight', 'year', 'acceleration','origin'],
                           outputCol='features'), regression.RandomForestRegressor(labelCol = 'mpg',featuresCol= 'features',maxDepth=3, numTrees = 40, seed =0)]).fit(training_df)
pipe_rf3 = Pipeline(stages=[feature.VectorAssembler(inputCols=['cylinders', 'displacement','horsepower','weight', 'year', 'acceleration','origin'],
                           outputCol='features'), regression.RandomForestRegressor(labelCol = 'mpg',featuresCol= 'features',maxDepth=6, numTrees = 20, seed =0)]).fit(training_df)
#raise NotImplementedError()

In [46]:
# tests for 10 pts
np.testing.assert_equal(type(pipe_rf1.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf2.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf3.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf1.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf2.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf3.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf1.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf2.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf3.transform(training_df)), pyspark.sql.dataframe.DataFrame)

# Question 2 (10 pts)

Use the following evaluator to compute the $R^2$ of the models on validation data. Assign the $R^2$ of the three models to `R2_1`, `R2_2`, and `R2_3`, respectively, and the performance. Assign the best pipeline based on validation performance to a variable `best_model`

In [47]:
evaluator = evaluation.RegressionEvaluator(labelCol='mpg', metricName='r2')
# use it as follows:
#   evaluator.evaluate(fitted_pipeline.transform(df)) -> R2

In [48]:
R2_1 = evaluator.evaluate(pipe_rf1.transform(validation_df))
print(R2_1)
R2_2 = evaluator.evaluate(pipe_rf2.transform(validation_df))
print(R2_2)
R2_3 = evaluator.evaluate(pipe_rf3.transform(validation_df))
print(R2_3)
best_model = pipe_rf3
#raise NotImplementedError()

0.6356640531609501
0.8383814052081987
0.8818386210391845


In [49]:
# tests for 10 pts
np.testing.assert_equal(type(best_model.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(best_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(best_model.transform(validation_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_array_less(R2_1, 1.)
np.testing.assert_array_less(0.5, R2_1)
np.testing.assert_array_less(R2_2, 1.)
np.testing.assert_array_less(0.5, R2_2)
np.testing.assert_array_less(R2_3, 1.)
np.testing.assert_array_less(0.5, R2_3)

# Question 3: 5 pts

Compute the $R^2$ of the model on testing data, print it, and assign it to variable `R2_best`

In [50]:
# create AUC_best below
R2_best = evaluator.evaluate(pipe_rf3.transform(testing_df))
R2_best
#raise NotImplementedError()

0.7871659333265015

In [51]:
# tests for 5 pts
np.testing.assert_array_less(R2_best, 1.)
np.testing.assert_array_less(0.5, R2_best)

# Question 4: 5 pts

Using the parameters of the best model, create a new pipeline called `final_model` and fit it to the entire data (`mpg_df`)

In [52]:
# create the fitted pipeline `final_model` here
final_model = Pipeline(stages=[feature.VectorAssembler(inputCols=['cylinders', 'displacement','horsepower','weight', 'year', 'acceleration','origin'],
                           outputCol='features'), regression.RandomForestRegressor(labelCol = 'mpg',featuresCol= 'features',maxDepth=6, numTrees = 20, seed =0)]).fit(mpg_df)
#raise NotImplementedError()

In [53]:
# tests for 10 pts
np.testing.assert_equal(type(final_model.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(final_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(final_model.transform(mpg_df)), pyspark.sql.dataframe.DataFrame)

# Question 5: 10 pts

Create a pandas dataframe `feature_importance` with the columns `feature` and `importance` which contains the names of the features (`cylinder`, `displacement`, etc.) and their feature importances as determined by the random forest of the final model. Sort the dataframe by `importance` in descending order.

In [54]:
# create feature_importance below
feature_importance = pd.DataFrame(list(zip(['cylinders', 'displacement','horsepower','weight', 'year', 'acceleration','origin'], final_model.stages[-1].featureImportances.toArray())),
            columns = ['feature', 'importance']).sort_values('importance')
#raise NotImplementedError()

In [55]:
# display it here
feature_importance

Unnamed: 0,feature,importance
6,origin,0.006946
5,acceleration,0.030344
0,cylinders,0.135399
4,year,0.140149
3,weight,0.149193
2,horsepower,0.16574
1,displacement,0.37223


In [56]:
# tests for 10 pts
assert type(feature_importance) == pd.core.frame.DataFrame
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])

**(5 pts)** Comment below on the importance that random forest has given to each feature. Are they reasonable? Do they tell you anything valuable about the mpg dataset? Answer in the cell below

The feature importance that is assigned to each feature by random forest is reasonable. From the result, we can see that the displacement, horsepower and weight are affecting the mpg more whereas, origin and acceleration are least affecting.

# Question 6:  5 pts.

Pick any of the trees from the final model and assign its `toDebugString` property to a variable `example_tree`. Print this variable and add comments to the cell describing how you think this particular tree is fitting the data

In [57]:
# create a variable example_tree with the toDebugString property of a tree from final_model.
# print this string and comment in this same cell about the branches that this tree fit
example_tree = final_model.stages[-1].trees[0].toDebugString
print(example_tree) 
# Every feature is assigned a threshold value,with change in threshold values prediction values of mpg are changing. 
# i.e., whenever the value of each feature is increasing or decreasing, the mpg value is varying
#raise NotImplementedError()

DecisionTreeRegressionModel: uid=dtr_c1b6fce1d827, depth=6, numNodes=101, numFeatures=7
  If (feature 1 <= 159.5)
   If (feature 4 <= 79.5)
    If (feature 2 <= 71.5)
     If (feature 5 <= 19.55)
      If (feature 5 <= 17.05)
       If (feature 3 <= 2188.5)
        Predict: 30.7375
       Else (feature 3 > 2188.5)
        Predict: 24.99999999999997
      Else (feature 5 > 17.05)
       If (feature 5 <= 18.75)
        Predict: 35.971428571428575
       Else (feature 5 > 18.75)
        Predict: 29.933333333333337
     Else (feature 5 > 19.55)
      If (feature 3 <= 1982.5)
       Predict: 26.0
      Else (feature 3 > 1982.5)
       Predict: 24.5
    Else (feature 2 > 71.5)
     If (feature 1 <= 119.5)
      If (feature 5 <= 13.75)
       If (feature 4 <= 73.5)
        Predict: 18.0
       Else (feature 4 > 73.5)
        Predict: 21.5
      Else (feature 5 > 13.75)
       If (feature 5 <= 14.850000000000001)
        Predict: 27.433333333333334
       Else (feature 5 > 14.850000000000001)


In [58]:
# tests for 5 points
assert type(example_tree) == str
assert 'DecisionTreeRegressionModel' in example_tree
assert 'feature 0' in example_tree
assert 'If' in example_tree
assert 'Else' in example_tree
assert 'Predict' in example_tree

# **Question 7 (5 pts)**

Gradient boosted trees are becoming increasingly popular for competitions. There is a high-performance implementation, [xgboost](https://en.wikipedia.org/wiki/XGBoost), that is particularly popular. Compare gradient boosted regression to the best model found with random forest in Question 3. Use the validation set. For GBR, use all the default parameters except make `seed=0`. Assign the pipeline and the $R^2$ of the model to `gbr_pipe` and `R2_gbr`, respectively. Does it have an amazing or dissapointing $R^2$? Comment.

In [59]:
gbr_pipe = Pipeline(stages=[feature.VectorAssembler(inputCols=['cylinders', 'displacement','horsepower','weight', 'year', 'acceleration','origin'],
                           outputCol='features'), regression.GBTRegressor(labelCol = 'mpg',featuresCol= 'features', seed =0)]).fit(training_df)
R2_gbr = evaluator.evaluate(gbr_pipe.transform(validation_df))
#raise NotImplementedError()

In [60]:
# test your models here
print("Performance of best RF: ", evaluator.evaluate(best_model.transform(validation_df)))
print("Performance of GBR: ", R2_gbr)

Performance of best RF:  0.8818386210391845
Performance of GBR:  0.8408850917598597


R squared value os higher for RF than GBR. Hence, it can be concluded that the Performance is better for the Random Forest Regression with max_depth "6" and Numtrees "20" compared to the performance of GBR (Gradient Boosting Regression) with default parameters. 

In [61]:
# tests for 5 pts
np.testing.assert_equal(type(gbr_pipe.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(gbr_pipe.stages[1]), regression.GBTRegressionModel)
np.testing.assert_equal(type(gbr_pipe.transform(validation_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_array_less(R2_gbr, 1.)
np.testing.assert_array_less(0.5, R2_gbr)