# IST 718: Big Data Analytics

- Prepared by: Prof Daniel E Acuna deacuna@syr.edu
- Modified by: Prof Humayun Khan <hhkhan@syr.edu>
- Faculty Assistant: Rashika Pramod Singh <rsingh37@syr.edu>
- Faculty Assistant: Rohan Nitin Mahajan <rmahaj01@syr.edu>


## General instructions:

- __In general use Spark, Spark machine learning, Spark data frames, RDD's, and map reduce to solve all problems unless instructed otherwise.__
- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets from the internet are allowed.  Code from the class text books or class provided code can be copied in its entirety.__
- There could be tests in some cells (i.e., `assert` and `np.testing` statements). These tests (if present) are used to grade your answers. **However, the professor and FAs could use __additional__ tests for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- Grading feedback cells are there for graders to provide feedback to students.  Do not change or remove grading feedback cells.
- Do not add or remove files from your Git repo. Do not change file names either. This also means **do not change the title** of the ipython notebook.
- You are free to add additional code cells around the cells marked `your code here`. You may use toPandas() to print the head of data frames. Plots should include a title, and axis labels.  Unless otherwise stated, plots may be made using your favorite Python package(s).
- Code is included to read files from Databricks or from local filesystem depending on environment. It is assumed you know how to upload data files in Databricks as needed. If done right, the same code will run in either environment.
- There are usually many coding solutions to a problem. As a thumb rule, write code that is easy to read. You may assume your readers are programmers. Use comments sparingly and where necesessary. Comments reveal intent. So, even when code is a little off target, effort will be recognized and rewarded. But runtime errors will be penalized.
- Before downloading and submitting your work through Github Classroom, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- **Good luck!!**

In [1]:
# load these packages
import pyspark
from pyspark.ml import feature, classification
from pyspark.ml import Pipeline
from pyspark.sql import functions as fn
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import functions as fn, Row
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
import pandas as pd
import time

a_start = time.time()

In [2]:
# Define a function to determine if we are running on data bricks
# Return true if running in the data bricks environment, false otherwise
import os
def is_databricks():
    if os.getenv("DATABRICKS_RUNTIME_VERSION") != None:
        return True
    else:
        return False

# Define a function to read the data. The full path name is constructed by checking
# the runtime environment to determine if it is databricks or a personal computer.
# On the local filesystem the data is assumed to be in the same directory as the code.
# On databricks, the data path is assumed to be at '/FileStore/tables/' location.
# 
# Parameter(s):
#   name: The base name of the file, parquet file or parquet directory
# Return Value:
#   full_path_name: the full path name of the data based on the runtime environment
#
# Correct Usage Example (pass ONLY the full file name):
#   name_to_load = get_datapath("sms_spam.csv") # correct  
#   
# Incorrect Usage Examples:
#   name_to_load = get_datapath("/sms_spam.csv") # incorrect
#   name_to_load = get_datapath("sms_spam.csv/") # incorrect
#   name_to_load = get_datapath("c:/users/will/data/sms_spam.csv") incorrect
#
def get_datapath(name):    
    if is_databricks():
        full_path_name = "/FileStore/tables/%s" % name
    else:
        full_path_name = name
    return full_path_name

# Part 1: Random Forest and gradient boosted trees

In these questions, we will examine the famous [Auto dataset](https://vincentarelbundock.github.io/Rdatasets/doc/ISLR/Auto.html). With this dataset, the goal is to predict the miles per gallon (`mpg`) performance based on characteristics of the car such as number of cylinders (`cylinders`), displacement between wheels (`displacement`), horsepower of the engine (`horsepower`), weight of the car (`weight`), top acceleration (`acceleration`), year of the model (`year`), and origin (`origin`).

In [3]:
# data
mpg_df = spark.read.csv(get_datapath('Auto.csv'), header=True, inferSchema=True).\
    drop('_c0').\
    withColumn('horsepower2', fn.col('horsepower').cast('int')).\
    drop('horsepower').\
    withColumnRenamed('horsepower2', 'horsepower').\
    dropna()
training_df, validation_df, testing_df = mpg_df.randomSplit([0.6, 0.3, 0.1], seed=0)
mpg_df.printSchema()

root
 |-- mpg: double (nullable = true)
 |-- cylinders: integer (nullable = true)
 |-- displacement: double (nullable = true)
 |-- weight: integer (nullable = true)
 |-- acceleration: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- origin: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- horsepower: integer (nullable = true)



# Question 1: (20 pts)

Create three pipelines that contain three different random forests that take in all features from `mpg_df` (`cylinders`, `displacement`, `horsepower`, `weight`, `acceleration`, `year`, and `origin`) to predict (`mpg`). **Set the `seed` parameter of the random forest to 0.** Fit these pipelines to the training data (`training_df`):

- `pipe_rf1`: Random forest with `maxDepth=1` and `numTrees=60`
- `pipe_rf2`: Random forest with `maxDepth=3` and `numTrees=40`
- `pipe_rf3`: Random forest with `maxDepth=6`, `numTrees=20`

In [None]:
inputs = list(training_df.columns[1:])
inputs.remove('name')
va = feature.VectorAssembler().setInputCols(inputs).setOutputCol('features') 

In [None]:
# tests
np.testing.assert_equal(type(pipe_rf1.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf2.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf3.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(pipe_rf1.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf2.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf3.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(pipe_rf1.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf2.transform(training_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_equal(type(pipe_rf3.transform(training_df)), pyspark.sql.dataframe.DataFrame)

# Question 2 (20 pts)

Use the following evaluator to compute the $R^2$ of the models on validation data. Assign the $R^2$ of the three models to `R2_1`, `R2_2`, and `R2_3`, respectively, and the performance. Assign the best pipeline based on validation performance to a variable `best_model`

In [None]:
evaluator = evaluation.RegressionEvaluator(labelCol='mpg', metricName='r2')
# use it as follows:
#   evaluator.evaluate(fitted_pipeline.transform(df)) -> R2

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# tests
np.testing.assert_equal(type(best_model.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(best_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(best_model.transform(validation_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_array_less(R2_1, 1.)
np.testing.assert_array_less(0.5, R2_1)
np.testing.assert_array_less(R2_2, 1.)
np.testing.assert_array_less(0.5, R2_2)
np.testing.assert_array_less(R2_3, 1.)
np.testing.assert_array_less(0.5, R2_3)

# Question 3: 10 pts

Compute the $R^2$ of the model on testing data, print it, and assign it to variable `R2_best`

In [None]:
# create AUC_best below
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# tests
np.testing.assert_array_less(R2_best, 1.)
np.testing.assert_array_less(0.5, R2_best)

# Question 4: 10 pts

Using the parameters of the best model, create a new pipeline called `final_model` and fit it to the entire data (`mpg_df`)

In [None]:
# create the fitted pipeline `final_model` here
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# tests
np.testing.assert_equal(type(final_model.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(final_model.stages[1]), regression.RandomForestRegressionModel)
np.testing.assert_equal(type(final_model.transform(mpg_df)), pyspark.sql.dataframe.DataFrame)

# Question 5: 10 pts

Create a pandas dataframe `feature_importance` with the columns `feature` and `importance` which contains the names of the features (`cylinder`, `displacement`, etc.) and their feature importances as determined by the random forest of the final model. Sort the dataframe by `importance` in descending order.

In [None]:
# create feature_importance below
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# display it here
feature_importance

In [None]:
# tests
assert type(feature_importance) == pd.core.frame.DataFrame
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])
np.testing.assert_array_equal(list(feature_importance.columns), ['feature', 'importance'])

Comment below on the importance that random forest has given to each feature. Are they reasonable? Do they tell you anything valuable about the mpg dataset? Answer in the cell below

YOUR ANSWER HERE

In [None]:
# Do not modify
end = time.time()
print("Part1 took %0.2f seconds"%(end - a_start))
print()

# Question 6:  10 pts.

Pick any of the trees from the final model and assign its `toDebugString` property to a variable `example_tree`. Print this variable and add comments to the cell describing how you think this particular tree is fitting the data

In [None]:
# create a variable example_tree with the toDebugString property of a tree from final_model.
# print this string and comment in this same cell about the branches that this tree fit
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# tests
assert type(example_tree) == str
assert 'DecisionTreeRegressionModel' in example_tree
assert 'feature 0' in example_tree
assert 'If' in example_tree
assert 'Else' in example_tree
assert 'Predict' in example_tree

# **Question 7 (20 pts)**

Gradient boosted trees are becoming increasingly popular for competitions. There is a high-performance implementation, [xgboost](https://en.wikipedia.org/wiki/XGBoost), that is particularly popular. Compare gradient boosted regression to the best model found with random forest in Question 3. Use the validation set. For GBR, use all the default parameters except make `seed=0`. Assign the pipeline and the $R^2$ of the model to `gbr_pipe` and `R2_gbr`, respectively. Does it have an amazing or dissapointing $R^2$? Comment.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# test your models here
print("Performance of best RF: ", evaluator.evaluate(best_model.transform(validation_df)))
print("Performance of GBR: ", R2_gbr)

In [None]:
# tests
np.testing.assert_equal(type(gbr_pipe.stages[0]), feature.VectorAssembler)
np.testing.assert_equal(type(gbr_pipe.stages[1]), regression.GBTRegressionModel)
np.testing.assert_equal(type(gbr_pipe.transform(validation_df)), pyspark.sql.dataframe.DataFrame)
np.testing.assert_array_less(R2_gbr, 1.)
np.testing.assert_array_less(0.5, R2_gbr)

In [None]:
# Do not modify
end = time.time()
print("Part2 took %0.2f seconds"%(end - a_start))
print()

*END OF CODE*