<center><h2>Machine Learning Model Lab and Excercise</h2></center>

<h2>Overview</h2>

<ul>
    <li>Lab</li>
    <li>ML Excercise</li>
</ul>

<h2>Reminder: Process Flow for Big-Data Machine Learning Modeling</h2>

<center><figure><img src="http://stat.cmu.edu/~mfarag/14810/l15/machine_learning_modeling_lifecycle_in_big_data.png"/><figcaption>Process Flow for Big-Data Machine Learning Modeling</figcaption></figure></center>

<h2>Deploy Your Tuned ML Model to the Cloud</h2>

<ul>In order to run your ML model to the cloud, you need to conduct the following steps:
    <li>Enable the Billing</li>
    <li>Start the Cluster</li>
    <li>Upload your data files in one folder to the GCP bucket</li>
    <li>Move the folder containing the data from the GCP bucket to your local cluster</li>
    <li>Move the folder containing the data from your local cluster to HDFS</li>
    <li>Upload your Jupyter notebook(s) to the cluster</li>
    <li>Update the Path for your data files in the Jupyter Notebook</li>
    <li>Run The Model in the Notebook</li>
    <li>Stop the Cluster after you finish</li>
    <li>When the Cluster is stopped, disable your billing</li>
</ul>

<h2>Today's Problem</h2>

<b>In this lecture, work in groups of 2 to develop ML model to predict the result of a given play (i.e. PlayResult column).</b>
<ul>
    <li>All the steps are implemented for you up-to the Data Scaling Phase.</li>
    <li>Start by deciding if this is regression or classification problem</li>
    <li>Depending on the type of the problem, use two ML models and try to identify which one of them is better (e.g. if it's regression, use two ML models. If it's classification, use two ML models. Check which one has better accuracy than the other</li>
    <li>When comparing the two Models, you may use the following approach:<ul>
        <li>Use the one with the highest <b>R-squared</b> if they are regression models</li>
        <li>Use the one with the highest AUC if they are classification models</li></ul></li>
</ul>

Instead of having our regular quiz at the end of this week, you will work on this problem in a group of 2 and inform me about your progress by the end of the lecture. Today's quiz has total of 2 points.
<ul>
<li>You will get 1 point for today's quiz by trying to solve the problem</li>
<li>You will get 2 points if you developed 2 ML models and told me which ones are better to correctly solve the problem and ran it on the cloud</li>
</ul>
I will release working solution at the end of the lecture for your reference.

<h2>Feature Engineering is Done for you!</h2>

Before we start, let's go ahead and conduct our basic data loading and cleaning
<ul>
    <li>Start Your Spark Session</li>
    <li>Ingest your data into the application</li>
    <li>Perform required data cleaning</li>
</ul>
<br/>
For simplicity, I'll ingest the data right from the CSVs. Ideally, you would ingest your data from PostgreSQL and join the dataframes back into one big dataframe. 

In [4]:
# if you installed Spark on windows, 
# you may need findspark and need to initialize it prior to being able to use pyspark
# Also, you may need to initialize SparkContext yourself.
#import findspark
#findspark.find()
#findspark.init()
import pyspark
from pyspark.sql import SparkSession, SQLContext
from pyspark.ml.feature import Imputer
from pyspark.sql.functions import *


appName = "Machine Learning via SparkML"
master = "local"

# Create Configuration object for Spark.
conf = pyspark.SparkConf()\
    .set('spark.driver.host','127.0.0.1')\
    .setAppName(appName)\
    .setMaster(master)

# Create Spark Context with the new configurations rather than rely on the default one
sc = SparkContext.getOrCreate(conf=conf)

# You need to create SQL Context to conduct some database operations like what we will see later.
#sqlContext = SQLContext(sc)

# If you have SQL context, you create the session from the Spark Context
spark = SparkSession.builder.getOrCreate()

#Ingest data from the players CSV into Spark Dataframe. What is dataframe?
plays_df = (spark.read
         .format("csv")
         .option("inferSchema", "true")
         .option("header","true")
         .load("/Data/plays.csv")
      )

renamed_columns_plays_df = plays_df.withColumnRenamed("personnelOffense","personnel_offense").withColumnRenamed("personnelDefense","personnel_defense")
casted_types_df = (renamed_columns_plays_df
              .withColumn("yardline_number", renamed_columns_plays_df["yardlineNumber"].cast("integer")).drop("yardlineNumber")
              .withColumn("defenders_in_the_box", renamed_columns_plays_df["defendersInTheBox"].cast("integer")).drop("defendersInTheBox")
              .withColumn("number_of_pass_rushers", renamed_columns_plays_df["numberOfPassRushers"].cast("integer")).drop("numberOfPassRushers")
              .withColumn("kick_return_yardage", renamed_columns_plays_df["KickReturnYardage"].cast("integer")).drop("KickReturnYardage")
              .withColumn("pass_length", renamed_columns_plays_df["PassLength"].cast("integer")).drop("PassLength")     
              .withColumn("yards_after_catch", renamed_columns_plays_df["YardsAfterCatch"].cast("integer")).drop("YardsAfterCatch")     
              .withColumn("play_id", renamed_columns_plays_df["playId"].cast("string")).drop("playId")     
              .withColumn("game_id", renamed_columns_plays_df["gameId"].cast("string")).drop("gameId")     
              .withColumn("quarter_var", renamed_columns_plays_df["quarter"].cast("string")).drop("quarter")     
              .withColumn("down_var", renamed_columns_plays_df["down"].cast("string")).drop("down")     
            .distinct()
           )
plays_df_with_substituted_na = (casted_types_df\
    .withColumn('offenseFormation', \
                when(casted_types_df.offenseFormation=='NA',regexp_replace(casted_types_df.offenseFormation,'NA',None)) \
                .otherwise(casted_types_df.offenseFormation))\
    .withColumn('personnel_offense', \
                when(casted_types_df.personnel_offense=='NA',regexp_replace(casted_types_df.personnel_offense,'NA',None)) \
                .otherwise(casted_types_df.personnel_offense))\
    .withColumn('personnel_defense', \
                when(casted_types_df.personnel_defense=='NA',regexp_replace(casted_types_df.personnel_defense,'NA',None)) \
                .otherwise(casted_types_df.personnel_defense))
    .withColumn('SpecialTeamsPlayType', \
                when(casted_types_df.SpecialTeamsPlayType=='NA',regexp_replace(casted_types_df.SpecialTeamsPlayType,'NA',None)) \
                .otherwise(casted_types_df.SpecialTeamsPlayType))
    .withColumn('PassResult', \
                when(casted_types_df.PassResult=='NA',regexp_replace(casted_types_df.PassResult,'NA',None)) \
                .otherwise(casted_types_df.PassResult)))

numeric_features = [feature[0] for feature in plays_df_with_substituted_na.dtypes if feature[1] == 'int']

plays_df_with_substituted_na.printSchema()

root
 |-- GameClock: string (nullable = true)
 |-- yardsToGo: integer (nullable = true)
 |-- possessionTeam: string (nullable = true)
 |-- yardlineSide: string (nullable = true)
 |-- offenseFormation: string (nullable = true)
 |-- personnel_offense: string (nullable = true)
 |-- personnel_defense: string (nullable = true)
 |-- HomeScoreBeforePlay: integer (nullable = true)
 |-- VisitorScoreBeforePlay: integer (nullable = true)
 |-- HomeScoreAfterPlay: integer (nullable = true)
 |-- VisitorScoreAfterPlay: integer (nullable = true)
 |-- isPenalty: boolean (nullable = true)
 |-- isSTPlay: boolean (nullable = true)
 |-- SpecialTeamsPlayType: string (nullable = true)
 |-- PassResult: string (nullable = true)
 |-- PlayResult: integer (nullable = true)
 |-- playDescription: string (nullable = true)
 |-- yardline_number: integer (nullable = true)
 |-- defenders_in_the_box: integer (nullable = true)
 |-- number_of_pass_rushers: integer (nullable = true)
 |-- kick_return_yardage: integer (nullab

In [5]:
plays_df_with_substituted_na.select("PlayResult").distinct().show()


22/04/12 15:48:16 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 6:>                                                          (0 + 1) / 1]

+----------+
|PlayResult|
+----------+
|       -35|
|        31|
|        65|
|        53|
|        78|
|       -13|
|       -20|
|        34|
|        -1|
|       -17|
|        28|
|        26|
|        27|
|       -10|
|        44|
|       -11|
|        12|
|        22|
|       -15|
|        47|
+----------+
only showing top 20 rows



                                                                                

<h2>How would you create the outcome variable?</h2>

In [6]:
plays_df_with_outcome_var = plays_df_with_substituted_na\
                        .withColumn('target', renamed_columns_plays_df["PlayResult"].cast("string")).drop("PlayResult")

In [7]:
filtered_plays_df = plays_df_with_outcome_var.select(
"yardsToGo",
"possessionTeam",
"yardlineSide",
"offenseFormation",
"personnel_offense",
"personnel_defense",
"isSTPlay",
"yardline_number",
"defenders_in_the_box",
"number_of_pass_rushers",
"down_var",
"play_id",
"game_id",
"target")

Let's continue our feature engineering from last class (without changes).
<h2>Process Continuous Variables</h2>

In [8]:
# Handle Null Values
numeric_columns = [column[0] for column in filtered_plays_df.dtypes if column[1]=='int']
filtered_plays_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in numeric_columns]).show(vertical=True)


-RECORD 0----------------------
 yardsToGo              | 0    
 yardline_number        | 180  
 defenders_in_the_box   | 2637 
 number_of_pass_rushers | 7483 



In [9]:
# Impute numerical fields with median
list_to_be_imputed = ['yardline_number','defenders_in_the_box','number_of_pass_rushers']
plays_df_filled_na = filtered_plays_df.fillna(-200, list_to_be_imputed)

imputer = Imputer (
            inputCols=list_to_be_imputed,
            outputCols=["{}_imputed".format(c) for c in list_to_be_imputed])\
                .setStrategy("median").setMissingValue(-200)

plays_df_imputed = imputer.fit(plays_df_filled_na).transform(plays_df_filled_na)
plays_df_imputed_enhanced = plays_df_imputed.drop('yardline_number','defenders_in_the_box','number_of_pass_rushers')

renamed_plays_df_imputed = plays_df_imputed_enhanced.withColumnRenamed("yardline_number_imputed","yardline_number")\
                            .withColumnRenamed("defenders_in_the_box_imputed","defenders_in_the_box")\
                            .withColumnRenamed("number_of_pass_rushers_imputed","number_of_pass_rushers")

In [10]:
non_correlated_plays_df = renamed_plays_df_imputed

<h2>Handling Outliers</h2>

In [11]:
from functools import reduce

def column_add(a,b):
     return  a.__add__(b)
    
def find_outliers(df):
    # Identifying the numerical columns in a spark dataframe
    numeric_columns = [column[0] for column in df.dtypes if column[1]=='int']

    # Using the `for` loop to create new columns by identifying the outliers for each feature
    for column in numeric_columns:

        less_Q1 = 'less_Q1_{}'.format(column)
        more_Q3 = 'more_Q3_{}'.format(column)
        Q1 = 'Q1_{}'.format(column)
        Q3 = 'Q3_{}'.format(column)

        # Q1 : First Quartile ., Q3 : Third Quartile
        Q1 = df.approxQuantile(column,[0.25],relativeError=0)
        Q3 = df.approxQuantile(column,[0.75],relativeError=0)
        
        # IQR : Inter Quantile Range
        # We need to define the index [0], as Q1 & Q3 are a set of lists., to perform a mathematical operation
        # Q1 & Q3 are defined seperately so as to have a clear indication on First Quantile & 3rd Quantile
        IQR = Q3[0] - Q1[0]
        
        #selecting the data, with -1.5*IQR to + 1.5*IQR., where param = 1.5 default value
        less_Q1 =  Q1[0] - 1.5*IQR
        more_Q3 =  Q3[0] + 1.5*IQR
        
        isOutlierCol = 'is_outlier_{}'.format(column)
        
        df = df.withColumn(isOutlierCol,when((df[column] > more_Q3) | (df[column] < less_Q1), 1).otherwise(0))
    

    # Selecting the specific columns which we have added above, to check if there are any outliers
    selected_columns = [column for column in df.columns if column.startswith("is_outlier")]
    # Adding all the outlier columns into a new colum "total_outliers", to see the total number of outliers
    df = df.withColumn('total_outliers',reduce(column_add, ( df[col] for col in  selected_columns)))

    # Dropping the extra columns created above, just to create nice dataframe., without extra columns
    df = df.drop(*[column for column in df.columns if column.startswith("is_outlier")])

    return df

In [12]:
# As a reminder, we don't have any null values for the outliers to be handled
numeric_columns = [column[0] for column in non_correlated_plays_df.dtypes if column[1]=='int']
non_correlated_plays_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in numeric_columns]).show(vertical=True)


-RECORD 0---------------------
 yardsToGo              | 0   
 yardline_number        | 0   
 defenders_in_the_box   | 0   
 number_of_pass_rushers | 0   



In [13]:
plays_df_with_substituted_na_and_outliers = find_outliers(non_correlated_plays_df)
plays_df_with_handled_outliers = plays_df_with_substituted_na_and_outliers.filter(plays_df_with_substituted_na_and_outliers['total_Outliers']<1)

<h2>Handle Binary Variables (by casting them)</h2>

In [14]:
plays_df_with_handled_binary = (plays_df_with_handled_outliers
              .withColumn("isSTPlay_encoded", plays_df_with_handled_outliers["isSTPlay"].cast("integer")))
plays_df_with_handled_binary.select("isSTPlay","isSTPlay_encoded").distinct().show()

+--------+----------------+
|isSTPlay|isSTPlay_encoded|
+--------+----------------+
|    true|               1|
|   false|               0|
+--------+----------------+



<h2>Handle Nominal Variables</h2>

In [15]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline


# define stage 1 : transform the columns to numeric
stage_1 = StringIndexer(inputCol= 'possessionTeam', outputCol= 'possessionTeam_index', handleInvalid="keep")

stage_2 = StringIndexer(inputCol= 'personnel_offense', outputCol= 'personnel_offense_index', handleInvalid="keep")

stage_3 = StringIndexer(inputCol= 'personnel_defense', outputCol= 'personnel_defense_index', handleInvalid="keep")

stage_4 = StringIndexer(inputCol= 'offenseFormation', outputCol= 'offenseFormation_index', handleInvalid="keep")

stage_5 = StringIndexer(inputCol= 'yardlineSide', outputCol= 'yardlineSide_index', handleInvalid="keep")

stage_6 = StringIndexer(inputCol= 'down_var', outputCol= 'down_index', handleInvalid="keep")


# define stage 4 : one hot encode the numeric columns
stage_7 = OneHotEncoder(inputCols=['possessionTeam_index','personnel_offense_index','personnel_defense_index',
                                  'offenseFormation_index','yardlineSide_index', 'down_index'], 
                        outputCols=['possessionTeam_encoded','personnel_offense_encoded','personnel_defense_encoded',
                                   'offenseFormation_encoded','yardlineSide_encoded', 'down_encoded'])

# setup the pipeline
pipeline = Pipeline(stages=[stage_1, stage_2, stage_3, stage_4, stage_5, stage_6, stage_7])

# fit the pipeline model and transform the data as defined
pipeline_model = pipeline.fit(plays_df_with_handled_binary)
encoded_plays_df = pipeline_model.transform(plays_df_with_handled_binary)


                                                                                

<h2>Combining Features into Single Vector</h2>

In [16]:
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(
    inputCols=['possessionTeam_encoded','personnel_offense_encoded','personnel_defense_encoded',
               'offenseFormation_encoded','yardlineSide_encoded','isSTPlay_encoded',
               'yardsToGo', "yardline_number", "defenders_in_the_box","number_of_pass_rushers"], 
    outputCol="vectorized_features")

assembled_plays_df = vector_assembler.transform(encoded_plays_df)

<h2>Data Scaling</h2>

In [17]:
from pyspark.ml.feature import StandardScaler
standard_scaler = StandardScaler(inputCol= 'vectorized_features', outputCol= 'features')
scaled_model = standard_scaler.fit(assembled_plays_df)
scaled_plays_df = scaled_model.transform(assembled_plays_df)

scaled_plays_df.select("target").show(5)

+------+
|target|
+------+
|     0|
|     8|
|    40|
|    -6|
|    40|
+------+
only showing top 5 rows



<h2>Now, let's Create the ML Model</h2>

In [None]:
### Your work Start Here!!