This data set is made available by http://www.baseball-reference.com. It contains data about professional baseball (MLB) games played in the 2016 season. There are 2,427 games in the data set. Each row represents a single game. The goal is to predict the attendance at a home team’s game. This is an important task because most franchises want to predict the number of attendees for a variety of reasons including profits.

## Goal

Use the **baseball.csv** data set and build a model to predict **attendance_binary**.

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.appName('IMMLLogReg').getOrCreate()

In [0]:
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

file_location = "/FileStore/tables/baseball.csv"
file_type = "csv"

In [0]:
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin
0,43683,2,6,2,Night Game,Day Game,0,6,6,Wednesday,Monday,55,24,Overcast,2.933333333,1
0,45785,0,7,2,Night Game,Day Game,0,10,3,Wednesday,Monday,48,7,Unknown,2.8,1
0,48282,0,8,4,Night Game,Day Game,2,4,3,Wednesday,Monday,65,10,Cloudy,3.383333333,0
0,21830,0,9,6,Day Game,Night Game,0,15,11,Wednesday,Tuesday,77,0,In Dome,3.233333333,1
0,49289,2,4,2,Night Game,Day Game,1,1,3,Tuesday,Monday,81,12,Cloudy,2.633333333,1
0,15116,1,7,5,Night Game,Night Game,0,8,3,Tuesday,Monday,72,0,In Dome,2.966666667,0
0,44317,0,17,15,Night Game,Day Game,2,4,0,Tuesday,Monday,70,6,Unknown,3.166666667,0
0,39500,0,5,1,Night Game,Day Game,1,9,4,Tuesday,Sunday,40,7,Sunny,3.033333333,1
0,35067,1,7,4,Night Game,Night Game,2,7,3,Tuesday,Monday,70,8,Cloudy,2.933333333,0
0,44318,0,15,12,Night Game,Day Game,1,8,3,Tuesday,Monday,64,0,In Dome,3.583333333,0


In [0]:
# Import the required libraries

from pyspark.sql.functions import datediff,date_format,to_date,to_timestamp,isnan

In [0]:
import pyspark.sql.functions as f

In [0]:
df=df.withColumn('previous_attendance',df.previous_attendance.cast('integer')).\
      withColumn('previous_away_team_errors',df.previous_away_team_errors.cast('integer')).\
      withColumn('previous_away_team_runs',df.previous_away_team_runs.cast('integer')).\
      withColumn('previous_away_team_hits',df.previous_away_team_hits.cast('integer')).\
      withColumn('previous_home_team_errors',df.previous_home_team_errors.cast('integer')).\
      withColumn('previous_home_team_runs',df.previous_home_team_runs.cast('integer')).\
      withColumn('previous_home_team_hits',df.previous_home_team_hits.cast('integer')).\
      withColumn('temperature',df.temperature.cast('integer')).\
      withColumn('wind_speed',df.wind_speed.cast('integer')).\
      withColumn('attendance_binary',df.attendance_binary.cast('integer')).\
      withColumn('previous_homewin',df.previous_homewin.cast('integer')).\
      withColumn('previous_game_duration',df.previous_game_duration.cast('integer'))

In [0]:
#Count NAs in the columns
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

+-----------------+-------------------+-------------------------+-----------------------+-----------------------+---------+------------------+-------------------------+-----------------------+-----------------------+--------+-----------------+-----------+----------+---+----------------------+----------------+
|attendance_binary|previous_attendance|previous_away_team_errors|previous_away_team_hits|previous_away_team_runs|game_type|previous_game_type|previous_home_team_errors|previous_home_team_hits|previous_home_team_runs|game_day|previous_game_day|temperature|wind_speed|sky|previous_game_duration|previous_homewin|
+-----------------+-------------------+-------------------------+-----------------------+-----------------------+---------+------------------+-------------------------+-----------------------+-----------------------+--------+-----------------+-----------+----------+---+----------------------+----------------+
|                0|                  0|                        0|  

In [0]:
df=df.dropna()

In [0]:
# Create a 70-30 train test split
import random
random.seed(1234)

train_data,test_data=df.randomSplit([0.7,0.3])

Baseline Accuracy

In [0]:
# Find majority class
train_data.groupBy('attendance_binary').count().show()

+-----------------+-----+
|attendance_binary|count|
+-----------------+-----+
|                1|  914|
|                0|  818|
+-----------------+-----+



In [0]:
# Find percentage
tot = train_data.count()
train_data.groupBy("attendance_binary") \
  .count() \
  .withColumnRenamed('count', 'cnt_per_group') \
  .withColumn('perc_of_count_total', (f.col('cnt_per_group') / tot) * 100 ) \
  .show()

+-----------------+-------------+-------------------+
|attendance_binary|cnt_per_group|perc_of_count_total|
+-----------------+-------------+-------------------+
|                1|          914| 52.771362586605086|
|                0|          818|  47.22863741339492|
+-----------------+-------------+-------------------+



Baseline Accuracy is 52.2%

The data in the target column is distributed almost evenly. 53% is the '1' and 47%  is the '0' . So we don't have any Data Imbalance probelm in our dataset.

In [0]:
# Import the required libraries

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler,StringIndexer ,OneHotEncoder
from pyspark.ml import Pipeline

In [0]:
# Use StringIndexer to convert the categorical columns to hold numerical data

game_type_indexer = StringIndexer(inputCol='game_type',outputCol='game_type_index',handleInvalid='keep')
previous_game_type_indexer = StringIndexer(inputCol='previous_game_type',outputCol='previous_game_type_index',handleInvalid='keep')
game_day_indexer = StringIndexer(inputCol='game_day',outputCol='game_day_index',handleInvalid='keep')
previous_game_day_indexer = StringIndexer(inputCol='previous_game_day',outputCol='previous_game_day_index',handleInvalid='keep')
sky_indexer = StringIndexer(inputCol='sky',outputCol='sky_index',handleInvalid='keep')

In [0]:
# OneHotEncoderEstimator converts the indexed data into a vector which will be effectively handled by Logistic Regression model

data_encoder = OneHotEncoder(inputCols=['game_type_index','previous_game_type_index',
                                                 'game_day_index','previous_game_day_index','sky_index'], outputCols= ['game_type_vec','previous_game_type_vec','game_day_vec',
                                                  'previous_game_day_vec','sky_vec'],
                                      handleInvalid='keep')

In [0]:
# Vector assembler is used to create a vector of input features

assembler = VectorAssembler(inputCols=['game_type_vec','previous_game_type_vec',
                                      'game_day_vec','previous_game_day_vec','sky_vec'],
                            outputCol="features")

#Logistic Regression Model

In [0]:
# Create an object for the Logistic Regression model

lr_model = LogisticRegression(labelCol='attendance_binary')

In [0]:

# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data
# in the same way as that of the train data. It also 

pipe = Pipeline(stages=[game_type_indexer,previous_game_type_indexer,game_day_indexer,previous_game_day_indexer,
                        sky_indexer,data_encoder,assembler,lr_model])

In [0]:
fit_model=pipe.fit(train_data)

In [0]:
# Store the results in a dataframe

results = fit_model.transform(test_data)

In [0]:
results.select(['attendance_binary','prediction']).show()

+-----------------+----------+
|attendance_binary|prediction|
+-----------------+----------+
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
+-----------------+----------+
only showing top 20 rows



Evaluating the model

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="attendance_binary", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy = ACC_evaluator.evaluate(results)


In [0]:
print("The accuracy of the model is {}".format(accuracy))

The accuracy of the model is 0.6561151079136691


With the variables/predictors used, the Linear regression model is able to predict the attendance at a home team’s game about 66% of the time which indicates a generally good performance.

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='attendance_binary',metricName='areaUnderROC')

In [0]:
AUC = AUC_evaluator.evaluate(results)

In [0]:
print("The area under the curve is {}".format(AUC))

The area under the curve is 0.6564596273291925


A roughly 65% area under ROC denotes that the model has performed reasonably well in predicting the attendance at a home team’s game.

#Decision Tree Model

In [0]:
# Import the required libraries

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml import Pipeline

In [0]:


assembler = VectorAssembler(inputCols=['game_type_index','previous_game_type_index','game_day_index',
                                       'previous_game_day_index','sky_index'],
                            outputCol="features")

In [0]:
dt_model = DecisionTreeClassifier(labelCol='attendance_binary',maxBins=5000)

In [0]:

pipe = Pipeline(stages=[game_type_indexer,previous_game_type_indexer,game_day_indexer,previous_game_day_indexer,
                        sky_indexer,data_encoder,assembler,dt_model])

In [0]:
fit_model=pipe.fit(train_data)

In [0]:
results = fit_model.transform(test_data)

In [0]:
results.select(['attendance_binary','prediction']).show()

+-----------------+----------+
|attendance_binary|prediction|
+-----------------+----------+
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
+-----------------+----------+
only showing top 20 rows



Evaluating the model

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="attendance_binary", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy = ACC_evaluator.evaluate(results)

In [0]:
print("The accuracy of the decision tree classifier is {}".format(accuracy))

The accuracy of the decision tree classifier is 0.6388489208633094


With the variables/predictors used, the Decision Tree model is able to predict the attendance at a home team’s game about 64% of the time which indicates a generally good performance.

Area under the ROC

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='attendance_binary',metricName='areaUnderROC')

In [0]:
AUC = AUC_evaluator.evaluate(results)

In [0]:
print("The area under the curve is {}".format(AUC))

The area under the curve is 0.6392753623188406


A roughly 64% area under ROC denotes that the model has performed reasonably well in predicting the attendance at a home team’s game.

###  ML Linear Support Vector Classifier Model

In [0]:
spark = SparkSession.builder.appName('IMMLSVC').getOrCreate()

In [0]:
# Importing the required libraries

from pyspark.ml.classification import LinearSVC
from pyspark.ml.feature import VectorAssembler,StringIndexer,StandardScaler
from pyspark.ml import Pipeline

In [0]:
# Vector assembler is used to create a vector of input features

assembler = VectorAssembler(inputCols=['game_type_index','previous_game_type_index','game_day_index',
                                       'previous_game_day_index','sky_index'],
                            outputCol="unscaled_features")

In [0]:
# Standard scaler is used to scale the data for the linear SVC to perform well on the training data

scaler = StandardScaler(inputCol="unscaled_features",outputCol="features")

In [0]:
# Create an object for the Linear SVC model

svc_model = LinearSVC(labelCol='attendance_binary')

In [0]:
# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data
# in the same way as that of the train data. It also 

pipe = Pipeline(stages=[game_type_indexer,previous_game_type_indexer,game_day_indexer,previous_game_day_indexer,
                        sky_indexer,assembler,scaler,svc_model])

In [0]:
fit_model=pipe.fit(train_data)

In [0]:
# Store the results in a dataframe

results = fit_model.transform(test_data)
display(results)

attendance_binary,previous_attendance,previous_away_team_errors,previous_away_team_hits,previous_away_team_runs,game_type,previous_game_type,previous_home_team_errors,previous_home_team_hits,previous_home_team_runs,game_day,previous_game_day,temperature,wind_speed,sky,previous_game_duration,previous_homewin,game_type_index,previous_game_type_index,game_day_index,previous_game_day_index,sky_index,unscaled_features,features,rawPrediction,prediction
0,10407,0,7,2,Night Game,Night Game,0,7,3,Tuesday,Monday,64,11,Sunny,2,1,0.0,0.0,2.0,5.0,0.0,"Map(vectorType -> sparse, length -> 5, indices -> List(2, 3), values -> List(2.0, 5.0))","Map(vectorType -> sparse, length -> 5, indices -> List(2, 3), values -> List(1.0375222119359775, 2.596517517407552))","Map(vectorType -> dense, length -> 2, values -> List(-0.08195500580582094, 0.08195500580582094))",1.0
0,10535,0,12,4,Night Game,Night Game,1,6,3,Tuesday,Monday,63,9,Overcast,2,0,0.0,0.0,2.0,5.0,4.0,"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 2.0, 5.0, 4.0))","Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 1.0375222119359775, 2.596517517407552, 2.8934538862307746))","Map(vectorType -> dense, length -> 2, values -> List(1.1318330764224531, -1.1318330764224531))",0.0
0,11023,1,7,4,Night Game,Night Game,0,7,3,Wednesday,Tuesday,87,1,Unknown,3,0,0.0,0.0,4.0,4.0,2.0,"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 4.0, 4.0, 2.0))","Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 2.075044423871955, 2.0772140139260413, 1.4467269431153873))","Map(vectorType -> dense, length -> 2, values -> List(0.6965529792731986, -0.6965529792731986))",0.0
0,11142,0,9,2,Night Game,Night Game,1,8,4,Friday,Thursday,50,6,Cloudy,2,1,0.0,0.0,3.0,6.0,1.0,"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 3.0, 6.0, 1.0))","Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 1.5562833179039663, 3.115821020889062, 0.7233634715576936))","Map(vectorType -> dense, length -> 2, values -> List(0.5697685045467271, -0.5697685045467271))",0.0
0,11149,3,2,0,Day Game,Night Game,0,14,12,Thursday,Wednesday,72,0,In Dome,2,1,1.0,0.0,6.0,3.0,3.0,"Map(vectorType -> dense, length -> 5, values -> List(1.0, 0.0, 6.0, 3.0, 3.0))","Map(vectorType -> dense, length -> 5, values -> List(2.14438622902521, 0.0, 3.1125666358079327, 1.557910510444531, 2.170090414673081))","Map(vectorType -> dense, length -> 2, values -> List(0.8692539507756196, -0.8692539507756196))",0.0
0,11327,1,9,0,Night Game,Night Game,0,8,1,Tuesday,Monday,78,7,Cloudy,3,1,0.0,0.0,2.0,5.0,1.0,"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 2.0, 5.0, 1.0))","Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 1.0375222119359775, 2.596517517407552, 0.7233634715576936))","Map(vectorType -> dense, length -> 2, values -> List(0.22149201475124758, -0.22149201475124758))",0.0
0,11399,0,15,9,Night Game,Day Game,1,7,1,Friday,Thursday,72,0,In Dome,2,0,0.0,1.0,3.0,6.0,3.0,"Map(vectorType -> dense, length -> 5, values -> List(0.0, 1.0, 3.0, 6.0, 3.0))","Map(vectorType -> dense, length -> 5, values -> List(0.0, 2.1506495099183027, 1.5562833179039663, 3.115821020889062, 2.170090414673081))","Map(vectorType -> dense, length -> 2, values -> List(1.2255716367930574, -1.2255716367930574))",0.0
0,11937,0,10,4,Night Game,Night Game,1,9,5,Wednesday,Tuesday,75,16,Overcast,3,1,0.0,0.0,4.0,4.0,4.0,"Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 4.0, 4.0, 4.0))","Map(vectorType -> dense, length -> 5, values -> List(0.0, 0.0, 2.075044423871955, 2.0772140139260413, 2.8934538862307746))","Map(vectorType -> dense, length -> 2, values -> List(1.3034470203873356, -1.3034470203873356))",0.0
0,12059,0,7,1,Night Game,Day Game,0,7,2,Friday,Thursday,72,0,In Dome,2,1,0.0,1.0,3.0,6.0,3.0,"Map(vectorType -> dense, length -> 5, values -> List(0.0, 1.0, 3.0, 6.0, 3.0))","Map(vectorType -> dense, length -> 5, values -> List(0.0, 2.1506495099183027, 1.5562833179039663, 3.115821020889062, 2.170090414673081))","Map(vectorType -> dense, length -> 2, values -> List(1.2255716367930574, -1.2255716367930574))",0.0
0,12063,0,12,5,Day Game,Night Game,0,12,6,Thursday,Wednesday,87,15,Sunny,3,1,1.0,0.0,6.0,3.0,0.0,"Map(vectorType -> dense, length -> 5, values -> List(1.0, 0.0, 6.0, 3.0, 0.0))","Map(vectorType -> dense, length -> 5, values -> List(2.14438622902521, 0.0, 3.1125666358079327, 1.557910510444531, 0.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.0410871108955857, 0.0410871108955857))",1.0


In [0]:
results.select(['attendance_binary','prediction']).show()

+-----------------+----------+
|attendance_binary|prediction|
+-----------------+----------+
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       1.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
|                0|       0.0|
+-----------------+----------+
only showing top 20 rows



#### Evaluating the model

1. Area under the ROC

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='attendance_binary',metricName='areaUnderROC')

In [0]:
AUC = AUC_evaluator.evaluate(results)

In [0]:
print("The area under the curve is {}".format(AUC))

The area under the curve is 0.6068944099378881


A roughly 61% area under ROC denotes that the model has performed reasonably well in predicting the attendance at a home team’s game.

2. Area under the Precision Recall Curve

In [0]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='attendance_binary',metricName='areaUnderPR')

In [0]:
PR = PR_evaluator.evaluate(results)

In [0]:
print("The area under the PR curve is {}".format(PR))

The area under the PR curve is 0.5878152277193831


A roughly 59% area under PR curve denotes that the model has performed fairly well in predicting the attendance at a home team’s game.

3. Accuracy

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="attendance_binary", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy = ACC_evaluator.evaluate(results)

In [0]:
print("The accuracy of the model is {}".format(accuracy))

The accuracy of the model is 0.6071942446043166


With the variables/predictors used, the SVC model is able to predict the attendance at a home team’s game about 61% of the time which indicates a generally good performance.

4. Confusion Matrix

In [0]:
from sklearn.metrics import confusion_matrix

In [0]:
y_true = results.select("attendance_binary")
y_true = y_true.toPandas()

y_pred = results.select("prediction")
y_pred = y_pred.toPandas()

cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix: \n {}".format(cnf_matrix))

Below is the confusion matrix: 
 [[227 123]
 [150 195]]


The confusion matrix results reflect the accuracy of the SVC model at predicting the attendance at a home team’s game at about 61% with a misclassification rate of 39%.

#### Conclusion

In conclusion, the model that has performed the best in accurately predicting attendance of a baseball game is the logistic regression model, with accuracy being around 65.6%. This means the model predicted around 65.6% of attendance at a baseball game accurately. In addition, it performed reasonably well compared to the baseline, which is around 52.2%. 

The other models, which included the decision tree and and the linear SVC model also performed relatively well in predicting attendance at the baseball game, at around 63% and 60.7% respectively. 

The logistic model, despite not having a very high accuracy rate, would still be very useful to predict attendance at the baseball game. Businesses can use this model to make an approximation of how many people will be attending the game and take advantage of resources such as marketing strategies. For example, businesses can use this model to determine the price and provision of tickets, food, snacks and merchandise effectively in order to maximize profits. In addition, businesses can also observe this model and attempt to increase attendance to the baseball game with more promotional tactics and thereby increase profits, such as using social media to promote the game or making use of giveaways or discounted prices for merchandise. Other ways in which businesses can use this model is to make important decisions, such as the schedule for the game, the most suitable weather for players and sports fans, and the quality and size of the stadium. This will ensure that businesses make the best decisions that are in the best interest for both the business and sports fans.