## European Soccer Events Analysis: Machine Learning

In this notebook, we use [Gradient-boosted tree](https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier) classifier to fit a model on transformed soccer events data, that could help predict whether a combination of on-field conditions lead to a goal or not.

In [0]:
%sql USE EURO_SOCCER_DB

In [0]:
%sql SELECT * FROM GAME_EVENTS limit 5

id_odsp,id_event,sort_order,time,event_type,event_type_str,event_type2,event_type2_str,side,side_str,event_team,opponent,player,player2,player_in,player_out,shot_place,shot_place_str,shot_outcome,shot_outcome_str,is_goal,location,location_str,bodypart,bodypart_str,assist_method,assist_method_str,situation,situation_str,time_bin,country_code
UFot0hit/,UFot0hit1,1,2,1,Attempt,12,Key Pass,2,Away,Hamburg SV,Borussia Dortmund,mladen petric,gokhan tore,,,6,High and wide,2,Off target,0,9,Left side of the box,2,Left foot,1,Pass,1,Open play,0.0,DEU
UFot0hit/,UFot0hit2,2,4,2,Corner,99,,1,Home,Borussia Dortmund,Hamburg SV,dennis diekmeier,dennis diekmeier,,,99,,99,,0,99,,99,,0,,99,,0.0,DEU
UFot0hit/,UFot0hit3,3,4,2,Corner,99,,1,Home,Borussia Dortmund,Hamburg SV,heiko westermann,heiko westermann,,,99,,99,,0,99,,99,,0,,99,,0.0,DEU
UFot0hit/,UFot0hit4,4,7,3,Foul,99,,1,Home,Borussia Dortmund,Hamburg SV,sven bender,,,,99,,99,,0,99,,99,,0,,99,,0.0,DEU
UFot0hit/,UFot0hit5,5,7,8,Free kick won,99,,2,Away,Hamburg SV,Borussia Dortmund,gokhan tore,,,,99,,99,,0,2,Defensive half,99,,0,,99,,0.0,DEU


In [0]:
gameEventsDf = spark.sql("select event_type_str, event_team, shot_place_str, location_str, assist_method_str, situation_str, country_code, is_goal from game_events")

In [0]:
gameEventsDf = gameEventsDf.withColumnRenamed('is_goal', 'label',)

## Cool, Right? I'm using SQL in my codings and python at the same time. well it's all possible using Data Bricks Platform.

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier, LogisticRegression
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [0]:
categFeatures = ["event_type_str", "event_team", "shot_place_str", "location_str", "assist_method_str", "situation_str", "country_code"]

In [0]:
stringIndexers = [StringIndexer().setInputCol(baseFeature).setOutputCol(baseFeature + "_idx") for baseFeature in categFeatures]

In [0]:
encoders = [OneHotEncoder().setInputCol(baseFeature + "_idx").setOutputCol(baseFeature + "_vec") for baseFeature in categFeatures]

In [0]:
featureAssembler = VectorAssembler()
featureAssembler.setInputCols([baseFeature + "_vec" for baseFeature in categFeatures])
featureAssembler.setOutputCol("features")

In [0]:
, maxDepth=5, maxIter=20

In [0]:
gbtClassifier = GBTClassifier(featuresCol= "features", labelCol="label")
lr = LogisticRegression(featuresCol= "features" ,labelCol="label")
pipelineStages = stringIndexers + encoders + [featureAssembler]
pipeline = Pipeline(stages=pipelineStages)

In [0]:
df_tr = pipeline.fit(gameEventsDf).transform(gameEventsDf)

## Cross Validation

In [0]:
param_grid = ParamGridBuilder().addGrid(gbtClassifier.maxDepth, [5,7]).build()

In [0]:
param_grid_lr = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]).addGrid(lr.elasticNetParam, [0,0.5,1]).build()

In [0]:
crossval = CrossValidator(estimator=lr, estimatorParamMaps=param_grid_lr ,evaluator=BinaryClassificationEvaluator(), numFolds=3)

In [0]:
(trainingData, testData) = df_tr.randomSplit([0.75, 0.25])

In [0]:
cross_model = crossval.fit(trainingData)

In [0]:
lr_model = lr.fit(trainingData)

In [0]:
(trainingData, testData) = gameEventsDf.randomSplit([0.75, 0.25])
model = pipeline.fit(trainingData)

In [0]:
prediction_lr = lr_model.transform(testData)

In [0]:
predictions = cross_model.transform(testData)


In [0]:
evaluator = BinaryClassificationEvaluator(
    labelCol="label", rawPredictionCol="prediction")
evaluator.evaluate(predictions)