# My xG model

I want to create my own expected goal model. I will be trying different model out, test and evaluate them and see which one performs the best.

The models I will build is:
- LogisticRegression
- Random Forest
- Gradient boost

Inspiration from Nick Wan (https://colab.research.google.com/drive/1ZtGuRWRMc1I_V7EJDllLVOtdot_ZOJlx?usp=sharing#scrollTo=cXk8uxmOXR61)

### Data preperation

In [1]:
#import libraries
import requests
import pandas as pd
import numpy as np

We will extract the data from StatsBomb open data

In [2]:
#the urls we will extract the data from
base_url = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"
comp_url = base_url + "matches/{}/{}.json"
match_url = base_url + "events/{}.json"

In [3]:
def parse_data(competition_id, season_id):
    matches = requests.get(url=comp_url.format(competition_id, season_id)).json() #get all match data
    match_ids = [match["match_id"] for match in matches] #extract all match_ids from the match data
    
    all_events = []
    for match_id in match_ids:
        
        events = requests.get(url=match_url.format(match_id)).json() #extract the events from the matches with the given match_id
        
        shots = [event for event in events if event["type"]["name"] == "Shot"] #extract the shots from the event data
        #we can now collect the features from the shots we find interesting
        for shot in shots:
            features = {
                "play_pattern": shot["play_pattern"]["name"],
                "head": 1 if shot["shot"]["body_part"]["name"] == "Head" else 0,
                "x": shot["location"][0],
                "y": shot["location"][1],
                "phase": shot["shot"]["type"]["name"], #from which phase the shot came from
                "outcome": 1 if shot["shot"]["outcome"]["name"] == "Goal" else 0,
                "statsbomb_xg": shot["shot"]["statsbomb_xg"]
                
            }
        all_events.append(features)
    
    
    return pd.DataFrame(all_events)

In [4]:
competition_id = 37
season_id = 4
df = parse_data(competition_id, season_id)

In [5]:
df.head()

Unnamed: 0,play_pattern,head,x,y,phase,outcome,statsbomb_xg
0,Regular Play,0,104.0,37.0,Open Play,0,0.068793
1,From Corner,1,114.0,44.0,Open Play,0,0.032231
2,From Goal Kick,1,111.0,44.0,Open Play,0,0.095063
3,From Throw In,0,113.0,46.0,Open Play,1,0.635004
4,From Throw In,0,97.0,30.0,Open Play,0,0.016997


We can now do some calculations on the data to calculate `distance_to_goal` and `goal_angle`, which are two very good predictors for if the shot will result in a goal. 

In [6]:
#function that calculates the distance to the center of the goal
def distance_to_goal(origin):
    dest = np.array([120., 40.])
    return np.sqrt(np.sum((origin - dest) ** 2))

In [7]:
#function that calculates the angle within the shot is taken
def goal_angle(origin):
    p0 = np.array((120., 36.))  # Left Post
    p1 = np.array(origin, dtype=np.float)
    p2 = np.array((120., 44.))  # Right Post

    v0 = p0 - p1
    v1 = p2 - p1

    angle = np.abs(np.math.atan2(np.linalg.det([v0, v1]), np.dot(v0, v1)))
    
    return angle

We will now add these columns to our dataframe.

We will use the technique `broadcasting`, which is efficient in large datasets. (https://stackoverflow.com/questions/29954263/what-does-the-term-broadcasting-mean-in-pandas-documentation)

We are also using the `lambda` to avoid loopig through our pandas df.

In [8]:
df['distance_to_goal'] = df.apply(lambda row: distance_to_goal(row[['x', 'y']]), axis=1)
df['goal_angle'] = df.apply(lambda r: goal_angle(r[['x', 'y']]), axis=1)

In [9]:
df.head()

Unnamed: 0,play_pattern,head,x,y,phase,outcome,statsbomb_xg,distance_to_goal,goal_angle
0,Regular Play,0,104.0,37.0,Open Play,0,0.068793,16.278821,0.474829
1,From Corner,1,114.0,44.0,Open Play,0,0.032231,7.211103,0.927295
2,From Goal Kick,1,111.0,44.0,Open Play,0,0.095063,9.848858,0.726642
3,From Throw In,0,113.0,46.0,Open Play,1,0.635004,9.219544,0.681771
4,From Throw In,0,97.0,30.0,Open Play,0,0.016997,25.079872,0.291606


# Training, testing and evaluating our model
Our data is now ready and we will start build our model, test and evaluate it's performance.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier