# Predicting AFL winners using Machine Learning - Cleaning & Feature Preprocessing

---

<img src='img/sportsbet.png' 
style="height:200px;width:700px;">

---



Using the data provided by Sportsbet and CIKM 2015 challenge. We are going to try and maximize winnnings from bet tippings in any given season of AFL. This involves two steps:

* Part 1: Predicting with a high enough accuracy the probability of the home team winning. 

* Part 2: Calculating what the best percentage of our capital to bet is.



The data for this project is sourced from several websites. The meat of the dataset is provided under the **2015 CIKM & SportsBet AFL Challenge**, with others sections such as historical ladder rankings and historical betting odds coming from 

> [The Official AFL Site](http://www.afl.com.au/ladder)
<img src='img/ladder.png'>

---

> [AFL Tables](https://afltables.com/afl/stats/stats_idx.html)
<img src='img/afltables.png'>

---

>[aussportsbetting.com](http://www.aussportsbetting.com/data/historical-afl-results-and-odds-data/)
<img src='img/aussportsbetting.png'>

Since we are unable to use the game statistics of the current game to predict the outcome ("Duh! Otherwise we'd all be millionaires.") We are going to have to do a lot of feature preprocessing to create predictors for our models. Some important ones that come to mind are:

* Previous year's ladder score

* Head-on-head data

* Current form for the team

* Premiership match?

But first lets import some dependencies

#### Import Dependencies
---

In [27]:
# Imports dependencies
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

%matplotlib inline

In [28]:
data = pd.read_csv("data/afl_2017.csv")

In [29]:
data["Date"] = pd.to_datetime(data["Date"])

In [30]:
data["Date"][0].year

2017

In [31]:
data["season"] = data["Date"].map(lambda x: x.year)

In [32]:
data.rename(columns={'Home Team': 'h_team', 'Away Team': 'a_team', "Play Off Game?": "play_off"}, inplace=True)

In [33]:
data.h_team.unique()

array(['Adelaide', 'Richmond', 'GWS Giants', 'Geelong', 'Port Adelaide',
       'Sydney', 'West Coast', 'Essendon', 'Brisbane', 'Collingwood',
       'Hawthorn', 'Fremantle', 'St Kilda', 'Melbourne', 'Carlton',
       'Gold Coast', 'Western Bulldogs', 'North Melbourne'], dtype=object)

#### Team Names and Season Done
---

In [34]:
data["play_off"] = data["play_off"].map(lambda x: 1 if x == "Y" else 0)

In [35]:
data.drop(["Home Goals", "Home Behinds", "Away Goals", "Away Behinds", "Venue", "Kick Off (local)"], inplace = True, axis = 1)

In [36]:
data["margin"] = data["Home Score"] - data["Away Score"]

In [37]:
data.head()

Unnamed: 0,Date,h_team,a_team,Home Score,Away Score,play_off,Home Odds,Away Odds,season,margin
0,2017-09-30,Adelaide,Richmond,60,108,1,1.64,2.26,2017,-48
1,2017-09-23,Richmond,GWS Giants,103,67,1,1.54,2.5,2017,36
2,2017-09-22,Adelaide,Geelong,136,75,1,1.43,2.85,2017,61
3,2017-09-16,GWS Giants,West Coast,125,58,1,1.42,2.88,2017,67
4,2017-09-15,Geelong,Sydney,98,39,1,3.1,1.37,2017,59


#### Playoff games done
---

In [38]:
# Loads the ladder rankings: ladder_rankings
ladder_rankings = pd.read_csv("data/ladder_rankings.csv", index_col = 1)

In [39]:
ladder_rankings.head()

Unnamed: 0_level_0,tid,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001
tname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Adelaide,101.0,62.0,64.0,54.0,44.0,40.0,68.0,28.0,36.0,56.0,52.0,48.0,64.0,68.0,68.0,14.0,60.0,48.0
Brisbane,102.0,20.0,12.0,16.0,28.0,40.0,40.0,16.0,28.0,54.0,40.0,40.0,28.0,40.0,64.0,16.0,68.0,68.0
Carlton,103.0,24.0,28.0,16.0,30.0,44.0,44.0,58.0,44.0,52.0,40.0,16.0,14.0,18.0,64.0,20.0,12.0,56.0
Collingwood,104.0,38.0,36.0,40.0,44.0,56.0,64.0,80.0,70.0,60.0,48.0,52.0,56.0,20.0,60.0,28.0,52.0,44.0
Essendon,105.0,48.0,12.0,24.0,50.0,56.0,44.0,46.0,28.0,42.0,32.0,40.0,14.0,32.0,56.0,30.0,50.0,68.0


In [42]:
def get_prev_score(tname, year):
    """
    Returns the previous years ladder scores given home and away team id's and the year
    """
    try:
        prev_score = ladder_rankings.loc[tname, str(year-1)]
    except KeyError:
        prev_score = 0     
    
    return prev_score

In [43]:
# Initializes and concats the previous year scores for the home and away team: 'h_prev_score', 'a_prev_score'
data['h_prevladder_score'] = data.apply(lambda row: get_prev_score(row["h_team"], row["season"]), axis=1)
data['a_prevladder_score'] = data.apply(lambda row: get_prev_score(row["a_team"], row["season"]), axis=1)

#### Previous Ladder score done
---

In [46]:
data.head(5)

Unnamed: 0,Date,h_team,a_team,Home Score,Away Score,play_off,Home Odds,Away Odds,season,margin,h_prevladder_score,a_prevladder_score
0,2017-09-30,Adelaide,Richmond,60,108,1,1.64,2.26,2017,-48,64.0,32.0
1,2017-09-23,Richmond,GWS Giants,103,67,1,1.54,2.5,2017,36,32.0,64.0
2,2017-09-22,Adelaide,Geelong,136,75,1,1.43,2.85,2017,61,64.0,68.0
3,2017-09-16,GWS Giants,West Coast,125,58,1,1.42,2.88,2017,67,64.0,64.0
4,2017-09-15,Geelong,Sydney,98,39,1,3.1,1.37,2017,59,68.0,68.0
