# Chapter 2

## Exercise 1

In this exercise we'll create input parsing functions that parse datasets of [Premier League results](https://github.com/footballcsv/england).

In [270]:
import os
import numpy as np
import pandas as pd

## Problem 1

Create a function `load_matches` that does the following:

- It takes a single csv file from the `../data/england-master/XXXXs/XXXX-XX/eng.1.csv`-datasets.
- It reads the csv.
- It converts Date into a proper Datetime object.
- It determines the season of the dataset and stores it into column `Season`. E.g. data files in 2019-20 have season 2019.
- It returns the dataset.

Hint: [min](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html) and [to_period](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-span-representation) might be of interest. You can use 

In [271]:
def load_matches(datafile):
    matches = pd.read_csv(datafile)
    matches.loc[:,'Date'] = pd.to_datetime(matches.loc[:,'Date'])
    matches['Season'] = matches.loc[:,'Date'].min().to_period('Y')
    return matches

matches_19 = load_matches('../data/england-master/2010s/2019-20/eng.1.csv')
matches_19.head()

Unnamed: 0,Round,Date,Team 1,FT,Team 2,Season
0,1,2019-08-09,Liverpool FC,4-1,Norwich City FC,2019
1,1,2019-08-10,West Ham United FC,0-5,Manchester City FC,2019
2,1,2019-08-10,Burnley FC,3-0,Southampton FC,2019
3,1,2019-08-10,AFC Bournemouth,1-1,Sheffield United FC,2019
4,1,2019-08-10,Crystal Palace FC,0-0,Everton FC,2019


## Problem 2

We want to convert our data into a tidy format with the following columns:
- `Round` (same as initial data)
- `Date` (same as initial data)
- `Season` (same as initial data)
- `Team` (new column that tells which team is in question)
- `Opponent` (new column that tells who was the opponent for the `Team`)

Create a function `format_matches`, which takes a single DataFrame created by `load_matches` and does the following:

- Use [pandas.Series.str.extract](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) to extract `HomeGoals` and `AwayGoals` from the `FT`-column. Convert hese columns into integers.
- Create two copies of the initial dataset: one from the away perspective and one from the home team perspective.
- For both datasets set the column `Side` with values `Home` or `Away`.
- For both 


The output of this formatting function should look something like this:

|Round|Date|Team|Opponent|Season|ScoredGoals|AllowedGoals|Side|Result|Points|
|-|-|-|-|-|-|-|-|-|-|
|0|1|2019-08-09|Liverpool FC|Norwich City FC|2019|4|1|Home|Win|3|
|1|1|2019-08-10|West Ham United FC|Manchester City FC|2019|0|5|Home|Loss|0|

In [272]:
def format_matches(matches):
    matches.loc[:, ['HomeGoals','AwayGoals']] = matches.loc[:, 'FT'].str.extract('(?P<HomeGoals>\d)-(?P<AwayGoals>\d)').astype(np.int64)
    home_win = matches['HomeGoals'] > matches['AwayGoals']
    draw = matches['HomeGoals'] == matches['AwayGoals']
    
    home_matches = matches.copy()
    home_matches['Side'] = 'Home'
    home_matches = home_matches.rename(columns={'Team 1':'Team', 'Team 2': 'Opponent', 'HomeGoals': 'ScoredGoals', 'AwayGoals': 'AllowedGoals'})
    home_matches.loc[home_win,'Result'] = 'Win'
    home_matches.loc[draw,'Result'] = 'Draw'
    home_matches.loc[-(draw|home_win),'Result'] = 'Loss'
    
    away_matches = matches.copy()
    away_matches['Side'] = 'Away'
    away_matches = away_matches.rename(columns={'Team 1':'Opponent', 'Team 2': 'Team', 'HomeGoals': 'AllowedGoals', 'AwayGoals': 'ScoredGoals'})
    away_matches.loc[home_win,'Result'] = 'Loss'
    away_matches.loc[draw,'Result'] = 'Draw'
    away_matches.loc[-(draw|home_win),'Result'] = 'Win'
    
    all_matches = pd.concat([home_matches,away_matches])
    all_matches['Points'] = 0
    all_matches.loc[all_matches.loc[:,'Result'] == 'Win','Points'] = 3
    all_matches.loc[all_matches.loc[:,'Result'] == 'Draw','Points'] = 1
    return all_matches

matches_19 = format_matches(matches_19)
matches_19.head()

Unnamed: 0,Round,Date,Team,FT,Opponent,Season,ScoredGoals,AllowedGoals,Side,Result,Points
0,1,2019-08-09,Liverpool FC,4-1,Norwich City FC,2019,4,1,Home,Win,3
1,1,2019-08-10,West Ham United FC,0-5,Manchester City FC,2019,0,5,Home,Loss,0
2,1,2019-08-10,Burnley FC,3-0,Southampton FC,2019,3,0,Home,Win,3
3,1,2019-08-10,AFC Bournemouth,1-1,Sheffield United FC,2019,1,1,Home,Draw,1
4,1,2019-08-10,Crystal Palace FC,0-0,Everton FC,2019,0,0,Home,Draw,1


In [273]:
def clean_matches(matches):
    matches.loc[:,['Team', 'Opponent', 'Result']] = matches.loc[:, ['Team', 'Opponent', 'Result']].astype('category')
    return matches

clean_matches(matches_19).dtypes

Round                    int64
Date            datetime64[ns]
Team                  category
FT                      object
Opponent              category
Season           period[A-DEC]
ScoredGoals              int64
AllowedGoals             int64
Side                    object
Result                category
Points                   int64
dtype: object

In [274]:
def read_matches(matchfolder, seasons):
    datasets = []

    for season in seasons:
        decadestr = '20%d0s' % int(season/10)
        seasonstr = '20%02d-%02d' % (season, season+1)
        
        datapath = os.path.join(matchfolder, decadestr, seasonstr,'eng.1.csv')
        datasets.append(format_matches(load_matches(datapath)))

    match_data = pd.concat(datasets)
    return match_data

In [275]:
matches_all = clean_matches(read_matches('../data/england-master/', range(0,20)))

In [276]:
win_loss = matches_all.loc[(matches_all.loc[:,'Side'] == 'Home'),'Result'].value_counts()

In [277]:
from scipy.stats import binom_test
binom_test(win_loss.loc[win_loss.index != 'Draw'])

1.3984301759446833e-74

In [278]:
matches_all.groupby('Team')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2084eadf10>

In [291]:
team_standings = matches_all.loc[:,['Season','Team','Points']].groupby(['Season','Team']).sum()
team_standings.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Points
Season,Team,Unnamed: 2_level_1
2000,AFC Bournemouth,
2000,Arsenal FC,70.0
2000,Aston Villa FC,54.0
2000,Birmingham City FC,
2000,Blackburn Rovers FC,


In [292]:
team_standings = team_standings.reset_index()
team_standings.head()

Unnamed: 0,Season,Team,Points
0,2000,AFC Bournemouth,
1,2000,Arsenal FC,70.0
2,2000,Aston Villa FC,54.0
3,2000,Birmingham City FC,
4,2000,Blackburn Rovers FC,


In [296]:
team_standings = team_standings.dropna()
team_standings.head()

Unnamed: 0,Season,Team,Points
1,2000,Arsenal FC,70.0
2,2000,Aston Villa FC,54.0
7,2000,Bradford City AFC,26.0
11,2000,Charlton Athletic FC,52.0
12,2000,Chelsea FC,61.0


Let's compare teams with most points to this [list of Premier League champions](https://en.wikipedia.org/wiki/List_of_English_football_champions#Premier_League_(1992%E2%80%93present)).

In [297]:
team_standings.groupby('Season').apply(lambda x: x.nlargest(1, 'Points'))

Unnamed: 0_level_0,Unnamed: 1_level_0,Season,Team,Points
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000,25,2000,Manchester United FC,80.0
2001,44,2001,Arsenal FC,87.0
2002,111,2002,Manchester United FC,83.0
2003,130,2003,Arsenal FC,90.0
2004,184,2004,Chelsea FC,95.0
2005,227,2005,Chelsea FC,91.0
2006,283,2006,Manchester United FC,89.0
2007,326,2007,Manchester United FC,87.0
2008,369,2008,Manchester United FC,90.0
2009,399,2009,Chelsea FC,86.0
