# Chapter 2

## Exercise 1

In this exercise we'll create input parsing functions that parse datasets of [Premier League results](https://github.com/footballcsv/england).

In [None]:
import os
import numpy as np
import pandas as pd

## Problem 1

- Look at the one of the `./data/england-master/XXXXs/XXXX-XX/eng.1.csv`-datasets.
- Determine whether the data is in a tidy format.
- If not, how would you modify the data format?

You can use the markdown cell below to keep notes.

Notes:

- Note 1

## Problem 2

Create a function `load_matches` that does the following:

- It takes a single csv file from the `../data/england-master/XXXXs/XXXX-XX/eng.1.csv`-datasets.
- It reads the csv.
- It converts Date into a proper Datetime object.
- It determines the season of the dataset and stores it into column `Season`. E.g. data files in 2019-20 have season 2019.
- It returns the dataset.

Hint: [min](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html) and [to_period](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-span-representation) might be of interest.

In [None]:
def load_matches(datafile):
    # TODO
    return matches

In [None]:
matches_19 = load_matches('../data/england-master/2010s/2019-20/eng.1.csv')
matches_19.head()

## Preamble for problem 3

We want to convert our data into a tidy format with the following columns:

- `Round` - same as initial data
- `Date` - same as initial data
- `Team` - new column that tells which team is in question
- `Opponent` - new column that tells who was the opponent for the `Team`
- `Season` - same as initial data
- `ScoredGoals` - new column that tells how many goals the `Team` scored
- `AllowedGoals` - new column that tells how many goals the `Team` allowed
- `Side` - new column that tells whether the `Team` played on the `Home`-side or on the `Away`-side.
- `Result` - new column that tells the result from values `Win`, `Loss` or `Draw`
- `Points` - new column that lists the amount of league points the team received from the game (3 for `Win`, 1 for `Draw`, 0 for `Loss`)

The output of this formatting function should look something like this:

||Round|Date|Team|Opponent|Season|ScoredGoals|AllowedGoals|Side|Result|Points|
|-|-|-|-|-|-|-|-|-|-|-|
|0|1|2019-08-09|Liverpool FC|Norwich City FC|2019|4|1|Home|Win|3|
|1|1|2019-08-10|West Ham United FC|Manchester City FC|2019|0|5|Home|Loss|0|

Do note that in this tidy format we have doubled the amount of rows: the original format had two results, one win and one loss, encoded into the column `FT`.

In our new format we have two differents states: one from both perspectives. This allows us to calculate e.g. `Points`, which are allocated in an asymmetric fashion (if the match wasn't a draw).

## Problem 3

Let's create a function `format_matches`, which takes a single DataFrame created by `load_matches` and converts the data into our desired data format:

- Use [pandas.Series.str.extract](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) to extract `HomeGoals` and `AwayGoals` from the `FT`-column. Convert these columns into integers.
- Drop unneeded `FT`-column.
- Create two copies of the initial dataset: one from the away perspective and one from the home team perspective.
- For home side:
    - Set `Side` to `Home`.
    - Rename `Team 1` to `Team`, `Team 2` to `Opponent`, `Homegoals` to `ScoredGoals` and `AwayGoals` to `AllowedGoals`.
    - Set `Result` to `Win` if home team one, `Draw`, if match was a draw and `Loss` if away team won.
- For away side:
    - Set `Side` to `Away`.
    - Rename `Team 1` to `Opponent`, `Team 2` to `Team`, `Homegoals` to `AllowedGoals` and `AwayGoals` to `ScoredGoals`.
    - Set `Result` to `Loss` if home team one, `Draw`, if match was a draw and `Win` if away team won.
- Concatenate both home- and away-datasets together to get all matches.
- Create column `Points` and fill it based on the `Result`-column. 3 points for a `Win`, 1 point for a `Draw` and 0 points for a `Loss`.

Hint: You can use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) to test which team won based on `HomeGoals` and `AwayGoals`.

In [None]:
def format_matches(matches):
    # TODO
    return all_matches

In [None]:
matches_19 = format_matches(matches_19)
matches_19.head()

## Problem 4

- Create a function `clean_matches`, which converts `Team`-, `Opponent`- and `Result`-categories from a dataset produced by our `format_matches` into categorical datatype.
- Think why this step needs to be a separate function. Why couldn't these steps be part of the `format_matches`-function?

In [None]:
def clean_matches(matches):
    # TODO
    return matches

In [None]:
matches_19 = clean_matches(matches_19)
matches_19.dtypes

## Problem 5

Create a function `read_matches` which, given the directory `../data/england-master/` and a list of seasons, does the following:

- Iterates over the given seasons
- Determines the correct data file based on season
- Loads the data using `load_matches` and formats the data using `format_matches`
- Concatenates different datasets together
- Runs `clean_matches` on the combined dataset
- Returns cleaned dataset with multiple seasons

Hint: Python [string formatting](https://docs.python.org/3/library/string.html#format-specification-mini-language) might be of interest.

In [None]:
def read_matches(matchfolder, seasons):
    # TODO
    return match_data

In [None]:
matches_all = read_matches('../data/england-master/', range(1992,2020))

## Demonstration of our new data format

### Checking home side advantage

**Requirement**: Problems 1 and 2 need to be completed for the dataset

Let's use our new datasets to check whether the home side has an advantage compared to the away side. This phenomenon [has been recognized](https://www.researchgate.net/publication/14465849_Factors_associated_with_home_advantage_in_English_and_Scottish_Soccer_matches) for years. Let's see if our data shows the same phenomena.

For this let's use a [binomial test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html) to check whether our wins and draws come from a fair binomial distribution (wins and losses as likely).

In [None]:
from scipy.stats import binom_test
def home_advantage(matches):
    home_team_stats = matches.loc[(matches.loc[:,'Side'] == 'Home'),'Result'].value_counts()
    print(home_team_stats)
    print('Home side wins %.1f %% of matches that do not end in a draw' % float(100.0*home_team_stats.loc['Win']/(home_team_stats.loc['Win']+home_team_stats.loc['Loss'])))
    return binom_test(home_team_stats.loc[['Win','Loss']])

In [None]:
home_advantage(matches_19)

### Checking Premier League winners

**Requirement**: Problems 1 and 2 need to be completed for the dataset

Let's use our new datasets to check which teams won seasons by checking which team got the most points on each season.

We can then compare teams with the most points to this [list of Premier League champions](https://en.wikipedia.org/wiki/List_of_English_football_champions#Premier_League_(1992%E2%80%93present)) and see that the lists match.

In [None]:
def season_winners(matches):
    team_standings = matches.loc[:,['Season','Team','Points']].groupby(['Season','Team']).sum()
    team_standings.reset_index(inplace=True)
    team_standings.dropna(inplace=True)
    top_teams = team_standings.groupby('Season').apply(lambda x: x.nlargest(1, 'Points'))
    return top_teams

In [None]:
season_winners(matches_19)