# Chapter 2

## Exercise 1

In this exercise we'll create input parsing functions that parse datasets of [Premier League results](https://github.com/footballcsv/england).

In [1]:
library(tidyverse)
library(lubridate)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date



## Problem 1

- Look at the one of the `./data/england-master/XXXXs/XXXX-XX/eng.1.csv`-datasets.
- Determine whether the data is in a tidy format.
- If not, how would you modify the data format?

You can use the markdown cell below to keep notes.

## Solution to problem 1

Solution is given in the preamble to problem 3.

## Problem 2

Create a function `load_matches` that does the following:

- It takes a single csv file from the `../data/england-master/XXXXs/XXXX-XX/eng.1.csv`-datasets.
- It reads the csv.
- It converts Date into a proper POSIXct time object.
- It determines the season of the dataset and stores it into column `Season`. E.g. data files in 2019-20 have season 2019.
- It returns the dataset.

Hint: [year](https://lubridate.tidyverse.org/reference/year.html), [min](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) and [parse_date_time](https://lubridate.tidyverse.org/reference/parse_date_time.html) might be of interest. If you have problems with day/month name parsing, do set `locale='C'` as an argument.

In [2]:
load_matches <- function(datafile) {
    matches <- read_csv(datafile, col_types=cols()) %>%
        mutate(Date=parse_date_time(Date, order='%a %b %d %Y', locale='C')) %>%
        mutate(Season=min(year(Date)))
    return(matches)
}

In [3]:
matches_19 = load_matches('../data/england-master/2010s/2019-20/eng.1.csv')
head(matches_19)

Round,Date,Team 1,FT,Team 2,Season
1,2019-08-09,Liverpool FC,4-1,Norwich City FC,2019
1,2019-08-10,West Ham United FC,0-5,Manchester City FC,2019
1,2019-08-10,Burnley FC,3-0,Southampton FC,2019
1,2019-08-10,AFC Bournemouth,1-1,Sheffield United FC,2019
1,2019-08-10,Crystal Palace FC,0-0,Everton FC,2019
1,2019-08-10,Watford FC,0-3,Brighton & Hove Albion FC,2019


## Preamble for problem 3

We want to convert our data into a tidy format with the following columns:

- `Round` - same as initial data
- `Date` - same as initial data
- `Team` - new column that tells which team is in question
- `Opponent` - new column that tells who was the opponent for the `Team`
- `Season` - same as initial data
- `ScoredGoals` - new column that tells how many goals the `Team` scored
- `AllowedGoals` - new column that tells how many goals the `Team` allowed
- `Side` - new column that tells whether the `Team` played on the `Home`-side or on the `Away`-side.
- `Result` - new column that tells the result from values `Win`, `Loss` or `Draw`
- `Points` - new column that lists the amount of league points the team received from the game (3 for `Win`, 1 for `Draw`, 0 for `Loss`)

The output of this formatting function should look something like this:

|Round|Date|Team|Opponent|Season|ScoredGoals|AllowedGoals|Side|Result|Points|
|-|-|-|-|-|-|-|-|-|-|
|1|2019-08-09|Liverpool FC|Norwich City FC|2019|4|1|Home|Win|3|
|1|2019-08-10|West Ham United FC|Manchester City FC|2019|0|5|Home|Loss|0|

Do note that in this tidy format we have doubled the amount of rows: the original format had two results, one win and one loss, encoded into the column `FT`.

In our new format we have two differents states: one from both perspectives. This allows us to calculate e.g. `Points`, which are allocated in an asymmetric fashion (if the match wasn't a draw).

## Problem 3

Let's create a function `format_matches`, which takes a single DataFrame created by `load_matches` and converts the data into our desired data format:

- Use [separate](https://tidyr.tidyverse.org/reference/separate.html) to extract `HomeGoals` and `AwayGoals` from the `FT`-column. Convert these columns into integers.
- Drop unneeded `FT`-column.
- Create two copies of the initial dataset: one from the away perspective and one from the home team perspective.
- For home side:
    - Set `Side` to `Home`.
    - Rename `Team 1` to `Team`, `Team 2` to `Opponent`, `Homegoals` to `ScoredGoals` and `AwayGoals` to `AllowedGoals`.
    - Set `Result` to `Win` if home team one, `Draw`, if match was a draw and `Loss` if away team won.
- For away side:
    - Set `Side` to `Away`.
    - Rename `Team 1` to `Opponent`, `Team 2` to `Team`, `Homegoals` to `AllowedGoals` and `AwayGoals` to `ScoredGoals`.
    - Set `Result` to `Loss` if home team one, `Draw`, if match was a draw and `Win` if away team won.
- Concatenate both home- and away-datasets together to get all matches.
- Create column `Points` and fill it based on the `Result`-column. 3 points for a `Win`, 1 point for a `Draw` and 0 points for a `Loss`.

Hint: You can use [logical indexing](https://bookdown.org/ndphillips/YaRrr/logical-indexing.html) to test which team won based on `HomeGoals` and `AwayGoals`.

In [4]:
format_matches <- function(matches) {
    matches <- matches %>%
        separate('FT', c('HomeGoals','AwayGoals')) %>%
        mutate(
            HomeGoals=as.numeric(HomeGoals),
            AwayGoals=as.numeric(AwayGoals),
        )
    home_wins <- matches['HomeGoals'] > matches['AwayGoals']
    draws <- matches['HomeGoals'] == matches['AwayGoals']
    away_wins <- !(home_wins|draws)
    
    home_matches <- matches %>%
        mutate(Side='Home') %>%
        rename(Team='Team 1', Opponent='Team 2', ScoredGoals=HomeGoals, AllowedGoals=AwayGoals)
    home_matches[home_wins, 'Result'] <- 'Win'
    home_matches[draws, 'Result'] <- 'Draw'
    home_matches[away_wins, 'Result'] <- 'Loss'

    away_matches <- matches %>%
        mutate(Side='Away') %>%
        rename(Team='Team 2', Opponent='Team 1', ScoredGoals=AwayGoals, AllowedGoals=HomeGoals)
    away_matches[home_wins, 'Result'] <- 'Loss'
    away_matches[draws, 'Result'] <- 'Draw'
    away_matches[away_wins, 'Result'] <- 'Win'
    
    all_matches <- bind_rows(home_matches, away_matches)
    
    all_matches['Points'] <- 0
    all_matches[all_matches['Result'] == 'Win', 'Points'] <- 3
    all_matches[all_matches['Result'] == 'Draw', 'Points'] <- 1
    return(all_matches)
    
}

In [5]:
matches_19 <- format_matches(matches_19)
head(matches_19)

Round,Date,Team,ScoredGoals,AllowedGoals,Opponent,Season,Side,Result,Points
1,2019-08-09,Liverpool FC,4,1,Norwich City FC,2019,Home,Win,3
1,2019-08-10,West Ham United FC,0,5,Manchester City FC,2019,Home,Loss,0
1,2019-08-10,Burnley FC,3,0,Southampton FC,2019,Home,Win,3
1,2019-08-10,AFC Bournemouth,1,1,Sheffield United FC,2019,Home,Draw,1
1,2019-08-10,Crystal Palace FC,0,0,Everton FC,2019,Home,Draw,1
1,2019-08-10,Watford FC,0,3,Brighton & Hove Albion FC,2019,Home,Loss,0


## Problem 4

- Create a function `clean_matches`, which converts `Team`-, `Opponent`- and `Result`-categories from a dataset produced by our `format_matches` into categorical datatype.
- Think why this step needs to be a separate function. Why couldn't these steps be part of the `format_matches`-function?

In [6]:
clean_matches <- function(matches) {
    matches <- matches %>%
        mutate_at(c('Team','Opponent','Result'), as.factor)
    return(matches)
}

In [7]:
matches_19 = clean_matches(matches_19)
str(matches_19)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	760 obs. of  10 variables:
 $ Round       : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Date        : POSIXct, format: "2019-08-09" "2019-08-10" ...
 $ Team        : Factor w/ 20 levels "AFC Bournemouth",..: 10 19 5 1 7 18 17 9 13 12 ...
 $ ScoredGoals : num  4 0 3 1 0 0 3 0 0 4 ...
 $ AllowedGoals: num  1 5 0 1 0 3 1 0 1 0 ...
 $ Opponent    : Factor w/ 20 levels "AFC Bournemouth",..: 14 11 16 15 8 4 3 20 2 6 ...
 $ Season      : num  2019 2019 2019 2019 2019 ...
 $ Side        : chr  "Home" "Home" "Home" "Home" ...
 $ Result      : Factor w/ 3 levels "Draw","Loss",..: 3 2 3 1 1 2 3 1 2 3 ...
 $ Points      : num  3 0 3 1 1 0 3 1 0 3 ...


## Problem 5

Create a function `read_matches` which, given the directory `../data/england-master/` and a list of seasons, does the following:

- Iterates over the given seasons
- Determines the correct data file based on season
- Loads the data using `load_matches` and formats the data using `format_matches`
- Concatenates different datasets together
- Runs `clean_matches` on the combined dataset
- Returns cleaned dataset with multiple seasons

Hint: [Lists](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/list), [append](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/append), [String formatting](https://www.gastonsanchez.com/r4strings/c-style-formatting.html) with [sprintf](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sprintf), [paste](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/paste) and [floor](https://stat.ethz.ch/R-manual/R-patched/library/base/html/Round.html) might be of interest.

In [8]:
read_matches <- function(matchfolder, seasons){
    datasets = list()
    for (season in seasons) {
        decadestr <- sprintf('%d0s',floor(season/10))
        seasonstr <- sprintf('%d-%02d', season, floor(as.numeric(substr(as.character(season),2,4))+1)%%100)
        datapath <- paste(matchfolder, decadestr, seasonstr,'eng.1.csv', sep='/')
        dataset <- format_matches(load_matches(datapath))
        datasets <- append(datasets, list(dataset))
    }
    match_data <- clean_matches(bind_rows(datasets))
    return(match_data)
}

In [9]:
matches_all <- read_matches('../data/england-master/', seq(1992,2019))
head(matches_all)

Round,Date,Team,ScoredGoals,AllowedGoals,Opponent,Season,Side,Result,Points
1,1992-08-15,Arsenal FC,2,4,Norwich City FC,1992,Home,Loss,0
1,1992-08-15,Leeds United FC,2,1,Wimbledon FC,1992,Home,Win,3
1,1992-08-15,Coventry City FC,2,1,Middlesbrough FC,1992,Home,Win,3
1,1992-08-15,Ipswich Town FC,1,1,Aston Villa FC,1992,Home,Draw,1
1,1992-08-15,Crystal Palace FC,3,3,Blackburn Rovers FC,1992,Home,Draw,1
1,1992-08-15,Southampton FC,0,0,Tottenham Hotspur FC,1992,Home,Draw,1


## Demonstration of our new data format

### Checking home side advantage

**Requirement**: Problems 1 and 2 need to be completed for the dataset

Let's use our new datasets to check whether the home side has an advantage compared to the away side. This phenomenon [has been recognized](https://www.researchgate.net/publication/14465849_Factors_associated_with_home_advantage_in_English_and_Scottish_Soccer_matches) for years. Let's see if our data shows the same phenomena.

For this let's use a [binomial test](https://www.rdocumentation.org/packages/mosaic/versions/1.8.2/topics/binom.test) to check whether our wins and draws come from a fair binomial distribution (wins and losses as likely).

In [10]:
home_advantage <- function(matches) {
    home_team_stats <- matches %>%
        filter(Side == 'Home') %>%
        group_by(Result) %>%
        summarize(n=n())
    
    wins = home_team_stats %>%
        filter(Result == 'Win') %>%
        select(n) %>%
        as.numeric()
    losses = home_team_stats %>%
        filter(Result == 'Loss') %>%
        select(n) %>%
        as.numeric()

    return(binom.test(c(wins,losses)))
}

In [11]:
home_advantage(matches_19)
home_advantage(matches_all)


	Exact binomial test

data:  c(wins, losses)
number of successes = 172, number of trials = 288, p-value = 0.001154
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5380811 0.6543414
sample estimates:
probability of success 
             0.5972222 



	Exact binomial test

data:  c(wins, losses)
number of successes = 5028, number of trials = 8047, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.6141448 0.6354230
sample estimates:
probability of success 
             0.6248291 


### Checking Premier League winners

**Requirement**: Problems 1 and 2 need to be completed for the dataset

Let's use our new datasets to check which teams won seasons by checking which team got the most points on each season.

We can then compare teams with the most points to this [list of Premier League champions](https://en.wikipedia.org/wiki/List_of_English_football_champions#Premier_League_(1992%E2%80%93present)) and see that the lists match.

In [12]:
season_winners <- function(matches) {
    team_standings <- matches %>%
        select(Season, Team, Points) %>%
        group_by(Season, Team) %>%
        summarize(Points=sum(Points))
    top_teams <- team_standings %>%
        top_n(1, Points)
    return(top_teams)
}

In [13]:
season_winners(matches_all)

Season,Team,Points
1992,Manchester United FC,84
1993,Manchester United FC,92
1994,Blackburn Rovers FC,89
1995,Manchester United FC,82
1996,Manchester United FC,75
1997,Arsenal FC,78
1998,Manchester United FC,79
1999,Manchester United FC,91
2000,Manchester United FC,80
2001,Arsenal FC,87
