## Task 1 - Data Collection

**Name:** Stefan Manek

In this assignment, data was taken from the [Football API](https://www.football-data.org/) to perform two separate studies. A brief analysis of the proportion of native players within 8 professional European football leagues was performed, as well as a crude prediction model of match results using Poisson statistics based on prior results.

This notebook contains the code written for the collection of the data needed for both studies. Since the majority of the data used within the analysis was historical, there was no need to gather this over a long time period. Various filters and specific resources could be specified within the 'Get' request to the API, and so multiple different urls were utilised to gain the required data.

In [1]:
#Importing Libraries
import http.client
import json, requests, urllib
import matplotlib
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd
import time

In [2]:
# API Key 
api_key = '____'# Prefix for API URLs
api_prefix = 'api.football-data.org'

#### Analysis 1 - Players from Leagues

The first part of this analysis was the comparison of native scorers and players across 8 chosen leagues, listed in league_names below. Their corresponding IDs were taken from the API documentation and saved in a list for easy access.

In [3]:
# League IDs for study
league_ids = ['BL1', 'PL', 'ELC', 'PPL', 'SA', 'DED', 'FL1', 'PD']
# The league IDs corresponding to each league name
league_names = {'BL1': 'Bundesliga',
                'PL' : 'Premier League',
                'ELC': 'EFL Championship',
                'PPL': 'Premiera Liga',
                'SA' : 'Seria A',
                'DED': 'Eredivise',
                'FL1': 'Ligue 1',
                'PD' : 'La Liga'}

leagues = []
for id_ in league_ids:
    leagues.append(league_names[id_])

Creating a directory in which to call the saved raw data, and store processed data:

In [5]:
dir_raw = Path("Raw Data")
dir_raw.mkdir(parents=True, exist_ok=True)

dir_data = Path("Processed Data")
dir_data.mkdir(parents=True, exist_ok=True)

Convenience function for requesting data from the API:

In [4]:
def check_backslash(url):
    """Simple function that adds a '/' to the end of a url if one is not already present"""
    if not url.endswith("/"):
        url += "/"
    return url

def fetch(endpoint, league_id, resource, params={}):
    """Inputs: url endpoint, the league's ID, the resource required and any relevant filter parameters.
    
    The default limt of returns is 10, so the limit must be specified to be large to retrieve all the scorers
    
    Output: Dictionary containing resource data"""
    # construct the url
    uri = endpoint
    uri = check_backslash(uri) + league_id
    uri = check_backslash(uri) + resource
    
    #Any added filters must be preceded by '?'
    uri += "?" + urllib.parse.urlencode(params)
    url = check_backslash(api_prefix) + uri
    print("Fetching %s" % url)
    
    # fetch the page
    connection = http.client.HTTPConnection(api_prefix)
    headers = { 'X-Auth-Token': api_key}
    connection.request('GET', uri, None, headers)
    response = json.loads(connection.getresponse().read().decode())
    
    
    return response

This was used to acquire data on all scorer in the specified league in a specified season. Unfortunately, only the seasons starting in 2020 and 2021 were available using this API so any analysis is limited to these. Scorer data was saved in JSON format to Raw Data directory.

In [20]:
endpoint = "/v2/competitions/"
seasons = ['2020', '2021']
for season_year in seasons:
    params = {'season': season_year, 'limit':str(500)}
    resource = 'scorers'
    for id_ in league_ids:
        time.sleep(6)
        scorer_data = fetch(endpoint, id_, resource, params)
        filename = "%s-%s-%s.json" % (league_names[id_], params['season'], resource)
        out_path = dir_raw / 'Scorer Data' / filename
        print("Writing data to %s" % out_path)
        fout = open(out_path, "w")
        json.dump(scorer_data, fout, indent=4)
        fout.close()

Fetching api.football-data.org//v2/competitions/BL1/scorers?season=2020&limit=500
Writing data to Raw Data\Scorer Data\Bundesliga-2020-scorers.json
Fetching api.football-data.org//v2/competitions/PL/scorers?season=2020&limit=500
Writing data to Raw Data\Scorer Data\Premier League-2020-scorers.json
Fetching api.football-data.org//v2/competitions/ELC/scorers?season=2020&limit=500
Writing data to Raw Data\Scorer Data\EFL Championship-2020-scorers.json
Fetching api.football-data.org//v2/competitions/PPL/scorers?season=2020&limit=500
Writing data to Raw Data\Scorer Data\Premiera Liga-2020-scorers.json
Fetching api.football-data.org//v2/competitions/SA/scorers?season=2020&limit=500
Writing data to Raw Data\Scorer Data\Seria A-2020-scorers.json
Fetching api.football-data.org//v2/competitions/DED/scorers?season=2020&limit=500
Writing data to Raw Data\Scorer Data\Eredivise-2020-scorers.json
Fetching api.football-data.org//v2/competitions/FL1/scorers?season=2020&limit=500
Writing data to Raw Dat

Data on the teams taking part in each league in the season beginning in 2020 (last season) was also requested and stored in Raw Data:

In [10]:
endpoint = "/v2/competitions/"
seasons = ['2020', '2021']
for season_year in seasons:
    params = {'season': season_year}
    resource = 'teams'
    for id_ in league_ids:
        teams_data = fetch(endpoint, id_, resource, params)
        filename = "%s-%s-%s.json" % (id_, params['season'], resource)
        out_path = dir_raw / 'Team Data' / filename
        print("Writing data to %s" % out_path)
        fout = open(out_path, "w")
        json.dump(teams_data, fout, indent=4)
        fout.close()

Fetching api.football-data.org//v2/competitions/BL1/teams?season=2020
Writing data to Raw Data\Team Data\BL1-2020-teams.json
Fetching api.football-data.org//v2/competitions/PL/teams?season=2020
Writing data to Raw Data\Team Data\PL-2020-teams.json
Fetching api.football-data.org//v2/competitions/ELC/teams?season=2020
Writing data to Raw Data\Team Data\ELC-2020-teams.json
Fetching api.football-data.org//v2/competitions/PPL/teams?season=2020
Writing data to Raw Data\Team Data\PPL-2020-teams.json
Fetching api.football-data.org//v2/competitions/SA/teams?season=2020
Writing data to Raw Data\Team Data\SA-2020-teams.json
Fetching api.football-data.org//v2/competitions/DED/teams?season=2020
Writing data to Raw Data\Team Data\DED-2020-teams.json
Fetching api.football-data.org//v2/competitions/FL1/teams?season=2020
Writing data to Raw Data\Team Data\FL1-2020-teams.json
Fetching api.football-data.org//v2/competitions/PD/teams?season=2020
Writing data to Raw Data\Team Data\PD-2020-teams.json
Fetchi

The full roster of players from each league was requested via the API using the following functions, and combined to a csv file for each league to avoid the clutter of having a single json file for each team (roughly 140). The csv files, containing data for all players from each league, was saved to the Raw data directory.

In [8]:
def parse_team_squads(team_ids):
    response_start = '/v2/teams/'
    rows = []
    for id_ in team_ids:
        #As only 10 requests permitted /min, sleep for 6s so this is not exceeded
        time.sleep(6)
        response_string = response_start + str(id_)
        team_dict = request_function(response_string)
        players = extract_squad_details(team_dict)
        rows += players
    players_df = pd.DataFrame(rows)
    players_df.set_index('Name', inplace=True)
    return players_df

def request_function(req_string):
    connection = http.client.HTTPConnection(api_prefix)
    headers = { 'X-Auth-Token': api_key}
    connection.request('GET', req_string, None, headers )
    response = json.loads(connection.getresponse().read().decode())
    return response

def get_team_ids(teams_response_dict):
    teams_id_dict = {}
    for team in teams_response_dict['teams']:
        name = team['name']
        id_ = team['id']
        teams_id_dict[name] = id_
    return teams_id_dict

def extract_squad_details(squad_dict):
    rows = []
    for player in squad_dict['squad']:
        row = {}
        row['Name'] = player['name']
        row['Team'] = squad_dict['name']
        row['Position'] = player['position']
        row['Nationality'] = player['nationality']
        row['Country of Birth'] = player['countryOfBirth']
        rows.append(row)
    return rows

endpoint = '/v2/competitions'
seasons = ['2020']
for season in seasons:
    for id_ in league_ids:
        teams_dict = fetch(endpoint, id_, 'teams', params={'season' : season})
        team_ids_dict = get_team_ids(teams_dict)
        team_ids = team_ids_dict.values()
        players_df = parse_team_squads(team_ids)

        #Saving to csv file
        league_filename = 'Raw Data/Squads/%s-players-%s.csv' % (league_names[id_], str(season))
        players_df.to_csv(league_filename)

Fetching api.football-data.org//v2/competitions/BL1/teams?season=2020
Fetching api.football-data.org//v2/competitions/PL/teams?season=2020
Fetching api.football-data.org//v2/competitions/ELC/teams?season=2020
Fetching api.football-data.org//v2/competitions/PPL/teams?season=2020
Fetching api.football-data.org//v2/competitions/SA/teams?season=2020
Fetching api.football-data.org//v2/competitions/DED/teams?season=2020
Fetching api.football-data.org//v2/competitions/FL1/teams?season=2020
Fetching api.football-data.org//v2/competitions/PD/teams?season=2020


### Analysis 2 - Prediction Model

For the prediction model analysis, only the Premier League was considered, and so the fixtures and results from the 2020/21 season were requested from the API as the historical data upon which the model is based.

A convenience function was written to request all the matches for each team in the Premier league in the 2020 Season:

In [10]:
def read_json_file(file_path):
    """Simple function to read in JSON data"""
    fin = open(file_path, "r")
    jdata = fin.read()
    data = json.loads(jdata)
    fin.close()
    return data

def get_matches_dict(team_id):
    url = '/v2/teams/%s/matches?' % str(team_id)
    params = {'dateFrom': '2020-08-13',
             'dateTo' : '2021-06-01'}
    url += urllib.parse.urlencode(params)
    connection = http.client.HTTPConnection('api.football-data.org')
    headers = { 'X-Auth-Token': api_key }
    connection.request('GET', url, None, headers )
    response = json.loads(connection.getresponse().read().decode())
    print('Requesting: '+ url)
    return response


PL_teams = read_json_file('Raw Data/Team Data/PL-2020-teams.json')
PL_team_ids_dict = get_team_ids(PL_teams)

for team in PL_team_ids_dict:
    id_ = PL_team_ids_dict[team]
    matches_dict = get_matches_dict(id_)
    file_path = 'Raw Data/Matches/%s-matches.json' % team
    fout = open(file_path, "w")
    json.dump(matches_dict, fout, indent=4)
    fout.close()
    time.sleep(6)

Requesting: /v2/teams/57/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/58/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/61/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/62/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/63/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/64/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/65/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/66/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/67/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/73/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/74/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/76/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/328/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requesting: /v2/teams/338/matches?dateFrom=2020-08-13&dateTo=2021-06-01
Requ

The performance of the model was also investigated using the first round of Premier League fixtures (matchday 1) so the data for these matches was also requested and saved:

In [13]:
connection = http.client.HTTPConnection(api_prefix)
headers = { 'X-Auth-Token': api_key}
connection.request('GET', '/v2/competitions/PL/matches?matchday=1', None, headers )
response = json.loads(connection.getresponse().read().decode())

fout = open(dir_raw/'PL_Matchday1_fixtures.json', "w")
json.dump(response, fout, indent=4)
fout.close()