In [1]:
import pandas as pd 
import numpy as np 
import os 
import requests
import matplotlib.pyplot as plt 

## Loading data  

Due to the problem with the API requests in the first two notebooks, to replicate this project, you must start running the code from the beginning of this notebook. 

This file is the replacement of the work done in the first two notebooks. It contains the product of the processing of the data obtained from the API requests in notebook 01 and the merge performed in notebook 02.
To get the code for this notebook you must first download the csv file from the following link:

https://drive.google.com/file/d/1lZmDYPA-9lzFHU6DSI8XTVaISMUZullL/view?usp=sharing

Once downloaded, it must be saved in the same local folder as this notebook in order to automatically run the rest of the code.

## 0.- Starting running code of the project

Although the execution of the project code is started on this notebook, it is advisable to see what has been executed on the first two notebooks. This is important to better understand the content of the data that will be processed when executing the code on this notebook.

In this notebook the dataset containing both requests (merged_data) will be prepared and cleaned. Unnecessary columns for will be removed, as well as possible duplicate information. Columns will be renamed to give more clarity to the dataframe.

When once the dataset is prepared, a column corresponding to the target categorical variable will be created.

In [2]:
merged_data = pd.read_csv('merged_data', index_col = [0])


## 1.- Preparing & cleaning

In [3]:
merged_data.columns

Index(['match_id', 'year_x', 'round_x', 'local', 'visitor', 'league_id',
       'team1_id_season', 'team2_id_season', 'team1_id', 'team2_id',
       'local_abbr', 'visitor_abbr', 'division_x', 'local_goals',
       'visitor_goals', 'result', 'winner', 'key_local', 'key_visitor', 'id_x',
       'team_x', 'points_x', 'wins_x', 'draws_x', 'losses_x', 'gf_x', 'ga_x',
       'avg_x', 'round_y', 'pos_x', 'form_x', 'year_y', 'division_y', 'key_x',
       'id_y', 'team_y', 'points_y', 'wins_y', 'draws_y', 'losses_y', 'gf_y',
       'ga_y', 'avg_y', 'round', 'pos_y', 'form_y', 'year', 'division',
       'key_y'],
      dtype='object')

When we have all data in one DataFrame, first we will delete some duplicated columns, and after, some others that are not useful anymore(keys, ids, ...)

In [4]:
merged_data['id_x'] == merged_data['team1_id']

10      True
11      True
12      True
13      True
14      True
        ... 
4783    True
4784    True
4785    True
4786    True
4787    True
Length: 4662, dtype: bool

In [5]:
merged_data['team_x'] == merged_data['local']

10      True
11      True
12      True
13      True
14      True
        ... 
4783    True
4784    True
4785    True
4786    True
4787    True
Length: 4662, dtype: bool

Columns cannot be deleted all at once because some of them need to be renamed first. They will therefore be deleted in two batches.

In [6]:
merged_data = merged_data.rename(columns={'points_x':'points_local', 'wins_x':'wins_local',
        'draws_x':'draws_local', 'losses_x':'losses_local', 'gf_x':'gf_local',
        'ga_x':'ga_local','avg_x':'avg_local','pos_x':'pos_local', 'form_x':'form_local', 'points_y':'points_visitor', 'wins_y':'wins_visitor',
        'draws_y':'draws_visitor', 'losses_y':'losses_visitor', 'gf_y':'gf_visitor',
        'ga_y':'ga_visitor','avg_y':'avg_visitor','pos_y':'pos_visitor', 'form_y':'form_visitor' })
merged_data.columns

Index(['match_id', 'year_x', 'round_x', 'local', 'visitor', 'league_id',
       'team1_id_season', 'team2_id_season', 'team1_id', 'team2_id',
       'local_abbr', 'visitor_abbr', 'division_x', 'local_goals',
       'visitor_goals', 'result', 'winner', 'key_local', 'key_visitor', 'id_x',
       'team_x', 'points_local', 'wins_local', 'draws_local', 'losses_local',
       'gf_local', 'ga_local', 'avg_local', 'round_y', 'pos_local',
       'form_local', 'year_y', 'division_y', 'key_x', 'id_y', 'team_y',
       'points_visitor', 'wins_visitor', 'draws_visitor', 'losses_visitor',
       'gf_visitor', 'ga_visitor', 'avg_visitor', 'round', 'pos_visitor',
       'form_visitor', 'year', 'division', 'key_y'],
      dtype='object')

In [7]:
merged_data = merged_data.drop(['year','year_y','team_x','team_y','division_y','key_x','key_local','key_visitor','id_y','key_y','id_x','round','round_y', 'division'], axis=1)

In [8]:
merged_data = merged_data.rename(columns={'round_x':'round','division_x':'division', 'year_x':'year'}) 

In [9]:
merged_data.columns

Index(['match_id', 'year', 'round', 'local', 'visitor', 'league_id',
       'team1_id_season', 'team2_id_season', 'team1_id', 'team2_id',
       'local_abbr', 'visitor_abbr', 'division', 'local_goals',
       'visitor_goals', 'result', 'winner', 'points_local', 'wins_local',
       'draws_local', 'losses_local', 'gf_local', 'ga_local', 'avg_local',
       'pos_local', 'form_local', 'points_visitor', 'wins_visitor',
       'draws_visitor', 'losses_visitor', 'gf_visitor', 'ga_visitor',
       'avg_visitor', 'pos_visitor', 'form_visitor'],
      dtype='object')

In [10]:
teams_id_df = merged_data.drop(['match_id', 'year', 'round', 'visitor', 'league_id',
        'team2_id_season', 'team2_id',
        'visitor_abbr', 'division', 'local_goals',
       'visitor_goals', 'result', 'winner', 'points_local',
       'wins_local', 'draws_local', 'losses_local', 'gf_local', 'ga_local',
       'avg_local', 'pos_local', 'form_local',
       'points_visitor', 'wins_visitor', 'draws_visitor', 'losses_visitor',
       'gf_visitor', 'ga_visitor', 'avg_visitor', 'pos_visitor',
       'form_visitor'] , axis = 1 )


At this point it is decided to store the indetifyers of each team together with their abbreviations in a dictionary. These columns will probably be deleted, and it is possible that this information will be needed later if you want to refer to specific team results.

In [11]:
teams_id_df = teams_id_df.reset_index(drop=True)
teams_id_df

Unnamed: 0,local,team1_id_season,team1_id,local_abbr
0,Villarreal,214625,2716,VIL
1,R. Sociedad,214631,2120,RSO
2,Barcelona,214620,429,FCB
3,Celta,214627,712,CEL
4,Real Madrid,214621,2107,RMA
...,...,...,...,...
4657,Real Oviedo,6382799,2115,ROV
4658,FC Cartagena,6382787,643,CAR
4659,UD Logroñés,6382792,1578,UDL
4660,Rayo Vallecano,6382798,2080,RAY


In [12]:
teams_id_dict = teams_id_df.set_index('local').T.to_dict()
teams_id_dict

  teams_id_dict = teams_id_df.set_index('local').T.to_dict()


{'Villarreal': {'team1_id_season': 6380753,
  'team1_id': 2716,
  'local_abbr': 'VIL'},
 'R. Sociedad': {'team1_id_season': 6380747,
  'team1_id': 2120,
  'local_abbr': 'RSO'},
 'Barcelona': {'team1_id_season': 6380740,
  'team1_id': 429,
  'local_abbr': 'FCB'},
 'Celta': {'team1_id_season': 6380741, 'team1_id': 712, 'local_abbr': 'CEL'},
 'Real Madrid': {'team1_id_season': 6380749,
  'team1_id': 2107,
  'local_abbr': 'RMA'},
 'Eibar': {'team1_id_season': 6380742, 'team1_id': 957, 'local_abbr': 'EIB'},
 'Valencia': {'team1_id_season': 6380752,
  'team1_id': 2647,
  'local_abbr': 'VCF'},
 'Sevilla': {'team1_id_season': 6380751,
  'team1_id': 1102,
  'local_abbr': 'SEV'},
 'Las Palmas': {'team1_id_season': 6382790,
  'team1_id': 2563,
  'local_abbr': 'UDL'},
 'Getafe': {'team1_id_season': 711285, 'team1_id': 1217, 'local_abbr': 'GET'},
 'Levante': {'team1_id_season': 711287, 'team1_id': 1547, 'local_abbr': 'LEV'},
 'Espanyol': {'team1_id_season': 6382789,
  'team1_id': 998,
  'local_abbr

In [13]:
len(teams_id_dict)

57

In [14]:
teams_id_dict['Real Madrid']

{'team1_id_season': 6380749, 'team1_id': 2107, 'local_abbr': 'RMA'}

## 2.- Creating target variable 

The dataset is prepared and cleaned. The target categorical variable is created. This variable will have three categories, each one corresponding to the three possible results of a match depending on the winning team: Local win, Draw and Visitor win.
As this is a categorical variable made up of only three categories, it is considered a simpler solution to assign numerical values to these categories. 
Therefore, from this moment on, the following numbers will be associated with the winner of the match:
    
 #### LOCAL WIN: 0 || DRAW: 1 || VISITOR WIN: 2

It is important to remember this information, as it will be maintained throughout the project.

In [15]:
def match_winner(col):
    if col['team1_id_season'] == col['winner']:
        return 0
    if col['team2_id_season'] == col['winner']:
        return 2
    if col['winner'] == 0:
        return 1

merged_data['match_winner'] = merged_data.apply(lambda col: match_winner (col),axis=1)
merged_data['match_winner']

10      0
11      1
12      0
13      0
14      0
       ..
4783    0
4784    0
4785    2
4786    1
4787    0
Name: match_winner, Length: 4662, dtype: int64

### Saving data 

Once this process is finished, the dataset is saved in two different files. One will be saved for plotting purposes, so some variables that cannot or do not want to be plotted will be removed. The other file will save the dataset intact in order to be able to continue working on it during the feature engineering phase.

In [16]:
merged_data.to_csv('Cleaned Data')

In [17]:
merged_data.columns

Index(['match_id', 'year', 'round', 'local', 'visitor', 'league_id',
       'team1_id_season', 'team2_id_season', 'team1_id', 'team2_id',
       'local_abbr', 'visitor_abbr', 'division', 'local_goals',
       'visitor_goals', 'result', 'winner', 'points_local', 'wins_local',
       'draws_local', 'losses_local', 'gf_local', 'ga_local', 'avg_local',
       'pos_local', 'form_local', 'points_visitor', 'wins_visitor',
       'draws_visitor', 'losses_visitor', 'gf_visitor', 'ga_visitor',
       'avg_visitor', 'pos_visitor', 'form_visitor', 'match_winner'],
      dtype='object')

In [18]:
plotting_data = merged_data.drop(['match_id','local', 'visitor','league_id',
       'team1_id_season', 'team2_id_season', 'team1_id', 'team2_id',
       'local_abbr', 'visitor_abbr','result', 'form_local', 'form_visitor', 'winner'], axis=1)

In [19]:
plotting_data.to_csv('plotting_data')