## Introduction to the project

To start this project, at first, we request all necessary data, that is, every match result from La Liga Santander & La Liga Smartbank in the last 5 years. This request is done through BeSoccer API.

Here starts the final project of the Master in Data Science at KSCHOOL, done by Pablo Fernández Matus. The name of this project is "Spanish La Liga Predictions" and it is realised during the first half of year 2021.  

During this project we will be working with data related to soccer matches results, from which we will try to extract relevant information in order to build a prediction model. The type of model, and what it will consist on are decissions that would be  made as the content of the data being worked on becomes better understood.

The project is divided into different notebooks following the steps taken for the data processing, algorithm execution and data  analysis. This final step is performed via Tableau dashboards operating as the front-end of the project.

All the files needed to run this project (Tableau front-end included) are collected in the following github repository: 

https://github.com/PabloMatus6/Spanish-LaLiga-Round-Prediction

In terms of software or licences required to run this program, it is sufficient to have access to Jupyter Notebook for the code. To view the front-end in Tableau, you will need a Tableau account or request the free trial offered by Tableau desktop.

# Imports

In [14]:
import pandas as pd 
import numpy as np 
import os 
import requests
import json
import matplotlib.pyplot as plt
import seaborn as sns

# 1.- Requests

### 1.1- Matches Request

The project starts with the data request from La Liga Santander & La Liga Smartbank matches in the last 5 years. This request is done via the BeSoccer API. A personal key for the requests is not needed as it is already configured in the notebook. 

In case this key stops working in the future, it will be necessary to request a temporary free account to the Besoccer API, and enter the key that is received replacing the old one in the links of the requests.

In [7]:
for league_division in range(1, 3):
    for league_season in range(2016, 2022):
        for league_round in range(1, 39):
            matches_url = f"https://apiclient.besoccerapps.com/scripts/api/api.php?key=023afbc77c5610fefc3fc8976e451752&tz=Europe/Madrid&format=json&req=matchs&league={league_division}&round={league_round}&order=twin&twolegged=1&year={league_season}"
            response = requests.get(matches_url)
            result = json.loads(response.content)
            if league_round == 1 and league_division == 1 and league_season == 2016:
                df1 = pd.DataFrame(result['match'])
            else:
                df1 = df1.append(result['match'])

In [3]:
df1.head(5)

Unnamed: 0,id,year,group,total_group,round,local,visitor,league_id,stadium,team1,...,visitor_goals,result,live_minute,status,channels,winner,penaltis1,penaltis2,prorroga,stadium2
0,37429,2016,1,1,1,Málaga,Sevilla,15373,La Rosaleda,214628,...,0,0-0,,1,"[{'id': '41', 'name': 'C+ Liga', 'image': 'htt...",0,0,0,False,
1,37433,2016,1,1,1,Deportivo,R. Sociedad,15373,Municipal Riazor,214634,...,0,0-0,,1,"[{'id': '41', 'name': 'C+ Liga', 'image': 'htt...",0,0,0,False,
2,37432,2016,1,1,1,Espanyol,Getafe,15373,RCDE Stadium,214629,...,0,1-0,,1,"[{'id': '128', 'name': 'C+ Liga Multi', 'image...",214629,0,0,False,
3,37437,2016,1,1,1,Atlético,Las Palmas,15373,Wanda Metropolitano,214622,...,0,1-0,,1,"[{'id': '41', 'name': 'C+ Liga', 'image': 'htt...",214622,0,0,False,
4,37430,2016,1,1,1,Rayo Vallecano,Valencia,15373,Vallecas,214630,...,0,0-0,,1,"[{'id': '41', 'name': 'C+ Liga', 'image': 'htt...",0,0,0,False,


We realise that we need to add for La Liga Smartbank four more rounds, as this division has two more teams, there are 42 rounds instead of 38. However, after several tests, adding it is problematic and the information can be confusing, as in the first division these rounds are not available, so it was finally decided not to add it. 

In [4]:
df1.tail()

Unnamed: 0,id,year,group,total_group,round,local,visitor,league_id,stadium,team1,...,visitor_goals,result,live_minute,status,channels,winner,penaltis1,penaltis2,prorroga,stadium2
6,91110,2021,1,1,38,Real Oviedo,Sabadell,57314,Carlos Tartiere,6382799,...,1,2-1,,1,"[{'id': '325', 'name': 'M. LaLiga', 'image': '...",6382799,0,0,False,
7,91104,2021,1,1,38,FC Cartagena,CD Castellón,57314,Municipal Cartagonova,6382787,...,0,1-0,,1,"[{'id': '325', 'name': 'M. LaLiga', 'image': '...",6382787,0,0,False,
8,91112,2021,1,1,38,UD Logroñés,Girona,57314,Las Gaunas,6382792,...,4,1-4,,1,"[{'id': '325', 'name': 'M. LaLiga', 'image': '...",6391868,0,0,False,
9,91109,2021,1,1,38,Rayo Vallecano,Leganés,57314,Vallecas,6382798,...,1,1-1,,1,"[{'id': '303', 'name': '#Vamos', 'image': 'htt...",0,0,0,False,
10,91105,2021,1,1,38,Real Sporting,Lugo,57314,El Molinón-Enrique Castro Quini,6382800,...,0,1-0,,1,"[{'id': '325', 'name': 'M. LaLiga', 'image': '...",6382800,0,0,False,


At this point, we decide to save the entire dataframe before we start working on it and it undergoes modifications.

In [5]:
df1.to_csv('matches_request')

The columns that make up the dataframe are noted and those that are not needed are removed.

In [15]:
for i in df1.columns:
    print(i)

id
year
group
total_group
round
local
visitor
league_id
stadium
team1
team2
conference
dteam1
dteam2
numc
no_hour
local_abbr
visitor_abbr
isVideo
numVideos
competition_name
competition_id
split_league
type
type_id
playoffs
group_code
total_rounds
coef
cflag_local
cflag_visitor
local_shield
visitor_shield
extraTxt
schedule
date
hour
minute
local_goals
visitor_goals
result
live_minute
status
channels
winner
penaltis1
penaltis2
prorroga
stadium2


We drop out the columns that we evidently don´t need, and will re-evaluate later the remaining ones.

In [7]:
df2 = df1.drop(['stadium', 'group', 'total_group', 'no_hour', 'status', 'isVideo', 'playoffs', 'penaltis1', 'penaltis2', 'prorroga', 'stadium2', 'channels','hour','minute', 'extraTxt', 'cflag_local', 'cflag_visitor','conference','local_shield', 'visitor_shield', 'schedule', 'total_rounds', 'type_id', 'type', 'split_league', 'live_minute', 'numVideos', 'no_hour', 'numc'] , axis = 1 )

The content of some of the columns that are not so obvious is investigated.

In [8]:
df2['dteam1']

0     1617
1      901
2      998
3      369
4     2080
      ... 
6     2115
7      643
8     1578
9     2080
10    2125
Name: dteam1, Length: 4788, dtype: object

In [9]:
dteam_meaning = df2['dteam1'] == '1535'
dteam_meaning.head()

0    False
1    False
2    False
3    False
4    False
Name: dteam1, dtype: bool

In [10]:
dteam_1535 = df2[dteam_meaning]
dteam_1535.tail()

Unnamed: 0,id,year,round,local,visitor,league_id,team1,team2,dteam1,dteam2,...,visitor_abbr,competition_name,competition_id,group_code,coef,date,local_goals,visitor_goals,result,winner
2,90646,2021,28,Leganés,CD Castellón,57314,6382791,6382788,1535,673,...,CAS,Segunda División,2,1,58.462,2021/03/06,0,0,0-0,0
3,90667,2021,30,Leganés,Fuenlabrada,57314,6382791,6387869,1535,1179,...,FUE,Segunda División,2,1,58.462,2021/03/20,0,2,0-2,6387869
7,91056,2021,33,Leganés,Sabadell,57314,6382791,6382802,1535,2198,...,SAB,Segunda División,2,1,58.462,2021/04/04,2,1,2-1,6382791
9,91078,2021,35,Leganés,Ponferradina,57314,6382791,6382797,1535,3287,...,PON,Segunda División,2,1,58.462,2021/04/19,1,1,1-1,0
6,91099,2021,37,Leganés,Real Sporting,57314,6382791,6382800,1535,2125,...,SPO,Segunda División,2,1,58.462,2021/05/02,0,0,0-0,0


We realise that 'dteam' columns are identifyers for teams, but also 'team1' and 'team 2'. Researching into API documentation we find out the difference, 'team' is an identifyer of the team only for that season and  'dteam' is the unique team identifyer for all the seasons and competitions. For the moment we remain these columns. 


#### For any other user doubts, here you can check the documentation with the meanings of each column: https://company.besoccer.com/api/documentacion/matchs

### 1.2.- Standings Request

The second request is then made to complete the information already available. In this request, information is requested about the league table according to the round, division and season of each team.

In [11]:
for league_division in range(1, 3):
    for league_season in range(2016, 2022):
        for league_round in range(1, 39):
            standings_url = f"https://apiclient.besoccerapps.com/scripts/api/api.php?key=023afbc77c5610fefc3fc8976e451752&tz=Europe/Madrid&format=json&req=tables&league={league_division}&round={league_round}&year={league_season}"
            response_standings = requests.get(standings_url)
            result_standings = json.loads(response_standings.content)
            if league_round == 1 and league_division == 1 and league_season == 2016:
                df3 = pd.DataFrame(result_standings['table'])
            else:
                df3 = df3.append(result_standings['table'])

In [12]:
df3.head(5)

Unnamed: 0,id,group,group_name,conference,team,points,wins,draws,losses,shield,...,coef,coefficients,mark,class_mark,round,pos,countrycode,abbr,form,direction
0,957,1,,0,Eibar,3,1,0,0,https://thumb.resfu.com/img_data/escudos/mediu...,...,,,1,cha,1,1,ES,EIB,w,
1,712,1,,0,Celta,3,1,0,0,https://thumb.resfu.com/img_data/escudos/mediu...,...,,,1,cha,1,2,ES,CEL,w,
2,369,1,,0,Atlético,3,1,0,0,https://thumb.resfu.com/img_data/escudos/mediu...,...,,,1,cha,1,3,ES,ATM,w,
3,429,1,,0,Barcelona,3,1,0,0,https://thumb.resfu.com/img_data/escudos/mediu...,...,,,2,prev,1,4,ES,FCB,w,
4,998,1,,0,Espanyol,3,1,0,0,https://thumb.resfu.com/img_data/escudos/mediu...,...,,,3,uefa,1,5,ES,ESP,w,


In [13]:
df3.columns

Index(['id', 'group', 'group_name', 'conference', 'team', 'points', 'wins',
       'draws', 'losses', 'shield', 'cflag', 'basealias', 'gf', 'ga', 'avg',
       'matchs_coef', 'points_coef', 'coef', 'coefficients', 'mark',
       'class_mark', 'round', 'pos', 'countrycode', 'abbr', 'form',
       'direction'],
      dtype='object')

In [14]:
df3 = df3.drop(['group', 'group_name', 'conference', 'shield', 'cflag', 'basealias',
       'matchs_coef', 'points_coef', 'coef', 'coefficients', 'mark',
       'class_mark', 'countrycode', 'abbr',
       'direction'] , axis = 1 )

In [15]:
df3

Unnamed: 0,id,team,points,wins,draws,losses,gf,ga,avg,round,pos,form
0,957,Eibar,3,1,0,0,3,1,2,1,1,w
1,712,Celta,3,1,0,0,2,1,1,1,2,w
2,369,Atlético,3,1,0,0,1,0,1,1,3,w
3,429,Barcelona,3,1,0,0,1,0,1,1,4,w
4,998,Espanyol,3,1,0,0,1,0,1,1,5,w
...,...,...,...,...,...,...,...,...,...,...,...,...
17,673,CD Castellón,41,11,8,19,35,44,-9,38,18,lwddl
18,1578,UD Logroñés,41,10,11,17,26,47,-21,38,19,wdldl
19,2198,Sabadell,40,9,13,16,36,44,-8,38,20,ddwwl
20,1598,Lugo,37,8,13,17,32,50,-18,38,21,llldl


In [16]:
df3.tail()

Unnamed: 0,id,team,points,wins,draws,losses,gf,ga,avg,round,pos,form
17,673,CD Castellón,41,11,8,19,35,44,-9,38,18,lwddl
18,1578,UD Logroñés,41,10,11,17,26,47,-21,38,19,wdldl
19,2198,Sabadell,40,9,13,16,36,44,-8,38,20,ddwwl
20,1598,Lugo,37,8,13,17,32,50,-18,38,21,llldl
21,140,Albacete,36,9,9,20,25,46,-21,38,22,lwdlw


There is some information left in this request that we need to be able to do de merge with the DataFrame created by Matches request. 

In [17]:
for league_division in range(1, 3):
    for league_season in range(2016, 2022):
        for round_num in range(1, 39):
            standings_url_bis = f"https://apiclient.besoccerapps.com/scripts/api/api.php?key=023afbc77c5610fefc3fc8976e451752&tz=Europe/Madrid&format=json&req=tables&league={league_division}&round={round_num}&year={league_season}"
            response_standings_bis = requests.get(standings_url_bis)
            result_standings_bis = json.loads(response_standings_bis.content)
            if league_division == 1 and league_season == 2016 and round_num == 1:
                df4 = pd.DataFrame(result_standings_bis['table'])
                df4['year'] = league_season
                df4['division'] = league_division
            else:
                df4_bis = pd.DataFrame(result_standings_bis['table'])
                df4_bis['year'] = league_season
                df4_bis['division'] = league_division
                df4 = pd.concat([df4, df4_bis])

In [18]:
df4.tail()

Unnamed: 0,id,group,group_name,conference,team,points,wins,draws,losses,shield,...,mark,class_mark,round,pos,countrycode,abbr,form,direction,year,division
17,673,1,,0,CD Castellón,41,11,8,19,https://thumb.resfu.com/img_data/escudos/mediu...,...,,,38,18,ES,CAS,lwddl,,2021,2
18,1578,1,,0,UD Logroñés,41,10,11,17,https://thumb.resfu.com/img_data/escudos/mediu...,...,3.0,desc,38,19,ES,UDL,wdldl,d,2021,2
19,2198,1,,0,Sabadell,40,9,13,16,https://thumb.resfu.com/img_data/escudos/mediu...,...,3.0,desc,38,20,ES,SAB,ddwwl,d,2021,2
20,1598,1,,0,Lugo,37,8,13,17,https://thumb.resfu.com/img_data/escudos/mediu...,...,3.0,desc,38,21,ES,LUG,llldl,,2021,2
21,140,1,,0,Albacete,36,9,9,20,https://thumb.resfu.com/img_data/escudos/mediu...,...,3.0,desc,38,22,ES,ALB,lwdlw,,2021,2


Proceed in the same way to save the dataframe extracted from this request. In this way, two dataframes are available and must be joined into one, using the information they have in common.

In [19]:
df4.to_csv('standings_request')