## La Quinila - Machine Learning Model:
In this project, we will create a model that predicts the results of La Liga football game.  

We used the historical data from the previous La Liga matches to train this model and then we will use these trained data in order to predicting the results of the 2020-2021 matches and then compare our prediction with the real data we have to calculate the accuracy of our model.

Since we are trying to predict results, we have an output is either X for tie, 1 for home_team and 2 for away_team in case they won, this means that we have a discrete output, which means it is better to use classification.

Knowing this, we chose Random Forest classification for building our model using the name of the home team, away team and their rankings to predict the results.

I will share step by step, the work we have done here in order to calculate accuracy.

**1- Importing the needed Libraries:**
we imported the most important and only needed libraries to make sure our code will work correctly without errors

Here we have used the Machine Learning library SciKitLearn to be able to use the methods required for our model to work perfectly, our classifier is Random Forest Classifier

In [54]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import sqlite3
import math
import warnings
warnings.filterwarnings("ignore")

**This is only to create a data frame using Pandas from the database provided**

In [55]:
con = sqlite3.connect("laliga.sqlite")
df = pd.read_sql_query("SELECT * FROM Matches", con)
con.close()

**2- Data Cleaning and Manipulation:**
In this step, we have tried to create columns that will help us organize the work by perofrming some data cleaning by dropping NA values, creating a columns to calculate Winning team and losing team in each match day along with calculating the points in this game, and help to identify the features, and the target for us  which are important in the process of creating the model and will help us to calculate the accuracy of our model.


In the first step, we calculated for each match day, the winner team, and the loser team and we managed to remove the unwanted NA values here

In [56]:
df.dropna(inplace = True)
df["home_score"] = df.apply(lambda x: int(x["score"].split(":")[0]), axis = 1)
df["away_score"] = df.apply(lambda x: int(x["score"].split(":")[1]), axis = 1)
df["winner_team"] = df.apply(lambda x : x["home_team"] if(x["home_score"] > x["away_score"]) else (x["away_team"] if(x["home_score"] < x["away_score"]) else "NaN"), axis = 1)
df["loser_team"] = df.apply(lambda x : x["home_team"] if(x["home_score"] < x["away_score"]) else (x["away_team"] if(x["home_score"] > x["away_score"]) else "NaN"), axis = 1)
df_home_team = df.copy()
df_home_team['team'] = df['home_team']
df_away_team = df.copy()
df_away_team['team'] = df['away_team']

After this, we calculated the Rank, the points for each team and removed the unwanted data and kept the necessary columns we will use in our model.

In [57]:
df_total = pd.concat([df_home_team,df_away_team])
df_total = df_total.sort_values(by = ['season', 'division', 'matchday', 'score'])
df_total['W'] = df_total.apply(lambda x : 1 if x['winner_team'] == x['team'] else 0, axis = 1)
df_total['T'] = df_total.apply(lambda x : 1 if x['loser_team'] == 'NaN' else 0, axis = 1)
df_total['W'] = df_total.groupby(['season', 'division', 'team'])['W'].cumsum()
df_total['T'] = df_total.groupby(['season', 'division', 'team'])['T'].cumsum()
df_total['Pts'] = 3 * df_total['W'] + df_total['T']
df_total['rank'] = df_total.groupby(['division','season','matchday'])['Pts'].rank(method = 'min', ascending=False)
df_total = df_total.sort_index()
df_total_home_team = df_total.loc[df_total['home_team'] == df_total['team']]
df_total_away_team = df_total.loc[df_total['away_team'] == df_total['team']]
df['home_team_rank'] = df_total_home_team['rank']
df['away_team_rank'] = df_total_away_team['rank']
df.drop(columns = ['winner_team', 'loser_team'], inplace = True)
df.drop(columns = ['home_score', 'away_score'], inplace = True) 

After we have organized and reordered our data a bit, we will check our work to make sure it looks exactly like what we need, we need each team, their rankings, the match details and the score, as shown below:

In [58]:
df

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank
16024,1995-1996,1,1,9/3/95,7:30 PM,CP Mérida,Real Betis,1:1,1.0,1.0
17318,1998-1999,1,1,8/29/98,9:00 PM,Alavés,Real Betis,0:0,5.0,5.0
17319,1998-1999,1,1,8/29/98,10:00 PM,Valencia,Atlético Madrid,1:0,1.0,17.0
17320,1998-1999,1,1,8/30/98,7:00 PM,Celta de Vigo,Dep. La Coruña,0:0,5.0,5.0
17321,1998-1999,1,1,8/30/98,7:00 PM,Real Sociedad,Real Oviedo,3:3,5.0,5.0
...,...,...,...,...,...,...,...,...,...,...
48585,2021-2022,2,3,8/28/21,10:00 PM,Ponferradina,Girona,2:1,1.0,11.0
48586,2021-2022,2,3,8/29/21,5:00 PM,SD Amorebieta,UD Almería,2:1,14.0,4.0
48587,2021-2022,2,3,8/29/21,7:30 PM,CD Lugo,Real Valladolid,0:2,16.0,2.0
48588,2021-2022,2,3,8/29/21,7:30 PM,Real Sociedad B,CF Fuenlabrada,0:0,6.0,11.0


After managing to organize the data a bit, just to make sure that our data is clean, we will also use the dropna to make sure that we do not have any null values, and to make our model clearer, we created using numpy a new column in our data frame that calculates the winner of each game.

In [59]:
df = df.dropna()
df['winner'] = df['score'].str.split(':').str[0].astype(int) - df['score'].str.split(':').str[1].astype(int)
df['winner'] = np.where(df['winner'] > 0, 0, np.where(df['winner'] < 0, 2, 1))

In order to make it easier to deal with the teams participating in the Spanish Football Laliga, referring to each team by a number and using numerical data,makes calculations easier than dealing with other type of data, and here we have the number assigned to each team 

In [60]:
teams = [df['home_team'].unique()]

For better performance, and good practice, we will use a dictionary to put our teams in it and next to it the number assigned to each team





In [61]:
teams = teams[0].tolist()
teams_dict = {}
for i in range(len(teams)):
    teams_dict[teams[i]] = i

So for now, we will be having the each team with a number (an index) assigned to it:

In [62]:
df['home_team_number'] = df['home_team'].map(teams_dict)
df['away_team_number'] = df['away_team'].map(teams_dict)

**2- Choosing the training Data for our Random Forest Model**:
In this step, we will use the data from the matches provided in Division 1 to train our data for all the matches that were before 2020-2021 and use the Random Forest Regressor method from SciKit Learn to create our model 

In [63]:
train_data_D1 = df[(df['season'] < '2020-2021') & (df['division'] == 1)] 
random_forest_model_D1= RandomForestRegressor(random_state=100, n_estimators=100)

**3- Choosing Target and Features for our model**:
In our model, using numerical data, we will rely on the home team number, the away team number, the home team and the away team rankings as our features, and we will use the winner of the match as our target

In [64]:
X_train_D1= train_data_D1[['home_team_number', 'away_team_number', 'home_team_rank', 'away_team_rank']]
Y_train_D1= train_data_D1['winner']

Fitting the model using the model.fit() method

In [65]:
random_forest_model_D1.fit(X_train_D1, Y_train_D1)

RandomForestRegressor(random_state=100)

**4-Testing our model**:

In the previous step, we tried to create a model using Random Forest classifier, and we trained the model on a previous historical data from old matches before 2020, so for now, we will try to test our model and calculate the winner and the results for our data and matches after 2020, and based on the real data we have , and what this model will predict, we will compare these two results and calculate the accuracy of our model.

In [66]:
df_20_21_D1 = df[(df['season'] == '2020-2021') & (df['division'] == 1)]

In [67]:
test_20_21_D1 = df_20_21_D1[['home_team_number', 'away_team_number', 'home_team_rank', 'away_team_rank']]

We will create here a new column, that will show us the predictions for each game

In [68]:
df_20_21_D1['winning_prediction'] = random_forest_model_D1.predict(test_20_21_D1).astype(int)

Calculating the accuracy here, as we can see, we are comparing the winner column which contains the real data from the matches that happened after 2020, compared with the winning prediction column that contains the predictions of our tested data

Accuracy is the sum of matched cells in these two columns and divided by the number of the rows (number of the matches in this division)

In [69]:
accuracy_20_21_D1 = (df_20_21_D1['winner'] == df_20_21_D1['winning_prediction']).sum() / len(df_20_21_D1)

print(f'The calculated accuracy of this model  to predict the winner in first division games for the 2020-2021 season is: {round(accuracy_20_21_D1,2)*100}%')

The calculated accuracy of this model  to predict the winner in first division games for the 2020-2021 season is: 42.0%


**Training the rest of our dataset**
at the begininng, we used historical data from division1, before 2020 to train our data, and then used the data in division 1 , after 2020, to test the accuracy of this data, and since it worked on the first division, we need to continue and take the second division into consideration and apply the Random Forest classifier on this part of data, so we can have full accuracy, along with the full prediction which we need.


we will use the same method we needed to train our data in division1, we will use all historical data before 2020 and then create the model we need Using RandomForestRegressor method from SciKitLearn Library

In [70]:
train_data_D2 = df[(df['season'] < '2020-2021') & (df['division'] == 2)] 
random_forest_model_D2 = RandomForestRegressor(random_state=100, n_estimators=100)

We will create the Training data, for the desired output, we shall also use the same important columns which are , the hometeam and away team number which refer to their names as well, along with their rankings

In [71]:
X_train_D2 = train_data_D2[['home_team_number', 'away_team_number', 'home_team_rank', 'away_team_rank']]
Y_train_D2 = train_data_D2['winner']

Fitting the trained data, in order to get a good prediction

In [72]:
random_forest_model_D2.fit(X_train_D2, Y_train_D2)

RandomForestRegressor(random_state=100)

As we did in the first division, here we will also test the data that are in 2020-2021 and in division 2 and compare the actual data results of winners with the prediction of winners, and calculate the accuracy by calculating the sum of the number of correct/accurate predictions and then dividing that on the number of all matches in division 2

In [73]:
df_20_21_D2 = df[(df['season'] == '2020-2021') & (df['division'] == 2)]
test_20_21_D2 = df_20_21_D2[['home_team_number', 'away_team_number', 'home_team_rank', 'away_team_rank']]
df_20_21_D2['winning_prediction'] = random_forest_model_D2.predict(test_20_21_D2).astype(int)
accuracy_20_21_D2 = (df_20_21_D2['winner'] == df_20_21_D2['winning_prediction']).sum() / len(df_20_21_D2)

print(f'The calculated accuracy of this model  to predict the winner in second division games for the 2020-2021 season is: {round(accuracy_20_21_D2,2)*100}%')

The calculated accuracy of this model  to predict the winner in second division games for the 2020-2021 season is: 44.0%


**5- Calculating the accuracy of the model for the whole dataset:**

Since we have calculated the accuracy of the data for each division, the full accuracy is the mean of calculated accuracy for each division.

In [74]:
accuracy_20_21 = (accuracy_20_21_D1 + accuracy_20_21_D2)/2
print(f"Total model accuracy: {round(accuracy_20_21,2)*100}")

Total model accuracy: 43.0


**6- Good practice:**

One of the things that we can check to know if our model is working correctly or not, is to try to use the predict function provided in our classifier, and to do so, we created a function called Match_result_prediction , and this function uses the data we have, taking the number of Hometeam and the away team and their ranking as features.

The Winning prediction is shown in an example next using this function

In [75]:
def Match_result_prediction(division, matchday):  
    season = '2020-2021' 
    
    data = df[(df['season'] == season) & (df['division'] == division) & (df['matchday'] == matchday)]
    features = data[['home_team_number', 'away_team_number', 'home_team_rank', 'away_team_rank']]
    
    if(division == 1): 
        data['winning_prediction'] = random_forest_model_D1.predict(features).astype(int)
    
    elif(division == 2): 
        data['winning_prediction'] = random_forest_model_D2.predict(features).astype(int)

    data['winning_prediction'] = data.apply(lambda x: 'X' if x['winning_prediction'] == 1 else 1 if x['winning_prediction'] == 0 else x['winning_prediction'], axis = 1)
    
    for index, row in data.iterrows():
        print(f"{row['home_team']}  vs  {row['away_team']} --> {row['winning_prediction']} ")

**7- Example of the prediction function**:

Here it will show us the prediction of the match results for the first division on the 38th match day

In [76]:
Match_result_prediction(1, 38)

Levante  vs  Cádiz CF --> 1 
Celta de Vigo  vs  Real Betis --> 1 
SD Eibar  vs  Barcelona --> X 
Elche CF  vs  Athletic --> 1 
SD Huesca  vs  Valencia --> X 
CA Osasuna  vs  Real Sociedad --> X 
Real Madrid  vs  Villarreal --> 1 
Real Valladolid  vs  Atlético Madrid --> X 
Granada CF  vs  Getafe --> 1 
Sevilla FC  vs  Alavés --> 1 


Since we have practiced out prediction on each division using our classifier, we will now try to get the winnings predictions for all matches for all divisions in this year, and since we have the actual results, it will help us distinguish and compare the accuracy of predictions and the real data so we can calculate total accuracy of the model for this dataset.

In [77]:
df_20_21 = df[(df['season'] == '2020-2021')]
df_20_21['winning_prediction'] = 0
df_20_21['winning_prediction'] = df_20_21 ['winning_prediction'].astype(int)
df_20_21['winning_prediction'] = df_20_21 .apply(lambda row: random_forest_model_D1.predict([[row['home_team_number'], row['away_team_number'], row['home_team_rank'], row['away_team_rank']]]) if row['division'] == 1 else random_forest_model_D2.predict([[row['home_team_number'], row['away_team_number'], row['home_team_rank'], row['away_team_rank']]]), axis=1)
df_20_21['winning_prediction'] = df_20_21 ['winning_prediction'].astype(int)

**8- Feature Selection Organised:**

Since we do not need all the columns presented in the data frame, we will create a new dataframe contains the following columns :
 
"season, division, matchday, home_team, away_team, winner, winning prediction".

and then we will claculate the accuracy of this model and put it into a new column called accuracy in this data frame by comparing the winner column with the winning predition column

In [78]:
df_20_21 = df_20_21[['season', 'division', 'matchday', 'home_team', 'away_team', 'winner', 'winning_prediction']]
df_20_21['accuracy'] = df_20_21['winner'] == df_20_21['winning_prediction']
df_20_21

Unnamed: 0,season,division,matchday,home_team,away_team,winner,winning_prediction,accuracy
25678,2020-2021,1,1,SD Eibar,Celta de Vigo,1,0,False
25679,2020-2021,1,1,Granada CF,Athletic,0,0,True
25680,2020-2021,1,1,Cádiz CF,CA Osasuna,2,1,False
25681,2020-2021,1,1,Alavés,Real Betis,2,1,False
25682,2020-2021,1,1,Real Valladolid,Real Sociedad,1,0,False
...,...,...,...,...,...,...,...,...
48173,2020-2021,2,42,CD Mirandés,CE Sabadell,2,0,False
48174,2020-2021,2,42,Ponferradina,RCD Mallorca,1,0,False
48175,2020-2021,2,42,Rayo Vallecano,CD Lugo,2,0,False
48176,2020-2021,2,42,Real Zaragoza,CD Leganés,2,1,False


**9- Difference and accuracy between real data and our model:**

So far, we have three results, either home_team wins, or away_team wins, or there is a Tie, for each state of these ouputs, we need to calculate the accuracy level between real data and our prediction to make sure our model can be considered fitted and good.

**home_team winning accuracy prediction:**

First, we will calculate the real accuracy of the home team winning the match by dividing the prediction of home team winning correctly on the number of times of home team winning actually 

In [79]:
prediction_win_home= df_20_21['winning_prediction'].value_counts()[0]
real_win_home = df_20_21['winner'].value_counts()[0]
accurate_win_home = df_20_21[(df_20_21['winner'] == 0) & (df_20_21['winning_prediction'] == 0)]['winning_prediction'].count()
prediction_accuracy_home_win = (prediction_win_home / real_win_home) * 100
real_accuracy_home_win = (accurate_win_home/ real_win_home) * 100

**Accuracy prediction of having a Tie in the game**:
we will count the total number of ties predicted, along with counting the actual real number of ties that happened in the match during this match day, and then calculating the precentage of the accuracy of the model for predicting that a Tie happened.

In [80]:
predcition_tie = df_20_21['winning_prediction'].value_counts()[1]
real_tie = df_20_21['winner'].value_counts()[1]
accurate_tie = df_20_21[(df_20_21['winner'] == 1) & (df_20_21['winning_prediction'] == 1)]['winning_prediction'].count()
prediction_accuracy_tie = (predcition_tie / real_tie) * 100
real_accuracy_tie = (accurate_tie / real_tie) * 100

**Accuracy prediction if away_team wins**:
here we have calculated the real number of rows where the away team was the winner, and also calculated the prediction of the away team winning, from the winning prediction column, then we calculated the accuracy of the preidction and the precentage compared to the real data.


In [81]:
prediction_win_away = df_20_21['winning_prediction'].value_counts()[2]
real_win_away = df_20_21['winner'].value_counts()[2]
accurate_win_away = df_20_21[(df_20_21['winner'] == 2) & (df_20_21['winning_prediction'] == 2)]['winning_prediction'].count()
prediction_accuracy_away_win = (prediction_win_away / real_win_away) * 100
real_accuracy_away_win = (accurate_win_away / real_win_away) * 100

**10- Accuracy of our model in terms of predictions**:

Here we can see after calculating the realtionship between the real data we have and the predictions we got from our random forest classifier, the total accuracy and divided down accuracy for each team whether it is home team or away team can be calculated and it can be considered as a fine result.

In [82]:
print(f"Accuracy prediction in case Home team wins : {round(real_accuracy_home_win, 2)}%")
print(f"Accuracy prediction in case away team wins : {round(real_accuracy_away_win, 2)}%")
print(f"Accuracy prediction in case there is a tie : {round(real_accuracy_tie, 2)}%")

print(f"Final Accuracy of the model in total: {round(accuracy_20_21,2)*100}%")

Accuracy prediction in case Home team wins : 80.11%
Accuracy prediction in case away team wins : 0.41%
Accuracy prediction in case there is a tie : 31.54%
Final Accuracy of the model in total: 43.0%


**Results and Conclusions:**

Looking Back at our model, using a classifier to predict this type of questions is very helpful, using the Random Forest Machine Learning Model, we got an accuracy rate of almost 43%, with a percentage of 80 % to home team accuracy and almost 32% for predicting a tie, and this can be considered as a good result for a classifier in a challenging topic like football match 