# Project Title: La Liga Match Prediction

# Project Description:

## The project is about predicting the outcome of a football match in the Spanish La Liga. The dataset contains the results of all the matches played in the La Liga from 1995 to 2020. The dataset contains the following columns:

    ### Date
    ### HomeTeam
    ### AwayTeam
    ### FTHG [Full time home goal]
    ### FTAG [Full time away goal]
    ### FTR [Full time result]
    ### HTHG [ Half time home goal]
    ### HTAG [Half time away goal]
    ### HTR [Half time result]

## The project will be using the following algorithms to predict the outcome of the match:

    ### K-Nearest Neighbors (KNN)
    ### Support Vector Machine (SVM)
    ### Naive Bayes
    ### Decision Tree
    



## The project will be using the following metrics to evaluate the performance of the algorithms:

    ### Accuracy
    ### Precision
    ### Confusion Matrix
    ### F1 Score




## The project will be using the following techniques:

    ### Data Cleaning
    ### Data Preprocessing
    ### Data Visualization
    ### Data Analysis
    ### Data Modeling
    ### Data Evaluation
    ### Data Interpretation
 



## Team Members:
    ## M A Moontasir Abtahee 19301150
    ## Maisha Jarin 20101125
    ## Bless Peter Biswas 20101429
    ## Tanzina Binte Azad 20201217



Importing Libraries

In [89]:
import pandas as pd

import IPython
import chardet

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score

# Support Vector Machine (SVM)
from sklearn.svm import SVC

# Naive Bayes
from sklearn.naive_bayes import GaussianNB

# Decision Tree
from sklearn.tree import DecisionTreeClassifier


Importing Data and Converting it to a DataFrame

In [90]:

path = 'LaLiga_Matches.csv'
    
with open(path, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
matches=pd.read_csv(path,encoding='Windows-1252')


Data Attributes 

⦁	'Date'
⦁	 'HomeTeam',
⦁	 'AwayTeam',
⦁	 'FTHG' [Full time home goal]
⦁	 'FTAG' [Full time away goal]
⦁	'FTR' [Full time result]
⦁	'HTHG' [ Half time home goal] 
⦁	'HTAG' [Half time away goal]
⦁	 'HTR' [Half time result]


In [91]:
list(matches.columns)

['Season',
 'Date',
 'HomeTeam',
 'AwayTeam',
 'FTHG',
 'FTAG',
 'FTR',
 'HTHG',
 'HTAG',
 'HTR']

Data Cleaning and Preprocessing

home and away team points == 3 and tie points == 1 
other cells == 0

In [92]:
matches.loc[matches.FTHG > matches.FTAG, "home_team_points"] = 3
matches.loc[matches.FTHG < matches.FTAG, "away_team_points"] = 3
matches.loc[matches.FTHG == matches.FTAG, "tie_points"] = 1

# Filling the NaN values with 0 from home_team_points, away_team_points and tie_points
matches[['home_team_points', 'away_team_points', 'tie_points']] = matches[['home_team_points', 'away_team_points', 'tie_points']].fillna(0)


In [93]:
# Creating a new dataframe with the columns we need

Home_win = matches.loc[matches.FTHG > matches.FTAG][['Date','Season','HomeTeam','home_team_points']]
Away_win = matches.loc[matches.FTHG < matches.FTAG][['Date','Season','AwayTeam','away_team_points']]
Home_tie = matches.loc[matches.FTHG == matches.FTAG][['Date','Season','HomeTeam','tie_points']]
Away_tie = matches.loc[matches.FTHG == matches.FTAG][['Date','Season','AwayTeam','tie_points']]

Home_win.columns = ['date','Season', 'team', 'points']
Away_win.columns = ['date','Season', 'team', 'points']
Home_tie.columns = ['date','Season', 'team', 'points']
Away_tie.columns = ['date','Season', 'team', 'points']

In [94]:
matches.shape

(10883, 13)

In [95]:
iframe = "<iframe src='https://public.flourish.studio/visualisation/6430207/' style='width:100%;height:600px;'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/6430207/' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'> </a></div>"
IPython.display.HTML(iframe)

In [96]:
matches=matches.drop(['HTHG','HTAG'],axis=1)


In [97]:
matches['Season'].unique()


array(['1995-96', '1996-97', '1997-98', '1998-99', '1999-2000', '2000-01',
       '2001-02', '2002-03', '2003-04', '2004-05', '2005-06', '2006-07',
       '2007-08', '2008-09', '2009-10', '2010-11', '2011-12', '2012-13',
       '2013-14', '2014-15', '2015-16', '2016-17', '2017-18', '2018-19',
       '2019-20', '2020-21', '2021-22', '2022-23', '2023-24'],
      dtype=object)

In [98]:
matches.head()

Unnamed: 0,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTR,home_team_points,away_team_points,tie_points
0,1995-96,02-09-1995,La Coruna,Valencia,3,0,H,H,3.0,0.0,0.0
1,1995-96,02-09-1995,Sp Gijon,Albacete,3,0,H,H,3.0,0.0,0.0
2,1995-96,03-09-1995,Ath Bilbao,Santander,4,0,H,H,3.0,0.0,0.0
3,1995-96,03-09-1995,Ath Madrid,Sociedad,4,1,H,D,3.0,0.0,0.0
4,1995-96,03-09-1995,Celta,Compostela,0,1,A,D,0.0,3.0,0.0


In [99]:
matches[matches.isnull().any(axis=1)]

Unnamed: 0,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTR,home_team_points,away_team_points,tie_points
136,1995-96,19-11-1995,Ath Bilbao,La Coruna,1,0,H,,3.0,0.0,0.0
1472,1998-99,10-01-1999,Valladolid,Betis,0,3,A,,0.0,3.0,0.0


In [100]:

matches.loc[136, 'HTR'] = 'D'
matches.loc[1472, 'HTR'] = 'D'

In [101]:
#finding the null values
matches.isnull().sum()

Season              0
Date                0
HomeTeam            0
AwayTeam            0
FTHG                0
FTAG                0
FTR                 0
HTR                 0
home_team_points    0
away_team_points    0
tie_points          0
dtype: int64

In [102]:
matches['resultHome'] = matches['FTR'].map({'H':3,'A':0,'D':1})
matches['resultAway'] = matches['FTR'].map({'H':0,'A':3,'D':1})
matches['result'] = matches['resultHome'] + matches['resultAway']
matches

Unnamed: 0,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTR,home_team_points,away_team_points,tie_points,resultHome,resultAway,result
0,1995-96,02-09-1995,La Coruna,Valencia,3,0,H,H,3.0,0.0,0.0,3,0,3
1,1995-96,02-09-1995,Sp Gijon,Albacete,3,0,H,H,3.0,0.0,0.0,3,0,3
2,1995-96,03-09-1995,Ath Bilbao,Santander,4,0,H,H,3.0,0.0,0.0,3,0,3
3,1995-96,03-09-1995,Ath Madrid,Sociedad,4,1,H,D,3.0,0.0,0.0,3,0,3
4,1995-96,03-09-1995,Celta,Compostela,0,1,A,D,0.0,3.0,0.0,0,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10878,2023-24,01-10-2023,Almeria,Granada,3,3,D,H,0.0,0.0,1.0,1,1,2
10879,2023-24,01-10-2023,Alaves,Osasuna,0,2,A,A,0.0,3.0,0.0,0,3,3
10880,2023-24,01-10-2023,Ath Madrid,Cadiz,3,2,H,A,3.0,0.0,0.0,3,0,3
10881,2023-24,01-10-2023,Betis,Valencia,3,0,H,H,3.0,0.0,0.0,3,0,3


In [103]:

y=matches['result']

X_train, X_test, y_train, y_test = train_test_split(matches[['FTHG','FTAG']],matches['result'],
                                                              test_size=0.2,
                                                              random_state=0,
                                                              stratify=y)
print(X_train)
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

      FTHG  FTAG
1209     2     1
2543     0     2
581      1     2
613      4     2
9635     0     1
...    ...   ...
5403     2     3
4038     5     2
872      0     2
2702     0     1
4356     3     0

[8706 rows x 2 columns]
(8706, 2) (8706,)
(2177, 2) (2177,)


In [104]:
X_train.shape

(8706, 2)

In [105]:
X_train

Unnamed: 0,FTHG,FTAG
1209,2,1
2543,0,2
581,1,2
613,4,2
9635,0,1
...,...,...
5403,2,3
4038,5,2
872,0,2
2702,0,1


In [106]:
X_test.shape

(2177, 2)

In [107]:
X_test

Unnamed: 0,FTHG,FTAG
8245,1,1
4511,1,3
10331,0,0
1194,0,0
9070,1,1
...,...,...
5326,1,1
632,1,1
1932,1,1
1867,0,2


In [108]:
y_train.shape

(8706,)

In [109]:
y_test.shape

(2177,)

In [110]:
total_data_count=matches.shape[0]
round(total_data_count*0.25)

2721

In [111]:
matches.dtypes

Season               object
Date                 object
HomeTeam             object
AwayTeam             object
FTHG                  int64
FTAG                  int64
FTR                  object
HTR                  object
home_team_points    float64
away_team_points    float64
tie_points          float64
resultHome            int64
resultAway            int64
result                int64
dtype: object

In [112]:
for i in range(3):
    print("Class -",i,":",list(y_train).count(i))

Class - 0 : 0
Class - 1 : 0
Class - 2 : 2225


In [113]:
for i in range(3):
    print("Class -",i,":",list(y_test).count(i))

Class - 0 : 0
Class - 1 : 0
Class - 2 : 556


In [114]:
uni=len(matches['result'].unique())
for i in range(uni):
    print("Class -",i,":",list(matches['result']).count(i))

Class - 0 : 0
Class - 1 : 0


In [115]:
print("per-feature minimum before scaling:\n {}".format(X_train.min(axis=0)))
print("per-feature maximum before scaling:\n {}".format(X_train.max(axis=0)))

per-feature minimum before scaling:
 FTHG    0
FTAG    0
dtype: int64
per-feature maximum before scaling:
 FTHG    8
FTAG    8
dtype: int64


In [116]:
knn=KNeighborsClassifier()

knn.fit(X_train, y_train)

print("Test set accuracy: {:.2f}".format(knn.score(X_test, y_test)))

Test set accuracy: 1.00


In [117]:
preds=knn.predict(X_test)
# acc=accuracy_score(y_test,preds)
combined=pd.DataFrame(dict(actual=y_test,predictions=preds),index=X_test.index)
# pd.crosstab(index=combined["actual"],columns=combined["predictions"])

In [118]:
combined


Unnamed: 0,actual,predictions
8245,2,2
4511,3,3
10331,2,2
1194,2,2
9070,2,2
...,...,...
5326,2,2
632,2,2
1932,2,2
1867,3,3


In [119]:
from sklearn.metrics import precision_score
precision_score(y_test,preds,average='weighted')


0.9995414769555061

In [120]:
def rolling_averages(group,cols,new_cols):
  group=group.sort_values("date")
  rolling_stat=group[cols].rolling(3, closed='left').mean()
  group[new_cols]=rolling_stat
  group=group.dropna(subset=new_cols)
  return group

In [121]:

scaler1 = MinMaxScaler()

scaler1.fit(X_train,X_test)
# transform data
X_train_scaled = scaler1.transform(X_train)
# transform test data
X_test_scaled = scaler1.transform(X_test)


In [122]:
print("per-feature minimum after scaling:\n {}".format(
    X_train_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n {}".format(
    X_train_scaled.max(axis=0)))

per-feature minimum after scaling:
 [0. 0.]
per-feature maximum after scaling:
 [1. 1.]


In [123]:
#train
knn.fit(X_train_scaled, y_train)

# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(
    knn.score(X_test_scaled, y_test)))

#another approach
preds2=knn.predict(X_test_scaled)
acc2=accuracy_score(y_test,preds2)
# combined=pd.DataFrame(dict(actual=y_test,predictions=preds2))
# pd.crosstab(index=combined["actual"],columns=combined["predictions"])


Scaled test set accuracy: 1.00


In [124]:
acc2

0.9995406522737712

In [125]:
preds2

array([2, 3, 2, ..., 2, 3, 3], dtype=int64)

Visualizing the KNN

In [126]:
from sklearn.metrics import confusion_matrix
pred = knn.predict(X_test_scaled)
print(confusion_matrix(y_test, pred))


[[ 556    0]
 [   1 1620]]


Support Vector Machine (SVM)

In [127]:

svm = SVC(kernel='linear', C=1, random_state=0)
svm.fit(X_train_scaled, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train_scaled, y_train)))

print('Accuracy of SVM classifier on test set: {:.2f}'
        .format(svm.score(X_test_scaled, y_test)))

svm2 = SVC(kernel='rbf', C=1, random_state=0)
svm2.fit(X_train_scaled, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
     .format(svm2.score(X_train_scaled, y_train)))

print('Accuracy of SVM classifier on test set: {:.2f}'
        .format(svm2.score(X_test_scaled, y_test)))

svm3 = SVC(kernel='poly', C=1, random_state=0)
svm3.fit(X_train_scaled, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
     .format(svm3.score(X_train_scaled, y_train)))

print('Accuracy of SVM classifier on test set: {:.2f}'
        .format(svm3.score(X_test_scaled, y_test)))


Accuracy of SVM classifier on training set: 0.74
Accuracy of SVM classifier on test set: 0.74
Accuracy of SVM classifier on training set: 1.00
Accuracy of SVM classifier on test set: 1.00
Accuracy of SVM classifier on training set: 1.00
Accuracy of SVM classifier on test set: 1.00


Visualizing the SVM

In [128]:
from sklearn.metrics import confusion_matrix
pred = svm.predict(X_test_scaled)
print(confusion_matrix(y_test, pred))


[[   0  556]
 [   0 1621]]



Naive Bayes

In [129]:
nb = GaussianNB()
nb.fit(X_train_scaled, y_train)
print('Accuracy of Naive Bayes classifier on training set: {:.2f}'
     .format(nb.score(X_train_scaled, y_train)))
print('Accuracy of Naive Bayes classifier on test set: {:.2f}'
        .format(nb.score(X_test_scaled, y_test)))


Accuracy of Naive Bayes classifier on training set: 0.74
Accuracy of Naive Bayes classifier on test set: 0.74


Visualizing the Naive Bayes

In [130]:
from sklearn.metrics import confusion_matrix
pred = nb.predict(X_test_scaled)
print(confusion_matrix(y_test, pred))

[[   0  556]
 [   0 1621]]


Decision Tree

In [131]:
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train_scaled, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(tree.score(X_train_scaled, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
        .format(tree.score(X_test_scaled, y_test)))

Accuracy of Decision Tree classifier on training set: 1.00
Accuracy of Decision Tree classifier on test set: 1.00
