#Introduction

**Author: Jemima Nhyira Antwi

**Assignment 2

## Description
In sports prediction, large numbers of factors including the historical performance of the teams, results of matches, and data on players, have to be accounted for to help different stakeholders understand the odds of winning or losing.


The specific tasks given are;
1. Demonstrate the data preparation & feature extraction process
2. Create feature subsets that show maximum correlation with the dependent variable.
3. Create and train a suitable machine learning model with cross-validation that can predict a player's rating.
4. Measure the model's performance and fine-tune it as a process of optimization.
5. Use the data from another season(players_22) which was not used during the training to test how good is the model.
6. Deploy the model on a simple web page using either (Heroku, Streamlite, or Flask) and upload a video that shows how the model performs on the web page/site.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Imports and Data Loading

This section of the notebook will be dedicated to installing, loading datasets and libraries

In [None]:
!pip install pandas numpy matplotlib seaborn xgboost scikit-learn



In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import pickle

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, VotingRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

We have the necessary libraries needed, now let's load our dataset

In [None]:
#loading datasets
males_legacy_df = pd.read_csv("/content/drive/MyDrive/Intro to ai/male_players (legacy).csv") #for training
player22_df =  pd.read_csv("/content/drive/MyDrive/Intro to ai/players_22.csv") # for testing

  males_legacy_df = pd.read_csv("/content/drive/MyDrive/Intro to ai/male_players (legacy).csv") #for training
  player22_df =  pd.read_csv("/content/drive/MyDrive/Intro to ai/players_22.csv") # for testing


# Data Preprocessing

## EDA, Imputation and Encoding

In this section we know that our data is loaded. Therefore we would be performing an exploratory data analysis, identifying features that are important to us, doing imputation and performing encoding on all the necessary columns.

This step is necessary for the transformation of our data since not all columns, rows are needed for the analysis.

In [None]:
#view first few rows and nature of data
males_legacy_df.head()

Unnamed: 0,player_id,player_url,fifa_version,fifa_update,fifa_update_date,short_name,long_name,player_positions,overall,potential,...,cdm,rdm,rwb,lb,lcb,cb,rcb,rb,gk,player_face_url
0,158023,/player/158023/lionel-messi/150002,15,2,2014-09-18,L. Messi,Lionel Andrés Messi Cuccittini,CF,93,95,...,62+3,62+3,62+3,54+3,45+3,45+3,45+3,54+3,15+3,https://cdn.sofifa.net/players/158/023/15_120.png
1,20801,/player/20801/c-ronaldo-dos-santos-aveiro/150002,15,2,2014-09-18,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,"LW, LM",92,92,...,63+3,63+3,63+3,57+3,52+3,52+3,52+3,57+3,16+3,https://cdn.sofifa.net/players/020/801/15_120.png
2,9014,/player/9014/arjen-robben/150002,15,2,2014-09-18,A. Robben,Arjen Robben,"RM, LM, RW",90,90,...,64+3,64+3,64+3,55+3,46+3,46+3,46+3,55+3,14+3,https://cdn.sofifa.net/players/009/014/15_120.png
3,41236,/player/41236/zlatan-ibrahimovic/150002,15,2,2014-09-18,Z. Ibrahimović,Zlatan Ibrahimović,ST,90,90,...,65+3,65+3,61+3,56+3,55+3,55+3,55+3,56+3,17+3,https://cdn.sofifa.net/players/041/236/15_120.png
4,167495,/player/167495/manuel-neuer/150002,15,2,2014-09-18,M. Neuer,Manuel Peter Neuer,GK,90,90,...,40+3,40+3,36+3,36+3,38+3,38+3,38+3,36+3,87+3,https://cdn.sofifa.net/players/167/495/15_120.png


In [None]:
player22_df.head()

Unnamed: 0,sofifa_id,player_url,short_name,long_name,player_positions,overall,potential,value_eur,wage_eur,age,...,lcb,cb,rcb,rb,gk,player_face_url,club_logo_url,club_flag_url,nation_logo_url,nation_flag_url
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,"RW, ST, CF",93,93,78000000.0,320000.0,34,...,50+3,50+3,50+3,61+3,19+3,https://cdn.sofifa.net/players/158/023/22_120.png,https://cdn.sofifa.net/teams/73/60.png,https://cdn.sofifa.net/flags/fr.png,https://cdn.sofifa.net/teams/1369/60.png,https://cdn.sofifa.net/flags/ar.png
1,188545,https://sofifa.com/player/188545/robert-lewand...,R. Lewandowski,Robert Lewandowski,ST,92,92,119500000.0,270000.0,32,...,60+3,60+3,60+3,61+3,19+3,https://cdn.sofifa.net/players/188/545/22_120.png,https://cdn.sofifa.net/teams/21/60.png,https://cdn.sofifa.net/flags/de.png,https://cdn.sofifa.net/teams/1353/60.png,https://cdn.sofifa.net/flags/pl.png
2,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,"ST, LW",91,91,45000000.0,270000.0,36,...,53+3,53+3,53+3,60+3,20+3,https://cdn.sofifa.net/players/020/801/22_120.png,https://cdn.sofifa.net/teams/11/60.png,https://cdn.sofifa.net/flags/gb-eng.png,https://cdn.sofifa.net/teams/1354/60.png,https://cdn.sofifa.net/flags/pt.png
3,190871,https://sofifa.com/player/190871/neymar-da-sil...,Neymar Jr,Neymar da Silva Santos Júnior,"LW, CAM",91,91,129000000.0,270000.0,29,...,50+3,50+3,50+3,62+3,20+3,https://cdn.sofifa.net/players/190/871/22_120.png,https://cdn.sofifa.net/teams/73/60.png,https://cdn.sofifa.net/flags/fr.png,,https://cdn.sofifa.net/flags/br.png
4,192985,https://sofifa.com/player/192985/kevin-de-bruy...,K. De Bruyne,Kevin De Bruyne,"CM, CAM",91,91,125500000.0,350000.0,30,...,69+3,69+3,69+3,75+3,21+3,https://cdn.sofifa.net/players/192/985/22_120.png,https://cdn.sofifa.net/teams/10/60.png,https://cdn.sofifa.net/flags/gb-eng.png,https://cdn.sofifa.net/teams/1325/60.png,https://cdn.sofifa.net/flags/be.png


In [None]:
#understand nature of data
males_legacy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161583 entries, 0 to 161582
Columns: 110 entries, player_id to player_face_url
dtypes: float64(18), int64(45), object(47)
memory usage: 135.6+ MB


In [None]:
player22_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19239 entries, 0 to 19238
Columns: 110 entries, sofifa_id to nation_flag_url
dtypes: float64(16), int64(44), object(50)
memory usage: 16.1+ MB


From the info decribed we can see that there are a lot of columns to consider for the analysis, so we are going to need to drop some, before finding the needed features we can work with, let's get the number of missing values for each of our dataframes

In [None]:
# Checking for missing data
print("Checking sum of missing value for Players 21(Train Data:)")
males_legacy_df.isnull().sum()

Checking sum of missing value for Players 21(Train Data:)


player_id           0
player_url          0
fifa_version        0
fifa_update         0
fifa_update_date    0
                   ..
cb                  0
rcb                 0
rb                  0
gk                  0
player_face_url     0
Length: 110, dtype: int64

In [None]:
print("Checking sum of missing value for Players 22(Test Data:)")
player22_df.isnull().sum()

Checking sum of missing value for Players 22(Test Data:)


sofifa_id               0
player_url              0
short_name              0
long_name               0
player_positions        0
                    ...  
player_face_url         0
club_logo_url          61
club_flag_url          61
nation_logo_url     18480
nation_flag_url         0
Length: 110, dtype: int64

## Dropping Missing Values

Now we are going to drop columns which have 30% of the data missing

In [None]:
total_rows_male = males_legacy_df.shape[0] #shape for train data
total_rows_player22 = player22_df.shape[0] #shape for test data

Calaculate the 30% threshhold for the two sets

In [None]:
threshold_male = int(0.3 * total_rows_male)
threshold_player22 = int(0.3 * total_rows_player22)

print("The threshold for Males Legacy is", threshold_male)

The threshold for Males Legacy is 48474


In [None]:
print("The theshold for Players 22 is", threshold_player22)

The theshold for Players 22 is 5771


Get a list of all columns with a sum of missing values greater than the threshold:

In [None]:
columns_to_drop = []
for column in males_legacy_df.columns:
    if males_legacy_df[column].isna().sum() > threshold_male:
        columns_to_drop.append(column)

In [None]:
columns_to_drop_22 = []
for column in player22_df.columns:
    if  player22_df[column].isna().sum() > threshold_player22:
        columns_to_drop_22.append(column)

Drop the columns:

In [None]:

males_legacy_df = males_legacy_df.drop(columns=columns_to_drop, axis=0)
player22_df = player22_df.drop(columns=columns_to_drop_22 , axis=0)

Let's check info again:

In [None]:
#understand nature of data
males_legacy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161583 entries, 0 to 161582
Columns: 102 entries, player_id to player_face_url
dtypes: float64(14), int64(45), object(43)
memory usage: 125.7+ MB


In [None]:
player22_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19239 entries, 0 to 19238
Columns: 102 entries, sofifa_id to nation_flag_url
dtypes: float64(13), int64(44), object(45)
memory usage: 15.0+ MB


After reviewing kaggle, reading the data description and looking at things using the data explorer, I came to understand some comuns don't contribute to the overall rating of a player, so we are going to drop those columns too.

Here is a link to exploer the columns in the data: [Data Explorer on Kaggle for Players 22](https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset/?select=players_22.csv)

and [Data Explorer on Kaggle for Males Legacy](https://www.kaggle.com/datasets/stefanoleone992/fifa-23-complete-player-dataset?select=male_players+%28legacy%29.csv)

For males legacy csv let's drop those columns:

In [None]:
drop_columns = ['club_contract_valid_until_year','club_name','club_position','club_joined_date','international_reputation','nationality_id','club_team_id','player_id','player_url','fifa_version','long_name','dob','body_type','real_face','player_face_url', 'fifa_update', 'fifa_update_date']

males_legacy_df = males_legacy_df.drop(drop_columns, axis=1)

In [None]:
drop_columns = ['club_contract_valid_until','club_name','club_position','international_reputation','nationality_id','club_team_id','sofifa_id','player_url','long_name','dob','body_type','real_face','player_face_url','club_logo_url','club_flag_url','nation_flag_url']

player22_df = player22_df.drop(drop_columns, axis=1)

After a further review, some columns were identified that could be dropped with this justfication. If we look at the `ls` column it is described as the `player attribute playing as LW`.

Such columns are only useful if we wanted to predict a player's effectiveness in playing such a position, so we drop such columns with that description.


Players are normally played in a specific posiion at their clubs which contibutes more to their overall rating, thus columns like `players_positions` which is the `player preferred positions`


Other columns reviewd that can be dropped are;
*   `short_name`
*   `club_joined`
*   `nationality_name`

In [None]:
#drop new identified columnas
drop_r_cols = ['short_name', 'player_positions', 'league_name', 'nationality_name', 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb', 'gk']

males_legacy_df  = males_legacy_df .drop(drop_r_cols, axis=1)
player22_df = player22_df.drop(drop_r_cols, axis=1)

## Imputation x Encoding

Having dropped columns will many missing values now we do imutation. Imputation is where will fill missing data with certain values.

In [None]:
## Filling missing numeric data with the mean value
num_imputer = SimpleImputer(strategy='mean')

## Filling missing categorical data with the most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')

In [None]:
# Selecting numerical and categorical features

num_features_males = males_legacy_df.select_dtypes(include=[np.number]).columns.tolist()
cat_features_males = males_legacy_df.select_dtypes(include=[object]).columns.tolist()

In [None]:
num_features_22 = player22_df.select_dtypes(include=[np.number]).columns.tolist()
cat_features_22 = player22_df.select_dtypes(include=[object]).columns.tolist()

In [None]:
cat_features_males, cat_features_22

(['preferred_foot', 'work_rate'],
 ['club_joined', 'preferred_foot', 'work_rate'])

From Kaggle the column `overall` is described as the *player current overall attribute* which transalte to the **the player rating** i.e the crux of this whole project. Thus we remove `overall` since it is our target variable.

In [None]:
# Removing the target variable from the features
num_features_males.remove('overall')  # 'overall' is the target variable
num_features_22.remove('overall')  # 'overall' is the target variable

Now we do the imputation:

In [None]:
#males legacy imputation
males_legacy_df[num_features_males] = num_imputer.fit_transform(males_legacy_df[num_features_males])

#categorical imputation
males_legacy_df[cat_features_males] = cat_imputer.fit_transform(males_legacy_df[cat_features_males])

In [None]:
#player 22 imputation
player22_df[num_features_22] = num_imputer.fit_transform(player22_df[num_features_22])

#categorical imputation
player22_df[cat_features_22] = cat_imputer.fit_transform(player22_df[cat_features_22])

In [None]:
males_legacy_df.shape

(161583, 54)

In [None]:
player22_df.shape

(19239, 55)

Next task is to do encoding. We do this for only categorical columns. We first explored encoding use OneHot Encoding technique, but quickly discovered that we run out of memory so quickly pivoted to encoding using `pd.get_dummies`

In [None]:
# Using `get_dummies` for one-hot encoding and dropping the first category
males_legacy_encoded_df = pd.get_dummies(males_legacy_df, columns=cat_features_males, drop_first=True)
player22_encoded_df = pd.get_dummies(player22_df, columns=cat_features_22, drop_first=True)

males_legacy_encoded_df.head()  # display the first few rows to verify the changes

Unnamed: 0,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,league_id,league_level,club_jersey_number,...,goalkeeping_reflexes,preferred_foot_Right,work_rate_High/Low,work_rate_High/Medium,work_rate_Low/High,work_rate_Low/Low,work_rate_Low/Medium,work_rate_Medium/High,work_rate_Medium/Low,work_rate_Medium/Medium
0,93,95.0,100500000.0,550000.0,27.0,169.0,67.0,53.0,1.0,10.0,...,8.0,False,False,False,False,False,False,False,True,False
1,92,92.0,79000000.0,375000.0,29.0,185.0,80.0,53.0,1.0,7.0,...,11.0,True,True,False,False,False,False,False,False,False
2,90,90.0,54500000.0,275000.0,30.0,180.0,80.0,19.0,1.0,10.0,...,15.0,False,True,False,False,False,False,False,False,False
3,90,90.0,52500000.0,275000.0,32.0,195.0,95.0,16.0,1.0,10.0,...,12.0,True,False,False,False,False,False,False,True,False
4,90,90.0,63500000.0,300000.0,28.0,193.0,92.0,19.0,1.0,1.0,...,86.0,True,False,False,False,False,False,False,False,True


In [None]:
males_legacy_encoded_df.shape

(161583, 61)

Finally let's describe our data set before feature analysis

In [None]:
males_legacy_encoded_df.describe()

Unnamed: 0,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,league_id,league_level,club_jersey_number,...,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes
count,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,...,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0,161583.0
mean,65.699071,70.744008,2326770.0,10855.409768,25.123181,181.240205,75.235031,210.409017,1.380283,20.161323,...,48.668492,57.816892,45.757957,47.669996,45.698588,16.52961,16.274918,16.140374,16.288861,16.636973
std,7.040855,6.259121,5967471.0,21821.763253,4.670207,6.750148,7.000456,442.238584,0.744308,16.777537,...,15.652208,11.004799,20.453699,21.336404,20.935273,17.67047,16.834294,16.476466,16.998697,17.980143
min,40.0,40.0,1000.0,500.0,16.0,154.0,49.0,1.0,1.0,1.0,...,5.0,3.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0
25%,61.0,66.0,325000.0,2000.0,21.0,176.0,70.0,19.0,1.0,9.0,...,39.0,53.0,26.0,27.0,25.0,8.0,8.0,8.0,8.0,8.0
50%,66.0,70.0,750000.0,4000.0,25.0,181.0,75.0,56.0,1.0,18.0,...,50.0,57.816892,50.0,54.0,52.0,11.0,11.0,11.0,11.0,11.0
75%,70.0,75.0,1900000.0,10000.0,28.0,186.0,80.0,308.0,1.380283,27.0,...,60.0,65.0,63.0,66.0,64.0,14.0,14.0,14.0,14.0,14.0
max,94.0,95.0,194000000.0,575000.0,54.0,208.0,110.0,2149.0,5.0,99.0,...,96.0,96.0,94.0,94.0,95.0,91.0,92.0,95.0,92.0,94.0


In [None]:
player22_encoded_df.describe()

Unnamed: 0,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,league_level,club_jersey_number,weak_foot,...,mentality_penalties,mentality_composure,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes
count,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,...,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0
mean,65.772182,71.07937,2850452.0,9017.989363,25.210822,181.299704,74.943032,1.354364,20.94525,2.946151,...,47.858724,57.92983,46.601746,48.045584,45.9067,16.406102,16.192474,16.055356,16.229274,16.491814
std,6.880232,6.086213,7599043.0,19439.284122,4.748235,6.863179,7.069434,0.746679,17.880953,0.67156,...,15.768583,12.159326,20.200807,21.232718,20.755683,17.574028,16.839528,16.564554,17.059779,17.884833
min,47.0,49.0,9000.0,500.0,16.0,155.0,49.0,1.0,1.0,1.0,...,7.0,12.0,4.0,5.0,5.0,2.0,2.0,2.0,2.0,2.0
25%,61.0,67.0,475000.0,1000.0,21.0,176.0,70.0,1.0,9.0,3.0,...,38.0,50.0,29.0,28.0,25.0,8.0,8.0,8.0,8.0,8.0
50%,66.0,71.0,975000.0,3000.0,25.0,181.0,75.0,1.0,18.0,3.0,...,49.0,59.0,52.0,56.0,53.0,11.0,11.0,11.0,11.0,11.0
75%,70.0,75.0,2100000.0,8000.0,29.0,186.0,80.0,1.0,27.0,3.0,...,60.0,66.0,63.0,65.0,63.0,14.0,14.0,14.0,14.0,14.0
max,93.0,95.0,194000000.0,350000.0,54.0,206.0,110.0,5.0,99.0,5.0,...,93.0,96.0,93.0,93.0,92.0,91.0,92.0,93.0,92.0,90.0


# Feature Engineering

## Feature Extraction
Now we are going to analyze the dataset to understand which features are important for determining a player's overall rating. We are using feature importance *to* identify necessary features.


Here we are fitting a RandomForestRegressor to obtain feature importances.

In [None]:
# the target variable and features; drop non-numeric columns if necessary
X = males_legacy_encoded_df.drop(columns=['overall'])
y = males_legacy_encoded_df['overall']

In [None]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=69)

In [None]:
# Create a Random Forest Regressor model
model = RandomForestRegressor(random_state=69)
model.fit(X_train, y_train)

In [None]:

# Get feature importances
importances = model.feature_importances_

# Sort them in descending order
indices = np.argsort(importances)[::-1]

# Let's print out the feature importance ranking
print("Top 20 Feature ranking:")

for i in range(21):
    print(f"{i + 1}. Feature {X.columns[indices[i]]} ({importances[indices[i]]})")



Top 20 Feature ranking:
1. Feature value_eur (0.8065332959425644)
2. Feature age (0.08062255967101842)
3. Feature potential (0.04557715087721492)
4. Feature movement_reactions (0.02080080413408383)
5. Feature wage_eur (0.014259375005116369)
6. Feature mentality_composure (0.006396275740171184)
7. Feature defending (0.004508892564411371)
8. Feature dribbling (0.0026127347611774937)
9. Feature skill_ball_control (0.0012526488480551143)
10. Feature physic (0.0009574911442534491)
11. Feature attacking_crossing (0.0008622794809781083)
12. Feature power_stamina (0.0008228267629681519)
13. Feature shooting (0.0006882887980096972)
14. Feature mentality_positioning (0.0006503894634027378)
15. Feature goalkeeping_positioning (0.0005986013742359419)
16. Feature power_shot_power (0.0005732124986454873)
17. Feature goalkeeping_diving (0.0005595093003058266)
18. Feature attacking_heading_accuracy (0.0005566025428797454)
19. Feature league_id (0.0005496121814647509)
20. Feature passing (0.00053297694

In [None]:
 #Now, let's get the top 10 features
top_features = [X.columns[indices[i]] for i in range(10)]
print("\nTop 10 features with % Contribution:")

for i in range(10):
    print(f"{i + 1}.  {top_features[i]} ({round(importances[indices[i]]*100,2)}%)")



Top 10 features with % Contribution:
1.  value_eur (80.65%)
2.  age (8.06%)
3.  potential (4.56%)
4.  movement_reactions (2.08%)
5.  wage_eur (1.43%)
6.  mentality_composure (0.64%)
7.  defending (0.45%)
8.  dribbling (0.26%)
9.  skill_ball_control (0.13%)
10.  physic (0.1%)


From observing the results of the feature importance process I observe the top 5 features contribute a percentage importance of *97%*.

Thus my strategy is to use the top 10 features to train so I capture the underlying data patterns even for weak contributing features. Then when testing use the same 5. And, when deployed in the future use the top 5 features for prediction.

Let's see how it Goes. On to Feature subsetting.

In [None]:
top_features = top_features[:5]

print('Features being used for model development are:\n')
top_features

Features being used for model development are:



['value_eur', 'age', 'potential', 'movement_reactions', 'wage_eur']

## Feature Subset

At this stage our goal is to use the top features we have identified at our feature extraction stage to create subsetted data that we will use to train models.

In [None]:
#Now we subset our X feauture set
X_top_f = X[top_features]
X_top_f

#no need to do for y

Unnamed: 0,value_eur,age,potential,movement_reactions,wage_eur
0,100500000.0,27.0,95.0,94.0,550000.0
1,79000000.0,29.0,92.0,90.0,375000.0
2,54500000.0,30.0,90.0,89.0,275000.0
3,52500000.0,32.0,90.0,85.0,275000.0
4,63500000.0,28.0,90.0,89.0,300000.0
...,...,...,...,...,...
161578,110000.0,18.0,61.0,39.0,700.0
161579,110000.0,19.0,58.0,42.0,750.0
161580,110000.0,19.0,58.0,50.0,500.0
161581,150000.0,17.0,70.0,45.0,500.0


Now let's scale our features which is our independent variables

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Scale the features
X_scaled = scaler.fit_transform(X_top_f)

# The features are now scaled and ready for training the model.
X_scaled_df = pd.DataFrame(X_scaled, columns=X_top_f.columns)

X_scaled_df.head()

Unnamed: 0,value_eur,age,potential,movement_reactions,wage_eur
0,16.451449,0.401872,3.875315,3.535775,24.706815
1,12.848571,0.83012,3.396013,3.099571,16.687273
2,8.742966,1.044244,3.076478,2.99052,12.104678
3,8.407815,1.472491,3.076478,2.554317,12.104678
4,10.251147,0.615996,3.076478,2.99052,13.250326


In [None]:
#Saving scaler to use in deployment

with open('/content/drive/MyDrive/Intro to ai/scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)


# Training Models

We are now reading to train some models, here we are going to train 3 modes;
1. XGBoost
2. Gradient Boost
3. Random Forest


Lets split data for training

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.3, random_state=84)

Now we define a function for training our various models

In [None]:
def train_model(model, param_grid, X, y):
    '''
        Trains a model using grid search with cross-validation and returns the best model.
        Parameters:
            model: scikit-learn model
            param_grid: dictionary with parameters to try
            X: features(independent variables)
            y: target(dependent variable)
    '''
    cv = KFold(n_splits=7 , random_state=69, shuffle=True)

    # Grid search with cross-validation
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)
    grid_search.fit(X, y)

    # Results of the grid search
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best score (MAE): {-grid_search.best_score_}")  # We negate the score because grid search maximizes performance (so it negates the scores)

    return grid_search.best_estimator_  # Returns the best model

## Model 1: XGBoost

In [None]:
print("\nTraining XGBoost...")
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_params = {
    'n_estimators': [100, 200, 500],
    'learning_rate': [0.1, 0.01],
    'max_depth': [3, 5, 15],
    'colsample_bytree': [0.5,1]
}
best_xgb = train_model(xgb_model, xgb_params, X_train, y_train)


Training XGBoost...




Best parameters: {'colsample_bytree': 1, 'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 100}
Best score (MAE): 0.6224623308863931


## Model 2: Gradient Bossting Regressor

In [None]:
print("\nTraining Gradient Boosting...")
gbr_model = GradientBoostingRegressor(random_state=63)
gbr_params = {
    'n_estimators': [10, 50,  100,],
    'learning_rate': [0.1, 0.01],
    'max_depth': [9, 15]
}
best_gbr = train_model(gbr_model, gbr_params, X_train, y_train)


Training Gradient Boosting...
Best parameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 50}
Best score (MAE): 0.6389476703651458


## Model 3: Random Forest Regressor

In [None]:
print("\nTraining Random Forest...")
rf_model = RandomForestRegressor(random_state=39)
rf_params = {
    'n_estimators': [10, 50, 100],
    'max_depth': [12, 15],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
best_rf = train_model(rf_model, rf_params, X_train, y_train)


Training Random Forest...
Best parameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best score (MAE): 0.620120615885473


## Model 4: Ensembled Model

Form discussion in class I have come to understand that, an ensembled model can improve a model's predicitve perfromance. Here I will combine the best versions of my 3 models into a single ensemble model.

In [None]:
# Create an ensemble model
ensemble = VotingRegressor(
    estimators=[
        ('xgb', best_xgb),
        ('gbr', best_gbr),
        ('rf', best_rf)
    ]
)

In [None]:
# Fit model on the training data
print("\nTraining Ensemble Model...")
ensemble.fit(X_train, y_train)

# Predict and evaluate on the training set
train_pred = ensemble.predict(X_train)
train_mae = mean_absolute_error(y_train, train_pred)
print(f"Ensemble model MAE on training set: {train_mae}")


Training Ensemble Model...
Ensemble model MAE on training set: 0.30503444733299373


Now we have our trained Models. We are moving on to evaluations on the test set to see how they perform. Before Let's save so we don't have to incur cost of training if runtime fails

## Saving Models

In [None]:
%cd '/content/drive/MyDrive/Intro to ai'

/content/drive/MyDrive/Intro to ai


In [None]:
with open('best_xgb_model.pkl', 'wb') as file:
    pickle.dump(best_xgb, file)

with open('best_gbr_model.pkl', 'wb') as file:
    pickle.dump(best_gbr, file)

with open('best_rf_model.pkl', 'wb') as file:
    pickle.dump(best_rf, file)

with open('ensemble_model.pkl', 'wb') as file:
    pickle.dump(ensemble, file)

Test if the model saved well

In [None]:
with open('ensemble_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

predictions = loaded_model.predict(X_test)

en_mae = mean_absolute_error(y_test, predictions)

print(f"Ensemble model MAE on test set: {en_mae}")

Ensemble model MAE on test set: 0.6036230979910489


Saved well

# Evaluation

We are going to do two evaluations, one on the test set seperated from the training data. The other on `Players 22` an unseen dataset similar to the data used to train models

## Test Set Evaluations

In [None]:
print("\nEvaluating XGBoost...")

#predict on test set
pred_xgb = best_xgb.predict(X_test)
xgb_mae = mean_absolute_error(y_test, pred_xgb)

print(f"XGBoost model MAE on test set: {xgb_mae:.2f}")


Evaluating XGBoost...
XGBoost model MAE on test set: 0.62


In [None]:
print("\nEvaluating Gradient Boost...")

#predict on test set
pred_gbr = best_gbr.predict(X_test)
gbr_mae = mean_absolute_error(y_test, pred_gbr)

print(f"Gradient Boost Regressor model MAE on test set: {gbr_mae}:.2f")


Evaluating Gradient Boost...
Gradient Boost Regressor model MAE on test set: 0.6283813246353903:.2f


In [None]:
print("\nEvaluating Ensemble...")

#predict on test set
pred_en = ensemble.predict(X_test)
en_mae = mean_absolute_error(y_test, pred_en)

print(f"Ensemble model MAE on test set: {en_mae:.2f}")


Evaluating Ensemble...
Ensemble model MAE on test set: 0.60


## Player 22 Evaluations


Here we will test our trained models further on `player22` data, the data has been preprocessed already. We only have to extract just the top features needed

In [None]:
player22_encoded_df['overall']

0        93
1        92
2        91
3        91
4        91
         ..
19234    47
19235    47
19236    47
19237    47
19238    47
Name: overall, Length: 19239, dtype: int64

In [None]:
top_features

['value_eur', 'age', 'potential', 'movement_reactions', 'wage_eur']

In [None]:
player22_encoded_df[top_features]

Unnamed: 0,value_eur,age,potential,movement_reactions,wage_eur
0,78000000.0,34.0,93.0,94.0,320000.0
1,119500000.0,32.0,92.0,93.0,270000.0
2,45000000.0,36.0,91.0,94.0,270000.0
3,129000000.0,29.0,91.0,89.0,270000.0
4,125500000.0,30.0,91.0,91.0,350000.0
...,...,...,...,...,...
19234,70000.0,22.0,52.0,53.0,1000.0
19235,110000.0,19.0,59.0,49.0,500.0
19236,100000.0,21.0,55.0,46.0,500.0
19237,110000.0,19.0,60.0,48.0,500.0


In [None]:
#Get player 22 info
y_22 = player22_encoded_df['overall']
X_22 = player22_encoded_df[top_features]


In [None]:
#Scale input

X_scaled_22 = scaler.fit_transform(X_22)

# The features are now scaled and ready for training the model.
X22_scaled_df = pd.DataFrame(X_scaled_22, columns=X_22.columns)

In [None]:
#reassign
X_22 = X22_scaled_df

X_22.head()

Unnamed: 0,value_eur,age,potential,movement_reactions,wage_eur
0,9.889601,1.851089,3.60178,3.599846,15.998022
1,15.350958,1.429869,3.43747,3.489252,13.425844
2,5.546836,2.272309,3.27316,3.599846,13.425844
3,16.601147,0.798039,3.27316,3.046874,13.425844
4,16.140551,1.008649,3.27316,3.268063,17.541329


Using saved models here.

### Loading Saved Models

In [None]:
#move to directory where models are saved
%cd "/content/drive/MyDrive/Intro to ai"

/content/drive/MyDrive/Intro to ai


In [None]:
with open('best_xgb_model.pkl', 'rb') as file:
    lbest_xgb = pickle.load(file)

with open('best_gbr_model.pkl', 'rb') as file:
    lbest_gbr = pickle.load(file)

with open('best_rf_model.pkl', 'rb') as file:
    lbest_rf = pickle.load(file)

with open('ensemble_model.pkl', 'rb') as file:
    lensemble = pickle.load(file)

### Testing

In [None]:
print("\nEvaluating XGBoost...")

#predict on test set
pred_xgb = lbest_xgb.predict(X_22)
xgb_mae = mean_absolute_error(y_22, pred_xgb)

print(f"XGBoost model MAE on Players 22 set: {xgb_mae:.2f}")


Evaluating XGBoost...
XGBoost model MAE on Players 22 set: 1.19


In [None]:
print("\nEvaluating Random Forest...")

#predict on test set
pred_rf = lbest_rf.predict(X_22)
rf_mae = mean_absolute_error(y_22, pred_rf)

print(f"Random Forest Regressor model MAE on Players 22 set: {rf_mae:.2f}")


Evaluating Random Forest...
Random Forest Regressor model MAE on Players 22 set: 1.03


In [None]:
print("\nEvaluating Gradient Boost...")

#predict on test set
pred_gbr = lbest_gbr.predict(X_22)
gbr_mae = mean_absolute_error(y_22, pred_gbr)

print(f"Gradient Boost Regressor model MAE on Players 22 set: {gbr_mae:.2f}")


Evaluating Gradient Boost...
Gradient Boost Regressor model MAE on Players 22 set: 1.11


In [None]:
print("\nEvaluating Ensemble...")

#predict on test set
pred_en = lensemble.predict(X_22)
en_mae = mean_absolute_error(y_22, pred_en)

print(f"Ensemble model MAE on Players 22 set: {en_mae:.2f}")



Evaluating Ensemble...
Ensemble model MAE on Players 22 set: 1.04


In [None]:
!pip freeze > requirements.txt