# **EDA and Visualization**

**FIFA 22 Sport Predicitions**

In this project, when one is making the best team for their FIFA club, the capabilities of the players added are the most important. Do those players fill the hole in your formation? Are they the right fit for the team? Do they have the skills to back up their cost? Our project is about calculating the ratings of these players using the Attributes and Positions of the player to help anyone know and decide if the player is an excellent addition to their club.

## **Data Preprocessing and Cleaning**

In [3]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
import pickle
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Running player_22 files after testing the entire process with player_21 files

In [4]:
df=pd.read_csv('/content/drive/My Drive/ColabAssign/players_21.csv')
unseen_df = pd.read_csv('/content/drive/My Drive/ColabAssign/players_22.csv')


  unseen_df = pd.read_csv('/content/drive/My Drive/ColabAssign/players_22.csv')


In [5]:
df.columns

Index(['sofifa_id', 'player_url', 'short_name', 'long_name',
       'player_positions', 'overall', 'potential', 'value_eur', 'wage_eur',
       'age',
       ...
       'lcb', 'cb', 'rcb', 'rb', 'gk', 'player_face_url', 'club_logo_url',
       'club_flag_url', 'nation_logo_url', 'nation_flag_url'],
      dtype='object', length=110)

Removing all useless columns

In [6]:
#Removing all columns with over 30% null values

# Calculate the threshold for 30% null values
threshold = len(df) * 0.3

# Use dropna with the thresh parameter
df.dropna(thresh=threshold, axis=1, inplace = True)
unseen_df.dropna(thresh=threshold, axis=1, inplace = True)

# Now df contains only columns with less than 30% null values


In [7]:
#Removing useless columns
useless_columns = ['sofifa_id', 'player_url', 'long_name',
                   'body_type', 'real_face',
                   'player_face_url', 'club_logo_url', 'nation_flag_url',
                    'wage_eur','nationality_id', 'nationality_name',
                   'club_jersey_number', 'club_joined', 'club_contract_valid_until', 'club_flag_url',"ls", "st","rs", "lw", "lf", "cf", "rf", "rw", "lam","cam","ram","lm", "lcm", "cm","rcm", "rm", "lwb", "ldm", "cdm", "rdm","rwb", "lb",
              "lcb", "cb", "rcb","rb","gk"
                   ]


In [8]:
new_df = df.drop(useless_columns, axis = 1)
unseen = unseen_df.drop(useless_columns, axis=1)

In [9]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18944 entries, 0 to 18943
Data columns (total 61 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   short_name                   18944 non-null  object 
 1   player_positions             18944 non-null  object 
 2   overall                      18944 non-null  int64  
 3   potential                    18944 non-null  int64  
 4   value_eur                    18707 non-null  float64
 5   age                          18944 non-null  int64  
 6   dob                          18944 non-null  object 
 7   height_cm                    18944 non-null  int64  
 8   weight_kg                    18944 non-null  int64  
 9   club_team_id                 18719 non-null  float64
 10  club_name                    18719 non-null  object 
 11  league_name                  18719 non-null  object 
 12  league_level                 18719 non-null  float64
 13  club_position   

## **Feature Engineering**

Encoding and Imputating Both Datasets

In [10]:
# Impute missing values for numerical columns
numerical_imputer = SimpleImputer(strategy='mean')
#for players_21
df_imputed_numerical = pd.DataFrame(numerical_imputer.fit_transform(new_df.select_dtypes(include='number')), columns=new_df.select_dtypes(include='number').columns)
#For players_22
unseen_imputed_numerical = pd.DataFrame(numerical_imputer.fit_transform(unseen.select_dtypes(include='number')), columns=unseen.select_dtypes(include='number').columns)

# Impute missing values for categorical columns
for column in new_df.select_dtypes(include='object').columns:
    new_df[column] = new_df[column].fillna(new_df[column].mode()[0])
    unseen[column] = unseen[column].fillna(unseen[column].mode()[0])

# Now new_df contains the entire dataset with imputed values for both numerical and categorical columns

In [11]:
# Label encode categorical variables
label_encoder = LabelEncoder()

for column in new_df.select_dtypes(include='object').columns:
  #player_21
    new_df[column] = label_encoder.fit_transform(new_df[column])
  #players_22
    unseen[column] = label_encoder.fit_transform(unseen[column])


# Concatenate the encoded columns with the imputed numerical columns
df_final = pd.concat([df_imputed_numerical, new_df.select_dtypes(include='object')], axis=1)
unseen_final = pd.concat([unseen_imputed_numerical, unseen.select_dtypes(include='object')], axis=1)

# Now df_final contains the entire dataset with imputed values and label encoded categorical columns


In [12]:
df_final = df_final.dropna()
unseen_final = unseen_final.dropna()

Training models with Player_21

In [13]:
y = df_final["overall"]

In [14]:
X = df_final.drop(columns=['overall'])

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Creating and training the model using the RandomForest, XGBoost, and Gradient Boost Regressors that can predict a player rating.

In [17]:
# RandomForestRegressor
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature Ranking:")
for i in range(X.shape[1]):
  print(f"{i + 1}.Feature {X.columns[indices[i]]}({importances[indices[i]]})")

Feature Ranking:
1.Feature value_eur(0.6767824853426374)
2.Feature release_clause_eur(0.1449016631853957)
3.Feature age(0.10669107323491853)
4.Feature potential(0.047968766844168995)
5.Feature movement_reactions(0.017236409228201896)
6.Feature defending(0.0002531033809081008)
7.Feature defending_marking_awareness(0.00020200687211827751)
8.Feature attacking_crossing(0.0001832159162218995)
9.Feature goalkeeping_reflexes(0.00018188450362503485)
10.Feature dribbling(0.000179515838055657)
11.Feature power_stamina(0.00017547513606515624)
12.Feature club_team_id(0.00017499670524155094)
13.Feature goalkeeping_positioning(0.00017376003216366967)
14.Feature mentality_composure(0.00017199378212519716)
15.Feature goalkeeping_diving(0.00017162834893177115)
16.Feature attacking_heading_accuracy(0.00017065526194936766)
17.Feature power_shot_power(0.00016879782956756733)
18.Feature mentality_interceptions(0.00016586594980476588)
19.Feature mentality_penalties(0.00016509635731620664)
20.Feature skill_b

In [18]:
top_features = [X.columns[indices[i]] for i in range(30)]
print("\nTop 20 features:")
print(top_features)


Top 20 features:
['value_eur', 'release_clause_eur', 'age', 'potential', 'movement_reactions', 'defending', 'defending_marking_awareness', 'attacking_crossing', 'goalkeeping_reflexes', 'dribbling', 'power_stamina', 'club_team_id', 'goalkeeping_positioning', 'mentality_composure', 'goalkeeping_diving', 'attacking_heading_accuracy', 'power_shot_power', 'mentality_interceptions', 'mentality_penalties', 'skill_ball_control', 'mentality_aggression', 'power_jumping', 'power_long_shots', 'physic', 'movement_sprint_speed', 'attacking_short_passing', 'defending_standing_tackle', 'attacking_volleys', 'goalkeeping_handling', 'goalkeeping_kicking']


After breaking down the features into how much they contribute to the calculation of player rating, we decided to use only the top 5: value, release clause, dob/age, potential, and movement reactions. When the values were added up, it proved 0.99, meaning all values below tthat are negible.

In [19]:
#Changing our datasets to include only the top 5 features,
features = ['value_eur', 'release_clause_eur', 'age', 'potential', 'movement_reactions']
new_df = df_final[features]
unseen_y = unseen_final["overall"]
new_unseen = unseen_final[features]

We now have the values we will be using to train the other two models: GradientBoost and XGBoost Regressor

In [20]:
#Scaling the dataset
from sklearn.preprocessing import StandardScaler
X= new_df
X=StandardScaler().fit_transform(X.copy())
new_unseen = StandardScaler().fit_transform(new_unseen.copy())

## **Training Models**

Splitting our modified Dataset and training with GradientBoostingRegressor and XGBRegressor.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
# GradientBoostingRegressor
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = np.round(gb_model.predict(X_test)).ravel()
gb_rmse = np.sqrt(mean_absolute_error(y_test, gb_predictions))
print(f'Mean Absolute Error: {gb_rmse}')


Mean Absolute Error: 0.7138859492860312


In [23]:
# XGBRegressor
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train,y_train)
xgb_predictions = xgb_model.predict(X_test)
xgb_rmse = np.sqrt(mean_absolute_error(y_test, xgb_predictions))
print(f'Mean Absolute Error: {xgb_rmse}')

Mean Absolute Error: 0.5915629437867324


With XGBoost Regressor, we cross-validate and proceed with it as the model of choice

In [24]:
from sklearn.model_selection import GridSearchCV

# Defining the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [50, 250, 500],
    'max_depth': [10, 20, 40],
    'min_samples_split': [17, 39, 100]
}

# Create XGBoost Regressor
xgb_mol = XGBRegressor()

# Create GridSearchCV object
grid_search = GridSearchCV(xgb_mol, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model
grid_search.fit(X, y)


Parameters: { "min_samples_split" } are not used.



## **Evalutaion**

Testing the model's competence with unseen data

Unseen data=players_22 file data

In [25]:
unseen_test = grid_search.predict(new_unseen)
rfg_unseen = np.sqrt(mean_absolute_error(unseen_test, unseen_y))
print(f'Mean Absolute Error: {rfg_unseen}')

Mean Absolute Error: 0.8461065699215934


Saving the Model

In [26]:
pickle.dump(grid_search, open('model.pkl','wb'))