# BUSINESS CASE: WITH THE FIFA20 DATASET WE NEED TO CLUSTER THE PLAYER BY THEIR SKILL INTO CERTAIN GROUP

## INTRODUCTION OF PROJECT:

FIFA 20 is a football simulation video game published by Electronic Arts as part of the FIFA series. It is the 27th installment in the FIFA series, and was released on 27 September 2019 for Microsoft Windows, PlayStation 4, Xbox One, and Nintendo Switch.

# Dataset Overview:

* The dataset contains 104 features and 18,278 records.

There is a clear distinction between goalkeeping skills and outfielding skills.
Among all players, 2,036 are goalkeepers, and the rest are outfielders.
For all goalkeepers, the outfielding skill attributes such as pace, shooting, passing, dribbling, defending, physic, and the positional features like ls, st, rs, lw, lf, cf, rf, rw, lam, cam, ram, lm, lcm, cm, rcm, rm, lwb, ldm, cdm, rdm, rwb, lb, lcb, cb, rcb, rb are NULL.
Likewise, for all outfielders, the goalkeeping skill attributes such as gk_diving, gk_handling, gk_kicking, gk_reflexes, gk_speed, and gk_positioning are NULL. Since the majority of players are outfielders, these goalkeeping features contain a large number of NULL values.
For some players, wage_eur and value_eur are 0. In the context of football player data, these represent a player’s earnings and market value. Although having 0 as wage or value is technically possible, it is unusual and likely indicates an issue in the data.
Only 3.1% of players have an overall rating above 80, with just 10 players exceeding 90. Consequently, these players may appear as outliers for many features, even though they are genuine data points.

# Project Pipeline:
1. Importing Libraries
2. Exploratory Data Analysis
3. Data Preprocessing / Feature Engineering
4. Feature Scaling AND Feature Selection
5. Model Building
6. Conclusion based on the models' performances

## 1. Importing Libraries
All the required libraries are imported.

## 2. Exploratory Data Analysis
1.The dataset consists of 104 features and 18,278 records.

2.sofifa_id and player_url are unique features, and there are no constant features in the dataset.
There is a clear distinction between goalkeeping skills and outfielding skills.

3.The dataset contains 2,036 goalkeepers, with the remaining players being outfielders.
For goalkeepers, all outfielding attributes such as pace, shooting, passing, dribbling, defending, physic, and positional attributes like ls, st, rs, lw, lf, cf, rf, rw, lam, cam, ram, lm, lcm, cm, rcm, rm, lwb, ldm, cdm, rdm, rwb, lb, lcb, cb, rcb, rb are NULL.

4.Conversely, for outfield players, all goalkeeping attributes such as gk_diving, gk_handling, gk_kicking, gk_reflexes, gk_speed, and gk_positioning are NULL. As the majority of players are outfielders, these goalkeeping features have a large proportion of missing values.

5.For some players, wage_eur and value_eur are 0. While it is technically possible for a player’s wage or market value to be zero, it is an uncommon and likely erroneous scenario in real-world football data.

6.Only 3.1% of players have an overall rating above 80, and only 10 players exceed 90. Consequently, these players may appear as outliers in many features, although they represent genuine data points.

7.A SWEETVIZ report was generated for univariate analysis.

8.Bivariate and multivariate analyses were also performed to study feature relationships.

## 3. Data Preprocessing / Feature Engineering
Imputing Missing Values (NULL Handling)
release_clause_eur had a skewed distribution, so its missing values were imputed using the median.

For goalkeepers, the outfielding skill features such as pace, shooting, passing, dribbling, defending, physic, and the positional features like ls, st, rs, lw, lf, cf, rf, rw, lam, cam, ram, lm, lcm, cm, rcm, rm, lwb, ldm, cdm, rdm, rwb, lb, lcb, cb, rcb, rb were NULL and imputed with 0.
For outfield players, the goalkeeping skill features such as gk_diving, gk_handling, gk_kicking, gk_reflexes, gk_speed, gk_positioning were NULL and also imputed with 0.
The categorical feature team_position had missing values, which were imputed using the mode.
Encoding Categorical Features
player_positions was split, and only the first position was selected and manually encoded.
preferred_foot and team_position were manually encoded.
work_rate was split into two new features: AttackWorkRate and DefenseWorkRate, both manually encoded.
Handling Outliers
From univariate analysis, we observed that a small percentage of players have exceptionally high values in certain features, which directly correspond to top-performing players with high market valu
If these high values are legitimate, imputing them would distort the representation of elite players.
Outliers were carefully evaluated:
If the proportion of outliers was around 4% or more, we retained them since they likely represent genuine data and not errors.
Imputing valid outliers could bias the data distribution.
Decisions made:
age was imputed using the mean.
wage_eur and value_eur values that were 0 were replaced with the median of their respective columns.
All other numerical features were left as-is to preserve the original data distribution.

## 4. Feature Scaling AND Feature Selection
1. MinMax Scaling is done on dataset as there are lot of outlier kind of data.
2. Duplicate features are identified and removed.
3. Highly correlated features which are above 0.92 are identified and removed.
4. Constant and unique features and the features that do not contribute to skills in identifying clusters are identified and removed.
5. After feature removals, there were 34 features left.
6. After applying PCA, 10 components were chosen as they explained 92% variance of the dataset.


## 5. Model Building
Three models were built.
### (1) K Mean Clustering
* Elbow method is plotted and silhouette score is used to finalise on 3 clusters.
### (2) DBSCAN
* min_samples is chosen as square root of the number of records.
* epsilon value is chosen using NearestNeighbors library
* It showed 4 clusters because the goalkeepers are futher divided according to Left and Right Preferred Foot.
### (3) Hierarchical Clustering
* AgglomerativeClustering is used
* silhouette score is used to finalise on 3 clusters.
* Dendogram is plotted on first 100 players of the dataset.
* It showed 4 clusters
1. Cluster1- orange line (goalkeepers)
2. cluster2- K.Manolas to F. de Jong
3. cluster3- H.Son to M.Icardi
4. cluster4- R.Lukaku to Bernardo Silva
  

# 6. Conclusion
* K Mean Clustering is better because its Silouette score is better than the other 2 models.
* DBSCAN further divided goalkeepers with LEFT and RIGHT footers. Hence, the 4 clusters.
* Hierarchical Clustering is the most time consuming. For huge dataset, its not recommended.

# PROJECT IMPLEMENTATION:

# IMPORT REQUIRED LIBRARIES

In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
!conda install -c conda-forge sweetviz -y
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Jupyter detected...
3 channel Terms of Service accepted
Channels:
 - conda-forge
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): failed



NoSpaceLeftError: No space left on devices.



# EXPLORATORY DATA ANALYSIS


In [2]:
pd.set_option('display.max_columns',None)

In [19]:
# Load the data
fifa=pd.read_csv('players_20.csv')


In [20]:
fifa.head(5)

Unnamed: 0,sofifa_id,player_url,short_name,long_name,age,dob,height_cm,weight_kg,nationality,club,overall,potential,value_eur,wage_eur,player_positions,preferred_foot,international_reputation,weak_foot,skill_moves,work_rate,body_type,real_face,release_clause_eur,player_tags,team_position,team_jersey_number,loaned_from,joined,contract_valid_until,nation_position,nation_jersey_number,pace,shooting,passing,dribbling,defending,physic,gk_diving,gk_handling,gk_kicking,gk_reflexes,gk_speed,gk_positioning,player_traits,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,32,1987-06-24,170,72,Argentina,FC Barcelona,94,94,95500000,565000,"RW, CF, ST",Left,5,4,4.0,Medium/Low,Messi,Yes,195800000.0,"#Dribbler, #Distance Shooter, #Crosser, #FK Sp...",RW,10.0,,2004-07-01,2021.0,,,87.0,92.0,92.0,96.0,39.0,66.0,,,,,,,"Beat Offside Trap, Argues with Officials, Earl...",88.0,95.0,70.0,92.0,88.0,97.0,93.0,94.0,92.0,96.0,91.0,84.0,93.0,95.0,95.0,86.0,68.0,75.0,68.0,94.0,48.0,40.0,94.0,94.0,75.0,96.0,33.0,37.0,26.0,6.0,11.0,15.0,14.0,8.0,89+2,89+2,89+2,93+2,93+2,93+2,93+2,93+2,93+2,93+2,93+2,92+2,87+2,87+2,87+2,92+2,68+2,66+2,66+2,66+2,68+2,63+2,52+2,52+2,52+2,63+2
1,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,34,1985-02-05,187,83,Portugal,Juventus,93,93,58500000,405000,"ST, LW",Right,5,4,5.0,High/Low,C. Ronaldo,Yes,96500000.0,"#Speedster, #Dribbler, #Distance Shooter, #Acr...",LW,7.0,,2018-07-10,2022.0,LS,7.0,90.0,93.0,82.0,89.0,35.0,78.0,,,,,,,"Long Throw-in, Selfish, Argues with Officials,...",84.0,94.0,89.0,83.0,87.0,89.0,81.0,76.0,77.0,92.0,89.0,91.0,87.0,96.0,71.0,95.0,95.0,85.0,78.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,32.0,24.0,7.0,11.0,15.0,14.0,11.0,91+3,91+3,91+3,89+3,90+3,90+3,90+3,89+3,88+3,88+3,88+3,88+3,81+3,81+3,81+3,88+3,65+3,61+3,61+3,61+3,65+3,61+3,53+3,53+3,53+3,61+3
2,190871,https://sofifa.com/player/190871/neymar-da-sil...,Neymar Jr,Neymar da Silva Santos Junior,27,1992-02-05,175,68,Brazil,Paris Saint-Germain,92,92,105500000,290000,"LW, CAM",Right,5,5,5.0,High/Medium,Neymar,Yes,195200000.0,"#Speedster, #Dribbler, #Playmaker , #Crosser,...",CAM,10.0,,2017-08-03,2022.0,LW,10.0,91.0,85.0,87.0,95.0,32.0,58.0,,,,,,,"Power Free-Kick, Injury Free, Selfish, Early C...",87.0,87.0,62.0,87.0,87.0,96.0,88.0,87.0,81.0,95.0,94.0,89.0,96.0,92.0,84.0,80.0,61.0,81.0,49.0,84.0,51.0,36.0,87.0,90.0,90.0,94.0,27.0,26.0,29.0,9.0,9.0,15.0,15.0,11.0,84+3,84+3,84+3,90+3,89+3,89+3,89+3,90+3,90+3,90+3,90+3,89+3,82+3,82+3,82+3,89+3,66+3,61+3,61+3,61+3,66+3,61+3,46+3,46+3,46+3,61+3
3,200389,https://sofifa.com/player/200389/jan-oblak/20/...,J. Oblak,Jan Oblak,26,1993-01-07,188,87,Slovenia,Atlético Madrid,91,93,77500000,125000,GK,Right,3,3,1.0,Medium/Medium,Normal,Yes,164700000.0,,GK,13.0,,2014-07-16,2023.0,GK,1.0,,,,,,,87.0,92.0,78.0,89.0,52.0,90.0,"Flair, Acrobatic Clearance",13.0,11.0,15.0,43.0,13.0,12.0,13.0,14.0,40.0,30.0,43.0,60.0,67.0,88.0,49.0,59.0,78.0,41.0,78.0,12.0,34.0,19.0,11.0,65.0,11.0,68.0,27.0,12.0,18.0,87.0,92.0,78.0,90.0,89.0,,,,,,,,,,,,,,,,,,,,,,,,,,
4,183277,https://sofifa.com/player/183277/eden-hazard/2...,E. Hazard,Eden Hazard,28,1991-01-07,175,74,Belgium,Real Madrid,91,91,90000000,470000,"LW, CF",Right,4,4,4.0,High/Medium,Normal,Yes,184500000.0,"#Speedster, #Dribbler, #Acrobat",LW,7.0,,2019-07-01,2024.0,LF,10.0,91.0,83.0,86.0,94.0,35.0,66.0,,,,,,,"Beat Offside Trap, Selfish, Finesse Shot, Spee...",81.0,84.0,61.0,89.0,83.0,95.0,83.0,79.0,83.0,94.0,94.0,88.0,95.0,90.0,94.0,82.0,56.0,84.0,63.0,80.0,54.0,41.0,87.0,89.0,88.0,91.0,34.0,27.0,22.0,11.0,12.0,6.0,8.0,8.0,83+3,83+3,83+3,89+3,88+3,88+3,88+3,89+3,89+3,89+3,89+3,89+3,83+3,83+3,83+3,89+3,66+3,63+3,63+3,63+3,66+3,61+3,49+3,49+3,49+3,61+3


In [5]:
# Make copy of the original data
original_data=fifa.copy()

In [6]:
original_data.head()

Unnamed: 0,sofifa_id,player_url,short_name,long_name,age,dob,height_cm,weight_kg,nationality,club,overall,potential,value_eur,wage_eur,player_positions,preferred_foot,international_reputation,weak_foot,skill_moves,work_rate,body_type,real_face,release_clause_eur,player_tags,team_position,team_jersey_number,loaned_from,joined,contract_valid_until,nation_position,nation_jersey_number,pace,shooting,passing,dribbling,defending,physic,gk_diving,gk_handling,gk_kicking,gk_reflexes,gk_speed,gk_positioning,player_traits,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,32,1987-06-24,170,72,Argentina,FC Barcelona,94,94,95500000,565000,"RW, CF, ST",Left,5,4,4.0,Medium/Low,Messi,Yes,195800000.0,"#Dribbler, #Distance Shooter, #Crosser, #FK Sp...",RW,10.0,,2004-07-01,2021.0,,,87.0,92.0,92.0,96.0,39.0,66.0,,,,,,,"Beat Offside Trap, Argues with Officials, Earl...",88.0,95.0,70.0,92.0,88.0,97.0,93.0,94.0,92.0,96.0,91.0,84.0,93.0,95.0,95.0,86.0,68.0,75.0,68.0,94.0,48.0,40.0,94.0,94.0,75.0,96.0,33.0,37.0,26.0,6.0,11.0,15.0,14.0,8.0,89+2,89+2,89+2,93+2,93+2,93+2,93+2,93+2,93+2,93+2,93+2,92+2,87+2,87+2,87+2,92+2,68+2,66+2,66+2,66+2,68+2,63+2,52+2,52+2,52+2,63+2
1,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,34,1985-02-05,187,83,Portugal,Juventus,93,93,58500000,405000,"ST, LW",Right,5,4,5.0,High/Low,C. Ronaldo,Yes,96500000.0,"#Speedster, #Dribbler, #Distance Shooter, #Acr...",LW,7.0,,2018-07-10,2022.0,LS,7.0,90.0,93.0,82.0,89.0,35.0,78.0,,,,,,,"Long Throw-in, Selfish, Argues with Officials,...",84.0,94.0,89.0,83.0,87.0,89.0,81.0,76.0,77.0,92.0,89.0,91.0,87.0,96.0,71.0,95.0,95.0,85.0,78.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,32.0,24.0,7.0,11.0,15.0,14.0,11.0,91+3,91+3,91+3,89+3,90+3,90+3,90+3,89+3,88+3,88+3,88+3,88+3,81+3,81+3,81+3,88+3,65+3,61+3,61+3,61+3,65+3,61+3,53+3,53+3,53+3,61+3
2,190871,https://sofifa.com/player/190871/neymar-da-sil...,Neymar Jr,Neymar da Silva Santos Junior,27,1992-02-05,175,68,Brazil,Paris Saint-Germain,92,92,105500000,290000,"LW, CAM",Right,5,5,5.0,High/Medium,Neymar,Yes,195200000.0,"#Speedster, #Dribbler, #Playmaker , #Crosser,...",CAM,10.0,,2017-08-03,2022.0,LW,10.0,91.0,85.0,87.0,95.0,32.0,58.0,,,,,,,"Power Free-Kick, Injury Free, Selfish, Early C...",87.0,87.0,62.0,87.0,87.0,96.0,88.0,87.0,81.0,95.0,94.0,89.0,96.0,92.0,84.0,80.0,61.0,81.0,49.0,84.0,51.0,36.0,87.0,90.0,90.0,94.0,27.0,26.0,29.0,9.0,9.0,15.0,15.0,11.0,84+3,84+3,84+3,90+3,89+3,89+3,89+3,90+3,90+3,90+3,90+3,89+3,82+3,82+3,82+3,89+3,66+3,61+3,61+3,61+3,66+3,61+3,46+3,46+3,46+3,61+3
3,200389,https://sofifa.com/player/200389/jan-oblak/20/...,J. Oblak,Jan Oblak,26,1993-01-07,188,87,Slovenia,Atlético Madrid,91,93,77500000,125000,GK,Right,3,3,1.0,Medium/Medium,Normal,Yes,164700000.0,,GK,13.0,,2014-07-16,2023.0,GK,1.0,,,,,,,87.0,92.0,78.0,89.0,52.0,90.0,"Flair, Acrobatic Clearance",13.0,11.0,15.0,43.0,13.0,12.0,13.0,14.0,40.0,30.0,43.0,60.0,67.0,88.0,49.0,59.0,78.0,41.0,78.0,12.0,34.0,19.0,11.0,65.0,11.0,68.0,27.0,12.0,18.0,87.0,92.0,78.0,90.0,89.0,,,,,,,,,,,,,,,,,,,,,,,,,,
4,183277,https://sofifa.com/player/183277/eden-hazard/2...,E. Hazard,Eden Hazard,28,1991-01-07,175,74,Belgium,Real Madrid,91,91,90000000,470000,"LW, CF",Right,4,4,4.0,High/Medium,Normal,Yes,184500000.0,"#Speedster, #Dribbler, #Acrobat",LW,7.0,,2019-07-01,2024.0,LF,10.0,91.0,83.0,86.0,94.0,35.0,66.0,,,,,,,"Beat Offside Trap, Selfish, Finesse Shot, Spee...",81.0,84.0,61.0,89.0,83.0,95.0,83.0,79.0,83.0,94.0,94.0,88.0,95.0,90.0,94.0,82.0,56.0,84.0,63.0,80.0,54.0,41.0,87.0,89.0,88.0,91.0,34.0,27.0,22.0,11.0,12.0,6.0,8.0,8.0,83+3,83+3,83+3,89+3,88+3,88+3,88+3,89+3,89+3,89+3,89+3,89+3,83+3,83+3,83+3,89+3,66+3,63+3,63+3,63+3,66+3,61+3,49+3,49+3,49+3,61+3


The following features are the abbrevations of field positions:
* LS: Left striker
* ST: Striker
* RS: Right striker
* LW: Left winger
* LF: Left forward
* CF: Center forward
* RF: Right forward
* RW: Right Winger
* LAM: Left Attacking Midfield
* CAM: Center Attacking Midfield
* RAM: Right Attacking Midfield
* LM: Left Midfield
* LCM: Left Center Midfield
* CM: Center Midfield
* RCM: Right Center Midfield
* RM: Right Midfield
* LWB: Left Wing Back
* LDM: Left Defensive Midfield
* CDM: Center Defensive Midfield
* RDM: Right Defensive Midfield
* RWB: Right Wing Back
* LB: Left Back
* LCB: Left Center Back
* CB: Center Back
* RCB: Right Center Back
* RB: Right Back

![soccer_positions.jpeg](attachment:ee86f67a-e81b-4179-8db2-277b0d6944a8.jpeg)

In [7]:
fifa.shape

(2019, 104)

In [8]:
# Retrieving exclusively goalkeepers' top 5 data
original_data.loc[original_data['player_positions'].str.contains('GK')].head(5)

Unnamed: 0,sofifa_id,player_url,short_name,long_name,age,dob,height_cm,weight_kg,nationality,club,overall,potential,value_eur,wage_eur,player_positions,preferred_foot,international_reputation,weak_foot,skill_moves,work_rate,body_type,real_face,release_clause_eur,player_tags,team_position,team_jersey_number,loaned_from,joined,contract_valid_until,nation_position,nation_jersey_number,pace,shooting,passing,dribbling,defending,physic,gk_diving,gk_handling,gk_kicking,gk_reflexes,gk_speed,gk_positioning,player_traits,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb
3,200389,https://sofifa.com/player/200389/jan-oblak/20/...,J. Oblak,Jan Oblak,26,1993-01-07,188,87,Slovenia,Atlético Madrid,91,93,77500000,125000,GK,Right,3,3,1.0,Medium/Medium,Normal,Yes,164700000.0,,GK,13.0,,2014-07-16,2023.0,GK,1.0,,,,,,,87.0,92.0,78.0,89.0,52.0,90.0,"Flair, Acrobatic Clearance",13.0,11.0,15.0,43.0,13.0,12.0,13.0,14.0,40.0,30.0,43.0,60.0,67.0,88.0,49.0,59.0,78.0,41.0,78.0,12.0,34.0,19.0,11.0,65.0,11.0,68.0,27.0,12.0,18.0,87.0,92.0,78.0,90.0,89.0,,,,,,,,,,,,,,,,,,,,,,,,,,
6,192448,https://sofifa.com/player/192448/marc-andre-te...,M. ter Stegen,Marc-André ter Stegen,27,1992-04-30,187,85,Germany,FC Barcelona,90,93,67500000,250000,GK,Right,3,4,1.0,Medium/Medium,Normal,Yes,143400000.0,,GK,1.0,,2014-07-01,2022.0,SUB,22.0,,,,,,,88.0,85.0,88.0,90.0,45.0,88.0,"Swerve Pass, Acrobatic Clearance, Flair Passes",18.0,14.0,11.0,61.0,14.0,21.0,18.0,12.0,63.0,30.0,38.0,50.0,37.0,86.0,43.0,66.0,79.0,35.0,78.0,10.0,43.0,22.0,11.0,70.0,25.0,70.0,25.0,13.0,10.0,88.0,85.0,88.0,88.0,90.0,,,,,,,,,,,,,,,,,,,,,,,,,,
13,212831,https://sofifa.com/player/212831/alisson-ramse...,Alisson,Alisson Ramses Becker,26,1992-10-02,191,91,Brazil,Liverpool,89,91,58000000,155000,GK,Right,3,3,1.0,Medium/Medium,Normal,Yes,111700000.0,,GK,1.0,,2018-07-19,2024.0,,,,,,,,,85.0,84.0,85.0,89.0,51.0,90.0,"Flair, Swerve Pass",17.0,13.0,19.0,45.0,20.0,27.0,19.0,18.0,44.0,30.0,56.0,47.0,40.0,88.0,37.0,64.0,52.0,32.0,78.0,14.0,27.0,11.0,13.0,66.0,23.0,65.0,15.0,19.0,16.0,85.0,84.0,85.0,90.0,89.0,,,,,,,,,,,,,,,,,,,,,,,,,,
14,193080,https://sofifa.com/player/193080/david-de-gea-...,De Gea,David De Gea Quintana,28,1990-11-07,192,82,Spain,Manchester United,89,90,56000000,205000,GK,Right,4,3,1.0,Medium/Medium,Lean,Yes,110600000.0,,GK,1.0,,2011-07-01,2020.0,GK,1.0,,,,,,,90.0,84.0,81.0,92.0,58.0,85.0,"Flair, Second Wind, Flair Passes",17.0,13.0,21.0,50.0,13.0,18.0,21.0,19.0,47.0,38.0,57.0,58.0,63.0,87.0,43.0,61.0,67.0,43.0,60.0,12.0,38.0,30.0,12.0,65.0,29.0,68.0,25.0,21.0,13.0,90.0,84.0,81.0,85.0,92.0,,,,,,,,,,,,,,,,,,,,,,,,,,
25,210257,https://sofifa.com/player/210257/ederson-santa...,Ederson,Ederson Santana de Moraes,25,1993-08-17,188,86,Brazil,Manchester City,88,91,54500000,185000,GK,Left,2,3,1.0,Medium/Medium,Normal,Yes,104900000.0,,GK,31.0,,2017-07-01,2024.0,,,,,,,,,86.0,82.0,93.0,88.0,63.0,86.0,"Leadership, Swerve Pass, Acrobatic Clearance",20.0,14.0,14.0,56.0,18.0,23.0,15.0,20.0,58.0,40.0,64.0,63.0,60.0,87.0,48.0,70.0,66.0,41.0,68.0,18.0,38.0,27.0,20.0,70.0,17.0,70.0,29.0,15.0,8.0,86.0,82.0,93.0,86.0,88.0,,,,,,,,,,,,,,,,,,,,,,,,,,


In [9]:
original_data.loc[original_data['player_positions'].str.contains('GK')].shape

(207, 104)

### Observations
* There are clear separation between the Goal Keeping skills and Outfielding skills.

* There are 2036 goalkeepers.

## 1. Basic Checks

In [10]:
fifa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2019 entries, 0 to 2018
Columns: 104 entries, sofifa_id to rb
dtypes: float64(51), int64(10), object(43)
memory usage: 1.6+ MB


In [11]:
import pandas as pd

# Create a DataFrame with column information
column_info = pd.DataFrame({
    'Column': original_data.columns,
    'Dtype': original_data.dtypes,
    'Non-Null Count': original_data.count(),
    'Null Count': original_data.isnull().sum(),
    'Total': original_data.isnull().count()
})

# Display the DataFrame without any index
column_info.style.hide(axis="index")


Column,Dtype,Non-Null Count,Null Count,Total
sofifa_id,int64,2019,0,2019
player_url,object,2019,0,2019
short_name,object,2019,0,2019
long_name,object,2019,0,2019
age,int64,2019,0,2019
dob,object,2019,0,2019
height_cm,int64,2019,0,2019
weight_kg,int64,2019,0,2019
nationality,object,2019,0,2019
club,object,2019,0,2019


### Observations

* For outfielding skills like pace, shooting, passing, dribbling, defending, physic, there are 2036 NULL values. It means for all the goalkeepers these features are kept NULL.It should have been 0.

* For goalkeeping skills like gk_diving, gk_handling, gk_kicking, gk_reflexes, gk_speed, gk_positioning, there are only 2036 NON-NULL values. It means for all the outfielders these features are kept NULL. It should have been 0.

* For all the position features like 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb', there are 2036 NULL values. It means for all the goalkeepers these features are kept NULL.It should have been 0.

## 2. Statistical Analysis

In [12]:
original_data.sofifa_id.nunique()

2019

In [13]:
original_data.describe()

Unnamed: 0,sofifa_id,age,height_cm,weight_kg,overall,potential,value_eur,wage_eur,international_reputation,weak_foot,skill_moves,release_clause_eur,team_jersey_number,contract_valid_until,nation_jersey_number,pace,shooting,passing,dribbling,defending,physic,gk_diving,gk_handling,gk_kicking,gk_reflexes,gk_speed,gk_positioning,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes
count,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2018.0,1885.0,1975.0,1975.0,535.0,1811.0,1811.0,1811.0,1811.0,1811.0,1811.0,207.0,207.0,207.0,207.0,207.0,207.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0
mean,201440.149084,27.476474,182.069341,76.942546,78.287271,80.565131,13301980.0,45352.154532,1.708767,3.216444,2.909812,26107320.0,16.255696,2021.681519,11.671028,70.785202,64.510768,69.625069,73.559912,60.967421,71.692435,79.009662,75.835749,72.695652,80.681159,46.599034,77.676329,60.514371,56.453915,60.875124,70.727948,54.822101,66.994549,59.845391,53.523786,64.882061,70.506938,68.20218,68.37116,68.681368,75.757681,66.460852,70.711596,69.43558,70.292369,70.664519,59.66551,66.199703,56.506938,61.403865,66.539148,57.202676,73.181863,56.188305,56.145689,52.837958,17.665015,17.346878,17.126363,17.518335,17.770565
std,28173.701008,3.974125,6.795261,7.254585,3.231824,3.989467,11578450.0,48229.293241,0.807164,0.727628,0.973008,22799970.0,15.043602,1.322909,6.577698,12.163105,13.640056,8.546668,8.78114,18.195851,8.129466,3.864187,4.45605,6.878215,3.993293,7.6899,3.965454,20.179537,21.604024,19.573572,14.591541,20.632405,20.101465,20.767282,20.236203,14.968815,17.570488,14.389295,13.989177,13.971785,4.920615,14.771024,11.512407,11.731632,15.110689,11.357792,20.966569,17.495697,23.086507,21.973525,13.405306,17.548602,8.294417,22.447608,24.428228,24.829473,21.103536,20.131289,19.246113,20.717136,21.575808
min,1179.0,18.0,158.0,56.0,75.0,75.0,0.0,0.0,1.0,1.0,1.0,1100000.0,1.0,2019.0,1.0,29.0,20.0,38.0,38.0,18.0,39.0,70.0,65.0,43.0,71.0,28.0,65.0,8.0,5.0,7.0,11.0,5.0,8.0,9.0,8.0,11.0,9.0,28.0,26.0,19.0,58.0,20.0,25.0,29.0,19.0,29.0,6.0,11.0,7.0,4.0,11.0,9.0,25.0,7.0,7.0,8.0,1.0,1.0,1.0,1.0,1.0
25%,187584.0,24.0,177.0,72.0,76.0,77.0,7500000.0,20000.0,1.0,3.0,2.0,13300000.0,7.0,2021.0,6.0,64.0,57.0,65.0,70.0,43.0,67.0,76.0,73.0,68.0,78.0,41.0,75.0,52.0,42.0,53.0,69.0,42.0,64.0,49.0,40.0,60.0,70.0,60.0,60.0,61.0,73.0,57.25,64.0,63.0,66.0,64.0,50.25,57.0,34.0,52.0,59.0,46.0,69.0,36.0,33.0,28.0,8.0,8.0,8.0,8.0,8.0
50%,204077.0,27.0,183.0,77.0,77.0,80.0,10000000.0,32000.0,2.0,3.0,3.0,18900000.0,13.0,2022.0,11.0,72.0,69.0,71.0,75.0,70.0,73.0,79.0,75.0,73.0,80.0,47.0,77.0,68.0,63.0,66.0,75.0,60.0,74.0,67.0,57.0,69.0,76.0,70.0,70.0,71.0,75.0,69.0,74.0,71.0,74.0,72.0,68.0,72.0,68.0,70.0,70.0,60.5,74.0,65.0,68.0,63.0,11.0,11.0,11.0,11.0,11.0
75%,221636.5,30.0,187.0,82.0,80.0,83.0,15000000.0,53000.0,2.0,4.0,4.0,30200000.0,22.0,2023.0,17.0,79.0,74.0,75.5,79.0,76.0,78.0,81.0,79.0,77.0,83.0,52.0,80.0,75.0,74.0,75.0,79.0,71.0,79.0,76.0,70.0,75.0,80.0,78.0,78.0,78.0,79.0,77.0,79.0,77.0,79.0,78.0,74.0,79.0,77.0,76.0,76.0,70.0,78.0,75.0,77.0,75.0,14.0,14.0,14.0,14.0,14.0
max,251700.0,41.0,201.0,103.0,94.0,95.0,105500000.0,565000.0,5.0,5.0,5.0,195800000.0,99.0,2026.0,24.0,96.0,93.0,92.0,96.0,90.0,90.0,90.0,92.0,93.0,92.0,65.0,91.0,93.0,95.0,93.0,92.0,90.0,97.0,94.0,94.0,92.0,96.0,96.0,96.0,96.0,96.0,96.0,95.0,95.0,97.0,95.0,94.0,95.0,92.0,95.0,94.0,92.0,96.0,94.0,92.0,90.0,90.0,92.0,93.0,91.0,92.0


### Observation
* sofifa_id is a unique feature.
  
* No Constant feature in continuous and discrete features.

* In the context of football player data, the wage and value of a player are monetary figures that represent the player's earnings and market value, respectively. While it is technically possible to have 'wage_eur' or 'value_eur' figures equal to 0 in a dataset, it would typically be an unusual and potentially erroneous scenario.


In [14]:
original_data.describe(include='O')

Unnamed: 0,player_url,short_name,long_name,dob,nationality,club,player_positions,preferred_foot,work_rate,body_type,real_face,player_tags,team_position,loaned_from,joined,nation_position,player_traits,ls,st,rs,lw,lf,cf,rf,rw,lam,cam,ram,lm,lcm,cm,rcm,rm,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb
count,2019,2019,2019,2019,2019,2019,2019,2019,2018,2018,2018,538,1975,88,1887,535,1649,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811,1811
unique,2019,1985,2018,1638,97,282,265,2,8,9,2,82,29,47,580,26,550,80,80,80,92,89,89,89,92,88,88,88,84,71,71,71,84,82,81,81,81,82,85,99,99,99,85
top,https://sofifa.com/player/193683/xavier-chaval...,Danilo,Lisandro López,1988-02-29,Spain,Real Madrid,CB,Right,Medium/Medium,Normal,Yes,#Strength,SUB,Real Madrid,2018-07-01,SUB,Early Crosser,73+2,73+2,73+2,74+2,74+2,74+2,74+2,74+2,74+2,74+2,74+2,74+2,71+2,71+2,71+2,74+2,72+2,74+2,74+2,74+2,72+2,74+2,74+2,74+2,74+2,74+2
freq,1,3,2,31,229,25,277,1552,745,1188,1046,113,602,7,102,223,64,107,107,107,159,171,171,171,159,157,157,157,160,120,120,120,160,98,145,145,145,98,101,141,141,141,101


### Observation
* No Constant feature in categorical features.
* player_url is a unique feature

In [15]:
column_list = fifa.columns.tolist()
print(column_list)

['sofifa_id', 'player_url', 'short_name', 'long_name', 'age', 'dob', 'height_cm', 'weight_kg', 'nationality', 'club', 'overall', 'potential', 'value_eur', 'wage_eur', 'player_positions', 'preferred_foot', 'international_reputation', 'weak_foot', 'skill_moves', 'work_rate', 'body_type', 'real_face', 'release_clause_eur', 'player_tags', 'team_position', 'team_jersey_number', 'loaned_from', 'joined', 'contract_valid_until', 'nation_position', 'nation_jersey_number', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning', 'player_traits', 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 'power_shot_power', 'power_

## 3.Univariate Analysis

In [16]:
univariate = fifa[[ 'age', 'height_cm', 'weight_kg', 'nationality', 'club', 'overall', 'potential', 'value_eur',
                   'wage_eur', 'player_positions', 'preferred_foot', 'international_reputation', 'weak_foot',
                   'skill_moves', 'work_rate', 'body_type', 'real_face', 'release_clause_eur', 'player_tags',
                   'team_position', 'joined', 'contract_valid_until', 'nation_position', 'pace', 'shooting',
                   'passing', 'dribbling', 'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking',
                   'gk_reflexes', 'gk_speed', 'gk_positioning', 'player_traits', 'attacking_crossing',
                   'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing',
                   'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
                   'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 'movement_agility',
                   'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping', 'power_stamina',
                   'power_strength', 'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
                   'mentality_positioning', 'mentality_vision', 'mentality_penalties', 'mentality_composure',
                   'defending_marking', 'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving',
                   'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes',
                   'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm',
                   'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']]
import sweetviz as sv
my_report = sv.analyze(univariate)
my_report.show_html()

                                             |          | [  0%]   00:00 -> (? left)

AttributeError: module 'numpy' has no attribute 'VisibleDeprecationWarning'

### Key Observations
* Players' Age range from 16-42 years with average 25. only 5% of the players are above 33. Most of the players are very young.
* Height (in cm) varies between 156-205. Most of them (>50%) are between 175-188.
* Weight( in kg) varies between 50-110. Most of them (>50%) are between 70-80.
* Most of the players are from England(8%) followed by Germany(7%)
* More than 45 clubs have 33 players each.
* Overall point of a player ranges from 48-94.Only 3.1% of players have overall above 80 with only 10 players above 90.
* Player value (in euro ) ranges between 0M -  105.5M with only 5% of them have value above 10.5M.
* Wage (in euro) ranges between 0k-565K with median of 3K. Only 5% of them get wage above 38K.
* There are multiple player positions listed for a player according to their points in each positions.
* Most of the players are right footed
* 92% of players have international reputation 1. Only 6 players have international reputation 5.
* 62% of players have weak foot rating 3.
* More than 50% players have medium attack and defence work rate.
* 89% players dont have their real face.
* 95% of players have release clause less than 20M.Top player has a release clause of 195.8M.
* 43% of the players play in substitute position.
* 89% players' contract validity ended in 2023.
* Only 5% of the players have all the skill points above 70.

## 

## 4.Bivariate and Multivariate Analysis

### Team Position vs Shooting

In [None]:
plt.figure(figsize=(12,10))
sns.barplot(x='team_position',y='shooting',data=fifa)
plt.xticks(rotation=45)
plt.show()

* Center Forward has the highest shooting capacity followed by Striker

### Overall Vs International reputation

In [None]:
plt.figure(figsize=(12,10))
sns.barplot(x=fifa.overall,y=fifa.international_reputation)
plt.show()

* As player's overall rating increases international reputation also increases

### wage_eur, value_eur, release_clause_eur Vs Overall rating

In [None]:
plt.figure(figsize=(15,5))
plt_num = 1

for column in ["wage_eur", "value_eur", "release_clause_eur"]:
    if plt_num <= 3:
        plt.subplot(1,3,plt_num)
        sns.scatterplot(x="overall", y=column, data=fifa)
    plt_num += 1
plt.show()

* As the overall point increases, a player's value,wage and release clause tend to increase.
* But very few players are paid in the top most slab. They look like outliers but they are actual data.

### wage_eur, value_eur, release_clause_eur Vs international reputation

In [None]:
plt.figure(figsize=(15,5))
plt_num = 1

for column in [ "wage_eur","value_eur", "release_clause_eur"]:
    if plt_num <= 3:
        plt.subplot(1,3,plt_num)
        sns.barplot(x="international_reputation", y=column, data=fifa)
    plt_num += 1
plt.show()

* As the international reputation increases, a player's value,wage and release clause tend to increase.

### Weak foot vs some skills, differentiating on the basis of preferred foot

In [None]:
plt.figure(figsize=(15,20))
plt_num = 1

for column in ["overall",'skill_fk_accuracy','skill_curve','shooting','passing','dribbling','attacking_crossing','skill_ball_control',
               'power_shot_power','power_long_shots','movement_acceleration','movement_sprint_speed','defending']:
    if plt_num <= 15:
        plt.subplot(5,3,plt_num)
        sns.barplot(x="weak_foot", y=column, data=fifa,hue='preferred_foot')
    plt_num += 1
plt.show()
plt.tight_layout()

* With a good weak_foot point, a player's overall points and points of skills like freekick accuracy, skill_curve,shooting,passing,dribbling,attacking_crossing,skill_ball_control, power_shot_power, power_long_shots, movement_acceleration and movement_sprint_speed increases.


* But good weak foot point does not help in improving the skills in defending.


* With good weak foot point, left footers do well with their right foot in freekick accuracy, curve, passing,attacking_crossing,ball_control,power_long_shots,acceleration and sprint speed.

### Age Vs some outfield skills

In [None]:
plt.figure(figsize=(15,15))
plt_num = 1

for column in ["pace", "shooting", "passing", "dribbling", "defending"]:
    if plt_num <= 6:
        plt.subplot(2,3,plt_num)
        plt.grid(True)
        sns.scatterplot(x="age", y=column, data=fifa)
    plt_num += 1
plt.show()

* Between the age of 32-33, player's skill in "pace", "shooting", "passing", "dribbling", "defending" start falling.

#  Prepare a rank ordered list of top 10 countries with most players. Which countries are producing the most footballers that play at this level?


In [None]:
df_countries=fifa.groupby('nationality').size().sort_values(ascending=False).reset_index(name='Count')
df_countries['Rank']=df_countries['Count'].rank(ascending=False)
country_rank=df_countries.head(10)
country_rank

In [None]:
plt.figure(figsize=(7,7))
plt.pie(country_rank.Count,
        labels=country_rank.nationality,
        autopct='%.1f',
        explode=[.3,0,0,0,0,0,0,0,0,0])
plt.show()

# Plot the distribution of overall rating vs. age of players. Interpret what is the age after which a player stops improving?

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(x='age', y='overall', data=fifa)
plt.title('Distribution of Overall Rating vs. Age of Players')
plt.xlabel('Age')
plt.ylabel('Overall Rating')
plt.grid(True)
plt.show()

### **At the age of 34 player stops improving.**

# Which type of offensive players tends to get paid the most: the striker, the right-winger, or the left-winger?


In [None]:
original_data[['player_positions','ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm',
               'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']].head(10)

* Let's assume that the player positions are listed in the 'player_positions' column according to their best ability in those positions and the player's preference.
* The first position in the list is considered their best performing position when cross checked with their position columns' points.
* So we can split and choose only the first position from the list.

In [None]:
offensive_positions = ['ST', 'RW', 'LW']
fifa['player_positions']=fifa['player_positions'].str.split(',').str[0]


In [None]:
# top20 top-waged player positions
fifa[['player_positions','wage_eur']].sort_values(by=['wage_eur'],ascending=False,ignore_index=True).head(20)


In [None]:
offensive_players=fifa.loc[fifa['player_positions'].isin(offensive_positions)]

In [None]:
# mean() is influenced by the outlier kind of 'wage_eur' values of top players. So we use median().
avg_wages=offensive_players.groupby('player_positions')['wage_eur'].median().sort_values(ascending=False).reset_index(name='Avg_wage')
avg_wages

In [None]:

sns.barplot(x='player_positions',y='Avg_wage',data=avg_wages)
plt.show()


### **Strikers are paid the most.**

# DATA PREPROCESSING / FEATURE ENGINEERING

## 1.CHECKING AND IMPUTING MISSING VALUE

In [None]:
# Create a DataFrame with column information
column_info_pct = pd.DataFrame({
    'Column': fifa.columns,
    'Non-Null Count': fifa.count(),
    'Null Count': fifa.isnull().sum(),
    'Total': fifa.isnull().count(),
    'Missing Percentage':fifa.isnull().sum()/fifa.isnull().count()*100 ,
    'Dtype': fifa.dtypes
})

# Display the DataFrame without any index
column_info_pct.style.hide(axis="index")

### 1.NUMERICAL


### release_clause_eur

In [None]:
# checking the distributuion before imputing the NULL to decide between mean() and median()
plt.figure(figsize=(5,5)) # defining canvas size
sns.distplot(x=fifa['release_clause_eur'])
plt.xlabel('release_clause_eur',fontsize=20)
plt.show()

* release_clause_eur has skewed distribution. So we replace NULL value with median.

In [None]:
fifa.loc[fifa['release_clause_eur'].isnull()==True,'release_clause_eur']=fifa['release_clause_eur'].median()

### Outfielder skill features- 'pace','shooting','passing','dribbling','defending','physic'


* As observed earlier, for outfielding skills like pace, shooting, passing, dribbling, defending, physic, there are 2036 NULL values. It means, for all the 2036 goalkeepers, these features are kept NULL.These features should be made 0 for all the 2036 goalkeepers.

In [None]:
fifa.loc[fifa['pace'].isnull()==True,'pace']=0
fifa.loc[fifa['shooting'].isnull()==True,'shooting']=0
fifa.loc[fifa['passing'].isnull()==True,'passing']=0
fifa.loc[fifa['dribbling'].isnull()==True,'dribbling']=0
fifa.loc[fifa['defending'].isnull()==True,'defending']=0
fifa.loc[fifa['physic'].isnull()==True,'physic']=0

In [None]:
# Check missing value after imputation
print('release_clause_eur:',fifa['release_clause_eur'].isnull().sum())
print('pace:',fifa['pace'].isnull().sum())
print('shooting:',fifa['shooting'].isnull().sum())
print('passing:',fifa['passing'].isnull().sum())
print('dribbling:',fifa['dribbling'].isnull().sum())
print('defending:',fifa['defending'].isnull().sum())
print('physic:',fifa['physic'].isnull().sum())

### Goalkeeper skill features - 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning'

* As observed earlier, for goalkeeping skills like gk_diving, gk_handling, gk_kicking, gk_reflexes, gk_speed, gk_positioning, there are only 2036 NON-NULL values. It means, for all the 16242 outfielders, these features are kept NULL. These features should be made 0 for all the 16242 outfielders.

In [None]:
# checking non null counts
print('gk_diving:',fifa['gk_diving'].count())
print('gk_handling:',fifa['gk_handling'].count())
print('gk_kicking:',fifa['gk_kicking'].count())
print('gk_reflexes:',fifa['gk_reflexes'].count())
print('gk_speed:',fifa['gk_speed'].count())
print('gk_positioning:',fifa['gk_positioning'].count())

In [None]:
fifa.loc[fifa['gk_diving'].isnull()==True,'gk_diving']=0
fifa.loc[fifa['gk_handling'].isnull()==True,'gk_handling']=0
fifa.loc[fifa['gk_kicking'].isnull()==True,'gk_kicking']=0
fifa.loc[fifa['gk_reflexes'].isnull()==True,'gk_reflexes']=0
fifa.loc[fifa['gk_speed'].isnull()==True,'gk_speed']=0
fifa.loc[fifa['gk_positioning'].isnull()==True,'gk_positioning']=0

### 2.CATEGORICAL

### 'Team Position'

In [None]:
# Getting the value counts of team position
fifa.team_position.value_counts().head(2)

In [None]:
# Impute categorical data using mode(most freqent)
fifa['team_position'] = fifa['team_position'].replace(np.nan,'SUB')

In [None]:
# Check missing value after imputation
fifa['team_position'].isnull().sum()

### Position features - 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb'

In [None]:
# Split the column values and remove the '+' and the values after them.
pos=['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb',
     'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']
for i in pos:
    fifa[i]=fifa[i].str.split('+',expand=True)[0]

    # Changing the datatype from obj to float
    fifa[i]=fifa[i].astype(float)
fifa.head(1)

* As observed earlier, for all the position features like 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb', there are 2036 NULL values. It means for all the 2036 goalkeepers these features are kept NULL. These features should be made 0 for all the 2036 goalkeepers.

In [None]:
# Impute missing value
fifa.loc[fifa['ls'].isnull()==True,'ls']=0
fifa.loc[fifa['st'].isnull()==True,'st']=0
fifa.loc[fifa['rs'].isnull()==True,'rs']=0
fifa.loc[fifa['lw'].isnull()==True,'lw']=0
fifa.loc[fifa['lf'].isnull()==True,'lf']=0
fifa.loc[fifa['cf'].isnull()==True,'cf']=0
fifa.loc[fifa['rf'].isnull()==True,'rf']=0
fifa.loc[fifa['rw'].isnull()==True,'rw']=0
fifa.loc[fifa['lam'].isnull()==True,'lam']=0
fifa.loc[fifa['cam'].isnull()==True,'cam']=0
fifa.loc[fifa['ram'].isnull()==True,'ram']=0
fifa.loc[fifa['lm'].isnull()==True,'lm']=0
fifa.loc[fifa['lcm'].isnull()==True,'lcm']=0
fifa.loc[fifa['cm'].isnull()==True,'cm']=0
fifa.loc[fifa['rcm'].isnull()==True,'rcm']=0
fifa.loc[fifa['rm'].isnull()==True,'rm']=0
fifa.loc[fifa['lwb'].isnull()==True,'lwb']=0
fifa.loc[fifa['ldm'].isnull()==True,'ldm']=0
fifa.loc[fifa['cdm'].isnull()==True,'cdm']=0
fifa.loc[fifa['rdm'].isnull()==True,'rdm']=0
fifa.loc[fifa['rwb'].isnull()==True,'rwb']=0
fifa.loc[fifa['lb'].isnull()==True,'lb']=0
fifa.loc[fifa['lcb'].isnull()==True,'lcb']=0
fifa.loc[fifa['cb'].isnull()==True,'cb']=0
fifa.loc[fifa['rcb'].isnull()==True,'rcb']=0
fifa.loc[fifa['rb'].isnull()==True,'rb']=0


In [None]:
# Create a DataFrame with column information to check the null values
column_info_pct = pd.DataFrame({
    'Column': fifa.columns,
    'Non-Null Count': fifa.count(),
    'Null Count': fifa.isnull().sum(),
    'Total': fifa.isnull().count(),
    'Missing Percentage':fifa.isnull().sum()/fifa.isnull().count()*100 ,
    'Dtype': fifa.dtypes
})

# Display the DataFrame without any index
column_info_pct.style.hide(axis="index")

## 2.ENCODING CATEGORICAL DATA

### 1. 'Player_positions'

In [None]:
# Gettimg value counts of player positions
fifa.player_positions.value_counts()

In [None]:
# Use mannual encoding because lots of labels available
fifa.player_positions = fifa.player_positions.map({'CB':14,'ST':13,'CM':12,'GK':11,'CDM':10,'RB':9,'LB':8,
                                                   'CAM':7,'RM':6,'LM':5,'LW':4,'RW':3,'CF':2,'LWB':1,'RWB':0})

### 2. 'preferred_foot'

In [None]:
# Gettimg value counts of preferred foot
fifa.preferred_foot.value_counts()

In [None]:
fifa.preferred_foot = fifa.preferred_foot.map({'Right':1,'Left':0})

### 3. team_position

In [None]:
# Getting the value counts of team position
fifa.team_position.value_counts()

In [None]:
# Use mannual encoding because lots of labels available
fifa.team_position = fifa.team_position.map({'SUB':28,'RES':27,'GK':26,'RCB':25,'LCB':24,'RB':23,'LB':22,'ST':21,
                                            'RCM':20,'LCM':19,'RM':18,'LM':17,'CAM':16,'RDM':15,'LDM':14,'RS':13,
                                            'LS':12,'CDM':11,'LW':10,'RW':9,'CB':8,'CM':7,'RWB':6,'LWB':5,'RAM':4,
                                            'LAM':3,'RF':2,'LF':1,'CF':0})

### 4. work_rate
In the project document, it is expected to do the following:
- "This feature is divided into two new features as AttackWorkRate and DefenseWorkRate. Besides, label encoder is applied as 0 for low, 0.5 for medium and 1 for high."

In [None]:
fifa['work_rate'].value_counts()

In [None]:
fifa['AttackWorkRate']=fifa['work_rate'].str.split('/',expand=True)[0]
fifa['DefenseWorkRate']=fifa['work_rate'].str.split('/',expand=True)[1]
fifa[['work_rate','AttackWorkRate','DefenseWorkRate']].tail()

In [None]:
fifa.AttackWorkRate = fifa.AttackWorkRate.map({'Low':0,'Medium':0.5,'High':1})
fifa.DefenseWorkRate = fifa.DefenseWorkRate.map({'Low':0,'Medium':0.5,'High':1})

In [None]:
fifa[['work_rate','AttackWorkRate','DefenseWorkRate']].head(3)

## 3. HANDLING OUTLIERS

* In univariate analysis, its understood that a very small percent of players have exceptional values in most of the features because of which they have high market values. If the high values in certain features are legitimate and represent exceptional cases (e.g., top-performing players), it might not be appropriate to treat them as outliers.

* If they are valid and essential to understanding the full range of player performance, imputing them to fit with other values might distort the representation of top-performing players.

* You might choose to perform separate analyses for the overall population and the subset of top performers. This way, you can capture the distinct characteristics of elite players without affecting the general trends in the dataset.Here, we are not going to take any subset.

*  Imputing values should be done cautiously, and it's essential to preserve the integrity of the data. Imputing extreme values to fit with other values might lead to misrepresentations.

In [None]:
fifa.head(3)

In [None]:
out1= fifa[['age','height_cm','weight_kg','value_eur','wage_eur','release_clause_eur','pace',
           'shooting','dribbling','defending','physic']]


In [None]:
plt.figure(figsize=(20,25)) # defining canvas size
plotno = 1 # counter

for column in out1: # iteration of columns / accessing the columns from  dataset
    if plotno<=11:    # set the limit
        plt.subplot(3,4,plotno) # # plotting 11 graphs (3-rows,4-columns) ,plotnumber is for count
        sns.boxplot(x=fifa[column]) # Plotting dist plots
        plt.xlabel(column,fontsize=20)  # assigning name to x-axis and font size is 20
    plotno+=1 # counter increment
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20,25)) # defining canvas size
plotno = 1 # counter

for column in out1: # iteration of columns / acessing the columns from  dataset
    if plotno<=11:    # set the limit
        plt.subplot(3,4,plotno) # # plotting 11 graphs (3-rows,4-columns) ,plotnumber is for count
        sns.distplot(x=fifa[column]) # Plotting dist plots
        plt.xlabel(column,fontsize=20)  # assigning name to x-axis and font size is 20
    plotno+=1 # counter increment
plt.tight_layout()
plt.show()

* Defending and shooting do not have outliers.


* 'age','height_cm','weight_kg','value_eur','wage_eur','release_clause_eur','pace','dribbling', and 'physic' have skewed data and outliers are present.


* We will decide whether to impute outliers based on careful consideration of the number of outliers. If the proportion of outliers is around 4% or more, we will lean towards retaining them as they may not be indicative of entry errors. Imputing such outliers carries the risk of influencing the distribution of the data, and we aim to preserve the original characteristics of the dataset.

### Height

In [None]:
# Use iqr because of some right skewed in data

# Step:1
from scipy import stats
iqr = stats.iqr(fifa['height_cm'],interpolation='midpoint')
print("IQR",iqr)

# step:2
Q1 = fifa['height_cm'].quantile(0.25)  # first quantile
Q3 = fifa['height_cm'].quantile(0.75)  #third quantile
# getting max & min limit
min_limit = Q1 - 1.5*iqr
print('minimum limit',min_limit)
max_limit = Q3 + 1.5*iqr
print('maximum limit',max_limit)


In [None]:
# Identify outliers
outliers = fifa[(fifa['height_cm'] < min_limit) | (fifa['height_cm'] > max_limit)]

print('% of outliers present in height_cm: ',len(outliers)/fifa.shape[0]*100)

* Outliers are around 5%.  As a general rule of thumb, we do not want to replace more than approx 4% of a data set with imputed values otherwise we risk influencing the distribution of the data. And the data does not look like data entry error as the height matches with the corresponding weight.

### Weight

In [None]:
# Use iqr because of some right skewed in data

# Step:1
from scipy import stats
iqr = stats.iqr(fifa['weight_kg'],interpolation='midpoint')
print("IQR",iqr)

# step:2
Q1 = fifa['weight_kg'].quantile(0.25)  # first quantile
Q3 = fifa['weight_kg'].quantile(0.75)  #third quantile
# getting max & min limit
min_limit = Q1 - 1.5*iqr
print('minimum limit',min_limit)
max_limit = Q3 + 1.5*iqr
print('maximum limit',max_limit)



In [None]:
# Identify outliers
outliers = fifa[(fifa['weight_kg'] < min_limit) | (fifa['weight_kg'] > max_limit)]

print('% of outliers present in weight_kg: ',len(outliers)/fifa.shape[0]*100)

* Outliers are around 4%.  As a general rule of thumb, we do not want to replace more than approx 4% of a data set with imputed values otherwise we risk influencing the distribution of the data. And the data do not look like data entry error as the weight matches with the height.

### Dribbling

In [None]:
# Calculate IQR
iqr = stats.iqr(fifa['dribbling'], interpolation='midpoint')

# Calculate lower and upper bounds
Q1 = fifa['dribbling'].quantile(0.25)
Q3 = fifa['dribbling'].quantile(0.75)
min_limit = Q1 - 1.5 * iqr
max_limit = Q3 + 1.5 * iqr

print("IQR",iqr)

print('minimum limit',min_limit)

print('maximum limit',max_limit)





In [None]:
# Identify outliers
outliers = fifa[(fifa['dribbling'] < min_limit) | (fifa['dribbling'] > max_limit)]


In [None]:
print('% of outliers present in dribbling: ',len(outliers)/fifa.shape[0]*100)

* outliers are more than 5%.  As a general rule of thumb, we do not want to replace more than 5% of a data set with imputed values otherwise we risk influencing the distribution of the data.

### Age

In [None]:
# Age is almost normally distributed. So, we use # sigma rule to find upper and lower limit
max_limit = fifa.age.mean() + 3*fifa.age.std()
print("Upper limit:",max_limit)
min_limit = fifa.age.mean() - 3*fifa.age.std()
print("Lower limit:",min_limit)


In [None]:
# Identify outliers
outliers = fifa[(fifa['age'] < min_limit) | (fifa['age'] > max_limit)]

print('% of outliers present in age: ',len(outliers)/fifa.shape[0]*100)

In [None]:
# impute outlier on both lower and upper side with mean as the distribution looks almost normal.
fifa.loc[fifa['age'] <  min_limit,'age'] = fifa['age'].mean()
fifa.loc[fifa['age'] > max_limit,'age'] = fifa['age'].mean()

In [None]:
sns.boxplot(x=fifa['age'])
plt.show()

### pace

In [None]:
# Use iqr because of some right skewed in data

# Step:1
from scipy import stats
iqr = stats.iqr(fifa['pace'],interpolation='midpoint')
print("IQR",iqr)

# step:2
Q1 = fifa['pace'].quantile(0.25)  # first quantile
Q3 = fifa['pace'].quantile(0.75)  #third quantile
# getting max & min limit
min_limit = Q1 - 1.5*iqr
print('minimum limit',min_limit)
max_limit = Q3 + 1.5*iqr
print('maximum limit',max_limit)


In [None]:
# Identify outliers
outliers = fifa[(fifa['pace'] < min_limit) | (fifa['pace'] > max_limit)]

print('% of outliers present in pace: ',len(outliers)/fifa.shape[0]*100)

* As a general rule of thumb, we do not want to replace more than 5% of a data set with imputed values otherwise we risk influencing the distribution of the data.

In [None]:

# impute outlier both lower and upper side
#fifa.loc[fifa['shooting'] <  min_limit,'shooting'] = fifa['shooting'].median()
#fifa.loc[fifa['shooting'] > max_limit,'shooting'] = fifa['shooting'].median()

### Physic

In [None]:
# Step:1
from scipy import stats
iqr = stats.iqr(fifa['physic'],interpolation='midpoint')
print("IQR",iqr)

# step:2
Q1 = fifa['physic'].quantile(0.25)  # first quantile
Q3 = fifa['physic'].quantile(0.75)  #third quantile
# getting max & min limit
min_limit = Q1 - 1.5*iqr
print('minimum limit',min_limit)
max_limit = Q3 + 1.5*iqr
print('maximum limit',max_limit)



In [None]:
# Identify outliers
outliers = fifa[(fifa['physic'] < min_limit) | (fifa['physic'] > max_limit)]

print('% of outliers present in physic: ',len(outliers)/fifa.shape[0]*100)

* As a general rule of thumb, we do not want to replace more than 5% of a data set with imputed values otherwise we risk influencing the distribution of the data.

### value_eur

In [None]:
# Calculate IQR
iqr = stats.iqr(fifa['value_eur'], interpolation='midpoint')

# Calculate lower and upper bounds
Q1 = fifa['value_eur'].quantile(0.25)
Q3 = fifa['value_eur'].quantile(0.75)
min_limit = Q1 - 1.5 * iqr
max_limit = Q3 + 1.5 * iqr

print("IQR",iqr)

print('minimum limit',min_limit)

print('maximum limit',max_limit)


In [None]:
# Identify outliers
outliers = fifa[(fifa['value_eur'] < min_limit) | (fifa['value_eur'] > max_limit)]

print('% of outliers present in value_eur: ',len(outliers)/fifa.shape[0]*100)

* As a general rule of thumb, we do not want to replace more than 5% of a data set with imputed values otherwise we risk influencing the distribution of the data.

### wage_eur

In [None]:
# Calculate IQR
iqr = stats.iqr(fifa['wage_eur'], interpolation='midpoint')

# Calculate lower and upper bounds
Q1 = fifa['wage_eur'].quantile(0.25)
Q3 = fifa['wage_eur'].quantile(0.75)
min_limit = Q1 - 1.5 * iqr
max_limit = Q3 + 1.5 * iqr

print("IQR",iqr)

print('minimum limit',min_limit)

print('maximum limit',max_limit)


In [None]:

# Identify outliers
outliers = fifa[(fifa['wage_eur'] < min_limit) | (fifa['wage_eur'] > max_limit)]

print('% of outliers present in wage_eurr: ',len(outliers)/fifa.shape[0]*100)

* As a general rule of thumb, we do not want to replace more than 5% of a data set with imputed values otherwise we risk influencing the distribution of the data.

### Impute 0 values of 'wage_eur' and 'value_eur'

In [None]:
fifa.loc[(fifa['value_eur']==0)|(fifa['wage_eur']==0)].head(3)

In [None]:
# imputing the 0 in 'value_eur' and 'wage_eur' as there cannot be 0 price for players.
fifa.value_eur.replace(0,np.median(fifa.value_eur),inplace =True)
fifa.wage_eur.replace(0,np.median(fifa.wage_eur),inplace =True)

In [None]:
# check the imputed values of the player at the 327
fifa.iloc[[327,]]

In [None]:
# check if the imputation has happened
fifa.loc[(fifa['value_eur']==0)|(fifa['wage_eur']==0)].head(3)

### release_clause_eur

In [None]:
# Calculate IQR
iqr = stats.iqr(fifa['release_clause_eur'], interpolation='midpoint')

# Calculate lower and upper bounds
Q1 = fifa['release_clause_eur'].quantile(0.25)
Q3 = fifa['release_clause_eur'].quantile(0.75)
min_limit = Q1 - 1.5 * iqr
max_limit = Q3 + 1.5 * iqr

print("IQR",iqr)

print('minimum limit',min_limit)

print('maximum limit',max_limit)

In [None]:
# Identify outliers
outliers = fifa[(fifa['release_clause_eur'] < min_limit) | (fifa['release_clause_eur'] > max_limit)]

print('% of outliers present in release_clause_eur: ',len(outliers)/fifa.shape[0]*100)

* As a general rule of thumb, we do not want to replace more than 5% of a data set with imputed values otherwise we risk influencing the distribution of the data.

In [None]:

out2=fifa[['gk_diving', 'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning', 'attacking_crossing','attacking_finishing',
           'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys','skill_dribbling', 'skill_curve','skill_fk_accuracy',
           'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 'movement_agility',
           'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping','power_stamina','power_strength','power_long_shots',
           'mentality_aggression', 'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 'mentality_penalties',
           'mentality_composure', 'defending_marking', 'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving',
           'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes']]


In [None]:
plt.figure(figsize=(20,25)) # defining canvas size
plotno = 1 # counter

for column in out2: # iteration of columns / acessing the columns from  dataset
    if plotno<=40:    # set the limit
        plt.subplot(10,4,plotno) # # plotting 36 graphs (9-rows,4-columns) ,plotnumber is for count
        sns.boxplot(x=fifa[column]) # Plotting dist plots
        plt.xlabel(column,fontsize=20)  # assigning name to x-axis and font size is 20
    plotno+=1 # counter increment
plt.tight_layout()
plt.show()

*  #### We take only 'attacking_heading_accuracy', 'skill_fk_accuracy', 'skill_long_passing', 'power_shot_power', 'mentality_vision',  'mentality_penalties' out of out2 to check the distribution because the rest of them either have no outliers or too many outliers.

In [None]:
out2_1=fifa[['attacking_heading_accuracy','skill_fk_accuracy', 'skill_long_passing', 'power_shot_power',
             'mentality_vision', 'mentality_penalties']]

plt.figure(figsize=(15,10)) # defining canvas size
plotno = 1 # counter

for column in out2_1:
    if plotno<=7:
        plt.subplot(2,4,plotno)
        sns.distplot(x=fifa[column])
        plt.xlabel(column,fontsize=20)
    plotno+=1
plt.tight_layout()
plt.show()

* Since all the plots are somewhat skewed, we use iqr to calculate min limit and max limit

In [None]:
for column in out2_1:
   # max_limit = fifa[column].mean() + 3*fifa[column].std()

    #min_limit = fifa[column].mean() - 3*fifa[column].std()

    # Calculate IQR
    iqr = stats.iqr(fifa[column], interpolation='midpoint')

    # Calculate lower and upper bounds
    Q1 = fifa[column].quantile(0.25)
    Q3 = fifa[column].quantile(0.75)
    min_limit = Q1 - 1.5 * iqr
    max_limit = Q3 + 1.5 * iqr

    # Identify outliers
    outliers = fifa[(fifa[column] < min_limit) | (fifa[column] > max_limit)]
    display(outliers.head(2))

    print(f'% of outliers present in {column}: {len(outliers)/fifa.shape[0]*100}')

* #### attacking_heading_accuracy have more than 5% outliers.
  
* #### In all other features, the outliers ,beyond lower limit or upper limit , look genuine when we check their international reputation and overall points. So we are not imputing them.

In [None]:

out3=fifa[['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb',
     'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']]

In [None]:
plt.figure(figsize=(20,25)) # defining canvas size
plotno = 1 # counter

for column in out3: # iteration of columns / acessing the columns from  dataset
    if plotno<=28:    # set the limit
        plt.subplot(7,4,plotno) # # plotting 26 graphs (9-rows,4-columns) ,plotnumber is for count
        sns.boxplot(x=fifa[column]) # Plotting dist plots
        plt.xlabel(column,fontsize=20)  # assigning name to x-axis and font size is 20
    plotno+=1 # counter increment
plt.tight_layout()
plt.show()

* #### the 0 in the plots are not outliers.

In [None]:
# To check the other outliers in the columns, we plot their distribution plot
out3_1=fifa[['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm']]
plt.figure(figsize=(15,10)) # defining canvas size
plotno = 1 # counter

for column in out3_1: # iteration of columns / acessing the columns from  dataset
    if plotno<=11:    # set the limit
        plt.subplot(3,4,plotno) # # plotting 11 graphs (3-rows,4-columns) ,plotnumber is for count
        sns.distplot(x=fifa[column]) # Plotting dist plots
        plt.xlabel(column,fontsize=20)  # assigning name to x-axis and font size is 20
    plotno+=1 # counter increment
plt.tight_layout()
plt.show()

In [None]:
from IPython.display import display

# since all the plots are normally distributed we use 3 sigma to calculate min limit and max limit
for column in out3_1:
    max_limit = fifa[column].mean() + 3*fifa[column].std()

    min_limit = fifa[column].mean() - 3*fifa[column].std()

    # Identify outliers
    outliers = fifa[(fifa[column] < min_limit) | (fifa[column] > max_limit)]
    display(outliers.head(1))

    print(f'% of outliers present in {column}: {len(outliers)/fifa.shape[0]*100}')

* #### The outliers in the boxplots are actually not outliers.

In [None]:
names = fifa.short_name.tolist()

# FEATURE SCALING AND FEATURE SELECTION

### MIN-MAX SCALING

* Scale the feature between 0 to 1


* Use min max scaling because of dataset contain large amount of outlier so outlier is going to be biased.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
column_list = fifa.columns.tolist()
print(column_list)

In [None]:
# We remove following columns as they do not contribute much to skills
d1=['age','sofifa_id', 'player_url', 'short_name', 'long_name', 'dob','nationality', 'club', 'overall',
    'potential','international_reputation','work_rate','body_type', 'real_face','player_tags',
    'team_jersey_number', 'loaned_from', 'joined', 'contract_valid_until', 'nation_position',
    'nation_jersey_number','player_traits']
minmaxscaled_data=scaler.fit_transform(fifa.drop(d1,axis=1))

In [None]:
data1=pd.DataFrame(minmaxscaled_data,columns=['height_cm', 'weight_kg','value_eur', 'wage_eur', 'player_positions',
                                  'preferred_foot','weak_foot', 'skill_moves','release_clause_eur','team_position',
                                 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'gk_diving',
                                  'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning',
                                 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy',
                                  'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve',
                                  'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control',
                                  'movement_acceleration', 'movement_sprint_speed', 'movement_agility',
                                  'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping',
                                  'power_stamina', 'power_strength', 'power_long_shots', 'mentality_aggression',
                                  'mentality_interceptions', 'mentality_positioning', 'mentality_vision',
                                  'mentality_penalties', 'mentality_composure', 'defending_marking',
                                  'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving',
                                  'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning',
                                  'goalkeeping_reflexes', 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam',
                                  'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm',
                                  'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb', 'AttackWorkRate', 'DefenseWorkRate'])

In [None]:
data1.head(1)

In [None]:
fifa.head(1)

In [None]:
data1.shape

In [None]:
# pip install fast_ml

### Check duplicate features

In [None]:

from fast_ml.utilities import display_all
from fast_ml.feature_selection import get_duplicate_features

get_duplicate_features(data1)

* #### Features ls, lw, lf, lam, lm, lcm, lwb, ldm, lb, lcb listed in feature1 are the features to be considered

*   #### Features listed in feature2 are to be deleted

In [None]:
d2 = ['st', 'rs', 'rw', 'cf', 'rf', 'cam', 'ram', 'rm','cm', 'rcm', 'rwb', 'cdm', 'rdm', 'rb', 'cb', 'rcb']

In [None]:

data2 = data1.drop(d2, axis=1)

In [None]:
data2.shape

In [None]:
X=data2.loc[:,:]

In [None]:
# Plot heatmap with features more than 0.92 correlated
plt.figure(figsize=(40,40))
sns.heatmap(X.corr()[X.corr()>0.92], annot=True)
plt.show()

* ####  When there are too many features, it is difficult to find the highly correlated ones.

In [None]:
# checking highly correlated features that are above 0.92
pd.set_option('display.max_rows',None)
corrmat = X.corr()
corrmat = corrmat.abs().unstack() # absolute value of corr coef
corrmat = corrmat.sort_values(ascending=False)
corrmat = corrmat[corrmat >= 0.92]
corrmat = corrmat[corrmat < 1]
corrmat = pd.DataFrame(corrmat).reset_index()
corrmat.columns = ['feature1', 'feature2', 'corr']
corrmat

In [None]:
pd.reset_option('display.max_rows',None)

In [None]:
# The following features are highly correlated with some of the festures in the dataset. So removing them.
d3=['gk_diving', 'gk_reflexes', 'gk_positioning', 'gk_handling',
'gk_speed','lf', 'lw','lm','lb','gk_kicking','lam','lwb','ls','lcm','lcb', 'ldm','defending_standing_tackle',
'attacking_finishing','defending_marking','release_clause_eur','defending_sliding_tackle', 'goalkeeping_kicking',
'power_long_shots','attacking_short_passing','goalkeeping_reflexes',
'goalkeeping_positioning','goalkeeping_handling',
'mentality_interceptions','mentality_positioning','movement_acceleration','team_position','skill_moves',
'dribbling','skill_ball_control','value_eur','wage_eur']

data3 = data2.drop(d3, axis=1)

In [None]:
# After removing the correlated features checking whether there are any highly correlated features.
X=data3.loc[:,:]
pd.set_option('display.max_rows',None)
corrmat = X.corr()
corrmat = corrmat.abs().unstack() # absolute value of corr coef
corrmat = corrmat.sort_values(ascending=False)
corrmat = corrmat[corrmat >= 0.92]
corrmat = corrmat[corrmat < 1]
corrmat = pd.DataFrame(corrmat).reset_index()
corrmat.columns = ['feature1', 'feature2', 'corr']
corrmat

In [None]:
data3.head(1)

In [None]:
data3.shape

### Applying PCA for dimensionality reduction

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
principalComponents = pca.fit_transform(data3)

In [None]:
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_),marker='*',color='k')
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Explained Variance')
plt.grid(True)
plt.show()

* #### 10 components explains 92% variance in the data.

In [None]:
pca = PCA(n_components=10)
new_data = pca.fit_transform(data3)

In [None]:
PC_df = pd.DataFrame(data = new_data,
                     columns = ['PC1', 'PC2','PC3','PC4','PC5','PC6', 'PC7','PC8','PC9','PC10'])

# MODEL BUILDING

## 1. K Mean Clustering

In [None]:
from sklearn.cluster import KMeans
wcss_pca = []
for cluster in range(2,11):
    kme_clu_pca = KMeans(n_clusters=cluster, random_state=9)
    kme_clu_pca.fit(PC_df)
    wcss_pca.append(kme_clu_pca.inertia_)

plt.plot(range(2,11), wcss_pca)
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS(for PCA)")
plt.grid(True)
plt.show()

In [None]:
for n in range(3,6):
    model = KMeans(n_clusters=n, random_state=10)
    model.fit(PC_df)
    label=model.labels_
    score=silhouette_score(PC_df,label)
    print(f'(The silhouette score for {n} cluster is {score} )')


In [None]:
model = KMeans(n_clusters=3, random_state=10)
model.fit(PC_df)

In [None]:
label=model.labels_
centroid = model.cluster_centers_
clusters=label.tolist()

In [None]:
reduced=PC_df.copy()
reduced['cluster'] = clusters
reduced['name'] = names
reduced.head(20)

In [None]:
# To plot in 2D,
pca = PCA(n_components = 2) # 2D PCA for the plot
new_data_for_plot = pd.DataFrame(pca.fit_transform(data3))

In [None]:

new_data_for_plot['cluster'] = clusters
new_data_for_plot['name'] = names
new_data_for_plot.columns = ['x', 'y', 'cluster', 'name']
new_data_for_plot.head(10)

#### **Plotting on 2D with top 100 players of dataset.**

In [None]:
plot_data=new_data_for_plot.head(100)
sns.set(style="white")

ax = sns.lmplot(x="x", y="y", hue='cluster', data = plot_data, legend=False,
                   fit_reg=False, height = 15, scatter_kws={"s": 250})

texts = []
for x, y, s in zip(plot_data.x, plot_data.y, plot_data.name):
    texts.append(plt.text(x, y, s))

ax.set(ylim=(-2, 2))
plt.tick_params(labelsize=15)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)

plt.show()

## 2. DBSCAN

## Plot to chose min_samples and epsilon value for DBSCAN Model

In [None]:
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

# A common rule of thumb is to set n_neighbors to the square root of the number of samples in your dataset.
n_samples = PC_df.shape[0]
n_neighbors_value = int(np.sqrt(n_samples))
print(f'No.of neighbors:{n_neighbors_value}')

# Initialize NearestNeighbors with the chosen n_neighbors value
neighbors = NearestNeighbors(n_neighbors=n_neighbors_value)
neighbors.fit(PC_df)
distances, indices = neighbors.kneighbors(PC_df)


# Sort the distances and plot the k-distance graph
distances = np.sort(distances[:,n_neighbors_value-1], axis=0)
plt.plot(distances)
plt.xlabel('Data Points')
plt.ylabel(f'Distance to {n_neighbors_value}th Nearest Neighbor')
plt.grid(True)
plt.show()

* #### From 0.8, the line is steady

In [None]:
from sklearn.cluster import DBSCAN
dbscan=DBSCAN(eps=0.8,min_samples=135)

In [None]:
model=dbscan.fit(PC_df)

labels=model.labels_

In [None]:
n_clusters=len(set(labels))- (1 if -1 in labels else 0)
n_clusters

In [None]:
clusters = labels.tolist()

In [None]:
print(metrics.silhouette_score(PC_df,labels))

In [None]:
reduced=PC_df.copy()
reduced['cluster'] = clusters
reduced['name'] = names
reduced.head(10)

In [None]:
# To plot in 2D
pca = PCA(n_components = 2) # 2D PCA for the plot
new_data_for_plot = pd.DataFrame(pca.fit_transform(data3))

In [None]:
new_data_for_plot['cluster'] = clusters
new_data_for_plot['name'] = names
new_data_for_plot.columns = ['x', 'y', 'cluster', 'name']
new_data_for_plot.head(10)

#### **Plotting on 2D with top 100 players of dataset.**

In [None]:
plot_data=new_data_for_plot.head(100)
sns.set(style="white")

ax = sns.lmplot(x="x", y="y", hue='cluster', data = plot_data, legend=False,
                   fit_reg=False, height = 15, scatter_kws={"s": 250})

texts = []
for x, y, s in zip(plot_data.x, plot_data.y, plot_data.name):
    texts.append(plt.text(x, y, s))

ax.set(ylim=(-2, 2))
plt.tick_params(labelsize=15)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)

plt.show()

In [None]:
new_data_for_plot.loc[new_data_for_plot['cluster']==0].head(10)

In [None]:
# Checking the data at the indices - 3,6,13,14,30
original_data.iloc[[3,6,13,14,30]]

In [None]:
new_data_for_plot.loc[new_data_for_plot['cluster']==3].head(10)

In [None]:
# Checking the data in the indices - 25,28,32,86
original_data.iloc[[25,28,32,86]]

### The goalkeepers are futher divided according to Left and Right Preferred Foot. Hence, the 4 clusters.

## 3.Hierarchical Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
clusters = model.fit_predict(PC_df)

In [None]:
silhouette_scores = []
for n_clusters in range(2, 6):
    model = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
    labels = model.fit_predict(PC_df)
    silhouette_scores.append(silhouette_score(PC_df, labels))


In [None]:
plt.plot(range(2, 6), silhouette_scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()

### Visualize the dendrogram for first 100 data

In [18]:

plt.figure(figsize=(20,14))
linkage_matrix = linkage(PC_df.head(100), method='ward')
dendrogram(linkage_matrix,labels= names[0:100], leaf_font_size = 10)
plt.grid(True)
plt.show()

NameError: name 'linkage' is not defined

* For first 100 data, it shows 4 clusters. To find it, we have to find the longest vertical line where no horizontal line crosses.
1. Cluster1- orange line (goalkeepers)
2. cluster2- K.Manolas to F. de Jong
3. cluster3- H.Son to M.Icardi
4. cluster4- R.Lukaku to Bernardo Silva

In [None]:
#pip install tabulate

In [17]:
from tabulate import tabulate

headers = ["Models => ", "K-Means", "DBSCAN", "Hierarchical"]
data = [["Silhouette Score", 0.326, 0.298, 0.3],
        ["Number of Clusters", 3, 4, 3]
]

# Generate the table
table = tabulate(data, headers=headers, tablefmt="grid")

# Print the table
print(table)

+--------------------+-----------+----------+----------------+
| Models =>          |   K-Means |   DBSCAN |   Hierarchical |
| Silhouette Score   |     0.326 |    0.298 |            0.3 |
+--------------------+-----------+----------+----------------+
| Number of Clusters |     3     |    4     |            3   |
+--------------------+-----------+----------+----------------+


# CONCLUSION

* K Mean Clustering is better because its Silouette score is better than the other 2 models.
* DBSCAN further divided goalkeepers with LEFT and RIGHT footers. Hence, the 4 clusters.
* Hierarchical Clustering is the most time consuming. For huge dataset, its not recommended.