# **FIFA Player Segmentation: Identifying Distinct Player Profiles**

>[FIFA Player Segmentation: Identifying Distinct Player Profiles](#scrollTo=Qmo19XtcJKUH)

>[1. Project Objectives](#scrollTo=Zy2NluWLPWjq)

>>[1.1. PO1](#scrollTo=Zy2NluWLPWjq)

>>[1.2. PO2](#scrollTo=Zy2NluWLPWjq)

>>[1.3. PO3](#scrollTo=Zy2NluWLPWjq)

>[2. Description of Data](#scrollTo=hapPQob6YnpQ)

>>[2.1. Index Variables](#scrollTo=jQto7_sYlvLP)

>>[2.2. Categorical Variables (CV)](#scrollTo=jQto7_sYlvLP)

>>>[2.2.1. Categorical Variables - Nominal Type](#scrollTo=jQto7_sYlvLP)

>>>[2.2.2. Categorical Variables - Ordinal Type](#scrollTo=jQto7_sYlvLP)

>>[2.3. Non-Categorical Variables (NCV)](#scrollTo=jQto7_sYlvLP)

>[3. Analysis of Data](#scrollTo=GhvkMLgGrXZx)

>>[3.1. Data Pre-Processing](#scrollTo=kP-SQ5JSDI3i)

>>>[3.1.1. Missing Data Statistics and Treatment](#scrollTo=kP-SQ5JSDI3i)

>>>>[3.1.1.1. Missing Data Statistics: Records](#scrollTo=h9Uxil6WDnd0)

>>>>[3.1.1.2. Missing Data Treatment: Records](#scrollTo=SqHif7DND9dY)

>>>>>[3.1.1.2.1. Imputation of Missing Data](#scrollTo=6eFnVW9dFuB6)

>>>>>[3.1.1.2.2. Removal of Records with More Than 50% Missing Data](#scrollTo=dgJfWYk3EM2M)

>>>[3.1.2. Numerical Encoding of Categorical Variables](#scrollTo=G9KOGIfIGRqz)

>>>[3.1.3. Outlier Statistics and Treatment](#scrollTo=uLsyaF5hGq7y)

>>>>[3.1.3.1. Outlier Treatment: Non-Categorical Variables](#scrollTo=uLsyaF5hGq7y)

>>[3.2. Data Analysis](#scrollTo=iZ33pASsLXYJ)

>>>[3.2.1. Assessment Criteria](#scrollTo=iZ33pASsLXYJ)

>>>[3.2.2. K-Means Clustering | Metrics Used - Euclidean Distance](#scrollTo=_k0HrMbcMBH1)

>>>>[3.2.2.1. Determining Value of 'k' | Elbow Curve & K-means Inertia](#scrollTo=L6lFTJhgN3-E)

>>>>[3.2.2.2. K-means 4 Clustering Analysis](#scrollTo=JTv0xS6Mt7td)

>>>>>[3.2.2.2.1. Model Performance Evaluation](#scrollTo=JTv0xS6Mt7td)

>>>>>[3.2.2.2.2. Cluster 4 Profile Analysis](#scrollTo=-_hTw59WcU9q)

>[4. Results | Observations](#scrollTo=duwbwKZIcdWQ)

>[5. Managerial Insights](#scrollTo=AxrVN-i7Gkix)



# **1. Project Objectives**

## **1.1. PO1**

Segmentation of FIFA Player Data using K-means Clustering to identify distinct groups based on player attributes


## **1.2. PO2**
Identification of the optimal number of clusters (k) in the FIFA Player data to ensure meaningful and interpretable groupings.

## **1.3. PO3**
Determination of the characteristics of each cluster to understand the unique features and trends within each identified player cluster by analyzing how various player positions, attributes and skills differ between clusters.



# **2. Description of Data**




## **2.1. Index Variables**

The dataset did not have a natural identifier like an "ID" column, which could be set as the index for easier retrieval by specific identifiers.

Hence, an index variable was created bu using `player_id` without modifying the actual data itself to significantly improve the organization and interpretability of the data for further analysis and exploration.

> Refer Code Block ([Cell ](#scrollTo=bVALc9YXYSeM&uniqifier=1))

## 2.2. **Categorical Variables (CV)**

### **2.2.1. Categorical Variables - Nominal Type**

* `fifa_version`: The version of the FIFA game the data is associated with.
* `league_id`: A numerical identifier for the league in which the player is currently active.
* `club_team_id`: A numerical identifier for the club the player is currently playing for.
* `club_position`: The primary position the player occupies within their club.
* `nationality_id`: A numerical identifier for the player's nationality.
* `nation_team_id`: A numerical identifier for the national team the player represents (if applicable).
* `preferred_foot`: The player's preferred foot for kicking the ball (Left or Right).
* `body_type`: The general body shape or build of the player (e.g., Normal, Lean, Stocky).




### 2.2.2. **Categorical Variables - Ordinal Type**

* `league_level`: The hierarchical level of the league the player is in, implying a ranking or order of leagues.
* `weak_foot`: A rating (from 1 to 5) indicating the skill level of the player's weaker foot.
* `skill_moves`: A rating (from 1 to 5) representing the player's ability to perform technical moves or tricks.
* `international_reputation`: A rating (from 1 to 5) reflecting the player's renown and recognition on the international stage.
* `work_rate`: A categorical variable describing the player's work ethic and stamina both in attack and defense (e.g., High/Low, Medium/Medium).

## 2.3. **Non-Categorical Variables (NCV)**

* `fifa_update`: An integer likely representing an update or patch number within a FIFA version
* `overall`, `potential`: Overall and potential ratings of the player, indicating their current and future ability
* `value_eur`, `wage_eur`: The player's estimated market value and weekly wage in Euros
* `age`: The player's age in years.
* `height_cm`: The player's height in centimeters.
* `weight_kg`: The player's weight in kilograms
* `club_jersey_number`: The jersey number the player wears for their club
* `pace`, `shooting`, `passing`, `dribbling`, `defending`, `physic`: Core attributes representing the player's abilities in different aspects of the game
* `attacking_crossing`, `attacking_finishing`, `attacking_heading_accuracy`, `attacking_short_passing`, `attacking_volleys`: Specific attacking attributes
* `skill_dribbling`, `skill_curve`, `skill_fk_accuracy`, `skill_long_passing`, `skill_ball_control`: Skill-related attributes
* `movement_acceleration`, `movement_sprint_speed`, `movement_agility`, `movement_reactions`, `movement_balance`: Movement-related attributes
* `power_shot_power`, `power_jumping`, `power_stamina`, `power_strength`, `power_long_shots`: Power-related attributes
* `mentality_aggression`, `mentality_interceptions`, `mentality_positioning`, `mentality_vision`, `mentality_penalties`, `mentality_composure`: Mentality or psychological attributes.
* `defending_marking_awareness`, `defending_standing_tackle`, `defending_sliding_tackle`: Defensive attributes.
* `goalkeeping_diving`, `goalkeeping_handling`, `goalkeeping_kicking`, `goalkeeping_positioning`, `goalkeeping_reflexes`, `goalkeeping_speed`: Goalkeeping-specific attributes

> Refer Code Block ([Cell ](#scrollTo=1FgYlqrBbQVe&uniqifier=1))



NOTE: Dropping irrelevant columns  from the DataFrame 'df' to simplify the dataset not needed for analysis.
> Refer Code Block ([Cell ](#scrollTo=mP847tUYbQh3&line=2&uniqifier=1))

# **3. Analysis of Data**


## **3.1. Data Pre-Processing**

### 3.1.1. Missing Data Statistics and Treatment

#### 3.1.1.1. Missing Data Statistics: Records

The dataset is divided into two subsets based on categorical and non-categorical variables. The code calculates and prints the number of missing values for each variable and record.

> Refer Code Block ([Cell ](#scrollTo=9PrnNh8RbQSO))


#### 3.1.1.2. Missing Data Treatment: Records

> Refer Code Block ([Cell ](#scrollTo=LG3Zd5gzbQPV))




##### 3.1.1.2.1. Imputation of Missing Data

Missing categorical data is imputed using the most frequent value, while missing non-categorical data is also imputed using the most frequent value.

> Refer Code Block ([Cell ](#scrollTo=pARLyWKVbQL2))


##### 3.1.1.2.2. Removal of Records with More Than 50% Missing Data

Empty records and variables are excluded from the dataset.

> Refer Code Block ([Cell ](#scrollTo=fOJiNgulbQBz))

### 3.1.2. Numerical Encoding of Categorical Variables

Categorical data is encoded numerically using the Ordinal Encoder from scikit-learn.

> Refer Code Block ([Cell ](#scrollTo=LtGZX-zmbP_m))

### 3.1.3. Outlier Statistics and Treatment

#### 3.1.3.1. Outlier Treatment: Non-Categorical Variables

Non-categorical variables are normalized using Min-Max Scaler to handle outliers.

> Refer Code Block ([Cell ](#scrollTo=hg1DCj81bP6i))

## **3.2. Data Analysis**

### 3.2.1. Assessment Criteria

1. **Silhouette Score (SS)**:
   - The Silhouette Score is a measure of how similar an object is to its own cluster compared to other clusters.
   - It quantifies the separation between clusters. A high Silhouette Score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
   - The Silhouette Score ranges from -1 to 1, where a high value indicates that the object is well-clustered, a value near 0 indicates overlapping clusters, and negative values suggest that the object may have been assigned to the wrong cluster.
   - In the context of K-means clustering, the average Silhouette Score across all data points can be used to evaluate the quality of clustering. Higher average Silhouette Scores indicate better-defined clusters.

2. **Davies-Bouldin Index (DBI)**:
   - The Davies-Bouldin Index is a measure of cluster compactness and separation.
   - It evaluates the average similarity between each cluster and its most similar cluster, weighted by the cluster sizes.
   - A lower DBI indicates better clustering, with clusters that are well-separated from each other and internally compact.
   - The DBI considers both intra-cluster and inter-cluster distances, aiming to minimize intra-cluster distance while maximizing inter-cluster distance.
   - Like the Silhouette Score, the Davies-Bouldin Index is used to assess the quality of clustering algorithms, with lower values indicating better-defined clusters.

**NOTE:**

* DBI: Lower values are better. A DBI close to 0 indicates well-separated clusters.
* Silhouette Score: Higher values are better. A score close to 1 indicates dense, well-separated clusters, while negative values suggest overlapping clusters.

These metrics provide insights into the quality and interpretability of the clustering results, helping to guide the selection of the optimal number of clusters for a given dataset.


### 3.2.2. K-Means Clustering | Metrics Used - Euclidean Distance

1. **Simple and Fast**: K-means is computationally efficient and relatively easy to understand and implement. It works well with large datasets, making it suitable for analysis even when dealing with a significant amount of data.

2. **Scalability**: K-means clustering is scalable to a large number of samples and has been used in many large-scale data processing scenarios.

3. **Interpretability**: K-means produces clusters that are easy to interpret. Each cluster is represented by its centroid, which is the mean of all the data points assigned to that cluster. This centroid can provide insight into the characteristics of the cluster.

4. **Versatility**: K-means can be applied to various types of data and can handle both numerical and categorical variables (after appropriate preprocessing). This versatility makes it applicable to a wide range of datasets.

5. **Well-suited for Convex Clusters**: K-means performs well when clusters are spherical or close to spherical in shape. It tries to minimize the within-cluster variance, which makes it suitable for convex clusters.

6. **Initial Centroid Selection**: While the performance of K-means can be sensitive to the initial choice of centroids, there are strategies to mitigate this issue, such as multiple initializations with different seeds and more advanced methods like k-means++.

However, it's essential to consider potential limitations as well:

1. **Sensitive to Initial Centroid Selection**: The results of K-means clustering can be sensitive to the initial placement of centroids. Different initializations may lead to different results.

2. **Assumes Spherical Clusters**: K-means assumes that clusters are spherical and isotropic, which may not always hold true for complex datasets with irregularly shaped clusters.

3. **Number of Clusters (K) Selection**: Determining the appropriate number of clusters (K) can be challenging and may require domain knowledge or additional validation techniques, such as the elbow method or silhouette analysis.

4. **Sensitive to Outliers**: K-means is sensitive to outliers, as it tries to minimize the within-cluster variance. Outliers can significantly impact the positions of cluster centroids.

5. **Equal Variance Among Clusters**: K-means assumes that clusters have equal variance, which may not always be the case in practice.

Overall, while K-means clustering has its limitations, it can still be a valuable tool for exploratory analysis and pattern discovery in your dataset, especially if the assumptions of the algorithm are met and appropriate preprocessing steps are taken.


#### 3.2.2.1. Determining Value of 'k' | Elbow Curve & K-means Inertia

The elbow curve is used to determine the optimal number of clusters (k) for the K-means clustering algorithm. It plots the Within Cluster Sum of Squared Distances (WCSS) on the y-axis and the number of clusters (k) on the x-axis.

The elbow curve appears to have a distinct bend and decreases steadily around k=3. This suggests that optimal number of clusters for this dataset will be in the range of 2 to 4.

> Refer Code Block ([Cell ](#scrollTo=N6-oTfNasSy0))

#### 3.2.2.2. K-means 4 Clustering Analysis

> Refer Code Block ([Cell ](#scrollTo=yJY7BBDnXL_V))

##### 3.2.2.2.1. Model Performance Evaluation


- **Davies-Bouldin Index (DBI): 0.058**

  A very low DBI value indicates excellent clustering performance. It suggests that the clusters are well-separated and compact, with minimal overlap between them.

- **Silhouette Score: 0.976**

  A Silhouette Score close to 1 signifies outstanding clustering quality. It implies that the data points within each cluster are very similar to each other and dissimilar to points in other clusters.

- **Overall Assessment:**

  Based on both the DBI and Silhouette Score, the K-means clustering model with 4 clusters exhibits exceptional performance on the given dataset. The clusters are highly distinct and internally cohesive, suggesting that the model has effectively captured the underlying structure of the data. This strong performance increases confidence in the meaningfulness and interpretability of the identified clusters, providing a solid foundation for further analysis and insights.










> Refer Code Block ([Cell ](#scrollTo=IJJqGlXbXdW8))











##### 3.2.2.2.2. Cluster 4 Profile Analysis


**ANOVA Results**

* **Significant Differences:**
    * Most non-categorical variables, including `overall_mmnorm`, `potential_mmnorm`, `value_eur_mmnorm`, `wage_eur_mmnorm`, and various skill attributes, showed **extremely small p-values (close to 0)**, indicating highly significant differences between clusters. This suggests that these attributes play a crucial role in distinguishing player clusters.
    * `age`, `height_cm_mmnorm`, and `weight_kg_mmnorm` also exhibited statistically significant differences between clusters.
    * Interestingly, `movement_balance_mmnorm` did not show a significant difference (p-value = 0.077), suggesting that balance might not be a key factor in differentiating these player groups.
    * Goalkeeping attributes, while showing some significant differences, had relatively higher p-values, likely due to the smaller sample size of goalkeepers in the dataset.
    * The warning about constant input arrays for `fifa_update` implies that this variable has the same value across all clusters and thus doesn't contribute to cluster differentiation.

**Chi-Square Test Results**

* **Strong Associations:**
    * All categorical variables, except for `preferred_foot_oe`, demonstrated **very small p-values (close to 0)**, indicating a strong association between these variables and the cluster assignments. This highlights their importance in defining the player clusters.
    * `league_id`, `club_team_id`, `nationality_id`, and `nation_team_id` showed particularly strong associations, suggesting that these factors heavily influence player grouping.
    * `preferred_foot_oe` showed a p-value of 0.0199, which is still statistically significant, although the association with cluster labels is weaker compared to other categorical variables.


**Cluster Profile Analysis**

**1. Centricity Analysis**

Centricity analysis involves examining the cluster centers (centroids) to understand the typical characteristics of players within each cluster. Below are key observations from the provided centroids:

* **Cluster 0 (Goalkeepers):**
    * Characterized by significantly higher values in goalkeeping attributes (diving, handling, kicking, positioning, reflexes, speed).
    * Lower values in outfield attributes like pace, shooting, passing, dribbling, defending, and physic.
    * Club position is predominantly 'GK.'

* **Cluster 1 (High-Rated Players):**
    * Possesses the highest overall and potential ratings among all clusters.
    * High values in most key skill attributes (shooting, passing, dribbling, etc.), indicating well-rounded players.
    * Higher market value (`value_eur`) and wage (`wage_eur`).
    * Includes a mix of positions, but likely skewed towards attacking roles given the higher emphasis on offensive skills.

* **Cluster 2 (Mid-Tier Players):**
    * Exhibits mid-range overall and potential ratings, falling between the high-rated and lower-rated clusters.
    * Skill attributes are generally balanced, suggesting a mix of players with diverse skill sets.
    * Market value and wages are lower than Cluster 1 but higher than Cluster 3.
    * Likely encompasses a wider range of positions, including midfielders and defenders.

* **Cluster 3 (Lower-Rated Players):**
    * Displays the lowest overall and potential ratings.
    * Skill attributes are generally lower across the board, implying less developed or specialized players.
    * Market value and wages are the lowest among all clusters.
    * May include a mix of young players with high potential and older players in the twilight of their careers.

**2. Cluster Sizes**

The distribution of players across clusters provides additional context for understanding the relative prevalence of each player archetype:

* **Cluster 0 (Goalkeepers):** 132 players (relatively small cluster, as expected for goalkeepers)
* **Cluster 1 (High-Rated Players):** 10,122 players (a substantial group representing the elite players)
* **Cluster 2 (Mid-Tier Players):** 14,612 players (the largest cluster, indicating a majority of players fall in this category)
* **Cluster 3 (Lower-Rated Players):** 987 players (a smaller group, likely consisting of young talents and less prominent players)

**3. Conclusion**

The centricity and cluster size analysis reveals meaningful distinctions between the identified player groups. Cluster 0 represents specialized goalkeepers, Cluster 1 comprises the top-tier, well-rounded players, Cluster 2 encompasses a large group of balanced, mid-tier players, and Cluster 3 includes less developed or specialized players. These insights offer a valuable framework for understanding the diversity of player skills and potential within the FIFA dataset.






> Refer Code Block ([Cell ](#scrollTo=jLfOGgQ5Xh0v))
  
> Refer Code Block ([Cell ](#scrollTo=u-tNiBkVXpSw))

> Refer Code Block ([Cell ](#scrollTo=lj7FQqmEX5FX))
  


# **4. Results | Observations**

**Clustering Performance Summary**

| Clusters | Silhouette Score | Davies-Bouldin Index | Memory Usage (MiB) |
|---|---|---|---|
| k=2 | 0.908 | 0.165 | 1048.59 |
| k=3 | 0.968 | 0.201 | 1078.74 |
| k=4 | 0.976 | 0.058 | 1094.34 |


# **5. Managerial Insights**






**Player Archetype Identification**

The segmentation provides valuable insights into distinct player archetypes, allowing managers to identify players with specific skill sets and potential.

**Team Building Strategies**

Managers can utilize these insights to build balanced teams by strategically selecting players from different clusters based on their roles and attributes.

**Player Valuation and Transfer Decisions**

The clustering analysis can inform player valuation and transfer decisions by providing a framework for understanding the relative value of players within different clusters.

**Player Development and Training**

The clustering analysis can help in player development and training as it can identify the pain points and strengths of each player and work on that.


________________________________________________________
________________________________________________________

**Setup**

Installation of Pre-Requisite Libraries

In [None]:
pip  install memory_profiler

Collecting memory_profiler
  Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0


In [None]:

import pandas as pd, numpy as np # For Data Manipulation
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from itertools import cycle, islice
from sklearn.cluster import KMeans as kmclus # For K-Means Clustering
from sklearn.metrics import silhouette_score as sscore, davies_bouldin_score as dbscore # For Clustering Model Evaluation
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt, seaborn as sns # For Data Visualization
from pandas.plotting import parallel_coordinates
import scipy.cluster.hierarchy as sch # For Hierarchical Clustering
from scipy.stats import f_oneway

!pip install ipython-autotime
%reload_ext autotime

# Load IPython extension for memory profiling
!pip install memory-profiler
%reload_ext memory_profiler


Collecting ipython-autotime
  Downloading ipython_autotime-0.3.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting jedi>=0.16 (from ipython->ipython-autotime)
  Using cached jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Downloading ipython_autotime-0.3.2-py2.py3-none-any.whl (7.0 kB)
Using cached jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
Installing collected packages: jedi, ipython-autotime
Successfully installed ipython-autotime-0.3.2 jedi-0.19.1
time: 5.04 s (started: 2024-08-18 03:18:20 +00:00)


In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

ValueError: mount failed

time: 2min 1s (started: 2024-08-18 03:18:25 +00:00)


In [None]:
#Display All Columns of Dataset
pd.set_option('display.max_columns', None)

In [None]:
from memory_profiler import memory_usage
import time

# Track start time
start_time = time.time()

# Track initial memory usage
initial_memory = memory_usage()[0]

In [None]:
import pandas as pd

# Provide the path to your CSV file
file_path = '/content/drive/MyDrive/male_players_football.csv'

# Read the CSV file into a DataFrame
data = pd.read_csv(file_path)

In [None]:
data

In [None]:
data.info()

In [None]:
data.shape

Data Sampling

In [None]:
#Data Sampling
sdata = data.sample(
    frac=0.16,
    replace=False,
    random_state=1234,
    )
sdata

Adding an index variable to the dataset

In [None]:
# Setting default numerical index automatically
sdata.reset_index(drop=True, inplace=True)

# Setting the new index
sdata.set_index('player_id', inplace=True)

# Renaming the index
sdata.index.name = 'index'

sdata


In [None]:
sdata.columns.tolist()

In [None]:
df = sdata.drop([
 'player_url',
 'fifa_update_date',
 'short_name',
 'player_positions',
 'long_name',
 'dob',
 'league_name',
 'club_name',
 'club_loaned_from',
 'club_joined_date',
 'club_contract_valid_until_year',
 'nationality_name',
 'nation_position',
 'nation_jersey_number',
 'real_face',
 'release_clause_eur',
 'player_tags',
 'player_traits',
 'ls',
 'st',
 'rs',
 'lw',
 'lf',
 'cf',
 'rf',
 'rw',
 'lam',
 'cam',
 'ram',
 'lm',
 'lcm',
 'cm',
 'rcm',
 'rm',
 'lwb',
 'ldm',
 'cdm',
 'rdm',
 'rwb',
 'lb',
 'lcb',
 'cb',
 'rcb',
 'rb',
 'gk',
 'player_face_url'], axis=1)

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df

In [None]:
df.columns

In [None]:
df['pace'].nunique()

In [None]:
# Categorical Variables:
df_cat = df[[
       'fifa_version', 'league_id', 'league_level',
       'club_team_id', 'club_position', 'nationality_id',
       'nation_team_id', 'preferred_foot', 'weak_foot',
       'skill_moves', 'international_reputation',
       'work_rate', 'body_type'
]]

# Non-Categorical Variables:
df_noncat = df[[ 'fifa_update',
       'overall', 'potential', 'value_eur', 'wage_eur', 'age', 'height_cm',
       'weight_kg', 'club_jersey_number', 'pace',
       'shooting', 'passing', 'dribbling', 'defending', 'physic',
       'attacking_crossing', 'attacking_finishing',
       'attacking_heading_accuracy', 'attacking_short_passing',
       'attacking_volleys', 'skill_dribbling', 'skill_curve',
       'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control',
       'movement_acceleration', 'movement_sprint_speed', 'movement_agility',
       'movement_reactions', 'movement_balance', 'power_shot_power',
       'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots',
       'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking_awareness',
       'defending_standing_tackle', 'defending_sliding_tackle',
       'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking',
       'goalkeeping_positioning', 'goalkeeping_reflexes', 'goalkeeping_speed'
       ]]


In [None]:
df.describe()

In [None]:
# Dataframe used: df_cat

# Creating an empty DataFrame to store results
cat_stats_df = pd.DataFrame(columns=['Variable', 'Category', 'Count', 'Frequency'])

# Iterating through each categorical variable
for var in df_cat.columns:
    # Counting occurrences of each category
    cat_count = df_cat[var].value_counts().reset_index()
    cat_count.columns = ['Category', 'Count']

    # Calculating frequency of each category
    cat_count['Frequency'] = cat_count['Count'] / cat_count['Count'].sum()

    # Adding variable name to the DataFrame
    cat_count['Variable'] = var

    # Appending results to the main DataFrame using pd.concat()
    cat_stats_df = pd.concat([cat_stats_df, cat_count], ignore_index=True)

# Displaying the DataFrame
print(cat_stats_df)

DATA PRE-PROCESSING

In [None]:
# Missing Data Information

variable_missing_data = df.isna().sum(); variable_missing_data # Variable-wise Missing Data Information

In [None]:
# Record-wise Missing Data Information (Top 5) (row-wise)

record_missing_data = df.isna().sum(axis=1).sort_values(ascending=False).head(5); record_missing_data


In [None]:

# Impute Missing Categorical Data [Nominal | Ordinal] using Descriptive Satatistics : Central Tendency (Mode)

# Dataset Used : df_cat

si_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Strategy = median [When Odd Number of Categories Exists]
si_cat_fit = si_cat.fit_transform(df_cat)
df_cat_mdi = pd.DataFrame(si_cat_fit, columns=df_cat.columns); df_cat_mdi # Missing Categorical Data Imputed Subset
df_cat_mdi.info()


In [None]:
# Impute Missing Non-Categorical Data using Descriptive Statistics : Central Tendency

# Dataset Used : df_noncat

si_noncat = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Other Strategy : mean | median | most_frequent | constant
si_noncat_fit = si_noncat.fit_transform(df_noncat)
df_noncat_mdi_si = pd.DataFrame(si_noncat_fit, columns=df_noncat.columns); df_noncat_mdi_si # Missing Non-Categorical Data Imputed Subset using Simple Imputer
df_noncat_mdi_si.info()


In [None]:
print(df_noncat_mdi_si.isnull().sum())

In [None]:
# Missing Data Exclusion [MCAR | MAR (> 50%)]

# Dataset Used : df_cat_mdi | df_noncat_mdi_si | df_noncat_mdi_ki

# Excluding Empty Records (If Any)
df_cat_mdi.dropna(axis=0, how='all', inplace=True) # Categorical Data Subset
df_noncat_mdi_si.dropna(axis=0, how='all', inplace=True) # Non-Categorical Data Subset

# Excluding Empty Variables (If Any)
df_cat_mdi.dropna(axis=1, how='all', inplace=True) # Categorical Data Subset
df_noncat_mdi_si.dropna(axis=1, how='all', inplace=True) # Non-Categorical Data Subset

df_cat_mdt = df_cat_mdi.copy() # Missing Categorical Treated Dataset
df_noncat_mdt = df_noncat_mdi_si.copy() # Missing Non-Categorical Treated Dataset


In [None]:
df_cat_mdt

In [None]:
df_cat_mdt.columns

In [None]:
# Numeric Encoding of Categorical Data

# Dataset Used : df_cat_mdt
df_cat_mdt_code = df_cat_mdt.copy()

# Using Scikit Learn : Ordinal Encoder (Superior)
oe = OrdinalEncoder()
oe_fit = oe.fit_transform(df_cat_mdt_code)
column_names = [
       'fifa_version_oe', 'league_id_oe', 'league_level_oe', 'club_team_id_oe',
       'club_position_oe', 'nationality_id_oe', 'nation_team_id_oe', 'preferred_foot_oe',
       'weak_foot_oe', 'skill_moves_oe', 'international_reputation_oe', 'work_rate_oe',
       'body_type_oe'
       ]
df_cat_code_oe = pd.DataFrame(oe_fit, columns=column_names);

df_cat_mdt_code_oe = pd.concat([df_cat_mdt_code, df_cat_code_oe], axis=1)
df_cat_mdt_code_oe # (Missing Data Treated) Numeric Coded Categorical Dataset using Scikit Learn Ordinal Encoder


In [None]:
df_cat_mdt_code_oe.columns

In [None]:
df_cat_mdt_code_oe1 = df_cat_mdt_code_oe.drop([
       'fifa_version_oe', 'league_id_oe', 'league_level_oe', 'club_team_id_oe',
       'preferred_foot', 'club_position', 'nationality_id_oe', 'nation_team_id_oe',
       'weak_foot_oe', 'skill_moves_oe', 'international_reputation_oe', 'work_rate', 'body_type',
       ], axis = 1)

In [None]:
df_cat_mdt_code_oe1

In [None]:
df_noncat_mdt

In [None]:
df_noncat_mdt.columns

In [None]:
from sklearn.preprocessing import MinMaxScaler

# List of columns to normalize
columns_to_normalize = [
       'overall', 'potential', 'value_eur',
       'wage_eur', 'height_cm', 'weight_kg',
       'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic',
       'attacking_crossing', 'attacking_finishing',
       'attacking_heading_accuracy', 'attacking_short_passing',
       'attacking_volleys', 'skill_dribbling', 'skill_curve',
       'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control',
       'movement_acceleration', 'movement_sprint_speed', 'movement_agility',
       'movement_reactions', 'movement_balance', 'power_shot_power',
       'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots',
       'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking_awareness',
       'defending_standing_tackle', 'defending_sliding_tackle',
       'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking',
       'goalkeeping_positioning', 'goalkeeping_reflexes', 'goalkeeping_speed'
       ]

# Normalization: Min-Max Scaling
mms = MinMaxScaler()
df_noncat_mdt_mmn = df_noncat_mdt.copy()  # Create a copy of the DataFrame

for column_name in columns_to_normalize:
    mms_fit = mms.fit_transform(df_noncat_mdt[[column_name]])
    normalized_column_name = f'{column_name}_mmnorm'
    df_noncat_minmax_norm = pd.DataFrame(mms_fit, columns=[normalized_column_name])
    df_noncat_mdt_mmn = df_noncat_mdt_mmn.join(df_noncat_minmax_norm)

# Display the DataFrame with Min-Max normalized values
df_noncat_mdt_mmn


In [None]:
df_noncat_mdt_mmn.columns

In [None]:
df_noncat_mdt_mmn1 = df_noncat_mdt_mmn.drop([      'overall', 'potential', 'value_eur',
       'wage_eur', 'height_cm', 'weight_kg',
       'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic',
       'attacking_crossing', 'attacking_finishing',
       'attacking_heading_accuracy', 'attacking_short_passing',
       'attacking_volleys', 'skill_dribbling', 'skill_curve',
       'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control',
       'movement_acceleration', 'movement_sprint_speed', 'movement_agility',
       'movement_reactions', 'movement_balance', 'power_shot_power',
       'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots',
       'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking_awareness',
       'defending_standing_tackle', 'defending_sliding_tackle',
       'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking',
       'goalkeeping_positioning', 'goalkeeping_reflexes', 'goalkeeping_speed'
], axis = 1)

In [None]:
df_noncat_mdt_mmn1

In [None]:
# Pre-Processed Categorical Data Subset
df_cat_ppd = df_cat_mdt_code_oe1.copy(); df_cat_ppd # Preferred Data Subset

In [None]:
# Pre-Processed Non-Categorical Data Subset
df_noncat_ppd = df_noncat_mdt_mmn1.copy(); df_noncat_ppd # Preferred Data Subset

In [None]:
# Pre-Processed Dataset
df_ppd = pd.merge(df_cat_ppd, df_noncat_ppd, left_index=True, right_index=True)
df_ppd

In [None]:
df_ppd.info()

In [None]:
df_ppd.columns

In [None]:
df1 = df_ppd.copy()

In [None]:
df1.columns

In [None]:

# Track final memory usage
final_memory = memory_usage()[0]

# Calculate elapsed time
elapsed_time = time.time() - start_time

# Print total time taken and total memory used
print(f"Total time taken: {elapsed_time} seconds")

# Track end time and final memory usage
end_time = time.time()
final_memory = memory_usage()[0]

# Calculate elapsed time and memory used
elapsed_time = end_time - start_time
memory_used = final_memory - initial_memory

print(f"Elapsed time: {elapsed_time} seconds")
print(f"Memory used: {memory_used} MiB")


In [None]:
from memory_profiler import memory_usage
import time

# Track start time
start_time = time.time()

# Track initial memory usage
initial_memory = memory_usage()[0]

In [None]:
wcssd = [] # Within-Cluster-Sum-Squared-Distance
nr_clus = range(1,11) # Number of Clusters
for k in nr_clus:
    kmeans = kmclus(n_clusters=k, init='random', random_state=111)
    kmeans.fit(df1)
    wcssd.append(kmeans.inertia_)
plt.plot(nr_clus, wcssd, marker='x')
plt.xlabel('Values of K')
plt.ylabel('Within Cluster Sum Squared Distance')
plt.title('Elbow Curve for Optimal K')
plt.show()

In [None]:
# Create K-Means Clusters [K=2]
km_2cluster = kmclus(n_clusters=2, init='random', random_state=333)
df1['Cluster_Label'] = km_2cluster.fit_predict(df1)
km_2cluster_model = df1['Cluster_Label'].values
km_2cluster_model

In [None]:
# K-Means Clustering Model Evaluation [K=2]
# ------------------------------------------------------

sscore_km_2cluster = sscore(df1, km_2cluster_model)
dbscore_km_2cluster = dbscore(df1, km_2cluster_model);
%memit
print(f"Davies-Bouldin Index for 2 clusters: {dbscore_km_2cluster}")
print(f"Silhouette Score for 2 clusters: {sscore_km_2cluster}")

ANOVA

In [None]:
df_noncat_ppd.columns

In [None]:
import pandas as pd
from scipy.stats import f_oneway
from sklearn.cluster import KMeans

# Assuming 'km_2cluster_model' is the cluster labels obtained from KMeans clustering

# Joining cluster labels with the original dataset
df_with_clusters = df1.copy()
df_with_clusters['Cluster_Label'] = km_2cluster_model

# Extracting non-categorical variables
non_cat_variables = [
       'fifa_update', 'age', 'club_jersey_number', 'overall_mmnorm',
       'potential_mmnorm', 'value_eur_mmnorm', 'wage_eur_mmnorm',
       'height_cm_mmnorm', 'weight_kg_mmnorm', 'pace_mmnorm',
       'shooting_mmnorm', 'passing_mmnorm', 'dribbling_mmnorm',
       'defending_mmnorm', 'physic_mmnorm', 'attacking_crossing_mmnorm',
       'attacking_finishing_mmnorm', 'attacking_heading_accuracy_mmnorm',
       'attacking_short_passing_mmnorm', 'attacking_volleys_mmnorm',
       'skill_dribbling_mmnorm', 'skill_curve_mmnorm',
       'skill_fk_accuracy_mmnorm', 'skill_long_passing_mmnorm',
       'skill_ball_control_mmnorm', 'movement_acceleration_mmnorm',
       'movement_sprint_speed_mmnorm', 'movement_agility_mmnorm',
       'movement_reactions_mmnorm', 'movement_balance_mmnorm',
       'power_shot_power_mmnorm', 'power_jumping_mmnorm',
       'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm',
       'goalkeeping_diving_mmnorm', 'goalkeeping_handling_mmnorm',
       'goalkeeping_kicking_mmnorm', 'goalkeeping_positioning_mmnorm',
       'goalkeeping_reflexes_mmnorm', 'goalkeeping_speed_mmnorm'
]

# Grouping variables by cluster label
cluster_groups = df_with_clusters.groupby('Cluster_Label')

# Perform ANOVA for each non-categorical variable
anova_results_non_cat = {}
for column in non_cat_variables:
    try:
        anova_results_non_cat[column] = f_oneway(*[group[column] for name, group in cluster_groups])
    except KeyError:
        print(f"Error: '{column}' not found in the DataFrame.")

# Print ANOVA results for non-categorical variables
for column, result in anova_results_non_cat.items():
    print(f"Variable: {column}")
    print(f"F-value: {result.statistic}")
    print(f"P-value: {result.pvalue}")
    print()


Chi-square Test

In [None]:
df_cat_ppd.columns

In [None]:
from scipy.stats import chi2_contingency

# Extracting categorical variables
cat_variables = ['fifa_version', 'league_id', 'league_level', 'club_team_id',
       'nationality_id', 'nation_team_id', 'weak_foot', 'skill_moves',
       'international_reputation', 'club_position_oe', 'preferred_foot_oe',
       'work_rate_oe', 'body_type_oe']

# Perform Chi-square test for each categorical variable
chi2_results_cat = {}
for column in cat_variables:
    contingency_table = pd.crosstab(df_with_clusters[column], km_2cluster_model)
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_results_cat[column] = {'Chi-square': chi2, 'P-value': p_value}

# Print Chi-square test results for categorical variables
for column, result in chi2_results_cat.items():
    print(f"Variable: {column}")
    print(f"Chi-square: {result['Chi-square']}")
    print(f"P-value: {result['P-value']}")
    print()


Centricity

In [None]:
# Get the cluster centers
cluster_centers = km_2cluster.cluster_centers_

# Convert cluster_centers to a DataFrame
centroids_df = pd.DataFrame(cluster_centers, columns=df1.columns[:-1])

# Display the centroids of clusters
centroids_df


Descriptive Statistics for 2 cluster model

In [None]:
df1['Cluster_Label'] = km_2cluster_model

# Calculate cluster-wise descriptive statistics
cluster_stats = df1.groupby('Cluster_Label').describe()

# Variables of interest
variables_of_interest = [
       'age', 'club_jersey_number', 'overall_mmnorm',
       'value_eur_mmnorm', 'wage_eur_mmnorm',
       'height_cm_mmnorm', 'weight_kg_mmnorm', 'pace_mmnorm',
       'shooting_mmnorm', 'passing_mmnorm', 'dribbling_mmnorm',
       'defending_mmnorm', 'physic_mmnorm', 'attacking_crossing_mmnorm',
       'attacking_finishing_mmnorm', 'attacking_heading_accuracy_mmnorm',
       'attacking_short_passing_mmnorm', 'attacking_volleys_mmnorm',
       'skill_dribbling_mmnorm', 'skill_curve_mmnorm',
       'skill_fk_accuracy_mmnorm', 'skill_long_passing_mmnorm',
       'skill_ball_control_mmnorm', 'movement_acceleration_mmnorm',
       'movement_sprint_speed_mmnorm', 'movement_agility_mmnorm',
       'movement_reactions_mmnorm',
       'power_shot_power_mmnorm', 'power_jumping_mmnorm',
       'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm'

]

# Print descriptive statistics for each variable
for variable in variables_of_interest:
    print(f"Descriptive statistics for variable: {variable}")
    print(cluster_stats[variable])
    print()

In [None]:
# Assuming km_2cluster_model contains the cluster labels

# Grouping the data by cluster labels
cluster_groups = df_with_clusters.groupby(km_2cluster_model)

# Counting the number of variables in each cluster
num_variables_in_clusters = cluster_groups.size()

# Displaying the results
print("Number of Variables in Each Cluster:")
print(num_variables_in_clusters)

ANOVA p-values Heatmap for 2 Clusters

In [None]:
df_noncat_ppd.columns

In [None]:
# Joining cluster labels with the original dataset
df_with_clusters = df1.copy()
df_with_clusters['Cluster_Label'] = km_2cluster_model

# Extracting non-categorical variables
non_cat_variables = [
       'fifa_update', 'age', 'club_jersey_number', 'overall_mmnorm',
       'potential_mmnorm', 'value_eur_mmnorm', 'wage_eur_mmnorm',
       'height_cm_mmnorm', 'weight_kg_mmnorm', 'pace_mmnorm',
       'shooting_mmnorm', 'passing_mmnorm', 'dribbling_mmnorm',
       'defending_mmnorm', 'physic_mmnorm', 'attacking_crossing_mmnorm',
       'attacking_finishing_mmnorm', 'attacking_heading_accuracy_mmnorm',
       'attacking_short_passing_mmnorm', 'attacking_volleys_mmnorm',
       'skill_dribbling_mmnorm', 'skill_curve_mmnorm',
       'skill_fk_accuracy_mmnorm', 'skill_long_passing_mmnorm',
       'skill_ball_control_mmnorm', 'movement_acceleration_mmnorm',
       'movement_sprint_speed_mmnorm', 'movement_agility_mmnorm',
       'movement_reactions_mmnorm', 'movement_balance_mmnorm',
       'power_shot_power_mmnorm', 'power_jumping_mmnorm',
       'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm',
       'goalkeeping_diving_mmnorm', 'goalkeeping_handling_mmnorm',
       'goalkeeping_kicking_mmnorm', 'goalkeeping_positioning_mmnorm',
       'goalkeeping_reflexes_mmnorm', 'goalkeeping_speed_mmnorm']

# Grouping variables by cluster label
cluster_groups = df_with_clusters.groupby('Cluster_Label')

# Perform ANOVA for each non-categorical variable
anova_results_non_cat = {}
for column in non_cat_variables:
    anova_results_non_cat[column] = f_oneway(*[group[column] for name, group in cluster_groups])

# Extract p-values for non-categorical variables
non_cat_p_values = [result.pvalue for result in anova_results_non_cat.values()]

# Plotting heatmap for non-categorical variables
plt.figure(figsize=(10, 6))
sns.heatmap([non_cat_p_values], cmap='coolwarm', annot=True, fmt='.2f', xticklabels=non_cat_variables, yticklabels=['Clusters'], cbar=False)
plt.xlabel('Variable')
plt.title('ANOVA p-values Heatmap for Non-Categorical Variables (2 Clusters)')
plt.show()


Chi-square P-values Heatmap for 2 clusters

In [None]:
df_cat_ppd.columns

In [None]:
cat_variables = ['fifa_version', 'league_id', 'league_level', 'club_team_id',
       'nationality_id', 'nation_team_id', 'weak_foot', 'skill_moves',
       'international_reputation', 'club_position_oe', 'preferred_foot_oe',
       'work_rate_oe', 'body_type_oe']

# Initialize dictionary to store chi-square results
chi2_results_cat = {}

# Calculate chi-square for each categorical variable
for column in cat_variables:
    contingency_table = pd.crosstab(df_with_clusters[column], km_2cluster_model)
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_results_cat[column] = {'P-value': p_value}

# Extract p-values
p_values = [[result['P-value'] for result in chi2_results_cat.values()]]

# Get variable names
variables = list(chi2_results_cat.keys())

# Plotting heatmap for p-values
plt.figure(figsize=(10, 6))
sns.heatmap(p_values, cmap='coolwarm', annot=True, fmt='.2f', xticklabels=variables, yticklabels=False)
plt.xlabel('Variable')
plt.ylabel('Cluster Label')
plt.title('Chi-square P-values Heatmap for Categorical Variables')
plt.show()


In [None]:
# Assign cluster labels to the DataFrame
df1['Cluster_Label'] = km_2cluster_model

# Plot the scatter plot with clusters and centroids
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df1, x= 'age', y='potential_mmnorm', hue='Cluster_Label', legend='full')

# Set plot title and labels
plt.title('Clusters with Centroids')
plt.xlabel('age')
plt.ylabel('potential_mmnorm')
plt.legend()
plt.grid(True)
%memit
plt.show()


Interactive 3-D Scatter plot for 2-cluster model

In [None]:
import plotly.graph_objs as go

# Create 3D scatter plot
fig = go.Figure()

# Add traces for each cluster
for cluster_label in df1['Cluster_Label'].unique():
    cluster_data = df1[df1['Cluster_Label'] == cluster_label]
    fig.add_trace(go.Scatter3d(
        x=cluster_data['age'],
        y=cluster_data['potential_mmnorm'],
        z=cluster_data['Cluster_Label'],
        mode='markers',
        marker=dict(
            size=5,
            color=cluster_label,
            colorscale='Viridis',  # Adjust colorscale if needed
            opacity=0.8
        ),
        name=f'Cluster {cluster_label}'
    ))

# Set layout
fig.update_layout(
    title='Clusters with Centroids (3D)',
    scene=dict(
        xaxis=dict(title='age'),
        yaxis=dict(title='potential_mmnorm'),
        zaxis=dict(title='overall_mmnorm'),
    ),
    margin=dict(l=0, r=0, b=0, t=40)
)

# Show interactive 3D scatter plot
fig.show()


In [None]:
# Create subplots for each cluster
fig, axes = plt.subplots(1, 2, figsize=(20, 4), sharex=True, sharey=True)
fig.suptitle('Clusters with Centroids')

# Iterate through each cluster
for i in range(2):
    # Filter data points belonging to the current cluster
    cluster_data = df1[df1['Cluster_Label'] == i]

    # Scatter plot of data points
    sns.scatterplot(data=cluster_data, x='overall_mmnorm', y='potential_mmnorm', ax=axes[i], label=f'Cluster {i}')

    # Set title and labels for each subplot
    axes[i].set_title(f'Cluster {i}')
    axes[i].set_xlabel('age')
    axes[i].set_ylabel('potential_mmnorm')
    axes[i].legend()

plt.show()
%memit

In [None]:
# Create K-Means Clusters [K=3]
km_3cluster = kmclus(n_clusters=3, init='random', random_state=333)
df1['Cluster_Label'] = km_3cluster.fit_predict(df1)
km_3cluster_model = df1['Cluster_Label'].values
km_3cluster_model

In [None]:
# K-Means Clustering Model Evaluation [K=3]
# ------------------------------------------------------

sscore_km_3cluster = sscore(df1, km_3cluster_model)
dbscore_km_3cluster = dbscore(df1, km_3cluster_model);
%memit
print(f"Davies-Bouldin Index for 3 clusters: {dbscore_km_3cluster}")
print(f"Silhouette Score for 3 clusters: {sscore_km_3cluster}")

ANOVA

In [None]:
# Joining cluster labels with the original dataset
df_with_clusters = df1.copy()
df_with_clusters['Cluster_Label'] = km_3cluster_model

# Extracting non-categorical variables
non_cat_variables = [
       'fifa_update', 'age', 'club_jersey_number', 'overall_mmnorm',
       'potential_mmnorm', 'value_eur_mmnorm', 'wage_eur_mmnorm',
       'height_cm_mmnorm', 'weight_kg_mmnorm', 'pace_mmnorm',
       'shooting_mmnorm', 'passing_mmnorm', 'dribbling_mmnorm',
       'defending_mmnorm', 'physic_mmnorm', 'attacking_crossing_mmnorm',
       'attacking_finishing_mmnorm', 'attacking_heading_accuracy_mmnorm',
       'attacking_short_passing_mmnorm', 'attacking_volleys_mmnorm',
       'skill_dribbling_mmnorm', 'skill_curve_mmnorm',
       'skill_fk_accuracy_mmnorm', 'skill_long_passing_mmnorm',
       'skill_ball_control_mmnorm', 'movement_acceleration_mmnorm',
       'movement_sprint_speed_mmnorm', 'movement_agility_mmnorm',
       'movement_reactions_mmnorm', 'movement_balance_mmnorm',
       'power_shot_power_mmnorm', 'power_jumping_mmnorm',
       'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm',
       'goalkeeping_diving_mmnorm', 'goalkeeping_handling_mmnorm',
       'goalkeeping_kicking_mmnorm', 'goalkeeping_positioning_mmnorm',
       'goalkeeping_reflexes_mmnorm', 'goalkeeping_speed_mmnorm'
       ]


# Grouping variables by cluster label
cluster_groups = df_with_clusters.groupby('Cluster_Label')

# Perform ANOVA for each non-categorical variable
anova_results_non_cat = {}
for column in non_cat_variables:
    anova_results_non_cat[column] = f_oneway(*[group[column] for name, group in cluster_groups])

# Print ANOVA results for non-categorical variables
for column, result in anova_results_non_cat.items():
    print(f"Variable: {column}")
    print(f"F-value: {result.statistic}")
    print(f"P-value: {result.pvalue}")
    print()


Chi-square Test

In [None]:
from scipy.stats import chi2_contingency

# Extracting categorical variables
cat_variables = [
       'fifa_version', 'league_id', 'league_level', 'club_team_id',
       'nationality_id', 'nation_team_id', 'weak_foot', 'skill_moves',
       'international_reputation', 'club_position_oe', 'preferred_foot_oe',
       'work_rate_oe', 'body_type_oe'
       ]

# Perform Chi-square test for each categorical variable
chi2_results_cat = {}
for column in cat_variables:
    contingency_table = pd.crosstab(df_with_clusters[column], km_3cluster_model)
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_results_cat[column] = {'Chi-square': chi2, 'P-value': p_value}

# Print Chi-square test results for categorical variables
for column, result in chi2_results_cat.items():
    print(f"Variable: {column}")
    print(f"Chi-square: {result['Chi-square']}")
    print(f"P-value: {result['P-value']}")
    print()


Centricity

In [None]:
# Get the cluster centers
cluster_centers = km_3cluster.cluster_centers_

# Convert cluster_centers to a DataFrame
centroids_df = pd.DataFrame(cluster_centers, columns=df1.columns)

# Display the centroids of clusters
centroids_df

In [None]:
# Calculate cluster-wise descriptive statistics
cluster_stats = df1.groupby('Cluster_Label').describe()

# Print the descriptive statistics for each cluster
print("Cluster-wise Descriptive Statistics:")
cluster_stats


In [None]:
# Grouping the data by cluster labels
cluster_groups = df_with_clusters.groupby(km_3cluster_model)

# Counting the number of variables in each cluster
num_variables_in_clusters = cluster_groups.size()

# Displaying the results
print("Number of Variables in Each Cluster:")
print(num_variables_in_clusters)


ANOVA p-values Heatmap for 3 Clusters

In [None]:
# Joining cluster labels with the original dataset
df_with_clusters = df1.copy()
df_with_clusters['Cluster_Label'] = km_3cluster_model

# Extracting non-categorical variables
non_cat_variables = [
       'fifa_update', 'age', 'club_jersey_number', 'overall_mmnorm',
       'potential_mmnorm', 'value_eur_mmnorm', 'wage_eur_mmnorm',
       'height_cm_mmnorm', 'weight_kg_mmnorm', 'pace_mmnorm',
       'shooting_mmnorm', 'passing_mmnorm', 'dribbling_mmnorm',
       'defending_mmnorm', 'physic_mmnorm', 'attacking_crossing_mmnorm',
       'attacking_finishing_mmnorm', 'attacking_heading_accuracy_mmnorm',
       'attacking_short_passing_mmnorm', 'attacking_volleys_mmnorm',
       'skill_dribbling_mmnorm', 'skill_curve_mmnorm',
       'skill_fk_accuracy_mmnorm', 'skill_long_passing_mmnorm',
       'skill_ball_control_mmnorm', 'movement_acceleration_mmnorm',
       'movement_sprint_speed_mmnorm', 'movement_agility_mmnorm',
       'movement_reactions_mmnorm', 'movement_balance_mmnorm',
       'power_shot_power_mmnorm', 'power_jumping_mmnorm',
       'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm',
       'goalkeeping_diving_mmnorm', 'goalkeeping_handling_mmnorm',
       'goalkeeping_kicking_mmnorm', 'goalkeeping_positioning_mmnorm',
       'goalkeeping_reflexes_mmnorm', 'goalkeeping_speed_mmnorm'
]

# Grouping variables by cluster label
cluster_groups = df_with_clusters.groupby('Cluster_Label')

# Perform ANOVA for each non-categorical variable
anova_results_non_cat = {}
for column in non_cat_variables:
    anova_results_non_cat[column] = f_oneway(*[group[column] for name, group in cluster_groups])

# Extract p-values for non-categorical variables
non_cat_p_values = [result.pvalue for result in anova_results_non_cat.values()]

# Plotting heatmap for non-categorical variables
plt.figure(figsize=(10, 6))
sns.heatmap([non_cat_p_values], cmap='coolwarm', annot=True, fmt='.2f', xticklabels=non_cat_variables, yticklabels=['Clusters'], cbar=False)
plt.xlabel('Variable')
plt.title('ANOVA p-values Heatmap for Non-Categorical Variables (3 Clusters)')
plt.show()


Chi-square P-values Heatmap for 3 Cluster

In [None]:
cat_variables = [
       'fifa_version', 'league_id', 'league_level', 'club_team_id',
       'nationality_id', 'nation_team_id', 'weak_foot', 'skill_moves',
       'international_reputation', 'club_position_oe', 'preferred_foot_oe',
       'work_rate_oe', 'body_type_oe'
]

# Initialize dictionary to store chi-square results
chi2_results_cat = {}

# Calculate chi-square for each categorical variable
for column in cat_variables:
    contingency_table = pd.crosstab(df_with_clusters[column], km_3cluster_model)
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_results_cat[column] = {'P-value': p_value}

# Extract p-values
p_values = [[result['P-value'] for result in chi2_results_cat.values()]]

# Get variable names
variables = list(chi2_results_cat.keys())

# Plotting heatmap for p-values
plt.figure(figsize=(10, 6))
sns.heatmap(p_values, cmap='coolwarm', annot=True, fmt='.2f', xticklabels=variables, yticklabels=False)
plt.xlabel('Variable')
plt.ylabel('Cluster Label')
plt.title('Chi-square P-values Heatmap for Categorical Variables (3 Cluster)')
plt.show()


In [None]:
# Assign cluster labels to the DataFrame
df1['Cluster_Label'] = km_3cluster_model

# Plot the scatter plot with clusters and centroids
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df1, x='age', y='international_reputation', hue='Cluster_Label', legend='full')

# Set plot title and labels
plt.title('Clusters with Centroids')
plt.xlabel('age')
plt.ylabel('overall_mmnorm')
plt.legend()
plt.grid(True)
%memit
plt.show()


interactive 3D scatter plot for 3-cluster model

In [None]:
import plotly.graph_objs as go
import numpy as np

# Create 3D scatter plot
fig = go.Figure()

# Get unique cluster labels
unique_clusters = np.unique(km_3cluster_model)

# Add traces for each cluster
for cluster_label in unique_clusters:
    cluster_data = df1[km_3cluster_model == cluster_label]
    fig.add_trace(go.Scatter3d(
        x=cluster_data['age'],
        y=cluster_data['overall_mmnorm'],
        z=cluster_data['potential_mmnorm'],
        mode='markers',
        marker=dict(
            size=5,
            color=cluster_label,
            colorscale='Viridis',  # Adjust colorscale if needed
            opacity=0.8
        ),
        name=f'Cluster {cluster_label}'
    ))

# Set layout
fig.update_layout(
    title='Clusters with Centroids (3D)',
    scene=dict(
        xaxis=dict(title='age'),
        yaxis=dict(title='overall_mmnorm '),
        zaxis=dict(title='potential_mmnorm'),
    ),
    margin=dict(l=0, r=0, b=0, t=40)
)

# Show interactive 3D scatter plot
fig.show()



In [None]:
# Create subplots for each cluster
fig, axes = plt.subplots(1, 3, figsize=(20, 4), sharex=True, sharey=True)
fig.suptitle('Clusters with Centroids')

# Iterate through each cluster
for i in range(3):
    # Filter data points belonging to the current cluster
    cluster_data = df1[df1['Cluster_Label'] == i]

    # Scatter plot of data points
    sns.scatterplot(data=cluster_data, x='potential_mmnorm', y='overall_mmnorm', ax=axes[i], label=f'Cluster_Label {i}')

    # Set title and labels for each subplot
    axes[i].set_title(f'Cluster {i}')
    axes[i].set_xlabel('potential_mmnorm')
    axes[i].set_ylabel('overall_mmnorm')
    axes[i].legend()

plt.show()
%memit


In [None]:
# Create K-Means Clusters [K=4]
km_4cluster = kmclus(n_clusters=4, init='random', random_state=444)
df1['Cluster_Label'] = km_4cluster.fit_predict(df1)
km_4cluster_model = df1['Cluster_Label'].values
km_4cluster_model


In [None]:
# K-Means Clustering Model Evaluation [K=4]
# ------------------------------------------------------

sscore_km_4cluster = sscore(df1, km_4cluster_model)
dbscore_km_4cluster = dbscore(df1, km_4cluster_model);
%memit
print(f"Davies-Bouldin Index for 4 clusters: {dbscore_km_4cluster}")
print(f"Silhouette Score for 4 clusters: {sscore_km_4cluster}")

ANOVA

In [None]:
# Joining cluster labels with the original dataset
df_with_clusters = df1.copy()
df_with_clusters['Cluster_Label'] = km_4cluster_model

# Extracting non-categorical variables
non_cat_variables = [
       'fifa_update', 'age', 'club_jersey_number', 'overall_mmnorm',
       'potential_mmnorm', 'value_eur_mmnorm', 'wage_eur_mmnorm',
       'height_cm_mmnorm', 'weight_kg_mmnorm', 'pace_mmnorm',
       'shooting_mmnorm', 'passing_mmnorm', 'dribbling_mmnorm',
       'defending_mmnorm', 'physic_mmnorm', 'attacking_crossing_mmnorm',
       'attacking_finishing_mmnorm', 'attacking_heading_accuracy_mmnorm',
       'attacking_short_passing_mmnorm', 'attacking_volleys_mmnorm',
       'skill_dribbling_mmnorm', 'skill_curve_mmnorm',
       'skill_fk_accuracy_mmnorm', 'skill_long_passing_mmnorm',
       'skill_ball_control_mmnorm', 'movement_acceleration_mmnorm',
       'movement_sprint_speed_mmnorm', 'movement_agility_mmnorm',
       'movement_reactions_mmnorm', 'movement_balance_mmnorm',
       'power_shot_power_mmnorm', 'power_jumping_mmnorm',
       'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm',
       'goalkeeping_diving_mmnorm', 'goalkeeping_handling_mmnorm',
       'goalkeeping_kicking_mmnorm', 'goalkeeping_positioning_mmnorm',
       'goalkeeping_reflexes_mmnorm', 'goalkeeping_speed_mmnorm'
]

# Grouping variables by cluster label
cluster_groups = df_with_clusters.groupby('Cluster_Label')

# Perform ANOVA for each non-categorical variable
anova_results_non_cat = {}
for column in non_cat_variables:
    anova_results_non_cat[column] = f_oneway(*[group[column] for name, group in cluster_groups])

# Print ANOVA results for non-categorical variables
for column, result in anova_results_non_cat.items():
    print(f"Variable: {column}")
    print(f"F-value: {result.statistic}")
    print(f"P-value: {result.pvalue}")
    print()


Chi-square Test

In [None]:
from scipy.stats import chi2_contingency

# Extracting categorical variables
cat_variables = [
       'fifa_version', 'league_id', 'league_level', 'club_team_id',
       'nationality_id', 'nation_team_id', 'weak_foot', 'skill_moves',
       'international_reputation', 'club_position_oe', 'preferred_foot_oe',
       'work_rate_oe', 'body_type_oe'
       ]

# Perform Chi-square test for each categorical variable
chi2_results_cat = {}
for column in cat_variables:
    contingency_table = pd.crosstab(df_with_clusters[column], km_4cluster_model)
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_results_cat[column] = {'Chi-square': chi2, 'P-value': p_value}

# Print Chi-square test results for categorical variables
for column, result in chi2_results_cat.items():
    print(f"Variable: {column}")
    print(f"Chi-square: {result['Chi-square']}")
    print(f"P-value: {result['P-value']}")
    print()


Centricity

In [None]:
cluster_centers = km_4cluster.cluster_centers_

# Convert centroids to DataFrame for better visualization
centroids_df = pd.DataFrame(cluster_centers, columns=df1.columns)
%memit
print("Centroids of Clusters:")
centroids_df

In [None]:
# Calculate cluster-wise descriptive statistics
cluster_stats = df1.groupby('Cluster_Label').describe()

# Print the descriptive statistics for each cluster
print("Cluster-wise Descriptive Statistics:")
cluster_stats

In [None]:
# Grouping the data by cluster labels
cluster_groups = df_with_clusters.groupby(km_4cluster_model)

# Counting the number of variables in each cluster
num_variables_in_clusters = cluster_groups.size()

# Displaying the results
print("Number of Variables in Each Cluster:")
print(num_variables_in_clusters)


ANOVA p-values Heatmap for 4 Clusters

In [None]:
# Joining cluster labels with the original dataset
df_with_clusters = df1.copy()
df_with_clusters['Cluster_Label'] = km_4cluster_model

# Extracting non-categorical variables
non_cat_variables = [
       'fifa_update', 'age', 'club_jersey_number', 'overall_mmnorm',
       'potential_mmnorm', 'value_eur_mmnorm', 'wage_eur_mmnorm',
       'height_cm_mmnorm', 'weight_kg_mmnorm', 'pace_mmnorm',
       'shooting_mmnorm', 'passing_mmnorm', 'dribbling_mmnorm',
       'defending_mmnorm', 'physic_mmnorm', 'attacking_crossing_mmnorm',
       'attacking_finishing_mmnorm', 'attacking_heading_accuracy_mmnorm',
       'attacking_short_passing_mmnorm', 'attacking_volleys_mmnorm',
       'skill_dribbling_mmnorm', 'skill_curve_mmnorm',
       'skill_fk_accuracy_mmnorm', 'skill_long_passing_mmnorm',
       'skill_ball_control_mmnorm', 'movement_acceleration_mmnorm',
       'movement_sprint_speed_mmnorm', 'movement_agility_mmnorm',
       'movement_reactions_mmnorm', 'movement_balance_mmnorm',
       'power_shot_power_mmnorm', 'power_jumping_mmnorm',
       'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm',
       'goalkeeping_diving_mmnorm', 'goalkeeping_handling_mmnorm',
       'goalkeeping_kicking_mmnorm', 'goalkeeping_positioning_mmnorm',
       'goalkeeping_reflexes_mmnorm', 'goalkeeping_speed_mmnorm'
]

# Grouping variables by cluster label
cluster_groups = df_with_clusters.groupby('Cluster_Label')

# Perform ANOVA for each non-categorical variable
anova_results_non_cat = {}
for column in non_cat_variables:
    anova_results_non_cat[column] = f_oneway(*[group[column] for name, group in cluster_groups])

# Extract p-values for non-categorical variables
non_cat_p_values = [result.pvalue for result in anova_results_non_cat.values()]

# Plotting heatmap for non-categorical variables
plt.figure(figsize=(10, 6))
sns.heatmap([non_cat_p_values], cmap='coolwarm', annot=True, fmt='.2f', xticklabels=non_cat_variables, yticklabels=['Clusters'], cbar=False)
plt.xlabel('Variable')
plt.title('ANOVA p-values Heatmap for Non-Categorical Variables (4 Clusters)')
plt.show()


Chi-square P-values Heatmap for 4 Cluster

In [None]:
cat_variables = [
       'fifa_version', 'league_id', 'league_level', 'club_team_id',
       'nationality_id', 'nation_team_id', 'weak_foot', 'skill_moves',
       'international_reputation', 'club_position_oe', 'preferred_foot_oe',
       'work_rate_oe', 'body_type_oe'
]

# Initialize dictionary to store chi-square results
chi2_results_cat = {}

# Calculate chi-square for each categorical variable
for column in cat_variables:
    contingency_table = pd.crosstab(df_with_clusters[column], km_4cluster_model)
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_results_cat[column] = {'P-value': p_value}

# Extract p-values
p_values = [[result['P-value'] for result in chi2_results_cat.values()]]

# Get variable names
variables = list(chi2_results_cat.keys())

# Plotting heatmap for p-values
plt.figure(figsize=(10, 6))
sns.heatmap(p_values, cmap='coolwarm', annot=True, fmt='.2f', xticklabels=variables, yticklabels=False)
plt.xlabel('Variable')
plt.ylabel('Cluster Label')
plt.title('Chi-square P-values Heatmap for Categorical Variables (4 Cluster)')
plt.show()


In [None]:
# Assign cluster labels to the DataFrame
df1['Cluster_Label'] = km_4cluster_model

# Plot the scatter plot with clusters and centroids
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df1, x='skill_moves', y='potential_mmnorm', hue='Cluster_Label', legend='full')

# Set plot title and labels
plt.title('Clusters with Centroids')
plt.xlabel('skill_moves')
plt.ylabel('potential_mmnorm')
plt.legend()
plt.grid(True)
%memit
plt.show()


Interactive 3D scatter plot for 4-cluster model

In [None]:
import plotly.graph_objs as go
import numpy as np

# Create 3D scatter plot
fig = go.Figure()

# Get unique cluster labels
unique_clusters = np.unique(km_4cluster_model)

# Add traces for each cluster
for cluster_label in unique_clusters:
    cluster_data = df1[km_4cluster_model == cluster_label]
    fig.add_trace(go.Scatter3d(
        x=cluster_data['age'],
        y=cluster_data['potential_mmnorm'],
        z=cluster_data['skill_moves'],
        mode='markers',
        marker=dict(

            size=5,
            color=cluster_label,
            colorscale='Viridis',  # Adjust colorscale if needed
            opacity=0.8
        ),
        name=f'Cluster {cluster_label}'
    ))

# Set layout
fig.update_layout(
    title='Clusters with Centroids (3D)',
    scene=dict(
        xaxis=dict(title='age'),
        yaxis=dict(title='potential_mmnorm'),
        zaxis=dict(title='skill_moves'),
    ),
    margin=dict(l=0, r=0, b=0, t=40)
)

# Show interactive 3D scatter plot
fig.show()


In [None]:
# Create subplots for each cluster
fig, axes = plt.subplots(1, 4, figsize=(20, 4), sharex=True, sharey=True)
fig.suptitle('Clusters with Centroids')

# Iterate through each cluster
for i in range(4):
    # Filter data points belonging to the current cluster
    cluster_data = df1[df1['Cluster_Label'] == i]

    # Scatter plot of data points
    sns.scatterplot(data=cluster_data, x='skill_moves', y='potential_mmnorm', ax=axes[i], label=f'Cluster {i}')

    # Set title and labels for each subplot
    axes[i].set_title(f'Cluster_Label {i}')
    axes[i].set_xlabel('skill_moves')
    axes[i].set_ylabel('potential_mmnorm')
    axes[i].legend()

plt.show()
%memit

In [None]:
# Comparison of Clusters (formed using K-means) on the basis of scores


# Silhouette Scores
ss_scores = [sscore_km_2cluster, sscore_km_3cluster, sscore_km_4cluster]

# Davies-Bouldin Index Scores
dbi_scores = [dbscore_km_2cluster, dbscore_km_3cluster, dbscore_km_4cluster]

# Number of clusters
clusters = [2, 3, 4]

# Create a figure and axis object
fig, ax = plt.subplots(figsize=(10, 6))

# Plot Silhouette Scores
ax.plot(clusters, ss_scores, marker='o', linestyle='-', label='Silhouette Score')

# Plot Davies-Bouldin Index Scores
ax.plot(clusters, dbi_scores, marker='o', linestyle='-', label='Davies-Bouldin Index')

# Set plot title and labels
ax.set_title('Cluster Evaluation Metrics')
ax.set_xlabel('Number of Clusters')
ax.set_ylabel('Score')

# Add legend
ax.legend()

# Show plot
plt.grid(True)
plt.show()

In [None]:
# Track final memory usage
final_memory = memory_usage()[0]

# Calculate elapsed time
elapsed_time = time.time() - start_time
%memit
# Print total time taken and total memory used
print(f"Total time taken: {elapsed_time} seconds")

In [None]:
## Machine Learning Models and Evaluation Metrics

from sklearn.utils.validation import column_or_1d
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, precision_recall_fscore_support
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression, Lasso, Ridge
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.metrics import f1_score

Logistic Regression

In [None]:
# Track start time and initial memory usage
start_time = time.time()
initial_memory = memory_usage()[0]

In [None]:
df1.columns

In [None]:
# Initialize StratifiedShuffleSplit with desired test size and random state
stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=12345)

# Perform the stratified split to get training and testing indices
for train_index, test_index in stratified_split.split(df1_inputs, df1_output):
    df1_inputs_train, df1_inputs_test = df1_inputs.iloc[train_index], df1_inputs.iloc[test_index]
    df1_output_train, df1_output_test = df1_output.iloc[train_index], df1_output.iloc[test_index]

In [None]:
# Subset df1 based on Inputs & Output
df1_inputs = df1[[
       'fifa_version', 'league_id', 'league_level', 'club_team_id',
       'nationality_id', 'nation_team_id', 'weak_foot', 'skill_moves',
       'international_reputation', 'club_position_oe', 'preferred_foot_oe',
       'work_rate_oe', 'body_type_oe', 'fifa_update', 'age',
       'club_jersey_number', 'overall_mmnorm', 'potential_mmnorm',
       'value_eur_mmnorm', 'wage_eur_mmnorm', 'height_cm_mmnorm',
       'weight_kg_mmnorm', 'pace_mmnorm', 'shooting_mmnorm', 'passing_mmnorm',
       'dribbling_mmnorm', 'defending_mmnorm', 'physic_mmnorm',
       'attacking_crossing_mmnorm', 'attacking_finishing_mmnorm',
       'attacking_heading_accuracy_mmnorm', 'attacking_short_passing_mmnorm',
       'attacking_volleys_mmnorm', 'skill_dribbling_mmnorm',
       'skill_curve_mmnorm', 'skill_fk_accuracy_mmnorm',
       'skill_long_passing_mmnorm', 'skill_ball_control_mmnorm',
       'movement_acceleration_mmnorm', 'movement_sprint_speed_mmnorm',
       'movement_agility_mmnorm', 'movement_reactions_mmnorm',
       'movement_balance_mmnorm', 'power_shot_power_mmnorm',
       'power_jumping_mmnorm', 'power_stamina_mmnorm', 'power_strength_mmnorm',
       'power_long_shots_mmnorm', 'mentality_aggression_mmnorm',
       'mentality_interceptions_mmnorm', 'mentality_positioning_mmnorm',
       'mentality_vision_mmnorm', 'mentality_penalties_mmnorm',
       'mentality_composure_mmnorm', 'defending_marking_awareness_mmnorm',
       'defending_standing_tackle_mmnorm', 'defending_sliding_tackle_mmnorm',
       'goalkeeping_diving_mmnorm', 'goalkeeping_handling_mmnorm',
       'goalkeeping_kicking_mmnorm', 'goalkeeping_positioning_mmnorm',
       'goalkeeping_reflexes_mmnorm', 'goalkeeping_speed_mmnorm'
       ]]; df1_inputs
df1_output = df1[['Cluster_Label']]; df1_output

df1_inputs_names = df1_inputs.columns; df1_inputs_names
df1_output_labels = df1_output['Cluster_Label'].unique().astype(str); df1_output_labels

In [None]:
# Create and fit a Logistic Regression model
logreg = LogisticRegression(random_state=12345, solver='liblinear')
logreg.fit(df1_inputs_train, df1_output_train)

In [None]:
# Make predictions using the trained model
y_pred = logreg.predict(df1_inputs_test)

# Calculate accuracy
accuracy = accuracy_score(df1_output_test, y_pred)
print(f'Accuracy: {accuracy}')

# Generate classification report
classification_rep = classification_report(df1_output_test, y_pred)
print(f'Classification Report:\n{classification_rep}')

# Generate confusion matrix
conf_mat = confusion_matrix(df1_output_test, y_pred)
print(f'Confusion Matrix:\n{conf_mat}')

In [None]:
# Plot confusion matrix
def plot_confusion_matrix(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# Assuming df1_output_test and y_pred are the true and predicted labels, respectively
plot_confusion_matrix(df1_output_test, y_pred)

In [None]:
# Track end time and final memory usage
end_time = time.time()
final_memory = memory_usage()[0]

# Calculate elapsed time and memory used
elapsed_time = end_time - start_time
memory_used = final_memory - initial_memory

print(f"Elapsed time: {elapsed_time} seconds")
print(f"Memory used: {memory_used} MiB")