# FIFA 19 Player Undervaluation Analysis

* **Student:** Shola Lajuwomi
* **Class:** AI

## Project Description

Develop a Jupyter Notebook (for Kaggle/GitHub) that performs unsupervised learning (DBSCAN) on the FIFA 19 player dataset to identify potentially undervalued player archetypes within specific positions (e.g., Strikers). The analysis focuses on clustering players based on position-specific composite skill metrics versus their market value ('Value'). The notebook will follow the structure required for the AI class assignment, including introduction, EDA, preprocessing, model training/tuning, and conclusion.

## Target Audience

-   AI Course Instructor (Mr. Dole) for assignment evaluation.
-   CS students interested in a practical unsupervised learning example.
-   Football analysts/enthusiasts interested in player valuation methods.

## Data Source

The analysis uses the "FIFA 19 Complete Player Dataset" available on Kaggle:
[https://www.kaggle.com/datasets/javagarm/fifa-19-complete-player-dataset](https://www.kaggle.com/datasets/javagarm/fifa-19-complete-player-dataset)

## 1. Introduction

**Problem:** Identifying potentially undervalued player archetypes in the FIFA 19 dataset using unsupervised learning. Standard valuation methods might overlook players who offer high skill relative to their market price.

**Goal:** To cluster players within selected positions (starting with Strikers 'ST') based on a composite skill score versus their market value ('Value_EUR'). The aim is to use DBSCAN to identify distinct groups, particularly focusing on clusters representing high-skill, low-value players ("undervalued archetypes").

**Data Source:** The analysis utilizes the "FIFA 19 Complete Player Dataset" sourced from Kaggle.
* Dataset Link: [https://www.kaggle.com/datasets/stefanoleone992/fifa-19-complete-player-dataset](https://www.kaggle.com/datasets/stefanoleone992/fifa-19-complete-player-dataset)
*(Note: This URL is specified in the technical specification, though the README might reference another URL. This analysis will proceed using the specification URL.)*

**Methodology Outline:**
1.  **Data Loading & Initial Exploration:** Load the dataset and perform preliminary checks.
2.  **Data Preprocessing:** Clean the data, handle missing values, convert data types (e.g., height, weight, currency).
3.  **Feature Engineering:** Define and calculate a composite skill score for the target position(s).
4.  **Exploratory Data Analysis (EDA):** Visualize distributions and relationships in the cleaned data and engineered features.
5.  **Feature Scaling:** Scale the selected features (skill score, value) for clustering.
6.  **Unsupervised Learning (DBSCAN):** Apply DBSCAN to the scaled data, including hyperparameter tuning.
7.  **Results Analysis:** Visualize clusters, identify undervalued groups, and examine sample players.
8.  **Conclusion:** Summarize findings, evaluate the model, discuss limitations, and suggest future work.

In [7]:
#This cell imports necessary libraries and sets up the plotting environment.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
# Ensure plots are displayed inline in the notebook
%matplotlib inline
# Set a visually appealing style for the plots
sns.set_style('whitegrid')

print("Libraries imported successfully.")

Libraries imported successfully.


In [8]:
"""## 2. Data Loading and Initial Exploration

This section focuses on loading the dataset from the specified CSV file into a pandas DataFrame and performing initial checks to understand its structure, data types, and basic statistics.
"""

# Define the path to the dataset file
# Assumes the CSV file is in the same directory as the notebook, or in a './data/' subdirectory.
# Update this path if your file is located elsewhere.
data_path = 'FIFA_19_COMPLETE_PLAYER_DATASET.csv'
# data_path = './data/FIFA_19_COMPLETE_PLAYER_DATASET.csv' # Alternative if in a 'data' subfolder

# Attempt to load the dataset
try:
    fifa_df = pd.read_csv(data_path, encoding='Windows-1252')
    print(f"Dataset loaded successfully from '{data_path}'.")
except FileNotFoundError:
    print(f"Error: The file '{data_path}' was not found.")
    print("Please ensure the 'FIFA_19_COMPLETE_PLAYER_DATASET.csv' file is in the correct directory.")
    # Depending on the environment, you might want to stop execution here
    # For example, in a script: import sys; sys.exit()
    # In a notebook, this message serves as a clear warning.
    fifa_df = None # Set to None if loading failed

# Display the first few rows if the dataframe loaded successfully
if fifa_df is not None:
    print("Displaying the first 5 rows of the dataset:")
    # display(fifa_df.head()) # Use display() in Jupyter/Colab for better formatting
    display(fifa_df.head())
else:
    print("Cannot display head because the dataset failed to load.")

Dataset loaded successfully from 'FIFA_19_COMPLETE_PLAYER_DATASET.csv'.
Displaying the first 5 rows of the dataset:


Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31.0,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94.0,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33.0,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94.0,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26.0,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92.0,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27.0,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91.0,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27.0,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91.0,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


### 2.1 DataFrame Info

Let's start by getting a concise summary of the DataFrame. The `.info()` method provides information about the column data types, the number of non-null values, and memory usage. This helps identify columns that might need type conversion or missing value handling.

In [9]:
display(fifa_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                18207 non-null  int64  
 1   ID                        18207 non-null  int64  
 2   Name                      18207 non-null  object 
 3   Age                       18206 non-null  float64
 4   Photo                     18207 non-null  object 
 5   Nationality               18207 non-null  object 
 6   Flag                      18207 non-null  object 
 7   Overall                   18206 non-null  float64
 8   Potential                 18207 non-null  int64  
 9   Club                      17966 non-null  object 
 10  Club Logo                 18207 non-null  object 
 11  Value                     18207 non-null  object 
 12  Wage                      18207 non-null  object 
 13  Special                   18207 non-null  int64  
 14  Prefer

None

### 2.2 First Few Rows

Displaying the first few rows using `.head()` gives us a glimpse of the actual data values and helps verify that the data loaded correctly.

In [10]:
display(fifa_df.head())


Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31.0,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94.0,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33.0,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94.0,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26.0,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92.0,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27.0,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91.0,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27.0,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91.0,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


### 2.3 Dataset Shape

Checking the `.shape` attribute tells us the total number of rows (players) and columns (features) in the dataset.

In [11]:
print(f"Dataset shape: {fifa_df.shape}")

Dataset shape: (18207, 89)


### 2.4 Summary Statistics

The `.describe(include='all')` method provides summary statistics for all columns. For numerical columns, it includes count, mean, standard deviation, min, max, and quartiles. For object/categorical columns, it includes count, unique values, top value, and frequency of the top value. This gives a broad overview of the data distribution and potential issues like outliers or skewed distributions.

In [12]:
display(fifa_df.describe(include='all'))

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
count,18207.0,18207.0,18207,18206.0,18207,18207,18207,18206.0,18207.0,17966,...,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,16643
unique,,,17194,,18207,164,164,,,651,...,,,,,,,,,,1244
top,,,J. Rodríguez,,https://cdn.sofifa.org/players/4/19/246269.png,England,https://cdn.sofifa.org/flags/14.png,,,RC Celta,...,,,,,,,,,,€1.1M
freq,,,11,,1,1662,1662,,,33,...,,,,,,,,,,557
mean,9103.0,214298.338606,,25.122048,,,,66.237449,71.307299,,...,58.648274,47.281623,47.697836,45.661435,16.616223,16.391596,16.232061,16.388898,16.710887,
std,5256.052511,29965.244204,,4.670022,,,,6.907059,6.136496,,...,11.436133,19.904397,21.664004,21.289135,17.695349,16.9069,16.502864,17.034669,17.955119,
min,0.0,16.0,,16.0,,,,46.0,48.0,,...,3.0,3.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,
25%,4551.5,200315.5,,21.0,,,,62.0,67.0,,...,51.0,30.0,27.0,24.0,8.0,8.0,8.0,8.0,8.0,
50%,9103.0,221759.0,,25.0,,,,66.0,71.0,,...,60.0,53.0,55.0,52.0,11.0,11.0,11.0,11.0,11.0,
75%,13654.5,236529.5,,28.0,,,,71.0,75.0,,...,67.0,64.0,66.0,64.0,14.0,14.0,14.0,14.0,14.0,


### 2.5 Check for Duplicate Rows

Duplicate rows can unnecessarily inflate the dataset size and potentially bias analysis results. We check for and count any fully duplicate rows.

In [13]:
duplicate_count = fifa_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

Number of duplicate rows: 0


## 3. Data Preprocessing

This section focuses on cleaning and transforming the raw data into a format suitable for analysis and modeling. This involves handling missing values, converting data types, and removing irrelevant information.

### 3.1 Drop Unnecessary Columns

Several columns identified in the project specification are not relevant for this analysis (e.g., URLs, IDs, administrative details) and can be removed.

In [14]:
# List of columns to drop
columns_to_drop = [
    'Unnamed: 0', 'Photo', 'Flag', 'Club Logo', 'Real Face',
    'Joined', 'Loaned From', 'Contract Valid Until', 'Special'
]

# Record shape before dropping
shape_before_drop = fifa_df.shape
print(f"Shape before dropping columns: {shape_before_drop}")

# Drop the columns, using errors='ignore' in case some are already missing
fifa_df = fifa_df.drop(columns=columns_to_drop, errors='ignore')

# Print the new shape
print(f"Shape after dropping columns: {fifa_df.shape}")

# Display the first few rows with remaining columns
display(fifa_df.head())

Shape before dropping columns: (18207, 89)
Shape after dropping columns: (18207, 80)


Unnamed: 0,ID,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,Preferred Foot,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,158023,L. Messi,31.0,Argentina,94.0,94,FC Barcelona,€110.5M,€565K,Left,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,20801,Cristiano Ronaldo,33.0,Portugal,94.0,94,Juventus,€77M,€405K,Right,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,190871,Neymar Jr,26.0,Brazil,92.0,93,Paris Saint-Germain,€118.5M,€290K,Right,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,193080,De Gea,27.0,Spain,91.0,93,Manchester United,€72M,€260K,Right,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,192985,K. De Bruyne,27.0,Belgium,91.0,92,Manchester City,€102M,€355K,Right,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


### 3.2 Handle Missing Position Values

The 'Position' column is critical for our analysis, especially for filtering players and potentially for calculating position-specific skill scores later. Rows where this information is missing cannot be effectively used in the core part of our analysis, so we will drop them.

In [15]:
# Store shape before dropping
shape_before = fifa_df.shape[0]

# Drop rows where 'Position' is NaN
fifa_df.dropna(subset=['Position'], inplace=True)

# Calculate and print the number of rows dropped
rows_dropped = shape_before - fifa_df.shape[0]
print(f"Dropped {rows_dropped} rows with missing 'Position'.")
print(f"New shape: {fifa_df.shape}")

Dropped 60 rows with missing 'Position'.
New shape: (18147, 80)
