# Unlocking Future Stars: Predicting High-Potential Football Players with Machine Learning

## PROJECT OVERVIEW

Smart scouting is pretty important especially if it's outshining a big budget.Sifting through thousands of young players worldwide, all while working with a tight budget can be quite a challenge. You need to know: Which of these kids will grow into a world-class player? Miss a future star, and you lose a chance to build a legacy. Sign the wrong player, and you waste millions. It’s a high-stakes gamble, and the pressure is on.

That’s where my project comes in. I’ve harnessed the power of data science to act like your ultimate scouting assistant.Using a massive dataset from the FIFA 21 game, packed with details on over 18,000 players—think skills, market value, and more—I’ve built a tool to predict which young players, aged 23 or under, have what it takes to reach the elite level, like a Lionel Messi or Kylian Mbappé. The model doesn’t just guess; it learns patterns from the data to spot players with that special potential, even if they’re flying under the radar. 



##  OBJECTIVES

The key objectives to be achieved by this study are as follows:

1. **Find Young Players Who Can Become Stars**: Build a tool to predict which players aged 23 or younger are likely to reach an elite level (Overall Rating ≥ 80), helping the club spot future top performers early.

2. **Save Money and Time in Scouting**: Create a model that shortlists the most promising players, so the scouting team can focus on the best targets without wasting resources on less likely prospects.

3.**Give the Club a Competitive Edge**: Use data to identify undervalued players before bigger clubs notice them, allowing the club to sign talent at a lower cost and build a stronger team.

## DATA UNDERSTANDING



In [12]:
#Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

In [11]:
# Load the dataset
fifa = pd.read_csv('/home/user/Documents/Flatiron/fifa data/fifa21_raw_data.csv')
fifa.head()

Unnamed: 0,photoUrl,LongName,playerUrl,Nationality,Positions,Name,Age,↓OVA,POT,Team & Contract,...,A/W,D/W,IR,PAC,SHO,PAS,DRI,DEF,PHY,Hits
0,https://cdn.sofifa.com/players/158/023/21_60.png,Lionel Messi,http://sofifa.com/player/158023/lionel-messi/2...,Argentina,RW ST CF,L. Messi,33,93,93,\n\n\n\nFC Barcelona\n2004 ~ 2021\n\n,...,Medium,Low,5 ★,85,92,91,95,38,65,\n372
1,https://cdn.sofifa.com/players/020/801/21_60.png,C. Ronaldo dos Santos Aveiro,http://sofifa.com/player/20801/c-ronaldo-dos-s...,Portugal,ST LW,Cristiano Ronaldo,35,92,92,\n\n\n\nJuventus\n2018 ~ 2022\n\n,...,High,Low,5 ★,89,93,81,89,35,77,\n344
2,https://cdn.sofifa.com/players/200/389/21_60.png,Jan Oblak,http://sofifa.com/player/200389/jan-oblak/210005/,Slovenia,GK,J. Oblak,27,91,93,\n\n\n\nAtlético Madrid\n2014 ~ 2023\n\n,...,Medium,Medium,3 ★,87,92,78,90,52,90,\n86
3,https://cdn.sofifa.com/players/192/985/21_60.png,Kevin De Bruyne,http://sofifa.com/player/192985/kevin-de-bruyn...,Belgium,CAM CM,K. De Bruyne,29,91,91,\n\n\n\nManchester City\n2015 ~ 2023\n\n,...,High,High,4 ★,76,86,93,88,64,78,\n163
4,https://cdn.sofifa.com/players/190/871/21_60.png,Neymar da Silva Santos Jr.,http://sofifa.com/player/190871/neymar-da-silv...,Brazil,LW CAM,Neymar Jr,28,91,91,\n\n\n\nParis Saint-Germain\n2017 ~ 2022\n\n,...,High,Medium,5 ★,91,85,86,94,36,59,\n273


In [10]:
#Checking the columns in the DataFrame
fifa.columns

Index(['photoUrl', 'LongName', 'playerUrl', 'Nationality', 'Positions', 'Name',
       'Age', '↓OVA', 'POT', 'Team & Contract', 'ID', 'Height', 'Weight',
       'foot', 'BOV', 'BP', 'Growth', 'Joined', 'Loan Date End', 'Value',
       'Wage', 'Release Clause', 'Attacking', 'Crossing', 'Finishing',
       'Heading Accuracy', 'Short Passing', 'Volleys', 'Skill', 'Dribbling',
       'Curve', 'FK Accuracy', 'Long Passing', 'Ball Control', 'Movement',
       'Acceleration', 'Sprint Speed', 'Agility', 'Reactions', 'Balance',
       'Power', 'Shot Power', 'Jumping', 'Stamina', 'Strength', 'Long Shots',
       'Mentality', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
       'Penalties', 'Composure', 'Defending', 'Marking', 'Standing Tackle',
       'Sliding Tackle', 'Goalkeeping', 'GK Diving', 'GK Handling',
       'GK Kicking', 'GK Positioning', 'GK Reflexes', 'Total Stats',
       'Base Stats', 'W/F', 'SM', 'A/W', 'D/W', 'IR', 'PAC', 'SHO', 'PAS',
       'DRI', 'DEF', 'PHY', 'Hits

In [13]:
#Checking the shape of the DataFrame
fifa.shape

(18979, 77)

In [16]:
#checking the summary of the DataFrame
fifa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 77 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   photoUrl          18979 non-null  object
 1   LongName          18979 non-null  object
 2   playerUrl         18979 non-null  object
 3   Nationality       18979 non-null  object
 4   Positions         18979 non-null  object
 5   Name              18979 non-null  object
 6   Age               18979 non-null  int64 
 7   ↓OVA              18979 non-null  int64 
 8   POT               18979 non-null  int64 
 9   Team & Contract   18979 non-null  object
 10  ID                18979 non-null  int64 
 11  Height            18979 non-null  object
 12  Weight            18979 non-null  object
 13  foot              18979 non-null  object
 14  BOV               18979 non-null  int64 
 15  BP                18979 non-null  object
 16  Growth            18979 non-null  int64 
 17  Joined      

In [55]:
#dropping the irrelevant columns
fifa_df = fifa.drop(columns=['photoUrl', 'playerUrl', 'Loan Date End', 'BP',
                             'LongName', 'Nationality', 'Positions', 'Team & Contract', 'ID', 'Joined'])

In [56]:
#checking the new fifa_df DataFrame
fifa_df.head(10)

Unnamed: 0,Name,Age,↓OVA,POT,Height,Weight,foot,BOV,Growth,Value,...,A/W,D/W,IR,PAC,SHO,PAS,DRI,DEF,PHY,Hits
0,L. Messi,33,93,93,"5'7""",159lbs,Left,93,0,€67.5M,...,Medium,Low,5 ★,85,92,91,95,38,65,\n372
1,Cristiano Ronaldo,35,92,92,"6'2""",183lbs,Right,92,0,€46M,...,High,Low,5 ★,89,93,81,89,35,77,\n344
2,J. Oblak,27,91,93,"6'2""",192lbs,Right,91,2,€75M,...,Medium,Medium,3 ★,87,92,78,90,52,90,\n86
3,K. De Bruyne,29,91,91,"5'11""",154lbs,Right,91,0,€87M,...,High,High,4 ★,76,86,93,88,64,78,\n163
4,Neymar Jr,28,91,91,"5'9""",150lbs,Right,91,0,€90M,...,High,Medium,5 ★,91,85,86,94,36,59,\n273
5,R. Lewandowski,31,91,91,"6'0""",176lbs,Right,91,0,€80M,...,High,Medium,4 ★,78,91,78,85,43,82,\n182
6,K. Mbappé,21,90,95,"5'10""",161lbs,Right,91,5,€105.5M,...,High,Low,3 ★,96,86,78,91,39,76,\n646
7,Alisson,27,90,91,"6'3""",201lbs,Right,90,1,€62.5M,...,Medium,Medium,3 ★,86,88,85,89,51,91,\n79
8,M. Salah,28,90,90,"5'9""",157lbs,Left,90,0,€78M,...,High,Medium,3 ★,93,86,81,90,45,75,\n164
9,S. Mané,28,90,90,"5'9""",152lbs,Right,90,0,€78M,...,High,Medium,3 ★,94,85,80,90,44,76,\n170


In [57]:
print(fifa_df.shape)
print(fifa_df.info())

(18979, 67)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 67 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Name              18979 non-null  object
 1   Age               18979 non-null  int64 
 2   ↓OVA              18979 non-null  int64 
 3   POT               18979 non-null  int64 
 4   Height            18979 non-null  object
 5   Weight            18979 non-null  object
 6   foot              18979 non-null  object
 7   BOV               18979 non-null  int64 
 8   Growth            18979 non-null  int64 
 9   Value             18979 non-null  object
 10  Wage              18979 non-null  object
 11  Release Clause    18979 non-null  object
 12  Attacking         18979 non-null  int64 
 13  Crossing          18979 non-null  int64 
 14  Finishing         18979 non-null  int64 
 15  Heading Accuracy  18979 non-null  int64 
 16  Short Passing     18979 non-null  int64 
 17  

In [58]:
#checking for null values
fifa_df.isnull().sum()

Name      0
Age       0
↓OVA      0
POT       0
Height    0
         ..
PAS       0
DRI       0
DEF       0
PHY       0
Hits      0
Length: 67, dtype: int64

In [61]:
# Remove the star in IR column (international reputation, global fame) if the columns are strings
if fifa_df['IR'].dtype == 'object':
	fifa_df['IR'] = fifa_df['IR'].str.replace('★', '').astype(int)
if fifa_df['SM'].dtype == 'object':
	fifa_df['SM'] = fifa_df['SM'].str.replace('★', '').astype(int)
if fifa_df['W/F'].dtype == 'object':
	fifa_df['W/F'] = fifa_df['W/F'].str.replace('★', '').astype(int)

# Checking the unique values in the IR column
print(fifa_df['IR'].unique())


[5 3 4 2 1]


In [62]:
fifa_df.head(3)

Unnamed: 0,Name,Age,↓OVA,POT,Height,Weight,foot,BOV,Growth,Value,...,A/W,D/W,IR,PAC,SHO,PAS,DRI,DEF,PHY,Hits
0,L. Messi,33,93,93,"5'7""",159lbs,Left,93,0,€67.5M,...,Medium,Low,5,85,92,91,95,38,65,\n372
1,Cristiano Ronaldo,35,92,92,"6'2""",183lbs,Right,92,0,€46M,...,High,Low,5,89,93,81,89,35,77,\n344
2,J. Oblak,27,91,93,"6'2""",192lbs,Right,91,2,€75M,...,Medium,Medium,3,87,92,78,90,52,90,\n86


In [64]:
# Convert monetary columns
def convert_money(value):
        if isinstance(value, str):
            value = value.replace('€', '')
            if 'M' in value:
                return float(value.replace('M', '')) * 1_000_000
            elif 'K' in value:
                return float(value.replace('K', '')) * 1_000
            return float(value)
        return value
    
for col in ['Value', 'Wage', 'Release Clause']:
    fifa_df[col] = fifa_df[col].apply(convert_money)

In [65]:
# Convert height to inches before converting to integer
def convert_height_to_inches(height):
	if isinstance(height, str):
		feet, inches = height.split("'")
		inches = inches.replace('"', '')
		return int(feet) * 12 + int(inches)
	return height

fifa_df['Height'] = fifa_df['Height'].apply(convert_height_to_inches).astype(int)

In [66]:

  # Convert weight
fifa_df['Weight'] = fifa_df['Weight'].str.replace('lbs', '').astype(float)

In [67]:
fifa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 67 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              18979 non-null  object 
 1   Age               18979 non-null  int64  
 2   ↓OVA              18979 non-null  int64  
 3   POT               18979 non-null  int64  
 4   Height            18979 non-null  int64  
 5   Weight            18979 non-null  float64
 6   foot              18979 non-null  object 
 7   BOV               18979 non-null  int64  
 8   Growth            18979 non-null  int64  
 9   Value             18979 non-null  float64
 10  Wage              18979 non-null  float64
 11  Release Clause    18979 non-null  float64
 12  Attacking         18979 non-null  int64  
 13  Crossing          18979 non-null  int64  
 14  Finishing         18979 non-null  int64  
 15  Heading Accuracy  18979 non-null  int64  
 16  Short Passing     18979 non-null  int64 