# Ask meaningful questions and answer using collected data

## Import necessary packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler

## Read raw data from csv file

In [2]:
df = pd.DataFrame()
df = pd.read_csv('players_transformed.csv')

# drop unecessary column
df = df.drop(df.columns[0], axis=1)

# test output
display(df.head())

# size of the data
print("Size of data: ", df.shape)

Unnamed: 0,name,age,nationality,club,height,weight,foot,total_matches,total_goals,total_assists,total_yellow,total_red,pass_completion_rate,dribble_success_rate,tackles,interception,market_value,titles,injuries,general_position
0,Ivan Balliu,32,Albania,Rayo Vallecano,172,63,Right,421,3,14,91,2,76.73,66.67,10,6,2.0,2,19,Defender
1,Marash Kumbulla,24,Albania,RCD Espanyol,191,78,Right,132,6,14,32,2,86.24,33.33,12,11,4.5,1,2,Defender
2,Abderrahman Rebbach,26,Algeria,Deportivo Alavés,176,75,Right,154,34,4,32,1,72.09,27.27,0,0,0.8,0,0,Forward
3,Farid El Melali,27,Algeria,Angers SCO,168,65,Right,157,19,8,11,0,82.0,54.29,10,3,1.5,1,12,Forward
4,Haris Belkebla,30,Algeria,Angers SCO,177,68,Right,323,9,2,57,3,87.0,25.0,9,5,1.5,0,6,Midfielder


Size of data:  (1057, 20)


## Question 1
> Can we identify young players (e.g., under 23) who have high efficiency and are undervalued in the market compared to their peers?

### Purpose
To identify promising young players who deliver exceptional performance relative to their market value, making them attractive targets for clubs seeking high-value talent on a budget.

### Relevant attributes
- `age`
- `market_value`
- `total_goals`
- `total_assists`
- `dribble_success_rate`

### Filter Players Under 23 and Define Efficiency Metrics

Efficiency is calculated by adding the products of the relevant statistics and their corresponding weight. 1 is chosen as the total weight for ease of comparision.

In [3]:
young_players = df[df['age'] < 23]

metrics = ['total_goals', 'total_assists', 'dribble_success_rate']

# normalize the data
for column_name in metrics:
        col_transformed, col_lambda  = stats.yeojohnson(young_players[column_name])
        young_players[column_name] = col_transformed

young_players['efficiency'] = (
    young_players['total_goals'] * 0.5 +
    young_players['total_assists'] * 0.3 +
    young_players['dribble_success_rate'] * 0.2
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  young_players[column_name] = col_transformed
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  young_players['efficiency'] = (


### Compare Market Value

Identify players who are "undervalued" by comparing their efficiency to their market value. A player is "undervalued" when they are priced lower than the average market value of peers who has similar efficiency value (within 5%)

In [4]:
undervalued_players = []

for index, player in young_players.iterrows():
    efficiency = player['efficiency']
    
    # Define the efficiency range (10%)
    lower_bound = efficiency * 0.95
    upper_bound = efficiency * 1.05
    
    # Find peers within this range
    peers = young_players[(young_players['efficiency'] >= lower_bound) & 
                          (young_players['efficiency'] <= upper_bound) &
                          (young_players.index != index)]  # Exclude the player themselves
    
    # Calculate the average market value of peers
    if not peers.empty:
        avg_peer_value = peers['market_value'].mean()
        
        # Check if the player is undervalued
        if player['market_value'] < avg_peer_value:
            undervalued_players.append({
                'name': player['name'],
                'age': player['age'],
                'efficiency': efficiency,
                'market_value': player['market_value'],
                'avg_peer_value': avg_peer_value
            })

# Convert the results into a DataFrame
undervalued_df = pd.DataFrame(undervalued_players)

# Rank Undervalued Players
undervalued_df['value_gap'] = undervalued_df['avg_peer_value'] - undervalued_df['market_value']
undervalued_df = undervalued_df.sort_values(by='value_gap', ascending=False)

display(undervalued_df)

Unnamed: 0,name,age,efficiency,market_value,avg_peer_value,value_gap
71,Othmane Maamma,19,2.714655,0.3,15.000000,14.700000
31,Omari Forson,20,9.966614,2.0,13.000000,11.000000
61,Louis Mouton,22,7.161689,0.6,11.160000,10.560000
87,Yllan Okou,21,5.812407,0.9,10.783333,9.883333
170,Cesar Tarrega,22,9.107556,4.0,13.850000,9.850000
...,...,...,...,...,...,...
209,Fabian Rieder,22,7.415049,8.0,8.161111,0.161111
38,Andy Diouf,21,6.636553,9.0,9.068333,0.068333
76,Sael Kumbedi,19,3.746791,5.0,5.039189,0.039189
131,Kassoum Ouattara,20,5.344896,8.0,8.026364,0.026364


## QUESTION 2
> Do offense or defense statistics have a higher impact on a player's market value? 

### Purpose
To help teams, scouts, and analysts understand how the market perceives different player roles and skill sets, which can influence decisions on recruitment, training focus, and contract negotiations.

#### 1. Define Relevant Attributes
We focus on attributes that represent offensive and defensive capabilities, as well as the player's market value:
- Offensive stats: `total_goals`, `total_assists`, `pass_completion_rate`, `dribble_success_rate` - represent a player's contribution to scoring, creating opportunities, and maintaining possession.
- Defensive stats: `tackles`, `interception`, `total_yellow`, `total_red` - measure a player's ability to disrupt opponents' plays and their discipline (via cards)

In [5]:
offensive_stats = ['total_goals', 'total_assists', 'pass_completion_rate', 'dribble_success_rate']
defensive_stats = ['tackles', 'interception', 'total_yellow', 'total_red']
market_value = 'market_value'

# Select relevant data
selected_columns = offensive_stats + defensive_stats + [market_value]
q2_data = df[selected_columns]
q2_data

Unnamed: 0,total_goals,total_assists,pass_completion_rate,dribble_success_rate,tackles,interception,total_yellow,total_red,market_value
0,3,14,76.73,66.67,10,6,91,2,2.0
1,6,14,86.24,33.33,12,11,32,2,4.5
2,34,4,72.09,27.27,0,0,32,1,0.8
3,19,8,82.00,54.29,10,3,11,0,1.5
4,9,2,87.00,25.00,9,5,57,3,1.5
...,...,...,...,...,...,...,...,...,...
1052,10,1,88.03,0.00,5,4,30,3,2.2
1053,11,2,81.94,41.18,6,2,22,1,4.5
1054,12,14,77.39,75.00,13,4,33,0,17.0
1055,5,14,81.94,41.18,6,2,11,0,5.0


#### 2. Correlation Analysis
Why correlation?
- Correlation quantifies the linear relationship between two variables
- Higher correlation indicates stronger alignment between a stat and `market_value`, helping us identify which stats matter more.
1. We compute the correlation matrix for the selected attributes
2. Extract the correlation values of offensive and defensive stats with `market_value`.
3. Compute average correlations for offensive and defensive stats.

In [11]:
correlation_matrix = q2_data.corr()

# Extract the correlation values with `market_value`
offensive_columns = correlation_matrix.loc[offensive_stats, market_value]
defensive_columns = correlation_matrix.loc[defensive_stats, market_value]

# Calculate average correlations
average_offensive_corr = offensive_columns.mean()
average_defensive_corr = defensive_columns.mean()

#### 3. Summary statistics
Why summary statistics?
- To understand the overall trends and variability of offensive and defensive stats.
- Attributes like mean, std, and range reveal:
  - How consistent the stats are.
  - Whether certain stats (e.g., goals or tackles) dominate the dataset.
1. Compute mean, standard deviation, min, max, and other relevant statistics for offensive and defensive attributes.
2. Compare offensive and defensive metrics to understand which group of stats shows greater variability or consistency

In [14]:
offensive_summary = q2_data[offensive_stats].describe()
defensive_summary = q2_data[defensive_stats].describe()

display(offensive_summary)
display(defensive_summary)

Unnamed: 0,total_goals,total_assists,pass_completion_rate,dribble_success_rate
count,1057.0,1057.0,1057.0,1057.0
mean,14.262062,10.477767,81.408969,37.886537
std,16.385635,6.253009,7.180711,29.206466
min,0.0,0.0,62.75,0.0
25%,2.0,5.0,77.78,0.0
50%,8.0,14.0,81.94,41.18
75%,20.0,14.0,85.71,50.0
max,82.0,25.0,100.0,100.0


Unnamed: 0,tackles,interception,total_yellow,total_red
count,1057.0,1057.0,1057.0,1057.0
mean,6.429518,2.859981,22.38316,1.034059
std,5.171949,2.750908,19.384952,1.26535
min,0.0,0.0,0.0,0.0
25%,2.0,1.0,7.0,0.0
50%,6.0,2.0,17.0,1.0
75%,9.0,4.0,33.0,2.0
max,23.0,11.0,95.0,5.0


#### 4. Interpretation of Results
After computing correlations and summary statistics:
1. Correlations reveal which group has higher average alignment with market value.
2. Summary statistics help contextualize the influence (e.g., whether high goals scored are rarer, boosting value disproportionately).

In [17]:
print("=== Correlations with Market Value ===")
print("Offensive Stats Correlations:")
display(offensive_columns)
print("\nDefensive Stats Correlations:")
display(defensive_columns)
print("\nAverage Offensive Correlation:", average_offensive_corr)
print("Average Defensive Correlation:", average_defensive_corr)

print("\n=== Summary Statistics ===")
print("Offensive Stats Summary:")
display(offensive_summary)
print("\nDefensive Stats Summary:")
display(defensive_summary)

=== Correlations with Market Value ===
Offensive Stats Correlations:


total_goals             0.089750
total_assists           0.122949
pass_completion_rate    0.137155
dribble_success_rate    0.072687
Name: market_value, dtype: float64


Defensive Stats Correlations:


tackles         0.106684
interception    0.085059
total_yellow    0.002718
total_red      -0.001585
Name: market_value, dtype: float64


Average Offensive Correlation: 0.10563520257418137
Average Defensive Correlation: 0.04821902669499542

=== Summary Statistics ===
Offensive Stats Summary:


Unnamed: 0,total_goals,total_assists,pass_completion_rate,dribble_success_rate
count,1057.0,1057.0,1057.0,1057.0
mean,14.262062,10.477767,81.408969,37.886537
std,16.385635,6.253009,7.180711,29.206466
min,0.0,0.0,62.75,0.0
25%,2.0,5.0,77.78,0.0
50%,8.0,14.0,81.94,41.18
75%,20.0,14.0,85.71,50.0
max,82.0,25.0,100.0,100.0



Defensive Stats Summary:


Unnamed: 0,tackles,interception,total_yellow,total_red
count,1057.0,1057.0,1057.0,1057.0
mean,6.429518,2.859981,22.38316,1.034059
std,5.171949,2.750908,19.384952,1.26535
min,0.0,0.0,0.0,0.0
25%,2.0,1.0,7.0,0.0
50%,6.0,2.0,17.0,1.0
75%,9.0,4.0,33.0,2.0
max,23.0,11.0,95.0,5.0


#### 5. Key observations
Offensive Stats correlate more strongly:
- The average offensive correlation (0.1056) is more than twice the defensive correlation (0.0482), suggesting that offensive stats have a stronger relationship with market value.

Weak correlations overall:
- Even the strongest correlations, such as `pass_completion_rate` (0.137) and `total_assists` (0.123), are in the weak range, indicating that these stats alone are not dominant predictions of market value.
- Defensive stats, particularly `total_yellow` and `total_red`, show almost no relationship with market value 

#### 6. Implications
- Offensive performance (e.g., creating and scoring goals) seems to have a slightly stronger impact on market value than defensive capabilities, aligning with the market's bias toward players who directly contribute to goal-scoring.
- The weak correlations overall suggest that market value depends on additional factors beyond offensive and defensive statistics, such as:
  - Age, physical attributes, or popularity.
  - Positional roles (e.g., forwards may inherently have higher values).
  - Team success or league competitiveness.

#### 7. Conclusion
While offensive stats have a stronger impact than defensive stats, neither group has a particularly strong influence on market value. Teams and analysts should consider a broader range of metrics, including off-field factors, when evaluating player market value.