# Player Market Value Prediction: Applied ML 2025 Final Project

In this notebook, we predict football players' market values using player stats, demographics, and performance data.  
We use the helper functions from `utils.py` to keep our workflow modular and clean.


In [None]:
# Import libraries and utils
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from src import load_all, aggregate_player_stats, get_latest_valuation, merge_player_data, plot_distribution, fillna_and_scale, encode_categorical, scatter_actual_vs_pred


# Set pandas options for nicer display
pd.set_option('display.max_columns', 100)


## 1. Data Loading

First, we load all the relevant tables from the dataset using our utility functions.


In [None]:
# Load all data as a dictionary of DataFrames
data = load_all()
for name, df in data.items():
    print(f"{name}: {df.shape}")


## 2. Data Preprocessing & Feature Table Creation

We aggregate player stats, get the latest player valuation, and merge everything into a single table.


In [None]:
# Aggregate stats and get latest valuations
stats = aggregate_player_stats(data['appearances'])
latest_val = get_latest_valuation(data['valuations'])

# Merge into main DataFrame
main_df = merge_player_data(data['players'], stats, latest_val)
print(main_df.head())


## 3. Exploratory Data Analysis (EDA)

Let’s examine the distribution of market values, relationships with age, and other interesting patterns.


In [None]:
# Plot distribution of target variable
plot_distribution(main_df['market_value_in_eur'].dropna(), title="Market Value (EUR) Distribution")

# Example: Plot Market Value vs Age
plt.figure(figsize=(8,4))
plt.scatter(main_df['last_season'], main_df['market_value_in_eur'], alpha=0.5)
plt.xlabel("Last Season")
plt.ylabel("Market Value (EUR)")
plt.title("Market Value vs Last Season")
plt.show()

# Age vs Value
if 'date_of_birth' in main_df.columns:
    main_df['age'] = 2025 - pd.to_datetime(main_df['date_of_birth']).dt.year
    plt.scatter(main_df['age'], main_df['market_value_in_eur'], alpha=0.4)
    plt.xlabel("Age")
    plt.ylabel("Market Value (EUR)")
    plt.title("Market Value vs Age")
    plt.show()


## 4. Feature Engineering

We select and prepare the features we want to use for modeling.  
Categorical variables are encoded, numeric features are scaled, and missing values are handled.


In [None]:
# Select features for the model (extend as needed!)
features = [
    'n_games', 'total_yellow', 'total_red',  # From stats
    'age',                                   # Derived
    # Add more features as engineered!
]
cat_features = ['position']  # Example: add position if available

# Handle missing numeric features and scale
X_num = main_df[features].copy()
X_scaled, scaler = fillna_and_scale(X_num, features)

# Encode categorical features if available
if all(col in main_df.columns for col in cat_features):
    X_cat, encoder = encode_categorical(main_df, cat_features)
    # Combine numeric + categorical features
    import numpy as np
    X_full = np.concatenate([X_scaled, X_cat.values], axis=1)
else:
    X_full = X_scaled

# Prepare target variable (log-transform recommended for skew)
y = main_df['market_value_in_eur'].copy()
y = np.log1p(y)  # Use log1p to avoid log(0)


## 5. Train/Test Split

We split our data into training and testing sets to fairly evaluate the model's performance.


In [None]:
from sklearn.model_selection import train_test_split

# Remove rows with missing targets
valid_idx = ~y.isna()
X_full = X_full[valid_idx]
y = y[valid_idx]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)


## 6. Neural Network Regression

We train a simple neural network to predict player market value.


In [None]:
import tensorflow as tf
from tensorflow import keras

# Define a simple MLP model
model = keras.Sequential([
    keras.layers.Input(shape=(X_train.shape[1],)),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Train the model (early stopping to avoid overfitting)
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stop],
    verbose=1
)


## 7. Model Evaluation

We evaluate the model’s predictions using mean absolute error (MAE) and plot actual vs. predicted values.


In [None]:
# Predict and reverse log-transform
y_pred = model.predict(X_test).flatten()
y_test_exp = np.expm1(y_test)
y_pred_exp = np.expm1(y_pred)

from sklearn.metrics import mean_absolute_error, r2_score

mae = mean_absolute_error(y_test_exp, y_pred_exp)
r2 = r2_score(y_test_exp, y_pred_exp)

print(f"Test MAE: €{mae:,.2f}")
print(f"Test R^2: {r2:.3f}")

# Plot actual vs predicted
scatter_actual_vs_pred(y_test_exp, y_pred_exp, title="Actual vs Predicted Market Value")


## 8. Hyperparameter Tuning (Quick Example)

Let’s briefly show how to tune the network size for better performance.


In [None]:
# Example: try different layer sizes
results = []
for size in [16, 32, 64, 128]:
    model = keras.Sequential([
        keras.layers.Input(shape=(X_train.shape[1],)),
        keras.layers.Dense(size, activation='relu'),
        keras.layers.Dense(size, activation='relu'),
        keras.layers.Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=0)
    y_pred = model.predict(X_test).flatten()
    y_pred_exp = np.expm1(y_pred)
    mae = mean_absolute_error(y_test_exp, y_pred_exp)
    results.append((size, mae))

print("Layer size vs MAE:")
for size, mae in results:
    print(f"  {size} units: MAE = €{mae:,.2f}")


## 9. Feature Importance (Optional)

For neural nets, try permutation importance to see which features matter most.


In [None]:
from sklearn.inspection import permutation_importance

# To use permutation_importance, you need a model with a .predict() method and numpy arrays.
# We'll use a simple baseline regressor for illustration:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(X_train, y_train)
result = permutation_importance(rf, X_test, y_test, n_repeats=5, random_state=42)

# Show top features
importances = result.importances_mean
indices = np.argsort(importances)[::-1]
for idx in indices[:10]:
    print(f"{features[idx]}: {importances[idx]:.4f}")


## 10. Conclusions

We built a neural network model to predict football player market values using stats and personal data.  
Feel free to extend this with more advanced features, more complex models, or deeper analysis!
