# Training a Machine Learning Model
Training machine learning models to predict FPL player points using engineered features.

## Models to train
- Random Forest, ensemble of decision trees 
- Linear Regression, baseline for comparison

## Features
- rolling_avg_points, player form last 5 games 
- opponent_difficulty, fixture difficulty (1 -10 scale)
- minutes, playing time a player gets (0-90 minutes)

## Steps
- Load features from 'fpl_features.csv',split the data into two parts 80% training and 20% testing, then test both models on the training data, assess on the test data & compare the performance.

## Performance Metrics 
- MAE (Mean Absolute Error), How far the predictions were off by 
- RMSE (Root Mean Squared Error), Prediction errors
- R^2 Score, How well the model predicts (0-1, higher score is better)

## Output
Random Forest normally achieves - 
MAE, 2.0-2.5 points
R^2 Score, 0.45-.055

In [2]:
# Importing libraries for ML training
# pandas for data manipulation, numpy for numerical operations
# sklearn for machine learning models and evaluation
# matplotlib for visualizations
# pickle to save the machine learning models
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import pickle
import sqlite3

In [3]:
# Loading the features that are in the featureengineering.ipynb
# rolling averages, opponent difficulty and minutes
# is_home, price, position encoding DEF, MID, FWD, GK
# clean_sheets_rolling_avg defensive form 
# these will be used to train the model

# Loading the features from the database
# connecting to the database
conn = sqlite3.connect('fpl_data.db')

# reading from the features table
df = pd.read_sql_query('SELECT * FROM features', conn)

# closing the connection
conn.close()

# Showing how many records, players & gameweeks were loaded
print(f"{len(df)} records")
print(f"{df['player_id'].nunique()} players")
print(f"{df['gameweek'].min()} to {df['gameweek'].max()} GW")

# showing the data loaded properly 
print(df[['name', 'gameweek', 'rolling_avg_points', 'opponent_difficulty', 
          'minutes', 'price', 'is_home', 'pos_GK','pos_DEF','pos_MID','pos_FWD','clean_sheets_rolling_avg','total_points']].head())

16559 records
799 players
1 to 22 GW
   name  gameweek  rolling_avg_points  opponent_difficulty  minutes  price  \
0  Raya         1                 0.0                  6.9       90    5.9   
1  Raya         2                 0.0                  5.8       90    5.9   
2  Raya         3                 0.0                  5.2       90    5.9   
3  Raya         4                 0.0                  8.0       90    5.9   
4  Raya         5                 0.0                  1.0       90    5.9   

   is_home  pos_GK  pos_DEF  pos_MID  pos_FWD  clean_sheets_rolling_avg  \
0        0       1        0        0        0                       0.0   
1        1       1        0        0        0                       0.0   
2        0       1        0        0        0                       0.0   
3        1       1        0        0        0                       0.0   
4        1       1        0        0        0                       0.0   

   total_points  
0            10  
1      

In [11]:
# preparing data for machine learning
# splitting it into features x and y
# removing players who didnt play

# features x and y
# x - input features (what the moodel learns from)
# y - target variable (what the model iis trying to predict)
# putting the features into x for the model to learn from
features = [
    'rolling_avg_points',
    'opponent_difficulty',
    'minutes',
    'is_home',
    'price',
    'pos_GK',
    'pos_DEF',
    'pos_MID',
    'pos_FWD',
    'clean_sheets_rolling_avg',
    ]
X = df[features]
y = df['total_points']

# showing the features we are using for the model 
# and the total points
print(f"features (x) {features}")
print(f"target (y) total points")

# splitting data into two parts training 80% and testing 20%
# training set, model learns patterns from this data
# testing set, model evaluates on this to test generalization
# making the random_state - 42, makes the split reproducible
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\ntraining set: {len(X_train)} records (80%)")
print(f"test set: {len(X_test)} records (20%)")

features (x) ['rolling_avg_points', 'opponent_difficulty', 'minutes', 'is_home', 'price', 'pos_GK', 'pos_DEF', 'pos_MID', 'pos_FWD', 'clean_sheets_rolling_avg']
target (y) total points

training set: 13247 records (80%)
test set: 3312 records (20%)


In [None]:
# implementing random forest an ensemble learning method
# makes multiple decision trees & averages their predictions
# it is good at capturing non-linear relationships
# using 50 n_estimators, which is 50 decision trees
# max_depth=15 limits how deep each tree can go to prevent overfitting
# min_samples_split=50 means a node must have at least 50 samples to split
# min_samples_leaf=20 means a leaf node must have at least 20 samples
# max_features='sqrt' means each tree considers sqrt(total features) when splitting
# random_state=42 for reproducability
# n_jobs=1 uses all cpu cores for faster training
rf_model = RandomForestRegressor(n_estimators=50, max_depth=15, min_samples_split=50, min_samples_leaf=20, max_features='sqrt', random_state=42, n_jobs=-1)
# fit() trains the model on the training data
rf_model.fit(X_train, y_train)

# predictions on test data
# showing how well the model works on unseen data
rf_predictions = rf_model.predict(X_test)

# showing model performance using standard metrics
# MAE (mean absolute error) average prediction error in points
# the lower the better, shows the average points it is off by
rf_mae = mean_absolute_error(y_test, rf_predictions)
print(f"MAE (mean absolute error):  {rf_mae:.2f} points")
print(f"On average, predictions are off by {rf_mae:.2f} points")

# RMSE (root mean squared error) penalizes large errors more than mae
# the lower the better
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))
print(f"RMSE (root mean squared error): {rf_rmse:.2f} points")

# R^2 Score, how well the model predicts
# 1.0 means perfect predictions, 0.0 means the model is the same as just getting the average
# the higher the bettter
rf_r2 = r2_score(y_test, rf_predictions)
print(f"R² Score: {rf_r2:.3f}")
print(f"Model shows how well the model predicts points {rf_r2*100:.1f}%")

# Finding the R² score
if rf_r2 > 0.5:
    print(f"Strong performance, Model explains {rf_r2*100:.1f}% of variance")
elif rf_r2 > 0.3:
    print(f"Moderate performance, Model explains {rf_r2*100:.1f}% of variance")
elif rf_r2 > 0:
    print(f"Weak performance, Model explains {rf_r2*100:.1f}% of variance")
else:
    print(f"Poor performance, Model didn't learn (negative R²)")


# showing the predictions to see how close they are 
print("Sample Predictions (Random Forest):")
comparison = pd.DataFrame({
    'Actual': y_test[:10].values,
    'Predicted': rf_predictions[:10].round(1),
    'Error': abs(y_test[:10].values - rf_predictions[:10]).round(1)
})
print(comparison)

MAE (mean absolute error):  0.73 points
On average, predictions are off by 0.73 points
RMSE (root mean squared error): 1.68 points
R² Score: 0.508
Model shows how well the model predicts points 50.8%
Strong performance, Model explains 50.8% of variance
Sample Predictions (Random Forest):
   Actual  Predicted  Error
0       0        2.9    2.9
1       0        0.0    0.0
2       0        0.0    0.0
3       1        3.7    2.7
4       0        0.0    0.0
5       2        3.6    1.6
6      10        3.7    6.3
7       0        0.0    0.0
8       8        4.4    3.6
9       0        0.0    0.0


In [41]:
# Making a baseline (Linear Regression) to compare ml models to
# This model assumes a straight line relationship between 
# inputs and outputs

# training linear regression model
# no features needed
lr_model = LinearRegression()
# fit() finds the best linear equation to fit the data
lr_model.fit(X_train, y_train)

# showing how well the model works
lr_predictions = lr_model.predict(X_test)

# showing model performance using standard regression metrics
# MAE (mean absolute error) average prediction error in points
# the lower the better, shows the average points it is off by
lr_mae = mean_absolute_error(y_test, lr_predictions)
print(f"MAE (mean absolute error):  {lr_mae:.2f} points")

# RMSE (root mean squared error) penalizes large errors more than mae
# the lower the better
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_predictions))
print(f"RMSE (root mean squared error): {lr_rmse:.2f} points")

# R^2 Score, how well the model predicts
# 1.0 means perfect predictions, 0.0 means the model is the same as just getting the average
# the higher the bettter
lr_r2 = r2_score(y_test, lr_predictions)
print(f"R² Score: {lr_r2:.3f}")
print(f"Model shows how well the model predicts points {rf_r2*100:.1f}%")

# showing the predictions to see how close they are 
print("\nExample Predictions (Linear Regression):")
comparison = pd.DataFrame({
    'Actual': y_test[:10].values,
    'Predicted': lr_predictions[:10].round(1)
})
print(comparison)

MAE (mean absolute error):  0.86 points
RMSE (root mean squared error): 1.73 points
R² Score: 0.483
Model shows how well the model predicts points 50.8%

Example Predictions (Linear Regression):
   Actual  Predicted
0       0        2.6
1       0       -0.1
2       0       -0.1
3       1        3.6
4       0       -0.0
5       2        3.8
6      10        3.6
7       0       -0.4
8       8        4.1
9       0        0.1


In [46]:
# Comparing both models to see the best one
# The model that has the lowest MAE score is the best
# in fpl the average error matters more than then perfect predicition

# creating a comparison dataframe/table with all the metrics
comparison_df = pd.DataFrame({
    'Model': ['Random Forest', 'Linear Regression'],
    'MAE': [rf_mae, lr_mae],
    'RMSE': [rf_rmse, lr_rmse],
    'R² Score': [rf_r2, lr_r2]
})

# Printing out the comparison dataframe/table
print("\nComparing Random Forest & Linear Regression")
print(comparison_df.to_string(index=False))

# Seeing what the best model is, the lowest MAE score
# The lower the MAE , the smaller average prediction error
bmn = comparison_df.loc[comparison_df['MAE'].idxmin(), 'Model']
print(f"\nBest Model (Lowest MAE): {bmn}")

# This shows what feature matters the most
print("\nFeature Importance (Random Forest):")
# Geting the performance score of each feature
# The higher the score th more important they are
feature_importance = pd.DataFrame({
    'Feature': features,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print(feature_importance.to_string(index=False))

# explaining what the importance of the features
print("\nExplanation:")
# for the features in the table show what % of performance each has
for idx, row in feature_importance.iterrows():
    print(f"{row['Feature']}: {row['Importance']*100:.1f}% importance")



Comparing Random Forest & Linear Regression
            Model      MAE     RMSE  R² Score
    Random Forest 0.728185 1.682202  0.508213
Linear Regression 0.859767 1.725005  0.482868

Best Model (Lowest MAE): Random Forest

Feature Importance (Random Forest):
                 Feature  Importance
                 minutes    0.706403
      rolling_avg_points    0.119804
                   price    0.076982
clean_sheets_rolling_avg    0.044315
     opponent_difficulty    0.030676
                 is_home    0.007966
                 pos_DEF    0.005382
                 pos_MID    0.003464
                  pos_GK    0.003078
                 pos_FWD    0.001931

Explanation:
minutes: 70.6% importance
rolling_avg_points: 12.0% importance
price: 7.7% importance
clean_sheets_rolling_avg: 4.4% importance
opponent_difficulty: 3.1% importance
is_home: 0.8% importance
pos_DEF: 0.5% importance
pos_MID: 0.3% importance
pos_GK: 0.3% importance
pos_FWD: 0.2% importance


In [50]:
# Testing random forest on actual players
# Getting the latest gameweek data for realistic testing
latest_gw = df['gameweek'].max()
print(f"\nUsing gameweek {latest_gw} data for predictions")

# getting the top 10 players by their form
players = df[df['gameweek'] == latest_gw].nlargest(10, 'rolling_avg_points')

# adding the features to the players 
X_players = players[features]
# generating precitions using random forest
predictions_rf = rf_model.predict(X_players)

# making a predictions dataframe/table to show to actual points v the predicted points  
predictions = pd.DataFrame({
    'Player': players['name'].values,
    'Form': players['rolling_avg_points'].values,
    'Opponent_Diff': players['opponent_difficulty'].values,
    'Minutes': players['minutes'].values,
    'Price': players['price'].values,
    'is_Home': players['is_home'].values,
    'position': players['position'].values,
    'Actual_Points': players['total_points'].values,
    'Predicted_Points': predictions_rf.round(1)
})

# printing out the predictions table of actual points v predicted point
print("\nTop 10 players, Actual v Predicted")
print(predictions.to_string(index=False))

# calculating thhe average error of these predictions
# showing how accurate random forest is on the players
predictions['Error'] = abs(predictions['Actual_Points'] - predictions['Predicted_Points'])
print(f"\nAverage prediction error. {predictions['Error'].mean():.2f} points")


Using gameweek 22 data for predictions

Top 10 players, Actual v Predicted
      Player  Form  Opponent_Diff  Minutes  Price  is_Home position  Actual_Points  Predicted_Points
     Collins   8.8            4.1       90    5.0        0      DEF              1               2.7
    Kelleher   7.6            4.1       90    4.6        0       GK              1               2.7
      Schade   7.6            4.1       81    7.2        0      MID              1               3.7
      Thiago   7.6            4.1       90    7.2        0      FWD              2               4.0
        Rice   7.2            8.0       90    7.4        0      MID              3               5.3
      Janelt   7.2            4.1       90    4.9        0      MID              1               3.1
Lewis-Potter   7.0            4.1        8    4.9        0      DEF              1               1.1
      Garner   7.0            8.0       90    5.2        0      MID              4               4.0
    Aaronson   

In [51]:
# Saving Random Forest using pickle
# saving to pickle so can the load the file later to make predictions without retraining
model_filename = 'fpl_predictor_model.pkl'
# wb to write the file in binary mode
with open(model_filename, 'wb') as file:
    pickle.dump(rf_model, file)

print(f"Random Forest saved to {model_filename}")

# creating a summary of the ml training
summary = f"""
Random Forest
Training Date - {pd.Timestamp.now()}

Features Used -
rolling_avg_points - Player form from the last 5 games
opponent_difficulty - Fixture difficulty from a scale of 1-10
minutes - How many minutes a player gets from 0 -90 minutes
is_home - Whether the player is playing at home (1) or away (0)
price - Player price in millions
pos_GK, pos_DEF, pos_MID, pos_FWD - One-hot encoding for player positions
clean_sheets_rolling_avg - Defensive form from last 5 games

Performance Metrics -
MAE - {rf_mae:.2f} points
RMSE - {rf_rmse:.2f} points
R² Score - {rf_r2:.3f}

Training Data -
- Records: {len(X_train)}
- Players: {df['player_id'].nunique()}
- Gameweeks: {df['gameweek'].min()} to {df['gameweek'].max()}
"""

# saving the summary to a txt file
with open('model_summary.txt', 'w') as f:
    f.write(summary)

print("Summary of Random Forest saved to model_summary.txt")


Random Forest saved to fpl_predictor_model.pkl
Summary of Random Forest saved to model_summary.txt


In [53]:
print("\n" + "="*80)
print("MODEL DIAGNOSTIC")
print("="*80)

# 1. Check feature correlations
print("\n1. FEATURE CORRELATIONS:")
print("-"*80)
for feature in ['rolling_avg_points', 'opponent_difficulty', 'minutes', 
                'is_home', 'price', 'clean_sheets_rolling_avg']:
    if feature in df.columns:
        corr = df[feature].corr(df['total_points'])
        print(f"   {feature:30s}: r = {corr:+.3f}")

# 2. Check missing values
print("\n2. MISSING VALUES:")
print("-"*80)
missing_found = False
for feature in features:
    missing = df[feature].isna().sum()
    if missing > 0:
        print(f"   {feature}: {missing} ({missing/len(df)*100:.1f}%)")
        missing_found = True
if not missing_found:
    print("   ✓ No missing values")

# 3. Check feature statistics
print("\n3. FEATURE STATISTICS:")
print("-"*80)
for feature in features:
    print(f"{feature}:")
    print(f"   Mean: {df[feature].mean():.2f}")
    print(f"   Std:  {df[feature].std():.2f}")
    print(f"   Min:  {df[feature].min():.2f}")
    print(f"   Max:  {df[feature].max():.2f}")
    print(f"   Non-zero: {(df[feature] != 0).sum()} ({(df[feature] != 0).sum()/len(df)*100:.1f}%)")
    print()

# 4. Check training vs test performance
print("\n4. OVERFITTING CHECK:")
print("-"*80)
train_predictions = rf_model.predict(X_train)
train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, rf_predictions)

print(f"Training R²: {train_r2:.3f}")
print(f"Test R²:     {test_r2:.3f}")
print(f"Gap:         {train_r2 - test_r2:.3f}")

if train_r2 - test_r2 > 0.1:
    print("   ⚠ Model is overfitting!")
elif train_r2 - test_r2 > 0.05:
    print("   + Slight overfitting")
else:
    print("   ✓ Good generalization")

# 5. Feature importance
print("\n5. FEATURE IMPORTANCE:")
print("-"*80)
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print(importance_df.to_string(index=False))

print("="*80)


MODEL DIAGNOSTIC

1. FEATURE CORRELATIONS:
--------------------------------------------------------------------------------
   rolling_avg_points            : r = +0.415
   opponent_difficulty           : r = +0.051
   minutes                       : r = +0.685
   is_home                       : r = +0.036
   price                         : r = +0.304
   clean_sheets_rolling_avg      : r = +0.322

2. MISSING VALUES:
--------------------------------------------------------------------------------
   ✓ No missing values

3. FEATURE STATISTICS:
--------------------------------------------------------------------------------
rolling_avg_points:
   Mean: 0.93
   Std:  1.59
   Min:  -0.60
   Max:  11.80
   Non-zero: 6672 (40.3%)

opponent_difficulty:
   Mean: 6.41
   Std:  2.38
   Min:  1.00
   Max:  10.00
   Non-zero: 16559 (100.0%)

minutes:
   Mean: 26.17
   Std:  37.69
   Min:  0.00
   Max:  90.00
   Non-zero: 6666 (40.3%)

is_home:
   Mean: 0.50
   Std:  0.50
   Min:  0.00
   Max:  1.0