# Training a Machine Learning Model
Training machine learning models to predict FPL player points using engineered features.

## Models to train
- Random Forest, ensemble of decision trees 
- Linear Regression, baseline for comparison

## Features
- rolling_avg_points, player form last 5 games 
- opponent_difficulty, fixture difficulty (1 -10 scale)
- minutes, playing time a player gets (0-90 minutes)

## Steps
- Load features from 'fpl_features.csv',split the data into two parts 80% training and 20% testing, then test both models on the training data, assess on the test data & compare the performance.

## Performance Metrics 
- MAE (Mean Absolute Error), How far the predictions were off by 
- RMSE (Root Mean Squared Error), Prediction errors
- R^2 Score, How well the model predicts (0-1, higher score is better)

## Output
Random Forest normally achieves - 
MAE, 2.0-2.5 points
R^2 Score, 0.45-.055

In [67]:
# Importing libraries for ML training
# pandas for data manipulation, numpy for numerical operations
# sklearn for machine learning models and evaluation
# matplotlib for visualizations
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

In [68]:
# Loading the features that are in the featureengineering.ipynb
# rolling averages, opponent difficulty and minutes
# these will be used to train the model

# reading the csv file that has all the features
df = pd.read_csv('fpl_features.csv')

# Showing how many records, players & gameweeks were loaded
print(f"{len(df)} records")
print(f"{df['player_id'].nunique()} players")
print(f"{df['gameweek'].min()} to {df['gameweek'].max()} GW")

# showing the data loaded properly 
print(df[['name', 'gameweek', 'rolling_avg_points', 'opponent_difficulty', 
          'minutes', 'total_points']].head())

10329 records
756 players
1 to 14 GW
   name  gameweek  rolling_avg_points  opponent_difficulty  minutes  \
0  Raya         1                 0.0                  6.9       90   
1  Raya         2                 0.0                  6.5       90   
2  Raya         3                 0.0                  4.9       90   
3  Raya         4                 0.0                  7.9       90   
4  Raya         5                 0.0                  1.2       90   

   total_points  
0            10  
1             6  
2             2  
3             6  
4             2  


In [69]:
# preparing data for machine learning
# splitting it into features x and y
# removing players who didnt play

# features x and y
# x - input features (what the moodel learns from)
# y - target variable (what the model iis trying to predict)
# putting the features into x for the model to learn from
features = ['rolling_avg_points', 'opponent_difficulty', 'minutes']
x = df[features]
y = df['total_points']

# showing the features we are using for the model 
# and the total points
print(f"features (x) {features}")
print(f"target (y) total points")

# splitting data into two parts training 80% and testing 20%
# training set, model learns patterns from this data
# testing set, model evaluates on this to test generalization
# making the random_state - 42, makes the split reproducible
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)

print(f"\ntraining set: {len(x_train)} records (80%)")
print(f"test set: {len(x_test)} records (20%)")

features (x) ['rolling_avg_points', 'opponent_difficulty', 'minutes']
target (y) total points

training set: 8263 records (80%)
test set: 2066 records (20%)


In [70]:
# implementing random forest an ensemble learning method
# makes multiple decision trees & averages their predictions
# it is good at capturing non-linear relationships
# using 100 n_estimators, which is 100 decision trees
# random_state=42 for reproducability
# n_jobs=1 uses all cpu cores for faster training
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
# fit() trains the model on the training data
rf_model.fit(x_train, y_train)

# predictions on test data
# showing how well the model works on unseen data
rf_predictions = rf_model.predict(x_test)

# showing model performance using standard metrics
# MAE (mean absolute error) average prediction error in points
# the lower the better, shows the average points it is off by
rf_mae = mean_absolute_error(y_test, rf_predictions)
print(f"MAE (mean absolute error):  {rf_mae:.2f} points")
print(f"On average, predictions are off by {rf_mae:.2f} points")

# RMSE (root mean squared error) penalizes large errors more than mae
# the lower the better
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))
print(f"RMSE (root mean squared error): {rf_rmse:.2f} points")

# R^2 Score, how well the model predicts
# 1.0 means perfect predictions, 0.0 means the model is the same as just getting the average
# the higher the bettter
rf_r2 = r2_score(y_test, rf_predictions)
print(f"R² Score: {rf_r2:.3f}")
print(f"Model shows how well the model predicts points {rf_r2*100:.1f}%")

# Finding the R² score
if rf_r2 > 0.5:
    print(f"Strong performance, Model explains {rf_r2*100:.1f}% of variance")
elif rf_r2 > 0.3:
    print(f"Moderate performance, Model explains {rf_r2*100:.1f}% of variance")
elif rf_r2 > 0:
    print(f"Weak performance, Model explains {rf_r2*100:.1f}% of variance")
else:
    print(f"Poor performance, Model didn't learn (negative R²)")


# showing the predictions to see how close they are 
print("Sample Predictions (Random Forest):")
comparison = pd.DataFrame({
    'Actual': y_test[:10].values,
    'Predicted': rf_predictions[:10].round(1),
    'Error': abs(y_test[:10].values - rf_predictions[:10]).round(1)
})
print(comparison)

MAE (mean absolute error):  0.83 points
On average, predictions are off by 0.83 points
RMSE (root mean squared error): 2.01 points
R² Score: 0.315
Model shows how well the model predicts points 31.5%
Moderate performance, Model explains 31.5% of variance
Sample Predictions (Random Forest):
   Actual  Predicted  Error
0       0        0.0    0.0
1       0        0.0    0.0
2       1        1.3    0.3
3       1        7.1    6.1
4       1        1.5    0.5
5       2        1.7    0.3
6       0        0.0    0.0
7       0        0.0    0.0
8       0        0.0    0.0
9       0        0.0    0.0
