# Training a Machine Learning Model
Training machine learning models to predict FPL player points using engineered features.

## Models to train
- Random Forest, ensemble of decision trees 
- Linear Regression, baseline for comparison

## Features
- rolling_avg_points, player form last 5 games 
- opponent_difficulty, fixture difficulty (1 -10 scale)
- minutes, playing time a player gets (0-90 minutes)

## Steps
- Load features from 'fpl_features.csv',split the data into two parts 80% training and 20% testing, then test both models on the training data, assess on the test data & compare the performance.

## Performance Metrics 
- MAE (Mean Absolute Error), How far the predictions were off by 
- RMSE (Root Mean Squared Error), Prediction errors
- R^2 Score, How well the model predicts (0-1, higher score is better)

## Output
Random Forest normally achieves - 
MAE, 2.0-2.5 points
R^2 Score, 0.45-.055

In [2]:
# Importing libraries for ML training
# pandas for data manipulation, numpy for numerical operations
# sklearn for machine learning models and evaluation
# matplotlib for visualizations
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

In [3]:
# Loading the features that are in the featureengineering.ipynb
# rolling averages, opponent difficulty and minutes
# these will be used to train the model

# reading the csv file that has all the features
df = pd.read_csv('fpl_features.csv')

# Showing how many records, players & gameweeks were loaded
print(f"{len(df)} records")
print(f"{df['player_id'].nunique()} players")
print(f"{df['gameweek'].min()} to {df['gameweek'].max()} GW")

# showing the data loaded properly 
print(df[['name', 'gameweek', 'rolling_avg_points', 'opponent_difficulty', 
          'minutes', 'total_points']].head())

10329 records
756 players
1 to 14 GW
   name  gameweek  rolling_avg_points  opponent_difficulty  minutes  \
0  Raya         1                 0.0                  6.9       90   
1  Raya         2                 0.0                  6.5       90   
2  Raya         3                 0.0                  4.9       90   
3  Raya         4                 0.0                  7.9       90   
4  Raya         5                 0.0                  1.2       90   

   total_points  
0            10  
1             6  
2             2  
3             6  
4             2  


In [None]:
# preparing data for machine learning
# splitting it into features x and y
# removing players who didnt play

# features x and y
# x - input features (what the moodel learns from)
# y - target variable (what the model iis trying to predict)
# putting the features into x for the model to learn from
features = ['rolling_avg_points', 'opponent_difficulty', 'minutes']
x = df[features]
y = df['total_points']

# showing the features we are using for the model 
# and the total points
print(f"features (x) {features}")
print(f"target (y) total points")

# removing any players from the dataframe/table of players who didnt play
# as player with 0 minutes score 0 points 
print(f"records including players who dont play {len(x)}")
# pwp is players who play
pwp = df['minutes'] > 0
x = x[pwp]
y = y[pwp]
print(f"records of players who play {len(x)}")

# splitting data into two parts training 80% and testing 20%
# training set, model learns patterns from this data
# testing set, model evaluates on this to test generalization
# making the random_state - 42, makes the split reproducible
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)

print(f"\ntraining set: {len(x_train)} records (80%)")
print(f"test set: {len(x_test)} records (20%)")

features (x) ['rolling_avg_points', 'opponent_difficulty', 'minutes']
target (y) total points
records including players who dont play 10329
records of players who play 4067

Training set: 3253 records (80%)
Test set: 814 records (20%)
