# Random Forest Implementation 3

Here the goal was to use more precise statistical methods to elimiate/ include features. 
### Steps taken 
- From the initial list of 140 features per team, and 300 features per player, with each team having around 20 players, I began by averaging over positions of players. This means that any given team had those 300 statistics as an average across positions, e.g. midfeilder_PLAYER_TACKLES_season_sum.
- To this set, I ran a correlation test, and opted to not include columns which are correlatied with another column we are including with coefficient 0.7 or greater. This reduced number of features per team from 140 + 300*4 = 1340 to 459 per team. This was carried out on all available data.
- Them, with the resulting 900 featues, I ran recursive feature elimination, to reduce the number of features included to 300. The RFE algorithm removes features with the least predictive power at each iteration. This was carried out on training data.
- With this set of 300 features, I then ran a grid search on a validation set to optimise hyperparameters.
- The random forest was then trained on the training split of the data.
- This approach resulted in:
    - Accuracy on training data: 0.8226609157266092
    - Accuracy on validation data: 0.4694272445820433
    - Accuracy on testing data: 0.479273909509618 

and hence it was actually the least successful! With more experience in selecting hyperparameters I dont doubt this could end up being the most effective method. 

In [2]:
import numpy as np
import pandas as pd
import os 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [6]:
data_folder = os.path.join(os.getcwd(), '..', 'Train_Data')
train_home_team_statistics_df = pd.read_csv(os.path.join(data_folder, 'train_uncorrelated_home_stats.csv'), index_col=0)
train_away_team_statistics_df = pd.read_csv(os.path.join(data_folder, 'train_uncorrelated_away_stats.csv'), index_col=0)
final_features = pd.read_csv(os.path.join(data_folder, 'Final_features_list_v_set.csv'), index_col=0)
train_scores   = pd.read_csv(os.path.join(data_folder, 'Y_train.csv'), index_col=0)

In [7]:
# Here i am turning results into a column vector, where 1 is a win, 0 is a draw, and -1 is a loss. This is again so our forest
# just classifies into outcomes of 3 classes
# It is defined so 1 = win, 0 = draw, -1 = loss.
results = []
for index, row in train_scores.iterrows():
    if row.iloc[0] == 1:
        results.append(1)
    elif row.iloc[1] == 1:
        results.append(0)
    elif row.iloc[2] == 1:
        results.append(-1)
results_df = pd.DataFrame(results, columns=['Score'])

Next, I make one big df and have it so all the information about every game is in one row

In [8]:
# join = inner just slots these 2 arrays side by side
train_home_team_statistics_df.columns = 'HOME_' + train_home_team_statistics_df.columns
train_away_team_statistics_df.columns = 'AWAY_' + train_away_team_statistics_df.columns
files = [train_home_team_statistics_df,train_away_team_statistics_df]
train_data =  pd.concat(files,join='inner',axis=1)
# this last line is a bit unnecessary, but it just fixes the scores to only include games that we have.
train_scores = train_scores.loc[train_data.index] 
train_data.shape 

(12303, 918)

In [9]:
features = final_features['0'].to_list()
essential_training_data = train_data[features]
essential_training_data.shape

(12303, 300)

In [10]:
# We need a training, testing and validation set 
X_train, X_test, y_train, y_test = train_test_split(train_data, results_df, test_size=0.3, random_state=42)
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
y_train = np.ravel(y_train)
y_validate = np.ravel(y_validate)
# This looks odd I assume but what we now have is 3 sets of data. The testing set is .3 of the original, testing set is .49 of 
# the orignial and validation set is 0.21 of the original. 

In [11]:
# This uses the results of the next section. Basically, we are just telling the random forest how many trees we want it to have, 
# how big we want the trees to be, and how many features to consider in each tree, etc. I sent you a video which I found very helpful 
# for getting to grips with this.

# These were chosen by running parameter optimisaion grid search on the validation section of the data. 
best_params = {
    'max_depth': 10,
    'max_features': 'sqrt',
    'min_samples_leaf': 2,
    'min_samples_split': 5,
    'n_estimators': 600
}
# Initialize the Random Forest classifier with the best parameters
rf = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    max_features=best_params['max_features'],
    min_samples_leaf=best_params['min_samples_leaf'],
    min_samples_split=best_params['min_samples_split'],
    random_state=42
)

In [13]:
rf.fit(X_train, y_train)
# Make predictions on the sets of data
y_pred_train = rf.predict(X_train)
y_pred_test = rf.predict(X_test)
y_pred_validate = rf.predict(X_validate)

In [14]:
print('Accuracy on training data:',   accuracy_score(y_train, y_pred_train))
print('Accuracy on validation data:', accuracy_score(y_validate, y_pred_validate))
print('Accuracy on testing data:',    accuracy_score(y_test, y_pred_test))

Accuracy on training data: 0.8226609157266092
Accuracy on validation data: 0.4694272445820433
Accuracy on testing data: 0.479273909509618


# Hyperparameter Optimisation

In [37]:
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the grid search to the data

grid_search.fit(X_validate, y_validate)

# Get the best parameters
print("Best Parameters:", grid_search.best_params_)

# Use the best estimator to make predictions
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)

print("Tuned Accuracy:", accuracy_score(y_test, y_pred_best))
print("Tuned Confusion Matrix:\n", confusion_matrix(y_test, y_pred_best))
print("Tuned Classification Report:\n", classification_report(y_test, y_pred_best))

Fitting 3 folds for each of 360 candidates, totalling 1080 fits
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.1s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.2s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.3s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.2s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.5s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   6.4s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=