# Notebook Overview and Summary

**Objective:**  
This notebook explores player rating predictions using various machine learning models, including **Linear Regression**, **Random Forest**, and **XGBoost**. The goal is to evaluate each model's performance and identify the most accurate predictor for future optimization.

**Key Steps:**
1. **Data Preprocessing:** Cleaned the dataset, handled missing values, and selected relevant features.
2. **Model Training:** Trained and evaluated Linear Regression, Random Forest, and XGBoost models.
3. **Model Evaluation:** Compared the models using metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared.
4. **Predictions Analysis:** Reviewed the top 10 most accurate and least accurate predictions to understand model behavior.

**Findings:**
- **Random Forest** performed the best with an R-squared of **0.9003** and the lowest MSE and MAE, making it the strongest candidate for further optimization.
- **XGBoost** and **Linear Regression** also performed well, but slightly lagged behind Random Forest in terms of prediction accuracy.
- The most accurate predictions had an error difference as small as **0.000196**, demonstrating the model's ability to closely match actual ratings.

Next steps involve saving the models and predictions for further optimization, with a focus on **Random Forest** and **XGBoost** for hyperparameter tuning and feature engineering.

### Step 1: Importing Libraries
In this step, we import the essential libraries required for data processing, model training, and evaluation. 
We use:
- `pandas` and `numpy` for data manipulation.
- `sklearn` libraries for model creation, preprocessing, and evaluation.
- `category_encoders` for encoding categorical features.
- `xgboost` for training the XGBoost model.


In [1]:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

### Step 2: Loading and Preprocessing the Data
In this step, we load the dataset from a `.pkl` file using `pandas`. We then check for missing values to ensure data integrity.
- **File Path:** The data is loaded from `all_players_df.pkl`.
- **Action:** Print the missing values in each column to determine data quality.

The 'age' column contains errors, so we drop it to avoid model issues.
- **Action:** Drop the 'age' column as a quick fix to proceed with model building. (Plan to revisit later for further analysis).

In [2]:
#Load and small preprocess of data 

data_path = '../CleanData/all_players_df.pkl'
df = pd.read_pickle(data_path)

print(df.isnull().sum())

player_name                                0
real_name                                  0
team                                       0
age                                        0
rating                                     0
                                          ..
utility_damage_per_round                   0
utility_kills_per_100_rounds               0
utility_flashes_thrown_per_round           0
utility_flash_assists_per_round            0
utility_time_opponent_flashed_per_round    0
Length: 69, dtype: int64


In [3]:
# Drop the age column due to error in current data with age (no time to fix will come back to it)
df = df.drop('age', axis=1, errors='ignore')

### Step 4: Feature Selection
After initial preprocessing, we select the important features for model training.
- **Columns Kept:** We retain several columns for model training based on their relevance which was discoverd through our ExploreData notebook.
- **Action:** Print the shape of the DataFrame after selecting the required features to verify the subset size.


In [4]:
# List of columns selected via feature selection
columns_to_keep = [
    # Numeric features
    'kd_ratio', 'firepower_damage_per_round_win', 'kills_per_round',
    'firepower_score', 'impact', 'trading_damage_per_kill', 'kast',
    'entrying_support_rounds', 'utility_time_opponent_flashed_per_round',
    
    # Categorical features
    'team',
    
    # Target variable
    'rating'
]

# Keep selected columns plus player_name and real_name (for reference)
df = df[columns_to_keep + ['player_name', 'real_name']]
print(f"DataFrame shape after selecting columns: {df.shape}")

DataFrame shape after selecting columns: (968, 13)


### Step 5: Splitting the Data
We split the dataset into training and testing sets for model evaluation using `train_test_split` from `sklearn`.
- **Train-Test Split Ratio:** 80% training, 20% testing.
- **Action:** Ensure that the data is correctly split for both model training and evaluation.


In [5]:
# Split features and target
X = df.drop(['rating', 'player_name', 'real_name'], axis=1)
y = df['rating']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save player names for later reference
train_names = df.loc[X_train.index, ['player_name', 'real_name']]
test_names = df.loc[X_test.index, ['player_name', 'real_name']]

In [6]:
# Target Encoding for 'team'
te_team = TargetEncoder(cols=['team'])
X_train_encoded = te_team.fit_transform(X_train, y_train)
X_test_encoded = te_team.transform(X_test)

In [7]:
# List of numeric features
numeric_features = [col for col in X_train.columns if col != 'team']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numeric features in the training data
X_train_encoded[numeric_features] = scaler.fit_transform(X_train_encoded[numeric_features])

# Transform the numeric features in the test data
X_test_encoded[numeric_features] = scaler.transform(X_test_encoded[numeric_features])

print("Shape of encoded training data:", X_train_encoded.shape)
print("Shape of encoded test data:", X_test_encoded.shape)
print("Feature names:", X_train_encoded.columns.tolist())

Shape of encoded training data: (774, 10)
Shape of encoded test data: (194, 10)
Feature names: ['kd_ratio', 'firepower_damage_per_round_win', 'kills_per_round', 'firepower_score', 'impact', 'trading_damage_per_kill', 'kast', 'entrying_support_rounds', 'utility_time_opponent_flashed_per_round', 'team']


### Step 7: Model Training
We train multiple models to predict the target variable, including:
- **Linear Regression**
- **RandomForestRegressor**
- **XGBRegressor**
- **Action:** Train these models on the training data using the preprocessed features.

In [8]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42, n_estimators=100),
    'XGBoost': XGBRegressor(random_state=42, n_estimators=100)
}

In [9]:
# Train and evaluate models
results = {}
for name, model in models.items():
    # Train the model
    model.fit(X_train_encoded, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_encoded)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Store results
    results[name] = {'model': model, 'mse': mse, 'mae': mae, 'r2': r2}
    
    print(f"{name}:")
    print(f"  MSE: {mse:.4f}")
    print(f"  MAE: {mae:.4f}")
    print(f"  R-squared: {r2:.4f}\n")

Linear Regression:
  MSE: 0.0020
  MAE: 0.0374
  R-squared: 0.8853

Random Forest:
  MSE: 0.0018
  MAE: 0.0314
  R-squared: 0.9003

XGBoost:
  MSE: 0.0021
  MAE: 0.0344
  R-squared: 0.8804



In [10]:
# Identify the best model
best_model = max(results, key=lambda x: results[x]['r2'])
print(f"Best model: {best_model}")

Best model: Random Forest


### Model Evaluation

We trained three different models — **Linear Regression**, **Random Forest**, and **XGBoost** — to predict our target variable. The evaluation metrics used to compare model performance are:

- **Mean Squared Error (MSE):** Measures the average of the squared differences between actual and predicted values. A lower MSE is better.
- **Mean Absolute Error (MAE):** Measures the average of the absolute differences between actual and predicted values. A lower MAE indicates more accurate predictions.
- **R-squared (R²):** Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. An R² closer to 1 means better model performance.

#### Model Evaluation Results:

1. **Linear Regression**  
   - **MSE:** 0.0020  
   - **MAE:** 0.0374  
   - **R-squared:** 0.8853  

   **Interpretation:**  
   The Linear Regression model performs reasonably well with an R-squared of 0.8853, indicating that it explains about 88.53% of the variance in the target variable. The error metrics, MSE and MAE, are within an acceptable range but are slightly higher compared to the other models.

2. **Random Forest**  
   - **MSE:** 0.0018  
   - **MAE:** 0.0314  
   - **R-squared:** 0.9003  

   **Interpretation:**  
   The Random Forest model performs the best out of the three models, with the lowest MSE (0.0018) and MAE (0.0314), indicating more accurate predictions. The R-squared value of 0.9003 shows that it captures about 90.03% of the variance in the target variable, making it the strongest performer.

3. **XGBoost**  
   - **MSE:** 0.0021  
   - **MAE:** 0.0344  
   - **R-squared:** 0.8804  

   **Interpretation:**  
   The XGBoost model performs similarly to Linear Regression, with an R-squared of 0.8804, slightly lower than Random Forest. While it does not outperform Random Forest, XGBoost remains a strong model with relatively low MSE and MAE values.

#### Conclusion:
Among the three models, **Random Forest** gives the best performance, having the lowest errors and the highest R-squared value. This indicates that Random Forest is better at capturing the complexity of the data, making it the most accurate predictor in this case. With **XGBoost** coming seond.

### Step 9: Making Predictions
Using the best-performing model, make predictions on the test set.
- **Predictions Output:** Output the predicted values and compare them with the actual values.


In [11]:
# Make predictions using the best model
best_model_name = best_model
best_model_object = results[best_model_name]['model']
y_pred = best_model_object.predict(X_test_encoded)

In [12]:
# Create a DataFrame with predictions and actual values
predictions_df = pd.DataFrame({
    'player_name': test_names['player_name'],
    'real_name': test_names['real_name'],
    'team': X_test['team'],
    'actual_rating': y_test,
    'predicted_rating': y_pred
})

In [13]:
# Calculate the difference between predicted and actual ratings
predictions_df['rating_difference'] = predictions_df['predicted_rating'] - predictions_df['actual_rating']

# Sort by the absolute difference to see the best and worst predictions
predictions_df['abs_difference'] = abs(predictions_df['rating_difference'])
predictions_df_sorted = predictions_df.sort_values('abs_difference')

print("Top 10 most accurate predictions:")
print(predictions_df_sorted.head(10))

print("\nTop 10 least accurate predictions:")
print(predictions_df_sorted.tail(10))

Top 10 most accurate predictions:
    player_name           real_name        team  actual_rating  \
876       AdreN  Dauren Kystaubayev     no team       0.333333   
923     SEMINTE      Valentin Bodea     no team       0.274510   
626      f0rest     Patrik Lindberg     no team       0.568627   
76       regali       Iulian Harjău    entropiq       0.666667   
450        eraa       Sean Knutsson  cph wolves       0.509804   
260      interz    Timofey Yakushin      cloud9       0.352941   
892    oskarish  Oskar Stenborowski     no team       0.313725   
231        asap      Tyson Paterson     rooster       0.647059   
755       tarik         Tarik Celik     no team       0.431373   
535    innocent         Paweł Mocek      rebels       0.352941   

     predicted_rating  rating_difference  abs_difference  
876          0.333137          -0.000196        0.000196  
923          0.274314          -0.000196        0.000196  
626          0.568039          -0.000588        0.000588  
76 

### Prediction Outcomes

After training and evaluating our models, we generated predictions for the test set. Below is an analysis of the top 10 most accurate predictions based on the absolute difference between the actual and predicted ratings.

#### Top 10 Most Accurate Predictions:

| player_name | real_name            | team       | actual_rating | predicted_rating | rating_difference | abs_difference |
|-------------|----------------------|------------|---------------|------------------|-------------------|----------------|
| AdreN       | Dauren Kystaubayev    | no team    | 0.333333      | 0.333137         | -0.000196         | 0.000196       |
| SEMINTE     | Valentin Bodea        | no team    | 0.274510      | 0.274314         | -0.000196         | 0.000196       |
| f0rest      | Patrik Lindberg       | no team    | 0.568627      | 0.568039         | -0.000588         | 0.000588       |
| regali      | Iulian Harjău         | entropiq   | 0.666667      | 0.667255         |  0.000588         | 0.000588       |
| eraa        | Sean Knutsson         | cph wolves | 0.509804      | 0.508824         | -0.000980         | 0.000980       |
| interz      | Timofey Yakushin      | cloud9     | 0.352941      | 0.351569         | -0.001373         | 0.001373       |
| oskarish    | Oskar Stenborowski    | no team    | 0.313725      | 0.312157         | -0.001569         | 0.001569       |
| asap        | Tyson Paterson        | rooster    | 0.647059      | 0.649216         |  0.002157         | 0.002157       |
| tarik       | Tarik Celik           | no team    | 0.431373      | 0.429216         | -0.002157         | 0.002157       |
| innocent    | Paweł Mocek           | rebels     | 0.352941      | 0.355490         |  0.002549         | 0.002549       |

#### Key Insights:

- **High Accuracy in Predictions:** 
  The top 10 predictions have extremely low absolute differences between the actual and predicted ratings. The lowest difference is as small as 0.000196, which indicates that the model's predictions are nearly identical to the true ratings. This highlights the model's strong ability to generalize and make accurate predictions for certain players.

- **Notable Players and Teams:** 
  Some well-known players such as **AdreN**, **f0rest**, and **tarik** feature in the top predictions. This suggests that the model performs well in predicting the ratings of experienced or prominent players, possibly due to the availability of consistent data for such individuals.

- **Consistency Across Teams:** 
  The accurate predictions span across various teams, including both established teams like **cloud9** and less formal groups like **rebels**, as well as our solo players denoated under **no team**. This indicates that the model is not biased towards specific teams and can generalize well across different player categories.

#### Conclusion:

The results demonstrate the model's ability to accurately predict player ratings for a variety of players. The low rating differences show that the model is well-tuned and has captured important patterns in the data. While some predictions are almost perfect, further tuning and optimization may help reduce the error in less accurate predictions.


### Step 10: Saving Models and Preprocessed Data

In this step, we will save the preprocessed data, trained models, and other important components necessary for future use. By saving these objects, we ensure that the models and data can be easily reloaded and reused for predictions or further analysis without having to retrain the models or preprocess the data again. This will make the deployment or evaluation of the models more efficient.

The following items will be saved:

1. **Preprocessed Data**: 
   - We will save the preprocessed training and test data (`X_train_encoded`, `X_test_encoded`) as well as the target variables (`y_train`, `y_test`) using `joblib`. These can be reloaded to evaluate model performance or for use in additional models.

2. **Trained Models**:
   - The trained **Random Forest** and **XGBoost** models will be saved. These models can be reloaded later for making predictions or further fine-tuning.

3. **Feature Names**:
   - We will save the names of the features used in the models, which will help in future interpretation of the model predictions and understanding the input structure.

4. **TargetEncoder for 'team'**:
   - The `TargetEncoder` used to transform the `team` categorical variable will be saved. This ensures consistency in encoding when new data is used.

5. **StandardScaler**:
   - The `StandardScaler` used to standardize the data will be saved, ensuring that any new data can be scaled in the same way as the training data.

6. **Predictions**:
   - The predictions made by the models on the test data will be saved as a CSV file for further analysis or evaluation.

By saving these components, we preserve the entire workflow and ensure that the models, data, and transformations can be easily restored and applied in the future.

In [1]:
#Imports for savings located here so the whole notebook does not have to be re-ruun to save. 
import joblib
import pickle

In [None]:
#Save the preprocessed data
joblib.dump(X_train_encoded, 'X_train_encoded.joblib')
joblib.dump(X_test_encoded, 'X_test_encoded.joblib')
joblib.dump(y_train, 'y_train.joblib')
joblib.dump(y_test, 'y_test.joblib')

In [None]:
#Save the Random Forest model
joblib.dump(results['Random Forest']['model'], 'random_forest_model.joblib')

In [None]:
# Save the XGBoost model
joblib.dump(results['XGBoost']['model'], 'xgboost_model.joblib')

In [None]:
# Save the feature names
with open('feature_names.pkl', 'wb') as f:
    pickle.dump(X_train_encoded.columns.tolist(), f)

In [None]:
# Save the TargetEncoder for 'team'
joblib.dump(te_team, 'team_target_encoder.joblib')

In [None]:
# Save the StandardScaler
joblib.dump(scaler, 'standard_scaler.joblib')

In [None]:
# Save the predictions DataFrame
predictions_df.to_csv('predictions.csv', index=False)

In [None]:
print("All necessary data and models have been saved.")