## Introduction

This Jupyter Notebook was created for the EML2025S project task.

**Author:** Subair Kirimow  
**Matriculation number:** 12321260

The notebook was executed using Visual Studio Code and Python version 3.13.2.

# Import important libraries

If you need to install the packages before importing the libraries. Uncomment the code in the lower cell and run it, otherwise skip this step and simply import the libraries.

In [1]:
# %pip install numpy
# %pip install pandas
# %pip install scikit-learn

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## Load and Preprocess Data

In [3]:
# Open the CSV files
test_df = pd.read_csv("rocketskillshots_test.csv")
train_df = pd.read_csv("rocketskillshots_train.csv")

# Only keep the summary rows (where 'window_id' is NaN) as the plan is to predict one label per trick shot (id), not per row.
test_summary_df = test_df[test_df['window_id'].isna()].copy()
train_summary_df = train_df[train_df['window_id'].isna()].copy()

# Drop the 'window_id' column from the summary_df DataFrame
test_summary_df.drop(columns=['window_id'], inplace=True)      # Reminder: inplace=True needed?
train_summary_df.drop(columns=['window_id'], inplace=True)

metric_cols = ['BallAcceleration', 'DistanceWall', 'DistanceCeil', 'DistanceBall',
               'PlayerSpeed', 'BallSpeed',
               'BallAcceleration_skew', 'DistanceWall_skew', 'DistanceCeil_skew',
               'DistanceBall_skew', 'PlayerSpeed_skew', 'BallSpeed_skew']

input_cols = ['up', 'accelerate', 'slow', 'goal', 'left', 'boost', 'camera',
              'down', 'right', 'slide', 'jump',
              'up_skew', 'accelerate_skew', 'slow_skew', 'goal_skew',
              'left_skew', 'boost_skew', 'camera_skew', 'down_skew',
              'right_skew', 'slide_skew', 'jump_skew']

# Fill accordingly
test_summary_df[metric_cols] = test_summary_df[metric_cols].fillna(test_summary_df[metric_cols].mean())
test_summary_df[input_cols] = test_summary_df[input_cols].fillna(test_summary_df[input_cols].median())
train_summary_df[metric_cols] = train_summary_df[metric_cols].fillna(train_summary_df[metric_cols].mean())
train_summary_df[input_cols] = train_summary_df[input_cols].fillna(train_summary_df[input_cols].median())

In [4]:
print(test_summary_df)
print("Missing: ", test_summary_df.isna().sum())

       id  BallAcceleration      Time   DistanceWall  DistanceCeil  \
0       1       -440.381900  2.639636    1615.840000       1038.69   
40      3          0.000000  1.631764     879.270000        327.42   
64      5        983.352618  0.868096    3216.740000       2010.87   
79      6      -1233.006564  2.553048     572.855000       1518.29   
118     7       -264.000000  0.869751    2983.800000       1956.78   
...   ...               ...       ...            ...           ...   
2914  286          0.000000  4.318480    2977.030000       1989.57   
2948  288          0.000000  2.017950    3713.810000       2008.64   
2970  289      -5135.257386  3.079645  111513.797867       1147.40   
3007  291        -97.513353  4.072630    3428.400000       2012.99   
3041  293          0.000000  1.008900    2499.890000       2012.99   

      DistanceBall    PlayerSpeed      BallSpeed   up  accelerate  ...  \
0       687.146769  139614.040931  109317.500731  0.0         0.0  ...   
40      885

In [5]:
print(train_summary_df)
print("Missing: ", train_summary_df.isna().sum())

       id  BallAcceleration      Time   DistanceWall  DistanceCeil  \
0       0          0.000000  2.205022    3817.380000      2013.000   
33      2       -198.638061  2.326575    1461.490000       974.430   
68      4        406.177646  1.232506    3812.555000      2012.375   
91     11      -1394.811364  1.548218     234.460000       940.685   
114    13          0.000000  2.087500    3375.140000      1997.190   
...   ...               ...       ...            ...           ...   
4043  292       -187.744121  1.827082   42939.720947      3793.790   
4070  294      -2624.775521  1.652519  168174.413428      1360.935   
4091  295          0.000000  0.765359    2547.940000      2012.990   
4103  296          0.000000  1.322000       0.000000      1814.880   
4121  297          0.000000  0.943604    2479.440000      1963.470   

      DistanceBall    PlayerSpeed      BallSpeed   up  accelerate  ...  \
0       861.194407  150959.239888  145648.061660  0.0         0.0  ...   
33      930

## Train-Validation Split

In [6]:
X, y = train_summary_df.drop(columns=['id', 'label']), train_summary_df['label']

# Create a train/test split, one is a train set, the other is a test/validation set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=29)

## Model 1: Basic Decision Tree

In [7]:
tree = DecisionTreeClassifier(random_state=29)
tree.fit(X_train, y_train)

y_pred_tree = tree.predict(X_test)

In [8]:
# Evaluate the model
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))
print("Classification Report: ", classification_report(y_test, y_pred_tree))
print("Decision Tree Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tree))

Decision Tree Accuracy: 0.6111111111111112
Classification Report:                precision    recall  f1-score   support

          -1       0.40      0.25      0.31         8
           1       0.50      0.50      0.50         2
           2       0.90      0.90      0.90        10
           3       0.80      0.67      0.73         6
           5       0.50      0.67      0.57         3
           6       0.57      1.00      0.73         4
           7       0.00      0.00      0.00         3

    accuracy                           0.61        36
   macro avg       0.52      0.57      0.53        36
weighted avg       0.61      0.61      0.60        36

Decision Tree Confusion Matrix:
 [[2 1 1 1 2 0 1]
 [1 1 0 0 0 0 0]
 [0 0 9 0 0 0 1]
 [2 0 0 4 0 0 0]
 [0 0 0 0 2 0 1]
 [0 0 0 0 0 4 0]
 [0 0 0 0 0 3 0]]


## Model 2: Random Forest with Hyperparameter Tuning

In [9]:
rfc = RandomForestClassifier(random_state=29)

param_grid = {
    'max_depth': [5, 10, 15, 20],
    'n_estimators': [50, 100, 150, 200, 300],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

grid_search = GridSearchCV(rfc, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

In [10]:
print("Best parameters:", grid_search.best_params_)

Best parameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}


In [11]:
y_pred_rf = grid_search.best_estimator_.predict(X_test)

# Evaluate the model
print("Accuracy: ", accuracy_score(y_test, y_pred_rf))
# "zero_division=0" sets the precision for undefined cases to 0, which is fair in cases where the model makes no predictions for a class.
print("Classification Report:\n", classification_report(y_test, y_pred_rf, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

Accuracy:  0.6666666666666666
Classification Report:
               precision    recall  f1-score   support

          -1       1.00      0.12      0.22         8
           1       1.00      1.00      1.00         2
           2       0.82      0.90      0.86        10
           3       0.86      1.00      0.92         6
           5       0.50      0.67      0.57         3
           6       0.36      1.00      0.53         4
           7       0.00      0.00      0.00         3

    accuracy                           0.67        36
   macro avg       0.65      0.67      0.59        36
weighted avg       0.73      0.67      0.60        36

Confusion Matrix:
 [[1 0 1 1 2 3 0]
 [0 2 0 0 0 0 0]
 [0 0 9 0 0 1 0]
 [0 0 0 6 0 0 0]
 [0 0 1 0 2 0 0]
 [0 0 0 0 0 4 0]
 [0 0 0 0 0 3 0]]


## Cross-Validation of Best Model

In [12]:
# Evaluate Random Forest (default or tuned) using cross-validation
cv_scores = cross_val_score(grid_search.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')

print("Random Forest Cross-Validation Scores:", cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())

Random Forest Cross-Validation Scores: [0.75862069 0.72413793 0.89285714 0.82142857 0.78571429]
Mean CV Accuracy: 0.796551724137931


## Retrain on All Training Data & Predict Test Set

In [13]:
best_rfc = grid_search.best_estimator_
best_rfc.fit(X, y)

In [14]:
predictions = best_rfc.predict(test_summary_df.drop(columns=['id']))

In [15]:
sample_submission = pd.read_csv("sample_submission.csv")
submission = sample_submission.copy()
submission['label'] = predictions

# Save final submission
submission.to_csv("final_submission.csv", index=False)
print("Submission saved to final_submission.csv")

Submission saved to final_submission.csv


## Model Selection and Data Processing Summary

In this project, two models were implemented: a basic decision tree and a random forest with hyperparameter tuning. The random forest consistently outperformed the decision tree across validation and cross-validation metrics.

Data was preprocessed by:
- Filtering summary rows (since the prediction is per trick shot, not per row).
- Dropping irrelevant columns (e.g., `window_id`).
- Filling missing values: means for metric features and medians for input control features, to maintain consistency and robustness.

The final model selected for submission was a **Random Forest** with the following optimal parameters:
- `max_depth=15`
- `n_estimators=300`
- `min_samples_split=2`
- `min_samples_leaf=1`
- `max_features='sqrt'`

This model was chosen for its higher accuracy and better generalization performance as observed in cross-validation.