### FIRST SIMULATION DATASET ANALYSIS

In this Jupyter Notebook we will analyze the first dataset provided by Dr. Langner.
The Dataset has the following characteristics:
The columns in csv file are
 - n -- number of the particle
size -- diameter (in meters), now it is equal to 1.0 for all objects
 - x,y,z -- initial position (meters) in Dimorphos rotating reference frame (IAU_Dimorphos)
 - vx,vy,vz -- initial velocities (meters/second) in Dimorphos rotating reference frame (IAU_Dimorphos)
 - t0 - initial time, in this case it is a dummy column with 0.0 for all objects
 - st - status; 0--survived; 1--escaped; 2--collision with Didimos; 3-- collision with Dimorphos
 - time_end -- time (days) when particle is removed from simulation (impact or escape time or 3000 for survivors). Note: for performance reasons the    current version of the code checks for escapes only at the end of each integration period, in this case it is 1 day, so for escaped objects the time_end is rounded up to a full day, but if you need better precision I can change it.
 - xfinal,yfinal,zfinal,vxfinal,vyfinal,vzfinal -- final positions of the object, these columns contents depend on the status. If st=0 (survived) it is the final position and velocity in barycentric reference frame (non rotating). If st=1 (escaped) -- the values are 0.0. If st=2 it is the impact position and velocity in Didimos rotating reference frame. And for st=3 it is  the impact position and velocity in Dimorphos rotating reference frame.

The object is considered to have escaped when it reaches the distance of 5*r_hill which is ~300 km. This distance is chosen to be simple and to give a very large margin for objects that are temporarily on escape trajectory to return to the system. If I choose a different definition of what we consider an escaped object, the times of escape will be different.

In this notebook we will concentrate on creating a random forest and test it against the linear models.

In [1]:
import pandas as pd
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Ignora tutti i FutureWarning
warnings.filterwarnings(action='ignore', category=FutureWarning)

dataset_path = '../data/raw/1st_simulation_Langner.csv'

try:
    df = pd.read_csv(dataset_path)
    print("Dataset uploaded succesfully")
except FileNotFoundError:
    print(f"Error: file not found for path: {dataset_path}")
except Exception as e:
    print(f"An error occurred: {e}")

#Check if the DataFrame has been uploaded succesfully

print(df.head())

# Define features (X) and labels/target (y)
features = ['x', 'y', 'z', 'vx', 'vy', 'vz']
target = 'st'

X = df[features]
y = df[target]

# Split the data into training and test set
"""
test size: percentage of data in the test set, the remaining is the training set
stratify = y: ensures that each class is well represented in both sets.
random_state = n: fixed seed so that the experiment can be repeated.
"""

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                    random_state = 42, stratify = y)

print(f'Training Set dimension: {X_train.shape[0]}')
print(f'Test Set Dimension: {X_test.shape[0]}')

#Initialize the scaler
scaler = StandardScaler()

#Get the median and average value from the training set only
X_train_scaled = scaler.fit_transform(X_train)

#Use the same parameters obtained when scaling on the training set on the test set as well
X_test_scaled = scaler.transform(X_test)

Dataset uploaded succesfully
   n  size          x          y          z        vx        vy        vz  \
0  1   1.0  -4.263844 -83.024745 -10.340867  0.033040 -0.088055 -0.033979   
1  2   1.0   0.933023 -83.051149 -10.585260  0.053433 -0.079447 -0.028861   
2  3   1.0  -9.297767 -81.321275 -14.435677 -0.012998 -0.065771 -0.074197   
3  4   1.0 -10.952091 -81.127219 -14.462322 -0.012467 -0.094318 -0.030801   
4  5   1.0 -15.406269 -82.785294  -5.903712 -0.033877 -0.093496 -0.010530   

    t0  st    time_end     xfinal     yfinal     zfinal   vxfinal   vyfinal  \
0  0.0   1   82.000000   0.000000   0.000000   0.000000  0.000000  0.000000   
1  0.0   1   80.000000   0.000000   0.000000   0.000000  0.000000  0.000000   
2  0.0   3    4.442850 -16.523756  78.020506  19.320183  0.014316 -0.059958   
3  0.0   1  237.000000   0.000000   0.000000   0.000000  0.000000  0.000000   
4  0.0   3  159.380358 -21.288058  65.137333 -33.949623  0.174760  0.003633   

    vzfinal  
0  0.000000  
1  0.

## Moving to a Non-Linear Model: Random Forest
After proving that linear models are insufficient, we now move to a Random Forest. This is an ensemble learning method that operates by constructing a multitude of decision trees at training time.

A Random Forest works in two key steps:

- Training (Bagging): It builds hundreds of individual decision trees. Each tree is trained on a random subset of the training data (a "bootstrap" sample). Furthermore, at each split in a tree, it only considers a random subset of the features. This process ensures that the trees are all different and "uncorrelated."

- Prediction (Voting): To classify a new particle, every single tree in the forest makes its own prediction (a "vote"). The Random Forest model then outputs the class that received the majority of the votes.

This approach makes the model extremely robust. By averaging the predictions of many diverse trees, it avoids the "overfitting" that a single decision tree might suffer from. Most importantly for our problem, a Random Forest is highly non-linear; it does not try to find a straight line but can create complex, high-dimensional "decision boundaries" to separate the classes.
We will try the RF model on the standard training set and all the possible resampled ones, starting with the standard one.

#### Standard Random Forest


In [9]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from collections import Counter

#Initialize Random Forest
n_estimators = 2000 #Amount of decisional trees

rf_model_standard = RandomForestClassifier(n_estimators,
                                  random_state=42,
                                  n_jobs=-1)


# Train the model
rf_model_standard.fit(X_train_scaled, y_train)
print("Training Completed.")

#Use the model to predict the test set
y_pred_rf_standard = rf_model_standard.predict(X_test_scaled)

#Print the classification report
print(f"Report Standard Random Forest with {n_estimators} trees")
print(classification_report(y_test, y_pred_rf_standard, zero_division=0))


Training Completed.
Report Standard Random Forest with 2000 trees
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.30      0.19      0.23        42
           2       0.38      0.32      0.35        57
           3       0.56      0.71      0.62        99

    accuracy                           0.48       200
   macro avg       0.31      0.30      0.30       200
weighted avg       0.45      0.48      0.46       200



Report Standard Logistic Regression

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.24      0.10      0.14        42
           2       0.67      0.04      0.07        57
           3       0.52      0.94      0.67        99

      accuracy                         0.49       200
      avg          0.35      0.27      0.22       200
      w_avg        0.50      0.49      0.38       200


If we compare the results on the same training set obatined by the `Logistic Regression`, we can see that the overall performance has improved a lot on classes `st=1`, `st=2` and `st=3` but the F1 Score of class st=0 is still 0.0. The weighted average F1 Score has improved from 0.38 to 0.46 meaning that the RF is a better model for this task. Now we test it on the Undersampled training set:

#### Random Undersampling


In [None]:
#Perform Undersampling on the dataset

#Check the original distribution of the dataset
print(f'Original training set distribution: {Counter(y_train)}')

#Initialize RandomUndersampler
undersampler = RandomUnderSampler(random_state=42)

#Create new training set with undersampling
X_train_under, y_train_under = undersampler.fit_resample(X_train_scaled, y_train)

#Check the new distribution
print(f'Current distribution of the dataset, after applying the random undersampling: {Counter(y_train_under)}')

#Initialize Random Forest
n_estimators = 2000 #Amount of decisional trees

rf_model_under = RandomForestClassifier(n_estimators,
                                  random_state=42,
                                  n_jobs=-1)


# Train the model on the undersampled training set
rf_model_under.fit(X_train_under, y_train_under)
print("Training Completed.")

#Use the model to predict the test set
y_pred_rf_under = rf_model_under.predict(X_test_scaled)

#Print the classification report
print(f"Report Undersampled Random Forest with {n_estimators} trees")
print(classification_report(y_test, y_pred_rf_under, zero_division=0))

Original training set distribution: Counter({3: 397, 2: 227, 1: 168, 0: 8})
Current distribution of the dataset, after applying the random undersampling: Counter({0: 8, 1: 8, 2: 8, 3: 8})
Training Completed.
Report Standard Random Forest with 2000 trees
              precision    recall  f1-score   support

           0       0.02      0.50      0.04         2
           1       0.32      0.29      0.30        42
           2       0.30      0.30      0.30        57
           3       0.54      0.32      0.41        99

    accuracy                           0.31       200
   macro avg       0.30      0.35      0.26       200
weighted avg       0.42      0.31      0.35       200



If we compare the results with the one obtained by the linear regression with undersampling:
              precision    recall  f1-score   support

           0       0.06      1.00      0.12         2
           1       0.33      0.38      0.35        42
           2       0.28      0.35      0.31        57
           3       0.57      0.27      0.37        99
           
    accuracy                           0.33       200
    avg            0.31      0.50      0.29       200
    w_avg          0.43      0.33      0.35       200

we can see that also in the RF, the `st=0` class is acknowledged by the model but with a lower F1 Score with respect to the `Logistic Regression`. The overall F1 score is the same, meaning that the `Random Forest` performs better on the other classes even though both models fail to deliver good results. Next we try `Random Oversampling`.

#### Random Oversampling

In [12]:
#Perform Oversampling on the dataset

#Check the original distribution of the dataset
print(f'Original training set distribution: {Counter(y_train)}')

#Initialize RandomOverSampler
oversampler = RandomOverSampler(random_state=42)

#Create new training set with oversampling
X_train_over, y_train_over = oversampler.fit_resample(X_train_scaled, y_train)

#Check the new distribution
print(f'Current distribution of the dataset, after applying the random oversampling: {Counter(y_train_over)}')

#Initialize Random Forest
n_estimators = 2000 #Amount of decisional trees

rf_model_over = RandomForestClassifier(n_estimators,
                                  random_state=42,
                                  n_jobs=-1)


# Train the model on the oversampled training set
rf_model_over.fit(X_train_over, y_train_over)
print("Training Completed.")

#Use the model to predict the test set
y_pred_rf_over = rf_model_over.predict(X_test_scaled)

#Print the classification report
print(f"Report Oversampled Random Forest with {n_estimators} trees")
print(classification_report(y_test, y_pred_rf_over, zero_division=0))

Original training set distribution: Counter({3: 397, 2: 227, 1: 168, 0: 8})
Current distribution of the dataset, after applying the random oversampling: Counter({3: 397, 1: 397, 2: 397, 0: 397})
Training Completed.
Report Oversampled Random Forest with 2000 trees
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.27      0.21      0.24        42
           2       0.34      0.35      0.34        57
           3       0.58      0.63      0.60        99

    accuracy                           0.46       200
   macro avg       0.30      0.30      0.30       200
weighted avg       0.44      0.46      0.45       200



If we compare the results with the one obtained by the linear regression with RandomOversampling:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.31      0.52      0.39        42
           2       0.35      0.11      0.16        57
           3       0.62      0.39      0.48        99

    accuracy                           0.34       200
    avg            0.32      0.26      0.26       200
    w_avg          0.47      0.34      0.37       200

We can see that the RF performed better globaly, with a weighted average F1 score of 0.45 against the 0.37 of the LR. Especially in class st=3, the RF achieves good results but the F1 score of the class st=0 is still 0.
Now we perform a SMOTE oversampling.

#### SMOTE Oversampling

In [15]:
#Check the original distribution of the dataset
print(f'Original training set distribution: {Counter(y_train)}')

#Initialize SMOTE to oversample the minority classes
#Instead of the standard value 5, we set k_neighbors to 7, which is still fine since the class 0 has 8 data points
smote_sampler = SMOTE(random_state=42, k_neighbors=7)

#Create the synthetic training data set
X_train_smote, y_train_smote = smote_sampler.fit_resample(X_train_scaled, y_train)

#Check the new distribution
print(f'Current distribution of the dataset, after applying the SMOTE oversampling: {Counter(y_train_smote)}')

#Initialize Random Forest
""" 
n_estimators = 100 is a good starting point, we use 100 trees.
n_jobs = -1: use all the processors to parallelize the problem.
"""

n_estimators = 2000

rf_model = RandomForestClassifier(n_estimators,
                                  random_state=42,
                                  n_jobs=-1)


# Train the model
print("Started Random Forest training...")
rf_model.fit(X_train_smote, y_train_smote)
print("Training Completed.")

#Use the model to predict the test set
y_pred_rf = rf_model.predict(X_test_scaled)

#Print the classification report
print(f"Report Random Forest (w\ SMOTE) and {n_estimators} trees")
print(classification_report(y_test, y_pred_rf, zero_division=0))


Original training set distribution: Counter({3: 397, 2: 227, 1: 168, 0: 8})
Current distribution of the dataset, after applying the SMOTE oversampling: Counter({3: 397, 1: 397, 2: 397, 0: 397})
Started Random Forest training...
Training Completed.
Report Random Forest (w\ SMOTE) and 2000 trees
              precision    recall  f1-score   support

           0       0.12      0.50      0.20         2
           1       0.28      0.29      0.28        42
           2       0.35      0.42      0.38        57
           3       0.61      0.49      0.55        99

    accuracy                           0.43       200
   macro avg       0.34      0.43      0.35       200
weighted avg       0.46      0.43      0.44       200



We trained the `RandomForestClassifier` on the exact same `SMOTE` data that failed to produce results with Logistic Regression.

At first glance, this appeared to be a significant breakthrough. The weighted avg `F1-score` improved to 0.43 (from 0.37 obtained with `RandomOversampling` in the LR), and crucially, the F1-score for the rare class `st=0` jumped from 0.00 to 0.17, indicating the model had successfully identified one of the two st=0 samples in the test set.

However, this result is highly misleading.

As we've established, the original `st=0` samples are highly interspersed and likely indistinguishable from their neighbors (i.e., they are "noise"). The issue with `SMOTE` is that its interpolation in such a mixed space generates statistical artifacts, not a valid signal.

Therefore, the F1-score of 0.17 is not a sign of success. The Random Forest did not learn the true, underlying non-linear pattern of the `st=0` class; it simply learned to recognize the artificial signature of the `SMOTE` algorithm itself.

This conclusion is strongly supported by our "honest" test using `class_weight='balanced'` below, which (like the Logistic Regression) failed to find the st=0 class, proving that no generalizable signal exists for this class in the 6-feature space.


In [16]:
#Weighted Random Forest.

# Initialize the Weighted Random Forest.
rf_weighted = RandomForestClassifier(n_estimators=2000,
                                    random_state=42,
                                    n_jobs=1,
                                    class_weight='balanced')

#Train the Weighted Random Forest.
rf_weighted.fit(X_train_scaled, y_train)
print("Training Complete.")

#Evaluate the Model on the Test Set.
y_pred_weighted = rf_weighted.predict(X_test_scaled)

print(classification_report(y_test, y_pred_weighted, zero_division=0))

Training Complete.
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.30      0.19      0.23        42
           2       0.40      0.30      0.34        57
           3       0.56      0.74      0.63        99

    accuracy                           0.49       200
   macro avg       0.31      0.31      0.30       200
weighted avg       0.45      0.49      0.46       200

