# Introduction

In this notebook we load and analyze two velocity datasets derived from DART impact simulations on the Didymosâ€“Dimorphos binary system. Each dataset contains 1000 velocity samples for a single simulated particle: one dataset corresponds to a particle that escapes the binary system after the impact, while the other corresponds to a particle that remains inside the system (due to impact on Didymos or Dimorphos, or by being placed on an unstable orbit). The raw files are located in the `data/raw/` folder.

Objective
- Identify whether there are clear turning points or threshold regions in the velocity distributions that separate escape from retention.
- Build machine learning models capable of mapping the probability of escape as a function of velocity features and to estimate critical velocity ranges where the transition occurs.

Analysis workflow
- Data loading and validation: read CSV files, check for missing or inconsistent values, and verify units.
- Exploratory data analysis (EDA): histograms, kernel density estimates, scatter plots, and comparison of distributions between the two groups.
- Feature engineering: compute scalar speed, velocity components, and any transformations that improve separability for modeling.
- Machine learning modeling: train interpretable classifiers (e.g., logistic regression) and non-linear models (e.g., Random Forest) to predict escape vs retention.
- Turning point detection: analyze model decision functions and probability curves, and compute derivatives of the escape probability with respect to velocity to estimate threshold regions.
- Validation and interpretation: use cross-validation and metrics (accuracy, ROC-AUC, precision/recall), and apply feature-importance methods (e.g., permutation importance or SHAP) to interpret results.

Practical notes
- Subsequent cells provide code to load the files from `data/raw/`, produce EDA figures, train models with cross-validation, and save key outputs to `results/`.
- Save important figures, models, and tables to the `results/` folder for later reference and inclusion in the thesis.

In [2]:
import numpy as np
import pandas as pd

#Load data from raw folder
path = "../data/raw"
def load_datasets():
    df1 = pd.read_csv(f"{path}/2nd_simulation_escaped.csv")
    df2 = pd.read_csv(f"{path}/2nd_simulation_survived.csv")
    return df1, df2

#Check for missing or inconsistent data
def check_data_quality(df):
    print("Missing values per column:")
    print(df.isnull().sum())
    print("\nData types:")
    print(df.dtypes)

#Use the function to load datasets
df_escaped, df_survived = load_datasets()
#check_data_quality(df_escaped)
#check_data_quality(df_survived)


#### Exploratory Data Analysis

In [None]:
#Histogram for exploratory data analysis
import matplotlib.pyplot as plt
import seaborn as sns

#### Exploratory Data Analysis
# Plot histograms for key features in both datasets
