# Data Preprocessing for Machine Learning Models

## 1. Introduction
### Dataset Description
- **Dataset Source**: REMS data collected by the Curiosity Rover. It contains **3,197 records** spanning multiple Martian years, with variables covering **temperature, pressure, UV radiation, and day length**.
- **Key Columns**:
  - `Ls`: Solar longitude representing Mars’ position in its orbit.
  - `sunrise`, `sunset`: Times for sunrise and sunset. 
  - `max_ground_temp`, `min_ground_temp`: Ground temperature extremes.
  - `max_air_temp`, `min_air_temp`: Air temperature extremes.
  - `avg_air_temp`, `avg_ground_temp`: Average air and ground temperatures.
  - `mars_month`: Martian month based on solar longitude (30  degrees per month)
  - `mars_year`: Martian year based on mission start (Initial year = 1)
  - `mars_season`: Martian season based on solar longitude. Curiosity is located in the southern hemisphere which means our seasons are inverted.
  - `day_length`: Length of the Martian day in minutes.
  - `mean_pressue`: Average atmospheric pressure for a given day.
  - `UV_Radiation`: UV index categories.
- Purpose: This dataset helps study Martian weather patterns and seasonal variations.

### Objectives for Preprocessing
 - Encode solar longitude with sin/cos to capture wrap around nature.
 - Create seperate synchronized datasets for scaled and unscaled data for different ML models.
 - Create dummy variables for categorical features: `Season`, `UV Radiation`
 - Use MinMaxScaler() on numerical features: `Mean Air Temp`, `Mean Ground Temp`, `Mean Pressue`, `Day Length`, `Solar Longitude`
 - Create train/test splits for each dataset.

---

## 2. Data Overview

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import joblib
import os

data_path = "../data/"

### 2.1 Load and Inspect Data

In [2]:
# Import data from CSV
df = pd.read_csv(os.path.join(data_path, "cleaned/mars_weather_cleaned.csv"), index_col='sol_number')

# Check that data loaded properly
df.head()

Unnamed: 0_level_0,earth_date_time,Ls,mars_month,mars_year,mars_season,sunrise,sunset,day_length,avg_ground_temp,max_ground_temp,min_ground_temp,avg_air_temp,max_air_temp,min_air_temp,mean_pressure,UV_Radiation
sol_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,2012-08-07,150.0,6,1,winter,05:30:00,17:22:00,712.0,-45.5,-16.0,-75.0,-37.5,8.0,-83.0,739.0,4
9,2012-08-15,155.0,6,1,winter,05:28:00,17:22:00,714.0,-45.5,-16.0,-75.0,-37.5,8.0,-83.0,739.0,4
10,2012-08-16,155.0,6,1,winter,05:28:00,17:22:00,714.0,-45.5,-16.0,-75.0,-37.5,8.0,-83.0,739.0,4
11,2012-08-17,156.0,6,1,winter,05:28:00,17:21:00,713.0,-43.5,-11.0,-76.0,-37.0,9.0,-83.0,740.0,4
12,2012-08-18,156.0,6,1,winter,05:28:00,17:21:00,713.0,-47.0,-18.0,-76.0,-37.0,8.0,-82.0,741.0,4


### **2.2 Summary Statistics**

In [3]:
df.describe()

Unnamed: 0,Ls,mars_month,mars_year,day_length,avg_ground_temp,max_ground_temp,min_ground_temp,avg_air_temp,max_air_temp,min_air_temp,mean_pressure,UV_Radiation
count,3197.0,3197.0,3197.0,3197.0,3197.0,3197.0,3197.0,3197.0,3197.0,3197.0,3197.0,3197.0
mean,166.959962,6.083203,3.507038,718.169221,-44.097279,-13.182828,-75.01173,-39.147357,2.01173,-80.306537,828.997028,2.583359
std,104.356771,3.466172,1.500218,12.036363,7.401663,10.489177,5.529929,7.262277,9.398862,8.824723,57.224328,0.692132
min,0.0,1.0,1.0,702.0,-72.5,-67.0,-100.0,-75.5,-61.0,-136.0,702.0,1.0
25%,78.0,3.0,2.0,707.0,-50.5,-23.0,-79.0,-45.5,-6.0,-86.0,785.0,2.0
50%,156.0,6.0,3.0,716.0,-43.0,-12.0,-75.0,-38.0,3.0,-80.0,844.0,3.0
75%,254.0,9.0,5.0,730.0,-37.5,-4.0,-71.0,-33.0,10.0,-75.0,873.0,3.0
max,359.0,12.0,6.0,738.0,-26.5,11.0,-52.0,-1.5,24.0,-8.0,925.0,4.0


---

## 3. Data Processing

Many of the features contain fundamentally similar information and will be dropped at this stage.

In [4]:
# Splitting features
cat_feats = ['mars_season', 'UV_Radiation']
num_feats = ['avg_air_temp', 'avg_ground_temp', 'mean_pressure', 'day_length', 'Ls']

# Remapping UV Radiation from 1–4 → string labels
uv_map = {1: "low", 2: "moderate", 3: "high", 4: "very_high"}

df["UV_Radiation"] = (
    df["UV_Radiation"]
    .map(uv_map)          # replace integers with strings
    .astype("category")   # optional: store as pandas categorical
)

# Encoding Cyclical nature in Solar Longitude
df['sin_lon'] = np.sin(2 * np.pi * df['Ls'] / 360)
df['cos_lon'] = np.cos(2 * np.pi * df['Ls'] / 360)
num_feats.remove('Ls')
num_feats.extend(['sin_lon', 'cos_lon'])

# Removing unnecessary features from dataframe
df = df[num_feats + cat_feats]

In [5]:
# Encoding categorical features
df_dummies = pd.get_dummies(df[cat_feats], drop_first=True).astype(int)
df_unscaled = pd.concat([df[num_feats], df_dummies], axis=1)

df_unscaled.head()

Unnamed: 0_level_0,avg_air_temp,avg_ground_temp,mean_pressure,day_length,sin_lon,cos_lon,mars_season_spring,mars_season_summer,mars_season_winter,UV_Radiation_low,UV_Radiation_moderate,UV_Radiation_very_high
sol_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,-37.5,-45.5,739.0,712.0,0.5,-0.866025,0,0,1,0,0,1
9,-37.5,-45.5,739.0,714.0,0.422618,-0.906308,0,0,1,0,0,1
10,-37.5,-45.5,739.0,714.0,0.422618,-0.906308,0,0,1,0,0,1
11,-37.0,-43.5,740.0,713.0,0.406737,-0.913545,0,0,1,0,0,1
12,-37.0,-47.0,741.0,713.0,0.406737,-0.913545,0,0,1,0,0,1


In [6]:
# Scaling numeric features
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(df[num_feats])
df_numeric = pd.DataFrame(scaled_values, columns=num_feats, index=df.index)

df_scaled = pd.concat([df_numeric, df_dummies], axis=1)

df_scaled.head()

joblib.dump(scaler, os.path.join(data_path,"processed/scaler.pkl"))

['../data/processed/scaler.pkl']

In [7]:
# Final shapes
print("Unscaled:", df_unscaled.shape)
print("Scaled:", df_scaled.shape)

# save scaled and unscaled dataframes as CSV
df_unscaled.to_csv(os.path.join(data_path,"processed/processed_weather_unscaled.csv"), index=True) # For later use in SARIMAX model
df_scaled.to_csv(os.path.join(data_path,"processed/processed_weather_scaled.csv"), index=True) # For later use in LSTM and Isolation Forest model

Unscaled: (3197, 12)
Scaled: (3197, 12)


---

## 4. Splitting into Training and Test Sets

In [8]:
# Set split ratio and index
split_ratio = 0.8
split_idx = int(len(df_scaled) * split_ratio)

# --- Scaled Data ---
X_scaled_train = df_scaled.iloc[:split_idx]
X_scaled_test = df_scaled.iloc[split_idx:]

# --- Unscaled Data ---
X_unscaled_train = df_unscaled.iloc[:split_idx]
X_unscaled_test = df_unscaled.iloc[split_idx:]

In [9]:
# Save scaled splits
X_scaled_train.to_csv(os.path.join(data_path, "processed/scaled_train.csv"), index=True)
X_scaled_test.to_csv(os.path.join(data_path, "processed/scaled_test.csv"), index=True)

# Save unscaled splits
X_unscaled_train.to_csv(os.path.join(data_path, "processed/unscaled_train.csv"), index=True)
X_unscaled_test.to_csv(os.path.join(data_path, "processed/unscaled_test.csv"), index=True)

---

## 5. Preprocessing Summary

- Cleaned and structured dataset prepared for time-series forecasting and anomaly detection.
- Categorical features (`mars_season`, `UV_Radiation`) encoded using one-hot encoding.
- Numerical features scaled using `MinMaxScaler` to support LSTM and Isolation Forest models.
- Created two datasets:
  - `df_unscaled`: original numeric values + dummy features (for SARIMAX)
  - `df_scaled`: scaled numeric values + dummy features (for LSTM, Isolation Forest)
- Time-aware 80/20 split applied:
  - `scaled_train.csv`, `scaled_test.csv`
  - `unscaled_train.csv`, `unscaled_test.csv`
- All files saved to: `../data/processed`
- Scaler object saved as `scaler.pkl` for inverse transforms during inference

In [10]:
# Final info summary
print("Preprocessing Complete")
print(f"Scaled train shape   : {X_scaled_train.shape}")
print(f"Scaled test shape    : {X_scaled_test.shape}")
print(f"Unscaled train shape : {X_unscaled_train.shape}")
print(f"Unscaled test shape  : {X_unscaled_test.shape}")
print(f"All files saved to   : {data_path}processed/")

Preprocessing Complete
Scaled train shape   : (2557, 12)
Scaled test shape    : (640, 12)
Unscaled train shape : (2557, 12)
Unscaled test shape  : (640, 12)
All files saved to   : ../data/processed/
