# Q6: Modeling Preparation

**Phase 7:** Modeling Preparation  
**Points: 3 points**

**Focus:** Perform temporal train/test split, select features, handle categorical variables.

**Lecture Reference:** Lecture 11, Notebook 3 ([`11/demo/03_pattern_analysis_modeling_prep.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/03_pattern_analysis_modeling_prep.ipynb)), Phase 7. This notebook demonstrates temporal train/test splitting (see "Your Approach" section below for the key code pattern).

---

## Setup

In [143]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load feature-engineered data from Q4
#df = pd.read_csv('output/q4_features.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
#print(df.dtypes) # Check data types
# Or if you saved without index:
df = pd.read_csv('output/q4_features.csv')
df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
#df_model = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with features")
from IPython.display import display, Markdown

Loaded 78,177 records with features


---

## Objective

Prepare data for modeling by performing temporal train/test split, selecting features, and handling categorical variables.

**CRITICAL - Temporal Split:** For time series data, you **MUST** use temporal splitting (earlier data for training, later data for testing). **DO NOT** use random split. Why? Time series data has temporal dependencies - using future data to predict the past would be data leakage.

---

## Required Artifacts

You must create exactly these 5 files in the `output/` directory:

### 1. `output/q6_X_train.csv`
**Format:** CSV file
**Content:** Training features (X)
**Requirements:**
- All feature columns (no target variable)
- Only training data (earlier time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 2. `output/q6_X_test.csv`
**Format:** CSV file
**Content:** Test features (X)
**Requirements:**
- All feature columns (same as X_train)
- Only test data (later time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 3. `output/q6_y_train.csv`
**Format:** CSV file
**Content:** Training target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only training data (corresponding to X_train)
- **No index column** (save with `index=False`)

**Example:**
```csv
Water Temperature
15.2
15.3
15.1
...
```

### 4. `output/q6_y_test.csv`
**Format:** CSV file
**Content:** Test target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only test data (corresponding to X_test)
- **No index column** (save with `index=False`)

### 5. `output/q6_train_test_info.txt`
**Format:** Plain text file
**Content:** Train/test split information
**Required information:**
- Split method: Temporal (80/20 or similar)
- Training set size: [number] samples
- Test set size: [number] samples
- Training date range: [start] to [end]
- Test date range: [start] to [end]
- Number of features: [number]
- Target variable: [name]

**Example format:**
```
TRAIN/TEST SPLIT INFORMATION
==========================

Split Method: Temporal (80/20 split by time)

Training Set Size: 40000 samples
Test Set Size: 10000 samples

Training Date Range: 2022-01-01 00:00:00 to 2026-09-15 07:00:00
Test Date Range: 2026-09-15 08:00:00 to 2027-09-15 07:00:00

Number of Features: 22
Target Variable: Water Temperature
```

---

## Requirements Checklist

- [ ] Target variable selected
- [ ] Temporal train/test split performed (train on earlier data, test on later data - **NOT random split**)
- [ ] Features selected and prepared
- [ ] Categorical variables handled (encoding if needed)
- [ ] No data leakage (future data not in training set)
- [ ] All 5 required artifacts saved with exact filenames

---

## Your Approach

1. **Select target variable** - Choose a meaningful numeric variable to predict
2. **Select features** - Exclude target, non-numeric columns, and any features derived from the target (to avoid data leakage)
3. **Handle categorical variables** - One-hot encode if needed
4. **Perform temporal train/test split** - Sort by datetime, then split by index position (earlier data for training, later for testing)
5. **Save artifacts** - Save X_train, X_test, y_train, y_test as separate CSVs
6. **Document split** - Record split sizes, date ranges, and feature count

---

## Feature Selection Guidelines

When selecting features for modeling, think critically about each feature:

**Red Flags to Watch For:**
- **Circular logic**: Does this feature use the target variable to predict the target?
  - Example: Rolling mean of target, lag of target (if not handled carefully)
  - Example: If predicting `Air Temperature`, using `air_temp_rolling_7h` is circular - you're predicting temperature from smoothed temperature
- **Data leakage**: Does this feature contain information that wouldn't be available at prediction time?
  - Example: Future values, aggregated statistics that include the current value
- **Near-duplicates**: Is this feature nearly identical to the target?
  - Check correlations - if correlation > 0.95, investigate whether it's legitimate
  - Example: A feature with 99%+ correlation with the target is likely problematic

**Good Practices:**
- Use external predictors (other weather variables, temporal features)
- Create rolling windows of **predictors**, not the target
  - Good: `wind_speed_rolling_7h`, `humidity_rolling_24h`
  - Bad: `air_temp_rolling_7h` when predicting Air Temperature
- Use derived features that combine multiple predictors
- Think: "Would I have this information when making a real prediction?"

**Remember:** The goal is to predict the target from **other** information, not from the target itself.

---

## Decision Points

- **Target variable:** What do you want to predict? Temperature? Water conditions? Choose something meaningful and measurable.
- **Temporal split:** **CRITICAL** - Use temporal split (earlier data for training, later data for testing), NOT random split. Why? Time series data has temporal dependencies. Typical split: 80/20 or 70/30.
- **Feature selection:** Which features are most relevant? Consider correlations, domain knowledge, and feature importance from previous analysis.
- **Categorical encoding:** If you have categorical variables, encode them (one-hot encoding, label encoding, etc.) before modeling.

---

## Checkpoint

After Q6, you should have:
- [ ] Temporal train/test split completed (earlier ‚Üí train, later ‚Üí test)
- [ ] Features prepared (no target, no datetime index)
- [ ] Categorical variables encoded
- [ ] No data leakage verified
- [ ] All 5 artifacts saved: `q6_X_train.csv`, `q6_X_test.csv`, `q6_y_train.csv`, `q6_y_test.csv`, `q6_train_test_info.txt`

---

**Next:** Continue to `q7_modeling.md` for Modeling.


In [144]:
#CREATE ADDITIONAL VARIABLES SUCH AS ROLLING AND TEMPORAL BEFORE SPLITTING THE DATASET

# Reset datetime
df_model = df.set_index('Measurement Timestamp')

# CREATE ROLLING FEATURES FOR HIGHLY CORRELATED VARIABLES

hourly_data = df_model.resample('h').agg({
    'Wet Bulb Temperature': 'mean',
    'wet_bulb_humidity_ratio':'mean',
    'wet_bulb_humidity_interaction':'mean',
    'rain_difference':'mean',
    'Humidity': 'mean',
    'Heading':'mean',
    'Total Rain': 'mean',
    'solar_totalrain_interaction':'mean',
    'pressure_humidity_interaction':'mean'
})

# Calculate 24-hour rolling mean
ROLLING_WINDOW_HOURS = 24  # 24-hour window 
for col in hourly_data.columns:
    hourly_data[f'{col}_24h_mean'] = hourly_data[col].rolling(window=ROLLING_WINDOW_HOURS, min_periods=1).mean()

display(Markdown("### üìà 24-Hour Rolling Mean"))
rollings = [f'{col}_24h_mean' for col in hourly_data.columns if not col.endswith('_24h_mean')]
display(hourly_data[rollings].head(20).round(2))

# Reset index back before merging to df_model
hourly_data = hourly_data.reset_index()

#Rollings columns with correct names before merging 
rolling_feature_cols = [col for col in hourly_data.columns if col.endswith('_24h_mean')]

# Merge rolling features into df_model
df_model = df_model.reset_index().merge(
    hourly_data[rolling_feature_cols + ['Measurement Timestamp']],
    how='left',
    on='Measurement Timestamp'
)

# create Temeporal variables: 
df_model = df_model.set_index('Measurement Timestamp') # setting index baxk after merging to allow for temporal features creation
# Hour (0-23)
df_model['hour'] = df_model.index.hour
print(f"‚úì hour: {df_model['hour'].min()}-{df_model['hour'].max()}")

# Day of week (0=Monday, 6=Sunday)
df_model['day_of_week'] = df_model.index.dayofweek
print(f"‚úì day_of_week: {df_model['day_of_week'].min()}-{df_model['day_of_week'].max()} (0=Mon, 6=Sun)")

# Month (1-12)
df_model['month'] = df_model.index.month
print(f"‚úì month: {df_model['month'].min()}-{df_model['month'].max()}")

print(df_model.columns.tolist())

### üìà 24-Hour Rolling Mean

Unnamed: 0_level_0,Wet Bulb Temperature_24h_mean,wet_bulb_humidity_ratio_24h_mean,wet_bulb_humidity_interaction_24h_mean,rain_difference_24h_mean,Humidity_24h_mean,Heading_24h_mean,Total Rain_24h_mean,solar_totalrain_interaction_24h_mean,pressure_humidity_interaction_24h_mean
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-05-22 19:00:00,14.8,0.26,858.4,7.3,58.0,352.0,7.3,576.7,57443.2
2015-05-22 20:00:00,14.8,0.25,865.8,7.3,58.5,352.0,7.3,306.6,57938.4
2015-05-22 21:00:00,14.8,0.25,883.07,7.3,59.67,352.0,7.3,204.4,59093.87
2015-05-22 22:00:00,14.8,0.24,906.5,7.3,61.25,352.0,7.3,153.3,60662.0
2015-05-22 23:00:00,14.8,0.24,911.68,7.3,61.6,352.0,7.3,122.64,61008.64
2015-05-23 00:00:00,14.8,0.24,911.68,7.3,61.6,352.0,7.3,122.64,61008.64
2015-05-23 01:00:00,14.8,0.24,911.68,7.3,61.6,352.0,7.3,122.64,61008.64
2015-05-23 02:00:00,14.8,0.24,911.68,7.3,61.6,352.0,7.3,122.64,61008.64
2015-05-23 03:00:00,14.8,0.24,911.68,7.3,61.6,352.0,7.3,122.64,61008.64
2015-05-23 04:00:00,14.8,0.24,911.68,7.3,61.6,352.0,7.3,122.64,61008.64


‚úì hour: 0-23
‚úì day_of_week: 0-6 (0=Mon, 6=Sun)
‚úì month: 1-12
['Station Name', 'Air Temperature', 'Wet Bulb Temperature', 'Humidity', 'Rain Intensity', 'Interval Rain', 'Total Rain', 'Precipitation Type', 'Wind Direction', 'Wind Speed', 'Maximum Wind Speed', 'Barometric Pressure', 'Solar Radiation', 'Heading', 'Battery Life', 'Measurement Timestamp Label', 'Measurement ID', 'wet_bulb_difference', 'wet_bulb_humidity_ratio', 'wet_bulb_humidity_interaction', 'rain_difference', 'rain_intensity_ratio', 'rain_humidity_interaction', 'Intervrain_humidity_interaction', 'rain_pressure_interaction', 'rain_wind_interaction', 'wind_range', 'wind_speed_ratio', 'wind_speed_interaction', 'wind_humidity_ratio', 'wind_pressure_interaction', 'pressure_humidity_ratio', 'pressure_humidity_interaction', 'solar_totalrain_interaction', 'solar_humidity_interaction', 'solar_pressure_ratio', 'solar_wind_interaction', 'Wet_bulb_temp_category', 'humidity_category', 'wind_speed_category', 'solar_category', 'We

In [145]:
#1
# For data with temporal structure, we must split by time (not randomly)
# Train on earlier data, test on later data
df_model = df_model.reset_index().sort_values('Measurement Timestamp').copy()

# Train/test split configuration
TRAIN_RATIO = 0.80  # 80% for training, 20% for testing

# Define temporal split point
# IMPORTANT: For time series, we split by time (not randomly) to prevent data leakage
split_date = df_model['Measurement Timestamp'].quantile(TRAIN_RATIO)

# Create train/test split
train = df_model[df_model['Measurement Timestamp'] < split_date].copy()
test = df_model[df_model['Measurement Timestamp'] >= split_date].copy()

display(Markdown("### ‚úÇÔ∏è Temporal Train/Test Split"))
display(pd.DataFrame({
    'Dataset': ['Train', 'Test'],
    'Obser': [f"{len(train):,}", f"{len(test):,}"],
    'Date Range': [
        f"{train['Measurement Timestamp'].min()} to {train['Measurement Timestamp'].max()}",
        f"{test['Measurement Timestamp'].min()} to {test['Measurement Timestamp'].max()}"
    ]
}))
display(Markdown(f"**Split date:** {split_date}"))

### ‚úÇÔ∏è Temporal Train/Test Split

Unnamed: 0,Dataset,Obser,Date Range
0,Train,62540,2015-05-22 19:00:00 to 2024-03-11 22:00:00
1,Test,15637,2024-03-11 23:00:00 to 2025-12-04 13:00:00


**Split date:** 2024-03-11 23:00:00

In [146]:
#2. FEATURE IDENTIFICATION

# Quick exploration: Which features are correlated?
# First, define our target and potential features
target = 'Air Temperature'

# Select numeric features for correlation
numeric_features = df_model.select_dtypes(include='number').columns

# Exclude the target from numeric columns
available_numeric = [col for col in numeric_features if col != target]

# Calculate correlation with target
if available_numeric:
    correlation_with_target = df_model[available_numeric + [target]].corr()[target].sort_values(ascending=False)
    print("Features correlated with Air Temperature:")
    print(correlation_with_target)
    print()


Features correlated with Air Temperature:
Air Temperature                           1.000000
Wet Bulb Temperature_24h_mean             0.810171
Wet Bulb Temperature                      0.794515
wet_bulb_humidity_ratio_24h_mean          0.779297
wet_bulb_humidity_interaction_24h_mean    0.778139
wet_bulb_humidity_ratio                   0.760656
wet_bulb_humidity_interaction             0.744859
rain_difference_24h_mean                  0.394054
Total Rain_24h_mean                       0.394054
rain_difference                           0.360054
Total Rain                                0.360054
solar_totalrain_interaction_24h_mean      0.275084
month                                     0.218208
solar_totalrain_interaction               0.108761
Humidity_24h_mean                         0.097896
Heading_24h_mean                          0.093974
pressure_humidity_interaction_24h_mean    0.092252
Heading                                   0.087785
Humidity                                

In [147]:
#3. FEATURES SELECTION
# Define target variable
target = 'Air Temperature'
       
# Select features for modeling
# Include temporal, rolling characteristics
feature_cols = [
    # Temporal features
    'hour', 'day_of_week', 'month',
    # strong predictors
    'Wet Bulb Temperature','wet_bulb_humidity_ratio','wet_bulb_humidity_interaction',
    # Moderate predictors
    'rain_difference','Total Rain',
    # Week predictors
    'solar_totalrain_interaction','Heading','Humidity','pressure_humidity_interaction',
    # Derived rolling features
    'Wet Bulb Temperature_24h_mean','Total Rain_24h_mean','Humidity_24h_mean',
    # Categorical (will need encoding)
    'wind_speed_category', 'solar_category', 'Station Name','Wet_bulb_temp_category','humidity_category'
]

# Check feature availability
available_features = [f for f in feature_cols if f in df_model.columns]
missing_features = [f for f in feature_cols if f not in df_model.columns]

display(Markdown("### üìã Feature Availability"))
display(Markdown(f"**Available features:** `{available_features}`"))
if missing_features:
    display(Markdown(f"‚ö†Ô∏è **Missing features** (will skip): `{missing_features}`"))

# Select available features
X_train = train[available_features].copy()
X_test = test[available_features].copy()
y_train = train[target].copy()
y_test = test[target].copy()


### üìã Feature Availability

**Available features:** `['hour', 'day_of_week', 'month', 'Wet Bulb Temperature', 'wet_bulb_humidity_ratio', 'wet_bulb_humidity_interaction', 'rain_difference', 'Total Rain', 'solar_totalrain_interaction', 'Heading', 'Humidity', 'pressure_humidity_interaction', 'Wet Bulb Temperature_24h_mean', 'Total Rain_24h_mean', 'Humidity_24h_mean', 'wind_speed_category', 'solar_category', 'Station Name', 'Wet_bulb_temp_category', 'humidity_category']`

In [148]:
#HANDLING CATEGORICAL VARIABLES

# Identify categorical variables
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

display(Markdown("### üè∑Ô∏è Feature Types"))
display(Markdown(f"**Categorical features:** `{categorical_cols}`"))
display(Markdown(f"**Numeric features:** `{numeric_cols}`"))

# One Hot Encoding

X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols, prefix=categorical_cols, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols, prefix=categorical_cols, drop_first=True)

# Ensure test set has same columns as training set
# Add missing columns (with 0s) and remove extra columns
for col in X_train_encoded.columns:
    if col not in X_test_encoded.columns:
        X_test_encoded[col] = 0

X_test_encoded = X_test_encoded[X_train_encoded.columns]

display(Markdown("### ‚úÖ After One-Hot Encoding"))
display(pd.DataFrame({
    'Dataset': ['Training features', 'Test features'],
    'Shape': [
        f"{X_train_encoded.shape[0]:,} √ó {X_train_encoded.shape[1]}",
        f"{X_test_encoded.shape[0]:,} √ó {X_test_encoded.shape[1]}"
    ]
}))
display(Markdown(f"**Feature names:** `{list(X_train_encoded.columns)[:10]}...` ({len(X_train_encoded.columns)} total)"))

### üè∑Ô∏è Feature Types

**Categorical features:** `['wind_speed_category', 'solar_category', 'Station Name', 'Wet_bulb_temp_category', 'humidity_category']`

**Numeric features:** `['hour', 'day_of_week', 'month', 'Wet Bulb Temperature', 'wet_bulb_humidity_ratio', 'wet_bulb_humidity_interaction', 'rain_difference', 'Total Rain', 'solar_totalrain_interaction', 'Heading', 'Humidity', 'pressure_humidity_interaction', 'Wet Bulb Temperature_24h_mean', 'Total Rain_24h_mean', 'Humidity_24h_mean']`

### ‚úÖ After One-Hot Encoding

Unnamed: 0,Dataset,Shape
0,Training features,"62,540 √ó 21"
1,Test features,"15,637 √ó 21"


**Feature names:** `['hour', 'day_of_week', 'month', 'Wet Bulb Temperature', 'wet_bulb_humidity_ratio', 'wet_bulb_humidity_interaction', 'rain_difference', 'Total Rain', 'solar_totalrain_interaction', 'Heading']...` (21 total)

In [149]:
#4. HANDLING MISSING 
# Check for missing values
display(Markdown("### üîç Missing Values in Training Set"))
missing_in_train = X_train_encoded.isnull().sum()[X_train_encoded.isnull().sum() > 0]
if len(missing_in_train) == 0:
    display(Markdown("‚úÖ **No missing values!**"))
else:
    missing_df = pd.DataFrame({'Column': missing_in_train.index, 'Missing Count': missing_in_train.values})
    display(missing_df)

# Fill missing values (using training set statistics)
# For numeric columns, use median
for col in numeric_cols:
    if col in X_train_encoded.columns:
        median_val = X_train_encoded[col].median()
        X_train_encoded[col] = X_train_encoded[col].fillna(median_val)
        X_test_encoded[col] = X_test_encoded[col].fillna(median_val)

display(Markdown("### ‚úÖ After Imputation"))
display(pd.DataFrame({
    'Dataset': ['Train', 'Test'],
    'Missing Values': [
        X_train_encoded.isnull().sum().sum(),
        X_test_encoded.isnull().sum().sum()
    ]
}))

### üîç Missing Values in Training Set

‚úÖ **No missing values!**

### ‚úÖ After Imputation

Unnamed: 0,Dataset,Missing Values
0,Train,0
1,Test,0


In [152]:
#5. SAVING FILES 

# Save prepared datasets for modeling
X_train_encoded.to_csv("output/q6_X_train.csv", index=False)
X_test_encoded.to_csv("output/q6_X_test.csv", index=False)
y_train.to_csv("output/q6_y_train.csv", index=False)
y_test.to_csv("output/q6_y_test.csv", index=False)

display(Markdown("### üíæ Prepared Datasets Saved"))
display(pd.DataFrame({
    'File': ['X_train', 'X_test', 'y_train', 'y_test'],
    'Shape': [
        f"{X_train_encoded.shape[0]:,} √ó {X_train_encoded.shape[1]}",
        f"{X_test_encoded.shape[0]:,} √ó {X_test_encoded.shape[1]}",
        f"{len(y_train):,}",
        f"{len(y_test):,}"
    ]
}))
display(Markdown("‚úÖ **Ready for next phase: Modeling & Results!**"))

### üíæ Prepared Datasets Saved

Unnamed: 0,File,Shape
0,X_train,"62,540 √ó 21"
1,X_test,"15,637 √ó 21"
2,y_train,62540
3,y_test,15637


‚úÖ **Ready for next phase: Modeling & Results!**

In [154]:
#6. Saving a report 
# Calculate split info
total_samples = len(train) + len(test)
train_pct = int((len(train) / total_samples) * 100)
test_pct = int((len(test) / total_samples) * 100)

# Create formatted output
split_info = f"""TRAIN/TEST SPLIT INFORMATION
============================

Split Method: Temporal ( 80/20 split by time)

Training Set Size: {len(train)} samples
Test Set Size: {len(test)} samples

Training Date Range: {train['Measurement Timestamp'].min()} to {train['Measurement Timestamp'].max()}
Test Date Range: {test['Measurement Timestamp'].min()} to {test['Measurement Timestamp'].max()}

Number of Features: {len(available_features)}
Target Variable: {target}
"""

display(Markdown(f"**Split date:** {split_date}"))
# Save to file
with open('output/q6_train_test_info.txt', 'w') as f:
    f.write(split_info)

print(split_info)
print("‚úì Split information saved to output/split_info.txt")

**Split date:** 2024-03-11 23:00:00

TRAIN/TEST SPLIT INFORMATION

Split Method: Temporal ( 80/20 split by time)

Training Set Size: 62540 samples
Test Set Size: 15637 samples

Training Date Range: 2015-05-22 19:00:00 to 2024-03-11 22:00:00
Test Date Range: 2024-03-11 23:00:00 to 2025-12-04 13:00:00

Number of Features: 20
Target Variable: Air Temperature

‚úì Split information saved to output/split_info.txt


In [None]:
#7. DECISION POINTS
# I want to predict Air Temperature
# I used temporal slipt because randomly splitting time series data, will train the model on future data and test on past data. This creates data leakage as the model will see the future during training. Therefore, inflating performance metrics.
# I used the correlation with the target to detmeone which features were important. I included those that re highly correlated, moderately correlated and weak correlation. 
#I also included temperal features in the selection of features, including the rolling features for variables interaction that were important.
# Categorical varaibles were also encoded and I checked for missing values. There was no missing values making our dataset ready for modeling.