# Machine Learning Lab: Radio Signal Strength - INSTRUCTOR SOLUTIONS
## CSC 2053 - 90 Minute Introduction to Regression

**Note to Instructors:** This notebook contains complete solutions to Exercises 1 and 2 from the radio_ml_lab.ipynb.

---

## Setup: Load Data and Run Base Models

First, let's run all the code from the lab to get Models 1-3 trained, so we have a baseline for comparison.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

plt.style.use('seaborn-v0_8-darkgrid')
pd.set_option('display.max_columns', 20)

print("Libraries imported successfully!")

In [None]:
# Load the data
url = "https://raw.githubusercontent.com/CSC-2053-100-Fall25/python-ml-template/main/fm_stations_25_locations_ml_dataset.csv"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")

In [None]:
# Clean the data
ml_cols = ['distance', 'field_strength', 'erp', 'frequency', 'haat']
df_clean = df.dropna(subset=ml_cols).copy()

df_clean = df_clean[
    (df_clean['distance'] > 0) & 
    (df_clean['erp'] > 0) & 
    (df_clean['field_strength'] > -100)
]

print(f"Clean dataset: {df_clean.shape[0]} rows")

In [None]:
# Quick-train Models 1, 2, and 3 for baseline comparison

# Model 1: Distance only
X1 = df_clean[['distance']]
y1 = df_clean['field_strength']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=42)
model_1 = LinearRegression()
model_1.fit(X1_train, y1_train)
y1_pred = model_1.predict(X1_test)
r2_1 = r2_score(y1_test, y1_pred)
rmse_1 = np.sqrt(mean_squared_error(y1_test, y1_pred))

# Model 2: Multiple features
feature_cols = ['distance', 'erp', 'frequency', 'haat']
X2 = df_clean[feature_cols]
y2 = df_clean['field_strength']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)
model_2 = LinearRegression()
model_2.fit(X2_train, y2_train)
y2_pred = model_2.predict(X2_test)
r2_2 = r2_score(y2_test, y2_pred)
rmse_2 = np.sqrt(mean_squared_error(y2_test, y2_pred))

# Model 3: With engineered features
df_engineered = df_clean.copy()
df_engineered['distance_squared'] = df_engineered['distance'] ** 2
df_engineered['log_distance'] = np.log(df_engineered['distance'] + 1)
df_engineered['log_erp'] = np.log(df_engineered['erp'] + 1)
df_engineered['power_per_mile'] = df_engineered['erp'] / (df_engineered['distance'] + 1)

feature_cols_eng = ['distance', 'erp', 'frequency', 'haat', 
                    'distance_squared', 'log_distance', 'log_erp', 'power_per_mile']
X3 = df_engineered[feature_cols_eng]
y3 = df_engineered['field_strength']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=42)
model_3 = LinearRegression()
model_3.fit(X3_train, y3_train)
y3_pred = model_3.predict(X3_test)
r2_3 = r2_score(y3_test, y3_pred)
rmse_3 = np.sqrt(mean_squared_error(y3_test, y3_pred))

print("Baseline Models Trained:")
print(f"Model 1 (Distance): R² = {r2_1:.4f}, RMSE = {rmse_1:.2f}")
print(f"Model 2 (Multiple):  R² = {r2_2:.4f}, RMSE = {rmse_2:.2f}")
print(f"Model 3 (Engineered): R² = {r2_3:.4f}, RMSE = {rmse_3:.2f}")

---
# Exercise 1 Solution: Add Geographic Features

**Task:** Create a new model that includes bearing (direction) information. Use `bearing`, `sin(bearing)`, and `cos(bearing)` as features.

**Why sin/cos for bearings?** Bearings are circular (0° and 360° are the same), so we convert them to sin/cos to capture this cyclical nature mathematically.

In [None]:
# SOLUTION

# Step 1: Examine the bearing column
print("Bearing statistics:")
print(df_clean['bearing'].describe())
print(f"\nBearing data type: {df_clean['bearing'].dtype}")
print(f"Missing bearings: {df_clean['bearing'].isna().sum()}")

In [None]:
# Step 2: Create bearing features (sin and cos for cyclic nature of angles)
df_with_bearing = df_engineered.copy()

# Convert bearing to radians for sin/cos calculations
df_with_bearing['bearing_rad'] = np.deg2rad(df_with_bearing['bearing'].astype(float))

# Create sin and cos features
df_with_bearing['bearing_sin'] = np.sin(df_with_bearing['bearing_rad'])
df_with_bearing['bearing_cos'] = np.cos(df_with_bearing['bearing_rad'])

print("Created bearing features:")
print(f"  bearing_sin range: [{df_with_bearing['bearing_sin'].min():.3f}, {df_with_bearing['bearing_sin'].max():.3f}]")
print(f"  bearing_cos range: [{df_with_bearing['bearing_cos'].min():.3f}, {df_with_bearing['bearing_cos'].max():.3f}]")

In [None]:
# Step 3: Combine with engineered features from Model 3
feature_cols_with_bearing = [
    'distance', 'erp', 'frequency', 'haat',
    'distance_squared', 'log_distance', 'log_erp', 'power_per_mile',
    'bearing_sin', 'bearing_cos'  # Add geographic features
]

X_bearing = df_with_bearing[feature_cols_with_bearing]
y_bearing = df_with_bearing['field_strength']

print(f"\nModel 4 feature set: {len(feature_cols_with_bearing)} features")
print(f"Features: {feature_cols_with_bearing}")

In [None]:
# Step 4: Train Model 4 with bearing features
X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(
    X_bearing, y_bearing, test_size=0.2, random_state=42
)

model_4 = LinearRegression()
model_4.fit(X_train_4, y_train_4)

# Make predictions
y_pred_4 = model_4.predict(X_test_4)

# Evaluate
r2_4 = r2_score(y_test_4, y_pred_4)
rmse_4 = np.sqrt(mean_squared_error(y_test_4, y_pred_4))
mae_4 = mean_absolute_error(y_test_4, y_pred_4)

print("=== Model 4 Performance (with bearing features) ===")
print(f"R² Score: {r2_4:.4f}")
print(f"RMSE: {rmse_4:.2f} dBu")
print(f"MAE: {mae_4:.2f} dBu")

In [None]:
# Step 5: Compare to Model 3
comparison_df = pd.DataFrame({
    'Model': ['Model 3: Engineered Features', 'Model 4: + Bearing Features'],
    'Features': [8, 10],
    'R²': [r2_3, r2_4],
    'RMSE (dBu)': [rmse_3, rmse_4],
    'R² Improvement': [0, r2_4 - r2_3],
    'RMSE Improvement': [0, rmse_3 - rmse_4]
})

print("\n" + "="*80)
print("MODEL COMPARISON: Does Bearing Help?")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

if r2_4 > r2_3:
    improvement = ((r2_4 - r2_3) / r2_3) * 100
    print(f"\n✓ YES! Bearing features improved R² by {improvement:.2f}%")
else:
    print(f"\n✗ Bearing features did not improve the model significantly")
    print("   (This is okay - not all features help!)")

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Actual vs Predicted for Model 4
axes[0].scatter(y_test_4, y_pred_4, alpha=0.5, s=20)
axes[0].plot([y_test_4.min(), y_test_4.max()], 
             [y_test_4.min(), y_test_4.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Field Strength (dBu)', fontsize=12)
axes[0].set_ylabel('Predicted Field Strength (dBu)', fontsize=12)
axes[0].set_title(f'Model 4 with Bearing Features (R² = {r2_4:.3f})', fontsize=13)
axes[0].grid(True, alpha=0.3)

# Right plot: Feature importance (coefficients)
feature_importance = pd.DataFrame({
    'Feature': feature_cols_with_bearing,
    'Coefficient': model_4.coef_,
    'Abs_Coefficient': np.abs(model_4.coef_)
}).sort_values('Abs_Coefficient', ascending=True)

axes[1].barh(feature_importance['Feature'], feature_importance['Coefficient'])
axes[1].set_xlabel('Coefficient Value', fontsize=12)
axes[1].set_title('Feature Importance (Model 4)', fontsize=13)
axes[1].axvline(x=0, color='black', linestyle='-', linewidth=0.8)
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

### Analysis: Why Sin/Cos for Bearings?

Bearings are **circular** measurements:
- 0° (North) and 360° are the same direction
- But numerically, 0 and 360 seem far apart

By converting to sin/cos:
- `sin(0°) = 0, cos(0°) = 1`
- `sin(360°) = 0, cos(360°) = 1`
- Now the model understands they're the same!

**Key Insight:** Whether bearing helps depends on:
- Are there directional patterns in signal propagation?
- Do mountains/terrain affect certain directions more?
- Is there systematic bias in antenna orientations?

---
# Exercise 2 Solution: Predict for a Specific Location

**Task:** Filter for New York stations and:
1. Use Model 3 (or Model 4) to predict field strength
2. Find top 10 stations with strongest predicted signals
3. Compare predictions to actual values
4. **Bonus:** Visualize predicted vs actual

In [None]:
# SOLUTION

# Step 1: Check which search_location names might contain New York
print("Available search locations:")
unique_locations = df_clean['search_location'].unique()
print(unique_locations)

# Look for New York
ny_locations = [loc for loc in unique_locations if 'New York' in loc or 'PA' in loc]
print(f"\nNew York-related locations: {ny_locations}")

In [None]:
# Step 2: Filter for New York (adjust location name as needed)
# If "New York, PA" doesn't exist, use closest match from your dataset
ny_search = "New York, PA"  # Adjust based on your data

# Try to find the exact match or closest
if ny_search in unique_locations:
    ny_stations = df_engineered[df_engineered['search_location'] == ny_search].copy()
else:
    # Fallback: filter by approximate coordinates (40.7128°N, 74.0060°W)
    # Get stations searched from near New York coordinates
    ny_stations = df_engineered[
        (df_engineered['search_lat'].between(40.2, 41.2)) &
        (df_engineered['search_lon'].between(-74.5, -73.5))
    ].copy()
    print(f"Using coordinate-based filter near New York")

print(f"\nFound {len(ny_stations)} stations from New York search")

In [None]:
# Step 3: Create engineered features for New York subset
# (These should already exist if using df_engineered)
print("New York stations dataset:")
print(ny_stations[['callsign', 'frequency', 'distance', 'field_strength', 'city']].head())

In [None]:
# Step 4: Make predictions using Model 3
X_ny = ny_stations[feature_cols_eng]
y_ny_actual = ny_stations['field_strength']

# Predict using Model 3
y_ny_pred = model_3.predict(X_ny)

# Add predictions to dataframe
ny_stations['predicted_field_strength'] = y_ny_pred
ny_stations['prediction_error'] = y_ny_actual - y_ny_pred

print("Predictions made for all New York stations!")

In [None]:
# Step 5: Find top 10 stations with strongest PREDICTED signals
top_10_predicted = ny_stations.nlargest(10, 'predicted_field_strength')

print("\n" + "="*90)
print("TOP 10 STATIONS BY PREDICTED FIELD STRENGTH (New York Area)")
print("="*90)

results = top_10_predicted[[
    'callsign', 'frequency', 'city', 'distance',
    'field_strength', 'predicted_field_strength', 'prediction_error'
]].copy()

results.columns = ['Callsign', 'Freq (MHz)', 'City', 'Dist (mi)', 
                   'Actual (dBu)', 'Predicted (dBu)', 'Error (dBu)']

print(results.to_string(index=False))
print("="*90)

In [None]:
# Step 6: Compare predictions to actual values (statistics)
mae_ny = mean_absolute_error(y_ny_actual, y_ny_pred)
rmse_ny = np.sqrt(mean_squared_error(y_ny_actual, y_ny_pred))
r2_ny = r2_score(y_ny_actual, y_ny_pred)

print("\nPrediction Accuracy for New York Stations:")
print(f"  R² Score: {r2_ny:.4f}")
print(f"  MAE: {mae_ny:.2f} dBu")
print(f"  RMSE: {rmse_ny:.2f} dBu")
print(f"\nOn average, predictions are off by {mae_ny:.1f} dBu")

In [None]:
# BONUS: Visualize predicted vs actual for New York stations
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left plot: Predicted vs Actual
axes[0].scatter(y_ny_actual, y_ny_pred, alpha=0.6, s=40)
axes[0].plot([y_ny_actual.min(), y_ny_actual.max()],
             [y_ny_actual.min(), y_ny_actual.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Field Strength (dBu)', fontsize=12)
axes[0].set_ylabel('Predicted Field Strength (dBu)', fontsize=12)
axes[0].set_title(f'New York Stations: Predicted vs Actual\n(R² = {r2_ny:.3f}, MAE = {mae_ny:.1f} dBu)', 
                  fontsize=13)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right plot: Top 10 comparison bar chart
x_pos = np.arange(len(top_10_predicted))
width = 0.35

axes[1].bar(x_pos - width/2, top_10_predicted['field_strength'], 
            width, label='Actual', alpha=0.8, color='blue')
axes[1].bar(x_pos + width/2, top_10_predicted['predicted_field_strength'], 
            width, label='Predicted', alpha=0.8, color='orange')

axes[1].set_xlabel('Station', fontsize=12)
axes[1].set_ylabel('Field Strength (dBu)', fontsize=12)
axes[1].set_title('Top 10 Strongest Stations: Actual vs Predicted', fontsize=13)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(top_10_predicted['callsign'], rotation=45, ha='right')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

In [None]:
# Additional analysis: Which stations had the worst predictions?
worst_predictions = ny_stations.nlargest(5, 'prediction_error', keep='first')

print("\nStations with LARGEST POSITIVE ERRORS (model under-predicted):")
print(worst_predictions[['callsign', 'city', 'field_strength', 
                         'predicted_field_strength', 'prediction_error']].to_string(index=False))

worst_predictions_neg = ny_stations.nsmallest(5, 'prediction_error', keep='first')
print("\nStations with LARGEST NEGATIVE ERRORS (model over-predicted):")
print(worst_predictions_neg[['callsign', 'city', 'field_strength', 
                              'predicted_field_strength', 'prediction_error']].to_string(index=False))

### Key Takeaways from New York Predictions:

1. **Model Performance:** The model trained on all 25 locations generalizes reasonably well to New York specifically

2. **Prediction Errors:** 
   - Positive errors = model under-predicted (actual was stronger)
   - Negative errors = model over-predicted (actual was weaker)
   - Large errors suggest factors the model doesn't capture (terrain, directional antennas, etc.)

3. **Practical Use:** You could use these predictions to:
   - Plan which stations to try tuning in
   - Estimate reception quality before visiting a location
   - Compare different locations for radio reception

4. **Model Limitations:** 
   - Doesn't account for local terrain (buildings, hills)
   - Doesn't know about directional antennas
   - Weather and atmospheric conditions not included
   - But still provides useful estimates!

---
## Additional Notes for Instructors

### Grading Rubric Suggestions:

**Exercise 1 (50 points):**
- Creates sin/cos bearing features (15 pts)
- Combines with Model 3 features (10 pts)
- Trains Model 4 correctly (10 pts)
- Evaluates and compares to Model 3 (10 pts)
- Interprets results (5 pts)

**Exercise 2 (50 points):**
- Filters for New York correctly (10 pts)
- Makes predictions using trained model (10 pts)
- Identifies top 10 stations (10 pts)
- Compares actual vs predicted (10 pts)
- Visualization (10 pts)

### Common Student Issues:

**Exercise 1:**
- Forgetting to convert degrees to radians
- Not understanding why sin/cos is needed
- Using bearing as a single feature instead of sin/cos
- Not properly combining with existing features

**Exercise 2:**
- Location name doesn't match exactly
- Forgetting to create engineered features for New York subset
- Trying to predict without trained model
- Confusion between predicted and actual values

### Discussion Points:

1. **When does bearing matter?**
   - If terrain is directional (mountains to the west)
   - If urban areas block certain directions
   - If there are systematic antenna patterns

2. **Why might predictions be off?**
   - Local terrain we don't account for
   - Directional antennas (not in our features)
   - Weather/atmospheric conditions
   - Interference from other stations

3. **Real-world applications:**
   - Radio station planning (where to build)
   - Coverage estimation for emergency broadcasts
   - Cell tower placement (similar physics)
   - WiFi network design

### Extension Ideas:
- Try other ML algorithms (Random Forest, XGBoost)
- Add more locations to the training set
- Create a web app for signal prediction
- Analyze which features matter most
- Compare day vs night signal propagation