# Final Portfolio Project - Regression Task

# Weather Temperature Prediction (Regression)

## Task 1: Exploratory Data Analysis and Data Understanding [20 Marks]

### 1.1 Choosing a Dataset

#### (a) When and by whom the dataset was created
Provide the dataset provenance here (creator/owner and year, if known).

#### (b) How and from where the dataset was accessed
Provide the dataset source here (website / repository / provider) and the access date.

#### (c) Alignment with United Nations Sustainable Development Goal (UNSDG)
This dataset relates to **SDG 13: Climate Action** and **SDG 11: Sustainable Cities and Communities** because weather conditions and temperature patterns are important for planning, resilience, and climate-informed decision making.

#### (d) List all attributes (columns) with brief descriptions
Use the column list table (next code cell) to write brief descriptions of each attribute in your report.

#### Potential Questions the Dataset Can Answer
1. How do humidity, wind speed, and pressure relate to temperature?
2. Are there seasonal patterns in temperature across months?
3. How accurately can temperature be predicted from the available meteorological variables?

#### Dataset Quality Assessment
The next cells check missing values, duplicates, and basic statistics.

In [None]:
# Core
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing / Model selection
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay,
    mean_absolute_error, mean_squared_error, r2_score
)

# Models
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor

# Feature selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif, mutual_info_regression

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [None]:
# Load the dataset
try:
    df = pd.read_csv("weatherHistory.csv")
except FileNotFoundError:
    df = pd.read_csv("/mnt/data/weatherHistory.csv")

print("Shape:", df.shape)
df.head()

In [None]:
# Column list
pd.DataFrame({"column": df.columns, "dtype": [str(df[c].dtype) for c in df.columns]})

In [None]:
# Missing values and duplicates
missing = df.isna().sum().sort_values(ascending=False)
print("Missing values (top):")
display(missing[missing>0].head(20))

print("\nDuplicate rows:", df.duplicated().sum())

In [None]:
# Summary statistics for numeric columns
df.describe(include=[np.number]).T

### 1.2 Exploratory Data Analysis (EDA)

#### (a) Data Cleaning and Preprocessing

We predict **Temperature (C)** as the target. The date column is parsed to extract time-based features. Categorical columns are one-hot encoded; numeric columns are imputed and scaled.

In [None]:
# Parse datetime and create time features
df = df.copy()

df["Formatted Date"] = pd.to_datetime(df["Formatted Date"], errors="coerce", utc=True)

df["year"] = df["Formatted Date"].dt.year
df["month"] = df["Formatted Date"].dt.month
df["day"] = df["Formatted Date"].dt.day
df["hour"] = df["Formatted Date"].dt.hour

# Drop original datetime after feature extraction
df = df.drop(columns=["Formatted Date"])

# Ensure numeric columns are numeric
for col in ["Temperature (C)", "Apparent Temperature (C)", "Humidity", "Wind Speed (km/h)",
            "Wind Bearing (degrees)", "Visibility (km)", "Loud Cover", "Pressure (millibars)"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

df.head()

#### (b) Visualizations to Summarize, Explore, and Understand the Data

In [None]:
# Distribution of target temperature
plt.figure(figsize=(7,4))
sns.histplot(df["Temperature (C)"].dropna(), kde=True)
plt.title("Distribution of Temperature (C)")
plt.xlabel("Temperature (C)")
plt.ylabel("Count")
plt.show()

In [None]:
# Temperature vs humidity
plt.figure(figsize=(7,5))
sns.scatterplot(data=df.sample(5000, random_state=RANDOM_STATE), x="Humidity", y="Temperature (C)", alpha=0.4)
plt.title("Temperature vs Humidity (sample)")
plt.show()

In [None]:
# Correlation heatmap for numeric features
num_cols = [c for c in df.columns if df[c].dtype != "object"]
corr = df[num_cols].corr()

plt.figure(figsize=(10,7))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()

In [None]:
# Average temperature by month
if "month" in df.columns:
    month_avg = df.groupby("month")["Temperature (C)"].mean()
    month_avg.plot(kind="line", marker="o")
    plt.title("Average Temperature by Month")
    plt.xlabel("Month")
    plt.ylabel("Avg Temperature (C)")
    plt.show()

month_avg.head()

#### (c) Summary of EDA Insights

Summarize key patterns observed in the plots (e.g., seasonal trend by month, correlation with apparent temperature, etc.).

## Task 2: Build a Neural Network Model for Regression [15 Marks]

We use an MLPRegressor with a preprocessing pipeline (imputation + one-hot encoding + scaling).

In [None]:
# Target and features
target = "Temperature (C)"
X = df.drop(columns=[target])
y = df[target]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

# Column types
numeric_features = [c for c in X_train.columns if X_train[c].dtype != "object"]
categorical_features = [c for c in X_train.columns if X_train[c].dtype == "object"]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

mlp_reg = MLPRegressor(
    hidden_layer_sizes=(64, 32),
    activation="relu",
    solver="adam",
    alpha=1e-4,
    max_iter=300,
    random_state=RANDOM_STATE
)

mlp_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", mlp_reg)
])

mlp_pipe

In [None]:
# Train neural network regressor
mlp_pipe.fit(X_train, y_train)

pred_train = mlp_pipe.predict(X_train)
pred_test = mlp_pipe.predict(X_test)

def reg_metrics(y_true, y_pred, label=""):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    return pd.Series({
        "MAE": mean_absolute_error(y_true, y_pred),
        "MSE": mean_squared_error(y_true, y_pred),
        "RMSE": rmse,
        "R2": r2_score(y_true, y_pred)
    }, name=label)

metrics_train = reg_metrics(y_train, pred_train, "Train")
metrics_test = reg_metrics(y_test, pred_test, "Test")

pd.concat([metrics_train, metrics_test], axis=1)

In [None]:
# Predicted vs actual (sample)
sample_idx = np.random.RandomState(RANDOM_STATE).choice(len(y_test), size=2000, replace=False)
plt.figure(figsize=(6,6))
plt.scatter(y_test.iloc[sample_idx], pred_test[sample_idx], alpha=0.3)
plt.xlabel("Actual Temperature (C)")
plt.ylabel("Predicted Temperature (C)")
plt.title("MLPRegressor: Actual vs Predicted (sample)")
plt.show()

## Task 3: Build Primary Machine Learning Models [20 Marks] (Two Classical ML Models)

### 3.1 Split Dataset into Training and Testing Sets

The same split from Task 2 is used.

### 3.2 Model A: Linear Regression
### 3.3 Model B: Random Forest Regressor
### 3.4 Initial Comparison and Discussion

In [None]:
# Model A: Linear Regression
lin_reg = LinearRegression()

linreg_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", lin_reg)
])

linreg_pipe.fit(X_train, y_train)
pred_lr = linreg_pipe.predict(X_test)

lr_metrics = reg_metrics(y_test, pred_lr, "Linear Regression")
lr_metrics

In [None]:
# Model B: Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=300,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

rf_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", rf_reg)
])

rf_pipe.fit(X_train, y_train)
pred_rf = rf_pipe.predict(X_test)

rf_metrics = reg_metrics(y_test, pred_rf, "Random Forest")
rf_metrics

In [None]:
# Initial comparison (test set)
initial_comp = pd.DataFrame([lr_metrics, rf_metrics]).reset_index().rename(columns={"index":"Model"})
initial_comp

## Task 4: Hyperparameter Optimization with Cross-Validation [15 Marks]

We tune the two classical regression models with cross-validation.

- Linear Regression has fewer hyperparameters; we tune whether to fit the intercept and use feature selection with k.
- Random Forest has key hyperparameters such as max_depth and min_samples_leaf.

CV scoring uses **negative RMSE**.

In [None]:
# 4.1 Linear Regression tuning (including SelectKBest as part of pipeline)
neg_rmse = "neg_root_mean_squared_error"

k_best = 30

linreg_tune_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("select", SelectKBest(score_func=mutual_info_regression, k=k_best)),
    ("model", LinearRegression())
])

linreg_param_grid = {
    "model__fit_intercept": [True, False],
}

linreg_gs = GridSearchCV(
    linreg_tune_pipe,
    param_grid=linreg_param_grid,
    scoring=neg_rmse,
    cv=5,
    n_jobs=-1
)
linreg_gs.fit(X_train, y_train)

linreg_gs.best_params_, linreg_gs.best_score_

In [None]:
# 4.2 Random Forest tuning
rf_tune_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("select", SelectKBest(score_func=mutual_info_regression, k=k_best)),
    ("model", RandomForestRegressor(random_state=RANDOM_STATE, n_jobs=-1))
])

rf_param_grid = {
    "model__n_estimators": [200, 400],
    "model__max_depth": [None, 10, 20],
    "model__min_samples_split": [2, 5],
    "model__min_samples_leaf": [1, 2]
}

rf_gs = GridSearchCV(
    rf_tune_pipe,
    param_grid=rf_param_grid,
    scoring=neg_rmse,
    cv=5,
    n_jobs=-1
)
rf_gs.fit(X_train, y_train)

rf_gs.best_params_, rf_gs.best_score_

In [None]:
# 4.3 Summary of Best Hyperparameters and CV scores
pd.DataFrame([
    {"Model": "Linear Regression", "Best CV Score (neg RMSE)": linreg_gs.best_score_, "Best Params": linreg_gs.best_params_},
    {"Model": "Random Forest", "Best CV Score (neg RMSE)": rf_gs.best_score_, "Best Params": rf_gs.best_params_},
])

## Task 5: Feature Selection [10 Marks]

A filter method is applied using mutual information after preprocessing with SelectKBest. This is applied for both classical regression models.

In [None]:
# Evaluate feature selection impact via CV (neg RMSE)
fs_linreg_cv = cross_val_score(linreg_tune_pipe, X_train, y_train, cv=5, scoring=neg_rmse, n_jobs=-1).mean()
fs_rf_cv = cross_val_score(rf_tune_pipe, X_train, y_train, cv=5, scoring=neg_rmse, n_jobs=-1).mean()

fs_linreg_cv, fs_rf_cv

## Task 6: Final Models and Comparative Analysis [10 Marks]

Rebuild both models using:
- Best hyperparameters from Task 4
- Selected features from Task 5

Evaluate on the test set and compare.

In [None]:
# Final models (already include SelectKBest)

final_linreg = linreg_gs.best_estimator_
final_rf = rf_gs.best_estimator_

final_linreg.fit(X_train, y_train)
final_rf.fit(X_train, y_train)

pred_linreg = final_linreg.predict(X_test)
pred_rf = final_rf.predict(X_test)

final_lr_metrics = reg_metrics(y_test, pred_linreg, "Final Linear Regression")
final_rf_metrics = reg_metrics(y_test, pred_rf, "Final Random Forest")

final_lr_metrics, final_rf_metrics

In [None]:
# Comparison table (similar to Table 2 / Table 5 in the assignment)
comparison_table = pd.DataFrame([
    {
        "Model": "Linear Regression (Final)",
        "Features Used": f"SelectKBest(k={k_best})",
        "CV Score (neg RMSE)": linreg_gs.best_score_,
        "Test MAE": final_lr_metrics["MAE"],
        "Test RMSE": final_lr_metrics["RMSE"],
        "Test R2": final_lr_metrics["R2"],
    },
    {
        "Model": "Random Forest (Final)",
        "Features Used": f"SelectKBest(k={k_best})",
        "CV Score (neg RMSE)": rf_gs.best_score_,
        "Test MAE": final_rf_metrics["MAE"],
        "Test RMSE": final_rf_metrics["RMSE"],
        "Test R2": final_rf_metrics["R2"],
    },
])

comparison_table

## Task 7: Report Quality and Presentation [5 Marks]

- Code is organized into tasks and uses pipelines for reproducibility.
- Visualizations are labeled clearly.
- Tables summarize results for comparison.

## Task 8: Conclusion and Reflection [5 Marks]

1. **Model Performance:** Discuss which final model performed best and why (use MAE/RMSE/R2).
2. **Impact of Methods:** Explain how cross-validation tuning and feature selection affected performance.
3. **Insights and Future Directions:** State key insights from EDA/modeling and suggest improvements (e.g., feature engineering, trying other models).