**Data Cleaning & Preprocessing**

This notebook:
- Loads `weatherAUS.csv`
- Fixes basic issues (column names, dates, target)
- Handles missing values
- Creates simple lag features by `Location` (humidity/pressure/temp/rainfall)
- Saves cleaned dataset to `clean_weather.csv`

In [24]:
import pandas as pd
import numpy as np
from pathlib import Path

In [25]:
csv_path = Path("/content/weatherAUS.csv")
if not csv_path.exists():
    raise FileNotFoundError("Place 'weatherAUS.csv' next to this notebook.")

df = pd.read_csv(csv_path)
df.columns = [c.strip().replace(" ", "_") for c in df.columns]

In [26]:
# Ensure target present and drop rows with missing target
if "RainTomorrow" not in df.columns:
    raise ValueError("Expected column 'RainTomorrow' not found.")
df = df[~df["RainTomorrow"].isna()].copy()

In [27]:
# Parse date
if "Date" in df.columns:
    df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
else:
    df["Date"] = pd.NaT

In [28]:
# Parse date
if "Date" in df.columns:
    df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
else:
    df["Date"] = pd.NaT

# Create lag features by Location
lag_cols = ["Rainfall", "Humidity3pm", "Humidity9am", "Temp3pm", "Temp9am", "Pressure3pm", "Pressure9am"]
existing_lags = [c for c in lag_cols if c in df.columns]
if "Location" in df.columns and "Date" in df.columns:
    df = df.sort_values(["Location", "Date"])
    for c in existing_lags:
        df[f"{c}_lag1"] = df.groupby("Location")[c].shift(1)
else:
    for c in existing_lags:
        df[f"{c}_lag1"] = df[c].shift(1)

In [29]:
# Basic missing handling (conservative):
# - Numeric: median
# - Categorical: mode
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

for col in num_cols:
    med = df[col].median()
    df[col] = df[col].fillna(med)

for col in cat_cols:
    mode_val = df[col].mode().iloc[0] if not df[col].mode().empty else "Unknown"
    df[col] = df[col].fillna(mode_val)

In [30]:
# Encode target to 1/0 (Yes/No)
df["RainTomorrow"] = df["RainTomorrow"].map({"Yes": 1, "No": 0}).astype(int)

In [31]:
# Save cleaned dataset
out_path = Path("clean_weather.csv")
df.to_csv(out_path, index=False)
print("Saved:", out_path.resolve())
print("Shape:", df.shape)
df.head(3)

Saved: /content/clean_weather.csv
Shape: (142193, 30)


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Temp3pm,RainToday,RainTomorrow,Rainfall_lag1,Humidity3pm_lag1,Humidity9am_lag1,Temp3pm_lag1,Temp9am_lag1,Pressure3pm_lag1,Pressure9am_lag1
96320,2008-07-01,Adelaide,8.8,15.7,5.0,1.6,2.6,NW,48.0,SW,...,14.9,Yes,0,0.0,52.0,70.0,21.1,16.7,1015.2,1017.6
96321,2008-07-02,Adelaide,12.7,15.8,0.8,1.4,7.8,SW,35.0,SSW,...,15.5,No,0,5.0,67.0,92.0,14.9,13.5,1017.7,1017.4
96322,2008-07-03,Adelaide,6.2,15.1,0.0,1.8,2.1,W,20.0,NNE,...,13.9,No,0,0.8,52.0,75.0,15.5,13.7,1022.6,1022.4
