<a href="https://colab.research.google.com/github/LiamDuero03/DS-Society-Project/blob/main/4-Data-preprocessing/Data-Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4. Data Preprocessing & Feature Engineering: Urban Weather Analysis

This session builds directly upon the data initially pulled: the **Location** dataset (containing city coordinates and population) and the **Weather** dataset (containing atmospheric conditions).

Our objective is to transform these raw inputs into a refined, model-ready dataset designed to predict the `feels_like` temperature. Throughout this notebook, we will demonstrate professional data handling techniques including:

* **Relational Merging:** Unifying geographic and atmospheric data points.
* **Leak-Proof Feature Engineering:** Creating a "Wind-Moisture Interaction" index while strictly avoiding data leakage from our target variable.
* **Categorical Encoding:** Utilizing One-Hot Encoding to transform qualitative weather conditions into machine-readable formats.
* **Feature Scaling:** Applying Standardization ($Z$-score scaling) to the population data to handle variance and the Urban Heat Island effect.


In [1]:
# --- LFS SETUP & REPO CLONING ---
import os
import pandas as pd

# 1. Install Git LFS in the Colab environment
!git lfs install

# 2. Clone the repository (This pulls the LFS pointers)
REPO_NAME = "DS-Society-Project"
REPO_URL = f"https://github.com/LiamDuero03/{REPO_NAME}.git"

if not os.path.exists(REPO_NAME):
    !git clone {REPO_URL}
else:
    print("Repository already cloned.")

# 3. Explicitly pull the LFS data (This replaces pointers with actual CSV data)
%cd {REPO_NAME}
!git lfs pull
%cd ..

# --- 4. READ THE DATA ---
# Now we point to the LOCAL folder inside Colab, not the web URL
cities_path = f"/content/{REPO_NAME}/1-Data-Sourcing/all_cities_raw.csv"
weather_path = f"/content/{REPO_NAME}/1-Data-Sourcing/live_weather_data.csv"

raw_data = pd.read_csv(cities_path, low_memory=False)
live_weather_df = pd.read_csv(weather_path)

print(f"Success! Cities Shape: {raw_data.shape}")
print(f"Success! Weather Shape: {live_weather_df.shape}")

Git LFS initialized.
Cloning into 'DS-Society-Project'...
remote: Enumerating objects: 177, done.[K
remote: Counting objects: 100% (177/177), done.[K
remote: Compressing objects: 100% (134/134), done.[K
remote: Total 177 (delta 72), reused 77 (delta 24), pack-reused 0 (from 0)[K
Receiving objects: 100% (177/177), 2.40 MiB | 6.80 MiB/s, done.
Resolving deltas: 100% (72/72), done.
/content/DS-Society-Project
/content
Success! Cities Shape: (5000, 4)
Success! Weather Shape: (492, 7)


### 4.1 Merging the Datasets
Since your columns are named **City** in the first DataFrame and **city_name** in the second, we need to specify that mapping. A **left join** is usually safest here to keep all your primary geographic data.

In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Assuming df_geo and df_weather are your dataframes
df = pd.merge(raw_data, live_weather_df, left_on='City', right_on='city_name', how='left')

# Drop the redundant city_name column
df.drop('city_name', axis=1, inplace=True)

### 4.2 Categorical Encoding
The `condition` column (e.g., Sunny, Rainy) is categorical. Since there isn't necessarily a mathematical order to weather conditions, **One-Hot Encoding** is the standard approach to transform these strings into a format the model can interpret without assuming a ranking.

In [5]:
# One-Hot Encoding the 'condition' column
# This creates new binary columns for each unique weather condition
df = pd.get_dummies(df, columns=['condition'], prefix='weather')

### 4.3 Feature Engineering: Interaction Features
To improve the model's ability to predict `feels_like`, we create an interaction feature. High humidity combined with low wind speed often results in a "stuffy" or higher perceived temperature. This provides a non-linear signal that helps the model learn patterns beyond simple linear correlations.


In [6]:
# Interaction feature: High humidity + Low wind often feels "stuffy"
# This provides a non-linear signal that helps the model predict 'feels_like'
df['moisture_wind_ratio'] = df['humidity'] / (df['wind'] + 1) # +1 to avoid division by zero

### 4.4 Standardization vs. Normalization
For variables like `Population`, which often have large outliers or a wide range, **Standardization (Z-score scaling)** is usually preferred. This process centers the data around a mean of 0 with a standard deviation of 1, ensuring that the model isn't disproportionately biased by the high magnitude of population figures.

In [7]:
scaler = StandardScaler()

# Standardizing the 'Population' column
# Formula: z = (x - u) / s
df['Population_scaled'] = scaler.fit_transform(df[['Population']])

### 4.5 Final Data Transformation Review
With the preprocessing pipeline complete, we verify the results. This step ensures that the merge was successful, categorical variables are correctly expanded, and our engineered features and scaled columns are present.

In [9]:
## --- 1. PREVIEW THE FINAL DATAFRAME ---
print("Final Dataframe Shape:", df.shape)
print("\nFirst 5 rows of processed data:")
display(df.head())

Final Dataframe Shape: (5000, 29)

First 5 rows of processed data:


Unnamed: 0,City,Population,Latitude,Longitude,temp,feels_like,humidity,pressure,wind,weather_broken clouds,...,weather_moderate rain,weather_overcast clouds,weather_sand,weather_scattered clouds,weather_smoke,weather_snow,weather_thunderstorm,weather_thunderstorm with light rain,moisture_wind_ratio,Population_scaled
0,tokyo,31480498.0,35.685,139.751389,4.31,4.31,47.0,1014.0,0.45,False,...,False,False,False,False,False,False,False,False,32.413793,35.093621
1,shanghai,14608512.0,31.045556,121.399722,7.92,4.27,66.0,1030.0,7.0,False,...,False,False,False,True,False,False,False,False,8.25,16.084495
2,bombay,12692717.0,18.975,72.825833,27.99,28.53,51.0,1013.0,3.6,False,...,False,False,False,False,True,False,False,False,11.086957,13.92603
3,karachi,11627378.0,24.9056,67.0822,22.9,21.85,23.0,1017.0,4.12,False,...,False,True,False,False,False,False,False,False,4.492188,12.725747
4,new delhi,10928270.0,28.6,77.2,15.09,14.67,77.0,1018.0,2.06,False,...,False,False,False,False,False,False,False,False,25.163399,11.938084
