<a href="https://colab.research.google.com/github/LiamDuero03/DS-Society-Project/blob/main/4-Data-preprocessing/Data-Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# --- LFS SETUP & REPO CLONING ---
import os
import pandas as pd

# 1. Install Git LFS in the Colab environment
!git lfs install

# 2. Clone the repository (This pulls the LFS pointers)
REPO_NAME = "DS-Society-Project"
REPO_URL = f"https://github.com/LiamDuero03/{REPO_NAME}.git"

if not os.path.exists(REPO_NAME):
    !git clone {REPO_URL}
else:
    print("Repository already cloned.")

# 3. Explicitly pull the LFS data (This replaces pointers with actual CSV data)
%cd {REPO_NAME}
!git lfs pull
%cd ..

# --- 4. READ THE DATA ---
# Now we point to the LOCAL folder inside Colab, not the web URL
cities_path = f"/content/{REPO_NAME}/1-Data-Sourcing/all_cities_raw.csv"
weather_path = f"/content/{REPO_NAME}/1-Data-Sourcing/live_weather_data.csv"

raw_data = pd.read_csv(cities_path, low_memory=False)
live_weather_df = pd.read_csv(weather_path)

print(f"Success! Cities Shape: {raw_data.shape}")
print(f"Success! Weather Shape: {live_weather_df.shape}")

Git LFS initialized.
Cloning into 'DS-Society-Project'...
remote: Enumerating objects: 177, done.[K
remote: Counting objects: 100% (177/177), done.[K
remote: Compressing objects: 100% (134/134), done.[K
remote: Total 177 (delta 72), reused 77 (delta 24), pack-reused 0 (from 0)[K
Receiving objects: 100% (177/177), 2.40 MiB | 6.80 MiB/s, done.
Resolving deltas: 100% (72/72), done.
/content/DS-Society-Project
/content
Success! Cities Shape: (5000, 4)
Success! Weather Shape: (492, 7)


In [3]:
raw_data.info()
live_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        5000 non-null   object 
 1   Population  5000 non-null   float64
 2   Latitude    5000 non-null   float64
 3   Longitude   5000 non-null   float64
dtypes: float64(3), object(1)
memory usage: 156.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 492 entries, 0 to 491
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city_name   492 non-null    object 
 1   temp        492 non-null    float64
 2   feels_like  492 non-null    float64
 3   humidity    492 non-null    int64  
 4   pressure    492 non-null    int64  
 5   condition   492 non-null    object 
 6   wind        492 non-null    float64
dtypes: float64(3), int64(2), object(2)
memory usage: 27.0+ KB


1. Merging the Datasets
Since your columns are named City in the first DataFrame and city_name in the second, we need to specify that mapping. A left join is usually safest here to keep all your primary geographic data.

In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Assuming df_geo and df_weather are your dataframes
df = pd.merge(raw_data, live_weather_df, left_on='City', right_on='city_name', how='left')

# Drop the redundant city_name column
df.drop('city_name', axis=1, inplace=True)

2. Categorical Encoding
The condition column (e.g., Sunny, Rainy) is categorical. Since there isn't necessarily a mathematical order to weather conditions, One-Hot Encoding is the standard approach.

In [5]:
# One-Hot Encoding the 'condition' column
# This creates new binary columns for each unique weather condition
df = pd.get_dummies(df, columns=['condition'], prefix='weather')

# Interaction feature: High humidity + Low wind often feels "stuffy"
# This provides a non-linear signal that helps the model predict 'feels_like'
df['moisture_wind_ratio'] = df['humidity'] / (df['wind'] + 1) # +1 to avoid division by zero

In [6]:
# Interaction feature: High humidity + Low wind often feels "stuffy"
# This provides a non-linear signal that helps the model predict 'feels_like'
df['moisture_wind_ratio'] = df['humidity'] / (df['wind'] + 1) # +1 to avoid division by zero

4. Standardization vs. Normalization
You asked for one of these. For variables like Population, which often have large outliers or a wide range, Standardization (Z-score scaling) is usually preferred as it centers the data around a mean of 0.

In [7]:
scaler = StandardScaler()

# Standardizing the 'Population' column
# Formula: z = (x - u) / s
df['Population_scaled'] = scaler.fit_transform(df[['Population']])

In [9]:
## --- 1. PREVIEW THE FINAL DATAFRAME ---
print("Final Dataframe Shape:", df.shape)
print("\nFirst 5 rows of processed data:")
display(df.head())

Final Dataframe Shape: (5000, 29)

First 5 rows of processed data:


Unnamed: 0,City,Population,Latitude,Longitude,temp,feels_like,humidity,pressure,wind,weather_broken clouds,...,weather_moderate rain,weather_overcast clouds,weather_sand,weather_scattered clouds,weather_smoke,weather_snow,weather_thunderstorm,weather_thunderstorm with light rain,moisture_wind_ratio,Population_scaled
0,tokyo,31480498.0,35.685,139.751389,4.31,4.31,47.0,1014.0,0.45,False,...,False,False,False,False,False,False,False,False,32.413793,35.093621
1,shanghai,14608512.0,31.045556,121.399722,7.92,4.27,66.0,1030.0,7.0,False,...,False,False,False,True,False,False,False,False,8.25,16.084495
2,bombay,12692717.0,18.975,72.825833,27.99,28.53,51.0,1013.0,3.6,False,...,False,False,False,False,True,False,False,False,11.086957,13.92603
3,karachi,11627378.0,24.9056,67.0822,22.9,21.85,23.0,1017.0,4.12,False,...,False,True,False,False,False,False,False,False,4.492188,12.725747
4,new delhi,10928270.0,28.6,77.2,15.09,14.67,77.0,1018.0,2.06,False,...,False,False,False,False,False,False,False,False,25.163399,11.938084


In [10]:
# 4. Save locally to CSV
file_name = "processed_city_weather.csv"
df.to_csv(file_name, index=False)

# 5. Trigger a direct download from Colab to your computer
from google.colab import files
files.download(file_name)

print(f"\nSuccess! '{file_name}' has been created and the download should start shortly.")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Success! 'processed_city_weather.csv' has been created and the download should start shortly.
