<a href="https://colab.research.google.com/github/LiamDuero03/DS-Society-Project/blob/Liams-Branch/Full-Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üöÄ Project Lab: Quick Start Guide

Welcome to our first society project!

Our problem statement: Can we predict a city's 'Feels Like' temperature (Apparent Temperature) based solely on its size, location, and humidity?

In [1]:
import pandas as pd
import requests
from google.colab import userdata

# 1. Data Sourcing & Integration
In this section, we demonstrate how a Data Scientist pulls data from two distinct sources:
1. **Local Data Sourcing:** Internal "proprietary" data (e.g., City Population).
2. **Live API:** Real-time data enrichment (Live Weather).

### 1.1 Local Data Sourcing (Automated)
Instead of manual uploads, we pull our "internal" population data directly from our GitHub repository. This ensures all members are working with the same version of the data.

* **Primary Data Source:** The underlying dataset is the **World Cities Database**, originally sourced from [Kaggle](https://www.kaggle.com/datasets/max-mind/world-cities-database).
* **Storage Method:** Due to the large file size (3.1+ million rows), the CSV is hosted using **Git LFS (Large File Storage)**. This allows us to bypass GitHub's standard file limits and stream the data directly into our environment via Raw GitHub URLs.
* **Automation:** By using `pd.read_csv()` on the hosted URL, we eliminate the need for users to download local copies or manage large files manually.

| Column | Description |
| :--- | :--- |
| **city** | Standardized city name (lowercased for merging) |
| **pop** | Total population count |
| **lat / lng** | Geographic coordinates (Latitude and Longitude) |

In [10]:
GITHUB_CSV_URL = "https://media.githubusercontent.com/media/LiamDuero03/DS-Society-Project/refs/heads/main/worldcitiespop.csv"

try:
    # We use low_memory=False because large CSVs often have mixed data types in columns
    internal_metadata_df = pd.read_csv(GITHUB_CSV_URL, low_memory=False)

    # Corrected column names based on the worldcitiespop dataset
    # Standard columns are: Country, City, AccentCity, Region, Population, Latitude, Longitude
    # We will rename them to match your desired format
    internal_metadata_df = internal_metadata_df[['City', 'Population', 'Latitude', 'Longitude']]
    internal_metadata_df.columns = ['city', 'pop', 'lat', 'lng']

    print(f"‚úÖ GitHub-hosted data loaded successfully! ({len(internal_metadata_df)} rows)")
except Exception as e:
    print(f"‚ùå Error loading CSV from GitHub: {e}")
    print("Tip: Ensure you have pushed the file to GitHub and it isn't still just a 'pointer' file.")

#internal_metadata_df.head()

‚úÖ GitHub-hosted data loaded successfully! (3173958 rows)


## 1.2 Live API
#### 1.2.1 Get your OpenWeatherMap API Key

We are combining static city data with live weather data. You need a personal key to "talk" to the weather server:
1. Go to [OpenWeatherMap.org](https://openweathermap.org/api) and create a free account.
2. Navigate to your **API Keys** tab and copy your default key.
3. *Note:* It can take up to 30-60 minutes for a new key to "activate."

#### 1.2.2 üõ°Ô∏è Set up Colab Secrets
To keep our project secure, we **never** type our API keys directly into the code.
* Look at the left-hand sidebar in this Colab window.
* Click the **Key icon (Secrets)** üîë.
* Click "Add new secret".
* Name: `OPENWEATHER_API_KEY`
* Value: Paste your key here.
* **Toggle the "Notebook access" switch to ON.**



In [13]:
import os

# List of files to remove
files_to_delete = []

for file in files_to_delete:
    if os.path.exists(file):
        os.remove(file)
        print(f"üóëÔ∏è Deleted: {file}")
    else:
        print(f"‚ö†Ô∏è {file} not found (already deleted).")

üóëÔ∏è Deleted: live_weather_cache.csv


In [14]:
import os
import pandas as pd
import requests
from google.colab import userdata

# 1. API Setup
try:
    API_KEY = userdata.get('OPENWEATHER_API_KEY')
    print("‚úÖ API Key ready.")
except:
    print("‚ùå API Key missing in Colab Secrets.")

# 2. Define the "Smart" Fetch Function
def fetch_weather_data(city_list):
    CACHE_FILE = "live_weather_cache.csv"

    if os.path.exists(CACHE_FILE):
        print(f"üì¶ Loading weather from local cache: {CACHE_FILE}")
        return pd.read_csv(CACHE_FILE)

    print(f"üåê Cache not found. Fetching data for {len(city_list)} cities...")
    results = []
    base_url = "http://api.openweathermap.org/data/2.5/weather"

    for city in city_list:
        params = {'q': city, 'appid': API_KEY, 'units': 'metric'}
        try:
            response = requests.get(base_url, params=params)
            if response.status_code == 200:
                d = response.json()
                results.append({
                    'city_name': city.lower().strip(),
                    'temp': d['main']['temp'],
                    'feels_like': d['main']['feels_like'], # ADDED THIS
                    'humidity': d['main']['humidity'],
                    'pressure': d['main']['pressure'],     # Extra data for ML
                    'condition': d['weather'][0]['description'],
                    'wind': d['wind']['speed']
                })
        except Exception as e:
            print(f"Skipping {city} due to error: {e}")

    weather_df = pd.DataFrame(results)
    weather_df.to_csv(CACHE_FILE, index=False)
    print("‚úÖ API calls complete and cached.")
    return weather_df

# --- THE BIG CHANGE: Selecting Cities for ML ---
# Instead of a manual list, we take the top 50 cities from your 3.1M dataset
# This ensures we have enough data points for the Machine Learning model
target_cities = internal_metadata_df.sort_values(by='pop', ascending=False).head(50)['city'].tolist()

live_weather_df = fetch_weather_data(target_cities)
live_weather_df.head()

‚úÖ API Key ready.
üåê Cache not found. Fetching data for 50 cities...
‚úÖ API calls complete and cached.


Unnamed: 0,city_name,temp,feels_like,humidity,pressure,condition,wind
0,tokyo,5.36,4.01,78,1011,scattered clouds,1.79
1,shanghai,5.92,2.06,61,1027,clear sky,6.0
2,bombay,24.99,25.35,69,1013,haze,2.06
3,karachi,17.9,17.18,55,1012,clear sky,4.12
4,new delhi,11.09,10.68,93,1013,fog,2.57


###1.3 The Master Merger
We will now join our **Baseline Data** with our **Local Metadata** and then ping the **OpenWeather API** to get live temperature data for a subset of these cities.

In [15]:
# 1. Prepare the GitHub Data for Merging
# We ensure the 'city_match' column is ready (standardized for a successful join)
internal_metadata_df['city_match'] = internal_metadata_df['city'].astype(str).str.lower().str.strip()

# 2. Perform the Merge
# We join the metadata (pop, lat, lng) with the live API data (temp, humidity, etc.)
# 'how=right' ensures we keep all cities we just got weather data for
master_df = pd.merge(
    internal_metadata_df,
    live_weather_df,
    left_on='city_match',
    right_on='city_name',
    how='right'
)

# 3. Final Polish
# We remove duplicates and drop the extra joining columns to keep it beginner-friendly
master_df = master_df.drop_duplicates(subset=['city_name']).drop(columns=['city_match', 'city_name'])

# 4. Final Column Selection
# Keeping only the essential columns for our beginners to explore
master_df = master_df[['city', 'pop', 'temp', 'feels_like', 'humidity', 'wind', 'condition', 'lat', 'lng']]

print(f"üöÄ Master Dataset Created! We now have {len(master_df)} cities ready for analysis.")
master_df.head()

üöÄ Master Dataset Created! We now have 50 cities ready for analysis.


Unnamed: 0,city,pop,temp,feels_like,humidity,wind,condition,lat,lng
0,tokyo,31480498.0,5.36,4.01,78,1.79,scattered clouds,35.685,139.751389
2,shanghai,,5.92,2.06,61,6.0,clear sky,29.329547,121.058037
11,bombay,,24.99,25.35,69,2.06,haze,34.123889,68.628333
20,karachi,,17.9,17.18,55,4.12,clear sky,-20.4,-64.766667
25,new delhi,10928270.0,11.09,10.68,93,2.57,fog,28.6,77.2


#2: Preliminary EDA & Visualization
Goal: Understand the "shape" and "health" of the data.

In [None]:
# --- SECTION 2: PRELIMINARY EDA ---
import seaborn as sns
import matplotlib.pyplot as plt

# Check for missing values and data types
print(df.info())

# Visualizing the target variable distribution
sns.histplot(df['target_column'], kde=True)
plt.title("Distribution of Target Variable")
plt.show()

#3: Data Wrangling
Goal: Clean the data for the machine. This is where your team will spend most of their time.

In [None]:
# --- SECTION 3: DATA WRANGLING ---
# 1. Handling Missing Values
df['column'] = df['column'].fillna(df['column'].median())

# 2. Encoding Categorical Variables
df = pd.get_dummies(df, columns=['category_column'], drop_first=True)

# 3. Feature Scaling (Week 4 topic)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# df[['num_cols']] = scaler.fit_transform(df[['num_cols']])

#4: Advanced Visualization
Goal: Storytelling. Move beyond simple bars to multi-dimensional charts.

In [None]:
# --- SECTION 4: ADVANCED VISUALS ---
import plotly.express as px

# Create an interactive 3D scatter or a Correlation Heatmap
corr = df.corr()
fig = px.imshow(corr, text_auto=True, title="Feature Correlation Heatmap")
fig.show()

#5: Machine Learning (ML)
Goal: The predictive engine.

In [None]:
# --- SECTION 5: MACHINE LEARNING ---
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor # or Classifier

X = df.drop('target_column', axis=1)
y = df['target_column']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor()
model.fit(X_train, y_train)
print(f"Model Score: {model.score(X_test, y_test)}")