# üöÄ Project Lab: Quick Start Guide

Welcome to our first society project!

Our problem statement: Can we predict a city's 'Feels Like' temperature (Apparent Temperature) based solely on its size, location, and humidity?

In [1]:
import pandas as pd
import requests
from google.colab import userdata

ModuleNotFoundError: No module named 'google.colab'

# 1. Data Sourcing & Integration
In this section, we demonstrate how a Data Scientist pulls data from two distinct sources:
1. **Local Data Sourcing:** Internal "proprietary" data (e.g., City Population).
2. **Live API:** Real-time data enrichment (Live Weather).

### 1.1 Local Data Sourcing (Automated)
Instead of manual uploads, we pull our "internal" population data directly from our GitHub repository. This ensures all members are working with the same version of the data.

* **Primary Data Source:** The underlying dataset is the **World Cities Database**, originally sourced from [Kaggle](https://www.kaggle.com/datasets/max-mind/world-cities-database).
* **Storage Method:** Due to the large file size (3.1+ million rows), the CSV is hosted using **Git LFS (Large File Storage)**. This allows us to bypass GitHub's standard file limits and stream the data directly into our environment via Raw GitHub URLs.
* **Automation:** By using `pd.read_csv()` on the hosted URL, we eliminate the need for users to download local copies or manage large files manually.

| Column | Description |
| :--- | :--- |
| **city** | Standardized city name (lowercased for merging) |
| **pop** | Total population count |
| **lat / lng** | Geographic coordinates (Latitude and Longitude) |

In [None]:
import pandas as pd

# --- CONFIGURATION ---
GITHUB_CSV_URL = "https://media.githubusercontent.com/media/LiamDuero03/DS-Society-Project/refs/heads/main/worldcitiespop.csv"

# --- 1. DATA LOADING ---
try:
    # Use 'usecols' to load only what we need (saves RAM)
    # Use lowercase column names if your CSV has them; otherwise keep these
    internal_metadata_df = pd.read_csv(
        GITHUB_CSV_URL,
        low_memory=False,
        usecols=['City', 'Population', 'Latitude', 'Longitude']
    )

    # Standardize column names for ease of use
    internal_metadata_df.columns = ['city', 'pop', 'lat', 'lng']

    print(f"‚úÖ Successfully loaded {len(internal_metadata_df):,} rows from GitHub.")

except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Tip: Check your GitHub LFS settings or the file URL.")

# --- 2. DATA CLEANING & SELECTION ---
# We focus on unique cities and prioritize the records with the highest population
valid_unique_cities = (
    internal_metadata_df
    .dropna(subset=['pop'])               # Remove rows without population data
    .sort_values(by='pop', ascending=False) # Put largest cities at the top
    .drop_duplicates(subset=['city'])     # Keep only the largest version of each city
)

# Extract the top 50 cities for our API target list
target_cities = valid_unique_cities.head(50)['city'].tolist()

# --- 3. SUMMARY OUTPUT ---
print(f"‚úÖ Selected the top {len(target_cities)} unique megacities for analysis.")
print("-" * 30)
print(f"Top 5 Cities: {', '.join(target_cities[:5])}...")
print("-" * 30)

# Display the first few rows of our cleaned dataset
valid_unique_cities.head()

## 1.2 Live API
#### 1.2.1 Get your OpenWeatherMap API Key

We are combining static city data with live weather data. You need a personal key to "talk" to the weather server:
1. Go to [OpenWeatherMap.org](https://openweathermap.org/api) and create a free account.
2. Navigate to your **API Keys** tab and copy your default key.
3. *Note:* It can take up to 30-60 minutes for a new key to "activate."

#### 1.2.2 üõ°Ô∏è Set up Colab Secrets
To keep our project secure, we **never** type our API keys directly into the code.
* Look at the left-hand sidebar in this Colab window.
* Click the **Key icon (Secrets)** üîë.
* Click "Add new secret".
* Name: `OPENWEATHER_API_KEY`
* Value: Paste your key here.
* **Toggle the "Notebook access" switch to ON.**



In [None]:
import os

# List of files to remove
files_to_delete = ['live_weather_cache.csv']

for file in files_to_delete:
    if os.path.exists(file):
        os.remove(file)
        print(f"üóëÔ∏è Deleted: {file}")
    else:
        print(f"‚ö†Ô∏è {file} not found (already deleted).")

In [None]:
import os
import pandas as pd
import requests
from google.colab import userdata

# --- 1. API SETUP & CACHING LOGIC ---
try:
    API_KEY = userdata.get('OPENWEATHER_API_KEY')
    print("‚úÖ OpenWeather API Key successfully retrieved.")
except:
    print("‚ùå API Key missing. Please add 'OPENWEATHER_API_KEY' to Colab Secrets.")

def fetch_live_weather(city_list):
    """
    Fetches real-time weather data for a list of cities.
    Saves results to a local CSV to prevent unnecessary API calls.
    """
    CACHE_FILE = "live_weather_cache.csv"

    # Check if we already have data from a previous run
    if os.path.exists(CACHE_FILE):
        print(f"üì¶ Loading weather data from local cache: {CACHE_FILE}")
        return pd.read_csv(CACHE_FILE)

    print(f"üåê Cache not found. Fetching live data for {len(city_list)} cities...")
    results = []
    base_url = "http://api.openweathermap.org/data/2.5/weather"

    # Loop through cities and collect data
    for i, city in enumerate(city_list):
        params = {'q': city, 'appid': API_KEY, 'units': 'metric'}

        # Simple progress indicator for beginners
        if (i + 1) % 10 == 0:
            print(f"   Progress: {i + 1}/{len(city_list)} cities processed...")

        try:
            response = requests.get(base_url, params=params)
            if response.status_code == 200:
                d = response.json()
                results.append({
                    'city_name': city.lower().strip(),
                    'temp': d['main']['temp'],
                    'feels_like': d['main']['feels_like'],
                    'humidity': d['main']['humidity'],
                    'pressure': d['main']['pressure'],
                    'condition': d['weather'][0]['description'],
                    'wind': d['wind']['speed']
                })
        except Exception as e:
            print(f"‚ö†Ô∏è Error fetching {city}: {e}")

    # Create DataFrame and save to CSV
    weather_df = pd.DataFrame(results)
    weather_df.to_csv(CACHE_FILE, index=False)
    print("‚úÖ API calls complete. Results cached locally.")
    return weather_df

# --- 2. TARGET SELECTION ---
# We prioritize the top 50 unique cities with the highest verified populations
print("üßπ Cleaning population data to identify the top 50 target cities...")

# This logic ensures we get unique names and the highest recorded pop for each
valid_unique_cities = (
    internal_metadata_df.dropna(subset=['pop'])
    .sort_values(by='pop', ascending=False)
    .drop_duplicates(subset=['city'])
)

target_cities = valid_unique_cities.head(50)['city'].tolist()

# --- 3. EXECUTION ---
live_weather_df = fetch_live_weather(target_cities)

print("-" * 30)
print(f"üìä Weather Data Preview ({len(live_weather_df)} cities):")
live_weather_df.head()