<a href="https://colab.research.google.com/github/LiamDuero03/DS-Society-Project/blob/main/1-Data-Sourcing/Data-Sourcing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üöÄ Project Lab: Quick Start Guide

Welcome to our first society project!

Our problem statement: Can we predict a city's 'Feels Like' temperature (Apparent Temperature) based solely on its size, location, and humidity?

In [2]:
import pandas as pd
import requests
from google.colab import userdata

# 1. Data Sourcing & Integration
In this section, we demonstrate how a Data Scientist pulls data from two distinct sources:
1. **Local Data Sourcing:** Internal "proprietary" data (e.g., City Population).
2. **Live API:** Real-time data enrichment (Live Weather).

### 1.1 Local Data Sourcing (Automated)
Instead of manual uploads, we pull our "internal" population data directly from our GitHub repository. This ensures all members are working with the same version of the data.

* **Primary Data Source:** The underlying dataset is the **World Cities Database**, originally sourced from [Kaggle](https://www.kaggle.com/datasets/max-mind/world-cities-database).
* **Storage Method:** Due to the large file size (3.1+ million rows), the CSV is hosted using **Git LFS (Large File Storage)**. This allows us to bypass GitHub's standard file limits and stream the data directly into our environment via Raw GitHub URLs.
* **Automation:** By using `pd.read_csv()` on the hosted URL, we eliminate the need for users to download local copies or manage large files manually.

| Column | Description |
| :--- | :--- |
| **city** | Standardized city name (lowercased for merging) |
| **pop** | Total population count |
| **lat / lng** | Geographic coordinates (Latitude and Longitude) |

In [4]:
import pandas as pd

# The raw data link from your GitHub LFS
DATA_URL = "https://media.githubusercontent.com/media/LiamDuero03/DS-Society-Project/refs/heads/main/worldcitiespop.csv"

print("Pandas loaded and URL defined. Ready to read!")

Pandas loaded and URL defined. Ready to read!


In [5]:
# Loading only the specific columns to keep the notebook fast
raw_data = pd.read_csv(
    DATA_URL,
    low_memory=False,
    usecols=['City', 'Population', 'Latitude', 'Longitude']
)

# Show the first 5 rows to confirm it worked
print(f"Success! Loaded {len(raw_data):,} rows.")
raw_data.head()

Success! Loaded 3,173,958 rows.


Unnamed: 0,City,Population,Latitude,Longitude
0,aixas,,42.483333,1.466667
1,aixirivali,,42.466667,1.5
2,aixirivall,,42.466667,1.5
3,aixirvall,,42.466667,1.5
4,aixovall,,42.466667,1.483333


## 1.2 Live API
#### 1.2.1 Get your OpenWeatherMap API Key

We are combining static city data with live weather data. You need a personal key to "talk" to the weather server:
1. Go to [OpenWeatherMap.org](https://openweathermap.org/api) and create a free account.
2. Navigate to your **API Keys** tab and copy your default key.
3. *Note:* It can take up to 30-60 minutes for a new key to "activate."

#### 1.2.2 üõ°Ô∏è Set up Colab Secrets
To keep our project secure, we **never** type our API keys directly into the code.
* Look at the left-hand sidebar in this Colab window.
* Click the **Key icon (Secrets)** üîë.
* Click "Add new secret".
* Name: `OPENWEATHER_API_KEY`
* Value: Paste your key here.
* **Toggle the "Notebook access" switch to ON.**



In [7]:
import os
import pandas as pd
import requests
from google.colab import userdata

# --- API SETUP ---
# Note: This part is more advanced! We use 'userdata' to keep API keys private.
try:
    API_KEY = userdata.get('OPENWEATHER_API_KEY')
    print("‚úÖ API Key found in Colab Secrets.")
except Exception:
    API_KEY = None
    print("‚ö†Ô∏è API Key missing. You can still run this if you have a 'live_weather_cache.csv' file!")

‚úÖ API Key found in Colab Secrets.


In [8]:
def fetch_live_weather(city_list):
    """
    Fetches real-time weather data for a list of cities.
    Saves results to a local CSV to prevent unnecessary API calls.
    """
    CACHE_FILE = "live_weather_cache.csv"

    # Check if we already have data from a previous run
    if os.path.exists(CACHE_FILE):
        print(f"üì¶ Loading weather data from local cache: {CACHE_FILE}")
        return pd.read_csv(CACHE_FILE)

    print(f"üåê Cache not found. Fetching live data for {len(city_list)} cities...")
    results = []
    base_url = "http://api.openweathermap.org/data/2.5/weather"

    # Loop through cities and collect data
    for i, city in enumerate(city_list):
        params = {'q': city, 'appid': API_KEY, 'units': 'metric'}

        # Simple progress indicator for beginners
        if (i + 1) % 10 == 0:
            print(f"   Progress: {i + 1}/{len(city_list)} cities processed...")

        try:
            response = requests.get(base_url, params=params)
            if response.status_code == 200:
                d = response.json()
                results.append({
                    'city_name': city.lower().strip(),
                    'temp': d['main']['temp'],
                    'feels_like': d['main']['feels_like'],
                    'humidity': d['main']['humidity'],
                    'pressure': d['main']['pressure'],
                    'condition': d['weather'][0]['description'],
                    'wind': d['wind']['speed']
                })
        except Exception as e:
            print(f"‚ö†Ô∏è Error fetching {city}: {e}")

    # Create DataFrame and save to CSV
    weather_df = pd.DataFrame(results)
    weather_df.to_csv(CACHE_FILE, index=False)
    print("‚úÖ API calls complete. Results cached locally.")
    return weather_df

In [10]:
# --- DATA PERSISTENCE ---
# This ensures that even if the runtime resets, your processed data is saved.

# Save the full city list we loaded earlier
if 'raw_data' in locals():
    raw_data.to_csv('all_cities_raw.csv', index=False)
    print("üíæ Saved: all_cities_raw.csv")


# Save the results of the weather fetch
if 'live_weather_df' in locals():
    live_weather_df.to_csv('final_weather_analysis.csv', index=False)
    print("üíæ Saved: final_weather_analysis.csv")


üíæ Saved: all_cities_raw.csv
