<a href="https://colab.research.google.com/github/LiamDuero03/DS-Society-Project/blob/main/1-Data-Sourcing/Data-Sourcing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Lab: Quick Start Guide

Welcome to our first society project!

Our problem statement: Can we predict a city's 'Feels Like' temperature (Apparent Temperature) based solely on its size, location, and humidity?

In [1]:
import pandas as pd
import requests
from google.colab import userdata

# 1. Data Sourcing & Integration
In this section, we demonstrate how a Data Scientist pulls data from two distinct sources:
1. **Local Data Sourcing:** Internal "proprietary" data (e.g., City Population).
2. **Live API:** Real-time data enrichment (Live Weather).

### 1.1 Local Data Sourcing (Automated)
Instead of manual uploads, we pull our "internal" population data directly from our GitHub repository. This ensures all members are working with the same version of the data.

* **Primary Data Source:** The underlying dataset is the **World Cities Database**, originally sourced from [Kaggle](https://www.kaggle.com/datasets/max-mind/world-cities-database).
* **Storage Method:** Due to the large file size (3.1+ million rows), the CSV is hosted using **Git LFS (Large File Storage)**. This allows us to bypass GitHub's standard file limits and stream the data directly into our environment via Raw GitHub URLs.
* **Automation:** By using `pd.read_csv()` on the hosted URL, we eliminate the need for users to download local copies or manage large files manually.

| Column | Description |
| :--- | :--- |
| **city** | Standardized city name (lowercased for merging) |
| **pop** | Total population count |
| **lat / lng** | Geographic coordinates (Latitude and Longitude) |

In [7]:
import pandas as pd

# The raw data link from your GitHub LFS
DATA_URL = "https://media.githubusercontent.com/media/LiamDuero03/DS-Society-Project/refs/heads/main/worldcitiespop.csv"

print("Pandas loaded and URL defined. Ready to read!")

Pandas loaded and URL defined. Ready to read!


In [8]:
# Loading only the specific columns to keep the notebook fast
raw_data = pd.read_csv(
    DATA_URL,
    low_memory=False,
    usecols=['City', 'Population', 'Latitude', 'Longitude']
)

# Show the first 5 rows to confirm it worked
print(f"Success! Loaded {len(raw_data):,} rows.")
raw_data.head()

Success! Loaded 3,173,958 rows.


Unnamed: 0,City,Population,Latitude,Longitude
0,aixas,,42.483333,1.466667
1,aixirivali,,42.466667,1.5
2,aixirivall,,42.466667,1.5
3,aixirvall,,42.466667,1.5
4,aixovall,,42.466667,1.483333


## 1.2 Live API
#### 1.2.1 Get your OpenWeatherMap API Key

We are combining static city data with live weather data. You need a personal key to "talk" to the weather server:
1. Go to [OpenWeatherMap.org](https://openweathermap.org/api) and create a free account.
2. Navigate to your **API Keys** tab and copy your default key.
3. *Note:* It can take up to 30-60 minutes for a new key to "activate."

#### 1.2.2 Set up Colab Secrets
To keep our project secure, we **never** type our API keys directly into the code.
* Look at the left-hand sidebar in this Colab window.
* Click the **Key icon (Secrets)** .
* Click "Add new secret".
* Name: `OPENWEATHER_API_KEY`
* Value: Paste your key here.
* **Toggle the "Notebook access" switch to ON.**



In [9]:
import os
import pandas as pd
import requests
from google.colab import userdata

# --- API SETUP ---
# Note: This part is more advanced! We use 'userdata' to keep API keys private.
try:
    API_KEY = userdata.get('OPENWEATHER_API_KEY')
    print("API Key found in Colab Secrets.")
except Exception:
    API_KEY = None
    print(" API Key missing. You can still run this if you have a 'live_weather_cache.csv' file!")

API Key found in Colab Secrets.


In [16]:
import pandas as pd
import requests

def fetch_live_weather(city_list):
    """
    Fetches weather data.
    Priority: 1. GitHub CSV link | 2. Live API calls
    """
    # Use the RAW URL so Pandas sees the actual CSV text
    GITHUB_URL = "https://media.githubusercontent.com/media/LiamDuero03/DS-Society-Project/main/1-Data-Sourcing/live_weather_data.csv"
    try:
        print("Checking GitHub for existing weather data...")
        # We use a timeout so the script doesn't hang if GitHub is down
        response = requests.get(GITHUB_URL, timeout=5)

        if response.status_code == 200:
            print("Success! Loading data directly from GitHub.")
            # Read the content directly from the response
            return pd.read_csv(GITHUB_URL)
        else:
            print(f"File not found on GitHub (Status: {response.status_code}).")

    except Exception as e:
        print(f"Could not connect to GitHub: {e}")

    # --- FALLBACK TO API CALLS ---
    print(f"Fetching live data for {len(city_list)} cities via API...")
    results = []
    base_url = "http://api.openweathermap.org/data/2.5/weather"

    for i, city in enumerate(city_list):
        params = {'q': city, 'appid': API_KEY, 'units': 'metric'}
        try:
            res = requests.get(base_url, params=params)
            if res.status_code == 200:
                d = res.json()
                results.append({
                    'city_name': city.lower().strip(),
                    'temp': d['main']['temp'],
                    'condition': d['weather'][0]['description']
                })
        except:
            continue

    return pd.DataFrame(results)

In [17]:
# --- EXECUTION: Identification & Fetching ---

# 1. Identify our target cities from the raw data we loaded earlier
# We filter for unique cities and take the top 500 by population
valid_unique_cities = (
    raw_data.dropna(subset=['Population'])
    .sort_values(by='Population', ascending=False)
    .drop_duplicates(subset=['City'])
)

target_cities = valid_unique_cities.head(500)['City'].tolist()

# 2. CALL THE FUNCTION
# This will either load from the CSV cache or call the API
live_weather_df = fetch_live_weather(target_cities)

# 3. Quick Preview
if not live_weather_df.empty:
    print("f Previewing weather for {len(live_weather_df)} cities:")
    display(live_weather_df.head())

Checking GitHub for existing weather data...
Success! Loading data directly from GitHub.
f Previewing weather for {len(live_weather_df)} cities:


Unnamed: 0,city_name,temp,feels_like,humidity,pressure,condition,wind
0,tokyo,4.31,4.31,47,1014,light rain,0.45
1,shanghai,7.92,4.27,66,1030,scattered clouds,7.0
2,bombay,27.99,28.53,51,1013,smoke,3.6
3,karachi,22.9,21.85,23,1017,overcast clouds,4.12
4,new delhi,15.09,14.67,77,1018,mist,2.06


In [18]:
# --- STEP 4: EXPORT YOUR RESULTS ---
# If you made any changes to the analysis above, run this cell
# to save your own copy of the data!

if 'live_weather_df' in locals():
    live_weather_df.to_csv('my_weather_analysis.csv', index=False)
    print(" Success! Your results are saved as 'my_weather_analysis.csv'.")
    print(" Click the folder icon on the left to download it to your computer.")

 Success! Your results are saved as 'my_weather_analysis.csv'.
 Click the folder icon on the left to download it to your computer.
