<a href="https://colab.research.google.com/github/LiamDuero03/DS-Society-Project/blob/Liams-Branch/Full-Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üöÄ Project Lab: Quick Start Guide

Welcome to our first society project!

# 1: Data Sourcing & Integration
In this section, we demonstrate how a Data Scientist pulls data from three distinct sources:
1. **External Web CSV:** Baseline geographic data.
2. **Local CSV Upload:** Internal "proprietary" data (e.g., City Population).
3. **Live API:** Real-time data enrichment (Live Weather).

#### 1.1 Setup
#### 1. üîë Get your OpenWeatherMap API Key
We are combining static city data with live weather data. You need a personal key to "talk" to the weather server:
1. Go to [OpenWeatherMap.org](https://openweathermap.org/api) and create a free account.
2. Navigate to your **API Keys** tab and copy your default key.
3. *Note:* It can take up to 30-60 minutes for a new key to "activate."

#### 2. üõ°Ô∏è Set up Colab Secrets
To keep our project secure, we **never** type our API keys directly into the code.
* Look at the left-hand sidebar in this Colab window.
* Click the **Key icon (Secrets)** üîë.
* Click "Add new secret".
* Name: `OPENWEATHER_API_KEY`
* Value: Paste your key here.
* **Toggle the "Notebook access" switch to ON.**

#### 3. üß™ Run the "Sourcing" Cell
Once your secret is set, scroll down to **Section 1: Data Sourcing** and run the cell to verify your connection!

In [2]:
# --- SECTION 1: DATA SOURCING (API & CSV) ---
import pandas as pd
import requests
from google.colab import userdata

# 1. Accessing the Secret Key securely
try:
    API_KEY = userdata.get('OPENWEATHER_API_KEY')
    print("‚úÖ API Key successfully retrieved from Secrets.")
except:
    print("‚ùå API Key not found. Please add 'OPENWEATHER_API_KEY' to your Colab Secrets (Key icon üîë).")

# 2. Loading the CSV (Static Data)
CSV_URL = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(CSV_URL)

# 3. Defining the API Call (Live Data)
def fetch_weather(city):
    base_url = "http://api.openweathermap.org/data/2.5/weather"
    params = {'q': city, 'appid': API_KEY, 'units': 'metric'}
    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        data = response.json()
        # Extracting specific values to avoid nested dictionary issues in the list comprehension
        return {'temp': data['main']['temp'], 'humidity': data['main']['humidity']}
    return None

# Preview the data structure
#cities_df.head()

‚úÖ API Key successfully retrieved from Secrets.


### 1.2: Local Data Sourcing (Automated)
Instead of manual uploads, we pull our "internal" population data directly from our GitHub repository. This ensures all members are working with the same version of the data.

In [3]:
# 3. SOURCE B: GitHub-Hosted CSV
# Note: GitHub LFS files also work with the 'raw' URL structure.
GITHUB_CSV_URL = "https://raw.githubusercontent.com/liamDuero03/DS-Society-Project/main/worldcitiespop.csv"

try:
    # We use low_memory=False because large CSVs often have mixed data types in columns
    internal_metadata_df = pd.read_csv(GITHUB_CSV_URL, low_memory=False)

    # Corrected column names based on the worldcitiespop dataset
    # Standard columns are: Country, City, AccentCity, Region, Population, Latitude, Longitude
    # We will rename them to match your desired format
    internal_metadata_df = internal_metadata_df[['City', 'Population', 'Latitude', 'Longitude']]
    internal_metadata_df.columns = ['city', 'pop', 'lat', 'lng']

    print(f"‚úÖ GitHub-hosted data loaded successfully! ({len(internal_metadata_df)} rows)")
except Exception as e:
    print(f"‚ùå Error loading CSV from GitHub: {e}")
    print("Tip: Ensure you have pushed the file to GitHub and it isn't still just a 'pointer' file.")

internal_metadata_df.head()

Please upload your local 'city_metadata.csv' file:


IndexError: list index out of range

###1.3 The Master Merger
We will now join our **Baseline Data** with our **Local Metadata** and then ping the **OpenWeather API** to get live temperature data for a subset of these cities.

In [None]:
# 1. Take a sample of 10 cities
master_df = cities_df.head(10).copy()

# 2. Add Live API Data
print("Fetching live weather...")
weather_results = [fetch_weather(city) for city in master_df['name']]

# Extract values safely
master_df['current_temp'] = [w['temp'] if w else None for w in weather_results]
master_df['humidity'] = [w['humidity'] if w else None for w in weather_results]

# 3. THE MERGER
# We match 'name' (from public CSV) with 'City' (from your large CSV)
master_df = pd.merge(
    master_df,
    internal_metadata_df[['City', 'Population', 'Latitude', 'Longitude']],
    left_on='name',
    right_on='City',
    how='left'
)

# Clean up: remove the duplicate 'City' column created by the merge
master_df = master_df.drop(columns=['City'])

print("üöÄ Master Dataset Created with Live Weather and LFS Metadata!")
master_df.head()

#2: Preliminary EDA & Visualization
Goal: Understand the "shape" and "health" of the data.

In [None]:
# --- SECTION 2: PRELIMINARY EDA ---
import seaborn as sns
import matplotlib.pyplot as plt

# Check for missing values and data types
print(df.info())

# Visualizing the target variable distribution
sns.histplot(df['target_column'], kde=True)
plt.title("Distribution of Target Variable")
plt.show()

#3: Data Wrangling
Goal: Clean the data for the machine. This is where your team will spend most of their time.

In [None]:
# --- SECTION 3: DATA WRANGLING ---
# 1. Handling Missing Values
df['column'] = df['column'].fillna(df['column'].median())

# 2. Encoding Categorical Variables
df = pd.get_dummies(df, columns=['category_column'], drop_first=True)

# 3. Feature Scaling (Week 4 topic)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# df[['num_cols']] = scaler.fit_transform(df[['num_cols']])

#4: Advanced Visualization
Goal: Storytelling. Move beyond simple bars to multi-dimensional charts.

In [None]:
# --- SECTION 4: ADVANCED VISUALS ---
import plotly.express as px

# Create an interactive 3D scatter or a Correlation Heatmap
corr = df.corr()
fig = px.imshow(corr, text_auto=True, title="Feature Correlation Heatmap")
fig.show()

#5: Machine Learning (ML)
Goal: The predictive engine.

In [None]:
# --- SECTION 5: MACHINE LEARNING ---
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor # or Classifier

X = df.drop('target_column', axis=1)
y = df['target_column']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor()
model.fit(X_train, y_train)
print(f"Model Score: {model.score(X_test, y_test)}")