# COM7064 CW1 — Final Project Template: Impact of Weather on Air Pollution (London)

**Purpose:** This comprehensive notebook contains step-by-step instructions, runnable code, data-fetching examples, EDA, modelling, a 1000-word report template, and a marking checklist to help you complete your CW1 assignment and reuse it for future work.

**Files included (created for you in /mnt/data):**
- `london_air_quality.csv` — sample synthetic air quality data (2023–2024)
- `london_weather.csv` — sample synthetic weather data (2023–2024)

**How to use this file:**
1. Follow sections in order.
2. Replace sample CSVs with real data from OpenAQ, LondonAir, Met Office, Kaggle, or Open‑Meteo as shown below.
3. Run each cell and write your observations in the Markdown cells.
4. Export as `.ipynb` or convert to PDF for TurnItIn.  


## 1 — Data sources & how to download them

Detailed guidance and code examples follow for OpenAQ and Open‑Meteo.

### Code: Fetch PM2.5 from OpenAQ (example)

In [None]:
# Example: Fetching PM2.5 measurements for London from OpenAQ API
# Note: OpenAQ limits paging; for large ranges implement pagination.
import requests
import pandas as pd

def fetch_openaq_pm25(city='London', date_from='2023-01-01', date_to='2023-12-31', limit=10000):
    url = "https://api.openaq.org/v2/measurements"
    params = {
        "city": city,
        "parameter": "pm25",
        "date_from": date_from,
        "date_to": date_to,
        "limit": limit,
        "page": 1,
        "sort": "desc"
    }
    records = []
    while True:
        resp = requests.get(url, params=params)
        if resp.status_code != 200:
            print('Request failed:', resp.status_code, resp.text)
            break
        data = resp.json()
        results = data.get('results', [])
        if not results:
            break
        records.extend(results)
        meta = data.get('meta', {})
        if params['page'] * params['limit'] >= meta.get('found', 0):
            break
        params['page'] += 1
    if not records:
        print("No records fetched.")
        return pd.DataFrame()
    df = pd.json_normalize(records)
    df = df[['date.utc', 'location', 'parameter', 'value', 'unit', 'coordinates.latitude', 'coordinates.longitude']]
    df = df.rename(columns={'date.utc':'date','coordinates.latitude':'lat','coordinates.longitude':'lon','value':'pm25'})
    df['date'] = pd.to_datetime(df['date']).dt.date
    daily = df.groupby('date')['pm25'].mean().reset_index()
    daily['date'] = pd.to_datetime(daily['date'])
    return daily

# Try a fetch (may be limited by API rate-limits)
try:
    sample_pm25 = fetch_openaq_pm25(date_from='2023-01-01', date_to='2023-12-31', limit=1000)
    display(sample_pm25.head())
except Exception as e:
    print('OpenAQ fetch error:', e)


### Code: Fetch daily weather from Open-Meteo (London)

In [None]:
# Example: Fetch daily weather summary for London coordinates from Open-Meteo
import requests
import pandas as pd

def fetch_open_meteo(start_date='2023-01-01', end_date='2023-12-31'):
    url = "https://archive-api.open-meteo.com/v1/archive"
    params = {
        "latitude": 51.5072,
        "longitude": -0.1276,
        "start_date": start_date,
        "end_date": end_date,
        "daily": ["temperature_2m_max","temperature_2m_min","windspeed_10m_max","precipitation_sum","temperature_2m_mean"],
        "timezone": "Europe/London"
    }
    resp = requests.get(url, params=params)
    if resp.status_code != 200:
        print('Open-Meteo request failed:', resp.status_code)
        return pd.DataFrame()
    data = resp.json().get('daily', {})
    df = pd.DataFrame(data)
    df['time'] = pd.to_datetime(df['time'])
    df = df.rename(columns={'time':'date','temperature_2m_mean':'temp_mean','temperature_2m_max':'temp_max','temperature_2m_min':'temp_min','windspeed_10m_max':'wind_max','precipitation_sum':'precip'})
    return df[['date','temp_mean','temp_max','temp_min','wind_max','precip']]

# Try a fetch (may be limited by remote availability)
try:
    sample_weather = fetch_open_meteo('2023-01-01','2023-12-31')
    display(sample_weather.head())
except Exception as e:
    print('Open-Meteo fetch error:', e)


### Kaggle datasets guidance (manual download)

If you use a Kaggle dataset:
1. Sign in to Kaggle.
2. Visit dataset page (e.g., 'London Weather Data 1979–2023').
3. Click **Download** and extract files.
4. Place CSVs in the same folder as this notebook and update file names in load cells.

To automate: install kaggle CLI and place token at ~/.kaggle/kaggle.json, then use `kaggle datasets download -d <dataset>`.


## 2 — Load sample data (created for you)

In [None]:
import pandas as pd
air = pd.read_csv('london_air_quality.csv')
weather = pd.read_csv('london_weather.csv')
air['date'] = pd.to_datetime(air['date'])
weather['date'] = pd.to_datetime(weather['date'])
print('Air shape:', air.shape)
print('Weather shape:', weather.shape)
display(air.head())
display(weather.head())


## 3 — Data cleaning & merging (step-by-step)

In [None]:
# Example cleaning and merging
import pandas as pd
df_air = air.copy()
df_weather = weather.copy()
print('Air missing:', df_air.isna().sum().to_dict())
print('Weather missing:', df_weather.isna().sum().to_dict())
# Merge
df = pd.merge(df_air, df_weather, on='date', how='inner')
print('Merged shape:', df.shape)
display(df.head())


## 4 — Exploratory Data Analysis (EDA) examples

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

# Descriptive stats
print(df[['PM2.5','PM10','NO2','temperature','humidity','wind_speed']].describe())

# Time series plot for PM2.5
plt.figure(figsize=(12,4))
sns.lineplot(x='date', y='PM2.5', data=df)
plt.title('Daily PM2.5 (sample data)')
plt.xlabel('Date')
plt.ylabel('PM2.5 (µg/m³)')
plt.show()

# Scatterplot relationships
plt.figure(figsize=(10,4))
sns.scatterplot(x='temperature', y='PM2.5', data=df, alpha=0.5)
plt.title('PM2.5 vs Temperature')
plt.show()

# Correlation heatmap
corr = df[['PM2.5','temperature','humidity','wind_speed']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation matrix')
plt.show()


## 5 — Modelling (Regression & Evaluation)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
X = df[['temperature','humidity','wind_speed']]
y = df['PM2.5']
model = LinearRegression()
model.fit(X,y)
y_pred = model.predict(X)
print('R2:', r2_score(y,y_pred))
print('RMSE:', mean_squared_error(y,y_pred,squared=False))
coef = pd.DataFrame({'feature': X.columns, 'coef': model.coef_})
display(coef)


### Statistical tests (significance)

In [None]:
from scipy.stats import pearsonr
for col in ['temperature','humidity','wind_speed']:
    r, p = pearsonr(df['PM2.5'], df[col])
    print(f'{col}: r={r:.3f}, p={p:.4f}')


## 7 — 1000-word Report Template (paste into Markdown cell or Word)

### Title: Impact of Weather Conditions on PM2.5 Levels in London (2023–2024)

**Introduction (≈150 words)**  
Air pollution affects public health and is influenced by meteorological conditions. This study explores the relationships between temperature, humidity, and wind speed and PM2.5 concentrations in London. Using daily data for 2023–2024, we examine correlations, perform regression analysis, and evaluate the predictive power of weather variables...

**Data (≈150 words)**  
Describe sources: OpenAQ for PM2.5 (or LondonAir / DEFRA if used), Open‑Meteo / Met Office for weather. Explain date range, variables, preprocessing (aggregation to daily means), and any missing data handling.

**EDA (≈250 words)**  
Summarize main patterns observed: seasonality, peaks, correlations (e.g., inverse relation with wind). Refer to 2–3 visualizations and key statistics.

**Modelling & Results (≈250 words)**  
Explain regression model, report R², RMSE, coefficients and p‑values. Interpret coefficients (e.g., a 1°C increase is associated with X µg/m³ change in PM2.5). Discuss statistical significance.

**Evaluation & Limitations (≈150 words)**  
Discuss limitations: station coverage, missing pollutant sources (traffic), temporal coverage, potential confounders. Suggest improvements: include traffic, land-use, and emissions inventory data.

**Conclusion & Policy Implications (≈100 words)**  
Summarize findings and suggest how authorities could use results for forecasting poor air-quality days or public advisories.

**References**  
- OpenAQ (2024). Open Air Quality Data Portal.  
- Open‑Meteo (2024). Historical Weather API.  
- London Air (King's College) or DEFRA datasets if used.


## 8 — Marking checklist & tips (aim for 80%+)

Use this checklist before submission:
- [ ] Clear research question and justification (Introduction)
- [ ] Data sources cited and described (Data section)
- [ ] Data cleaning steps documented and reproducible
- [ ] At least 4 visualizations with captions (time series, scatter, heatmap, regression)
- [ ] Descriptive stats + correlation + regression results
- [ ] Interpretations for coefficients and p-values
- [ ] Discussion of limitations and future work
- [ ] 1000-word report included as a Markdown cell or Word doc
- [ ] Code runs end-to-end and notebook exported as PDF/.ipynb

Tips:
- Comment your code and add Markdown explanations after each code block.
- Keep plots readable (labels, titles, captions).
- Reference your data and any external methods.
