# {Project Title}üìù

![Banner](./assets/banner.jpeg)

## Topic
*What problem are you (or your stakeholder) trying to address?*
üìù <!-- Answer Below -->

How does air quality affect human health? 

This is for general awareness about the dangers of pollution. Also, if patterns between specific air quality metrics and health impacts can be found, then a combination of public policy and medical interventions could address those health impacts.

## Project Question
*What specific question are you seeking to answer with this project?*
*This is not the same as the questions you ask to limit the scope of the project.*
üìù <!-- Answer Below -->

How do metrics of air quality correlate with metrics of human health? For example, AQI (air quality index) and life expectancy. 

Does coal usage (for energy generation) correlate with air quality, and by extension, does coal usage correlate with human health impacts?

## What would an answer look like?
*What is your hypothesized answer to your question?*
üìù <!-- Answer Below -->

The answer will look like scatter plots, correlation coefficients and regression models between air quality metrics and human health metrics.


In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl

import requests
from bs4 import BeautifulSoup
from io import StringIO

## Data Sources
*What 3 data sources have you identified for this project?*
*How are you going to relate these datasets?*
üìù <!-- Answer Below -->

Data:

https://www.kaggle.com/datasets/sazidthe1/global-air-pollution-data

https://gco.iarc.fr/today/en/dataviz/maps-prevalence?mode=population&age_end=17&age_start=0&options_indicator=%5Bobject%20Object%5D_%5Bobject%20Object%5D&types=2&cancers=40

https://vizhub.healthdata.org/gbd-results/

https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy

https://ourworldindata.org/grapher/coal-consumption-by-country-terawatt-hours-twh
Data sources: Energy Institute - Statistical Review of World Energy (2025) ‚Äì with major processing by Our World in Data


Relating the data sets:
Plot each data set against air pollution to look for patterns and correlations (linear or polynomial regression)


## Approach and Analysis
*What is your approach to answering your project question?*
*How will you use the identified data to answer your project question?*
üìù <!-- Start Discussing the project here; you can add as many code cells as you need -->

In [None]:
air_pollution_df = pd.read_csv("assets/data/AirPollution/global_air_pollution_data.csv")

In [3]:
air_pollution_df = air_pollution_df[['country_name', 'city_name', 'aqi_value', 'ozone_aqi_value', 'pm2.5_aqi_value']]

In [4]:
air_poll_mean_df = pd.DataFrame(data=air_pollution_df.groupby('country_name', as_index=False)['aqi_value'].mean())
air_poll_mean_df.columns = [['region', 'aqi_value']]
air_poll_mean_df.to_csv("assets/data/AirPollution/air_poll_mean.csv")

In [5]:
# Fetch the data.
coal_df = pd.read_csv("https://ourworldindata.org/grapher/coal-consumption-by-country-terawatt-hours-twh.csv?v=1&csvType=full&useColumnShortNames=true", 
                 storage_options = {'User-Agent': 'Our World In Data data fetch/1.0'})
# Fetch the metadata
#metadata = requests.get("https://ourworldindata.org/grapher/coal-consumption-by-country-terawatt-hours-twh.metadata.json?v=1&csvType=full&useColumnShortNames=true").json()

In [6]:
coal_mean_df = pd.DataFrame(data=coal_df.groupby('Entity', as_index=False)['coal_consumption_twh'].mean())
coal_mean_df.columns = [['region', 'coal_consumption_twh']]
coal_mean_df.to_csv("assets/data/AirPollution/coal_mean_per_year.csv")

In [7]:
cancer_df = pd.read_csv("assets/data/AirPollution/cancers-excl-non-melanoma-skin-cancer.csv")
lung_cancer_df = pd.read_csv("assets/data/AirPollution/trachea-bronchus-and-lung.csv")

In [8]:
cancer_df = cancer_df[['Population', 'Alpha‚Äë3 code', 'Cancer id', 'Prevalence (Prop. (W)) per 100 000']]
cancer_df.columns = [['region', 'Alpha‚Äë3 code', 'Cancer id', 'Prevalence (Prop. (W)) per 100 000']]
cancer_df.to_csv("assets/data/AirPollution/cancer.csv")

In [9]:
lung_cancer_df = lung_cancer_df[['Population', 'Alpha‚Äë3 code', 'Cancer id', 'Prevalence (Prop. (W)) per 100 000']]
lung_cancer_df.columns = [['region', 'Alpha‚Äë3 code', 'Cancer id', 'Prevalence (Prop. (W)) per 100 000']]
lung_cancer_df.to_csv("assets/data/AirPollution/lung_cancer.csv")

In [10]:
chronic_lung_df = pd.read_csv("assets/data/AirPollution/IHME-GBD_2023_DATA-cd7bf834-1.csv")

In [11]:
chronic_lung_death_rate_df = chronic_lung_df[chronic_lung_df['metric_name']=='Rate']
chronic_lung_death_rate_df = chronic_lung_death_rate_df[['location_name', 'measure_name', 'cause_name', 'metric_name', 'year', 'val']]

In [12]:
chronic_lung_death_rate_df.columns = [['region', 'measure_name', 'cause_name', 'metric_name', 'year', 'val']]

In [13]:
chronic_lung_death_rate_df.to_csv("assets/data/AirPollution/chronic_lung_death_rate.csv")

In [14]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy'
headers = {'User-Agent': 'PyRequests/2.14'}
page = requests.get(url, headers=headers)
#print(page.content)

In [15]:
soup = BeautifulSoup(page.content,'html.parser')
tables = soup.find_all('table')
table_IO = StringIO(str(tables[1]))
life_expect_df = pd.read_html(table_IO)[0]

In [16]:
life_expect_df = life_expect_df.loc[:, (['Locations', 'Life expectancy overall'], ['Locations', 'at birth'])]

In [17]:
life_expect_df.to_csv("assets/data/AirPollution/life_expect.csv")

## Resources and References
*What resources and references have you used for this project?*
üìù <!-- Answer Below -->

In [2]:
# ‚ö†Ô∏è Make sure you run this cell at the end of your notebook before every submission!
!jupyter nbconvert --to python source.ipynb

[NbConvertApp] Converting notebook source.ipynb to python
[NbConvertApp] Writing 1271 bytes to source.py
