# {Project Title}üìù

![Banner](./assets/banner.jpeg)

## Topic
*What problem are you (or your stakeholder) trying to address?*
üìù <!-- Answer Below -->

How does air quality affect human health? 

This is for general awareness about the dangers of pollution. Also, if patterns between specific air quality metrics and health impacts can be found, then a combination of public policy and medical interventions could address those health impacts.

## Project Question
*What specific question are you seeking to answer with this project?*
*This is not the same as the questions you ask to limit the scope of the project.*
üìù <!-- Answer Below -->

How do metrics of air quality correlate with metrics of human health? For example, AQI (air quality index) and life expectancy. 

Does coal usage (for energy generation) correlate with air quality, and by extension, does coal usage correlate with human health impacts?

## What would an answer look like?
*What is your hypothesized answer to your question?*
üìù <!-- Answer Below -->

The answer will look like scatter plots, correlation coefficients and regression models between air quality metrics and human health metrics.


In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl

import requests
from bs4 import BeautifulSoup
from io import StringIO

## Data Sources
*What 3 data sources have you identified for this project?*
*How are you going to relate these datasets?*
üìù <!-- Answer Below -->

Data:

https://www.kaggle.com/datasets/sazidthe1/global-air-pollution-data

https://gco.iarc.fr/today/en/dataviz/maps-prevalence?mode=population&age_end=17&age_start=0&options_indicator=%5Bobject%20Object%5D_%5Bobject%20Object%5D&types=2&cancers=40

https://vizhub.healthdata.org/gbd-results/

https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy

https://ourworldindata.org/grapher/coal-consumption-by-country-terawatt-hours-twh
Data sources: Energy Institute - Statistical Review of World Energy (2025) ‚Äì with major processing by Our World in Data


Relating the data sets:
Plot each data set against air pollution to look for patterns and correlations (linear or polynomial regression)


## Approach and Analysis
*What is your approach to answering your project question?*
*How will you use the identified data to answer your project question?*
üìù <!-- Start Discussing the project here; you can add as many code cells as you need -->

In [3]:
air_pollution_df = pd.read_csv("assets/data/AirPollution/global_air_pollution_data.csv")
air_pollution_df.head(5)

In [58]:
air_pollution_df = air_pollution_df[['country_name', 'city_name', 'aqi_value', 'ozone_aqi_value', 'pm2.5_aqi_value']]
air_pollution_df.columns = [['region', 'city', 'aqi_value', 'ozone_aqi_value', 'pm2.5_aqi_value']]
air_pollution_df.head(5)

Unnamed: 0,region,city,aqi_value,ozone_aqi_value,pm2.5_aqi_value
0,Russian Federation,Praskoveya,51,36,51
1,Brazil,Presidente Dutra,41,5,41
2,Italy,Priolo Gargallo,66,39,66
3,Poland,Przasnysz,34,34,20
4,France,Punaauia,22,22,6


In [60]:
# Fetch the data.
coal_df = pd.read_csv("https://ourworldindata.org/grapher/coal-consumption-by-country-terawatt-hours-twh.csv?v=1&csvType=full&useColumnShortNames=true", 
                 storage_options = {'User-Agent': 'Our World In Data data fetch/1.0'})
# Fetch the metadata
#metadata = requests.get("https://ourworldindata.org/grapher/coal-consumption-by-country-terawatt-hours-twh.metadata.json?v=1&csvType=full&useColumnShortNames=true").json()
coal_df.head(5)

Unnamed: 0,Entity,Code,Year,coal_consumption_twh
0,Africa,,1965,323.49615
1,Africa,,1966,323.12222
2,Africa,,1967,330.2916
3,Africa,,1968,343.51288
4,Africa,,1969,346.64288


In [63]:
coal_df_mean = pd.DataFrame(data=coal_df.groupby('Entity', as_index=False)['coal_consumption_twh'].mean())
coal_df_mean.columns = [['region', 'coal_consumption_twh']]
coal_df_mean.head(5)

Unnamed: 0,region,coal_consumption_twh
0,Africa,845.844225
1,Africa (EI),845.844225
2,Algeria,4.6228
3,Argentina,10.814003
4,Asia,15964.662777


In [64]:
air_poll_coal_df = pd.merge(air_pollution_df, coal_df_mean, on='region', how='inner')
air_poll_coal_df.head(5)

ValueError: The column label 'region' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.

In [46]:
cancer_df = pd.read_csv("assets/data/AirPollution/cancers-excl-non-melanoma-skin-cancer.csv")
lung_cancer_df = pd.read_csv("assets/data/AirPollution/trachea-bronchus-and-lung.csv")

In [44]:
cancer_df = cancer_df[['Population', 'Alpha‚Äë3 code', 'Cancer id', 'Prevalence (Prop. (W)) per 100 000']]
cancer_df.head(5)

Unnamed: 0,Population,Alpha‚Äë3 code,Cancer id,Prevalence (Prop. (W)) per 100 000
0,Afghanistan,AFG,40,52.53
1,Albania,ALB,40,109.75
2,Algeria,DZA,40,96.31
3,Angola,AGO,40,73.06
4,Azerbaijan,AZE,40,99.93


In [45]:
lung_cancer_df = lung_cancer_df[['Population', 'Alpha‚Äë3 code', 'Cancer id', 'Prevalence (Prop. (W)) per 100 000']]
lung_cancer_df.head(5)

Unnamed: 0,Population,Alpha‚Äë3 code,Cancer id,Prevalence (Prop. (W)) per 100 000
0,Afghanistan,AFG,15,3.9
1,Albania,ALB,15,13.19
2,Algeria,DZA,15,5.97
3,Angola,AGO,15,1.31
4,Azerbaijan,AZE,15,10.08


In [47]:
chronic_lung_df = pd.read_csv("assets/data/AirPollution/IHME-GBD_2023_DATA-cd7bf834-1.csv")
chronic_lung_df.head(5)

Unnamed: 0,measure_id,measure_name,location_id,location_name,sex_id,sex_name,age_id,age_name,cause_id,cause_name,metric_id,metric_name,year,val,upper,lower
0,1,Deaths,148,Morocco,3,Both,22,All ages,508,Chronic respiratory diseases,1,Number,2023,7340.195618,10431.263297,4659.016336
1,1,Deaths,148,Morocco,3,Both,22,All ages,508,Chronic respiratory diseases,2,Percent,2023,0.02625,0.036947,0.016484
2,1,Deaths,148,Morocco,3,Both,22,All ages,508,Chronic respiratory diseases,3,Rate,2023,19.889601,28.265413,12.624456
3,1,Deaths,27,Samoa,3,Both,22,All ages,508,Chronic respiratory diseases,1,Number,2023,108.162232,151.46214,76.715183
4,1,Deaths,27,Samoa,3,Both,22,All ages,508,Chronic respiratory diseases,2,Percent,2023,0.077654,0.108629,0.056364


In [48]:
chronic_lung_df = chronic_lung_df[['location_name', 'measure_name', 'cause_name', 'metric_name', 'year', 'val']]
chronic_lung_df.head(5)

Unnamed: 0,location_name,measure_name,cause_name,metric_name,year,val
0,Morocco,Deaths,Chronic respiratory diseases,Number,2023,7340.195618
1,Morocco,Deaths,Chronic respiratory diseases,Percent,2023,0.02625
2,Morocco,Deaths,Chronic respiratory diseases,Rate,2023,19.889601
3,Samoa,Deaths,Chronic respiratory diseases,Number,2023,108.162232
4,Samoa,Deaths,Chronic respiratory diseases,Percent,2023,0.077654


In [12]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy'
headers = {'User-Agent': 'PyRequests/2.14'}
page = requests.get(url, headers=headers)
#print(page.content)

In [36]:
soup = BeautifulSoup(page.content,'html.parser')
tables = soup.find_all('table')
table_IO = StringIO(str(tables[1]))
life_expect_df = pd.read_html(table_IO)[0]

In [35]:
life_expect_df = life_expect_df.loc[:, (['Locations', 'Life expectancy overall'], ['Locations', 'at birth'])]
life_expect_df.columns = [['Locations', 'Life expectancy at birth']]
life_expect_df.head(5)

Unnamed: 0,Locations,Life expectancy at birth
0,Hong Kong,85.51
1,Japan,84.71
2,South Korea,84.33
3,French Polynesia,84.07
4,Andorra,84.04


## Resources and References
*What resources and references have you used for this project?*
üìù <!-- Answer Below -->

In [2]:
# ‚ö†Ô∏è Make sure you run this cell at the end of your notebook before every submission!
!jupyter nbconvert --to python source.ipynb

[NbConvertApp] Converting notebook source.ipynb to python
[NbConvertApp] Writing 1271 bytes to source.py
