# Environmental Health Analysis: PM2.5 levels and Respiratory Diseases in Major Nigerian Cities (2018–2023)
This notebook analyzes the relationship between air quality (PM2.5 levels) and respiratory health outcomes across Lagos, Abuja, Port Harcourt, and Kano.


In [24]:
# import necessary libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

# Set plot style
plt.style.use('default')
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=['#636EFB', '#FFFFFF'])

In [17]:
df = pd.read_csv('dataset/Airpollution-and-Publichealth_dataset.csv')

## Data Overview & Preprocessing
This section provides an overview of the dataset and performs basic cleaning to ensure consistency and readiness for analysis.

In [19]:
df.head()

Unnamed: 0.1,Unnamed: 0,city,year,pm25_annual,population,respiratory_cases,rate_per_100k
0,0,Lagos,2018,89.56,14029002,36086,257.22
1,1,Lagos,2019,85.1,14327530,50221,350.52
2,2,Lagos,2020,79.86,14428199,48446,335.78
3,3,Lagos,2021,69.84,14800018,35757,241.6
4,4,Lagos,2022,80.13,15195849,43928,289.08


In [55]:
df.shape # displays dataset dimension

(24, 6)

In [22]:
df.columns # check dataset columns 

Index(['Unnamed: 0', 'city', 'year', 'pm25_annual', 'population',
       'respiratory_cases', 'rate_per_100k'],
      dtype='object')

In [42]:
df.drop(columns=['Unnamed: 0'], inplace=True) # drop the unnecessary index column 
df.columns

Index(['city', 'year', 'pm25_annual', 'population', 'respiratory_cases',
       'rate_per_100k'],
      dtype='object')

In [44]:
df.info() # check datatype an dmissing values 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   city               24 non-null     object 
 1   year               24 non-null     int64  
 2   pm25_annual        24 non-null     float64
 3   population         24 non-null     int64  
 4   respiratory_cases  24 non-null     int64  
 5   rate_per_100k      24 non-null     float64
dtypes: float64(2), int64(3), object(1)
memory usage: 1.3+ KB


In [46]:
df.describe() # statistical summary of dataset 

Unnamed: 0,year,pm25_annual,population,respiratory_cases,rate_per_100k
count,24.0,24.0,24.0,24.0,24.0
mean,2020.5,70.303333,8553365.0,21055.791667,269.841667
std,1.744557,20.267865,5808256.0,14780.828461,90.416358
min,2018.0,34.58,2497038.0,4831.0,141.99
25%,2019.0,49.8575,2938447.0,8481.75,203.5325
50%,2020.5,70.37,8172270.0,16367.0,258.25
75%,2022.0,88.105,14128310.0,32511.75,338.095
max,2023.0,102.45,15564590.0,50221.0,489.26


In [60]:
df.isnull().sum()  # Check for missing values

city                 0
year                 0
pm25_annual          0
population           0
respiratory_cases    0
rate_per_100k        0
dtype: int64

In [62]:
df.duplicated().sum() # Check for duplicates

0

In [48]:
df.nunique(axis=0).sort_values().to_frame(name= 'unique values')  # check number of unique values in each column

Unnamed: 0,unique values
city,4
year,6
pm25_annual,24
population,24
respiratory_cases,24
rate_per_100k,24


**Insights from Data Overview**

- dataset contains 24 rows and 6 columns: covers *4 cities* over *6 years* (2018–2023).
- dropped redundant column
- No missing values: all columns have 24 non-null entries.
- No duplicates
- Data types are appropriate:

The dataset is clean and analysis-ready, requires minimal preprocessing.

## Exploratory Data Analysis


## Correlation Analysis

## Results and Insights

In [None]:
## Policy Implications