#### REPORTING AND ANALYTICS: AIR QUALITY DATA

## INTRODUCTION
The Air Quality Index (AQI) is used for reporting daily air quality. It tells you how clean or polluted your air is, and what associated health effects might be a concern for you. The AQI focuses on health effects you may experience within a few hours or days after breathing polluted air.

**Question**
- How does East Africa compare to the rest of Africa relative to AQI?
- How is the distribution of Countries by Status?

##### 1. OBTAIN
here:
- we get the dataset
- we establish data sources

Data source
- https://www.kaggle.com/datasets/azminetoushikwasi/aqi-air-quality-index-scheduled-daily-update?resource=download

In [None]:
# import the libraries
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib as plt
import seaborn as sns
import datetime as dt
import statistics as st
#read the dataset
df = pd.read_csv("data/air_quality.csv")

In [None]:
# inspect the header
df.head()

In [None]:
# inspect the tail
df.tail()

##### 2. SCRUB
Is the data:
- Complete/asses gaps in data
- Has missing values
- Consistency
- Data Integrity/ uniformity
- Repeating values

In [None]:
# get the information from the dataset
df.info()

 - There no missing values
 - The column data types are good but:
 - The date column should be transformed to date-type object

In [None]:
# transform the date column
df["Date"] = pd.to_datetime(df["Date"])
df.info()

In [None]:
# check for duplicates
duplicates = df[df.duplicated(keep = False)]
len(duplicates)

- The *duplicated()* returns a boolean Series object indicating whether each row is a duplicate
- By setting *keep*=False, all duplicate rows will be set to True/1.

In [None]:
# original dataset
df.shape

In [None]:
# duplicated values
bad_df = df[df.duplicated(keep = False)]
bad_df

In [None]:
# create a dataframe that are going stores non duplicated values
good_df = df[~df.duplicated(keep = False)]
good_df

In [None]:
# # dataset without duplicates
# good_df = df.drop_duplicates(keep = False)
# good_df.shape

In [None]:
# generate a clean csv of air quality data
good_df.to_csv('data/good_air_quality.csv', index = False, header = True)

In [None]:
# generate duplicated csv
bad_df.to_csv('data/bad_air_quality.csv', index = False, header = True)

In [None]:
np.min(good_df["Date"])

##### 3. EXPLORE
Your exploration is supported by visualization/ tabulation/ summaries:
What to explore:

- **Tabulation**:
Show totals of broad quantities

- **Summary Statistics**
Describe the characteristics of data (mean, median, std, quartiles)

- **Spread of data**
How is the data dispersed (scatter plots - visual inspection of outliers)

- **Distribution of major variables**
Histogram of singular variables

- **Heatmaps of major variables**
- **Proportions of major variables**


In [None]:
good_df.info()

In [None]:
# bar chart representation of the unhealthy for Sensitive groups
status_dist = good_df["Status"].value_counts()
status_dist
fig = px.bar(status_dist, title = "Distribution of Countries by Status", 
             labels = {"value":"Number of Countries", "Status":"Air Quality Status"}, 
             color = status_dist.index)
fig.show()

In [None]:
# filter out east_african country and get the Air Quality Value compared to the rest of the world
East_african_df = good_df["Country"].isin(["Kenya", "Uganda", "Tanzania", "Rwanda", "Burundi", "South Sudan", "Ethiopia","Somalia","Djibouti"]).value_counts()
East_african_df

In [None]:
# check for east_africa data
East_Countries =["Kenya", "Uganda", "Tanzania", "Rwanda", "Burundi", "South Sudan", "Ethiopia","Somalia","Djibouti"]
East_african_df = good_df[good_df["Country"].isin(East_Countries)]
East_african_df

In [None]:
# # bar chart that shows AQI value in east african countries 
# fig = px.bar(East_Countries, title = "East African Countries by Status", 
#              labels = {"value":"Number of Countries", "Status":"Air Quality Status"}, 
#              color = East_Countries.index)
# fig.show()

**Auxilliary datasets to explain:
-  How AQI affects a country's Population
-  How The AQI of a country affect its GDP**

In [None]:
# additional associations
Africa_df = pd.read_csv('data/Data_Africa.csv')
Africa_df.head()

In [None]:
# chech for missing values and datatypes
Africa_df.info()

In [None]:
# Filtering Year by 2022
Africa_df = Africa_df[Africa_df["Year"] == 2022]
Africa_df.info()

In [None]:
# Replace the missing values with 0
Africa_df["GDP (USD)"].fillna(0, inplace = True)
Africa_df.info()

In [None]:
# drop the unnecessary columns
Africa_df.drop(columns= ["ID", "Year"], axis = 1, inplace = True)
Africa_df.head()

In [None]:
# reset the index
Africa_df.reset_index(drop = True)
Africa_df

In [None]:
# merge two dataframes
Population_df = pd.merge(good_df,Africa_df, on = "Country")
Population_df

In [None]:
african_countries = [
    "Algeria",
    "Angola",
    "Benin",
    "Botswana",
    "Burkina Faso",
    "Burundi",
    "Cabo Verde",
    "Cameroon",
    "Central African Republic",
    "Chad",
    "Comoros",
    "Congo, Republic of the",
    "Congo, Democratic Republic of the",
    "Djibouti",
    "Egypt",
    "Equatorial Guinea",
    "Eritrea",
    "Eswatini",
    "Ethiopia",
    "Gabon",
    "Gambia",
    "Ghana",
    "Guinea",
    "Guinea-Bissau",
    "Ivory Coast",
    "Kenya",
    "Lesotho",
    "Liberia",
    "Libya",
    "Madagascar",
    "Malawi",
    "Mali",
    "Mauritania",
    "Mauritius",
    "Morocco",
    "Mozambique",
    "Namibia",
    "Niger",
    "Nigeria",
    "Rwanda",
    "Sao Tome and Principe",
    "Senegal",
    "Seychelles",
    "Sierra Leone",
    "Somalia",
    "South Africa",
    "South Sudan",
    "Sudan",
    "Tanzania",
    "Togo",
    "Tunisia",
    "Uganda",
    "Zambia",
    "Zimbabwe"
]


##### 4. MODELLING
Are there any statistical associations that can be used to summarize the behavior of the dataset?

##### 5. INTERPRETATION
Here you provide your understanding informed by the prior analysis.