## Libraries

In [60]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import missingno as msn
import plotly.express as px

## Importing Data

In [8]:
gap = pd.read_csv("global air pollution dataset.csv")

> Let's take a look at the initial view of the dataset

In [11]:
gap.head()

Unnamed: 0,Country,City,AQI Value,AQI Category,CO AQI Value,CO AQI Category,Ozone AQI Value,Ozone AQI Category,NO2 AQI Value,NO2 AQI Category,PM2.5 AQI Value,PM2.5 AQI Category
0,Russian Federation,Praskoveya,51,Moderate,1,Good,36,Good,0,Good,51,Moderate
1,Brazil,Presidente Dutra,41,Good,1,Good,5,Good,1,Good,41,Good
2,Italy,Priolo Gargallo,66,Moderate,1,Good,39,Good,2,Good,66,Moderate
3,Poland,Przasnysz,34,Good,1,Good,34,Good,0,Good,20,Good
4,France,Punaauia,22,Good,0,Good,22,Good,0,Good,6,Good


In [67]:
gap.shape

(23035, 12)

In [76]:
gap.dtypes.value_counts()

object    7
int64     5
dtype: int64

> The dataset consists of 23,463 records and 12 features

> Out of those 12 features 5 of them are Numerical and 7 are Categorical

## About 

> The Features of the dataset are to be fully understood inorder to decide which data are to be taken for further analysis.

> Learning the features is essential for preprocessing the data if there exists incorrect entries, missing values, duplicated values and incorrect string formatting etc.

> Hence let's take a look at what each feature in the dataset represents

**Country** - Name of the country

**City** - Name of the city

**Air quality index (AQI)** - An air quality index (AQI) indicates how polluted the air currently is or how polluted it is forecast to become. The AQI is calculated based on the levels of several air pollutants, including particulate matter (PM2.5 and PM10), ozone (O3), nitrogen dioxide (NO2), sulfur dioxide (SO2), and carbon monoxide (CO).

**AQI Category** -  Overall AQI category of the city

**CO AQI** -  CO AQI specifically refers to the Air Quality Index value calculated based on the levels of carbon monoxide in the air. Carbon monoxide is a poisonous gas that is emitted by vehicles, generators, and other combustion sources. Carbon Monoxide is a colorless and odorless gas. Outdoor, it is emitted in the air above all by cars, trucks and other vehicles or machineries that burn fossil fuels. Such items like kerosene and gas space heaters, gas stoves also release CO affecting indoor air quality.

**CO AQI Value** : AQI value of Carbon Monoxide of the city

**Ozone AQI** - Ozone AQI refers to the Air Quality Index value calculated based on the levels of ozone (O3) in the air. Ozone is a highly reactive gas that is created by the interaction of sunlight with pollutants emitted by vehicles, industrial processes, and other sources.Furthermore it can reduce lung function and worsen bronchitis, emphysema, and asthma. Ozone affects also vegetation and ecosystems. In particular, it damages sensitive vegetation during the growing season.

**Ozone AQI Value** : AQI value of Ozone of the city

**NO2 AQI** - NO2 AQI refers to the Air Quality Index value calculated based on the levels of nitrogen dioxide (NO2) in the air. Nitrogen dioxide is a highly reactive gas that is emitted by vehicles, power plants, and other combustion sources. Exposure over short periods can aggravate respiratory diseases, like asthma. Longer exposures may contribute to develoment of asthma and respiratory infections. People with asthma, children and the elderly are at greater risk for the health effects of NO2.

**NO2 AQI Value** : AQI value of Nitrogen Dioxide of the city

**PM2.5** - Atmospheric Particulate Matter, also known as atmospheric aerosol particles, are complex mixtures of small solid and liquid matter that get into the air. If inhaled they can cause serious heart and lungs problem. They have been classified as **group 1 carcinogen by the International Agengy for Research on Cancer (IARC)**.It can come from a variety of sources, including vehicle exhaust, **power plants**, wildfires, and dust. PM2.5 refers to those particles with a diameter of 2.5 micrometers or less.

**PM2.5 AQI Value** : AQI value of Particulate Matter with a diameter of 2.5 micrometers or less of the city

https://acp.copernicus.org/preprints/acp-2020-672/acp-2020-672-manuscript-version4.pdf

https://www.google.com/search?q=pm+value+of+plastic+burn&client=firefox-b-d&ei=DU4kZMC8O8ONseMP_7CvuAo&ved=0ahUKEwiA95LAr4H-AhXDRmwGHX_YC6cQ4dUDCA4&uact=5&oq=pm+value+of+plastic+burn&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCCEQoAEyBQghEKABOgoIABBHENYEELADOhkIABCKBRDqAhC0AhCKAxC3AxDUAxDlAhgBOhMIABCPARDqAhC0AhCMAxDlAhgCOhMILhCPARDqAhC0AhCMAxDlAhgCOgcIABCKBRBDOggIABCKBRCRAjoLCC4QigUQsQMQgwE6CwgAEIAEELEDEIMBOg0IABCKBRCxAxCDARBDOggILhCABBCxAzoLCC4QgAQQsQMQ1AI6CwguEIAEELEDEIMBOg0ILhCKBRCxAxCDARBDOgUILhCABDoJCAAQigUQChBDOg0ILhCKBRDHARCvARBDOggIABCABBCxAzoFCAAQgAQ6DQguEIAEEMcBEK8BEAo6EAguEK8BEMcBELEDEIAEEAo6BggAEBYQHjoICAAQFhAeEA86CAgAEIoFEIYDOggIIRAWEB4QHToHCCEQoAEQCjoKCCEQFhAeEA8QHUoECEEYAEoFCEASATFQ1AZY4llgolxoCXABeACAAZgBiAGdGpIBBDAuMjeYAQCgAQGwARTIAQjAAQHaAQQIARgH2gEGCAIQARgK&sclient=gws-wiz-serp






> Now let's take a look at the domain clases of AQI Category along with it's corresponding range

## AQI Value Indications

The AQI Value ranging from: 

> 0 to 50 - Good 

> 51 to 100 - Moderate

> 101 to 150 - Unhealthy for sensitive groups

> 151 to 200 - Unhealthy

> 201 to 300 - Very Unhealthy

> 301 to 500 - Hazardous

## Data Preprocessing

The Aim is to ensure the following for making it suitable for analysis and modeling

> No duplicates 

> No missing values

> Dataset with proper datatypes

The first step is to check for duplicate records in the dataset

In [18]:
# check for duplicates
dup = gap[gap.duplicated()]
dup

Unnamed: 0,Country,City,AQI Value,AQI Category,CO AQI Value,CO AQI Category,Ozone AQI Value,Ozone AQI Category,NO2 AQI Value,NO2 AQI Category,PM2.5 AQI Value,PM2.5 AQI Category


There exists no duplications

> Now let's check for missing values in the dataset

In [19]:
# check for missing values
gap.isnull().sum()

Country               427
City                    1
AQI Value               0
AQI Category            0
CO AQI Value            0
CO AQI Category         0
Ozone AQI Value         0
Ozone AQI Category      0
NO2 AQI Value           0
NO2 AQI Category        0
PM2.5 AQI Value         0
PM2.5 AQI Category      0
dtype: int64

> The dataset contains missing values which can lead to wrong inferences therefore imputation is necessary before proceeding into analysis

In [75]:
print("Missing value count:")
print("--------------------")
print("Country: - 427")
print("City: - 1")

Missing value count:
--------------------
Country: - 427
City: - 1


> The Feature "Country"  has a missing value count of 427 and "City" with a count of 1

> Most Countries has city names which are similar hence even if the missing values are manually imputed with the help of city names it may not be accurate enough and may deviate the course of analysis.

> In the case of City it is impossible to find the city just by the name of the country it cannot be imputed.

> Another way is drop the records having missing values but this may affect the analysis if the data is smaller but since our 
current data is big enough rows are dropped.


In [29]:
# dropping the records
gap.dropna(axis=0, how='any',inplace=True)

> Let's take a look at the new dimensions of the dataset after dropping the records

In [49]:
# new dimension
new_dim = gap.shape
print("New dimensions of the Dataset: ", new_dim)
print("The Records got reduced from 23,463 to 23,035")

New dimensions of the Dataset:  (23035, 12)
The Records got reduced from 23,463 to 23,035


> Ensuring whether all the features are assigned with proper datatypes

In [38]:
# check for proper datatypes
gap.dtypes

Country               object
City                  object
AQI Value              int64
AQI Category          object
CO AQI Value           int64
CO AQI Category       object
Ozone AQI Value        int64
Ozone AQI Category    object
NO2 AQI Value          int64
NO2 AQI Category      object
PM2.5 AQI Value        int64
PM2.5 AQI Category    object
dtype: object

### Statistical Analysis

In [39]:
gap.describe()

Unnamed: 0,AQI Value,CO AQI Value,Ozone AQI Value,NO2 AQI Value,PM2.5 AQI Value
count,23035.0,23035.0,23035.0,23035.0,23035.0
mean,72.344693,1.376254,35.233905,3.084741,68.883482
std,56.360992,1.844926,28.236613,5.281708,55.057396
min,6.0,0.0,0.0,0.0,0.0
25%,39.0,1.0,21.0,0.0,35.0
50%,55.0,1.0,31.0,1.0,54.0
75%,80.0,1.0,40.0,4.0,79.0
max,500.0,133.0,235.0,91.0,500.0


In [22]:
pd.set_option("display.max_rows", None)

In [27]:
gap['Country'].value_counts(ascending=False)

United States of America                                2872
India                                                   2488
Brazil                                                  1562
Germany                                                 1345
Russian Federation                                      1241
Italy                                                    979
France                                                   802
China                                                    795
Japan                                                    702
Mexico                                                   588
Spain                                                    425
United Kingdom of Great Britain and Northern Ireland     400
Poland                                                   389
Indonesia                                                379
Philippines                                              337
Pakistan                                                 307
Netherlands             

## Objectives

> **Ensure Proper Country Names**

> **Top 10 Countries having high AQI Value**

> **Top 10 Countries having low AQI Values**

> **High AQI Levels in India and Why?**

> **Low AQI Levels in India and Why?**

## Ideas

> **Create a new feature called "Continent" to analyse the AQI Quality on each Continent**