# Exploratory Data Analysis (EDA) of Air Pollution in Skopje, Macedonia
Completed 1/19/2020  
Data provided by Bojan

## Project Overview:
Here we have air pollution data for the city Bojan currently lives in - Skopje, Macedonia.
The measurements go back ~10 years and include  6 measuring stations (A, B, C, D, E, G). The measurements are taken every hour but there is a lot of missing data and some station are active just since the last few years.
Higher ratings mean worse air quality.
Anything over 50 is considered not good. Anything over 200 is considered hazardous.

Let's try to answer the following questions for the last 5 winters, a winter being the period from November including February. We are looking at the years 2013/14 until 2017/18.
- Which have been the top 3 worst months overall?
- Which measuring station has the highest ratings on average?
- Make a pie chart with the average rating for each station
- Which is the worst month per measuring station on average? Is it the same for them all?
- Make a horizontal bar chart showing how many days in total the measurements have been over 50 for each. 
- Same chart for over 200.

In [21]:
import pandas as pd
from matplotlib import pyplot as plt
from datetime import datetime

## 1. Data Cleaning:

### Import the data

In [22]:
data = pd.read_csv("pm10_data.csv", parse_dates=True)

### Data Cleaning

In [23]:
data.head()

Unnamed: 0,A,B,C,D,E,G,time
0,,,,,,120.26,2008-01-01 00:00:00
1,,,124.84,99.12,,130.95,2008-01-01 01:00:00
2,,,107.64,98.37,,130.19,2008-01-01 02:00:00
3,,,107.8,89.33,,121.46,2008-01-01 03:00:00
4,,,100.65,94.35,,103.99,2008-01-01 04:00:00


In [34]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92246 entries, 0 to 92245
Data columns (total 7 columns):
A       55670 non-null float64
B       50826 non-null float64
C       64273 non-null float64
D       82896 non-null float64
E       65715 non-null float64
G       76992 non-null float64
time    92246 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(6)
memory usage: 4.9 MB


#### Convert date to datetime format and make it the row index

In [35]:
# Rename column from time to date
data=data.rename(columns = {'time':'date'})

In [36]:
# convert it to datetime formate
pd.to_datetime(data['date'])

0       2008-01-01 00:00:00
1       2008-01-01 01:00:00
2       2008-01-01 02:00:00
3       2008-01-01 03:00:00
4       2008-01-01 04:00:00
                ...        
92241   2018-03-09 19:00:00
92242   2018-03-09 20:00:00
92243   2018-03-09 21:00:00
92244   2018-03-09 22:00:00
92245   2018-03-09 23:00:00
Name: date, Length: 92246, dtype: datetime64[ns]

In [39]:
data.date.dtype

dtype('<M8[ns]')

In [40]:
data.set_index('date', inplace=True)

In [41]:
data.head()

Unnamed: 0_level_0,A,B,C,D,E,G
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008-01-01 00:00:00,,,,,,120.26
2008-01-01 01:00:00,,,124.84,99.12,,130.95
2008-01-01 02:00:00,,,107.64,98.37,,130.19
2008-01-01 03:00:00,,,107.8,89.33,,121.46
2008-01-01 04:00:00,,,100.65,94.35,,103.99


#### Create a new column that gets the combined average for all 6 stations

In [50]:
data['Combined'] = data.mean(axis=1)
data.head()

Unnamed: 0_level_0,A,B,C,D,E,G,Combined
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2008-01-01 00:00:00,,,,,,120.26,120.26
2008-01-01 01:00:00,,,124.84,99.12,,130.95,118.303333
2008-01-01 02:00:00,,,107.64,98.37,,130.19,112.066667
2008-01-01 03:00:00,,,107.8,89.33,,121.46,106.196667
2008-01-01 04:00:00,,,100.65,94.35,,103.99,99.663333


#### Get rid of any rows with no data at all (combined = NaN)

In [51]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 92246 entries, 2008-01-01 00:00:00 to 2018-03-09 23:00:00
Data columns (total 7 columns):
A           55670 non-null float64
B           50826 non-null float64
C           64273 non-null float64
D           82896 non-null float64
E           65715 non-null float64
G           76992 non-null float64
Combined    91739 non-null float64
dtypes: float64(7)
memory usage: 8.1 MB


In [62]:
# count of null rows in Combined
data['Combined'].isna().sum()

507

In [68]:
# percent of total
data['Combined'].isna().sum() / len(data.index) * 100

0.5496173275805997

In [72]:
# drop em
data = data.dropna(subset=['Combined'])

In [73]:
# count of null rows
data['Combined'].isna().sum()

0

## 2. Exploratory Data Analysis

#### Which have been the top 3 worst months overall?