<a href="https://colab.research.google.com/github/DARKESTX/Analyze/blob/main/northern_taiwan_air_quality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import os
import warnings
import matplotlib.pyplot as plt
import math
%matplotlib inline
warnings.filterwarnings('ignore')

# Import Data

In [None]:
dtype = {
    "station": str,
    "AMB_TEMP": int,
    "CH4": int,
    "CO": float,
    "NMHC": float,
    "NO": float,
    "NO2": int,
    "NOx": int,
    "O3": int,
    "PH_RAIN": str,
    "PM10": int,
    "PM2.5": int,
    "RAINFALL": str,
    "RAIN_COND": str,
    "RH": str,
    "SO2": int,
    "THC": float,
    "UVB": int,
    "WD_HR": int,
    "WIND_DIREC": int,
    "WIND_SPEED": float,
    "WS_HR": float
}

In [None]:
#| # indicates invalid value by equipment inspection
#| * indicates invalid value by program inspection
#| x indicates invalid value by human inspection
#| NR indicates no rainfall
#| blank indicates no data

def clean_data(elem):
    elem = str(elem)
    if elem is None or elem is '':
        return np.NaN
    if any(x in elem for x in ["#", "*", "x"]):
        return np.NaN
    if any(x in elem for x in [".", "e"]):
        return float(elem)
    if elem in 'NR':
        return -1
    return int(elem)

converters = {
    "AMB_TEMP": clean_data,
    "CH4": clean_data,
    "CO": clean_data,
    "NMHC": clean_data,
    "NO": clean_data,
    "NO2": clean_data,
    "NOx": clean_data,
    "O3": clean_data,
    "PH_RAIN": clean_data,
    "PM10": clean_data,
    "PM2.5": clean_data,
    "RAINFALL": clean_data,
    "RAIN_COND": clean_data,
    "RH": clean_data,
    "SO2": clean_data,
    "THC": clean_data,
    "UVB": clean_data,
    "WD_HR": clean_data,
    "WIND_DIREC": clean_data,
    "WIND_SPEED": clean_data,
    "WS_HR": clean_data
}

In [None]:
df = pd.read_csv('/kaggle/input/air-quality-in-northern-taiwan/2015_Air_quality_in_northern_Taiwan.csv', dtype = dtype, converters = converters, parse_dates = ['time'])
df.dtypes

# Clean Data

In [None]:
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df.plot.bar(x='column_name', y='percent_missing', rot=45, figsize=(20, 5))

Analysis

- All columns for calculating the AQI with NaNs can be dropped as the percentage of NaNs in the columns is around 5%
- Any analysis with wind speed would result in 20% of data being dropped which will have to be considered
- UVB has lots of Null values making any analysis difficult

Conclusion

- Drop NaN rows for relevant columns (columns used in AQI)
- Drop columns where not enough data is present

In [None]:
relevant_cols = ["time", "station", "CO", "NOx", "NO2", "SO2", "PM10", "PM2.5", "RAINFALL", "AMB_TEMP", "RH", "O3"]
drop_cols = ["CH4", "THC", "NMHC", "PH_RAIN", "RAIN_COND", "UVB"]

df.dropna(subset=relevant_cols, inplace=True)
df.drop(drop_cols, axis=1, inplace=True)

# Analysis

## Univariate Analysis

In [None]:
df.describe()

In [None]:
cols = list(df.columns)[2:]

block = len(cols) // 3

plt.figure(figsize=(20,10))
plt.suptitle("Frequency Histograms")
for idx, col in enumerate(cols):
    plt.subplot(math.ceil(len(cols) / block), block, idx + 1)
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.hist(df[col], bins=30)

## Bi/Multivariate Analysis

In [None]:
# Breakup timestamp
df['year'] = df['time'].dt.to_period('Y')
df['month'] = df['time'].dt.to_period('M')
df['day'] = df['time'].dt.to_period('D')
df['8hour'] = df['time'].dt.to_period('8H')

In [None]:
# Seasons
seasons_dic = {
    '2015-01': "winter",
    '2015-02': "winter",
    '2015-03': "spring",
    '2015-04': "spring",
    '2015-05': "spring",
    '2015-06': "summer",
    '2015-07': "summer",
    '2015-08': "summer",
    '2015-09': "autumn",
    '2015-10': "autumn",
    '2015-11': "autumn",
    '2015-12': "winter",
}

df['season'] = df['month'].map(str).map(seasons_dic)

In [None]:
# Month name
month_dic = {
    '2015-01': "jan",
    '2015-02': "feb",
    '2015-03': "mar",
    '2015-04': "apr",
    '2015-05': "may",
    '2015-06': "jun",
    '2015-07': "jul",
    '2015-08': "aug",
    '2015-09': "sep",
    '2015-10': "oct",
    '2015-11': "nov",
    '2015-12': "dec",
}

df['month_label'] = df['month'].map(str).map(month_dic)

### Levels of Pollution

AQI air quality index
https://app.cpcbccr.com/ccr_docs/FINAL-REPORT_AQI_.pdf


WHO guidance for annual levels of: PM2.5 & 10, O3, NO2, SO2
https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health

    Annual Mean Recommendations
    - PM2.5 - 5 μg/m3
    - PM10 - 15 μg/m3
    - NO2 - 10 μg/m3

    24 Hour Mean Recommendations
    - PM2.5 - 15 μg/m3
    - PM10 - 45 μg/m3
    - NO2 - 25 μg/m3
    - SO2 - 40 μg/m3

    8 Hour Mean Recommendations
    - O3 - 100 μg/m3

In [None]:
# 24 Hour Mean Recommended

cols = [['PM2.5_polluted', 'PM2.5', 15], ['PM10_polluted', 'PM10', 45], ['NO2_polluted', 'NO2', 25], ['SO2_polluted', 'SO2', 40]]

for elem in cols:
    pol_col, col, thresh = elem

    averages = df[[col, 'day']].groupby("day").mean()
    averages[pol_col] = averages[col] > thresh
    dic = averages[pol_col].to_dict()

    df[pol_col] = df['day'].map(dic)

In [None]:
# 8 Hour Mean Recommended

cols = [['O3_polluted', 'O3', 100]]

for elem in cols:
    pol_col, col, thresh = elem

    averages = df[[col, '8hour']].groupby("8hour").mean()
    averages[pol_col] = averages[col] > thresh
    dic = averages[pol_col].to_dict()

    df[pol_col] = df['8hour'].map(dic)

In [None]:
pollutants = ['PM2.5_polluted', 'PM10_polluted', 'NO2_polluted', 'SO2_polluted', 'O3_polluted']
df['polluted'] = df[pollutants].sum(axis = 1) / len(pollutants)

### Temporal Trends

Which seasons have low air quality

In [None]:
[*pollutants, 'month']

In [None]:
df.groupby('month')[pollutants].mean().sort_values(by="month")

In [None]:
plt.figure(figsize=(30,8))
plt.suptitle("Temporal Trends")
plt.subplot(1, 2, 1)
seasons = df.groupby('season')[pollutants].mean()
plt.ylabel("level of air pollution")
plt.xlabel("seasons")
plt.plot(seasons.index, seasons.values, label = pollutants)
plt.legend()
plt.subplot(1, 2, 2)
seasons = df.groupby('season')['polluted'].mean()
plt.ylabel("level of air pollution")
plt.xlabel("seasons")
plt.plot(seasons.index, seasons.values)

In [None]:
plt.figure(figsize=(30,8))
plt.suptitle("Temporal Trends")
plt.subplot(1, 2, 1)
months = df.groupby('month')[pollutants].mean()
plt.ylabel("level of air pollution")
plt.xlabel("months")
plt.plot(months.index.map(str), months.values, label = pollutants)
plt.legend()
plt.subplot(1, 2, 2)
months = df.groupby('month')['polluted'].mean()
plt.ylabel("level of air pollution")
plt.xlabel("months")
plt.plot(months.index.map(str), months.values)

Conclusion
- August, Jun and July show low levels of pollution. This could be due to less people using transport in hotter weather, or different atmospheric conditions.

### Spatial Trends

In [None]:
plt.figure(figsize=(30,7))
plt.suptitle("Temporal Trends")
plt.subplot(1, 2, 1)
seasons = df.groupby('station')[pollutants].mean()
plt.ylabel("level of air pollution")
plt.xlabel("station")
plt.xticks(rotation=45)
plt.plot(seasons.index, seasons.values, label = pollutants)
plt.legend()
plt.subplot(1, 2, 2)
seasons = df.groupby('station')['polluted'].mean()
plt.ylabel("level of air pollution")
plt.xlabel("station")
plt.xticks(rotation=45)
plt.plot(seasons.index, seasons.values)
plt.show()

Conclusions
- The pollution for each station is very similar, meaning there is a low standard deviation between all stations. Probably due to being is similar conditions.

### Relationship between Pollutants

In [None]:
col = ["CO", "NOx", "NO2", "SO2", "PM10", "PM2.5", "O3"]

In [None]:
data = df[col].corr()

fig, ax = plt.subplots(figsize=(10, 10))
ax.matshow(data)
ax.set_xticklabels([''] + col)
ax.set_yticklabels([''] + col)

for (i, j), z in np.ndenumerate(data):
    ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')

Conclusion
- PM10 is related to PM2.5, this is expected as they are both related to particles
- CO, NOx and NO2 are related
- When CO, NOx and NO2 is high O3 is low, showing a negative correlation
- Both PMs show little correlation to other variables

# Hypothesis

In [None]:
from scipy.stats.stats import pearsonr
from scipy import stats

**Hypothesis 1**

h0 : PM2.5 is not correlated to PM10

h1 : PM2.5 is correlated to PM10

In [None]:
pearsonr(df['PM2.5'], df['PM10'])

Correlation coefficient: 0.81

P-value: 0

Therefore, there is correlation and with strong statistical significance.

Accept h1.

**Hypothesis 2**

h0 : PM10 is correlated to NOx

h1 : PM10 is not correlated to NOx

In [None]:
pearsonr(df['PM10'], df['NOx'])

Correlation coefficient: 0.29

P-value: 0

Coefficient is closer to zero, therefore, there is no correlation and with strong statistical significance.

Accept h1.

**Hypothesis 3**

h0 : CO is not correlated to O3

h1 : CO is negatively correlated to O3

In [None]:
pearsonr(df['CO'], df['O3'])

Correlation coefficient: -0.36

P-value: 0

Coefficient is negative, however, closer to zero, therefore, there is no correlation and with strong statistical significance.

Accept h0.

**Hypothesis 4**

h0 : Air pollution in the Winter is not higher than the Summer

h1 : Air pollution in the Winter is higher than the Summer

In [None]:
summer_pollution = df[df['season'] == 'summer']['polluted']
winter_pollution = df[df['season'] == 'winter']['polluted']

print(f"Mean Difference (stayed - left): {np.abs(summer_pollution.mean() - winter_pollution.mean())}")

In [None]:
alpha = 0.05
statistic, pvalue = stats.ttest_ind(summer_pollution, winter_pollution)
print(f"Alpha Value: {alpha}, P-Value: {pvalue}")

The p value returned is less than or equal to 0.05 then the chance the data occurred randomly is very low, therefore, there is statistical significance, and the non null hypothesis (h1) can be accepted.

The p value is zero as the dataset used for testing is very large, therefore, the error is very low.

# Evaluate

Temporal Correlation

- August, Jun and July show low levels of pollution. This could be due to less people using transport in hotter weather, or different atmospheric conditions.

Spatial Correlation

- The pollution for each station is very similar, meaning there is a low standard deviation between all stations. Probably due to being is similar conditions.

Pollutant Correlations

- PM10 is related to PM2.5, this is expected as they are both related to particles
- CO, NOx and NO2 are related
- When CO, NOx and NO2 is high O3 is low, showing a negative correlation
- Both PMs show little correlation to other variables