### Proyek Analisis Data : Air-quality-dataset

* Nama : Muhamad Raka Pratama
* Email : raka824pratama@gmail.com
* Id Dicoding : rakap824

### Menentukan Pertanyaan Bisnis

#### Pertanyaan 1
Bagaimana pengaruh zat SO2, NO2, CO, O3 terhadap polusi udara ?
#### Pertanyaan 2
Apakah suhu memengaruhi intensitas partikel debu pada udara ?

### Import library

In [None]:
import pandas as pd
import streamlit as st
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# from babel.numbers import format_currency
# sns.set(style="dark")

### Data Wrangling

### Gathering Data

In [None]:
df_Aotizhongxin = pd.read_csv("PRSA_Data_Aotizhongxin_20130301-20170228.csv")
df_Changping = pd.read_csv("PRSA_Data_Changping_20130301-20170228.csv")
df_Dingling = pd.read_csv("PRSA_Data_Dingling_20130301-20170228.csv")
df_Dongsi = pd.read_csv("PRSA_Data_Dongsi_20130301-20170228.csv")
df_Guanyuan = pd.read_csv("PRSA_Data_Guanyuan_20130301-20170228.csv")
df_Gucheng = pd.read_csv("PRSA_Data_Gucheng_20130301-20170228.csv")
df_Wanliu = pd.read_csv("PRSA_Data_Wanliu_20130301-20170228.csv")

In [None]:
dfs = [
    df_Aotizhongxin, df_Changping, df_Dingling, df_Dongsi, df_Guanyuan, df_Gucheng, df_Wanliu
]
all_df = pd.concat(dfs, ignore_index=True, sort=False)

In [None]:
all_df["station"].value_counts()

In [None]:
all_df.info()

In [None]:
all_df.isna().sum()

In [None]:
print(f"Data Duplikat: {all_df.duplicated().sum()}")
all_df.describe()

In [None]:
all_df.drop_duplicates(inplace=True)

In [None]:
all_df['datetime'] = pd.to_datetime(
    all_df['year'].apply(str) + '-' + all_df['month'].apply(str) + '-' + all_df['day'].apply(str),
    format='%Y-%m-%d'
)

In [None]:
all_df.sample(10)

### Cleaning Data

# CO

In [None]:
all_df["CO"].isna().sum()

In [None]:
all_df["CO"].describe()

In [None]:
plt.plot(all_df["CO"])
plt.show()

In [None]:
all_df["CO"].value_counts()

In [None]:
q1, q3 = all_df["CO"].quantile(0.25), all_df["CO"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["CO"] < minimum
higher = all_df["CO"] > maximum

print(maximum)
print(minimum)

# all_df["CO"].mask(higher, maximum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["CO"].interpolate(method="linear", limit_direction="forward", inplace=True)

# PM2.5

In [None]:
all_df["PM2.5"].isna().sum()

In [None]:
all_df["PM2.5"].describe()

In [None]:
all_df["PM2.5"].value_counts()

In [None]:
q1, q3 = all_df["PM2.5"].quantile(0.25), all_df["PM2.5"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["PM2.5"] < minimum
higher = all_df["PM2.5"] > maximum

print(maximum)
print(minimum)
# all_df["PM2.5"].mask(higher, maximum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["PM2.5"].interpolate(method="linear", limit_direction="forward", inplace=True)

# PM10

In [None]:
all_df["PM10"].isna().sum()

In [None]:
all_df["PM10"].describe()

In [None]:
all_df["PM10"].value_counts(ascending=False)

In [None]:
plt.plot(all_df["PM10"])
plt.show()

In [None]:
q1, q3 = all_df["PM10"].quantile(0.25), all_df["PM10"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["PM10"] < minimum
higher = all_df["PM10"] > maximum

print(maximum)
print(minimum)
# all_df["PM10"].mask(higher, maximum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["PM10"].interpolate(method="linear", limit_direction="forward", inplace=True)

In [None]:
all_df.PM10[all_df['PM10'] > 800].value_counts().sort_index(ascending=False)

# SO2

In [None]:
all_df["SO2"].isna().sum()

In [None]:
all_df["SO2"].describe()

In [None]:
plt.plot(all_df["SO2"])
plt.show()

In [None]:
q1, q3 = all_df["SO2"].quantile(0.25), all_df["SO2"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["SO2"] < minimum
higher = all_df["SO2"] > maximum

all_df["SO2"].mask(higher, maximum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["SO2"].interpolate(method="linear", limit_direction="forward", inplace=True)

# NO2

In [None]:
all_df["NO2"].isna().sum()

In [None]:
all_df["NO2"].describe()

In [None]:
all_df["NO2"].value_counts()

In [None]:
plt.plot(all_df["NO2"])
plt.show()

In [None]:
q1, q3 = all_df["NO2"].quantile(0.25), all_df["NO2"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["NO2"] < minimum
higher = all_df["NO2"] > maximum

print(maximum)
print(minimum)

# all_df["NO2"].mask(higher, maximum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["NO2"].interpolate(method="linear", limit_direction="forward", inplace=True)

# CO

In [None]:
all_df["CO"].isna().sum()

In [None]:
all_df["CO"].describe()

In [None]:
plt.plot(all_df["CO"])
plt.show()

In [None]:
all_df["CO"].value_counts()

In [None]:
q1, q3 = all_df["CO"].quantile(0.25), all_df["CO"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["CO"] < minimum
higher = all_df["CO"] > maximum

print(maximum)
print(minimum)

# all_df["CO"].mask(higher, maximum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["CO"].interpolate(method="linear", limit_direction="forward", inplace=True)

# O3

In [None]:
all_df["O3"].isna().sum()

In [None]:
all_df.tail(25).sort_values(by="O3", ascending=False)

In [None]:
all_df["O3"].describe()

In [None]:
all_df["O3"].value_counts().index.sort_values(ascending=False)

In [None]:
plt.plot(all_df["O3"])
plt.show()

In [None]:
plt.boxplot(all_df["O3"])
plt.show()

In [None]:
q1, q3 = all_df["O3"].quantile(0.25), all_df["O3"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["O3"] < minimum
higher = all_df["O3"] > maximum

print(maximum)
print(minimum)

In [None]:
all_df["O3"].mask(all_df["O3"] > 700, maximum, inplace=True)
all_df["O3"].mask(all_df["O3"] > 600, maximum, inplace=True)

In [None]:
all_df["O3"].interpolate(method="linear", limit_direction="forward", inplace=True)

# TEMP

In [None]:
all_df["TEMP"].isna().sum()

In [None]:
all_df["TEMP"].describe()

In [None]:
plt.plot(all_df["TEMP"])
plt.show()

In [None]:
all_df["TEMP"].value_counts()

In [None]:
q1, q3 = all_df["TEMP"].quantile(0.25), all_df["TEMP"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["TEMP"] < minimum
higher = all_df["TEMP"] > maximum

print(maximum)
print(minimum)

# all_df["TEMP"].mask(higher, maximum, inplace=True)
# all_df["TEMP"].mask(lower, minimum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["TEMP"].interpolate(method="linear", limit_direction="forward", inplace=True)

# PRES

In [None]:
all_df["PRES"].isna().sum()

In [None]:
all_df["PRES"].describe()

In [None]:
plt.plot(all_df["PRES"])
plt.show()

In [None]:
all_df["PRES"].value_counts()

In [None]:
sns.boxplot(all_df["PRES"],)

In [None]:
q1, q3 = all_df["PRES"].quantile(0.25), all_df["PRES"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["PRES"] < minimum
higher = all_df["PRES"] > maximum

print(maximum)
print(minimum)

# all_df["PRES"].mask(higher, maximum, inplace=True)
# all_df["PRES"].mask(lower, minimum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["PRES"].interpolate(method="linear", limit_direction="forward", inplace=True)

# DEWP

In [None]:
all_df["DEWP"].isna().sum()

In [None]:
all_df["DEWP"].describe()

In [None]:
plt.plot(all_df["DEWP"])
plt.show()

In [None]:
all_df["DEWP"].value_counts()

In [None]:
q1, q3 = all_df["DEWP"].quantile(0.25), all_df["DEWP"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["DEWP"] < minimum
higher = all_df["DEWP"] > maximum

print(maximum)
print(minimum)

# all_df["DEWP"].mask(higher, maximum, inplace=True)
# all_df["DEWP"].mask(lower, minimum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["DEWP"].interpolate(method="linear", limit_direction="forward", inplace=True)

# RAIN

In [None]:
all_df["RAIN"].isna().sum()

In [None]:
all_df["RAIN"].describe()

In [None]:
station_rain_mean = all_df.groupby("station", as_index=False).RAIN.mean()

In [None]:
station_rain_mean.head(100)

In [None]:
for i in station_rain_mean["station"]:
    print(i)

In [None]:
station_rain_mean["station"]

In [None]:
all_df["RAIN"].value_counts().sort_values(ascending=False)

In [None]:
plt.plot(all_df["RAIN"])
plt.show()

In [None]:
all_df["RAIN"].value_counts()

In [None]:
sns.boxplot(all_df["RAIN"],)

In [None]:
q1, q3 = all_df["RAIN"].quantile(0.25), all_df["RAIN"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["RAIN"] < minimum
higher = all_df["RAIN"] > maximum

print(maximum)
print(minimum)

# all_df["RAIN"].mask(higher, maximum, inplace=True)
# all_df["RAIN"].mask(lower, minimum, inplace=True)

In [None]:
# all_df["PM10"].fillna(value=all_df["PM10"].mean(), inplace=True)
all_df["RAIN"].interpolate(method="linear", limit_direction="forward", inplace=True)

# wd

In [None]:
all_df["wd"].isna().sum()

In [None]:
all_df["wd"].value_counts()

In [None]:
new_wd = {
    "ENE" : "NE",
    "NNE" : "NE",
    "NNW" : "NW",
    "WNW" : "NW",
    "ESE" : "SE",
    "SSW" : "SW",
    "WSW" : "SW",
    "SSE" : "SE",
}

def replace_new_wd(wd):
    return new_wd.get(wd, wd)

all_df["wd"] = all_df["wd"].apply(replace_new_wd)

In [None]:
all_df["wd"].describe(include="all")

In [None]:
all_df["wd"].fillna(value="undefined", inplace=True)

# WSPM

In [None]:
all_df["WSPM"].isna().sum()

In [None]:
all_df["WSPM"].describe()

In [None]:
plt.plot(all_df["WSPM"])
plt.show()

In [None]:
all_df[all_df["WSPM"] == 0.0].value_counts().sum()

In [None]:
q1, q3 = all_df["WSPM"].quantile(0.25), all_df["WSPM"].quantile(0.75)
iqr = q3 - q1

maximum = q3 + (1.5*iqr)
minimum = q1 - (1.5*iqr)

lower = all_df["WSPM"] < minimum
higher = all_df["WSPM"] > maximum

print(maximum)
print(minimum)

# all_df["WSPM"].mask(higher, maximum, inplace=True)
# all_df["WSPM"].mask(lower, minimum, inplace=True)

In [None]:
all_df.WSPM.interpolate(method="linear", limit_direction="forward", inplace=True)

In [None]:
all_df.to_csv("all_df.csv", index=False)

### Exploratory Data Analysis (EDA)

Explore .....

In [None]:
all_df.groupby(by='station').agg({
    'PM2.5' : 'mean',
    'PM10' : 'mean',
    'SO2' : 'mean',
    'NO2' : 'mean',
    'CO' : 'mean',
    'O3' : 'mean',
}).sort_values(by='PM2.5')

In [None]:
all_df.groupby(by='station').agg({
    'PM2.5' : 'max',
    'PM10' : 'max',
}).sort_values(by='PM2.5', ascending=False)

In [None]:
all_df.groupby(by=['station', 'year', 'month']).agg({
    'PM2.5' : 'mean',
    'PM10' : 'mean',
}).head(20)

In [None]:
all_df.groupby(by=['station']).agg({
    'PM2.5' : ['mean', 'max', 'min'],
})

In [None]:
all_df.groupby(by=['station', 'year', 'month']).agg({
    'PM2.5' : ['mean', 'max', 'min'],
    'PM10' : ['mean', 'max', 'min']
})

In [None]:
all_df.groupby(by=['day']).agg({
    'PM2.5' : 'mean'
})

In [None]:
all_df.groupby(by=['station', 'datetime']).agg({
    'TEMP' : 'mean',
    'PM2.5' : 'mean',
    'PM10' : 'mean',
})

In [None]:
all_df.groupby(by='wd').agg({
    'PM2.5' : ['mean', 'max', 'min'],
    'PM10' : ['mean', 'max', 'min']
})

In [None]:
all_df.groupby(by=['station', 'wd']).agg({
    'wd' : 'count'
}).head(15)

In [None]:
all_df.groupby(by=['station', 'datetime']).agg({
    'TEMP' : 'mean',
    'PM2.5' : 'mean',
    'PM10' : 'mean',
}).reset_index()

In [None]:
answer_1 = all_df.groupby(by='PM2.5').agg({
    'SO2' : 'mean',
    'NO2' : 'mean',
    'CO' : 'mean',
    'O3' : 'mean'
}).reset_index()

In [None]:
answer_2 = all_df.groupby(by="TEMP").agg({
    'PM2.5' : 'mean',
    'PM10' : 'mean',
}).reset_index()

### Visualization & Explanatory Analysis

#### Pertanyaan 1
Bagaimana pengaruh zat SO2, NO2, CO, O3 terhadap polusi udara ?
#### Pertanyaan 2
Apakah suhu memengaruhi intensitas partikel debu pada udara ?

### ------------------------------------------------------------------------------------

#### Pertanyaan 1

In [None]:
fig, ax = plt.subplots(figsize=(35, 15))

ax.plot(
    answer_1['PM2.5'],
    answer_1['SO2'],
    linewidth=5,
    marker='.',
)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(35, 15))

ax.plot(
    answer_1['PM2.5'],
    answer_1['NO2'],
    linewidth=5,
    marker='.',
)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(35, 15))

ax.plot(
    answer_1['PM2.5'],
    answer_1['CO'],
    linewidth=5,
    marker='.',
)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(35, 15))

ax.plot(
    answer_1['PM2.5'],
    answer_1['O3'],
    linewidth=5,
    marker='.',
)

plt.show()

#### Pertanyaan 2

In [None]:
fig, ax = plt.subplots(figsize=(35, 15))

ax.plot(
    answer_2['TEMP'],
    answer_2['PM2.5'],
    linewidth=5,
    marker='.',
    markerfacecolor='blue',
    label='PM2.5',
)
ax.plot(
    answer_2['TEMP'],
    answer_2['PM10'],
    linewidth=5,
    marker='.',
    markerfacecolor='red',
    label='PM10',
)

plt.legend(fontsize=35)

plt.show()

### Conclusion

* Conclusion pertanyaan 1 : 
    - Pada visualisasi hubungan antara nilai partikel SO2, NO2, dan CO dengan partikel PM2.5, yaitu semakin besar nilai partikel PM2.5, maka semakin besar pula nilai ataupun kandungan SO2, NO2, dan CO pada udara.
    - Pada visualisasi antara nilai O3 dengan partikel PM2.5, dapat dilihat mengalami penurunan nilai. jaid dapat disimpulkan bahwa semakin besar nilai PM2.5, semakin kecil nilai O3
* Conclusion pertanyaan 2 : 
    - Pada visualisasi antara nilai suhu dengan partikel PM2.5 dan PM10, dapat disimpulkan bahwa semakin besar nilai suatu suhu, maka intensitas partikel PM2.5 dan PM10 akan dengan perlahan meningkat pula