# Klasifikasi Kualitas Udara DKI Jakarta (Model Decision Tree)

Anggota Kelompok 2 SWIFT: <br>
1. Ali Tiflen <br>
2. Brilian Herda <br>
3. Dimas Fauzan Nurhidayat <br>
4. Maya Astriyani <br>

## Modules dan Packages

In [1]:
pip show scikit-learn

Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /usr/local/lib/python3.10/dist-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: fastai, imbalanced-learn, librosa, lightgbm, mlxtend, qudida, sklearn-pandas, yellowbrick


In [2]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import joblib

## Import Data

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/MayaAstriyani/klasifikasi-kualitas-udara/main/ispu.csv")

## Exploratory Data Analysis (EDA)

In [4]:
df.head()

Unnamed: 0,tanggal,stasiun,pm10,pm25,so2,co,o3,no2,max,critical,categori
0,2021-01-01,DKI1 (Bunderan HI),38.0,53.0,29.0,6.0,31.0,13.0,53,PM25,SEDANG
1,2021-01-02,DKI1 (Bunderan HI),27.0,46.0,27.0,7.0,47.0,7.0,47,O3,BAIK
2,2021-01-03,DKI1 (Bunderan HI),44.0,58.0,25.0,7.0,40.0,13.0,58,PM25,SEDANG
3,2021-01-04,DKI1 (Bunderan HI),30.0,48.0,24.0,4.0,32.0,7.0,48,PM25,BAIK
4,2021-01-05,DKI1 (Bunderan HI),38.0,53.0,24.0,6.0,31.0,9.0,53,PM25,SEDANG


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1825 entries, 0 to 1824
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tanggal   1825 non-null   object 
 1   stasiun   1825 non-null   object 
 2   pm10      1757 non-null   float64
 3   pm25      1725 non-null   float64
 4   so2       1711 non-null   float64
 5   co        1789 non-null   float64
 6   o3        1757 non-null   float64
 7   no2       1790 non-null   float64
 8   max       1825 non-null   object 
 9   critical  1809 non-null   object 
 10  categori  1808 non-null   object 
dtypes: float64(6), object(5)
memory usage: 157.0+ KB


In [6]:
print(df.isnull().sum())

tanggal       0
stasiun       0
pm10         68
pm25        100
so2         114
co           36
o3           68
no2          35
max           0
critical     16
categori     17
dtype: int64


In [7]:
df.dropna(inplace=True)

In [8]:
print(df.isnull().sum())

tanggal     0
stasiun     0
pm10        0
pm25        0
so2         0
co          0
o3          0
no2         0
max         0
critical    0
categori    0
dtype: int64


In [9]:
df.drop(['tanggal','stasiun','max','critical'], axis=1, inplace=True)

In [10]:
df.head()

Unnamed: 0,pm10,pm25,so2,co,o3,no2,categori
0,38.0,53.0,29.0,6.0,31.0,13.0,SEDANG
1,27.0,46.0,27.0,7.0,47.0,7.0,BAIK
2,44.0,58.0,25.0,7.0,40.0,13.0,SEDANG
3,30.0,48.0,24.0,4.0,32.0,7.0,BAIK
4,38.0,53.0,24.0,6.0,31.0,9.0,SEDANG


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1517 entries, 0 to 1824
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pm10      1517 non-null   float64
 1   pm25      1517 non-null   float64
 2   so2       1517 non-null   float64
 3   co        1517 non-null   float64
 4   o3        1517 non-null   float64
 5   no2       1517 non-null   float64
 6   categori  1517 non-null   object 
dtypes: float64(6), object(1)
memory usage: 94.8+ KB


In [12]:
df['categori'].value_counts()

SEDANG         1147
TIDAK SEHAT     245
BAIK            125
Name: categori, dtype: int64

In [13]:
df.describe()

Unnamed: 0,pm10,pm25,so2,co,o3,no2
count,1517.0,1517.0,1517.0,1517.0,1517.0,1517.0
mean,52.874753,78.238629,35.720501,11.914305,31.338167,20.423204
std,14.70502,23.17835,12.627751,4.779081,14.843322,9.458014
min,15.0,13.0,2.0,2.0,8.0,3.0
25%,45.0,63.0,26.0,9.0,21.0,14.0
50%,54.0,78.0,36.0,11.0,28.0,19.0
75%,62.0,92.0,45.0,14.0,38.0,26.0
max,179.0,174.0,82.0,43.0,151.0,65.0


## Training Model

In [14]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [16]:
model = DecisionTreeClassifier()
model = model.fit(X_train, y_train)

## Evaluasi Model

In [17]:
accuracy_train = model.score(X_train, y_train)
accuracy_test  = model.score(X_test, y_test)

In [18]:
print(f"Akurasi Model (Train) : {np.round(accuracy_train * 100,2)} %")
print(f"Akurasi Model (Test)  : {np.round(accuracy_test * 100,2)} %")

Akurasi Model (Train) : 100.0 %
Akurasi Model (Test)  : 99.47 %


## Menyimpan Model

In [19]:
joblib.dump((model), "model_kualitas_dt.model")

['model_kualitas_dt.model']

## Prediksi

In [20]:
df_test = pd.DataFrame(data={
    "pm10" : [38],
    "pm25"  : [53],
    "so2" : [29],
    "co"  : [6],
    "o3"  : [31],
    "no2"  : [13]
})

df_test[0:1]

Unnamed: 0,pm10,pm25,so2,co,o3,no2
0,38,53,29,6,31,13


In [21]:
pred_test = model.predict(df_test[0:1])
pred_test[0]

'SEDANG'