# **Preprocessing**

Tahap ini bertujuan untuk menyiapkan data sebelum dianalisis. Data diambil dari dua sumber berbeda (MySQL dan PostgreSQL), lalu digabungkan menjadi satu dataframe. Hasil akhirnya adalah dataset Iris dengan lima kolom utama: sepal_length, sepal_width, petal_length, petal_width, dan species.

## **Outliers Cleaning**

Secara sederhana, pembersihan outlier adalah langkah untuk mendeteksi serta menghilangkan data yang nilainya jauh menyimpang dari pola umum atau distribusi utama dalam sebuah dataset.

Alasan Mengapa Outlier Perlu Dihapus

Mengganggu Nilai Rata-rata
Outlier bisa membuat nilai rata-rata (mean) menjadi bias karena tertarik ke arah nilai ekstrem, sehingga hasil rata-rata tidak lagi mewakili data secara keseluruhan.

Mengurangi Akurasi Model
Dalam pemodelan machine learning, keberadaan outlier sering menurunkan kinerja model. Misalnya, pada regresi linier, satu titik data ekstrem dapat mengubah posisi garis regresi, sehingga kemampuan model dalam melakukan prediksi menjadi kurang tepat.

**Data Iris**

In [None]:
from pycaret.anomaly import *


import pandas as pd
from module.dataTransformer import combineData
from module.fetcher import fetchDataMysql, fetchDataPg

data_pg = fetchDataPg("SELECT petal_length, petal_width, species FROM iris_table")
data_my = fetchDataMysql("SELECT sepal_length, sepal_width FROM iris_table")

iris_df = combineData(data1=data_my, data2=data_pg)
iris_df

|     | sepal_length | sepal_width | petal_length | petal_width | species        |
|-----|--------------|-------------|--------------|-------------|----------------|
| 0   | 5.10         | 3.50        | 1.40         | 0.20        | Iris-setosa    |
| 1   | 4.90         | 3.00        | 1.40         | 0.20        | Iris-setosa    |
| 2   | 4.70         | 3.20        | 1.30         | 0.20        | Iris-setosa    |
| 3   | 4.60         | 3.10        | 1.50         | 0.20        | Iris-setosa    |
| 4   | 5.00         | 3.60        | 1.40         | 0.20        | Iris-setosa    |
| ... | ...          | ...         | ...          | ...         | ...            |
| 145 | 6.70         | 3.00        | 5.20         | 2.30        | Iris-virginica |
| 146 | 6.30         | 2.50        | 5.00         | 1.90        | Iris-virginica |
| 147 | 6.50         | 3.00        | 5.20         | 2.00        | Iris-virginica |
| 148 | 6.20         | 3.40        | 5.40         | 2.30        | Iris-virginica |
| 149 | 5.90         | 3.00        | 5.10         | 1.80        | Iris-virginica |

150 rows × 5 columns


In [None]:
# Menyingkirkan kolom species (class)
numeric_iris_df = iris_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].astype(float)
numeric_iris_df

|     | sepal_length | sepal_width | petal_length | petal_width |
|-----|--------------|-------------|--------------|-------------|
| 0   | 5.1          | 3.5         | 1.4          | 0.2         |
| 1   | 4.9          | 3.0         | 1.4          | 0.2         |
| 2   | 4.7          | 3.2         | 1.3          | 0.2         |
| 3   | 4.6          | 3.1         | 1.5          | 0.2         |
| 4   | 5.0          | 3.6         | 1.4          | 0.2         |
| ... | ...          | ...         | ...          | ...         |
| 145 | 6.7          | 3.0         | 5.2          | 2.3         |
| 146 | 6.3          | 2.5         | 5.0          | 1.9         |
| 147 | 6.5          | 3.0         | 5.2          | 2.0         |
| 148 | 6.2          | 3.4         | 5.4          | 2.3         |
| 149 | 5.9          | 3.0         | 5.1          | 1.8         |

150 rows × 4 columns


## **Metode ABOD**

Metode Angle-Based Outlier Detection (ABOD) digunakan untuk mendeteksi outlier dengan mengukur variasi sudut antar titik data. Parameter fraction=0.05 artinya sekitar 5% data dianggap sebagai outlier. Model ini cocok untuk dataset berdimensi rendah seperti Iris.

### **Menyiapkan model ABOD**

In [None]:
from pycaret.anomaly import *

s = setup(data=numeric_iris_df)

abod_model = create_model("abod", fraction=0.05)

df_abod = assign_model(abod_model)

df_abod

<table>
  <thead>
    <tr>
      <th>No</th>
      <th>Description</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>0</td><td>Session id</td><td>7108</td></tr>
    <tr><td>1</td><td>Original data shape</td><td>(150, 4)</td></tr>
    <tr><td>2</td><td>Transformed data shape</td><td>(150, 4)</td></tr>
    <tr><td>3</td><td>Numeric features</td><td>4</td></tr>
    <tr style="background-color:#90EE90;">
      <td>4</td><td>Preprocess</td><td><b>True</b></td>
    </tr>
    <tr><td>5</td><td>Imputation type</td><td>simple</td></tr>
    <tr><td>6</td><td>Numeric imputation</td><td>mean</td></tr>
    <tr><td>7</td><td>Categorical imputation</td><td>mode</td></tr>
    <tr><td>8</td><td>CPU Jobs</td><td>-1</td></tr>
    <tr><td>9</td><td>Use GPU</td><td>False</td></tr>
    <tr><td>10</td><td>Log Experiment</td><td>False</td></tr>
    <tr><td>11</td><td>Experiment Name</td><td>anomaly-default-name</td></tr>
    <tr><td>12</td><td>USI</td><td>03dd</td></tr>
  </tbody>
</table>

|       | sepal_length | sepal_width | petal_length | petal_width | Anomaly | Anomaly_Score  |
|-------|--------------|-------------|--------------|-------------|---------|----------------|
| 0     | 5.1          | 3.5         | 1.4          | 0.2         | 0       | -556.251421    |
| 1     | 4.9          | 3.0         | 1.4          | 0.2         | 0       | -400.000928    |
| 2     | 4.7          | 3.2         | 1.3          | 0.2         | 0       | -93.421993     |
| 3     | 4.6          | 3.1         | 1.5          | 0.2         | 0       | -99.229221     |
| 4     | 5.0          | 3.6         | 1.4          | 0.2         | 0       | -82.176201     |
| ...   | ...          | ...         | ...          | ...         | ...     | ...            |
| 145   | 6.7          | 3.0         | 5.2          | 2.3         | 0       | -13.831461     |
| 146   | 6.3          | 2.5         | 5.0          | 1.9         | 0       | -9.110201      |
| 147   | 6.5          | 3.0         | 5.2          | 2.0         | 0       | -20.571707     |
| 148   | 6.2          | 3.4         | 5.4          | 2.3         | 0       | -6.345400      |
| 149   | 5.9          | 3.0         | 5.1          | 1.8         | 0       | -28.205616     |

150 rows × 6 columns

### **Data IRIS clean**

Data Tanpa Outliers

Setelah menjalankan ABOD, data yang terdeteksi sebagai inlier (Anomaly = 0) dipertahankan. Hasilnya adalah dataset Iris tanpa outlier, sehingga lebih stabil untuk analisis selanjutnya.

In [None]:
#DataTanpa Outliers

clean_iris_abod = df_abod[df_abod["Anomaly"] == 0].merge(iris_df["species"], left_index=True, right_index=True)[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]]
clean_iris_abod

| Index | sepal_length | sepal_width | petal_length | petal_width | species        |
|-------|--------------|-------------|--------------|-------------|----------------|
| 0     | 5.1          | 3.5         | 1.4          | 0.2         | Iris-setosa    |
| 1     | 4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa    |
| 2     | 4.7          | 3.2         | 1.3          | 0.2         | Iris-setosa    |
| 3     | 4.6          | 3.1         | 1.5          | 0.2         | Iris-setosa    |
| 4     | 5.0          | 3.6         | 1.4          | 0.2         | Iris-setosa    |
| ...   | ...          | ...         | ...          | ...         | ...            |
| 145   | 6.7          | 3.0         | 5.2          | 2.3         | Iris-virginica |
| 146   | 6.3          | 2.5         | 5.0          | 1.9         | Iris-virginica |
| 147   | 6.5          | 3.0         | 5.2          | 2.0         | Iris-virginica |
| 148   | 6.2          | 3.4         | 5.4          | 2.3         | Iris-virginica |
| 149   | 5.9          | 3.0         | 5.1          | 1.8         | Iris-virginica |

142 rows x 5 columns

## **Metode KNN**

Metode K-Nearest Neighbors (KNN) untuk anomaly detection digunakan untuk mengidentifikasi data yang berbeda jauh dari tetangga terdekatnya. Sama seperti ABOD, digunakan fraction=0.05 untuk menganggap 5% data sebagai outlier.

### **Menyiapkan Model KNN**

In [None]:
from pycaret.anomaly import *

s = setup(data=numeric_iris_df)

iforest_model = create_model("knn", fraction=0.05)

df_knn = assign_model(iforest_model)

df_knn

<table border="1" style="border-collapse: collapse; width: 70%; text-align: left;">
  <thead style="background-color: #4CAF50; color: white;">
    <tr>
      <th>Description</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>Session id</td><td>5315</td></tr>
    <tr><td>Original data shape</td><td>(150, 4)</td></tr>
    <tr><td>Transformed data shape</td><td>(150, 4)</td></tr>
    <tr><td>Numeric features</td><td>4</td></tr>
    <tr><td>Preprocess</td><td style="color: green; font-weight: bold;">True</td></tr>
    <tr><td>Imputation type</td><td>simple</td></tr>
    <tr><td>Numeric imputation</td><td>mean</td></tr>
    <tr><td>Categorical imputation</td><td>mode</td></tr>
    <tr><td>CPU Jobs</td><td>-1</td></tr>
    <tr><td>Use GPU</td><td style="color: red; font-weight: bold;">False</td></tr>
    <tr><td>Log Experiment</td><td>False</td></tr>
    <tr><td>Experiment Name</td><td>anomaly-default-name</td></tr>
    <tr><td>USI</td><td>7446</td></tr>
  </tbody>
</table>


|       | sepal_length | sepal_width | petal_length | petal_width | Anomaly | Anomaly_Score |
|-------|--------------|-------------|--------------|-------------|---------|---------------|
| 0     | 5.1          | 3.5         | 1.4          | 0.2         | 0       | 0.141421      |
| 1     | 4.9          | 3.0         | 1.4          | 0.2         | 0       | 0.173205      |
| 2     | 4.7          | 3.2         | 1.3          | 0.2         | 0       | 0.264575      |
| 3     | 4.6          | 3.1         | 1.5          | 0.2         | 0       | 0.264575      |
| 4     | 5.0          | 3.6         | 1.4          | 0.2         | 0       | 0.244949      |
| ...   | ...          | ...         | ...          | ...         | ...     | ...           |
| 145   | 6.7          | 3.0         | 5.2          | 2.3         | 0       | 0.374166      |
| 146   | 6.3          | 2.5         | 5.0          | 1.9         | 0       | 0.479583      |
| 147   | 6.5          | 3.0         | 5.2          | 2.0         | 0       | 0.387298      |
| 148   | 6.2          | 3.4         | 5.4          | 2.3         | 0       | 0.624500      |
| 149   | 5.9          | 3.0         | 5.1          | 1.8         | 0       | 0.360555      |

150 rows x 6 columns

### **Data IRIS clean**

**Data tanpa outliers**

In [None]:
clean_iris_knn = df_knn[df_knn["Anomaly"] == 0].merge(iris_df["species"], left_index=True, right_index=True)[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]]
clean_iris_knn

|      | sepal_length | sepal_width | petal_length | petal_width | species        |
|-------|--------------|-------------|--------------|-------------|----------------|
| 0     | 5.1          | 3.5         | 1.4          | 0.2         | Iris-setosa    |
| 1     | 4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa    |
| 2     | 4.7          | 3.2         | 1.3          | 0.2         | Iris-setosa    |
| 3     | 4.6          | 3.1         | 1.5          | 0.2         | Iris-setosa    |
| 4     | 5.0          | 3.6         | 1.4          | 0.2         | Iris-setosa    |
| ...   | ...          | ...         | ...          | ...         | ...            |
| 145   | 6.7          | 3.0         | 5.2          | 2.3         | Iris-virginica |
| 146   | 6.3          | 2.5         | 5.0          | 1.9         | Iris-virginica |
| 147   | 6.5          | 3.0         | 5.2          | 2.0         | Iris-virginica |
| 148   | 6.2          | 3.4         | 5.4          | 2.3         | Iris-virginica |
| 149   | 5.9          | 3.0         | 5.1          | 1.8         | Iris-virginica |

142 rows x 5 columns

## **Metode LOF**

Membuat model LOF dengan fraction (proporsi data yang akan di anggap sebagai outlier) sama dengan 0.05 atau 5%

### **Menyiapkan Model LOF**

In [None]:
from pycaret.anomaly import *

s = setup(data=numeric_iris_df)

lof_model = create_model("lof", fraction=0.05)

df_lof = assign_model(lof_model)

df_lof

| No  | Description             | Value                 |
|-----|-------------------------|-----------------------|
| 0   | Session id              | 5909                  |
| 1   | Original data shape     | (150, 4)              |
| 2   | Transformed data shape  | (150, 4)              |
| 3   | Numeric features        | 4                     |
| 4   | Preprocess              | True                  |
| 5   | Imputation type         | simple                |
| 6   | Numeric imputation      | mean                  |
| 7   | Categorical imputation  | mode                  |
| 8   | CPU Jobs                | -1                    |
| 9   | Use GPU                 | False                 |
| 10  | Log Experiment          | False                 |
| 11  | Experiment Name         | anomaly-default-name  |
| 12  | USI                     | 79eb                  |

| sepal_length | sepal_width | petal_length | petal_width | Anomaly | Anomaly_Score |
|--------------|-------------|--------------|-------------|---------|---------------|
| 5.1          | 3.5         | 1.4          | 0.2         | 0       | 0.976302      |
| 4.9          | 3.0         | 1.4          | 0.2         | 0       | 1.008758      |
| 4.7          | 3.2         | 1.3          | 0.2         | 0       | 1.019841      |
| 4.6          | 3.1         | 1.5          | 0.2         | 0       | 1.049882      |
| 5.0          | 3.6         | 1.4          | 0.2         | 0       | 0.958473      |
| ...          | ...         | ...          | ...         | ...     | ...           |
| 6.7          | 3.0         | 5.2          | 2.3         | 0       | 0.978474      |
| 6.3          | 2.5         | 5.0          | 1.9         | 0       | 1.004232      |
| 6.5          | 3.0         | 5.2          | 2.0         | 0       | 0.980847      |
| 6.2          | 3.4         | 5.4          | 2.3         | 0       | 1.021819      |
| 5.9          | 3.0         | 5.1          | 1.8         | 0       | 1.011326      |

150 rows × 6 columns


### **Data IRIS clean**

**Data tanpa outliers**

In [None]:
clean_iris_lof = df_lof[df_lof["Anomaly"] == 0].merge(iris_df["species"], left_index=True, right_index=True)[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]]
clean_iris_lof

| index | sepal_length | sepal_width | petal_length | petal_width | species        |
|-------|--------------|-------------|--------------|-------------|----------------|
| 0     | 5.1          | 3.5         | 1.4          | 0.2         | Iris-setosa    |
| 1     | 4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa    |
| 2     | 4.7          | 3.2         | 1.3          | 0.2         | Iris-setosa    |
| 3     | 4.6          | 3.1         | 1.5          | 0.2         | Iris-setosa    |
| 4     | 5.0          | 3.6         | 1.4          | 0.2         | Iris-setosa    |
| ...   | ...          | ...         | ...          | ...         | ...            |
| 145   | 6.7          | 3.0         | 5.2          | 2.3         | Iris-virginica |
| 146   | 6.3          | 2.5         | 5.0          | 1.9         | Iris-virginica |
| 147   | 6.5          | 3.0         | 5.2          | 2.0         | Iris-virginica |
| 148   | 6.2          | 3.4         | 5.4          | 2.3         | Iris-virginica |
| 149   | 5.9          | 3.0         | 5.1          | 1.8         | Iris-virginica |

142 rows × 6 columns
