# IF3170 Artificial Intelligence | Tugas Besar 2

This notebook serves as a template for the assignment. Please create a copy of this notebook to complete your work. You can add more code blocks, markdown blocks, or new sections if needed.


Group Number: 20

Group Members:
- Ahmad Naufal Ramadan (13522005)
- Kristo Anugrah (13522024)
- Tazkia Nizami (13522032)
- Farhan Nafis Rayhan (13522037)

## Import Libraries

In [51]:
import pandas as pd
import numpy as np
from IPython.display import display, Markdown
import matplotlib.pyplot as plt
import pickle

## Import Dataset

In [None]:
additional_features_df = pd.read_csv('../dataset/train/additional_features_train.csv')
basic_features_df = pd.read_csv('../dataset/train/basic_features_train.csv')
content_features_df = pd.read_csv('../dataset/train/content_features_train.csv')
flow_features_df = pd.read_csv('../dataset/train/flow_features_train.csv')
labels_df = pd.read_csv('../dataset/train/labels_train.csv')
time_features_df = pd.read_csv('../dataset/train/time_features_train.csv')

data_with_label = pd.merge(basic_features_df, additional_features_df, on="id")
data_with_label = pd.merge(data_with_label, content_features_df, on="id")
data_with_label = pd.merge(data_with_label, flow_features_df, on="id")
data_with_label = pd.merge(data_with_label, time_features_df, on="id")
data_with_label = pd.merge(data_with_label, labels_df, on="id")
data = data_with_label.drop('label', axis=1)

data.head()

# Exploratory Data Analysis (Optional)

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to uncover patterns, trends, anomalies, and insights. It is the first step before applying more advanced statistical and machine learning techniques. EDA helps you to gain a deep understanding of the data you are working with, allowing you to make informed decisions and formulate hypotheses for further analysis.

In [None]:
categorical_features = data.select_dtypes(include=['object']).columns.tolist()
binary_features = ['is_sm_ips_ports', 'is_ftp_login']

numerical_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
for feature in binary_features:
    numerical_features.remove(feature)
numerical_features.remove('id')

markdown_content = "### Total Features\n"
markdown_content += f"* Total features: {len(categorical_features) + len(numerical_features) + len(binary_features)}\n"
markdown_content += f"* Total categorical features: {len(categorical_features)}\n"
markdown_content += f"* Total numerical features: {len(numerical_features)}\n"
markdown_content += f"* Total binary features: {len(binary_features)}\n"
markdown_content += "\n### Categorical Features\n"
markdown_content += "\n".join([f"* {feature}" for feature in categorical_features])
markdown_content += "\n\n### Numerical Features\n"
markdown_content += "\n".join([f"* {feature}" for feature in numerical_features])
markdown_content += "\n\n### Binary Features\n"
markdown_content += "\n".join([f"* {feature}" for feature in binary_features])

display(Markdown(markdown_content))

In [None]:
markdown_content = "### Categorical Features Values\n"
for feature in categorical_features:
    markdown_content += f"\n#### {feature}\n"
    markdown_content += f"* Unique values: {data[feature].nunique()}\n"
    markdown_content += f"* Values: {data[feature].unique()}\n"

display(Markdown(markdown_content))

markdown_content = "### Numerical Features Values\n"
for feature in numerical_features:
    markdown_content += f"\n#### {feature}\n"
    unique_values = data[feature].nunique()
    if unique_values > 150:
        markdown_content += f"* Min: {data[feature].min()}\n"
        markdown_content += f"* Mean: {data[feature].mean()}\n"
        markdown_content += f"* Max: {data[feature].max()}\n"
        markdown_content += f"* STD: {data[feature].std()}\n"
        markdown_content += f"* 25%: {data[feature].quantile(0.25)}\n"
        markdown_content += f"* 50%: {data[feature].quantile(0.50)}\n"
        markdown_content += f"* 75%: {data[feature].quantile(0.75)}\n"
    else:
        markdown_content += f"* Unique values: {unique_values}\n"
        markdown_content += f"* Values: {data[feature].unique()}\n"
    

display(Markdown(markdown_content))

In [None]:
rows, columns = data.shape
display(Markdown(f"### Data shape:"))
display(Markdown(f"#### Number of Instances is: {rows}"))
display(Markdown(f"#### Number of Features is: {columns}"))

In [None]:
categorical_values = data.select_dtypes(include=['object'])
binary_values = data.loc[:, binary_features]
numerical_values = data.select_dtypes(include=['int64', 'float64']) \
                            .drop(binary_features, axis=1).drop('id', axis=1)

display(Markdown("### Numerical values:"))
display(numerical_values)
display(Markdown("### Categorical values:"))
display(categorical_values)
display(Markdown("### Binary values:"))
display(binary_values)

In [None]:
display(Markdown("### Numerical value statistics:"))
display(numerical_values.describe())

In [None]:
missing_values = data.isnull().sum().to_frame("missing_values_count")
display(Markdown("### Number of missing values in every column:"))
display(missing_values)

In [None]:
def histogram_numerical(numerical_table: pd.DataFrame):
  fig, axes = plt.subplots(9, 4, figsize=(30, 30))
  axes = axes.flatten()
  for index, col in enumerate(numerical_table.columns):
    axes[index].hist(numerical_table[col].dropna(), color='skyblue', edgecolor='black')
    axes[index].set_title(f'Distribution of {col}')
  fig.tight_layout()
    
display(Markdown("### Histogram of every numerical features:"))
histogram_numerical(numerical_values)

In [None]:
def histogram_categorical(categorical_table: pd.DataFrame):
  fig, axes = plt.subplots(4, 1, figsize=(100, 100))
  axes = axes.flatten()
  for index, col in enumerate(categorical_table.columns):
    categorical_table[col].value_counts().plot(kind="bar", ax=axes[index]).set_title(col)
  fig.tight_layout()
    
display(Markdown("### Histogram of every categorical features:"))
histogram_categorical(categorical_values)

In [None]:
# boxplot with outliers

def boxplot_all_column(numerical_table: pd.core.frame.DataFrame):
  fig, axes = plt.subplots(9, 4, figsize = (20, 20))
  axes = axes.flatten() 
  for index, col in enumerate(numerical_table.columns):
    axes[index].boxplot(numerical_table[col].dropna(), False, sym="rs", vert=False,  widths=0.5, positions=[0])
    axes[index].set_title(f"Boxplot of {col}")
  fig.tight_layout()

display(Markdown("### Boxplot of every numerical features:"))
boxplot_all_column(numerical_values)

In [None]:
def count_of_outliers(numerical_table: pd.DataFrame):
  Q1 = numerical_table.quantile(0.25)
  Q3 = numerical_table.quantile(0.75)
  IQR = Q3 - Q1
  return (((numerical_table < (Q1 - 1.5 * IQR)) | (numerical_table > (Q3 + 1.5 * IQR))).sum().to_frame("Number of outliers"))

count_of_outliers(numerical_values)

In [None]:
import seaborn as sn

corr_numerical_matrix = numerical_values.corr()

plt.figure(figsize=(180, 180))

sn.heatmap(corr_numerical_matrix, annot=True, cmap='coolwarm', fmt='.2f', annot_kws={"size": 42})
plt.xticks(fontsize=36)
plt.yticks(fontsize=36)

display(Markdown("### Correlation heatmap of every pair of numerical features"))
plt.show()

In [None]:
corr_categorical_matrix = categorical_values.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
display(Markdown("### Correlation heatmap of every pair of categorical features"))
display(sn.heatmap(corr_categorical_matrix, annot=True, cmap='coolwarm', fmt='.2f', annot_kws={"size": 10}))

In [None]:
def display_contingency_table(categorical_table: pd.DataFrame):
  columns = categorical_table.columns
  for index1, col1 in enumerate(columns):
    for index2, col2 in enumerate(columns):
      if index2 >= index1:
        break
      first_feature = categorical_values.loc[:, col1]
      second_feature = categorical_values.loc[:, col2]
      contingency_table = pd.crosstab(first_feature, second_feature)
      display(Markdown(f"#### Contingency table of {col1} with {col2}:"))
      display(contingency_table)

display_contingency_table(categorical_values)

# 1. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train` folder given by the TA. The `test` data is only used for kaggle submission.

In [None]:
import sklearn.model_selection

X = data.drop(['attack_cat'], axis=1)
y = data['attack_cat']

X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(X, y, test_size=0.20, random_state=42)

display(Markdown("### X train:"))
display(X_train)
display(Markdown("### X validation:"))
display(X_val)

# 2. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

We will give some common methods for you to try, but you only have to **at least implement one method for each process**. For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

## A. Data Cleaning

**Data cleaning** is the crucial first step in preparing your dataset for machine learning. Raw data collected from various sources is often messy and may contain errors, missing values, and inconsistencies. Data cleaning involves the following steps:

1. **Handling Missing Data:** Identify and address missing values in the dataset. This can include imputing missing values, removing rows or columns with excessive missing data, or using more advanced techniques like interpolation.

2. **Dealing with Outliers:** Identify and handle outliers, which are data points significantly different from the rest of the dataset. Outliers can be removed or transformed to improve model performance.

3. **Data Validation:** Check for data integrity and consistency. Ensure that data types are correct, categorical variables have consistent labels, and numerical values fall within expected ranges.

4. **Removing Duplicates:** Identify and remove duplicate rows, as they can skew the model's training process and evaluation metrics.

5. **Feature Engineering**: Create new features or modify existing ones to extract relevant information. This step can involve scaling, normalizing, or encoding features for better model interpretability.

### I. Handling Missing Data

In [67]:
import sklearn.impute
from sklearn.base import BaseEstimator, TransformerMixin

class ImputeNumerical(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.numerical_columns = []
    self.imp_mean = {}

  def fit(self, X: pd.DataFrame, y):
    self.imp_mean = sklearn.impute.SimpleImputer(missing_values=np.nan, strategy='median')
    self.imp_mean.fit(X)
    return self
  
  def transform(self, X: pd.DataFrame):
    return pd.DataFrame(self.imp_mean.transform(X), columns=X.columns)
  
class ImputeCategorical(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.categorical_columns = []
    self.imp_mode = {}

  def fit(self, X: pd.DataFrame, y):
    self.imp_mode = sklearn.impute.SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    self.imp_mode.fit(X)
    return self
  
  def transform(self, X: pd.DataFrame):
    return pd.DataFrame(self.imp_mode.transform(X), columns=X.columns)

### II. Dealing with Outliers

In [68]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd


class ReplaceOutliersWithMedian(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=3):
        self.threshold = threshold
        self.medians_ = None
    
    def fit(self, X, y=None):
        X = X.toarray() if hasattr(X, 'toarray') else np.array(X)

        self.medians_ = np.median(X, axis=0)
        
        return self
    
    def transform(self, X):
        if self.medians_ is None:
            raise ValueError("Transformer has not been fitted yet. Call 'fit' first.")
        
        X = X.toarray() if hasattr(X, 'toarray') else np.array(X)
        
        X_transformed = X.copy()
        
        outlier_mask = np.abs(X_transformed) > self.threshold
        X_transformed[outlier_mask] = self.medians_[np.where(outlier_mask)[1]]
        
        return X_transformed


class DropOutliers(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=3):
        self.threshold = threshold
        self.medians_ = None
    
    def fit(self, X, y=None):
        X = X.toarray() if hasattr(X, 'toarray') else np.array(X)

        self.medians_ = np.median(X, axis=0)
        
        return self
    
    def transform(self, X):
        if self.medians_ is None:
            raise ValueError("Transformer has not been fitted yet. Call 'fit' first.")
        
        X = X.toarray() if hasattr(X, 'toarray') else np.array(X)
        
        outlier_mask = np.abs(X) > self.threshold
        X_transformed = X[~np.any(outlier_mask, axis=1)]
        
        return X_transformed

### ReplaceOutliersWithMedian + DropOutliers
Kami menyiapkan kelas untuk menangani outliers dengan 2 cara, yaitu dengan menggantinya menggunakan median atau mendropnya sama sekali. Meskipun kedua kelas ini tidak terpakai pada pipeline final kami, kelas ini membantu kami dalam mencari konfigurasi pipeline yang menghasilkan performa nilai terbaik.

### III. Remove Duplicates

In [69]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class RemoveDuplicates(BaseEstimator, TransformerMixin):
    def fit(self, X: pd.DataFrame, y):
        return self

    def transform(self, X: pd.DataFrame):
        return X.drop_duplicates(keep='first').reset_index(drop=True)

### Remove Duplicates
Pada block kode ini kami menyiapkan kelas untuk menghilangkan duplikat. Meskipun kelas ini tidak terpakai pada pipeline final, kelas ini membantu kami dalam mencoba berbagai kombinasi pipeline yang menghasilkan performa model maksimal.

### IV. Feature Engineering

In [70]:
from sklearn.feature_selection import SelectKBest, f_classif

class FilterSelection(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X: pd.DataFrame, y):
        self.select_k_best = SelectKBest(score_func=f_classif, k=10)
        self.select_k_best.fit(X, y)
        return self

    def transform(self, X: pd.DataFrame):
        selected_features = self.select_k_best.get_support()
        return X.loc[:, selected_features]

### Feature Selection
Proses pada tahap ini adalah pemilihan K kolom terbaik yang merepresentasikan keseluruhan data. Tahap ini dilakukan karena:
1. Meningkatkan performa model: Dengan memilih K fitur/kolom terbaik, noise atau irrelevant data dapat dikurangi, yang kemudian akan meningkatkan performa model.
2. Mengurangi overfitting: Dengan hanya menggunakan K fitur yang terbaik, proses ini akan membantu model men-generalisasi training data.

## B. Data Preprocessing

### I. Feature Scaling

In [72]:
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler

class MinMaxFeatureScaling(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = MinMaxScaler()

    def fit(self, X, y=None):
        self.scaler.fit(X)
        return self

    def transform(self, X):
        X_scaled = X.copy()
        X_scaled = pd.DataFrame(self.scaler.transform(X), columns=X.columns)
        return X_scaled
    
class MaxAbsFeatureScaling(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = MaxAbsScaler()

    def fit(self, X, y=None):
        self.scaler.fit(X)
        return self

    def transform(self, X):
        X_scaled = X.copy()
        X_scaled = pd.DataFrame(self.scaler.transform(X), columns=X.columns)
        return X_scaled

### Min Max Scaling
Pada tahap ini, dilakukan min max scaling pada kolom numerikal. Proses scaling ini dilakukan karena:
1. Rentang nilai yang sama: Melakukan Min Max Scaling membuat seluruh nilai kolom berada pada rentang yang sama, yang dalam hal ini adalah rentang [0, 1]. 
2. Memastikan performa algoritma yang sensitif terhadap nilai: Algoritma seperti KNN yang sangat sensitif terhadap perhitungan jarak sangat sensitif terhadap nilai-nilai besar. Dengan melakukan min max scaling, semua fitur akan berkontribusi sama. 

### Max Absolute Scaling
Pada tahap ini dilakukan max absolute scaling pada kolom kategorikal. Tahap ini dilakukan karena: 
1. Menjaga sparsity data: Teknik ini menjaga sparsity data. Karena kolom yang diterapkan teknik ini adalah kolom one-hot yang cenderung sparse, maka teknik ini cocok digunakan.
2. Menjaga relasi/hubungan antar fitur: Teknik ini tidak membuat fitur menjadi sebuah rentang tertentu, yang berarti hubungan antar fitur tetap terjaga.


### II. Feature Encoding

In [73]:
import sklearn.preprocessing

class OneHotCategorical(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.categorical_columns = []
    self.onehot = None

  def fit(self, X: pd.DataFrame, y):
    self.onehot = sklearn.preprocessing.OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    self.onehot.fit(X)
    return self
  
  def transform(self, X: pd.DataFrame):
    return pd.DataFrame(self.onehot.transform(X), columns=self.onehot.get_feature_names_out(), index=X.index)

### One Hot Encoding
Pada tahap ini dilakukan one hot encoding pada kolom kategorikal. Tahap one hot encoding dilakukan karena:
1. Mengubah nilai kategorik menjadi numerik: Banyak algoritma machine learning membutuhkan fitur numerikal. Tahap one-hot encoding mengubah nilai-nilai pada kolom kategorik menjadi numerik dengan tetap mempertahankan keunikan nilai.
2. Mencegah kesalahan pada interpretasi nilai ordinal: One hot encoding menjaga independensi antar nilai kategorikal, dan mencegah kesalahan interpretasi nilai ordinal.


In [None]:
%pip install imblearn

### III. Handling Imbalanced Dataset

In [75]:
from imblearn.over_sampling import SMOTE

class SMOTEImbalance(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.smote = None

    def fit(self, X: pd.DataFrame, y: pd.DataFrame):
        self.smote = SMOTE(sampling_strategy='auto', random_state=42)
        self.smote.fit_resample(X, y) 
        return self

    def transform(self, X: pd.DataFrame, y: pd.DataFrame):
        X_resampled, y_resampled = self.smote.fit_resample(X, y)
        return pd.DataFrame(X_resampled, columns=X.columns), pd.DataFrame(y_resampled, columns=y.columns)

### SMOTE
Pada block kode ini kami menyiapkan kelas SMOTE untuk menangani label target yang imbalanced dengan melakukan oversampling pada data label yang underrepresented. Meskipun pada akhirnya kelas ini tidak kami masukkan ke pipeline final, kelas ini membantu kami dalam melakukan tuning untuk menemukan pipeline terbaik.

### IV. Data Normalization

In [76]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class StandardScaleNumerical(BaseEstimator, TransformerMixin):
  def __init__(self, with_mean=True):
    self.with_mean = with_mean
    self.scaler = None

  def fit(self, X: pd.DataFrame, y):
    self.scaler = StandardScaler(with_mean=self.with_mean)
    self.scaler.fit(X)

    return self

  def transform(self, X: pd.DataFrame):
    return pd.DataFrame(self.scaler.transform(X), columns=self.scaler.get_feature_names_out())

### Standard Scaling
Pada tahap ini dilakukan standard scaling pada kolom kategorik dan numerik. Tahap standard scaling dilakukan karena:
1. Meratakan kontribusi setiap fitur pada proses jalannya model: Proses standard scaling akan membuat kontribusi setiap fitur sama dengan mengubah rentang setiap fitur sesuai dengan Z-score masing-masing.
2. Menjaga relasi antar fitur: Tahap standard scaling menjaga relasi antar fitur karena standard scaling tidak melakukan mapping dari setiap nilai ke sebuah rentang spesifik.

### V. Dimensionality Reduction

# 3. Compile Preprocessing Pipeline

In [78]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA

numerical_pipeline = Pipeline([
    ('imputer', ImputeNumerical()),
    ('scaler', MinMaxFeatureScaling()),
])

categorical_pipeline = Pipeline([
    ('imputer', ImputeCategorical()),
    ('onehot', OneHotCategorical()),
    ('scaler', MaxAbsFeatureScaling()),
])

preprocessor = ColumnTransformer([
    ('numerical', numerical_pipeline, [*numerical_features, *binary_features]),
    ('categorical', categorical_pipeline, [feature for feature in categorical_features if (feature != 'attack_cat')]),
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('filter', SelectKBest(score_func=f_classif, k=70)),
])

In [None]:
train_set = pipe.fit_transform(X_train, y_train)
val_set = pipe.transform(X_val)
print(train_set.shape)
print(val_set.shape)

# 4. Modeling and Validation


In [80]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

kf = KFold(n_splits=5, shuffle=True, random_state=42)

## A. KNN

In [None]:
# Type your code here
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(train_set, y_train)

scores = cross_val_score(knn, train_set, y_train, cv=kf, scoring='accuracy')
print(f"Cross validation scores: {scores}")
print(f"Mean accuracy: {scores.mean()}")
print(f"Standard deviation of accuracy: {scores.std()}")

y_pred = knn.predict(val_set)
f1 = f1_score(y_val, y_pred, average='macro')
print(f"F1 score: {f1}")

df = pd.DataFrame()
df['attack_cat'] = y_val
df['predicted'] = y_pred
df.head()

### Custom KNN

In [82]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import LabelEncoder

class KNN(BaseEstimator, ClassifierMixin):
    def __init__(self, k=5, distance_metric='euclidean', p=3):
        self.k = k
        self.distance_metric = distance_metric
        self.p = p

        self.label_encoder = LabelEncoder()

        valid_metrics = ['euclidean', 'manhattan', 'minkowski']
        if self.distance_metric.lower() not in valid_metrics:
            raise ValueError(f"Unsupported distance metric: {self.distance_metric}")

        self._X_train = None
        self._y_train = None

    def fit(self, X, y):
        self._X_train = np.asarray(X)
        self._y_train = self.label_encoder.fit_transform(y)
        return self

    def _chunked_pairwise_distances(self, X, chunk_size=500):
        n_samples = X.shape[0]
        n_train_samples = self._X_train.shape[0]
        distances = np.zeros((n_samples, n_train_samples), dtype=np.float32)

        for start in range(0, n_train_samples, chunk_size):
            end = min(start + chunk_size, n_train_samples)
            X_train_chunk = self._X_train[start:end]

            if self.distance_metric.lower() == 'euclidean':
                dists = np.sqrt(
                    np.sum(X[:, np.newaxis, :] ** 2, axis=2) +
                    np.sum(X_train_chunk ** 2, axis=1) -
                    2 * np.dot(X, X_train_chunk.T)
                )
            elif self.distance_metric.lower() == 'manhattan':
                dists = np.sum(
                    np.abs(X[:, np.newaxis, :] - X_train_chunk), axis=2
                )
            elif self.distance_metric.lower() == 'minkowski':
                dists = np.power(
                    np.sum(
                        np.abs(X[:, np.newaxis, :] - X_train_chunk) ** self.p, axis=2
                    ),
                    1 / self.p
                )
            else:
                raise ValueError(f"Unsupported distance metric: {self.distance_metric}")

            distances[:, start:end] = dists

        return distances

    def predict(self, X):
        X = np.asarray(X)

        batch_size = 1000
        predictions = []

        for i in range(0, len(X), batch_size):
            batch = X[i:i + batch_size]

            distances = self._chunked_pairwise_distances(batch)

            k_nearest_indices = np.argpartition(distances, self.k, axis=1)[:, :self.k]

            k_nearest_labels = self._y_train[k_nearest_indices]

            batch_predictions = np.apply_along_axis(lambda x: np.bincount(x).argmax(), 1, k_nearest_labels)

            predictions.append(batch_predictions)

        encoded_predictions = np.concatenate(predictions)
        return self.label_encoder.inverse_transform(encoded_predictions)

    def score(self, X, y):
        y_pred = self.predict(X)
        return np.mean(y_pred == y)


In [None]:
knn_model = KNN(k=131, distance_metric="manhattan")
knn_model.fit(train_set, y_train)

scores = cross_val_score(knn_model, train_set, y_train, cv=kf, scoring='accuracy', n_jobs=10, verbose=1)
print(f"Cross validation scores: {scores}")
print(f"Mean accuracy: {scores.mean()}")
print(f"Standard deviation of accuracy: {scores.std()}")

y_pred = knn_model.predict(val_set)
f1 = f1_score(y_val, y_pred, average='macro')
print(f"F1 score: {f1}")

df = pd.DataFrame()
df['attack_cat'] = y_val
df['predicted'] = y_pred
df.head()

## B. Naive Bayes

In [None]:
# Type your code here
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(train_set, y_train)

scores = cross_val_score(nb, train_set, y_train, cv=kf, scoring='accuracy')
print(f"Cross validation scores: {scores}")
print(f"Mean accuracy: {scores.mean()}")
print(f"Standard deviation of accuracy: {scores.std()}")

y_pred = nb.predict(val_set)
f1 = f1_score(y_val, y_pred, average='macro')
print(f"F1 score: {f1}")

df = pd.DataFrame()
df['attack_cat'] = y_val
df['predicted'] = y_pred
df.head()

### Custom Naive Bayes

In [120]:
from sklearn.base import BaseEstimator, ClassifierMixin

class CustomNaiveBayes(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.classes = None
        self.class_priors = None
        self.class_means = None
        self.class_variances = None
    
    def fit(self, X, y):

        self.classes = np.unique(y)
        n_classes = len(self.classes)
        n_features = X.shape[1]

        self.class_means = np.zeros((n_classes, n_features))
        self.class_variances = np.zeros((n_classes, n_features))
        self.class_priors = np.zeros(n_classes)

        for i, c in enumerate(self.classes):
            X_c = X[y == c]
            self.class_priors[i] = X_c.shape[0] / X.shape[0]
            self.class_means[i, :] = X_c.mean(axis=0)
            self.class_variances[i, :] = X_c.var(axis=0) + 1e-7
        
        return self
    
    def _gaussian_probability(self, x, mean, variance):
        exponent = np.exp(-((x - mean)**2 / (2 * variance)))
        return (1 / (np.sqrt(2 * np.pi * variance))) * exponent
    
    def predict_proba(self, X):
        n_samples = X.shape[0]
        n_classes = len(self.classes)
        
        probabilities = np.zeros((n_samples, n_classes))
        for i in range(n_classes):
            prior = np.log(self.class_priors[i])
            conditional = np.sum(np.log(
                self._gaussian_probability(
                    X, 
                    self.class_means[i, :], 
                    self.class_variances[i, :]
                ) + 1e-10 
            ), axis=1)
            
            
            probabilities[:, i] = prior + conditional
        
        probabilities = np.exp(probabilities)
        probabilities /= probabilities.sum(axis=1, keepdims=True)
        
        return probabilities
    
    def predict(self, X):
        probabilities = self.predict_proba(X)
        return self.classes[np.argmax(probabilities, axis=1)]
    
    def score(self, X, y):
        y_pred = self.predict(X)
        return np.mean(y_pred == y)


In [None]:
model = CustomNaiveBayes()
model.fit(train_set, y_train)

scores = cross_val_score(model, train_set, y_train, cv=kf, scoring='accuracy')
print(f"Cross validation scores: {scores}")
print(f"Mean accuracy: {scores.mean()}")
print(f"Standard deviation of accuracy: {scores.std()}")

y_pred = model.predict(val_set)

f1 = f1_score(y_val, y_pred, average='macro')
print(f"F1 score: {f1}")

df = pd.DataFrame({'attack_cat': y_val, 'predicted': y_pred})
df.head()

## C. ID3

In [None]:
# Type your code here
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

dt = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt.fit(train_set, y_train)

scores = cross_val_score(dt, train_set, y_train, cv=kf, scoring='accuracy')
print(f"Cross validation scores: {scores}")
print(f"Mean accuracy: {scores.mean()}")
print(f"Standard deviation of accuracy: {scores.std()}")

y_pred = dt.predict(val_set)
f1 = f1_score(y_val, y_pred, average='macro')
print(f"F1 score: {f1}")

df = pd.DataFrame()
df['attack_cat'] = y_val
df['predicted'] = y_pred
df.head()

In [123]:
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import LabelEncoder

class ID3(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=5, n_bins=10):
        self.max_depth = max_depth
        self.n_bins = n_bins
    
    def _discretize(self, X):
        X_binned = np.zeros_like(X, dtype=int)
        for i in range(X.shape[1]):
            bins = np.linspace(X[:, i].min(), X[:, i].max(), self.n_bins + 1)
            X_binned[:, i] = np.digitize(X[:, i], bins[1:-1]) 
        return X_binned
    
    def fit(self, X, y):
        X = np.asarray(X, dtype=np.float64)
        le = LabelEncoder()
        y = le.fit_transform(y)
        X, y = check_X_y(X, y, dtype=np.float64)
        
        X_binned = self._discretize(X)
        
        self.tree_ = self._build_tree(X_binned, y)
        
        self.n_features_in_ = X.shape[1]
        self.classes_ = le.classes_
        self.label_encoder_ = le
        
        return self
    
    def _entropy(self, y):
        _, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        return -np.sum(probabilities * np.log2(probabilities + 1e-10))
    
    def _information_gain(self, X, y, feature_idx):
        total_entropy = self._entropy(y)
        
        unique_values = np.unique(X[:, feature_idx])
        
        weighted_entropies = np.array([
            len(y[X[:, feature_idx] == value]) / len(y) * 
            self._entropy(y[X[:, feature_idx] == value])
            for value in unique_values
        ])
        
        return total_entropy - np.sum(weighted_entropies)
    
    def _build_tree(self, X, y, depth=0):
        unique_classes = np.unique(y)
        
        if (len(unique_classes) == 1 or 
            depth == self.max_depth or 
            X.shape[1] == 0):
            return np.argmax(np.bincount(y)) if len(y) > 0 else None
        
        gains = np.array([
            self._information_gain(X, y, i) 
            for i in range(X.shape[1])
        ])
        
        best_feature = np.argmax(gains)
        max_gain = gains[best_feature]
        
        if max_gain == 0:
            return np.argmax(np.bincount(y))
        
        node = {'feature': best_feature}
        node['children'] = {}
        
        unique_values = np.unique(X[:, best_feature])
        
        for value in unique_values:
            mask = X[:, best_feature] == value
            
            X_subset = np.delete(X[mask], best_feature, axis=1)
            y_subset = y[mask]
            
            if len(y_subset) == 0:
                continue
            
            subtree = self._build_tree(
                X_subset,
                y_subset,
                depth=depth+1
            )
            
            node['children'][value] = subtree
        
        return node
    
    def predict(self, X):
        X = check_array(X, dtype=np.float64)
        check_is_fitted(self, ['tree_', 'classes_'])
        
        X_binned = self._discretize(X)
        
        predictions = np.array([self._predict_single(x) for x in X_binned])
        return self.label_encoder_.inverse_transform(predictions)
    
    def _predict_single(self, x):
        node = self.tree_
        
        while isinstance(node, dict):
            feature = node['feature']
            value = x[feature]
            
            node = node['children'].get(value, list(node['children'].values())[0])
            x = np.delete(x, feature)
        
        return node
    
    def predict_proba(self, X):
        X = check_array(X, dtype=np.float64)
        check_is_fitted(self, ['tree_', 'classes_'])
        
        predictions = self.predict(X)
        
        proba = np.zeros((X.shape[0], len(self.classes_)))
        proba[np.arange(len(predictions)), 
              self.label_encoder_.transform(predictions)] = 1
        
        return proba
    
    def score(self, X, y):
        predictions = self.predict(X)
        return np.mean(predictions == y)

In [None]:
id3 = ID3(max_depth=70, n_bins=66)
id3.fit(train_set, y_train)

scores = cross_val_score(id3, train_set, y_train, cv=kf, scoring='accuracy', n_jobs=10, verbose=1)
print(f"Cross validation scores: {scores}")
print(f"Mean accuracy: {scores.mean()}")
print(f"Standard deviation of accuracy: {scores.std()}")

y_pred = id3.predict(val_set)
f1 = f1_score(y_val, y_pred, average='macro')
print(f"F1 score: {f1}")

df = pd.DataFrame()
df['attack_cat'] = y_val
df['predicted'] = y_pred
df.head()

## D. Improvements (Optional)

In [125]:
# Type your code here

## E. Submission

In [None]:
# Membaca semua file csv test
additional_features_df = pd.read_csv('../dataset/test/additional_features_test.csv')
basic_features_df = pd.read_csv('../dataset/test/basic_features_test.csv')
content_features_df = pd.read_csv('../dataset/test/content_features_test.csv')
flow_features_df = pd.read_csv('../dataset/test/flow_features_test.csv')
time_features_df = pd.read_csv('../dataset/test/time_features_test.csv')

# Menggabungkan data training dan testing untuk analisis EDA menyeluruh
test_data = pd.merge(basic_features_df, additional_features_df, on="id")
test_data = pd.merge(test_data, content_features_df, on="id")
test_data = pd.merge(test_data, flow_features_df, on="id")
test_data = pd.merge(test_data, time_features_df, on="id")

data.head()

In [127]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA

numerical_pipeline = Pipeline([
    ('imputer', ImputeNumerical()),
    ('scaler', MinMaxFeatureScaling()),
])

categorical_pipeline = Pipeline([
    ('imputer', ImputeCategorical()),
    ('onehot', OneHotCategorical()),
    ('scaler', MaxAbsFeatureScaling()),
])

preprocessor = ColumnTransformer([
    ('numerical', numerical_pipeline, [*numerical_features, *binary_features]),
    ('categorical', categorical_pipeline, [feature for feature in categorical_features if (feature != 'attack_cat')]),
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('filter', SelectKBest(score_func=f_classif, k=70)),
])

In [None]:
train_set = pipe.fit_transform(X, y)
test_set = pipe.transform(test_data)
print(train_set.shape)
print(test_set.shape)

In [None]:
knn_model.fit(train_set, y)
with open('../model/knn.pkl', 'wb') as f:
    pickle.dump(knn_model, f)

y_pred = knn_model.predict(test_set)

df = pd.DataFrame()
df['id'] = test_data['id']
df['attack_cat'] = y_pred
df.to_csv('../submission/knn.csv', index=False)
df.head()

In [None]:
model.fit(train_set, y)
with open('../model/nb.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

y_pred = model.predict(test_set)

df = pd.DataFrame()
df['id'] = test_data['id']
df['attack_cat'] = y_pred
df.to_csv('../submission/nb.csv', index=False)
df.head()

In [None]:
id3.fit(train_set, y)
with open('../model/id3.pkl', 'wb') as model_file:
    pickle.dump(id3, model_file)

y_pred_id3 = id3.predict(test_set)

df = pd.DataFrame()
df['id'] = test_data['id']
df['attack_cat'] = y_pred_id3
df.to_csv('../submission/id3_scratch_submission.csv', index=False)
df.head()

# 6. Error Analysis

### KNN
Dari hasil perbandingan di atas dapat dilihat bahwa model yang diimplementasikan secara manual memiliki performa lebih buruk dibandingkan model dari library.
Kami juga menemukan bahwa model ini sangat sensitif terhadap scaling, karena menggunakan min max scaling pada fitur numerikal dan max absolute scaling pada fitur kategorikal membantu performa model dengan signifikan, terlihat dari F1 score validasi yang meningkat.

### Naive Bayes
Dari hasil perbandingan di atas, dapat dilihat bahwa nilai F1 score dari implementasi F1-score manual lebih buruk dibandingkan implementasi F1-score dari library. Hal ini karena implementasi library sudah di optimasi menggunakan teknik yang lebih advanced. 
Dari hasil eksperimentasi kami, kami juga menemukan bahwa missing values lebih baik diganti menggunakan teknik imputing dibandingkan di-drop. Hal ini karena banyak row yang memiliki missing values cukup banyak, kurang lebih ⅞ dari total data. Hal ini tentu akan memperkecil data training secara signifikan.

### ID3
Sama seperti Naive Bayes, model yang diimplementasikan sendiri untuk ID3 memiliki performa lebih buruk dibandingkan model yang menggunakan library.
Selain itu, kami juga menemukan bahwa proses imputing meningkatkan performa model. Hal ini karena dengan imputing dan tidak men-drop row data, data training yang model dapatkan semakin banyak, memungkinkan proses training yang lebih relevan ke data aslinya.
Selain itu, feature scaling juga memperbaiki performa model, tepatnya Min Max Absolute Scaling pada kolom numerikal. Hal ini karena dengan scaling, didapat representasi fitur yang lebih uniform, dibanding tanpa scaling dimana setiap fitur memiliki rentangnya masing-masing.
