# IF3170 Artificial Intelligence | Tugas Besar 2

This notebook serves as a template for the assignment. Please create a copy of this notebook to complete your work. You can add more code blocks, markdown blocks, or new sections if needed.


Group Number: 04

Group Members:
- Muhamad Rafli Rasyiidin (13522088)
- Julian Caleb Simandjuntak (13522099)
- Christopher Brian (13522106)
- Indraswara Galih Jayanegara (13522119)

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split, KFold
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score

## Import Dataset

In [None]:
# Example of reading a csv file from a gdrive link

# Take the file id from the gdrive file url
# https://drive.google.com/file/d/1ZUtiaty9RPXhpz5F2Sy3dFPHF4YIt5iU/view?usp=sharing => The file id is 1ZUtiaty9RPXhpz5F2Sy3dFPHF4YIt5iU
# and then put it in this format:
# https://drive.google.com/uc?id={file_id}
# Don't forget to change the access to public

# Selalu tampilkan semua kolom pada output
pd.set_option('display.max_columns', None)

# Import file csv train
additional_features_train = pd.read_csv('https://drive.google.com/uc?id=1nC3zLlKlUdDZCFqQCa5mdtJLc4h-kXK9')
basic_features_train = pd.read_csv('https://drive.google.com/uc?id=1kyqx2WrYUHV0P74SWx_GUTpmmVTG1pw1')
flow_features_train = pd.read_csv('https://drive.google.com/uc?id=1k4ovJ8w8ZHtBV_XOp1WFPVolheWoHSmh')
content_features_train = pd.read_csv('https://drive.google.com/uc?id=1XP-QOpMFnFPjSIsLFA6xtovVcV1Viyft')
time_features_train = pd.read_csv('https://drive.google.com/uc?id=1QnNviNpoKggFeuzFQAWjU-FlfobRDCDm')
labels_train = pd.read_csv('https://drive.google.com/uc?id=14hflzUn7iYPJCwsOwDKjGEp7ZP1HtzY8')

# Gabungkan semua dataframe menjadi satu dataframe
df_combined = pd.concat([additional_features_train, basic_features_train, flow_features_train, content_features_train, time_features_train, labels_train], axis = 1)

# Hilangkan atribut id yang duplikat (ada di semua file)
df_combined = df_combined.loc[:, ~df_combined.columns.duplicated()]

# Exploratory Data Analysis (Optional)

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to uncover patterns, trends, anomalies, and insights. It is the first step before applying more advanced statistical and machine learning techniques. EDA helps you to gain a deep understanding of the data you are working with, allowing you to make informed decisions and formulate hypotheses for further analysis.

## Data Understanding

In [None]:
# Ukuran data dan tipe data
df_combined.info()

In [None]:
# Data statistik dasar untuk tiap feature non-kategorikal
continuous_features = df_combined.select_dtypes(include=['number']).drop(columns=['id'])
continuous_features.describe()

In [None]:
# Jumlah nilai unik tiap feature kategorikal
categorical_features = df_combined.select_dtypes(include=['object'])
unique_values_counts = categorical_features.nunique()
print(unique_values_counts)

In [None]:
# Missing values
null_values = df_combined.isna().sum()
print(null_values)

In [None]:
# Outlier setiap feature (nonkategorikal)
# Menggunakan boxplot

plt.figure(figsize=(15, 40))

for i, column in enumerate(continuous_features.columns, 1):
    plt.subplot(10, 4, i) 
    sns.boxplot(y=continuous_features[column])
    plt.title(column)

plt.tight_layout() 
plt.show()

In [None]:
# Korelasi antar fitur nonkategorikal
corr_matrix = continuous_features.corr()
plt.figure(figsize=(24, 16))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidth=0.5)
plt.title('Korelasi Fitur Nonkategorikal')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Korelasi antara fitur kategorikal dengan Cramer
def cramers_v(x, y):
    contingency = pd.crosstab(x, y)
    chi2, p, dof, expected = stats.chi2_contingency(contingency)
    n = contingency.sum().sum()
    return np.sqrt(chi2 / (n * (min(contingency.shape) - 1)))

cramers_matrix = pd.DataFrame(index=categorical_features.columns, columns=categorical_features.columns)

for col1 in categorical_features.columns:
    for col2 in categorical_features.columns:
        cramers_matrix.loc[col1, col2] = cramers_v(df_combined[col1], df_combined[col2])

cramers_matrix = cramers_matrix.astype(float)

plt.figure(figsize=(12, 8))
sns.heatmap(cramers_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Korelasi Cramér untuk Fitur Kategorikal')
plt.show()

In [None]:
# Visualisasi distribusi fitur nonkategorikal
fig, ax = plt.subplots(13, 3, figsize=(15, 15))
fig.suptitle('Distribusi Fitur Non-Kategorikal', fontsize=19)

for idx in range(39):
    i, j = divmod(idx, 3)
    ax[i, j].hist(continuous_features[continuous_features.columns[idx]])
    ax[i, j].set_title(continuous_features.columns[idx])

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

In [None]:
# Visualisasi distribusi fitur kategorikal (kecuali proto)
fig, ax = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Distribusi Fitur Kategorikal (selain proto)', fontsize=19)

for idx in range(3):
    feature = categorical_features.drop('proto', axis=1).columns[idx]
    sns.countplot(data=categorical_features, x=feature, ax=ax[idx], order=categorical_features[feature].value_counts().index)
    ax[idx].set_title(feature)
    ax[idx].tick_params(axis='x', rotation=45)

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

In [None]:
# Visualisasi distribusi fitur kategorikal (khusus proto)
plt.figure(figsize=(30, 6))
sns.countplot(data=categorical_features, x='proto', order=categorical_features['proto'].value_counts().index)
plt.title('Distribusi Protokol (proto)', fontsize=19)
plt.xticks(rotation=45, ha='right')
plt.xlabel('Protokol')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

# 1. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train` folder given by the TA. The `test` data is only used for kaggle submission.

In [None]:
# Split training set and validation set here, store into variables train_set and val_set.
# Remember to also keep the original training set before splitting. This will come important later.
# train_set, val_set = ...

# Splitting dilakukan dengan menggunakan train-test split atau hold-out split
# Random state 69 karena aestetik kata Chris

# train_set, val_set = train_test_split(df_combined, test_size=0.2, random_state=42) 

# with pd.option_context('display.max_columns', None):
#     print(train_set.info())


In [None]:
# Splitting yang dilakukan dengan K-fold validation
# X: predictor features, Y: target feature (attack_cat)
X = df_combined.drop(columns=['attack_cat'], axis=1)
y = df_combined['attack_cat']

X_train_array = []
X_val_array = []
y_train_array = []
y_val_array = []

kf = KFold(n_splits=5, shuffle=True, random_state=135)
for train_index, val_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
    X_train_array.append(X_train)
    X_val_array.append(X_val)
    y_train_array.append(y_train)
    y_val_array.append(y_val)
    
# X_train_array[0].head()
len(X_train_array)

# 2. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

We will give some common methods for you to try, but you only have to **at least implement one method for each process**. For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

## A. Data Cleaning

**Data cleaning** is the crucial first step in preparing your dataset for machine learning. Raw data collected from various sources is often messy and may contain errors, missing values, and inconsistencies. Data cleaning involves the following steps:

1. **Handling Missing Data:** Identify and address missing values in the dataset. This can include imputing missing values, removing rows or columns with excessive missing data, or using more advanced techniques like interpolation.

2. **Dealing with Outliers:** Identify and handle outliers, which are data points significantly different from the rest of the dataset. Outliers can be removed or transformed to improve model performance.

3. **Data Validation:** Check for data integrity and consistency. Ensure that data types are correct, categorical variables have consistent labels, and numerical values fall within expected ranges.

4. **Removing Duplicates:** Identify and remove duplicate rows, as they can skew the model's training process and evaluation metrics.

5. **Feature Engineering**: Create new features or modify existing ones to extract relevant information. This step can involve scaling, normalizing, or encoding features for better model interpretability.

### I. Handling Missing Data

Missing data can adversely affect the performance and accuracy of machine learning models. There are several strategies to handle missing data in machine learning:

1. **Data Imputation:**

    a. **Mean, Median, or Mode Imputation:** For numerical features, you can replace missing values with the mean, median, or mode of the non-missing values in the same feature. This method is simple and often effective when data is missing at random.

    b. **Constant Value Imputation:** You can replace missing values with a predefined constant value (e.g., 0) if it makes sense for your dataset and problem.

    c. **Imputation Using Predictive Models:** More advanced techniques involve using predictive models to estimate missing values. For example, you can train a regression model to predict missing numerical values or a classification model to predict missing categorical values.

2. **Deletion of Missing Data:**

    a. **Listwise Deletion:** In cases where the amount of missing data is relatively small, you can simply remove rows with missing values from your dataset. However, this approach can lead to a loss of valuable information.

    b. **Column (Feature) Deletion:** If a feature has a large number of missing values and is not critical for your analysis, you can consider removing that feature altogether.

3. **Domain-Specific Strategies:**

    a. **Domain Knowledge:** In some cases, domain knowledge can guide the imputation process. For example, if you know that missing values are related to a specific condition, you can impute them accordingly.

4. **Imputation Libraries:**

    a. **Scikit-Learn:** Scikit-Learn provides a `SimpleImputer` class that can handle basic imputation strategies like mean, median, and mode imputation.

    b. **Fancyimpute:** Fancyimpute is a Python library that offers more advanced imputation techniques, including matrix factorization, k-nearest neighbors, and deep learning-based methods.

The choice of imputation method should be guided by the nature of your data, the amount of missing data, the problem you are trying to solve, and the assumptions you are willing to make.

In [None]:
# Banyak nilai yang hilang
# print(train_set.isna().sum())

In [None]:
# Cleaning akan dilakukan dengan menggunakan SimpleImputer, dengan median pada data numerical dan most frequent pada data categorical

# # Versi train-test split
# # Pilih selain yang object, ini buat number 
# train_set_num = train_set.select_dtypes(include=['number']).drop(columns=['id'])
# imputer = SimpleImputer(strategy='median')
# imputer.fit(train_set_num)
# train_set_num_imputed = imputer.transform(train_set_num)
# train_set_num_imputed = pd.DataFrame(train_set_num_imputed, columns=train_set_num.columns)


# # Pilih yang categorical (object) 
# train_set_cat = train_set.select_dtypes(include=['object'])
# print(train_set_cat.isna().sum())
# imputer = SimpleImputer(strategy='most_frequent')
# imputer.fit(train_set_cat)
# train_set_cat_imputed = imputer.transform(train_set_cat)
# train_set_cat_imputed = pd.DataFrame(train_set_cat_imputed, columns=train_set_cat.columns)
# # print("Setelah update: ")
# # print(train_set_cat_imputed.isna().sum())

# Versi K-fold
def NumericalImputer(df):
    numerical_cols = df.select_dtypes(include=['number']).columns
    categorical_cols = df.select_dtypes(exclude=['number']).columns
    df_categorical = df[categorical_cols]

    imputer = SimpleImputer(strategy='median')
    df_numerical = pd.DataFrame(
        imputer.fit_transform(df[numerical_cols]), 
        columns=numerical_cols, 
        index=df.index
    )

    df_combined = pd.concat([df_numerical, df_categorical], axis=1)

    return df_combined
    
def CategoricalImputer(df):
    numerical_cols = df.select_dtypes(include=['number']).columns
    categorical_cols = df.select_dtypes(exclude=['number']).columns
    df_numerical = df[numerical_cols]
    
    imputer = SimpleImputer(strategy='most_frequent')
    df_categorical = pd.DataFrame(
        imputer.fit_transform(df[categorical_cols]),
        columns=categorical_cols,
        index=df.index
    )

    df_combined = pd.concat([df_numerical, df_categorical], axis=1)

    return df_combined


for i in range (len(X_train_array)):
    X_train_array[i] = NumericalImputer(X_train_array[i])
    X_train_array[i] = CategoricalImputer(X_train_array[i])

    X_val_array[i] = NumericalImputer(X_val_array[i])
    X_val_array[i] = CategoricalImputer(X_val_array[i])

# X_train_array[0].head()
print(X_val_array[0].isna().sum())

### II. Dealing with Outliers

Outliers are data points that significantly differ from the majority of the data. They can be unusually high or low values that do not fit the pattern of the rest of the dataset. Outliers can significantly impact model performance, so it is important to handle them properly.

Some methods to handle outliers:
1. **Imputation**: Replace with mean, median, or a boundary value.
2. **Clipping**: Cap values to upper and lower limits.
3. **Transformation**: Use log, square root, or power transformations to reduce their influence.
4. **Model-Based**: Use algorithms robust to outliers (e.g., tree-based models, Huber regression).

In [None]:
# Write your code here

### III. Remove Duplicates
Handling duplicate values is crucial because they can compromise data integrity, leading to inaccurate analysis and insights. Duplicate entries can bias machine learning models, causing overfitting and reducing their ability to generalize to new data. They also inflate the dataset size unnecessarily, increasing computational costs and processing times. Additionally, duplicates can distort statistical measures and lead to inconsistencies, ultimately affecting the reliability of data-driven decisions and reporting. Ensuring data quality by removing duplicates is essential for accurate, efficient, and consistent analysis.

In [None]:
# Untuk duplicate, akan di drop segala row yang duplicate (untuk train dan validation) melalui pipeline
# Hal ini dilakukan untuk menghindari bias
# Hanya train set yang diremove duplicatenya

def RemoveDuplicate(df):
    return df.drop_duplicates()

for i in range (len(X_train_array)):
    X_train_array[i] = RemoveDuplicate(X_train_array[i])
    
print(X_train_array[0].isna().sum())
X_train_array[0].head()


In [None]:
X_train_array[0]

### IV. Feature Engineering

**Feature engineering** involves creating new features (input variables) or transforming existing ones to improve the performance of machine learning models. Feature engineering aims to enhance the model's ability to learn patterns and make accurate predictions from the data. It's often said that "good features make good models."

1. **Feature Selection:** Feature engineering can involve selecting the most relevant and informative features from the dataset. Removing irrelevant or redundant features not only simplifies the model but also reduces the risk of overfitting.

2. **Creating New Features:** Sometimes, the existing features may not capture the underlying patterns effectively. In such cases, engineers create new features that provide additional information. For example:
   
   - **Polynomial Features:** Engineers may create new features by taking the square, cube, or other higher-order terms of existing numerical features. This can help capture nonlinear relationships.
   
   - **Interaction Features:** Interaction features are created by combining two or more existing features. For example, if you have features "length" and "width," you can create an "area" feature by multiplying them.

3. **Binning or Discretization:** Continuous numerical features can be divided into bins or categories. For instance, age values can be grouped into bins like "child," "adult," and "senior."

4. **Domain-Specific Feature Engineering:** Depending on the domain and problem, engineers may create domain-specific features. For example, in fraud detection, features related to transaction history and user behavior may be engineered to identify anomalies.

Feature engineering is both a creative and iterative process. It requires a deep understanding of the data, domain knowledge, and experimentation to determine which features will enhance the model's predictive power.

In [None]:
# Write your code here

## B. Data Preprocessing

**Data preprocessing** is a broader step that encompasses both data cleaning and additional transformations to make the data suitable for machine learning algorithms. Its primary goals are:

1. **Feature Scaling:** Ensure that numerical features have similar scales. Common techniques include Min-Max scaling (scaling to a specific range) or standardization (mean-centered, unit variance).

2. **Encoding Categorical Variables:** Machine learning models typically work with numerical data, so categorical variables need to be encoded. This can be done using one-hot encoding, label encoding, or more advanced methods like target encoding.

3. **Handling Imbalanced Classes:** If dealing with imbalanced classes in a binary classification task, apply techniques such as oversampling, undersampling, or using different evaluation metrics to address class imbalance.

4. **Dimensionality Reduction:** Reduce the number of features using techniques like Principal Component Analysis (PCA) or feature selection to simplify the model and potentially improve its performance.

5. **Normalization:** Normalize data to achieve a standard distribution. This is particularly important for algorithms that assume normally distributed data.

### Notes on Preprocessing processes

It is advised to create functions or classes that have the same/similar type of inputs and outputs, so you can add, remove, or swap the order of the processes easily. You can implement the functions or classes by yourself

or

use `sklearn` library. To create a new preprocessing component in `sklearn`, implement a corresponding class that includes:
1. Inheritance to `BaseEstimator` and `TransformerMixin`
2. The method `fit`
3. The method `transform`

In [None]:
# Example

# from sklearn.base import BaseEstimator, TransformerMixin

# class FeatureEncoder(BaseEstimator, TransformerMixin):

#     def fit(self, X, y=None):

#         # Fit the encoder here

#         return self

#     def transform(self, X):
#         X_encoded = X.copy()

#         # Encode the categorical variables here

#         return X_encoded

### I. Feature Scaling

**Feature scaling** is a preprocessing technique used in machine learning to standardize the range of independent variables or features of data. The primary goal of feature scaling is to ensure that all features contribute equally to the training process and that machine learning algorithms can work effectively with the data.

Here are the main reasons why feature scaling is important:

1. **Algorithm Sensitivity:** Many machine learning algorithms are sensitive to the scale of input features. If the scales of features are significantly different, some algorithms may perform poorly or take much longer to converge.

2. **Distance-Based Algorithms:** Algorithms that rely on distances or similarities between data points, such as k-nearest neighbors (KNN) and support vector machines (SVM), can be influenced by feature scales. Features with larger scales may dominate the distance calculations.

3. **Regularization:** Regularization techniques, like L1 (Lasso) and L2 (Ridge) regularization, add penalty terms based on feature coefficients. Scaling ensures that all features are treated equally in the regularization process.

Common methods for feature scaling include:

1. **Min-Max Scaling (Normalization):** This method scales features to a specific range, typically [0, 1]. It's done using the following formula:

   $$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$

   - Here, $X$ is the original feature value, $X_{min}$ is the minimum value of the feature, and $X_{max}$ is the maximum value of the feature.  
<br />
<br />
2. **Standardization (Z-score Scaling):** This method scales features to have a mean (average) of 0 and a standard deviation of 1. It's done using the following formula:

   $$X' = \frac{X - \mu}{\sigma}$$

   - $X$ is the original feature value, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation of the feature.  
<br />
<br />
3. **Robust Scaling:** Robust scaling is a method that scales features to the interquartile range (IQR) and is less affected by outliers. It's calculated as:

   $$X' = \frac{X - Q1}{Q3 - Q1}$$

   - $X$ is the original feature value, $Q1$ is the first quartile (25th percentile), and $Q3$ is the third quartile (75th percentile) of the feature.  
<br />
<br />
4. **Log Transformation:** In cases where data is highly skewed or has a heavy-tailed distribution, taking the logarithm of the feature values can help stabilize the variance and improve scaling.

The choice of scaling method depends on the characteristics of your data and the requirements of your machine learning algorithm. **Min-max scaling and standardization are the most commonly used techniques and work well for many datasets.**

Scaling should be applied separately to each training and test set to prevent data leakage from the test set into the training set. Additionally, **some algorithms may not require feature scaling, particularly tree-based models.**

### II. Feature Encoding

**Feature encoding**, also known as **categorical encoding**, is the process of converting categorical data (non-numeric data) into a numerical format so that it can be used as input for machine learning algorithms. Most machine learning models require numerical data for training and prediction, so feature encoding is a critical step in data preprocessing.

Categorical data can take various forms, including:

1. **Nominal Data:** Categories with no intrinsic order, like colors or country names.  

2. **Ordinal Data:** Categories with a meaningful order but not necessarily equidistant, like education levels (e.g., "high school," "bachelor's," "master's").

There are several common methods for encoding categorical data:

1. **Label Encoding:**

   - Label encoding assigns a unique integer to each category in a feature.
   - It's suitable for ordinal data where there's a clear order among categories.
   - For example, if you have an "education" feature with values "high school," "bachelor's," and "master's," you can encode them as 0, 1, and 2, respectively.
<br />
<br />
2. **One-Hot Encoding:**

   - One-hot encoding creates a binary (0 or 1) column for each category in a nominal feature.
   - It's suitable for nominal data where there's no inherent order among categories.
   - Each category becomes a new feature, and the presence (1) or absence (0) of a category is indicated for each row.
<br />
<br />
3. **Target Encoding (Mean Encoding):**

   - Target encoding replaces each category with the mean of the target variable for that category.
   - It's often used for classification problems.

In [None]:
class FeatureEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoders = {}

    def fit(self, X, y=None):
        for col in X.select_dtypes(include=['object', 'category']).columns:
            le = LabelEncoder()
            le.fit(X[col].astype(str))
            self.encoders[col] = le
        return self

    def transform(self, X):
        X_encoded = X.copy()
        for col, le in self.encoders.items():
            X_encoded[col] = X[col].astype(str).apply(
                lambda x: le.transform([x])[0] if x in le.classes_ else -1
            )
        return X_encoded

        
        return X_encoded

### III. Handling Imbalanced Dataset

**Handling imbalanced datasets** is important because imbalanced data can lead to several issues that negatively impact the performance and reliability of machine learning models. Here are some key reasons:

1. **Biased Model Performance**:

 - Models trained on imbalanced data tend to be biased towards the majority class, leading to poor performance on the minority class. This can result in misleading accuracy metrics.

2. **Misleading Accuracy**:

 - High overall accuracy can be misleading in imbalanced datasets. For example, if 95% of the data belongs to one class, a model that always predicts the majority class will have 95% accuracy but will fail to identify the minority class.

3. **Poor Generalization**:

 - Models trained on imbalanced data may not generalize well to new, unseen data, especially if the minority class is underrepresented.


Some methods to handle imbalanced datasets:
1. **Resampling Methods**:

 - Oversampling: Increase the number of instances in the minority class by duplicating or generating synthetic samples (e.g., SMOTE).
 - Undersampling: Reduce the number of instances in the majority class to balance the dataset.

2. **Evaluation Metrics**:

 - Use appropriate evaluation metrics such as precision, recall, F1-score, ROC-AUC, and confusion matrix instead of accuracy to better assess model performance on imbalanced data.

3. **Algorithmic Approaches**:

 - Use algorithms that are designed to handle imbalanced data, such as decision trees, random forests, or ensemble methods.
 - Adjust class weights in algorithms to give more importance to the minority class.

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class ResampleDataset(BaseEstimator, TransformerMixin):
    def __init__(self, desired_count=15000, random_state=42):
        self.desired_count = desired_count
        self.random_state = random_state
        self.smote = SMOTE(sampling_strategy='auto', random_state=self.random_state)
        self.undersample = RandomUnderSampler(sampling_strategy='auto', random_state=self.random_state)

    def fit(self, X, y=None):
        return self

    def transform(self, X, y):
        
        value_counts = y.value_counts()

        oversample_strategy = {
            label: self.desired_count 
            for label in value_counts.index 
            if value_counts[label] < self.desired_count
        }

        undersample_strategy = {
            label: self.desired_count 
            for label in value_counts.index 
            if value_counts[label] > self.desired_count
        }

        smote = SMOTE(sampling_strategy=oversample_strategy, random_state=self.random_state)
        undersample = RandomUnderSampler(sampling_strategy=undersample_strategy, random_state=self.random_state)
        
        X_resampled, y_resampled = smote.fit_resample(X, y)
        X_resampled, y_resampled = undersample.fit_resample(X_resampled, y_resampled)

        return X_resampled

print(y.value_counts())

### IV. Data Normalization

Data normalization is used to achieve a standard distribution. Without normalization, models or processes that rely on the assumption of normality may not work correctly. Normalization helps reduce the magnitude effect and ensures numerical stability during optimization.

In [None]:
# Write your code here

### V. Dimensionality Reduction

Dimensionality reduction is a technique used in data preprocessing to reduce the number of input features (dimensions) in a dataset while retaining as much important information as possible. It is essential when dealing with high-dimensional data, where too many features can cause problems like increased computational costs, overfitting, and difficulty in visualization. Reducing dimensions simplifies the data, making it easier to analyze and improving the performance of machine learning models.

One of the main approaches to dimensionality reduction is feature extraction. Feature extraction creates new, smaller sets of features that capture the essence of the original data. Common techniques include:

1. **Principal Component Analysis (PCA)**: Converts correlated features into a smaller number of uncorrelated "principal components."
2. **t-SNE (t-Distributed Stochastic Neighbor Embedding)**: A visualization-focused method to project high-dimensional data into 2D or 3D spaces.
3. **Autoencoders**: Neural networks that learn compressed representations of the data.

In [None]:
# Write your code here


# 3. Compile Preprocessing Pipeline

All of the preprocessing classes or functions defined earlier will be compiled in this step.

If you use sklearn to create preprocessing classes, you can list your preprocessing classes in the Pipeline object sequentially, and then fit and transform your data.

In [None]:

# from sklearn.pipeline import Pipeline

# # Note: You can add or delete preprocessing components from this pipeline

# pipe = Pipeline([("imputer", FeatureImputer()),
#                  ("featurecreator", FeatureCreator()),
#                  ("scaler", FeatureScaler()),
#                  ("encoder", FeatureEncoder())])

# train_set = pipe.fit_transform(train_set)
# val_set = pipe.transform(val_set)

pipe = Pipeline([
    ("encoder", FeatureEncoder()),
])

processed_X_train_array = []
processed_X_val_array = []
processed_y_train_array = []
processed_y_val_array = []

for i in range(len(X_train_array)):
    processed_X_train = pipe.fit_transform(X_train_array[i])
    processed_X_train_array.append(processed_X_train)
    processed_X_val = pipe.transform(X_val_array[i])
    processed_X_val_array.append(processed_X_val)

    # processed_y_train_array.append(y_train_array[i])
    # processed_y_val_array.append(y_val_array[i])

X_train_array = processed_X_train_array
X_val_array = processed_X_val_array
# y_train_array = processed_y_train_array
# y_val_array = processed_y_val_array

# print(X_val_array[0].head())


In [None]:
# # Your code should work up until this point
# train_set = pipe.fit_transform(train_set)
# val_set = pipe.transform(val_set)

for i in range(len(X_val_array)):
    processed_X_val = pipe.transform(X_val_array[i])
    processed_X_val_array.append(processed_X_val)

print(X_val_array[0].head())

    

or create your own here

In [None]:
# Write your code here

# 4. Modeling and Validation

Modelling is the process of building your own machine learning models to solve specific problems, or in this assignment context, predicting the target feature `attack_cat`. Validation is the process of evaluating your trained model using the validation set or cross-validation method and providing some metrics that can help you decide what to do in the next iteration of development.

## A. KNN

In [None]:
# Scikit-learn
knn_model = KNeighborsClassifier(n_neighbors=5)

for i in range(len(X_train_array)):
    knn_model.fit(X_train_array[i], y_train_array[i])
    
# for i in range(len(X_val_array)):
    y_pred = knn_model.predict(X_val_array[i])
    accuracy = accuracy_score(y_val_array[i], y_pred)
    print(f"Fold {i+1}: Accuracy = {accuracy:.2f}")

# knn_model.fit(X_train_array[0], y_train_array[0])
# y_pred = knn_model.predict(X_val_array[0])
# accuracy = accuracy_score(y_val_array[0], y_pred)
# print(f"Fold {1}: Accuracy = {accuracy:.2f}")

len(X_train_array)



## B. Naive Bayes

In [None]:
# Type your code here

## C. ID3

In [None]:
# Write your code here 

#multiclass 
import math
import numpy as np
from typing import List, Dict, Any, Optional


"""
Implementasi ID3 multiclass
"""
class ID3MultiClass:
    def __init__(self, max_depth: Optional[int] = None):
        """
        Inisiasi dari decision tree ID3  
        Args:
            max_depth (int, optional): Maximum depth of the tree to prevent overfitting
        """
        self.root = None # Root dari decision tree
        self.max_depth = max_depth # max_depth kalo diatur
        self.feature_names = None # nama dari fitur 
        self.class_names = None # kelas yang ada
    
    class Node:
        def __init__(self, 
                     feature_name: Optional[str] = None, 
                     decision: Optional[str] = None, 
                     value: Optional[str] = None):
            """
            Inisiasi dari node pada decision tree 
            Args:
                feature_name (str): nama dari fitur
                decision (str): Decision yang diambil 
                value (str): prediksi dari kelas atau value
            """
            self.feature_name = feature_name
            self.decision = decision
            self.value = value
            self.childs: List[ID3MultiClass.Node] = []
    
    def calculate_multiclass_entropy(self, labels: List[str]) -> float:
        """
        fungsi untuk menghitung entropy dari multi-class classification 
        persamaan entropy tuh kalo dislide kelas 

        Entropy(S) = -p1 * log2(p1) - p2 * log2(p2) - ... - pn * log2(pn) (gabisa gambar sigma)

        - S: Set of training examples 
        - c: number of classes -> kalo dari persamaan yang di atas c itu banyak dari n 
        - p_i: proportion of examples in class c 
        
        Args:
            labels (List[str]): List of class labels
        
        Returns:
            float: Entropy value
        """
        # hitung berapa banyak kelas yang ada (C) 
        label_counts = {}
        for label in labels:
            label_counts[label] = label_counts.get(label, 0) + 1
        
        # hitugn Entropy 
        total = len(labels) # C
        entropy = 0 
        for count in label_counts.values(): #
            prob = count / total #P_i 
            # Avoid log(0), karena bakalan error lol 
            if prob > 0:
                entropy -= prob * math.log2(prob)
        
        return entropy
    
    def information_gain(self, features: List[List[Any]], lables: List[str], feature_idx: int) -> float:
        """
        information gain dari slide tuh 
        Gain(S, A) = expected reduction of entropy due to sorting A  -> Find A which has a maximum Gain(S, A) 
        - S: set of training examples 
        - A: attribute

        Gain(S, A) = Entropy(S) - sum(|S_v| / |S| * Entropy(S_v))
        Parameter:
            features (List[List[Any]]): Feature matrix
            lables (List[str]): Target labels
            feature_idx (int): Index of the feature to calculate gain for
        
        Returns:
            float: Information gain value
        """
        # Total entropy dari dataset -> Entropy(S) 
        total_entropy = self.calculate_multiclass_entropy(lables)
        
        # Grouping berdasarkan features yang ada
        # ini tuh bagian dari sum(|S_v| / |S| * Entropy(S_v))
        feature_groups = {} 
        for idx, row in enumerate(features): 
            feature_val = row[feature_idx]
            if feature_val not in feature_groups:
                feature_groups[feature_val] = {
                    'indices': [], #index
                    'labels': [] #label
                }
            feature_groups[feature_val]['indices'].append(idx)
            feature_groups[feature_val]['labels'].append(lables[idx])
        
        # Hitung dari sum(|S_v| / |S| * Entropy(S_v))
        weighted_entropy = 0
        for group in feature_groups.values(): #looping tiap group 
            group_entropy = self.calculate_multiclass_entropy(group['labels'])
            weight = len(group['indices']) / len(lables)
            weighted_entropy += weight * group_entropy

        # hasil dari information gain  
        return total_entropy - weighted_entropy
    
    def most_common_label(self, labels: List[str]) -> str:
        """ Label yang paling sering muncul
        Parameter:
            labels (List[str]): List of labels
        Returns:
            str: Most common label
        """

        if len(labels) == 1:
            return labels[0]

        label_counts = {}
        for label in labels:
            label_counts[label] = label_counts.get(label, 0) + 1

        return max(label_counts, key=label_counts.get)

    
    def build_tree(
            self, 
            X: List[List[Any]], 
            y: List[str], 
            feature_indices: List[int], 
            depth: int = 0
            ) -> 'ID3MultiClass.Node':
        """
        Build the decision tree recursively

        Args:
            X (List[List[Any]]): Feature matrix
            y (List[str]): Target labels
            feature_indices (List[int]): Indices of features to consider
            depth (int): Current depth of the tree

        Returns:
            Node: Root of the decision tree (or subtree)
        """
        if len(set(y)) == 1:
            node = self.Node(value=y[0])
            return node

        if len(feature_indices) == 0 or (self.max_depth is not None and depth >= self.max_depth):
            node = self.Node(value=self.most_common_label(y))
            return node

        best_gain = 0
        best_feature_idx = None
        for idx in feature_indices:
            gain = self.information_gain(X, y, idx)
            if gain > best_gain:
                best_gain = gain
                best_feature_idx = idx

        if best_feature_idx is None:
            node = self.Node(value=self.most_common_label(y))
            return node

        node = self.Node(feature_name=self.feature_names[best_feature_idx])

        feature_values = set(row[best_feature_idx] for row in X)

        new_feature_indices = [idx for idx in feature_indices if idx != best_feature_idx]

        for value in feature_values:
            child_indices = [i for i, row in enumerate(X) if row[best_feature_idx] == value]

            if not child_indices:
                child_node = self.Node(value=self.most_common_label(y))
            else:
                child_X = [X[i] for i in child_indices]
                child_y = [y[i] for i in child_indices]

                child_node = self.build_tree(child_X, child_y, new_feature_indices, depth + 1)

            child_node.decision = value
            node.childs.append(child_node)

        return node

    def fit(
            self, 
            data_train_feature: Any, 
            data_train_label: Any, 
            ):
        """
        Method untuk train model 
        Parameter:
            train_feature (DataFrame): Feature matrix
            train_label (DataFrame): Target labels
        """
        train_feature = data_train_feature.values.tolist()
        train_label = data_train_label.values.tolist() 
        feature_names = data_train_feature.columns.tolist()

        if len(train_feature) != len(train_label):
            raise ValueError("train_feature and train_label must have the same number of samples")
        
        self.feature_names = feature_names or [f'Feature_{i}' for i in range(len(train_feature[0]))]
        
        self.class_names = list(set(train_label))
        
        feature_indices = list(range(len(feature_names)))
        
        self.root = self.build_tree(train_feature, train_label, feature_indices)
    
    def predict(self, test_feature_dataframe: Any) -> List[str]:
        """
        prediksi untuk test_feature
        Parameter:
            test_feature (List[List[Any]]): Feature matrix to predict
        Returns:
            List[str]: Predicted labels
        """

        test_feature = test_feature_dataframe.values.tolist()
        if self.root is None:
            raise ValueError("Model not trained. Call fit() first.")
        
        predictions = []
        for sample in test_feature:
            predictions.append(self.predict_sample(sample))
        return predictions
    
    def predict_sample(self, sample: List[Any]) -> str:
        """
        Predict label untuk single sample
        Args:
            sample (List[Any]): Single sample to predict
        Returns:
            str: Predicted label
        """
        node = self.root
        while node.childs:
            feature_idx = self.feature_names.index(node.feature_name)
            
            matching_child = None
            for child in node.childs:
                if child.decision == sample[feature_idx]:
                    matching_child = child
                    break
            
            if matching_child is None:
                child_labels = [child.value for child in node.childs]
                
                if child_labels:
                    return self.most_common_label(child_labels)
                
                return node.value
            
            node = matching_child
        
        return node.value
    
    def print_tree(self, node: Optional['ID3MultiClass.Node'] = None, indent: str = ""):
        """
        print dari structure tree 
        cara bacanya dari atas ke bawah (refer dari slide kelas)

        - Feature: node kalo di slide
        - Decision: indices(?) jalur dari node ke node lah kalo di slide 
        - value: prediksi dari kelas atau value

        Parameter:
            node (Node, optional): Current node (root by default)
            indent (str): Indentation for visualization
        """
        if node is None:
            node = self.root
        
        # Print current node
        if node.decision:
            print(f"{indent}Decision: {node.decision}")
        
        if node.feature_name:
            print(f"{indent}Feature: {node.feature_name}")
        
        if node.value:
            print(f"{indent}Value Classes: {node.value}")
        
        # Recursively print children
        for child in node.childs:
            self.print_tree(child, indent + "  ")
    
    
    def evaluate(self, X_test: List[str], y_test: List[str]) -> Dict[str, float]:
        """
        Evaluate the model's performance
        
        Parameter:
            X_test (List[Any]): Test feature matrix
            y_test (List[str]): True labels for test data
        
        Returns:
            Dict[str, float]: Performance metrics
        """
        y_pred = X_test
        
        accuracy = sum(1 for true, pred in zip(y_test, y_pred) if true == pred) / len(y_test)
        
        class_metrics = {}
        for class_name in self.class_names:
            class_preds = [pred for pred, true in zip(y_pred, y_test) if pred == class_name]
            class_true = [true for true in y_test if true == class_name]
            
            precision = (len([p for p in class_preds if p in class_true]) / len(class_preds)) if class_preds else 0
            
            recall = (len([p for p in class_true if p in class_preds]) / len(class_true)) if class_true else 0
            
            class_metrics[class_name] = {
                'precision': precision,
                'recall': recall
            }
        
        return {
            'overall_accuracy': accuracy,
            'class_metrics': class_metrics
        }




In [None]:
print(X_val_array[0].head(5))

In [None]:
def changed_to_categorical(df, min_unique_values=5):
    import pandas as pd
    import numpy as np
    
    df_categorical = df.copy()
    
    for column in df_categorical.select_dtypes(include=['int64', 'float64']).columns:
        unique_values = df_categorical[column].nunique()

        if unique_values < min_unique_values:
            continue
        elif unique_values >= min_unique_values:
            df_categorical[column] = df_categorical[column].astype(float)
            
            ranges = [
                (0, 20),   # 0 < x < 20
                (20, 40),  # 20 < x < 40
                (40, 60),  # 41 < x < 60
                (60, 80),  # 61 < x < 80
                (80, 100), # 81 < x < 100
            ]
            
            # Loop over defined ranges
            for start, end in ranges:
                condition = (df_categorical[column] >= start) & (df_categorical[column] < end)
                
                if condition.any():  # If there are any values in this range
                    median_value = df_categorical[column][condition].mean()
                    df_categorical[column] = np.where(condition, median_value, df_categorical[column])

    return df_categorical



df_categorical = changed_to_categorical(X_train_array[0], min_unique_values=2)

# print(df_categorical.isna().sum())
print(df_categorical)
# Assuming X_train_array is a list of DataFrames
# for column in X_train_array[0].columns:
#     unique_count = X_train_array[0][column].nunique()  # Count unique values
#     print(f"Unique values count in column '{column}': {unique_count}")



In [None]:
#skilearn 

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

print("SKLEARN")
def main_sklearn():
    dt = DecisionTreeClassifier(max_depth=5)
    for i in range(len(X_train_array)):
        dt.fit(changed_to_categorical(X_train_array[i].head(1000)), y_train_array[i].head(1000))
        y_pred = dt.predict(changed_to_categorical(X_val_array[i].head(100)))
        accuracy = accuracy_score(y_val_array[i].head(100), y_pred)
        print(f"Fold {i+1}: Accuracy = {accuracy:.2f}")
main_sklearn()


def filter_columns_and_values(df, exclude_items=[]):
    filtered_df = df.drop(columns=[col for col in df.columns if col in exclude_items], errors="ignore")
    return filtered_df

def keep_columns_and_values(df, columns_to_keep):
    filtered_df = df[columns_to_keep]
    return filtered_df

    
# # kalo mau ngefilter
# #buat bagian data train
# filtered_df_train = keep_columns_and_values(X_train_array[0], ['proto', 'dbytes', 'sttl', 'dttl', 'smean'])
# filtered_exclude_df_train = filter_columns_and_values(X_train_array[0], ['id'])

# #buat bagian testnya
# filtered_df_test = keep_columns_and_values(X_val_array[0], ['proto', 'dbytes', 'sttl', 'dttl', 'smean'])
# filtered_exclude_df_test = filter_columns_and_values(X_val_array[0], ['id'])



print()
print("OUR IMPLEMENTATION")
print("Model Evaluation:")
def main_dataset(): 
    dt = ID3MultiClass(max_depth=len(X_train_array[0].columns.tolist()))
    for i in range(len(X_train_array)): 
        dt.fit(changed_to_categorical(X_train_array[i].head(10000)), y_train_array[i].head(10000))

        # Print tree structure (debug)
        # print("Decision Tree Structure:")
        # dt.print_tree()

    # Make predictions
        test_samples = changed_to_categorical(X_train_array[i].head(100))
        predictions = dt.predict(test_samples)
        # print("\nPredictions:", predictions)

        X_test = predictions
        y_test = y_val_array[i].head(100).values.tolist()

        evaluation = dt.evaluate(X_test, y_test)

        
        # print("\nActual data: ", y_test)
        print(f"Fold {i+1} Overall Accuracy: {evaluation['overall_accuracy']:.2f}")
    
    
main_dataset()



## D. Improvements (Optional)

- **Visualize the model evaluation result**

This will help you to understand the details more clearly about your model's performance. From the visualization, you can see clearly if your model is leaning towards a class than the others. (Hint: confusion matrix, ROC-AUC curve, etc.)

- **Explore the hyperparameters of your models**

Each models have their own hyperparameters. And each of the hyperparameter have different effects on the model behaviour. You can optimize the model performance by finding the good set of hyperparameters through a process called **hyperparameter tuning**. (Hint: Grid search, random search, bayesian optimization)

- **Cross-validation**

Cross-validation is a critical technique in machine learning and data science for evaluating and validating the performance of predictive models. It provides a more **robust** and **reliable** evaluation method compared to a hold-out (single train-test set) validation. Though, it requires more time and computing power because of how cross-validation works. (Hint: k-fold cross-validation, stratified k-fold cross-validation, etc.)

In [None]:
# Type your code here

## E. Submission
To predict the test set target feature and submit the results to the kaggle competition platform, do the following:
1. Create a new pipeline instance identical to the first in Data Preprocessing
2. With the pipeline, apply `fit_transform` to the original training set before splitting, then only apply `transform` to the test set.
3. Retrain the model on the preprocessed training set
4. Predict the test set
5. Make sure the submission contains the `id` and `attack_cat` column.

In [None]:
# Membuat pipeline submission

submission_pipe = Pipeline([
    ("encoder", FeatureEncoder())
])

# Cleaning
# Cleaning training set


# Preprocessing
