# **IF3170 Artificial Intelligence | Tugas Besar 2**

Group Number: 9

Group Members:
- Ariel Herfrison (13522002)
- Irfan Sidiq Permana (13522007)
- Akbar Al Fattah (13522036)
- Diero Arga Purnama (13522056)

## Import Libraries

In [1753]:
import pandas as pd
import numpy as np
import matplotlib as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import QuantileTransformer, LabelEncoder
from sklearn.decomposition import PCA

# Import models
import sys
import os

sys.path.append(os.path.join(os.getcwd(), "..", "models"))

from knn import KNN

## Import Dataset

In [1754]:
# Import dataset from the training folder
df_additional = pd.read_csv('../dataset/train/additional_features_train.csv')
df_basic = pd.read_csv('../dataset/train/basic_features_train.csv')
df_content = pd.read_csv('../dataset/train/content_features_train.csv')
df_flow = pd.read_csv('../dataset/train/flow_features_train.csv')
df_labels = pd.read_csv('../dataset/train/labels_train.csv')
df_time = pd.read_csv('../dataset/train/time_features_train.csv')

In [1755]:
# Join the datasets based on the 'id' attribute
df_merged = (
    df_time
    .merge(df_labels, on = "id", how = "left")
    .merge(df_flow, on = "id", how = "left")
    .merge(df_content, on = "id", how = "left")
    .merge(df_basic, on = "id", how = "left")
    .merge(df_additional, on = "id", how = "left")
)

df_merged.head()

Unnamed: 0,sjit,djit,sinpkt,dinpkt,tcprtt,synack,ackdat,id,attack_cat,label,...,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm
0,4449.110313,3234.831566,11.845558,6.261361,,0.000444,0.000114,0,Normal,0,...,0.0,0.0,0.0,11.0,,5.0,4.0,2.0,1.0,5.0
1,0.0,0.0,0.009,0.0,0.0,0.0,,1,Generic,1,...,0.0,0.0,0.0,10.0,10.0,10.0,10.0,,10.0,10.0
2,8561.040438,249.950547,165.386453,172.34575,0.158826,0.057902,0.100924,2,Exploits,1,...,0.0,0.0,0.0,4.0,4.0,2.0,2.0,1.0,1.0,4.0
3,4053.08602,2918.730804,8.669644,4.496707,0.000558,0.000448,,3,Normal,0,...,0.0,0.0,0.0,9.0,9.0,3.0,2.0,2.0,1.0,6.0
4,0.0,0.0,0.008,0.007,0.0,0.0,0.0,4,Normal,0,...,0.0,0.0,0.0,3.0,3.0,4.0,3.0,1.0,,1.0


# Exploratory Data Analysis (Optional)

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to uncover patterns, trends, anomalies, and insights. It is the first step before applying more advanced statistical and machine learning techniques. EDA helps you to gain a deep understanding of the data you are working with, allowing you to make informed decisions and formulate hypotheses for further analysis.

In [1756]:
# Write your code here

# 1. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train` folder given by the TA. The `test` data is only used for kaggle submission.

In [1757]:
train_set, val_set = train_test_split(df_merged, test_size=0.2, random_state=42)

print(f"Training set size: {train_set.shape[0]}")
print(f"Validation set size: {val_set.shape[0]}")

Training set size: 140272
Validation set size: 35069


# 2. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

We will give some common methods for you to try, but you only have to **at least implement one method for each process**. For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

## A. Data Cleaning

**Data cleaning** is the crucial first step in preparing your dataset for machine learning. Raw data collected from various sources is often messy and may contain errors, missing values, and inconsistencies. Data cleaning involves the following steps:

1. **Handling Missing Data:** Identify and address missing values in the dataset. This can include imputing missing values, removing rows or columns with excessive missing data, or using more advanced techniques like interpolation.

2. **Dealing with Outliers:** Identify and handle outliers, which are data points significantly different from the rest of the dataset. Outliers can be removed or transformed to improve model performance.

3. **Data Validation:** Check for data integrity and consistency. Ensure that data types are correct, categorical variables have consistent labels, and numerical values fall within expected ranges.

4. **Removing Duplicates:** Identify and remove duplicate rows, as they can skew the model's training process and evaluation metrics.

5. **Feature Engineering**: Create new features or modify existing ones to extract relevant information. This step can involve scaling, normalizing, or encoding features for better model interpretability.

### I. Handling Missing Data

Missing data can adversely affect the performance and accuracy of machine learning models. There are several strategies to handle missing data in machine learning:

1. **Data Imputation:**

    a. **Mean, Median, or Mode Imputation:** For numerical features, you can replace missing values with the mean, median, or mode of the non-missing values in the same feature. This method is simple and often effective when data is missing at random.

    b. **Constant Value Imputation:** You can replace missing values with a predefined constant value (e.g., 0) if it makes sense for your dataset and problem.

    c. **Imputation Using Predictive Models:** More advanced techniques involve using predictive models to estimate missing values. For example, you can train a regression model to predict missing numerical values or a classification model to predict missing categorical values.

2. **Deletion of Missing Data:**

    a. **Listwise Deletion:** In cases where the amount of missing data is relatively small, you can simply remove rows with missing values from your dataset. However, this approach can lead to a loss of valuable information.

    b. **Column (Feature) Deletion:** If a feature has a large number of missing values and is not critical for your analysis, you can consider removing that feature altogether.

3. **Domain-Specific Strategies:**

    a. **Domain Knowledge:** In some cases, domain knowledge can guide the imputation process. For example, if you know that missing values are related to a specific condition, you can impute them accordingly.

4. **Imputation Libraries:**

    a. **Scikit-Learn:** Scikit-Learn provides a `SimpleImputer` class that can handle basic imputation strategies like mean, median, and mode imputation.

    b. **Fancyimpute:** Fancyimpute is a Python library that offers more advanced imputation techniques, including matrix factorization, k-nearest neighbors, and deep learning-based methods.

The choice of imputation method should be guided by the nature of your data, the amount of missing data, the problem you are trying to solve, and the assumptions you are willing to make.

To check if we could use the methods in the Deletion of Missing Data strategy, we first need to find how many of the data entries has missing values:

In [1758]:
# Check the number of data entries with missing values
missing_data = train_set[train_set.isnull().any(axis=1)]
print(f"Number of entries in training set: {train_set.shape[0]}")
print(f"Number of entries in training set that has missing values: {missing_data.shape[0]}")

missing_data = val_set[val_set.isnull().any(axis=1)]
print(f"Number of entries in validation set: {val_set.shape[0]}")
print(f"Number of entries in validation set that has missing values: {val_set.shape[0]}")

Number of entries in training set: 140272
Number of entries in training set that has missing values: 123176
Number of entries in validation set: 35069
Number of entries in validation set that has missing values: 35069


As can be seen above, the number of entries that has missing values largely represents the original training dataset (that is, 87,8% of the training dataset has missing values), therefore we clearly cannot use the strategy Deletion of Missing Data with the method Listwise Deletion.

Next, to check if we can use the method Column (Feature) Deletion, we check if there are some features that particularly have very many null values compared to the other features:

In [1759]:
missing_values_count = train_set.isnull().sum()
print("Count for missing values in each feature in training set:")
print(missing_values_count)

missing_values_count = val_set.isnull().sum()
print("\nCount for missing values in each feature in validation set:")
print(missing_values_count)

Count for missing values in each feature in training set:
sjit                 7015
djit                 7109
sinpkt               6953
dinpkt               6999
tcprtt               7093
synack               6943
ackdat               6886
id                      0
attack_cat              0
label                   0
proto                7033
swin                 6971
dwin                 6999
stcpb                6941
dtcpb                7072
smean                7008
dmean                7078
trans_depth          7001
response_body_len    6999
state                7034
dur                  6957
sbytes               6836
dbytes               7069
sttl                 7022
dttl                 6937
sloss                7004
dloss                7209
service              7038
sload                7006
dload                7096
spkts                6959
dpkts                6918
is_sm_ips_ports      7028
ct_state_ttl         7000
ct_flw_http_mthd     6946
is_ftp_login         6910
ct_ftp

As can be seen above, the number of entries that has missing values is distributed pretty much evenly across the features except for the ID and target features, so we definitely cannot use the method Column (Feature) Deletion either.

Our option left is to do imputation on the training dataset.
- For the **categorical** features, we fill the missing values with the **most frequent value** of the attribute.
- For the **numerical** features, we fill the missing values with the **median** value of the attribute. This is to prevent outliers from massively damaging our results, because if we use mean for filling the missing values there could be outliers that have very large or very small values compared to the rest of the data, and this can massively skew the mean value.

First, we define the method for imputing our dataframe:

In [1760]:
# Method for imputing dataset
def impute(df: pd.DataFrame) -> pd.DataFrame:
    categorical_features = ['proto', 'state', 'service', 'is_sm_ips_ports', 'is_ftp_login', 'id']  # id is only used for join later
    target_feature = 'attack_cat'

    categorical_set = df[categorical_features]
    if 'label' in df.columns:
        categorical_set = categorical_set.join(df['label'])
    
    categorical_features.remove('id')
    numerical_set = df.drop(categorical_features, axis=1)
    if 'label' in df.columns:
        numerical_set = numerical_set.drop('label', axis=1)
    
    has_target_feature = False
    if target_feature in df.columns:
        target_set = df[[target_feature, 'id']] # id is used for join later
        numerical_set = numerical_set.drop(target_feature, axis=1)
        has_target_feature = True
    else:
        target_set = None

    imputer_mode = SimpleImputer(strategy='most_frequent')
    imputer_median = SimpleImputer(strategy='median')

    categorical_set = pd.DataFrame(
        imputer_mode.fit_transform(categorical_set),
        columns=categorical_set.columns,
        index=categorical_set.index
    )

    numerical_set = pd.DataFrame(
        imputer_median.fit_transform(numerical_set),
        columns=numerical_set.columns,
        index=numerical_set.index
    )

    result = categorical_set.merge(numerical_set, on='id')
    if has_target_feature:
        result = result.merge(target_set, on='id')
        
    return result

Then we can use it to impute both the train set and validation set:

In [1761]:
# Imputing train set and validation set
train_set = impute(train_set)
val_set = impute(val_set)

# Check if there's missing data after imputing
missing_data_train = train_set[train_set.isnull().any(axis=1)]
missing_data_val = val_set[val_set.isnull().any(axis=1)]

print(f"Number of data entries in training set with missing values: {missing_data_train.shape[0]}")
print(f"Number of data entries in validation set with missing values: {missing_data_val.shape[0]}")

Number of data entries in training set with missing values: 0
Number of data entries in validation set with missing values: 0


### II. Dealing with Outliers

Outliers are data points that significantly differ from the majority of the data. They can be unusually high or low values that do not fit the pattern of the rest of the dataset. Outliers can significantly impact model performance, so it is important to handle them properly.

Some methods to handle outliers:
1. **Imputation**: Replace with mean, median, or a boundary value.
2. **Clipping**: Cap values to upper and lower limits.
3. **Transformation**: Use log, square root, or power transformations to reduce their influence.
4. **Model-Based**: Use algorithms robust to outliers (e.g., tree-based models, Huber regression).

First, we check how many outliers are there in our dataset. This is done to help us decide what to do with the outliers. For example, if the number of outliers is significantly smaller than our entire population, then dropping the outliers is safe. But if the number of outliers is not insignificant, dropping them can hugely impact our results as we wouldn't have enough data to train our model.

Let us define the function to select the outliers:

In [1762]:
# Function to find the outliers
def select_outliers(df: pd.DataFrame) -> pd.DataFrame:
    outliers_mask = pd.DataFrame(False, index=df.index, columns=df.columns)

    for column in df.select_dtypes(include=['number']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers_mask[column] = (df[column] < lower_bound) | (df[column] > upper_bound)

    outlier_rows = outliers_mask.any(axis=1)
    return outlier_rows

Then we can use it to find the number of outliers in our dataset:

In [1763]:
# Display the number of outliers in train set and validation set
outliers_train = select_outliers(train_set)
outliers_val = select_outliers(val_set)

print(f"Number of outliers in the training set: {outliers_train.sum()}")
print(f"Number of outliers in the validation set: {outliers_val.sum()}")

Number of outliers in the training set: 115228
Number of outliers in the validation set: 28855


As can be seen above, the number of outliers hugely represents our entire dataset (82,15% of the entire train set!), and thus we clearly cannot drop the outliers. Our option is then to replace the outlier values with the median (as we have done so before for missing values).

Like before, we define the function for replacing the outliers with median first:

In [1764]:
# Function for replacing outlier values with the median
def replace_outliers_with_median(df):
    for column in df.select_dtypes(include=['number']).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        median = df[column].median()
        df[column] = df[column].apply(
            lambda x: median if x < lower_bound or x > upper_bound else x
        )
    return df

Then we can use it to replace the outliers in the train set and validation set:

In [1765]:
# Replace the outlier values in the train set and validation set with the median
train_set = replace_outliers_with_median(train_set)
val_set = replace_outliers_with_median(val_set)

# Check the number of outliers after replacing them with median
outliers_train = select_outliers(train_set)
outliers_val = select_outliers(val_set)

print(f"Number of outliers in the training set: {outliers_train.sum()}")
print(f"Number of outliers in the validation set: {outliers_val.sum()}")

Number of outliers in the training set: 115677
Number of outliers in the validation set: 28851


### III. Remove Duplicates
Handling duplicate values is crucial because they can compromise data integrity, leading to inaccurate analysis and insights. Duplicate entries can bias machine learning models, causing overfitting and reducing their ability to generalize to new data. They also inflate the dataset size unnecessarily, increasing computational costs and processing times. Additionally, duplicates can distort statistical measures and lead to inconsistencies, ultimately affecting the reliability of data-driven decisions and reporting. Ensuring data quality by removing duplicates is essential for accurate, efficient, and consistent analysis.

First, we check how many duplicate data are there in our dataset:

In [1766]:
# Display the number of duplicates in train set and validation set
duplicates_count_train = train_set.duplicated(subset=[col for col in train_set.columns if col != 'id']).sum()
duplicates_count_val = val_set.duplicated(subset=[col for col in train_set.columns if col != 'id']).sum()

print(f"Number of duplicates in training dataset: {duplicates_count_train}")
print(f"Number of duplicates in validation dataset: {duplicates_count_val}")

Number of duplicates in training dataset: 28681
Number of duplicates in validation dataset: 5144


It turns out that the number of duplicates is a lot more than we've expected (20% of the train set, and 14,67% of the validation set). If unhandled, our model would overfit on these duplicate data in the training phase, and this would negatively impact their ability to predict the class correctly. Yet unlike with handling missing values or handling outliers, we cannot "manipulate" the duplicate data itself because they do not represent new information (and thus imputing them would only make the quality of our dataset worse by adding artificial, unmeaningful data). Thus, we need to remove the duplicate:

In [1767]:
# Drop all duplicate data in train set and validation set
train_set = train_set.drop_duplicates(subset=[col for col in train_set.columns if col != 'id'], keep='first')
val_set = val_set.drop_duplicates(subset=[col for col in train_set.columns if col != 'id'], keep='first')

# Check the number of duplicate data after dropping them
duplicates_count_train = train_set.duplicated(subset=[col for col in train_set.columns if col != 'id']).sum()
duplicates_count_val = val_set.duplicated(subset=[col for col in train_set.columns if col != 'id']).sum()

print(f"Number of duplicates in training dataset: {duplicates_count_train}")
print(f"Number of duplicates in validation dataset: {duplicates_count_val}")

Number of duplicates in training dataset: 0
Number of duplicates in validation dataset: 0


### IV. Feature Engineering

**Feature engineering** involves creating new features (input variables) or transforming existing ones to improve the performance of machine learning models. Feature engineering aims to enhance the model's ability to learn patterns and make accurate predictions from the data. It's often said that "good features make good models."

1. **Feature Selection:** Feature engineering can involve selecting the most relevant and informative features from the dataset. Removing irrelevant or redundant features not only simplifies the model but also reduces the risk of overfitting.

2. **Creating New Features:** Sometimes, the existing features may not capture the underlying patterns effectively. In such cases, engineers create new features that provide additional information. For example:
   
   - **Polynomial Features:** Engineers may create new features by taking the square, cube, or other higher-order terms of existing numerical features. This can help capture nonlinear relationships.
   
   - **Interaction Features:** Interaction features are created by combining two or more existing features. For example, if you have features "length" and "width," you can create an "area" feature by multiplying them.

3. **Binning or Discretization:** Continuous numerical features can be divided into bins or categories. For instance, age values can be grouped into bins like "child," "adult," and "senior."

4. **Domain-Specific Feature Engineering:** Depending on the domain and problem, engineers may create domain-specific features. For example, in fraud detection, features related to transaction history and user behavior may be engineered to identify anomalies.

Feature engineering is both a creative and iterative process. It requires a deep understanding of the data, domain knowledge, and experimentation to determine which features will enhance the model's predictive power.

Note: Our group decided to do feature selection **after one-hot encoding**.

One of the most common method for feature engineering is feature selection, which is selecting only the most impactful features from the entire dataset (that is, features that has a huge impact on improving the model's performance).

However, to do feature selection we usually need to use a model for deciding which features to keep, and these models cannot receive non numerical attributes. Therefore, we decided to push feature selection further down until one-hot encoding is done to the dataset so that we can use the models for feature selection. For now, we will only drop the 'id' attribute which we can be sure it will not help our model to predict the target class.

In [1768]:
train_set = train_set.drop('id', axis=1)
val_set = val_set.drop('id', axis=1)

## B. Data Preprocessing

**Data preprocessing** is a broader step that encompasses both data cleaning and additional transformations to make the data suitable for machine learning algorithms. Its primary goals are:

1. **Feature Scaling:** Ensure that numerical features have similar scales. Common techniques include Min-Max scaling (scaling to a specific range) or standardization (mean-centered, unit variance).

2. **Encoding Categorical Variables:** Machine learning models typically work with numerical data, so categorical variables need to be encoded. This can be done using one-hot encoding, label encoding, or more advanced methods like target encoding.

3. **Handling Imbalanced Classes:** If dealing with imbalanced classes in a binary classification task, apply techniques such as oversampling, undersampling, or using different evaluation metrics to address class imbalance.

4. **Dimensionality Reduction:** Reduce the number of features using techniques like Principal Component Analysis (PCA) or feature selection to simplify the model and potentially improve its performance.

5. **Normalization:** Normalize data to achieve a standard distribution. This is particularly important for algorithms that assume normally distributed data.

### Notes on Preprocessing processes

It is advised to create functions or classes that have the same/similar type of inputs and outputs, so you can add, remove, or swap the order of the processes easily. You can implement the functions or classes by yourself

or

use `sklearn` library. To create a new preprocessing component in `sklearn`, implement a corresponding class that includes:
1. Inheritance to `BaseEstimator` and `TransformerMixin`
2. The method `fit`
3. The method `transform`

### I. Feature Scaling

**Feature scaling** is a preprocessing technique used in machine learning to standardize the range of independent variables or features of data. The primary goal of feature scaling is to ensure that all features contribute equally to the training process and that machine learning algorithms can work effectively with the data.

Here are the main reasons why feature scaling is important:

1. **Algorithm Sensitivity:** Many machine learning algorithms are sensitive to the scale of input features. If the scales of features are significantly different, some algorithms may perform poorly or take much longer to converge.

2. **Distance-Based Algorithms:** Algorithms that rely on distances or similarities between data points, such as k-nearest neighbors (KNN) and support vector machines (SVM), can be influenced by feature scales. Features with larger scales may dominate the distance calculations.

3. **Regularization:** Regularization techniques, like L1 (Lasso) and L2 (Ridge) regularization, add penalty terms based on feature coefficients. Scaling ensures that all features are treated equally in the regularization process.

Common methods for feature scaling include:

1. **Min-Max Scaling (Normalization):** This method scales features to a specific range, typically [0, 1]. It's done using the following formula:

   $$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$

   - Here, $X$ is the original feature value, $X_{min}$ is the minimum value of the feature, and $X_{max}$ is the maximum value of the feature.  
<br />
<br />
2. **Standardization (Z-score Scaling):** This method scales features to have a mean (average) of 0 and a standard deviation of 1. It's done using the following formula:

   $$X' = \frac{X - \mu}{\sigma}$$

   - $X$ is the original feature value, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation of the feature.  
<br />
<br />
3. **Robust Scaling:** Robust scaling is a method that scales features to the interquartile range (IQR) and is less affected by outliers. It's calculated as:

   $$X' = \frac{X - Q1}{Q3 - Q1}$$

   - $X$ is the original feature value, $Q1$ is the first quartile (25th percentile), and $Q3$ is the third quartile (75th percentile) of the feature.  
<br />
<br />
4. **Log Transformation:** In cases where data is highly skewed or has a heavy-tailed distribution, taking the logarithm of the feature values can help stabilize the variance and improve scaling.

The choice of scaling method depends on the characteristics of your data and the requirements of your machine learning algorithm. **Min-max scaling and standardization are the most commonly used techniques and work well for many datasets.**

Scaling should be applied separately to each training and test set to prevent data leakage from the test set into the training set. Additionally, **some algorithms may not require feature scaling, particularly tree-based models.**

In [1769]:
class MinMaxScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.min_values = None
        self.max_values = None

    def fit(self, X, y=None):
        X = pd.DataFrame(X)
        numeric_columns = X.select_dtypes(include=['number'])
        self.numeric_features = numeric_columns.columns[numeric_columns.nunique() > 2]
        self.min_values = X[self.numeric_features].min()
        self.max_values = X[self.numeric_features].max()
        return self

    def transform(self, X):
        X = pd.DataFrame(X)
        X[self.numeric_features] = (
            (X[self.numeric_features] - self.min_values) /
            (self.max_values - self.min_values).replace(0, 1)
        )
        return X
    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)

### II. Feature Encoding

**Feature encoding**, also known as **categorical encoding**, is the process of converting categorical data (non-numeric data) into a numerical format so that it can be used as input for machine learning algorithms. Most machine learning models require numerical data for training and prediction, so feature encoding is a critical step in data preprocessing.

Categorical data can take various forms, including:

1. **Nominal Data:** Categories with no intrinsic order, like colors or country names.  

2. **Ordinal Data:** Categories with a meaningful order but not necessarily equidistant, like education levels (e.g., "high school," "bachelor's," "master's").

There are several common methods for encoding categorical data:

1. **Label Encoding:**

   - Label encoding assigns a unique integer to each category in a feature.
   - It's suitable for ordinal data where there's a clear order among categories.
   - For example, if you have an "education" feature with values "high school," "bachelor's," and "master's," you can encode them as 0, 1, and 2, respectively.
<br />
<br />
2. **One-Hot Encoding:**

   - One-hot encoding creates a binary (0 or 1) column for each category in a nominal feature.
   - It's suitable for nominal data where there's no inherent order among categories.
   - Each category becomes a new feature, and the presence (1) or absence (0) of a category is indicated for each row.
<br />
<br />
3. **Target Encoding (Mean Encoding):**

   - Target encoding replaces each category with the mean of the target variable for that category.
   - It's often used for classification problems.

In [1770]:
class OneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.non_numeric_columns = {}

    def fit(self, X, y=None):
        df = pd.DataFrame(X)
        df = df.apply(pd.to_numeric, errors='ignore')
        non_numeric_columns = df.select_dtypes(exclude='number')
        self.non_numeric_columns = non_numeric_columns.columns
        
        return self

    def transform(self, X):
        df = pd.DataFrame(X)
        df_encoded = pd.get_dummies(df, columns=self.non_numeric_columns)

        return df_encoded

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

### Feature Selection

In [1771]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import pandas as pd

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, estimator=None):
        self.estimator = estimator if estimator else RandomForestClassifier(n_estimators=100)
        self.selector = SelectFromModel(self.estimator)
        self.label_encoder = LabelEncoder()

    def fit(self, X, y):
        if y.dtype == 'object':
            y = self.label_encoder.fit_transform(y)
        
        self.selector.fit(X, y)
        return self

    def transform(self, X):
        X_selected_array = self.selector.transform(X)
        selected_features = X.columns[self.selector.get_support()]
        X_selected = pd.DataFrame(X_selected_array, columns=selected_features, index=X.index)
        return X_selected

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

### III. Handling Imbalanced Dataset

**Handling imbalanced datasets** is important because imbalanced data can lead to several issues that negatively impact the performance and reliability of machine learning models. Here are some key reasons:

1. **Biased Model Performance**:

 - Models trained on imbalanced data tend to be biased towards the majority class, leading to poor performance on the minority class. This can result in misleading accuracy metrics.

2. **Misleading Accuracy**:

 - High overall accuracy can be misleading in imbalanced datasets. For example, if 95% of the data belongs to one class, a model that always predicts the majority class will have 95% accuracy but will fail to identify the minority class.

3. **Poor Generalization**:

 - Models trained on imbalanced data may not generalize well to new, unseen data, especially if the minority class is underrepresented.


Some methods to handle imbalanced datasets:
1. **Resampling Methods**:

 - Oversampling: Increase the number of instances in the minority class by duplicating or generating synthetic samples (e.g., SMOTE).
 - Undersampling: Reduce the number of instances in the majority class to balance the dataset.

2. **Evaluation Metrics**:

 - Use appropriate evaluation metrics such as precision, recall, F1-score, ROC-AUC, and confusion matrix instead of accuracy to better assess model performance on imbalanced data.

3. **Algorithmic Approaches**:

 - Use algorithms that are designed to handle imbalanced data, such as decision trees, random forests, or ensemble methods.
 - Adjust class weights in algorithms to give more importance to the minority class.

In [1772]:
class Undersampler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.undersampler = RandomUnderSampler(random_state=42)
        self.y_resampled = None

    def fit(self, X, y):
        _, self.y_resampled = self.undersampler.fit_resample(X, y)
        return self

    def transform(self, X):
        X_resampled, y_resampled = self.undersampler.fit_resample(X, self.y_resampled)
        return X_resampled, y_resampled

### IV. Data Normalization

Data normalization is used to achieve a standard distribution. Without normalization, models or processes that rely on the assumption of normality may not work correctly. Normalization helps reduce the magnitude effect and ensures numerical stability during optimization.

In [1773]:
class DataNormalizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.transformer = QuantileTransformer(output_distribution='normal')
        self.selected_columns = None

    def fit(self, X, y=None):
        numeric_columns = X.select_dtypes(include='number')
        binary_columns = numeric_columns.columns[(numeric_columns.nunique() == 2)]
        self.selected_columns = numeric_columns.drop(columns=binary_columns).columns
        X[self.selected_columns] = X[self.selected_columns] + 1e-6
        self.transformer.fit(X[self.selected_columns])
        return self

    def transform(self, X):
        X = pd.DataFrame(X).copy()
        X[self.selected_columns] = self.transformer.transform(X[self.selected_columns])
        
        for col in X.select_dtypes(include=['object']).columns:
            X[col] = pd.to_numeric(X[col], errors='coerce')
        
        for col in X.select_dtypes(include=['int64', 'float64']).columns:
            if X[col].isin([0, 1]).all():
                X[col] = X[col].astype(bool)
        
        return X
    
    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

### V. Dimensionality Reduction

Dimensionality reduction is a technique used in data preprocessing to reduce the number of input features (dimensions) in a dataset while retaining as much important information as possible. It is essential when dealing with high-dimensional data, where too many features can cause problems like increased computational costs, overfitting, and difficulty in visualization. Reducing dimensions simplifies the data, making it easier to analyze and improving the performance of machine learning models.

One of the main approaches to dimensionality reduction is feature extraction. Feature extraction creates new, smaller sets of features that capture the essence of the original data. Common techniques include:

1. **Principal Component Analysis (PCA)**: Converts correlated features into a smaller number of uncorrelated "principal components."
2. **t-SNE (t-Distributed Stochastic Neighbor Embedding)**: A visualization-focused method to project high-dimensional data into 2D or 3D spaces.
3. **Autoencoders**: Neural networks that learn compressed representations of the data.

In [1774]:
class DimensionalityReducer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.pca = PCA(n_components=2)

    def fit(self, X, y=None):
        numeric_columns = X.select_dtypes(include='number')
        binary_columns = numeric_columns.columns[(numeric_columns.nunique() == 2)]
        self.selected_columns = numeric_columns.drop(columns=binary_columns)
        
        return self

    def transform(self, X):
        df = pd.DataFrame(X)
        df = df.dropna()
        result = self.pca.fit_transform(df[self.selected_columns.columns])
        result_df = pd.DataFrame(result, columns=[f'PC{i+1}' for i in range(result.shape[1])], index=df.index)
        non_numeric_columns = df.drop(columns=self.selected_columns)
        result_df = pd.concat([non_numeric_columns, result_df], axis=1)

        return result_df
    
    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

# 3. Compile Preprocessing Pipeline

All of the preprocessing classes or functions defined earlier will be compiled in this step.

If you use sklearn to create preprocessing classes, you can list your preprocessing classes in the Pipeline object sequentially, and then fit and transform your data.

In [1775]:
from sklearn.pipeline import Pipeline

# Pipeline for data preprocessing
pipe = Pipeline([("scaler", MinMaxScaler()),
                 ("encoder", OneHotEncoder()),
                 ("normalizer", DataNormalizer()),
                 ("reductor", DimensionalityReducer())])

In [1776]:
x_train = train_set.drop('attack_cat', axis=1)
y_train = train_set['attack_cat']

x_val = val_set.drop('attack_cat', axis=1)
y_val = val_set['attack_cat']

sampler = Undersampler()
x_train, y_train = sampler.fit_transform(x_train, y_train)
x_val, y_val = sampler.fit_transform(x_val, y_val)

x_train_processed = pipe.fit(x_train, y_train).transform(x_train)
x_val_processed = pipe.fit(x_val, y_val).transform(x_val)

y_train_processed = y_train
y_val_processed = y_val

# Hasil akhir preprocessing

train_set = pd.concat([x_train_processed, y_train_processed], axis=1)
val_set = pd.concat([x_val_processed, y_val_processed], axis=1)

missing_data_train = train_set[train_set.isnull().any(axis=1)]
missing_data_val = val_set[val_set.isnull().any(axis=1)]

train_set.head()

  df = df.apply(pd.to_numeric, errors='ignore')
  df = df.apply(pd.to_numeric, errors='ignore')


Unnamed: 0,is_sm_ips_ports,is_ftp_login,label,swin,dwin,proto_aes-sp3-d,proto_any,proto_arp,proto_br-sat-mon,proto_cftp,...,service_dns,service_ftp,service_ftp-data,service_http,service_pop3,service_smtp,service_ssh,PC1,PC2,attack_cat
0,0.0,0.0,0,255.0,255.0,False,False,False,False,False,...,False,False,False,False,False,False,False,12.84171,-0.401048,Analysis
1,0.0,0.0,1,0.0,0.0,False,False,False,False,False,...,False,False,False,False,False,False,False,-8.714159,-3.815737,Analysis
2,0.0,0.0,1,0.0,0.0,False,False,False,False,False,...,True,False,False,False,False,False,False,-10.82807,-0.871104,Analysis
3,0.0,0.0,1,0.0,0.0,False,False,False,False,False,...,True,False,False,False,False,False,False,-12.069625,2.948029,Analysis
4,0.0,0.0,1,255.0,255.0,False,False,False,False,False,...,False,False,False,True,False,False,False,5.271891,7.377483,Analysis


In [1777]:
x_train = train_set.drop('attack_cat', axis=1)
y_train = train_set['attack_cat']

x_val = val_set.drop('attack_cat', axis=1)
y_val = val_set['attack_cat']

selector = FeatureSelector()

x_train_processed = selector.fit(x_train, y_train).transform(x_train)
x_val_processed = selector.fit(x_val, y_val).transform(x_val)

# Hasil akhir Feature Selection
train_set = pd.concat([x_train_processed, y_train_processed], axis=1)
val_set = pd.concat([x_val_processed, y_val_processed], axis=1)

print(f"Hasil kolom train_set yang terpilih: {train_set.columns}")
print(f"Hasil kolom val_set yang terpilih: {train_set.columns}")

Hasil kolom train_set yang terpilih: Index(['label', 'PC1', 'PC2', 'attack_cat'], dtype='object')
Hasil kolom val_set yang terpilih: Index(['label', 'PC1', 'PC2', 'attack_cat'], dtype='object')


In [1778]:
print(x_train_processed)

      label        PC1       PC2
0       0.0  12.841710 -0.401048
1       1.0  -8.714159 -3.815737
2       1.0 -10.828070 -0.871104
3       1.0 -12.069625  2.948029
4       1.0   5.271891  7.377483
...     ...        ...       ...
1010    1.0   9.053380 -7.250072
1020    1.0  10.856322 -3.951538
1009    1.0   5.614910 -8.451905
991     0.0   6.890463  8.751084
1053    1.0   7.506988 -7.437839

[1020 rows x 3 columns]


# 4. Modeling and Validation

Modelling is the process of building your own machine learning models to solve specific problems, or in this assignment context, predicting the target feature `attack_cat`. Validation is the process of evaluating your trained model using the validation set or cross-validation method and providing some metrics that can help you decide what to do in the next iteration of development.

## A. KNN

Ubah kolom dengan dtype _object_ menjadi float.

In [1779]:
train_set_numeric = train_set.copy()

train_set_X = train_set_numeric.loc[:, train_set_numeric.columns != "attack_cat"]
train_set_y = train_set_numeric["attack_cat"]

test_input = train_set_numeric.iloc[0:10].copy()
print(f"Test input: {test_input}")

test_input = test_input.drop(columns = ["attack_cat"])

print(test_input)

Test input:    label        PC1       PC2 attack_cat
0    0.0  12.841710 -0.401048   Analysis
1    1.0  -8.714159 -3.815737   Analysis
2    1.0 -10.828070 -0.871104   Analysis
3    1.0 -12.069625  2.948029   Analysis
4    1.0   5.271891  7.377483   Analysis
5    1.0 -12.253926  4.214388   Analysis
6    1.0  12.655175 -6.011861   Analysis
7    1.0   9.200599 -6.433403   Analysis
8    0.0  12.525325  3.793069   Analysis
9    0.0  10.988831 -1.209004   Analysis
   label        PC1       PC2
0    0.0  12.841710 -0.401048
1    1.0  -8.714159 -3.815737
2    1.0 -10.828070 -0.871104
3    1.0 -12.069625  2.948029
4    1.0   5.271891  7.377483
5    1.0 -12.253926  4.214388
6    1.0  12.655175 -6.011861
7    1.0   9.200599 -6.433403
8    0.0  12.525325  3.793069
9    0.0  10.988831 -1.209004


KNN from scratch

In [1780]:
scratchKNN = KNN(10)
scratchKNN.fit(x_train_processed, y_train_processed)

print(scratchKNN.predict(test_input))

['Analysis', 'Analysis', 'Analysis', 'Analysis', 'Shellcode', 'Generic', 'Normal', 'Analysis', 'Analysis', 'Normal']


KNN scikit-learn

In [1781]:
scikitKNN = KNeighborsClassifier(n_neighbors = 10)
scikitKNN.fit(x_train_processed, y_train_processed)

print(scikitKNN.predict(test_input))

['Analysis' 'Analysis' 'Analysis' 'Analysis' 'Analysis' 'Generic'
 'Backdoor' 'Analysis' 'Analysis' 'Normal']


Validasi menggunakan k-fold cross validation

In [1782]:
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

scores_scratch_accuracy = cross_val_score(scratchKNN, x_val_processed, y_val_processed, cv = kf, scoring = "accuracy")
scores_scikit_accuracy = cross_val_score(scikitKNN, x_val_processed, y_val_processed, cv = kf, scoring = "accuracy")

scores_scratch_f1 = cross_val_score(scratchKNN, x_val_processed, y_val_processed, cv = kf, scoring = "f1_weighted")
scores_scikit_f1 = cross_val_score(scikitKNN, x_val_processed, y_val_processed, cv = kf, scoring = "f1_weighted")

print("KNN scratch:")
print(f"  Mean Accuracy: {scores_scratch_accuracy.mean():.4f}")
print(f"  Mean F1 Score: {scores_scratch_f1.mean():.4f}")
print("--------------------------------------------------------")
print("KNN scikit:")
print(f"  Mean Accuracy: {scores_scikit_accuracy.mean():.4f}")
print(f"  Mean F1 Score: {scores_scikit_f1.mean():.4f}")

KNN scratch:
  Mean Accuracy: 0.0929
  Mean F1 Score: 0.0891
--------------------------------------------------------
KNN scikit:
  Mean Accuracy: 0.0714
  Mean F1 Score: 0.0661


## B. Naive Bayes

### Naive Bayes From Scratch

In [1783]:
# Train Model
from gaussian_naive_bayes import NaiveBayes

train_X = x_train_processed.copy().astype({'PC1': float, 'PC2': float})
train_set_y = y_train_processed

scratchNB = NaiveBayes()
scratchNB.fit(train_X, train_set_y)

scratchNB.printNumericModel()
scratchNB.printCategoricalModel()

MEAN
label
    Analysis: 0.6568627450980392
    Backdoor: 0.5588235294117647
    DoS: 0.6862745098039216
    Exploits: 0.5882352941176471
    Fuzzers: 0.6568627450980392
    Generic: 0.696078431372549
    Normal: 0.6764705882352942
    Reconnaissance: 0.6176470588235294
    Shellcode: 0.696078431372549
    Worms: 0.6470588235294118
PC1
    Analysis: -0.2186873084152944
    Backdoor: 0.2769072814144496
    DoS: -0.020543468578057084
    Exploits: -0.20528388927193914
    Fuzzers: -0.06375170966234027
    Generic: -2.01681879548762
    Normal: 0.27269462250397225
    Reconnaissance: 0.4545192941013385
    Shellcode: -0.7664640270478664
    Worms: 2.2874280004433545
PC2
    Analysis: 0.10574727023241677
    Backdoor: 0.5388841785601965
    DoS: -0.2698839363324706
    Exploits: 0.41078698206431025
    Fuzzers: -0.029622702898913163
    Generic: -0.04010400221455159
    Normal: -0.12468068653527542
    Reconnaissance: -0.4295773651301254
    Shellcode: 0.04333992321082872
    Worms: -0.204

In [1784]:
# Predict
test_input = train_set.copy()
print(f"Test input: {test_input.iloc[1]}")

test_input = test_input.drop(columns = ["attack_cat"])

result = scratchNB.predict(test_input)
print([val for val in result])

Test input: label              1.0
PC1          -8.714159
PC2          -3.815737
attack_cat    Analysis
Name: 1, dtype: object
['Backdoor', 'Generic', 'Generic', 'Generic', 'Worms', 'Generic', 'Worms', 'Worms', 'Backdoor', 'Backdoor', 'Reconnaissance', 'Worms', 'Backdoor', 'Backdoor', 'Backdoor', 'DoS', 'Reconnaissance', 'Worms', 'Generic', 'Generic', 'Generic', 'Worms', 'Normal', 'Backdoor', 'Backdoor', 'Reconnaissance', 'DoS', 'Worms', 'Generic', 'Generic', 'Reconnaissance', 'Generic', 'Generic', 'Generic', 'DoS', 'Reconnaissance', 'Backdoor', 'Generic', 'DoS', 'Generic', 'Generic', 'Generic', 'Backdoor', 'Backdoor', 'Generic', 'Reconnaissance', 'DoS', 'DoS', 'Backdoor', 'Backdoor', 'Worms', 'Backdoor', 'Generic', 'Backdoor', 'DoS', 'Generic', 'Worms', 'Generic', 'Generic', 'Worms', 'Backdoor', 'Generic', 'Worms', 'Generic', 'Generic', 'Backdoor', 'Worms', 'Backdoor', 'Generic', 'Backdoor', 'Worms', 'Generic', 'Backdoor', 'Backdoor', 'Backdoor', 'Reconnaissance', 'DoS', 'Generic', 'E

### Naive Bayes scikit-learn

In [1785]:
from sklearn.naive_bayes import GaussianNB

train_X = x_train_processed.copy().astype({'PC1': float, 'PC2': float})
train_set_y = y_train_processed

scikitNB = GaussianNB()
scikitNB.fit(train_X, train_set_y)

print(scikitNB.predict(test_input))

['Backdoor' 'Generic' 'Generic' ... 'Worms' 'Exploits' 'Worms']


### Validasi menggunakan k-fold cross validation

In [1786]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

scores_scratch = cross_val_score(scratchNB, train_set_X, train_set_y, cv = kf, scoring = "accuracy")
scores_scikit = cross_val_score(scikitNB, train_set_X, train_set_y, cv = kf, scoring = "accuracy")

print("NB scratch:")
print(f"  Mean Accuracy: {scores_scratch.mean():.4f}")
print(f"  Standard Deviation: {scores_scratch.std():.4f}")
print("--------------------------------------------------------")
print("NB scikit:")
print(f"  Mean Accuracy: {scores_scikit.mean():.4f}")
print(f"  Standard Deviation: {scores_scikit.std():.4f}")

NB scratch:
  Mean Accuracy: 0.0882
  Standard Deviation: 0.0155
--------------------------------------------------------
NB scikit:
  Mean Accuracy: 0.0882
  Standard Deviation: 0.0155


## C. ID3

In [1787]:
# Type your code here

## D. Improvements (Optional)

- **Visualize the model evaluation result**

This will help you to understand the details more clearly about your model's performance. From the visualization, you can see clearly if your model is leaning towards a class than the others. (Hint: confusion matrix, ROC-AUC curve, etc.)

- **Explore the hyperparameters of your models**

Each models have their own hyperparameters. And each of the hyperparameter have different effects on the model behaviour. You can optimize the model performance by finding the good set of hyperparameters through a process called **hyperparameter tuning**. (Hint: Grid search, random search, bayesian optimization)

- **Cross-validation**

Cross-validation is a critical technique in machine learning and data science for evaluating and validating the performance of predictive models. It provides a more **robust** and **reliable** evaluation method compared to a hold-out (single train-test set) validation. Though, it requires more time and computing power because of how cross-validation works. (Hint: k-fold cross-validation, stratified k-fold cross-validation, etc.)

In [1788]:
# Type your code here

## E. Submission
To predict the test set target feature and submit the results to the kaggle competition platform, do the following:
1. Create a new pipeline instance identical to the first in Data Preprocessing
2. With the pipeline, apply `fit_transform` to the original training set before splitting, then only apply `transform` to the test set.
3. Retrain the model on the preprocessed training set
4. Predict the test set
5. Make sure the submission contains the `id` and `attack_cat` column.

In [1789]:
pipe = Pipeline([("scaler", MinMaxScaler()),
                 ("encoder", OneHotEncoder()),
                 ("normalizer", DataNormalizer()),
                 ("reducer", DimensionalityReducer())])

In [1790]:
df_additional = pd.read_csv('../dataset/test/additional_features_test.csv')
df_basic = pd.read_csv('../dataset/test/basic_features_test.csv')
df_content = pd.read_csv('../dataset/test/content_features_test.csv')
df_flow = pd.read_csv('../dataset/test/flow_features_test.csv')
df_time = pd.read_csv('../dataset/test/time_features_test.csv')

df_merged_test = (
    df_time
    .merge(df_flow, on = "id", how = "left")
    .merge(df_content, on = "id", how = "left")
    .merge(df_basic, on = "id", how = "left")
    .merge(df_additional, on = "id", how = "left")
)

df_merged_test.head()

Unnamed: 0,sjit,djit,sinpkt,dinpkt,tcprtt,synack,ackdat,id,proto,swin,...,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm
0,2737.954123,118.833969,48.756556,76.593602,0.165117,0.072001,0.093116,0,tcp,255.0,...,0.0,0.0,0.0,5.0,5.0,2.0,2.0,2.0,1.0,2.0
1,2938.299144,165.780563,49.812539,109.557602,0.223604,0.100248,0.123356,1,tcp,255.0,...,0.0,,0.0,6.0,6.0,1.0,1.0,1.0,1.0,5.0
2,4287.453629,129.471406,69.76553,94.395906,0.113189,0.082498,0.030691,2,tcp,255.0,...,0.0,0.0,0.0,4.0,4.0,1.0,2.0,1.0,1.0,4.0
3,0.0,0.0,0.001,0.0,0.0,0.0,0.0,3,udp,0.0,...,0.0,0.0,0.0,10.0,4.0,2.0,4.0,2.0,1.0,4.0
4,1119.063538,26.748141,17.628799,15.543294,0.000655,0.000526,0.000129,4,tcp,255.0,...,,0.0,0.0,13.0,11.0,10.0,7.0,6.0,1.0,7.0


In [1791]:
df_merged = impute(df_merged)
df_merged = replace_outliers_with_median(df_merged)
df_merged = df_merged.drop_duplicates(subset=[col for col in df_merged.columns if col != 'id'], keep='first')

df_merged_test = impute(df_merged_test)

In [1792]:
train_x = df_merged.drop(['attack_cat','label'], axis=1)
train_y = df_merged['attack_cat']

test_x = df_merged_test

In [1793]:
# Lakukan fit dan transform untuk data training
sampler = Undersampler()
train_x, train_y = sampler.fit_transform(train_x, train_y)
train_x = pipe.fit(train_x, train_y).transform(train_x)

# Hanya lakukan transform untuk data test
test_x = pipe.transform(test_x)

  df = df.apply(pd.to_numeric, errors='ignore')


In [1794]:
# Prediksi menggunakan KNN from scratch
scratchKNN = KNN(10)
scratchKNN.fit(train_x, train_y)

test_input = test_x[:200]

print(scratchKNN.predict(test_input))

['Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'Analysis', 'An

# 6. Error Analysis

Based on all the process you have done until the modeling and evaluation step, write an analysis to support each steps you have taken to solve this problem. Write the analysis using the markdown block. Some questions that may help you in writing the analysis:

- Does my model perform better in predicting one class than the other? If so, why is that?
- To each models I have tried, which performs the best and what could be the reason?
- Is it better for me to impute or drop the missing data? Why?
- Does feature scaling help improve my model performance?
- etc...

`Provide your analysis here`