# IF3070 Foundations of Artificial Intelligence | Tugas Besar 2

This notebook serves as a template for the assignment. Please create a copy of this notebook to complete your work. You can add more code blocks, markdown blocks, or new sections if needed.


Group Number: 13

Group Members:
- Jonathan Wiguna (18222019)
- Naomi Pricilla Agustine(18222065)
- Harry Truman Suhalim (18222081)
- Micky Valentino(18222093)

## Import Libraries

In [6]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.utils.validation import check_is_fitted, check_array
from imblearn.combine import SMOTETomek
from datetime import datetime
import re

## Import Dataset

In [7]:
train_df = pd.read_csv('https://drive.google.com/uc?id=1629Y3A-tpqB9Oi69heIUk6zPX3K4e1I5')

In [8]:
test_df = pd.read_csv('https://gist.githubusercontent.com/RunningPie/62a453218bd691c00d695f4d1fa05f38/raw/b9214382e6007dd6e267d7693e400e36c0fe4d77/test.csv')

In [None]:
train_df.describe()

In [None]:
test_df.describe()

In [11]:
train_df = train_df.drop(columns=['FILENAME'])
test_df = test_df.drop(columns=['FILENAME'])


In [12]:
categorical = ['TLD',  'Title', 'Domain','URL',  'IsDomainIP', 'HasObfuscation', 'IsHTTPS', 'HasTitle', 'HasFavicon', 'Robots', 'IsResponsive', 'HasDescription', 'HasExternalFormSubmit', 'HasSocialNet', 'HasSubmitButton', 'HasHiddenFields', 'HasPasswordField', 'Bank', 'Pay', 'Crypto', 'HasCopyrightInfo']

non_categorical = [col for col in train_df.columns if col not in categorical]

# 1. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train` folder given by the TA. The `test` data is only used for kaggle submission.

In [13]:
train_set, val_set = train_test_split(train_df, test_size=0.3,stratify=train_df['label'], random_state=42)

In [None]:
train_set

In [None]:
val_set

# 2. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

We will give some common methods for you to try, but you only have to **at least implement one method for each process**. For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

## A. Data Cleaning

**Data cleaning** is the crucial first step in preparing your dataset for machine learning. Raw data collected from various sources is often messy and may contain errors, missing values, and inconsistencies. Data cleaning involves the following steps:

1. **Handling Missing Data:** Identify and address missing values in the dataset. This can include imputing missing values, removing rows or columns with excessive missing data, or using more advanced techniques like interpolation.

2. **Dealing with Outliers:** Identify and handle outliers, which are data points significantly different from the rest of the dataset. Outliers can be removed or transformed to improve model performance.

3. **Data Validation:** Check for data integrity and consistency. Ensure that data types are correct, categorical variables have consistent labels, and numerical values fall within expected ranges.

4. **Removing Duplicates:** Identify and remove duplicate rows, as they can skew the model's training process and evaluation metrics.

5. **Feature Engineering**: Create new features or modify existing ones to extract relevant information. This step can involve scaling, normalizing, or encoding features for better model interpretability.

### I. Handling Missing Data

Missing data can adversely affect the performance and accuracy of machine learning models. There are several strategies to handle missing data in machine learning:

1. **Data Imputation:**

    a. **Mean, Median, or Mode Imputation:** For numerical features, you can replace missing values with the mean, median, or mode of the non-missing values in the same feature. This method is simple and often effective when data is missing at random.

    b. **Constant Value Imputation:** You can replace missing values with a predefined constant value (e.g., 0) if it makes sense for your dataset and problem.

    c. **Imputation Using Predictive Models:** More advanced techniques involve using predictive models to estimate missing values. For example, you can train a regression model to predict missing numerical values or a classification model to predict missing categorical values.

2. **Deletion of Missing Data:**

    a. **Listwise Deletion:** In cases where the amount of missing data is relatively small, you can simply remove rows with missing values from your dataset. However, this approach can lead to a loss of valuable information.

    b. **Column (Feature) Deletion:** If a feature has a large number of missing values and is not critical for your analysis, you can consider removing that feature altogether.

3. **Domain-Specific Strategies:**

    a. **Domain Knowledge:** In some cases, domain knowledge can guide the imputation process. For example, if you know that missing values are related to a specific condition, you can impute them accordingly.

4. **Imputation Libraries:**

    a. **Scikit-Learn:** Scikit-Learn provides a `SimpleImputer` class that can handle basic imputation strategies like mean, median, and mode imputation.

    b. **Fancyimpute:** Fancyimpute is a Python library that offers more advanced imputation techniques, including matrix factorization, k-nearest neighbors, and deep learning-based methods.

The choice of imputation method should be guided by the nature of your data, the amount of missing data, the problem you are trying to solve, and the assumptions you are willing to make.

Identify Missing Values

In [None]:
missing_percent = (train_set.isnull().sum() / len(train_set)) * 100
print("Persentase missing values:")
print(missing_percent)

Data Imputation

In [None]:
if 'URL' in train_set.columns:
    most_frequent_url = train_set['URL'].mode()[0]

    mask_both_missing = pd.isna(train_set['IsHTTPS']) & pd.isna(train_set['Domain'])
    train_set.loc[mask_both_missing & pd.isna(train_set['URL']), 'URL'] = most_frequent_url

    mask_one_exists = ((pd.notna(train_set['IsHTTPS']) & pd.isna(train_set['Domain'])) | 
            (pd.isna(train_set['IsHTTPS']) & pd.notna(train_set['Domain'])))
    train_set.loc[mask_one_exists & pd.isna(train_set['URL']), 'URL'] = most_frequent_url

for index, row in train_set.iterrows():
    if pd.notna(row['URL']):
        try:
            train_set.at[index, 'IsHTTPS'] = 1 if row['URL'].startswith('https://') else 0
            
            parsed = urlparse(row['URL'])
            if not parsed.netloc:
                parsed = urlparse("http://" + row['URL'])
            
            domain = parsed.netloc or parsed.path

            if '.' in domain:
                tld = domain.split('.')[-1]
                
                if pd.isna(train_set.at[index, 'Domain']):
                    train_set.at[index, 'Domain'] = domain
                if pd.isna(train_set.at[index, 'TLD']):
                    train_set.at[index, 'TLD'] = tld
        except:
            continue
    elif pd.notna(row['Domain']) and pd.isna(row['TLD']):
        domain = row['Domain']
        if '.' in domain:
            train_set.at[index, 'TLD'] = domain.split('.')[-1]
    
    if pd.isna(row['URL']) and pd.notna(row['Domain']) and pd.notna(row['IsHTTPS']):
        protocol = 'https://' if row['IsHTTPS'] == 1 else 'http://'
        train_set.at[index, 'URL'] = f"{protocol}www.{row['Domain']}"

if 'URL' in train_set.columns:
    mask = train_set['URL'].notna()
    train_set.loc[mask, 'URLLength'] = train_set.loc[mask, 'URL'].str.len()
    
    train_set.loc[mask, 'NoOfLettersInURL'] = train_set.loc[mask, 'URL'].str.count('[a-zA-Z]')
    train_set.loc[mask, 'NoOfDegitsInURL'] = train_set.loc[mask, 'URL'].str.count('[0-9]')
    train_set.loc[mask, 'NoOfEqualsInURL'] = train_set.loc[mask, 'URL'].str.count('=')
    train_set.loc[mask, 'NoOfQMarkInURL'] = train_set.loc[mask, 'URL'].str.count(r'\?')
    train_set.loc[mask, 'NoOfAmpersandInURL'] = train_set.loc[mask, 'URL'].str.count('&')
    
    url_lengths = train_set.loc[mask, 'URL'].str.len()
    train_set.loc[mask, 'LetterRatioInURL'] = train_set.loc[mask, 'NoOfLettersInURL'] / url_lengths
    train_set.loc[mask, 'DegitRatioInURL'] = train_set.loc[mask, 'NoOfDegitsInURL'] / url_lengths
    
    special_chars_pattern = '[^a-zA-Z0-9=?&]'
    train_set.loc[mask, 'NoOfOtherSpecialCharsInURL'] = train_set.loc[mask, 'URL'].str.count(special_chars_pattern)
    train_set.loc[mask, 'SpacialCharRatioInURL'] = train_set.loc[mask, 'NoOfOtherSpecialCharsInURL'] / url_lengths
    
    char_series = train_set.loc[mask, 'URL']
    continuation_rates = []
    
    for url in char_series:
        if len(url) < 2:
            continuation_rates.append(0)
        else:
            continuous = sum(1 for i in range(len(url)-1) if url[i] == url[i+1])
            continuation_rates.append(continuous / (len(url)-1))
    
    train_set.loc[mask, 'CharContinuationRate'] = continuation_rates

if 'Domain' in train_set.columns:
    mask = train_set['Domain'].notna()
    train_set.loc[mask, 'DomainLength'] = train_set.loc[mask, 'Domain'].str.len()
    
    train_set.loc[mask, 'IsDomainIP'] = train_set.loc[mask, 'Domain'].apply(
        lambda x: float(bool(re.match(r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', str(x)))))
    
    train_set.loc[mask, 'NoOfSubDomain'] = train_set.loc[mask, 'Domain'].str.count(r'\.')

if 'TLD' in train_set.columns:
    mask = train_set['TLD'].notna()
    train_set.loc[mask, 'TLDLength'] = train_set.loc[mask, 'TLD'].str.len()

mask = train_set['NoOfObfuscatedChar'] == 0
train_set.loc[mask & train_set['HasObfuscation'].isna(), 'HasObfuscation'] = 0

mask = train_set['NoOfObfuscatedChar'] > 0
train_set.loc[mask & train_set['HasObfuscation'].isna(), 'HasObfuscation'] = 1

mask = train_set['NoOfURLRedirect'].notna() & train_set['NoOfSelfRedirect'].notna()
invalid_mask = mask & (train_set['NoOfSelfRedirect'] > train_set['NoOfURLRedirect'])
train_set.loc[invalid_mask, 'NoOfSelfRedirect'] = train_set.loc[invalid_mask, 'NoOfURLRedirect']

mask = train_set['NoOfExternalRef'] > 0
train_set.loc[mask & train_set['HasExternalFormSubmit'].isna(), 'HasExternalFormSubmit'] = 1

mask = train_set['NoOfExternalRef'] == 0
train_set.loc[mask & train_set['HasExternalFormSubmit'].isna(), 'HasExternalFormSubmit'] = 0

mask = train_set['Title'].notna()
train_set.loc[mask & train_set['HasTitle'].isna(), 'HasTitle'] = 1
train_set.loc[~mask & train_set['HasTitle'].isna(), 'HasTitle'] = 0

boolean_cols = ['HasObfuscation', 'IsHTTPS', 'HasTitle',
                'HasFavicon', 'IsResponsive', 'HasDescription',
                'HasExternalFormSubmit', 'HasSocialNet', 'HasSubmitButton',
                'HasHiddenFields', 'HasPasswordField', 'HasCopyrightInfo']

count_cols = ['NoOfObfuscatedChar', 'NoOfURLRedirect', 'NoOfSelfRedirect', 'NoOfPopup',
              'NoOfiFrame', 'NoOfImage', 'NoOfCSS', 'NoOfJS',
              'NoOfSelfRef', 'NoOfEmptyRef', 'NoOfExternalRef']

for col in boolean_cols:
    if col in train_set.columns and train_set[col].isnull().any():
        train_set[col] = train_set[col].fillna(train_set[col].mode()[0])

for col in count_cols:
    if col in train_set.columns and train_set[col].isnull().any():
        train_set[col] = train_set[col].fillna(train_set[col].median())

numerical_cols = train_set.select_dtypes(include=['float64', 'int64']).columns
remaining_cols = [col for col in numerical_cols if col not in boolean_cols + count_cols]
for col in remaining_cols:
    if train_set[col].isnull().any():
        train_set[col] = train_set[col].fillna(train_set[col].median())

print("\nJumlah missing values setelah diisi:")
print(train_set.isnull().sum())

In [None]:
if 'URL' in val_set.columns:
    most_frequent_url = val_set['URL'].mode()[0]

    mask_both_missing = pd.isna(val_set['IsHTTPS']) & pd.isna(val_set['Domain'])
    val_set.loc[mask_both_missing & pd.isna(val_set['URL']), 'URL'] = most_frequent_url

    mask_one_exists = ((pd.notna(val_set['IsHTTPS']) & pd.isna(val_set['Domain'])) | 
            (pd.isna(val_set['IsHTTPS']) & pd.notna(val_set['Domain'])))
    val_set.loc[mask_one_exists & pd.isna(val_set['URL']), 'URL'] = most_frequent_url

for index, row in val_set.iterrows():
    if pd.notna(row['URL']):
        try:
            val_set.at[index, 'IsHTTPS'] = 1 if row['URL'].startswith('https://') else 0
            parsed = urlparse(row['URL'])
            if not parsed.netloc:
                parsed = urlparse("http://" + row['URL'])
            domain = parsed.netloc or parsed.path
            if '.' in domain:
                tld = domain.split('.')[-1]
                if pd.isna(val_set.at[index, 'Domain']):
                    val_set.at[index, 'Domain'] = domain
                if pd.isna(val_set.at[index, 'TLD']):
                    val_set.at[index, 'TLD'] = tld
        except:
            continue
    elif pd.notna(row['Domain']) and pd.isna(row['TLD']):
        domain = row['Domain']
        if '.' in domain:
            val_set.at[index, 'TLD'] = domain.split('.')[-1]
    if pd.isna(row['URL']) and pd.notna(row['Domain']) and pd.notna(row['IsHTTPS']):
        protocol = 'https://' if row['IsHTTPS'] == 1 else 'http://'
        val_set.at[index, 'URL'] = f"{protocol}www.{row['Domain']}"

if 'URL' in val_set.columns:
    mask = val_set['URL'].notna()
    val_set.loc[mask, 'URLLength'] = val_set.loc[mask, 'URL'].str.len()
    val_set.loc[mask, 'NoOfLettersInURL'] = val_set.loc[mask, 'URL'].str.count('[a-zA-Z]')
    val_set.loc[mask, 'NoOfDegitsInURL'] = val_set.loc[mask, 'URL'].str.count('[0-9]')
    val_set.loc[mask, 'NoOfEqualsInURL'] = val_set.loc[mask, 'URL'].str.count('=')
    val_set.loc[mask, 'NoOfQMarkInURL'] = val_set.loc[mask, 'URL'].str.count(r'\?')
    val_set.loc[mask, 'NoOfAmpersandInURL'] = val_set.loc[mask, 'URL'].str.count('&')
    url_lengths = val_set.loc[mask, 'URL'].str.len()
    val_set.loc[mask, 'LetterRatioInURL'] = val_set.loc[mask, 'NoOfLettersInURL'] / url_lengths
    val_set.loc[mask, 'DegitRatioInURL'] = val_set.loc[mask, 'NoOfDegitsInURL'] / url_lengths
    special_chars_pattern = '[^a-zA-Z0-9=?&]'
    val_set.loc[mask, 'NoOfOtherSpecialCharsInURL'] = val_set.loc[mask, 'URL'].str.count(special_chars_pattern)
    val_set.loc[mask, 'SpacialCharRatioInURL'] = val_set.loc[mask, 'NoOfOtherSpecialCharsInURL'] / url_lengths
    char_series = val_set.loc[mask, 'URL']
    continuation_rates = []
    for url in char_series:
        if len(url) < 2:
            continuation_rates.append(0)
        else:
            continuous = sum(1 for i in range(len(url)-1) if url[i] == url[i+1])
            continuation_rates.append(continuous / (len(url)-1))
    val_set.loc[mask, 'CharContinuationRate'] = continuation_rates

if 'Domain' in val_set.columns:
    mask = val_set['Domain'].notna()
    val_set.loc[mask, 'DomainLength'] = val_set.loc[mask, 'Domain'].str.len()
    val_set.loc[mask, 'IsDomainIP'] = val_set.loc[mask, 'Domain'].apply(
        lambda x: float(bool(re.match(r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', str(x)))))
    val_set.loc[mask, 'NoOfSubDomain'] = val_set.loc[mask, 'Domain'].str.count(r'\.')

if 'TLD' in val_set.columns:
    mask = val_set['TLD'].notna()
    val_set.loc[mask, 'TLDLength'] = val_set.loc[mask, 'TLD'].str.len()

mask = val_set['NoOfObfuscatedChar'] == 0
val_set.loc[mask & val_set['HasObfuscation'].isna(), 'HasObfuscation'] = 0
mask = val_set['NoOfObfuscatedChar'] > 0
val_set.loc[mask & val_set['HasObfuscation'].isna(), 'HasObfuscation'] = 1

mask = val_set['NoOfURLRedirect'].notna() & val_set['NoOfSelfRedirect'].notna()
invalid_mask = mask & (val_set['NoOfSelfRedirect'] > val_set['NoOfURLRedirect'])
val_set.loc[invalid_mask, 'NoOfSelfRedirect'] = val_set.loc[invalid_mask, 'NoOfURLRedirect']

mask = val_set['NoOfExternalRef'] > 0
val_set.loc[mask & val_set['HasExternalFormSubmit'].isna(), 'HasExternalFormSubmit'] = 1
mask = val_set['NoOfExternalRef'] == 0
val_set.loc[mask & val_set['HasExternalFormSubmit'].isna(), 'HasExternalFormSubmit'] = 0

mask = val_set['Title'].notna()
val_set.loc[mask & val_set['HasTitle'].isna(), 'HasTitle'] = 1
val_set.loc[~mask & val_set['HasTitle'].isna(), 'HasTitle'] = 0

boolean_cols = ['HasObfuscation', 'IsHTTPS', 'HasTitle',
                'HasFavicon', 'IsResponsive', 'HasDescription',
                'HasExternalFormSubmit', 'HasSocialNet', 'HasSubmitButton',
                'HasHiddenFields', 'HasPasswordField', 'HasCopyrightInfo']

count_cols = ['NoOfObfuscatedChar', 'NoOfURLRedirect', 'NoOfSelfRedirect', 'NoOfPopup',
              'NoOfiFrame', 'NoOfImage', 'NoOfCSS', 'NoOfJS',
              'NoOfSelfRef', 'NoOfEmptyRef', 'NoOfExternalRef']

for col in boolean_cols:
    if col in val_set.columns and val_set[col].isnull().any():
        val_set[col] = val_set[col].fillna(val_set[col].mode()[0])

for col in count_cols:
    if col in val_set.columns and val_set[col].isnull().any():
        val_set[col] = val_set[col].fillna(val_set[col].median())

numerical_cols = val_set.select_dtypes(include=['float64', 'int64']).columns
remaining_cols = [col for col in numerical_cols if col not in boolean_cols + count_cols]
for col in remaining_cols:
    if val_set[col].isnull().any():
        val_set[col] = val_set[col].fillna(val_set[col].median())

print("\nJumlah missing values setelah diisi:")
print(val_set.isnull().sum())


### II. Dealing with Outliers

Outliers are data points that significantly differ from the majority of the data. They can be unusually high or low values that do not fit the pattern of the rest of the dataset. Outliers can significantly impact model performance, so it is important to handle them properly.

Some methods to handle outliers:
1. **Imputation**: Replace with mean, median, or a boundary value.
2. **Clipping**: Cap values to upper and lower limits.
3. **Transformation**: Use log, square root, or power transformations to reduce their influence.
4. **Model-Based**: Use algorithms robust to outliers (e.g., tree-based models, Huber regression).

In [19]:
import numpy as np

def handle_outliers(df, non_categorical, label_value, threshold_percentage=0.03):
    subset = df[df['label'] == label_value].copy()
    outlier = True
    while outlier:
        change = False
        for col in non_categorical:
            if subset[col].skew() > 1 or subset[col].skew() < -1:
                Q1 = subset[col].quantile(0.25)
                Q3 = subset[col].quantile(0.75)
                IQR = Q3 - Q1
                outlier_indices = subset[(subset[col] < Q1 - 1.5 * IQR) | (subset[col] > Q3 + 1.5 * IQR)].index
            else:
                z_scores = np.abs((subset[col] - subset[col].mean()) / subset[col].std())
                outlier_indices = subset[z_scores > 3].index
            
            outlier_percentage = len(outlier_indices) / len(subset)
            if 0 < outlier_percentage < threshold_percentage:
                subset = subset.drop(outlier_indices)
                change = True
            elif outlier_percentage > threshold_percentage:
                Q1 = subset[col].quantile(0.25)
                Q3 = subset[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR

                subset[col] = np.where(subset[col] < lower_bound, lower_bound, subset[col])
                subset[col] = np.where(subset[col] > upper_bound, upper_bound, subset[col])
        
        print(f"change for label {label_value}: {change}")
        if not change:
            outlier = False
        
        subset.reset_index(drop=True, inplace=True)
    
    return subset


Dipisah-pisah dengan tujuan si yang label 0 ga kehapus karena dianggap outlier

In [None]:
train_label_1 = handle_outliers(train_set, non_categorical, label_value=1)
train_label_0 = handle_outliers(train_set, non_categorical, label_value=0)

train_set_cleaned = pd.concat([train_label_1, train_label_0], ignore_index=True)

val_label_1 = handle_outliers(val_set, non_categorical, label_value=1)

val_label_0 = handle_outliers(val_set, non_categorical, label_value=0)
val_set_cleaned = pd.concat([val_label_1, val_label_0], ignore_index=True)

print("\ntraining set label outlier handling:")
print(train_set_cleaned['label'].value_counts())

print("\nval set label distribution outlier handling:")
print(val_set_cleaned['label'].value_counts())

train_set = train_set_cleaned
val_set = val_set_cleaned


### III. Remove Duplicates
Handling duplicate values is crucial because they can compromise data integrity, leading to inaccurate analysis and insights. Duplicate entries can bias machine learning models, causing overfitting and reducing their ability to generalize to new data. They also inflate the dataset size unnecessarily, increasing computational costs and processing times. Additionally, duplicates can distort statistical measures and lead to inconsistencies, ultimately affecting the reliability of data-driven decisions and reporting. Ensuring data quality by removing duplicates is essential for accurate, efficient, and consistent analysis.

In [None]:
jumlah_duplikat = train_set.duplicated().sum()
print(f"Jumlah baris duplikat: {jumlah_duplikat}")

In [None]:
jumlah_duplikat = val_set.duplicated().sum()
print(f"Jumlah baris duplikat: {jumlah_duplikat}")

In [23]:
train_set = train_set.drop_duplicates()

In [24]:
val_set = val_set.drop_duplicates()

In [None]:
jumlah_duplikat = train_set.duplicated().sum()
print(f"Jumlah baris duplikat: {jumlah_duplikat}")

In [None]:
jumlah_duplikat = val_set.duplicated().sum()
print(f"Jumlah baris duplikat: {jumlah_duplikat}")

### IV. Feature Engineering

**Feature engineering** involves creating new features (input variables) or transforming existing ones to improve the performance of machine learning models. Feature engineering aims to enhance the model's ability to learn patterns and make accurate predictions from the data. It's often said that "good features make good models."

1. **Feature Selection:** Feature engineering can involve selecting the most relevant and informative features from the dataset. Removing irrelevant or redundant features not only simplifies the model but also reduces the risk of overfitting.

2. **Creating New Features:** Sometimes, the existing features may not capture the underlying patterns effectively. In such cases, engineers create new features that provide additional information. For example:
   
   - **Polynomial Features:** Engineers may create new features by taking the square, cube, or other higher-order terms of existing numerical features. This can help capture nonlinear relationships.
   
   - **Interaction Features:** Interaction features are created by combining two or more existing features. For example, if you have features "length" and "width," you can create an "area" feature by multiplying them.

3. **Binning or Discretization:** Continuous numerical features can be divided into bins or categories. For instance, age values can be grouped into bins like "child," "adult," and "senior."

4. **Domain-Specific Feature Engineering:** Depending on the domain and problem, engineers may create domain-specific features. For example, in fraud detection, features related to transaction history and user behavior may be engineered to identify anomalies.

Feature engineering is both a creative and iterative process. It requires a deep understanding of the data, domain knowledge, and experimentation to determine which features will enhance the model's predictive power.

In [None]:
print(len(train_df.columns))

In [None]:
# Buat fitur baru dari karakteristik URL
if 'URL' in val_set.columns:
   # Hitung karakter spesial
   val_set['special_char_count'] = val_set['URL'].str.count(r'[^a-zA-Z0-9]').fillna(0)
   # Hitung digit
   val_set['digit_count'] = val_set['URL'].str.count(r'[0-9]').fillna(0)

# Buat fitur interaksi
if 'URLLength' in val_set.columns and 'DomainLength' in val_set.columns:
   val_set['url_domain_ratio'] = val_set['URLLength'] / val_set['DomainLength'].replace(0, 1)

# Buat fitur binary dari TLD
if 'TLD' in val_set.columns:
   common_tlds = ['com', 'org', 'net', 'edu']
   val_set['is_common_tld'] = val_set['TLD'].isin(common_tlds).astype(int)

# Buat aggregate features
security_cols = ['HasObfuscation', 'IsHTTPS', 'HasHiddenFields']
val_set['security_score'] = val_set[security_cols].sum(axis=1)

# Cek fitur baru
print("\nJumlah kolom setelah feature engineering:", len(val_set.columns))
print("\nBeberapa baris dengan fitur baru:")
print(val_set[['special_char_count', 'digit_count', 'url_domain_ratio',
          'is_common_tld', 'security_score']].head())

In [None]:
# Define the features to drop with reasons
features_to_drop = {
    #korelasi tinggi
    'NoOfLettersInURL': 'almost perfect correlation sama URLLength (0.998)',
    'NoOfOtherSpecialCharsInURL': 'strong correlation URLLength (0.876)',
    'DomainLength': 'strong correlation URLLength (0.845)',
    
    # varianceny 0, ga contribute ke modeling
    'NoOfObfuscatedChar': 'konstan, std 0',
    'URLTitleMatchScore': 'konstan, std 0',
    'NoOfURLRedirect': 'konstan, std 0',
    'ObfuscationRatio': 'konstan, std 0',
    'DomainTitleMatchScore': 'konstan, std 0',
    'DegitRatioInURL': 'konstan, std 0',
    'NoOfEqualsInURL': 'konstan, std 0',
    'HasObfuscation': 'konstan, std 0',
    'NoOfSelfRedirect': 'konstan, std 0',
    'NoOfPopup': 'konstan, std 0',
    'NoOfQMarkInURL': 'konstan, std 0',
    'NoOfAmpersandInURL': 'konstan, std 0',
    'NoOfDegitsInURL': 'konstan, std 0',
    'NoOfSubDomain': 'konstan, std 0',
    
    # uda kegambarin di feature lain/raw text
    'Domain': 'raw text',
    'URL': 'raw text',
    'Title': 'raw text',
    'LetterRatioInURL': 'derived',
    'SpacialCharRatioInURL': 'derived',
    
    # missing too much, low mi
    'NoOfCSS': 'High missing values, low importance',
    'HasCopyrightInfo': 'High missing values, low importance',
    'HasPasswordField': 'High missing values, low importance',
    'HasSocialNet': 'High missing values, low importance',
    'NoOfExternalRef': 'High missing, info captured in HasExternalFormSubmit'
}

#drop
train_set = train_set.drop(columns=list(features_to_drop.keys()), errors='ignore')
val_set = val_set.drop(columns=list(features_to_drop.keys()), errors='ignore')

#same columns
common_columns = train_set.columns.intersection(val_set.columns)
train_set = train_set[common_columns]
val_set = val_set[common_columns]

#verify
print("\nnumber of remaining features:", len(train_set.columns))
print("training set:", train_set.columns.tolist())
print("validation set:", val_set.columns.tolist())


# shapess
print("\nNew shapes:")
print(f"training set: {train_set.shape}")
print(f"validation set: {val_set.shape}")


In [None]:
def engineer_features(dataset):
    features = dataset.copy()
    
    # ui content - ngambil juga dari tucil
    ui_features = ['HasFavicon',  'HasSubmitButton', 'HasTitle', 'HasDescription', 'IsResponsive']
    features['UIQualityScore'] = features[ui_features].sum(axis=1)
    
    # terkait duit n pembayaran
    financial_features = ['Bank', 'Pay', 'Crypto']
    features['FinancialRiskScore'] = (
        features[financial_features].sum(axis=1) * 
        (1 - features['IsHTTPS'])  #kalo ga https worse
    )
    
    # domain
    features['DomainTrustScore'] = (
        features['TLDLegitimateProb'] * 
        (1 + features['IsHTTPS']) *  #kalo https kasi plus
        (1 - features['IsDomainIP'])  #kalo domainnya ip = veri sus kurangin
    )
    
    # kompleksitas code
    features['ContentComplexity'] = (
        np.log1p(features['LineOfCode']) * 
        np.log1p(features['LargestLineLength'])
    )
    
    # ref pattern
    features['RefPatternScore'] = (
        features['NoOfSelfRef'] / 
        (features['NoOfSelfRef'] + features['NoOfEmptyRef']).clip(lower=1)
    )
    
    return features

train_set_engineered = engineer_features(train_set)
val_set_engineered = engineer_features(val_set)

train_set = train_set_engineered
val_set = val_set_engineered

new_features = [
    'UIQualityScore',
    'FinancialRiskScore',
    'DomainTrustScore',
    'ContentComplexity',
    'RefPatternScore'
]

print("\ncorrelation with label:")
correlations = train_set_engineered[new_features + ['label']].corr()['label'].drop('label')
print(correlations.sort_values(ascending=False))

print("\nnew features stats:")
print(train_set_engineered[new_features].describe())

# recheck label distribution
print("label information:")
print("Type:", train_set['label'].dtype)
print("\nvalue counts:")
print(train_set['label'].value_counts())


In [None]:
selected_features = [
    'RefPatternScore',
    'UIQualityScore',
    'ContentComplexity'
]

print("\ncorrelations with label:")
correlations = train_set_engineered[selected_features + ['label']].corr()['label'].drop('label')
print(correlations.sort_values(ascending=False))

#drop yang korelasi jelek
train_set = train_set.drop(columns = ['DomainTrustScore', 'FinancialRiskScore'])
val_set = val_set.drop(columns = ['DomainTrustScore', 'FinancialRiskScore'])


## B. Data Preprocessing

**Data preprocessing** is a broader step that encompasses both data cleaning and additional transformations to make the data suitable for machine learning algorithms. Its primary goals are:

1. **Feature Scaling:** Ensure that numerical features have similar scales. Common techniques include Min-Max scaling (scaling to a specific range) or standardization (mean-centered, unit variance).

2. **Encoding Categorical Variables:** Machine learning models typically work with numerical data, so categorical variables need to be encoded. This can be done using one-hot encoding, label encoding, or more advanced methods like target encoding.

3. **Handling Imbalanced Classes:** If dealing with imbalanced classes in a binary classification task, apply techniques such as oversampling, undersampling, or using different evaluation metrics to address class imbalance.

4. **Dimensionality Reduction:** Reduce the number of features using techniques like Principal Component Analysis (PCA) or feature selection to simplify the model and potentially improve its performance.

5. **Normalization:** Normalize data to achieve a standard distribution. This is particularly important for algorithms that assume normally distributed data.

Drop Column that Useless

### Notes on Preprocessing processes

It is advised to create functions or classes that have the same/similar type of inputs and outputs, so you can add, remove, or swap the order of the processes easily. You can implement the functions or classes by yourself

or

use `sklearn` library. To create a new preprocessing component in `sklearn`, implement a corresponding class that includes:
1. Inheritance to `BaseEstimator` and `TransformerMixin`
2. The method `fit`
3. The method `transform`

### I. Feature Scaling

**Feature scaling** is a preprocessing technique used in machine learning to standardize the range of independent variables or features of data. The primary goal of feature scaling is to ensure that all features contribute equally to the training process and that machine learning algorithms can work effectively with the data.

Here are the main reasons why feature scaling is important:

1. **Algorithm Sensitivity:** Many machine learning algorithms are sensitive to the scale of input features. If the scales of features are significantly different, some algorithms may perform poorly or take much longer to converge.

2. **Distance-Based Algorithms:** Algorithms that rely on distances or similarities between data points, such as k-nearest neighbors (KNN) and support vector machines (SVM), can be influenced by feature scales. Features with larger scales may dominate the distance calculations.

3. **Regularization:** Regularization techniques, like L1 (Lasso) and L2 (Ridge) regularization, add penalty terms based on feature coefficients. Scaling ensures that all features are treated equally in the regularization process.

Common methods for feature scaling include:

1. **Min-Max Scaling (Normalization):** This method scales features to a specific range, typically [0, 1]. It's done using the following formula:

   $$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$

   - Here, $X$ is the original feature value, $X_{min}$ is the minimum value of the feature, and $X_{max}$ is the maximum value of the feature.  
<br />
<br />
2. **Standardization (Z-score Scaling):** This method scales features to have a mean (average) of 0 and a standard deviation of 1. It's done using the following formula:

   $$X' = \frac{X - \mu}{\sigma}$$

   - $X$ is the original feature value, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation of the feature.  
<br />
<br />
3. **Robust Scaling:** Robust scaling is a method that scales features to the interquartile range (IQR) and is less affected by outliers. It's calculated as:

   $$X' = \frac{X - Q1}{Q3 - Q1}$$

   - $X$ is the original feature value, $Q1$ is the first quartile (25th percentile), and $Q3$ is the third quartile (75th percentile) of the feature.  
<br />
<br />
4. **Log Transformation:** In cases where data is highly skewed or has a heavy-tailed distribution, taking the logarithm of the feature values can help stabilize the variance and improve scaling.

The choice of scaling method depends on the characteristics of your data and the requirements of your machine learning algorithm. **Min-max scaling and standardization are the most commonly used techniques and work well for many datasets.**

Scaling should be applied separately to each training and test set to prevent data leakage from the test set into the training set. Additionally, **some algorithms may not require feature scaling, particularly tree-based models.**

In [32]:
class NumericStandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self, exclude_label='label', label_column='label'):
        self.exclude_label = exclude_label
        self.label_column = label_column
        self.scaler = StandardScaler()

    def fit(self, X, y=None):
        X_numeric = X.select_dtypes(include=['float64', 'int64'])

        if self.exclude_label and self.label_column in X_numeric.columns:
            X_numeric = X_numeric.drop(columns=[self.label_column])

        X_numeric = check_array(X_numeric, dtype=np.float64)

        self.numeric_cols = [
            col for col in X.select_dtypes(include=['float64', 'int64']).columns
            if col != self.label_column
        ]

        self.scaler.fit(X_numeric)
        return self

    def transform(self, X):
        X_transformed = X.copy()
        X_transformed[self.numeric_cols] = self.scaler.transform(
            X_transformed[self.numeric_cols]
        )
        return X_transformed

In [33]:
class NumericMinMaxScaler(BaseEstimator, TransformerMixin):
      def __init__(self, exclude_label=True, label_column='label', exclude_id=True, id_column='id'):
          self.exclude_label = exclude_label
          self.label_column = label_column
          self.exclude_id = exclude_id
          self.id_column = id_column
          self.scaler = MinMaxScaler()
  
      def fit(self, X, y=None):
          self.numeric_cols_ = X.select_dtypes(include=['float64', 'int64']).columns
          
          if self.exclude_label and self.label_column in self.numeric_cols_:
              self.numeric_cols_ = self.numeric_cols_.drop(self.label_column)
  
          if self.exclude_id and self.id_column in self.numeric_cols_:
              self.numeric_cols_ = self.numeric_cols_.drop(self.id_column)
  
          if len(self.numeric_cols_) > 0:
              self.scaler.fit(X[self.numeric_cols_])
              self.is_fitted_ = True
          return self
  
      def transform(self, X):
          check_is_fitted(self, 'is_fitted_')
  
          X_transformed = X.copy()
          if len(self.numeric_cols_) > 0:
              scaled_data = self.scaler.transform(X_transformed[self.numeric_cols_])
              X_transformed[self.numeric_cols_] = scaled_data
          return X_transformed

### II. Feature Encoding

**Feature encoding**, also known as **categorical encoding**, is the process of converting categorical data (non-numeric data) into a numerical format so that it can be used as input for machine learning algorithms. Most machine learning models require numerical data for training and prediction, so feature encoding is a critical step in data preprocessing.

Categorical data can take various forms, including:

1. **Nominal Data:** Categories with no intrinsic order, like colors or country names.  

2. **Ordinal Data:** Categories with a meaningful order but not necessarily equidistant, like education levels (e.g., "high school," "bachelor's," "master's").

There are several common methods for encoding categorical data:

1. **Label Encoding:**

   - Label encoding assigns a unique integer to each category in a feature.
   - It's suitable for ordinal data where there's a clear order among categories.
   - For example, if you have an "education" feature with values "high school," "bachelor's," and "master's," you can encode them as 0, 1, and 2, respectively.
<br />
<br />
2. **One-Hot Encoding:**

   - One-hot encoding creates a binary (0 or 1) column for each category in a nominal feature.
   - It's suitable for nominal data where there's no inherent order among categories.
   - Each category becomes a new feature, and the presence (1) or absence (0) of a category is indicated for each row.
<br />
<br />
3. **Target Encoding (Mean Encoding):**

   - Target encoding replaces each category with the mean of the target variable for that category.
   - It's often used for classification problems.

In [34]:
class FrequencyEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None, drop_original=True):
        self.columns = columns
        self.drop_original = drop_original

    def fit(self, X, y=None):
        if self.columns is None:
            self.columns_ = X.select_dtypes(include=['object', 'category']).columns.tolist()
        else:
            self.columns_ = self.columns
        
        missing_cols = [col for col in self.columns_ if col not in X.columns]
        if missing_cols:
            raise ValueError(f"Columns not found in input data: {missing_cols}")
        
        self.frequencies_ = {}
        for col in self.columns_:
            self.frequencies_[col] = X[col].value_counts(normalize=True)
        
        self.is_fitted_ = True
        return self

    def transform(self, X):
        check_is_fitted(self, 'is_fitted_')
        
        X_transformed = X.copy()
        
        missing_cols = [col for col in self.columns_ if col not in X_transformed.columns]
        if missing_cols:
            raise ValueError(f"Columns not found in input data during transform: {missing_cols}")
            
        for col in self.columns_:
            encoded_col_name = f"{col}_encoded"
            
            X_transformed[encoded_col_name] = X_transformed[col].map(self.frequencies_[col])
            X_transformed[encoded_col_name] = X_transformed[encoded_col_name].fillna(0)
            
            if self.drop_original:
                X_transformed = X_transformed.drop(columns=[col])
        
        return X_transformed

### III. Handling Imbalanced Dataset

**Handling imbalanced datasets** is important because imbalanced data can lead to several issues that negatively impact the performance and reliability of machine learning models. Here are some key reasons:

1. **Biased Model Performance**:

 - Models trained on imbalanced data tend to be biased towards the majority class, leading to poor performance on the minority class. This can result in misleading accuracy metrics.

2. **Misleading Accuracy**:

 - High overall accuracy can be misleading in imbalanced datasets. For example, if 95% of the data belongs to one class, a model that always predicts the majority class will have 95% accuracy but will fail to identify the minority class.

3. **Poor Generalization**:

 - Models trained on imbalanced data may not generalize well to new, unseen data, especially if the minority class is underrepresented.


Some methods to handle imbalanced datasets:
1. **Resampling Methods**:

 - Oversampling: Increase the number of instances in the minority class by duplicating or generating synthetic samples (e.g., SMOTE).
 - Undersampling: Reduce the number of instances in the majority class to balance the dataset.

2. **Evaluation Metrics**:

 - Use appropriate evaluation metrics such as precision, recall, F1-score, ROC-AUC, and confusion matrix instead of accuracy to better assess model performance on imbalanced data.

3. **Algorithmic Approaches**:

 - Use algorithms that are designed to handle imbalanced data, such as decision trees, random forests, or ensemble methods.
 - Adjust class weights in algorithms to give more importance to the minority class.

In [None]:
print("distribution:")
print(train_df['label'].value_counts())
print("\npercentage distribution:")
print(train_df['label'].value_counts(normalize=True) * 100)

In [36]:
class SMOTETomekBalancer(BaseEstimator, TransformerMixin):
    def __init__(self, label_column='label', random_state=42, sampling_strategy=0.2):
        self.label_column = label_column
        self.random_state = random_state
        self.sampling_strategy = sampling_strategy
        self.smotetomek = SMOTETomek(
            random_state=self.random_state, 
            sampling_strategy=self.sampling_strategy
        )

    def fit(self, X, y=None):
        if self.label_column not in X.columns:
            raise ValueError(f"Label column '{self.label_column}' not found in input data")
            
        self.feature_names_ = [col for col in X.columns if col != self.label_column]
        self.is_fitted_ = True
        return self

    def transform(self, X):
        check_is_fitted(self, 'is_fitted_')
        
        if self.label_column not in X.columns:
            raise ValueError(f"Label column '{self.label_column}' not found in input data")

        X_features = X.drop(columns=[self.label_column])
        y_label = X[self.label_column]
        
        X_resampled, y_resampled = self.smotetomek.fit_resample(X_features, y_label)
        X_resampled_df = pd.DataFrame(X_resampled, columns=self.feature_names_)
        y_resampled_df = pd.Series(y_resampled, name=self.label_column)
        
        return pd.concat([X_resampled_df, y_resampled_df], axis=1)

# 3. Compile Preprocessing Pipeline

All of the preprocessing classes or functions defined earlier will be compiled in this step.

If you use sklearn to create preprocessing classes, you can list your preprocessing classes in the Pipeline object sequentially, and then fit and transform your data.

In [None]:
pipe = Pipeline([
    ("frequency_encoder", FrequencyEncoder(columns=["TLD"], drop_original=True)),
    ("scaler_minmax", NumericMinMaxScaler()),
    ("balancer", SMOTETomekBalancer(label_column='label'))
])

train_set_preprocessed = pipe.fit_transform(train_set)
val_set_preprocessed = pipe.transform(val_set)
    


In [None]:
train_set_preprocessed.describe()

In [None]:
val_set_preprocessed.describe()


# Cleaning & Preprocessing test_df

## Cleaning

### Handling Missing Value

In [None]:
if 'URL' in test_df.columns:
    most_frequent_url = test_df['URL'].mode()[0]

    mask_both_missing = pd.isna(test_df['IsHTTPS']) & pd.isna(test_df['Domain'])
    test_df.loc[mask_both_missing & pd.isna(test_df['URL']), 'URL'] = most_frequent_url

    mask_one_exists = ((pd.notna(test_df['IsHTTPS']) & pd.isna(test_df['Domain'])) | 
            (pd.isna(test_df['IsHTTPS']) & pd.notna(test_df['Domain'])))
    test_df.loc[mask_one_exists & pd.isna(test_df['URL']), 'URL'] = most_frequent_url

for index, row in test_df.iterrows():
    if pd.notna(row['URL']):
        try:
            test_df.at[index, 'IsHTTPS'] = 1 if row['URL'].startswith('https://') else 0
            parsed = urlparse(row['URL'])
            if not parsed.netloc:
                parsed = urlparse("http://" + row['URL'])
            domain = parsed.netloc or parsed.path
            if '.' in domain:
                tld = domain.split('.')[-1]
                if pd.isna(test_df.at[index, 'Domain']):
                    test_df.at[index, 'Domain'] = domain
                if pd.isna(test_df.at[index, 'TLD']):
                    test_df.at[index, 'TLD'] = tld
        except:
            continue
    elif pd.notna(row['Domain']) and pd.isna(row['TLD']):
        domain = row['Domain']
        if '.' in domain:
            test_df.at[index, 'TLD'] = domain.split('.')[-1]
    if pd.isna(row['URL']) and pd.notna(row['Domain']) and pd.notna(row['IsHTTPS']):
        protocol = 'https://' if row['IsHTTPS'] == 1 else 'http://'
        test_df.at[index, 'URL'] = f"{protocol}www.{row['Domain']}"

if 'URL' in test_df.columns:
    mask = test_df['URL'].notna()
    test_df.loc[mask, 'URLLength'] = test_df.loc[mask, 'URL'].str.len()
    test_df.loc[mask, 'NoOfLettersInURL'] = test_df.loc[mask, 'URL'].str.count('[a-zA-Z]')
    test_df.loc[mask, 'NoOfDegitsInURL'] = test_df.loc[mask, 'URL'].str.count('[0-9]')
    test_df.loc[mask, 'NoOfEqualsInURL'] = test_df.loc[mask, 'URL'].str.count('=')
    test_df.loc[mask, 'NoOfQMarkInURL'] = test_df.loc[mask, 'URL'].str.count(r'\?')
    test_df.loc[mask, 'NoOfAmpersandInURL'] = test_df.loc[mask, 'URL'].str.count('&')
    url_lengths = test_df.loc[mask, 'URL'].str.len()
    test_df.loc[mask, 'LetterRatioInURL'] = test_df.loc[mask, 'NoOfLettersInURL'] / url_lengths
    test_df.loc[mask, 'DegitRatioInURL'] = test_df.loc[mask, 'NoOfDegitsInURL'] / url_lengths
    special_chars_pattern = '[^a-zA-Z0-9=?&]'
    test_df.loc[mask, 'NoOfOtherSpecialCharsInURL'] = test_df.loc[mask, 'URL'].str.count(special_chars_pattern)
    test_df.loc[mask, 'SpacialCharRatioInURL'] = test_df.loc[mask, 'NoOfOtherSpecialCharsInURL'] / url_lengths
    char_series = test_df.loc[mask, 'URL']
    continuation_rates = []
    for url in char_series:
        if len(url) < 2:
            continuation_rates.append(0)
        else:
            continuous = sum(1 for i in range(len(url)-1) if url[i] == url[i+1])
            continuation_rates.append(continuous / (len(url)-1))
    test_df.loc[mask, 'CharContinuationRate'] = continuation_rates

if 'Domain' in test_df.columns:
    mask = test_df['Domain'].notna()
    test_df.loc[mask, 'DomainLength'] = test_df.loc[mask, 'Domain'].str.len()
    test_df.loc[mask, 'IsDomainIP'] = test_df.loc[mask, 'Domain'].apply(
        lambda x: float(bool(re.match(r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', str(x)))))
    test_df.loc[mask, 'NoOfSubDomain'] = test_df.loc[mask, 'Domain'].str.count(r'\.')

if 'TLD' in test_df.columns:
    mask = test_df['TLD'].notna()
    test_df.loc[mask, 'TLDLength'] = test_df.loc[mask, 'TLD'].str.len()

mask = test_df['NoOfObfuscatedChar'] == 0
test_df.loc[mask & test_df['HasObfuscation'].isna(), 'HasObfuscation'] = 0
mask = test_df['NoOfObfuscatedChar'] > 0
test_df.loc[mask & test_df['HasObfuscation'].isna(), 'HasObfuscation'] = 1

mask = test_df['NoOfURLRedirect'].notna() & test_df['NoOfSelfRedirect'].notna()
invalid_mask = mask & (test_df['NoOfSelfRedirect'] > test_df['NoOfURLRedirect'])
test_df.loc[invalid_mask, 'NoOfSelfRedirect'] = test_df.loc[invalid_mask, 'NoOfURLRedirect']

mask = test_df['NoOfExternalRef'] > 0
test_df.loc[mask & test_df['HasExternalFormSubmit'].isna(), 'HasExternalFormSubmit'] = 1
mask = test_df['NoOfExternalRef'] == 0
test_df.loc[mask & test_df['HasExternalFormSubmit'].isna(), 'HasExternalFormSubmit'] = 0

mask = test_df['Title'].notna()
test_df.loc[mask & test_df['HasTitle'].isna(), 'HasTitle'] = 1
test_df.loc[~mask & test_df['HasTitle'].isna(), 'HasTitle'] = 0

boolean_cols = ['IsDomainIP', 'HasObfuscation', 'IsHTTPS', 'HasTitle',
                'HasFavicon', 'IsResponsive', 'HasDescription',
                'HasExternalFormSubmit', 'HasSocialNet', 'HasSubmitButton',
                'HasHiddenFields', 'HasPasswordField', 'HasCopyrightInfo']

count_cols = ['NoOfSubDomain', 'NoOfObfuscatedChar', 'NoOfLettersInURL',
              'NoOfDegitsInURL', 'NoOfEqualsInURL', 'NoOfQMarkInURL', 
              'NoOfAmpersandInURL', 'NoOfOtherSpecialCharsInURL',
              'NoOfURLRedirect', 'NoOfSelfRedirect', 'NoOfPopup',
              'NoOfiFrame', 'NoOfImage', 'NoOfCSS', 'NoOfJS',
              'NoOfSelfRef', 'NoOfEmptyRef', 'NoOfExternalRef']

for col in boolean_cols:
    if col in test_df.columns and test_df[col].isnull().any():
        test_df[col] = test_df[col].fillna(test_df[col].mode()[0])

for col in count_cols:
    if col in test_df.columns and test_df[col].isnull().any():
        test_df[col] = test_df[col].fillna(test_df[col].median())

numerical_cols = test_df.select_dtypes(include=['float64', 'int64']).columns
remaining_cols = [col for col in numerical_cols if col not in boolean_cols + count_cols]
for col in remaining_cols:
    if test_df[col].isnull().any():
        test_df[col] = test_df[col].fillna(test_df[col].median())

print("\nJumlah missing values setelah diisi:")
print(test_df.isnull().sum())


### Feature Engineering

In [None]:
# Buat fitur baru dari karakteristik URL
if 'URL' in test_df.columns:
   # Hitung karakter spesial
   test_df['special_char_count'] = test_df['URL'].str.count(r'[^a-zA-Z0-9]').fillna(0)
   # Hitung digit
   test_df['digit_count'] = test_df['URL'].str.count(r'[0-9]').fillna(0)

# Buat fitur interaksi
if 'URLLength' in test_df.columns and 'DomainLength' in test_df.columns:
   test_df['url_domain_ratio'] = test_df['URLLength'] / test_df['DomainLength'].replace(0, 1)

# Buat fitur binary dari TLD
if 'TLD' in test_df.columns:
   common_tlds = ['com', 'org', 'net', 'edu']
   test_df['is_common_tld'] = test_df['TLD'].isin(common_tlds).astype(int)

# Buat aggregate features
security_cols = ['HasObfuscation', 'IsHTTPS', 'HasHiddenFields']
test_df['security_score'] = test_df[security_cols].sum(axis=1)

# Cek fitur baru
print("\nJumlah kolom setelah feature engineering:", len(test_df.columns))
print("\nBeberapa baris dengan fitur baru:")
print(test_df[['special_char_count', 'digit_count', 'url_domain_ratio',
          'is_common_tld', 'security_score']].head())

In [None]:
# Define the features to drop with reasons
features_to_drop = {
    #korelasi tinggi
    'NoOfLettersInURL': 'almost perfect correlation sama URLLength (0.998)',
    'NoOfOtherSpecialCharsInURL': 'strong correlation URLLength (0.876)',
    'DomainLength': 'strong correlation URLLength (0.845)',
    
    # varianceny 0, ga contribute ke modeling
    'NoOfObfuscatedChar': 'konstan, std 0',
    'URLTitleMatchScore': 'konstan, std 0',
    'NoOfURLRedirect': 'konstan, std 0',
    'ObfuscationRatio': 'konstan, std 0',
    'DomainTitleMatchScore': 'konstan, std 0',
    'DegitRatioInURL': 'konstan, std 0',
    'NoOfEqualsInURL': 'konstan, std 0',
    'HasObfuscation': 'konstan, std 0',
    'NoOfSelfRedirect': 'konstan, std 0',
    'NoOfPopup': 'konstan, std 0',
    'NoOfQMarkInURL': 'konstan, std 0',
    'NoOfAmpersandInURL': 'konstan, std 0',
    'NoOfDegitsInURL': 'konstan, std 0',
    'NoOfSubDomain': 'konstan, std 0',
    
    # uda kegambarin di feature lain/raw text
    'Domain': 'raw text',
    'URL': 'raw text',
    'Title': 'raw text',
    'LetterRatioInURL': 'derived',
    'SpacialCharRatioInURL': 'derived',
    
    # missing too much, low mi
    'NoOfCSS': 'High missing values, low importance',
    'HasCopyrightInfo': 'High missing values, low importance',
    'HasPasswordField': 'High missing values, low importance',
    'HasSocialNet': 'High missing values, low importance',
    'NoOfExternalRef': 'High missing, info captured in HasExternalFormSubmit'
}

#drop
test_df = test_df.drop(columns=list(features_to_drop.keys()), errors='ignore')

#same columns
common_columns = test_df.columns.intersection(val_set.columns)
test_df = test_df[common_columns]

#verify
print("\nnumber of remaining features:", len(test_df.columns))
print("training set:", test_df.columns.tolist())


# shapess
print("\nNew shapes:")
print(f"training set: {test_df.shape}")


In [43]:
def engineer_features(dataset):
    features = dataset.copy()
    
    # ui content - ngambil juga dari tucil
    ui_features = ['HasFavicon',  'HasSubmitButton', 'HasTitle', 'HasDescription', 'IsResponsive']
    features['UIQualityScore'] = features[ui_features].sum(axis=1)
    
    # terkait duit n pembayaran
    financial_features = ['Bank', 'Pay', 'Crypto']
    features['FinancialRiskScore'] = (
        features[financial_features].sum(axis=1) * 
        (1 - features['IsHTTPS'])  #kalo ga https worse
    )
    
    # domain
    features['DomainTrustScore'] = (
        features['TLDLegitimateProb'] * 
        (1 + features['IsHTTPS']) *  #kalo https kasi plus
        (1 - features['IsDomainIP'])  #kalo domainnya ip = veri sus kurangin
    )
    
    # kompleksitas code
    features['ContentComplexity'] = (
        np.log1p(features['LineOfCode']) * 
        np.log1p(features['LargestLineLength'])
    )
    
    # ref pattern
    features['RefPatternScore'] = (
        features['NoOfSelfRef'] / 
        (features['NoOfSelfRef'] + features['NoOfEmptyRef']).clip(lower=1)
    )
    
    return features

test_df_engineered = engineer_features(test_df)

test_df = test_df_engineered

new_features = [
    'UIQualityScore',
    'FinancialRiskScore',
    'DomainTrustScore',
    'ContentComplexity',
    'RefPatternScore'
]


In [None]:
test_df = test_df.drop(columns = ['DomainTrustScore', 'FinancialRiskScore'])
test_df.info()


# 4. Modeling and Validation

Modelling is the process of building your own machine learning models to solve specific problems, or in this assignment context, predicting the target feature `label`. Validation is the process of evaluating your trained model using the validation set or cross-validation method and providing some metrics that can help you decide what to do in the next iteration of development.

## A. KNN

### From Scratch

In [45]:
class KNNClassifier(BaseEstimator):
    def __init__(self, k=3, metric='euclidean', p=2):
        self.k = k
        self.metric = metric
        self.p = p

    def fit(self, X, y):
        self.X_train = X.to_numpy()
        self.y_train = y.to_numpy()
        return self

    def _euclidean_distance(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2) ** 2))

    def _manhattan_distance(self, x1, x2):
        return np.sum(np.abs(x1 - x2))

    def _minkowski_distance(self, x1, x2, p):
        return np.power(np.sum(np.power(np.abs(x1 - x2), p)), 1 / p)

    def _calculate_distances(self, x_test):
        distances = []
        for x_train in self.X_train:
            if self.metric == 'euclidean':
                dist = self._euclidean_distance(x_test, x_train)
            elif self.metric == 'manhattan':
                dist = self._manhattan_distance(x_test, x_train)
            elif self.metric == 'minkowski':
                dist = self._minkowski_distance(x_test, x_train, self.p)
            else:
                raise ValueError("Invalid metric. Use 'euclidean', 'manhattan', or 'minkowski'.")
            distances.append(dist)
        return np.array(distances)

    def _get_neighbors(self, x_test):
        distances = self._calculate_distances(x_test)
        k_indices = np.argsort(distances)[:self.k]
        return self.y_train[k_indices]

    def _get_majority_vote(self, labels):
        labels = list(labels)
        unique_labels = list(set(labels))
        votes = [labels.count(label) for label in unique_labels]
        return unique_labels[np.argmax(votes)]

    def predict(self, X):
        X_test = X.to_numpy()
        predictions = []
        for x in X_test:
            neighbors = self._get_neighbors(x)
            majority_label = self._get_majority_vote(neighbors)
            predictions.append(majority_label)
        return np.array(predictions)

### From Scikit-Learn

In [46]:
knn_scikit = KNeighborsClassifier(n_neighbors=5)

## B. Naive Bayes

### From Scratch

In [47]:
class GaussianNaiveBayes(BaseEstimator):
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.mean_std = {}
        self.prior = {}

        for c in self.classes:
            X_c = X[y == c]
            self.mean_std[c] = {
                'mean': np.mean(X_c, axis=0),
                'std': np.where(np.std(X_c, axis=0) == 0, 1e-6, np.std(X_c, axis=0))
            }
            self.prior[c] = len(X_c) / len(y)
        return self

    def _gaussian_probability(self, x, mean, std):
        epsilon = 1e-9
        exponent = np.exp(-((x - mean) ** 2) / (2 * std ** 2))
        prob = exponent / (np.sqrt(2 * np.pi) * std)
        return np.maximum(prob, epsilon)

    def _predict_sample(self, x):
        probabilities = {}
        for c in self.classes:
            prior = np.log(self.prior[c])
            mean = self.mean_std[c]['mean']
            std = self.mean_std[c]['std']
            likelihood = np.sum(np.log(self._gaussian_probability(x, mean, std)))
            probabilities[c] = prior + likelihood
        return max(probabilities, key=probabilities.get)

    def predict(self, X):
        return np.array([self._predict_sample(x) for x in X])


### From Scikit-Learn

In [48]:
nb_scikit = GaussianNB()

## Train and Evaluate

In [None]:
train_preprocessed = pipe.fit_transform(train_set)
val_preprocessed = pipe.transform(val_set)

X_train = train_preprocessed.drop(columns=['label', 'id'])
y_train = train_preprocessed['label']
X_val = val_preprocessed.drop(columns=['label', 'id'])
y_val = val_preprocessed['label']

### KNN

In [None]:
X_train_small, X_unused, y_train_small, y_unused = train_test_split(
    X_train, y_train, test_size=99, random_state=42
)

knn = KNNClassifier(k=5, metric='euclidean')
knn.fit(X_train_small, y_train_small)
knn_scratch_predictions = knn.predict(X_val)

In [366]:
knn_scikit.fit(X_train, y_train)
knn_scikit_predictions = knn_scikit.predict(X_val)

In [None]:
# Save model
pickle.dump(knn_scikit, open('knn_scikit_model.pkl', 'wb'))
print("KNN Scikit-Learn model saved!")

# Load model
loaded_knn_scikit = pickle.load(open('knn_scikit_model.pkl', 'rb'))
print("KNN Scikit-Learn model loaded!")

In [None]:
print(classification_report(y_val, knn_scikit_predictions))

loaded_knn_scikit_predictions = loaded_knn_scikit.predict(X_val)
print(classification_report(y_val, loaded_knn_scikit_predictions))

### Naive Bayes

In [51]:
nb = GaussianNaiveBayes()
nb.fit(X_train.values, y_train.values)
nb_scratch_predictions = nb.predict(X_val.values)



In [None]:
# Save  model
pickle.dump(nb, open('nb_from_scratch_model.pkl', 'wb'))
print("Naive Bayes From Scratch model saved!")

# Load model
loaded_nb = pickle.load(open('nb_from_scratch_model.pkl', 'rb'))
print("Naive Bayes From Scratch model loaded!")

In [None]:
print(classification_report(y_val, nb_scratch_predictions))

loaded_nb_from_scratch_predictions = loaded_nb.predict(X_val.values)
print(classification_report(y_val, loaded_nb_from_scratch_predictions))

In [54]:
nb_scikit.fit(X_train, y_train)
nb_scikit_predictions = nb_scikit.predict(X_val)



In [None]:
# Save  model
pickle.dump(nb_scikit, open('nb_scikit_model.pkl', 'wb'))
print("Naive Bayes Scikit-Learn model saved!")

# Load model
loaded_nb_scikit = pickle.load(open('nb_scikit_model.pkl', 'rb'))
print("Naive Bayes Scikit-Learn model loaded!")

In [None]:
print(classification_report(y_val, nb_scikit_predictions))

loaded_nb_scikit_predictions = loaded_nb_scikit.predict(X_val)
print(classification_report(y_val, loaded_nb_scikit_predictions))

## C. Improvements (Optional)

- **Visualize the model evaluation result**

This will help you to understand the details more clearly about your model's performance. From the visualization, you can see clearly if your model is leaning towards a class than the others. (Hint: confusion matrix, ROC-AUC curve, etc.)

- **Explore the hyperparameters of your models**

Each models have their own hyperparameters. And each of the hyperparameter have different effects on the model behaviour. You can optimize the model performance by finding the good set of hyperparameters through a process called **hyperparameter tuning**. (Hint: Grid search, random search, bayesian optimization)

- **Cross-validation**

Cross-validation is a critical technique in machine learning and data science for evaluating and validating the performance of predictive models. It provides a more **robust** and **reliable** evaluation method compared to a hold-out (single train-test set) validation. Though, it requires more time and computing power because of how cross-validation works. (Hint: k-fold cross-validation, stratified k-fold cross-validation, etc.)

## D. Submission
To predict the test set target feature and submit the results to the kaggle competition platform, do the following:
1. Create a new pipeline instance identical to the first in Data Preprocessing
2. With the pipeline, apply `fit_transform` to the original training set before splitting, then only apply `transform` to the test set.
3. Retrain the model on the preprocessed training set
4. Predict the test set
5. Make sure the submission contains the `id` and `label` column.

Note: Adjust step 1 and 2 to your implementation of the preprocessing step if you don't use pipeline API from `sklearn`.

In [376]:
pipe_test = Pipeline([
    ("frequency_encoder", FrequencyEncoder(columns=["TLD"], drop_original=True)),
    ("scaler_minmax", NumericMinMaxScaler())
])

train_set_preprocessed = pipe_test.fit_transform(train_set)

test_df = pipe_test.transform(test_df)


In [384]:
X_test = test_df.drop(columns=['id'])

nb_scratch_predictions = nb.predict(X_test.values)

In [385]:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

filename = f'submission_{timestamp}.csv'

submission = pd.DataFrame({
    'id': test_df['id'],
    'label': nb_scratch_predictions
})
submission.to_csv(filename, index=False)

# 6. Error Analysis

Based on all the process you have done until the modeling and evaluation step, write an analysis to support each steps you have taken to solve this problem. Write the analysis using the markdown block. Some questions that may help you in writing the analysis:

- Does my model perform better in predicting one class than the other? If so, why is that?
- To each models I have tried, which performs the best and what could be the reason?
- Is it better for me to impute or drop the missing data? Why?
- Does feature scaling help improve my model performance?
- etc...

`Provide your analysis here`

Does my model perform better in predicting one class than the other? If so, why is that?

Berdasarkan hasil evaluasi model, terdapat perbedaan performa dalam memprediksi kelas yang berbeda. Hal ini dapat disebabkan oleh ketidakseimbangan data yang dimiliki, dalam hal ini dataset didominasi oleh kelas 1 (website legitmate) yang menyebabkan model memiliki lebih banyak contoh untuk mempelajari website legitmate, walaupun telah dilakukan SMOTE-Tomek untuk mengatasinya, efeknya masih terlihat. Selain itu, karakteristik dari fitur juga berpengaruh, angka precision 1.00 untuk kelas 1 menunjukkan bahwa fitur-fitur website legitmate lebih konsisten dan mudah untuk diidentifikasi dibandingkan dengan kelas 0 yang mendapat precision yang lebih rendah.

To each models I have tried, which performs the best and what could be the reason?

Berdasarkan skor kaggle terbaik yang didapatkan dengan model tanpa outlier handling (0.92), didapatkan bahwa model terbaik adalah Naive Bayes (tanpa outlier handling). Hal ini karena performa yang konsisten yaitu accuracy, precision, recall, dan F1 score diatas 0.94, mampu mempertahakan performa baik serta tidak menunjukkan tanda-tanda overfitting seperti pada versi dengan outlier handling, dan tidak memerlukan preprocessing outlier yang kompleks yang berarti lebih cepat dan sederhana dalam implementasinya.

Is it better for me to impute or drop the missing data? Why?

Keputusan untuk impute ataupun drop missing data tergantung pada berbagai faktor, ketika missing value sangat sedikit dan sampel masih cukup besar setelah data didrop, maka menghapus data tidak masalah. Selain itu, beberapa alasan lain untuk melakuakn drop pada missing data adalah jika missing value terjadi secara completely random (MCAR) dan juga data yang hilang tidak terlalu penting. Ketika missing value berjumlah cukup besar dan ketika dihapus akan mengurangi jumlah sampel dengan cukup signifikan maka akan lebih baik untuk melakukan drop pada data dan juga apabila missing value memiliki pola tertentu serta variabel yang hilang merupakan variabel yang penting maka akan lebih baik untuk melakukan impute pada missing value. Dikarenakan persentase missing values pada tiap kolom cukup besar, kami memutuskan untuk melakukan impute pada missing value.

Does feature scaling help improve my model performance?

Feature scaling dapat membantu beberapa performa model - model tertentu. Pada kasus KNN, feature scaling sangat membantu performa model karena model KNN berbasis jarak, sehingga scaling akan membuat jarak di setiap fitur menjadi sama. Hal ini juga terlihat setelah melakukan testing dimana akurasi model yang mencapai 98% saat menggunakan feature scaling, sedangkan jika tidak menggunakan feature scaling, akurasi model hanya 79%. Berbeda hal dengan kasus Naive Bayes, feature scaling tidak terlalu membantu performa model. Hal ini dikarenakan model Naive Bayes tidak berbasis jarak atau magnitudo fitur . Setelah melakukan testing, akurasi model menggunakan feature scaling adalah 96% sedangkan tanpa feature scaling adalah 94%. 