# Assignment: Decision Trees and Random Forests

## 0. Setting Up the Data

From UCI Machine Learning Repository we first fetched the [**Phishing Websites**](https://archive.ics.uci.edu/dataset/327/phishing+websites) data set. 

In [76]:
import plt
from ucimlrepo import fetch_ucirepo
  
# fetch dataset 
phishing_websites = fetch_ucirepo(id=327) 
  
# data (as pandas dataframes) 
X = phishing_websites.data.features 
y = phishing_websites.data.targets 
  
# metadata 
print(phishing_websites.metadata) 
  
# variable information 
print(phishing_websites.variables) 

{'uci_id': 327, 'name': 'Phishing Websites', 'repository_url': 'https://archive.ics.uci.edu/dataset/327/phishing+websites', 'data_url': 'https://archive.ics.uci.edu/static/public/327/data.csv', 'abstract': 'This dataset collected mainly from: PhishTank archive, MillerSmiles archive, Googleâ€™s searching operators.', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 11055, 'num_features': 30, 'feature_types': ['Integer'], 'demographics': [], 'target_col': ['result'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2012, 'last_updated': 'Tue Mar 05 2024', 'dataset_doi': '10.24432/C51W2X', 'creators': ['Rami Mohammad', 'Lee McCluskey'], 'intro_paper': {'ID': 396, 'type': 'NATIVE', 'title': 'An assessment of features related to phishing websites using an automated technique', 'authors': 'R. Mohammad, F. Thabtah, L. Mccluskey', 'venue': 'International Conference for Internet Tec

## 1. Business Understanding

The goal is to asses whether a website can be classified as phishing or legitimate based on website features using **Decision Tree** and **Random Forest**.

Model performance is evaluated to determine if a reliable phishing prediction is feasible with the given data and models.

## 2. Data Understanding

In [77]:
# Inspect the data
X.info()
X.describe()

# Display the targets
print("\nTarget value counts:\n" ,y.value_counts())

# Display unique values for each feature
print("\nUnique values for each feature:")
for num, col in enumerate(X.columns):
    unique_vals = sorted(int(x) for x in X[col].unique())
    print(f"{num} {col}: {unique_vals}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   having_ip_address           11055 non-null  int64
 1   url_length                  11055 non-null  int64
 2   shortining_service          11055 non-null  int64
 3   having_at_symbol            11055 non-null  int64
 4   double_slash_redirecting    11055 non-null  int64
 5   prefix_suffix               11055 non-null  int64
 6   having_sub_domain           11055 non-null  int64
 7   sslfinal_state              11055 non-null  int64
 8   domain_registration_length  11055 non-null  int64
 9   favicon                     11055 non-null  int64
 10  port                        11055 non-null  int64
 11  https_token                 11055 non-null  int64
 12  request_url                 11055 non-null  int64
 13  url_of_anchor               11055 non-null  int64
 14  links_

The dataset contains 11,055 entries and 30 columns. Each row represents a website, and the columns represent different website-related features. The feature values are encoded as -1, 0, and 1, describing whether a website characteristic looks bad, suspicious, or normal.

## 3. Data Preparation

The dataset is complete and consistent. Features and Targets are separated. No additional preparation required.

## 4. Modeling

### Data splitting
The data is split into two sets. One for training and one for testing. 
* Training: 70%
* Testing: 30%

In [78]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

### Base Random Forest
First we create a base Random Forest model with default hyperparameters, we will use this as a reference point for later comparisons.

In [79]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_samples=0.7,
    max_features=0.75,
    random_state=42,
    n_jobs=-1
)

y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

rf.fit(X_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",0.75
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


### Random Forest Hyperparameter Tuning
Next we perform hyperparameter tuning via grid search with cross-validation to find the best combination of hyperparameters for the model.

This setup uses 5-fold cross-validation and 6 different hyperparameters (64 combinations), resulting in 320 model fits. The combination resulting in the highest mean cross-validation accuracy is selected as the final tuned model.

In [80]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Try different hyperparameter combinations
param_grid = {
    "n_estimators": [200, 300],
    "max_depth": [20, None],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": [0.7, 0.75],
    "max_samples": [0.7, 0.75]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Select the best model
best_rf = grid_search.best_estimator_

print("Best hyperparameters:", grid_search.best_params_)



Best hyperparameters: {'max_depth': 20, 'max_features': 0.75, 'max_samples': 0.75, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}


## 5. Evaluation

Evaluating the performance of both the base and tuned Random Forest models.

In [81]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Evaluate the base Random Forest
y_pred_base = rf.predict(X_test)
print("Base Accuracy:", accuracy_score(y_test, y_pred_base))
print("\nBase Classification Report:\n", classification_report(y_test, y_pred_base))
print("\nBase Confusion Matrix:\n", confusion_matrix(y_test, y_pred_base))

# Evaluate the tuned Random Forest
y_pred_tuned = best_rf.predict(X_test)
print("Tuned Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("\nTuned Classification Report:\n", classification_report(y_test, y_pred_tuned))
print("\nTuned Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tuned))

Base Accuracy: 0.9674404582454025

Base Classification Report:
               precision    recall  f1-score   support

          -1       0.97      0.95      0.96      1428
           1       0.96      0.98      0.97      1889

    accuracy                           0.97      3317
   macro avg       0.97      0.97      0.97      3317
weighted avg       0.97      0.97      0.97      3317


Base Confusion Matrix:
 [[1357   71]
 [  37 1852]]
Tuned Accuracy: 0.9686463671992764

Tuned Classification Report:
               precision    recall  f1-score   support

          -1       0.98      0.95      0.96      1428
           1       0.96      0.98      0.97      1889

    accuracy                           0.97      3317
   macro avg       0.97      0.97      0.97      3317
weighted avg       0.97      0.97      0.97      3317


Tuned Confusion Matrix:
 [[1358   70]
 [  34 1855]]


## 6. Deployment