# Assignment: Decision Trees and Random Forests

## 0. Setting Up the Data

From UCI Machine Learning Repository we first fetched the [**Phishing Websites**](https://archive.ics.uci.edu/dataset/327/phishing+websites) data set. 

In [None]:
from ucimlrepo import fetch_ucirepo
  
# fetch dataset 
phishing_websites = fetch_ucirepo(id=327) 
  
# data (as pandas dataframes) 
X = phishing_websites.data.features 
y = phishing_websites.data.targets 
  
# metadata 
print(phishing_websites.metadata) 
  
# variable information 
print(phishing_websites.variables) 

## 1. Business Understanding

The goal is to asses whether a website can be classified as phishing or legitimate based on website features using **Decision Tree** and **Random Forest**.

Model performance is evaluated to determine if a reliable phishing prediction is feasible with the given data and models.

## 2. Data Understanding

In [None]:
# Inspect the data
X.info()
X.describe()

# Display the targets
print("\nTarget value counts:\n" ,y.value_counts())

# Display unique values for each feature
print("\nUnique values for each feature:")
for num, col in enumerate(X.columns):
    unique_vals = sorted(int(x) for x in X[col].unique())
    print(f"{num} {col}: {unique_vals}")

The dataset contains 11,055 entries and 30 columns. Each row represents a website, and the columns represent different website-related features. The feature values are encoded as -1, 0, and 1, describing whether a website characteristic looks bad, suspicious, or normal.

## 3. Data Preparation

The dataset is complete and consistent. Features and Targets are separated. No additional preparation required.

## 4. Modeling

### Data splitting
The data is split into two sets. One for training and one for testing. 
* Training: 70%
* Testing: 30%

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

### Base Random Forest
First we create a base Random Forest model with default hyperparameters, we will use this as a reference point for later comparisons.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_samples=0.7,
    max_features=0.75,
    random_state=42,
    n_jobs=-1
)

y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

rf.fit(X_train, y_train)

### Random Forest Hyperparameter Tuning
Next we perform hyperparameter tuning via grid search with cross-validation to find the best combination of hyperparameters for the model.

This setup uses 5-fold cross-validation and 6 different hyperparameters (64 combinations), resulting in 320 model fits. The combination resulting in the highest mean cross-validation accuracy is selected as the final tuned model.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Try different hyperparameter combinations
param_grid = {
    "n_estimators": [200, 300],
    "max_depth": [20, None],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": [0.7, 0.75],
    "max_samples": [0.7, 0.75]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Select the best model
best_rf = grid_search.best_estimator_

print("Best hyperparameters:", grid_search.best_params_)

Training model with max depth 3

In [None]:
rf_3 = RandomForestClassifier(
    max_depth=3,
    n_estimators=200,
    random_state=42,
    min_samples_leaf=1,
    min_samples_split=2,
    n_jobs=-1
)

rf_3.fit(X_train, y_train)

## 5. Evaluation

Evaluating the performance of both the base and tuned Random Forest models.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Evaluate the base Random Forest
y_pred_base = rf.predict(X_test)
print("Base Accuracy:", accuracy_score(y_test, y_pred_base))
print("\nBase Classification Report:\n", classification_report(y_test, y_pred_base))
print("\nBase Confusion Matrix:\n", confusion_matrix(y_test, y_pred_base))

# Evaluate the tuned Random Forest
y_pred_tuned = best_rf.predict(X_test)
print("Tuned Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("\nTuned Classification Report:\n", classification_report(y_test, y_pred_tuned))
print("\nTuned Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tuned))

# Evaluate max depth 3 Random Forest
y_pred_3 = rf_3.predict(X_test)
print("Max depth 3 Accuracy:", accuracy_score(y_test, y_pred_3))
print("\nMax depth 3 Report:\n", classification_report(y_test, y_pred_3))
print("\nMax depth 3 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_3))

## 6. Deployment

Compared to similiar hyperparameters in the decision tree, the random forest displays only minor improvements accuracy.
However it has much higher recall in recognising phishing sites, without adversely affecting its recall in terms of legitimate sites.
It can also much more reliably recognise legitimate sites with 4 points higher precision with only a slight dip in precision in phishing sites.
The model does show
