# Assignment: Decision Trees and Random Forests

## 0. Setting Up the Data

From UCI Machine Learning Repository we first fetched the [**Phishing Websites**](https://archive.ics.uci.edu/dataset/327/phishing+websites) data set. 

In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
phishing_websites = fetch_ucirepo(id=327) 
  
# data (as pandas dataframes) 
X = phishing_websites.data.features 
y = phishing_websites.data.targets 
  
# metadata 
print(phishing_websites.metadata) 
  
# variable information 
print(phishing_websites.variables) 

{'uci_id': 327, 'name': 'Phishing Websites', 'repository_url': 'https://archive.ics.uci.edu/dataset/327/phishing+websites', 'data_url': 'https://archive.ics.uci.edu/static/public/327/data.csv', 'abstract': 'This dataset collected mainly from: PhishTank archive, MillerSmiles archive, Googleâ€™s searching operators.', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 11055, 'num_features': 30, 'feature_types': ['Integer'], 'demographics': [], 'target_col': ['result'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2012, 'last_updated': 'Tue Mar 05 2024', 'dataset_doi': '10.24432/C51W2X', 'creators': ['Rami Mohammad', 'Lee McCluskey'], 'intro_paper': {'ID': 396, 'type': 'NATIVE', 'title': 'An assessment of features related to phishing websites using an automated technique', 'authors': 'R. Mohammad, F. Thabtah, L. Mccluskey', 'venue': 'International Conference for Internet Tec

## 1. Business Understanding

The goal is to asses whether a website can be classified as phishing or legitimate based on website features using **Decision Tree** and **Random Forest**.

Model performance is evaluated to determine if a reliable phishing prediction is feasible with the given data and models.

## 2. Data Understanding

In [2]:
# Inspect the data
X.info()
X.describe()

# Display the targets
print("\nTarget value counts:\n" ,y.value_counts())

# Display unique values for each feature
print("\nUnique values for each feature:")
for num, col in enumerate(X.columns):
    unique_vals = sorted(int(x) for x in X[col].unique())
    print(f"{num} {col}: {unique_vals}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   having_ip_address           11055 non-null  int64
 1   url_length                  11055 non-null  int64
 2   shortining_service          11055 non-null  int64
 3   having_at_symbol            11055 non-null  int64
 4   double_slash_redirecting    11055 non-null  int64
 5   prefix_suffix               11055 non-null  int64
 6   having_sub_domain           11055 non-null  int64
 7   sslfinal_state              11055 non-null  int64
 8   domain_registration_length  11055 non-null  int64
 9   favicon                     11055 non-null  int64
 10  port                        11055 non-null  int64
 11  https_token                 11055 non-null  int64
 12  request_url                 11055 non-null  int64
 13  url_of_anchor               11055 non-null  int64
 14  links_

The dataset contains 11,055 entries and 30 columns. Each row represents a website, and the columns represent different website-related features. The feature values are encoded as -1, 0, and 1, describing whether a website characteristic looks bad, suspicious, or normal.

## 3. Data Preparation

The dataset is complete and consistent. Features and Targets are separated. No additional preparation required.

## 4. Modeling

### Data splitting
The data is split into two sets. One for training and one for testing. 
* Training: 70%
* Testing: 30%

In [3]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

### Base Random Forest

In [4]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

rf.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


### Hyperparameter Tuning

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Try different hyperparameter combinations
param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": ["sqrt", "log2"]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Select the best model
best_rf = grid_search.best_estimator_

print("Best hyperparameters:", grid_search.best_params_)

Best hyperparameters: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


Training model with max depth 3

In [10]:
rf_3 = RandomForestClassifier(
    max_depth=3,
    n_estimators=200,
    random_state=42,
    min_samples_leaf=1,
    min_samples_split=2,
    n_jobs=-1
)

rf_3.fit(X_train, y_train)

0,1,2
,n_estimators,200
,criterion,'gini'
,max_depth,3
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## 5. Evaluation

In [12]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Evaluate the base Random Forest
y_pred_base = rf.predict(X_test)
print("Base Accuracy:", accuracy_score(y_test, y_pred_base))
print("\nBase Classification Report:\n", classification_report(y_test, y_pred_base))
print("\nBase Confusion Matrix:\n", confusion_matrix(y_test, y_pred_base))

# Evaluate the tuned Random Forest
y_pred_tuned = best_rf.predict(X_test)
print("Tuned Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("\nTuned Classification Report:\n", classification_report(y_test, y_pred_tuned))
print("\nTuned Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tuned))

# Evaluate max depth 3 Random Forest
y_pred_3 = rf_3.predict(X_test)
print("Max depth 3 Accuracy:", accuracy_score(y_test, y_pred_3))
print("\nMax depth 3 Report:\n", classification_report(y_test, y_pred_3))
print("\nMax depth 3 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_3))

Base Accuracy: 0.9668375037684654

Base Classification Report:
               precision    recall  f1-score   support

          -1       0.98      0.95      0.96      1428
           1       0.96      0.98      0.97      1889

    accuracy                           0.97      3317
   macro avg       0.97      0.96      0.97      3317
weighted avg       0.97      0.97      0.97      3317


Base Confusion Matrix:
 [[1352   76]
 [  34 1855]]
Tuned Accuracy: 0.967138981006934

Tuned Classification Report:
               precision    recall  f1-score   support

          -1       0.97      0.95      0.96      1428
           1       0.96      0.98      0.97      1889

    accuracy                           0.97      3317
   macro avg       0.97      0.96      0.97      3317
weighted avg       0.97      0.97      0.97      3317


Tuned Confusion Matrix:
 [[1354   74]
 [  35 1854]]
Max depth 3 Accuracy: 0.9255351220982816

Max depth 3 Report:
               precision    recall  f1-score   sup

## 6. Deployment

Compared to similiar hyperparameters in the decision tree, the random forest displays only minor improvements accuracy.  
However it has much higher recall in recognising phishing sites, without adversely affecting its recall in terms of legitimate sites.  
It can also much more reliably recognise legitimate sites with 4 points higher precision with only a slight dip in precision in phishing sites.  
The model does show some signs of favouring declaring sites as phishing with a decrease in recall for legitimate sites.  
