# Assignment: Decision Trees and Random Forests

## 0. Setting Up the Data

From UCI Machine Learning Repository we first fetched the [**Phishing Websites**](https://archive.ics.uci.edu/dataset/327/phishing+websites) data set. 

In [60]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
phishing_websites = fetch_ucirepo(id=327) 
  
# data (as pandas dataframes) 
X = phishing_websites.data.features 
y = phishing_websites.data.targets 
  
# metadata 
print(phishing_websites.metadata) 
  
# variable information 
print(phishing_websites.variables) 

{'uci_id': 327, 'name': 'Phishing Websites', 'repository_url': 'https://archive.ics.uci.edu/dataset/327/phishing+websites', 'data_url': 'https://archive.ics.uci.edu/static/public/327/data.csv', 'abstract': 'This dataset collected mainly from: PhishTank archive, MillerSmiles archive, Googleâ€™s searching operators.', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 11055, 'num_features': 30, 'feature_types': ['Integer'], 'demographics': [], 'target_col': ['result'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2012, 'last_updated': 'Tue Mar 05 2024', 'dataset_doi': '10.24432/C51W2X', 'creators': ['Rami Mohammad', 'Lee McCluskey'], 'intro_paper': {'ID': 396, 'type': 'NATIVE', 'title': 'An assessment of features related to phishing websites using an automated technique', 'authors': 'R. Mohammad, F. Thabtah, L. Mccluskey', 'venue': 'International Conference for Internet Tec

## 1. Business Understanding

The goal is to asses whether a website can be classified as phishing or legitimate based on website features using **Decision Tree** and **Random Forest**.

Model performance is evaluated to determine if a reliable phishing prediction is feasible with the given data and models.

## 2. Data Understanding

In [61]:
# Inspect the data
X.info()
X.describe()

# Display the targets
print("\nTarget value counts:\n" ,y.value_counts())

# Display unique values for each feature
print("\nUnique values for each feature:")
for num, col in enumerate(X.columns):
    unique_vals = sorted(int(x) for x in X[col].unique())
    print(f"{num} {col}: {unique_vals}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   having_ip_address           11055 non-null  int64
 1   url_length                  11055 non-null  int64
 2   shortining_service          11055 non-null  int64
 3   having_at_symbol            11055 non-null  int64
 4   double_slash_redirecting    11055 non-null  int64
 5   prefix_suffix               11055 non-null  int64
 6   having_sub_domain           11055 non-null  int64
 7   sslfinal_state              11055 non-null  int64
 8   domain_registration_length  11055 non-null  int64
 9   favicon                     11055 non-null  int64
 10  port                        11055 non-null  int64
 11  https_token                 11055 non-null  int64
 12  request_url                 11055 non-null  int64
 13  url_of_anchor               11055 non-null  int64
 14  links_

The dataset contains 11,055 entries and 30 columns. Each row represents a website, and the columns represent different website-related features. The feature values are encoded as -1, 0, and 1, describing whether a website characteristic looks bad, suspicious, or normal.

## 3. Data Preparation

The dataset is complete and consistent. Features and Targets are separated. No additional preparation required.

## 4. Modeling

### Data splitting
The data is split into two sets. One for training and one for testing. 
* Training: 70%
* Testing: 30%

In [62]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

### Training the Decision Tree

### Finding the best

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    criterion="entropy",
    random_state=42)
clf.fit(X_train, y_train)

# Hyperparameter tuning using GridSearchCV
params = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ['gini', 'entropy']
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid=params, cv=5)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)



Best parameters: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}


## 5. Evaluation

In [64]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Evaluate the model
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.9614109134760326

Classification Report:
               precision    recall  f1-score   support

          -1       0.96      0.95      0.95      1428
           1       0.96      0.97      0.97      1889

    accuracy                           0.96      3317
   macro avg       0.96      0.96      0.96      3317
weighted avg       0.96      0.96      0.96      3317


Confusion Matrix:
 [[1357   71]
 [  57 1832]]


## 6. Deployment