# **PhiUSIIL Phishing URL (Website)**

## **Part 2:**

- For this section, we aim to predict whether a URL page is legitimate or phishing using a Random Forest model.

#### **Step 1: Loading the dataset**

In [1]:
import pandas as pd

# Reading the processed dataset

train_data = pd.read_csv("/workspaces/proyecto-final-Phishing-URL/data/processed/X_train_sel.csv")
test_data = pd.read_csv("/workspaces/proyecto-final-Phishing-URL/data/processed/X_test_sel.csv")

train_data.head()

Unnamed: 0,URLSimilarityIndex,CharContinuationRate,URLCharProb,LetterRatioInURL,DegitRatioInURL,NoOfOtherSpecialCharsInURL,SpacialCharRatioInURL,IsHTTPS,HasTitle,DomainTitleMatchScore,...,Robots,IsResponsive,HasDescription,HasSocialNet,HasSubmitButton,HasHiddenFields,Pay,HasCopyrightInfo,NoOfJS,label
0,100.0,1.0,0.066724,0.5,0.0,1.0,0.038,1.0,1.0,100.0,...,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,17.0,1
1,100.0,0.692308,0.048056,0.481,0.0,2.0,0.074,1.0,1.0,0.0,...,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0,1
2,89.623464,0.72,0.045218,0.568,0.162,2.0,0.054,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,26.11451,1.0,0.064391,0.695,0.0,4.0,0.102,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,100.0,1.0,0.054861,0.552,0.0,1.0,0.034,1.0,1.0,100.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,23.0,1


#### **Step 2: Build a random forest:**

In [2]:
# Separate predictors and target variable in training and test data:

X_train = train_data.drop(['label'], axis = 1)
y_train = train_data['label']
X_test = test_data.drop(['label'], axis = 1)
y_test = test_data['label']

In [3]:
# Creating and training the Random Forest model:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 60, random_state = 42)
model.fit(X_train, y_train)

In [4]:
# Make predictions on test data:

y_pred = model.predict(X_test)
y_pred

array([0, 1, 1, ..., 1, 0, 1])

In [5]:
# Calculating model accuracy on test data:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9999150274036623

#### **Step 3: Optimize the previous model:**

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the grid of hyperparameters for search:

hyperparams = {
    "n_estimators": [50, 150],
    "max_depth": [None, 10],
    "min_samples_split": [5, 10],
    "min_samples_leaf": [1, 4],
    "max_features": ["sqrt", "log2"]
}

# Creating and training the Random Forest model:

model = RandomForestClassifier(random_state=42)

# Performing hyperparameter search using GridSearchCV:

grid = GridSearchCV(model, hyperparams, scoring="accuracy", cv=5)
grid.fit(X_train, y_train)

# Print the best hyperparameters from search:

print(f"The best parameters are: {grid.best_params_}")

The best parameters are: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}


In [7]:
# Retrain model with best hyperparameters:

model_grid = RandomForestClassifier(max_depth = None, max_features = 'sqrt', min_samples_leaf = 1, min_samples_split = 5, n_estimators = 50, random_state = 42)
model_grid.fit(X_train, y_train)

In [8]:
y_pred = model_grid.predict(X_test)
y_pred

array([0, 1, 1, ..., 1, 0, 1])

In [9]:
# Calculating model accuracy on retrained model:

accuracy_score(y_test, y_pred)

0.9999150274036623

**Conclusions:**
- The optimization step did not improve the results, as the accuracy score of the previous model was already high (**0.9999150274036623**).

#### **Step 4: Save the model:**

In [10]:
from pickle import dump

dump(model, open("/workspaces/proyecto-final-Phishing-URL/models/ranfor_classifier_42.sav", "wb"))