In [6]:
!pip install pandas scikit-learn transformers tensorflow




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


### Phishing URL Detection with RoBERTa and TensorFlow

This script demonstrates how to build a machine learning model to classify URLs as phishing or not phishing using the RoBERTa transformer model and TensorFlow. It includes steps from data loading to model training, saving, and evaluation.

#### Data Preparation

1. **Import Necessary Libraries**: We begin by importing required libraries including `pandas` for data manipulation, `sklearn.model_selection` for splitting the dataset, and components from `transformers` and `tensorflow` for model building.

2. **Load Dataset**: 
    - `combined_df.csv` is loaded into a pandas DataFrame. This CSV file should contain at least two columns: one with URLs (`url`) and another with labels indicating whether each URL is phishing or not (`label`).

3. **Split Dataset**:
    - The dataset is split into training and testing sets using `train_test_split`, with 20% of the data reserved for testing.



In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import RobertaTokenizer, TFRobertaModel
from tensorflow.keras.utils import to_categorical

# Specify the path to your CSV file
csv_file_path = 'combined_df.csv'

# Load the CSV file into a DataFrame
combined_df = pd.read_csv(csv_file_path)


# Assuming combined_df is your DataFrame
combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)

print(combined_df.head(20))
# Split dataset into training, validation, and testing sets
train_val_df, test_df = train_test_split(combined_df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)  # Results in 60% train, 20% validation, 20% test

                                                  url  label
0                      www.usaozsazps.com/information      1
1    awrs.cl/wp-content/themes/form/bill.charged.html      1
2   ipfs.eth.aragon.network/ipfs/bafybeifg3yzh6ekg...      1
3              wwxhajudjgwjklckvzgs7.firebaseapp.com/      1
4                                         roomclip.jp      0
5                                           macys.com      0
6                           pages-confirm.start.page/      1
7   cloudflare-ipfs.com/ipfs/bafybeihd2ekf4bv6pvrt...      1
8                                          yenicag.az      0
9                                      desjardins.com      0
10                                         orange.com      0
11  statybosabc.lt/wp-content/plugins/dir/wp-insta...      1
12                              coolandevencooler.com      0
13                                    usp.usspzp.top/      1
14  storage.cloud.google.com/q90qqqar22r229r292eus...      1
15                      


#### Model Preparation

1. Tokenizer Initialization:
    - A RoBERTa tokenizer is initialized to process the URLs, converting them into a format suitable for the model.

2. Tokenization Function:
    - Defines a function to tokenize the URLs. This function adjusts the padding and truncation to ensure consistent input size.

3. Label Preparation:
    - Converts the `label` column into a one-hot encoded format using `to_categorical`, facilitating binary classification.

In [11]:

# Initialize tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
# Tokenize URLs

def tokenize_urls(texts, tokenizer, max_len=512):
    return tokenizer(texts, padding=True, truncation=True, max_length=max_len, return_tensors="tf")



# Tokenize URLs for train, validation, and test sets
train_encodings = tokenize_urls(train_df['url'].tolist(), tokenizer)
val_encodings = tokenize_urls(val_df['url'].tolist(), tokenizer)
test_encodings = tokenize_urls(test_df['url'].tolist(), tokenizer)

# Prepare labels for train, validation, and test sets
train_labels = to_categorical(train_df['label'])
val_labels = to_categorical(val_df['label'])
test_labels = to_categorical(test_df['label'])

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

#### RoBERTa Model Setup

1. **Load Pre-Trained RoBERTa**:
    - The pre-trained RoBERTa model is loaded with TensorFlow bindings. The model is set to non-trainable to utilize its pre-trained embeddings.

2. **Model Architecture**:
    - An input layer is defined for both input IDs and attention masks.
    - RoBERTa's pooled output embeddings are extracted and passed through a dropout layer for regularization.
    - A dense layer with softmax activation is used for binary classification.

3. **Model Compilation**:
    - The model is compiled with the Adam optimizer and categorical crossentropy loss function, suitable for binary classification tasks.


In [16]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dropout, Dense
# Make sure to import any other layers you are using


# Now you can define your input layers without encountering the NameError
input_ids = Input(shape=(None,), dtype='int32', name="input_ids")
attention_masks = Input(shape=(None,), dtype='int32', name="attention_mask")

# Continue with your model definition...

# Load pre-trained RoBERTa model
roberta = TFRobertaModel.from_pretrained('roberta-base')

# Freeze the RoBERTa model to reuse the pre-trained features without modifying them
roberta.trainable = False

# Input layer
input_ids = Input(shape=(None,), dtype='int32', name="input_ids")
attention_masks = Input(shape=(None,), dtype='int32', name="attention_mask")

# RoBERTa embeddings
embeddings = roberta(input_ids, attention_mask=attention_masks)[1]  # We use the pooled output

# Additional layers
x = Dropout(0.1)(embeddings)
output = Dense(2, activation='softmax')(x)  # Assuming binary classification (phishing or not)

# Construct the model
model = Model(inputs=[input_ids, attention_masks], outputs=output)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.dense.weight', 'roberta.embeddings.position_ids', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaModel were not initialized from the PyTorch model and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, None)]               0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 tf_roberta_model_3 (TFRobe  TFBaseModelOutputWithPooli   1246456   ['input_ids[0][0]',           
 rtaModel)                   ngAndCrossAttentions(last_   32         'attention_mask[0][0]']      
                             hidden_state=(None, None,                                       

#### Training

1. **Model Training**:
    - The model is trained using the tokenized URL data and labels, with a validation split to monitor performance on unseen data.

2. **Model Saving**:
    - The trained model is saved to a directory for later use in prediction.

In [17]:
# Model training to include validation data
history = model.fit(
    {'input_ids': train_encodings['input_ids'], 'attention_mask': train_encodings['attention_mask']},
    train_labels,
    validation_data=(
        {'input_ids': val_encodings['input_ids'], 'attention_mask': val_encodings['attention_mask']},
        val_labels
    ),
    epochs=3  # Adjust epochs based on your dataset size and desired performance
)

# Save the model
model_save_directory = 'my_roberta_model'
model.save(model_save_directory, save_format='tf')

Epoch 1/3


   9/1792 [..............................] - ETA: 6:21:10 - loss: 0.7062 - accuracy: 0.4444

KeyboardInterrupt: 

#### Evaluation

1. **Model Loading**:
    - Demonstrates how to load the saved model for further evaluation or prediction.

2. **Prediction**:
    - The script shows how to make predictions on new data, specifically on the test set URLs.

3. **Evaluation Metrics**:
    - Calculates and prints the accuracy and detailed classification report, providing insights into the model's performance on classifying URLs as phishing or not.


In [None]:
# Load the model for prediction
loaded_model = tf.keras.models.load_model(model_save_directory)

# Make predictions
predictions = loaded_model.predict(
    {'input_ids': test_encodings['input_ids'], 'attention_mask': test_encodings['attention_mask']}
)

# Convert predictions to label indices
predicted_labels = np.argmax(predictions, axis=1)

# Calculate accuracy
accuracy = accuracy_score(test_df['label'].values, predicted_labels)
print(f"Accuracy: {accuracy}")

# Detailed classification report
print(classification_report(test_df['label'].values, predicted_labels, target_names=['Class 0', 'Class 1']))

## Model Training Visualization

To understand how our model learns over time, we'll visualize its performance across epochs. We will plot both the accuracy and loss for the training and validation sets. This visualization helps in identifying key aspects of the training process, such as overfitting or underfitting, and whether the model is improving with each epoch.


In [None]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.figure(figsize=(10, 5))
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(loc='upper left')
plt.show()

# Plot training & validation loss values
plt.figure(figsize=(10, 5))
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(loc='upper left')
plt.show()
