# INF8215 : Intelligence artificielle : méthodes et algorithmes

# TP3 - Détection de Phishing

Hiver 2023

Polytechnique Montréal

Benjamin-Ousmane M'Bengue

In [1]:
# pip install pytorch
# pip install transformers
# pip install optuna
# pip install datasets
# pip install pandas
# pip install numpy


# utilisation du GPU
import gc
import torch

gc.collect()
torch.cuda.empty_cache()
torch.cuda.memory_summary(device=None, abbreviated=False)
torch.cuda.is_available()

True

## 1. Chargement des données d'entrainement

In [2]:
import pandas as pd

df = pd.read_csv('train.csv')
df

Unnamed: 0.1,Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,0,http://www.wikihow.com/Clear-Google-Search-His...,50,15,0,2,3,0,0,0,...,0,1,0,1417,5887,184,0,0,6,legitimate
1,1,https://www.cmanextstep.com/wp-includes/css/fl...,66,19,0,3,1,0,0,0,...,1,0,0,118,2074,4484551,0,1,2,phishing
2,2,https://fasteratexcel.com/excel-sort-shortcuts/,47,17,0,1,2,0,0,0,...,1,1,0,775,2146,1529081,0,0,2,legitimate
3,3,https://share.hsforms.com/1HWBy9E8zQ5CPqHe5EI6...,54,17,0,2,0,0,0,0,...,1,0,0,56,2501,8469,0,0,2,phishing
4,4,http://abouthappybirthday.blogspot.com,38,31,0,2,0,0,0,0,...,1,0,0,372,7297,0,0,0,5,legitimate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8167,8167,http://www.biocreative.org/media/store/files/2...,63,19,0,3,0,0,0,0,...,1,1,0,58,-1,0,0,0,4,legitimate
8168,8168,https://www.joserobles.com/,27,18,0,2,0,0,0,0,...,1,1,0,329,5879,0,0,0,1,legitimate
8169,8169,http://www.ciclic.fr/,21,13,0,2,0,0,0,0,...,0,1,0,116,3902,544866,0,0,3,legitimate
8170,8170,http://secureupdate.appleld.com.duilawyeryork....,68,42,1,4,0,0,0,0,...,1,1,0,25,3992,5697976,0,1,0,phishing


## 2. Prétraitement des données

### 2.1 Selection des variables

Nous allons utiliser une approche NLP avec la librairie HuggingFace (https://huggingface.co/) qui nous permet d'utiliser des modèles Transformers.<br>
Nous allons donc nous concentrer seulement sur la variable 'url' et gardons les colonnes suivantes :<br>
  - 'url' : feature
  - 'status' : variable de prédiction


In [3]:
df = df[['url', 'status']]
df

Unnamed: 0,url,status
0,http://www.wikihow.com/Clear-Google-Search-His...,legitimate
1,https://www.cmanextstep.com/wp-includes/css/fl...,phishing
2,https://fasteratexcel.com/excel-sort-shortcuts/,legitimate
3,https://share.hsforms.com/1HWBy9E8zQ5CPqHe5EI6...,phishing
4,http://abouthappybirthday.blogspot.com,legitimate
...,...,...
8167,http://www.biocreative.org/media/store/files/2...,legitimate
8168,https://www.joserobles.com/,legitimate
8169,http://www.ciclic.fr/,legitimate
8170,http://secureupdate.appleld.com.duilawyeryork....,phishing


### 2.2 Analyse des classes

Examinons la variable de prédiction 'status' pour voir s'il y a un déséquilibre entre les classes 'legitimate' et 'phishing'.

In [4]:
df['status'].value_counts()

legitimate    4629
phishing      3543
Name: status, dtype: int64

Nous observons un déséquilibre entre nos classes.<br>
Comme le jeu de donnée est grand, nous pouvons utiliser la technique de sous-échantillonnage pour rééequilibrer nos classes.

In [5]:
from sklearn.utils import resample

# Séparer les observations en deux ensembles différents
df_phishing = df[df.status == 'phishing']
df_legitimate = df[df.status == 'legitimate']

# Sous-échantillonner la classe majoritaire
df_legitimate_downsampled = resample(df_legitimate,
                                   replace=False,
                                   n_samples=len(df_phishing),
                                   random_state=42)

# Concaténer les deux ensembles
df = pd.concat([df_legitimate_downsampled, df_phishing])

# Vérifier la taille de chaque classe dans l'ensemble de données sous-échantillonné
print(df['status'].value_counts())

legitimate    3543
phishing      3543
Name: status, dtype: int64


### 2.3 Formatage et séparation des données pour Huggingface

Il faut renommer les colonnes pour que cela convienne aux modèles de HuggingFace. <br>
Nous devons également encoder les valeurs des labels en int.
- legitimate -> 0
- phishing -> 1

In [6]:
from sklearn.preprocessing import LabelEncoder

df = df.rename(columns={"url": "text", "status": "labels"})

le = LabelEncoder()
df['labels'] = le.fit_transform(df['labels'])
df

Unnamed: 0,text,labels
7229,http://itlaw.wikia.com/wiki/Command_and_contro...,0
1895,https://www.youtube.com/user/NBCNightlyNews,0
4243,https://en.wikipedia.org/wiki/Router_(woodwork...,0
8143,http://lauvaylaparra.blogspot.com/,0
50,https://www.slideshare.net/Haddies/food-proces...,0
...,...,...
8162,http://infantbaptism.net/libraries/fof/utils/i...,1
8164,http://vps-6e0e78eb.vps.ovh.net/,1
8166,http://storage.googleapis.com/cftu6r8bigigi8.a...,1
8170,http://secureupdate.appleld.com.duilawyeryork....,1


Nous séparons nos données en jeux d'entrainement, de test et de validation suivant le ratio 90/5/5. <br>
Les ensemble de train est assez conséquent car nous allons le diviser plus tard pour que la recherche d'hyperparamètre ne prenne pas trop de temps.

In [7]:
df_train =df.sample(frac=0.9,random_state=200)
df_others = df.drop(df_train.index)
df_test = df_others.sample(frac=0.5, random_state=200)
df_valid = df_others.drop(df_test.index)

print(df_train.shape)
print(df_test.shape)
print(df_valid.shape)

(6377, 2)
(354, 2)
(355, 2)


Les modèles de HuggingFace fonctionnent avec des objets datasetDict de la librairie datasets. <br>
Il faut donc convertir les dataframes pandas.


In [8]:
from datasets import Dataset, DatasetDict

dataset_train = Dataset.from_pandas(df_train)
dataset_test= Dataset.from_pandas(df_test)
dataset_valid= Dataset.from_pandas(df_valid)
dataset = DatasetDict(
    {
        "train": dataset_train,
        "test": dataset_test,
        "validation": dataset_valid
    }
)

print(dataset)
print(dataset["train"][0])
print(dataset["test"][0])
print(dataset["validation"][0])

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 6377
    })
    test: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 354
    })
    validation: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 355
    })
})
{'text': 'http://prevtek.com/blog/profile/', 'labels': 1, '__index_level_0__': 8030}
{'text': 'https://www.emploi-petrole.com/', 'labels': 0, '__index_level_0__': 1094}
{'text': 'https://www.supersamastore.it/', 'labels': 0, '__index_level_0__': 6136}


### 2.4 Tokenization des urls

Nous utilisons le modèle pré-entrainé de la librairie huggingFace pour la tokenization des urls. <br>
https://huggingface.co/priyabrat/new_bert_url_clasification <br>

La tokenization sert à transformer un texte brut (ici nos URLs) en une séquence de "tokens" représentant les unités de sens du texte.<br>
Ces tokens seront ensuite utilisés en entrée de notre modèle de classification pour nos prédictions.

Nous devons d'abord charger le modèle.

In [9]:
from transformers import AutoTokenizer

model_checkpoint = 'priyabrat/new_bert_url_clasification'

# le parametre "use_fast" permet de charger la version rapide du modèle. 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Nous pouvons observer comment le tokenizer sépare un url en tokens :

In [10]:
tokenizer.tokenize(dataset["train"][0]['text'])

['http',
 ':',
 '/',
 '/',
 'pre',
 '##v',
 '##tek',
 '.',
 'com',
 '/',
 'blog',
 '/',
 'profile',
 '/']

Chaque token possède son ID. <br>
De plus, un masque d'attention sera calculé.

In [11]:
tokenizer(dataset["train"][0]['text'])

{'input_ids': [101, 8299, 1024, 1013, 1013, 3653, 2615, 23125, 1012, 4012, 1013, 9927, 1013, 6337, 1013, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Examinons le nombre de tokens de nos urls.

In [12]:
df['lengths'] = df['text'].apply(lambda x: len(tokenizer.tokenize(x)))

# obtenir des statistiques descriptives
df['lengths'].describe()

count    7086.000000
mean       27.929579
std        26.523485
min         7.000000
25%        15.000000
50%        20.000000
75%        31.000000
max       631.000000
Name: lengths, dtype: float64

Il nous faut définir une longueur maximum pour tronquer le nombre de tokens (cela est indispensable pour le modèle). <br>
Une longueur maximum haute augmente le temps d'entrainement des modèles. <br>
Grace à ces stats, une longueur maximum de 50 semble cohérente.

Il faut maintenant appliquer la Tokenization à notre dataset en entier.

In [13]:
def preprocess_function(urls):
    return tokenizer(urls['text'], truncation=True, padding=True,  max_length=50)


encoded_dataset = dataset.map(preprocess_function, batched=True)

print(encoded_dataset)
# Afficher les 2 premières lignes du dataset après la tokenization
print(encoded_dataset['train'][:2])

                                                                  

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6377
    })
    test: Dataset({
        features: ['text', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 354
    })
    validation: Dataset({
        features: ['text', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 355
    })
})
{'text': ['http://prevtek.com/blog/profile/', 'https://www.tumblr.com/safe-mode?url=https%3A%2F%2Fscatparadise.tumblr.com%2F'], 'labels': [1, 0], '__index_level_0__': [8030, 7821], 'input_ids': [[101, 8299, 1024, 1013, 1013, 3653, 2615, 23125, 1012, 4012, 1013, 9927, 1013, 6337, 1013, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 16770, 1024, 1013, 1013, 7479, 1012, 10722, 14905, 20974, 1012, 4012, 1013, 3647, 1011, 5549, 1029, 2447



Maintenant, nos données sont prêtes à entrer dans un modèle de classification. <br>

## 3. Modélisation

### 3.1 Chargement et préparation de l'entrainement du modèle

Nous allons ré-entrainer le même modèle 'priyabrat/new_bert_url_clasification' avec notre dataset.<br>


Il faut d'abord charger le modèle avec le nombre de labels que nous voulons utiliser pour la classification.<br>
Comme nous allons entrainer plusieurs fois notre modèle pour la recherche d'hyperparamètres, le modèle doit donc être réinitialisé à chaque nouvel entraînement.<br>
C'est pourquoi nous définissons une fonction pour charger le modèle.

In [14]:
from transformers import AutoModelForSequenceClassification

num_labels = 2 # legitimate et phishing

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Nous allons ensuite instancier un Trainer. Il s'agit d'une classe de HuggingFace permettant d'optimiser l'entrainement de modèles Transformers. <br>
Avant cela, nous devons définir les paramètres du modèle à l'aide de la classe TrainingArguments.<br>

Pour l'instant, le choix des hyperparametres ne sont pas très importants car nous allons trouver les meilleurs pour nos données plus tard.

In [15]:
from transformers import  TrainingArguments

args = TrainingArguments(
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    output_dir="test_trainer"
)

Il nous faut également une fonction qui calcule l'accuracy de nos prédictions.

In [16]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### 3.2 Recherche des meilleurs hyperparametres

Comme la recherche d'hyperparamètres est longue est demande beaucoup de ressources, nous l'appliquons sur une petite partie du jeu de données. 

In [17]:
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
train_dataset

Dataset({
    features: ['text', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 638
})

Voici l'instance du Trainer.

In [18]:
from transformers import Trainer

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Nous definissons ensuite les limites des hyperparametres

In [19]:
from optuna import trial

def hyperparameter_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 2e-5, 2e-2, log=True),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 5),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
    }

Nous allons maintenant effectuer 10 essaies notre recherche d'hyperparamètres. <br>

In [20]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize", hp_space=hyperparameter_space)

[32m[I 2023-04-15 22:45:05,258][0m A new study created in memory with name: no-name-9c5c9e70-6c6f-4a58-9ac0-f4e28dacc827[0m
  0%|          | 0/160 [00:00<?, ?it/s]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
                                                
 25%|██▌       | 40/160 [00:15<00:36,  3.33it/s]

{'eval_loss': 0.27732351422309875, 'eval_accuracy': 0.895774647887324, 'eval_runtime': 1.754, 'eval_samples_per_second': 202.394, 'eval_steps_per_second': 6.841, 'epoch': 1.0}


                                                
 50%|█████     | 80/160 [00:37<00:26,  3.04it/s]

{'eval_loss': 0.37041714787483215, 'eval_accuracy': 0.8788732394366198, 'eval_runtime': 2.028, 'eval_samples_per_second': 175.053, 'eval_steps_per_second': 5.917, 'epoch': 2.0}


                                                 
 75%|███████▌  | 120/160 [01:06<00:14,  2.75it/s]

{'eval_loss': 0.41847044229507446, 'eval_accuracy': 0.895774647887324, 'eval_runtime': 2.0331, 'eval_samples_per_second': 174.612, 'eval_steps_per_second': 5.902, 'epoch': 3.0}


                                                 
100%|██████████| 160/160 [01:32<00:00,  2.88it/s]

{'eval_loss': 0.46645328402519226, 'eval_accuracy': 0.8873239436619719, 'eval_runtime': 2.279, 'eval_samples_per_second': 155.767, 'eval_steps_per_second': 5.265, 'epoch': 4.0}


100%|██████████| 160/160 [01:44<00:00,  2.88it/s]

{'train_runtime': 104.2781, 'train_samples_per_second': 24.473, 'train_steps_per_second': 1.534, 'train_loss': 0.20957043170928955, 'epoch': 4.0}


100%|██████████| 160/160 [01:45<00:00,  1.52it/s]
[32m[I 2023-04-15 22:46:53,950][0m Trial 0 finished with value: 0.8873239436619719 and parameters: {'learning_rate': 7.242936983983539e-05, 'num_train_epochs': 4, 'per_device_train_batch_size': 16}. Best is trial 0 with value: 0.8873239436619719.[0m
 20%|██        | 20/100 [00:12<00:51,  1.56it/s]
 20%|██        | 20/100 [00:15<00:51,  1.56it/s]

{'eval_loss': 0.7099958658218384, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.133, 'eval_samples_per_second': 166.432, 'eval_steps_per_second': 5.626, 'epoch': 1.0}


 40%|████      | 40/100 [00:41<00:37,  1.59it/s]
 40%|████      | 40/100 [00:43<00:37,  1.59it/s]

{'eval_loss': 0.695316731929779, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.1038, 'eval_samples_per_second': 168.745, 'eval_steps_per_second': 5.704, 'epoch': 2.0}


 60%|██████    | 60/100 [01:07<00:23,  1.69it/s]
 60%|██████    | 60/100 [01:09<00:23,  1.69it/s]

{'eval_loss': 0.6998288631439209, 'eval_accuracy': 0.476056338028169, 'eval_runtime': 2.127, 'eval_samples_per_second': 166.902, 'eval_steps_per_second': 5.642, 'epoch': 3.0}


 80%|████████  | 80/100 [01:33<00:11,  1.72it/s]
 80%|████████  | 80/100 [01:35<00:11,  1.72it/s]

{'eval_loss': 0.6925010681152344, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.175, 'eval_samples_per_second': 163.218, 'eval_steps_per_second': 5.517, 'epoch': 4.0}


100%|██████████| 100/100 [02:00<00:00,  1.51it/s]
100%|██████████| 100/100 [02:03<00:00,  1.51it/s]

{'eval_loss': 0.6924797892570496, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.527, 'eval_samples_per_second': 140.483, 'eval_steps_per_second': 4.749, 'epoch': 5.0}


100%|██████████| 100/100 [02:19<00:00,  1.51it/s]

{'train_runtime': 140.9386, 'train_samples_per_second': 22.634, 'train_steps_per_second': 0.71, 'train_loss': 0.7835408782958985, 'epoch': 5.0}


100%|██████████| 100/100 [02:19<00:00,  1.40s/it]
[32m[I 2023-04-15 22:49:17,051][0m Trial 1 finished with value: 0.523943661971831 and parameters: {'learning_rate': 0.0015956138904359856, 'num_train_epochs': 5, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.8873239436619719.[0m
                                               
 25%|██▌       | 20/80 [00:14<00:36,  1.63it/s]

{'eval_loss': 0.753790557384491, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.355, 'eval_samples_per_second': 150.743, 'eval_steps_per_second': 5.096, 'epoch': 1.0}


                                               
 50%|█████     | 40/80 [00:40<00:25,  1.59it/s]

{'eval_loss': 0.69716876745224, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.215, 'eval_samples_per_second': 160.273, 'eval_steps_per_second': 5.418, 'epoch': 2.0}


                                               
 75%|███████▌  | 60/80 [01:08<00:11,  1.69it/s]

{'eval_loss': 0.692206859588623, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.8619, 'eval_samples_per_second': 124.042, 'eval_steps_per_second': 4.193, 'epoch': 3.0}


                                               
100%|██████████| 80/80 [01:35<00:00,  1.37it/s]

{'eval_loss': 0.6921840906143188, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.2473, 'eval_samples_per_second': 157.966, 'eval_steps_per_second': 5.34, 'epoch': 4.0}


100%|██████████| 80/80 [01:47<00:00,  1.37it/s]

{'train_runtime': 108.8559, 'train_samples_per_second': 23.444, 'train_steps_per_second': 0.735, 'train_loss': 0.8247064590454102, 'epoch': 4.0}


100%|██████████| 80/80 [01:48<00:00,  1.35s/it]
[32m[I 2023-04-15 22:51:09,243][0m Trial 2 finished with value: 0.523943661971831 and parameters: {'learning_rate': 0.0036525823135063838, 'num_train_epochs': 4, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 0.8873239436619719.[0m
 33%|███▎      | 40/120 [00:15<00:29,  2.68it/s]
 33%|███▎      | 40/120 [00:17<00:29,  2.68it/s]

{'eval_loss': 0.31226441264152527, 'eval_accuracy': 0.8760563380281691, 'eval_runtime': 2.6639, 'eval_samples_per_second': 133.262, 'eval_steps_per_second': 4.505, 'epoch': 1.0}


 67%|██████▋   | 80/120 [00:42<00:14,  2.70it/s]
 67%|██████▋   | 80/120 [00:45<00:14,  2.70it/s]

{'eval_loss': 0.31699198484420776, 'eval_accuracy': 0.895774647887324, 'eval_runtime': 2.8455, 'eval_samples_per_second': 124.757, 'eval_steps_per_second': 4.217, 'epoch': 2.0}


100%|██████████| 120/120 [01:12<00:00,  2.56it/s]
100%|██████████| 120/120 [01:14<00:00,  2.56it/s]

{'eval_loss': 0.3001350164413452, 'eval_accuracy': 0.9183098591549296, 'eval_runtime': 2.3801, 'eval_samples_per_second': 149.155, 'eval_steps_per_second': 5.042, 'epoch': 3.0}


100%|██████████| 120/120 [01:25<00:00,  2.56it/s]

{'train_runtime': 86.6639, 'train_samples_per_second': 22.085, 'train_steps_per_second': 1.385, 'train_loss': 0.32428051630655924, 'epoch': 3.0}


100%|██████████| 120/120 [01:25<00:00,  1.40it/s]
[32m[I 2023-04-15 22:52:38,267][0m Trial 3 finished with value: 0.9183098591549296 and parameters: {'learning_rate': 0.0001090622037256774, 'num_train_epochs': 3, 'per_device_train_batch_size': 16}. Best is trial 3 with value: 0.9183098591549296.[0m
 20%|██        | 40/200 [00:17<01:11,  2.24it/s]
 20%|██        | 40/200 [00:19<01:11,  2.24it/s]

{'eval_loss': 0.3317747414112091, 'eval_accuracy': 0.8760563380281691, 'eval_runtime': 2.645, 'eval_samples_per_second': 134.216, 'eval_steps_per_second': 4.537, 'epoch': 1.0}


 40%|████      | 80/200 [00:47<00:52,  2.27it/s]
 40%|████      | 80/200 [00:50<00:52,  2.27it/s]

{'eval_loss': 0.3828962445259094, 'eval_accuracy': 0.8901408450704226, 'eval_runtime': 2.481, 'eval_samples_per_second': 143.09, 'eval_steps_per_second': 4.837, 'epoch': 2.0}


 60%|██████    | 120/200 [01:17<00:31,  2.51it/s]
 60%|██████    | 120/200 [01:20<00:31,  2.51it/s]

{'eval_loss': 0.4256536066532135, 'eval_accuracy': 0.8929577464788733, 'eval_runtime': 2.955, 'eval_samples_per_second': 120.136, 'eval_steps_per_second': 4.061, 'epoch': 3.0}


 80%|████████  | 160/200 [01:47<00:17,  2.24it/s]
 80%|████████  | 160/200 [01:50<00:17,  2.24it/s]

{'eval_loss': 0.45226699113845825, 'eval_accuracy': 0.895774647887324, 'eval_runtime': 2.843, 'eval_samples_per_second': 124.868, 'eval_steps_per_second': 4.221, 'epoch': 4.0}


100%|██████████| 200/200 [02:20<00:00,  2.29it/s]
100%|██████████| 200/200 [02:23<00:00,  2.29it/s]

{'eval_loss': 0.4899830222129822, 'eval_accuracy': 0.8929577464788733, 'eval_runtime': 2.63, 'eval_samples_per_second': 134.981, 'eval_steps_per_second': 4.563, 'epoch': 5.0}


100%|██████████| 200/200 [02:33<00:00,  2.29it/s]

{'train_runtime': 154.6423, 'train_samples_per_second': 20.628, 'train_steps_per_second': 1.293, 'train_loss': 0.21303890228271485, 'epoch': 5.0}


100%|██████████| 200/200 [02:34<00:00,  1.29it/s]
[32m[I 2023-04-15 22:55:15,409][0m Trial 4 finished with value: 0.8929577464788733 and parameters: {'learning_rate': 6.330844311556136e-05, 'num_train_epochs': 5, 'per_device_train_batch_size': 16}. Best is trial 3 with value: 0.9183098591549296.[0m
100%|██████████| 40/40 [00:18<00:00,  2.06it/s]
100%|██████████| 40/40 [00:21<00:00,  1.86it/s]
[32m[I 2023-04-15 22:55:41,150][0m Trial 5 pruned. [0m


{'eval_loss': 0.6946999430656433, 'eval_accuracy': 0.476056338028169, 'eval_runtime': 3.431, 'eval_samples_per_second': 103.468, 'eval_steps_per_second': 3.498, 'epoch': 1.0}


 33%|███▎      | 20/60 [00:21<00:38,  1.03it/s]
 33%|███▎      | 20/60 [00:25<00:38,  1.03it/s]

{'eval_loss': 0.7880554795265198, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 4.109, 'eval_samples_per_second': 86.396, 'eval_steps_per_second': 2.92, 'epoch': 1.0}


 67%|██████▋   | 40/60 [00:50<00:17,  1.12it/s]
 67%|██████▋   | 40/60 [00:53<00:26,  1.33s/it]
[32m[I 2023-04-15 22:56:35,878][0m Trial 6 pruned. [0m


{'eval_loss': 0.6952977180480957, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.9804, 'eval_samples_per_second': 119.11, 'eval_steps_per_second': 4.026, 'epoch': 2.0}


 25%|██▌       | 20/80 [00:18<00:52,  1.14it/s]
 25%|██▌       | 20/80 [00:22<01:07,  1.12s/it]
[32m[I 2023-04-15 22:57:00,333][0m Trial 7 pruned. [0m


{'eval_loss': 0.6950702667236328, 'eval_accuracy': 0.476056338028169, 'eval_runtime': 3.875, 'eval_samples_per_second': 91.612, 'eval_steps_per_second': 3.097, 'epoch': 1.0}


 50%|█████     | 80/160 [00:23<00:21,  3.69it/s]
 50%|█████     | 80/160 [00:26<00:26,  2.99it/s]
[32m[I 2023-04-15 22:57:28,832][0m Trial 8 pruned. [0m


{'eval_loss': 0.6934789419174194, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.695, 'eval_samples_per_second': 131.725, 'eval_steps_per_second': 4.453, 'epoch': 1.0}


 20%|██        | 40/200 [00:17<01:05,  2.45it/s]
 20%|██        | 40/200 [00:20<01:21,  1.95it/s]
[32m[I 2023-04-15 22:57:50,912][0m Trial 9 pruned. [0m


{'eval_loss': 1.0509955883026123, 'eval_accuracy': 0.523943661971831, 'eval_runtime': 2.991, 'eval_samples_per_second': 118.69, 'eval_steps_per_second': 4.012, 'epoch': 1.0}


Voici les meilleurs hyperparamètres :

In [21]:
print(best_run)

BestRun(run_id='3', objective=0.9183098591549296, hyperparameters={'learning_rate': 0.0001090622037256774, 'num_train_epochs': 3, 'per_device_train_batch_size': 16}, run_summary=None)


### 3.3 Ajustement du modèle à nos données (Fine-tuning)

Nous pouvons maintenant entrainer notre modèle avec les meilleurs paramètres

In [22]:
args = TrainingArguments(
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate = best_run.hyperparameters['learning_rate'],
    per_device_train_batch_size = best_run.hyperparameters['per_device_train_batch_size'],
    per_device_eval_batch_size = best_run.hyperparameters['per_device_train_batch_size'],
    num_train_epochs= best_run.hyperparameters['num_train_epochs'],
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    output_dir="test_trainer"
)

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

 33%|███▎      | 399/1197 [03:15<04:59,  2.66it/s]
 33%|███▎      | 399/1197 [03:19<04:59,  2.66it/s]

{'eval_loss': 0.3361622989177704, 'eval_accuracy': 0.9011299435028248, 'eval_runtime': 3.6118, 'eval_samples_per_second': 98.012, 'eval_steps_per_second': 6.368, 'epoch': 1.0}


 42%|████▏     | 500/1197 [04:12<05:16,  2.20it/s]

{'loss': 0.3418, 'learning_rate': 6.350572764978877e-05, 'epoch': 1.25}


 67%|██████▋   | 798/1197 [06:43<02:38,  2.52it/s]
 67%|██████▋   | 798/1197 [06:46<02:38,  2.52it/s]

{'eval_loss': 0.2935185730457306, 'eval_accuracy': 0.9180790960451978, 'eval_runtime': 3.0165, 'eval_samples_per_second': 117.354, 'eval_steps_per_second': 7.625, 'epoch': 2.0}


 84%|████████▎ | 1000/1197 [08:22<01:44,  1.88it/s]

{'loss': 0.195, 'learning_rate': 1.7949251573900123e-05, 'epoch': 2.51}


100%|██████████| 1197/1197 [09:57<00:00,  2.74it/s]
100%|██████████| 1197/1197 [10:01<00:00,  2.74it/s]

{'eval_loss': 0.31337815523147583, 'eval_accuracy': 0.9209039548022598, 'eval_runtime': 3.2894, 'eval_samples_per_second': 107.617, 'eval_steps_per_second': 6.992, 'epoch': 3.0}


100%|██████████| 1197/1197 [10:08<00:00,  1.97it/s]

{'train_runtime': 609.1159, 'train_samples_per_second': 31.408, 'train_steps_per_second': 1.965, 'train_loss': 0.24814418324253015, 'epoch': 3.0}





TrainOutput(global_step=1197, training_loss=0.24814418324253015, metrics={'train_runtime': 609.1159, 'train_samples_per_second': 31.408, 'train_steps_per_second': 1.965, 'train_loss': 0.24814418324253015, 'epoch': 3.0})

Nous obtenons un score d'accuracy  de 0.92 sur notre entraitement de test.

## 4. Prédictions

Enfin, nous pouvons tester notre modèle avec le dataset de test.

### 4.1 Charger le jeu de test

In [23]:
df_submission = pd.read_csv('test.csv')
df_submission

Unnamed: 0.1,Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,...,empty_title,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank
0,0,http://kinstationery.com/4tr/?email=nobody@myc...,58,17,0,2,0,1,1,0,...,0,1,0,0,222,-1,0,0,1,1
1,1,https://www.granhongo.com/wp-includes/Login/cu...,114,17,0,2,2,0,1,0,...,0,1,1,0,221,3065,0,0,1,2
2,2,http://j.gs/Cpjh,16,4,0,1,0,0,0,0,...,0,1,0,0,207,-1,1590047,0,1,6
3,3,http://www.learnpunjabi.org/pdf/gslehal-pap18.pdf,49,20,0,3,1,0,0,0,...,1,1,1,0,406,5073,287661,0,1,5
4,4,http://42vital.pl/cache/cache/navy/navyfederal/,47,10,0,1,0,0,0,0,...,1,1,0,0,93,2098,0,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3253,3253,http://unitus.mk.ua/sites/default/files/ctools...,110,12,0,3,1,0,0,0,...,0,1,1,0,208,3080,0,0,1,2
3254,3254,http://vps-a155c406.vps.ovh.net/,32,24,1,3,1,0,0,0,...,0,1,1,0,1767,8094,3612,0,1,4
3255,3255,http://mail01.tinyletterapp.com/Support--2/imp...,188,24,1,4,9,0,1,0,...,0,1,0,0,194,2362,173942,0,1,1
3256,3256,http://amazon-vtech-safe-and-sound-audio.buy.m...,76,47,0,4,6,0,0,0,...,0,1,1,0,568,-1,0,0,1,2


### 4.2 Prétraitement des données du jeu de test

In [24]:
df_submission = df_submission['url']
df_submission = df_submission.rename("text").to_frame()
# df_submission
dataset_submission= Dataset.from_pandas(df_submission)
print(dataset_submission)
print(dataset_submission[0])

Dataset({
    features: ['text'],
    num_rows: 3258
})
{'text': 'http://kinstationery.com/4tr/?email=nobody@mycraftmail.com'}


In [25]:
encoded_dataset_submission = dataset_submission.map(preprocess_function, batched=True)

# Afficher les 2 premières lignes du dataset après la tokenization
print(encoded_dataset_submission[:2])

                                                                  

{'text': ['http://kinstationery.com/4tr/?email=nobody@mycraftmail.com', 'https://www.granhongo.com/wp-includes/Login/customer_center/customer-IDPP00C824/myaccount/identity?cmd=_session=US'], 'input_ids': [[101, 8299, 1024, 1013, 1013, 12631, 20100, 7301, 1012, 4012, 1013, 1018, 16344, 1013, 1029, 10373, 1027, 6343, 1030, 2026, 10419, 21397, 1012, 4012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 16770, 1024, 1013, 1013, 7479, 1012, 12604, 19991, 2080, 1012, 4012, 1013, 1059, 2361, 1011, 2950, 1013, 8833, 2378, 1013, 8013, 1035, 2415, 1013, 8013, 1011, 8909, 9397, 8889, 2278, 2620, 18827, 1013, 2026, 6305, 3597, 16671, 1013, 4767, 1029, 4642, 2094, 1027, 1035, 5219, 1027, 2149, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,



### 4.3 Prédictions pour la soumission Kaggle

In [26]:
y_submission = trainer.predict(encoded_dataset_submission)
y_submission

100%|██████████| 204/204 [00:26<00:00,  7.62it/s]


PredictionOutput(predictions=array([[-2.3728497,  2.6690106],
       [-2.3808033,  2.6742992],
       [-1.1377612,  1.4587404],
       ...,
       [-2.3737447,  2.669765 ],
       [ 1.6936927, -1.3166003],
       [ 1.7669984, -1.7183537]], dtype=float32), label_ids=None, metrics={'test_runtime': 26.9828, 'test_samples_per_second': 120.743, 'test_steps_per_second': 7.56})

In [27]:
# Appliquer la règle 0/1 à chaque prédiction
predictions = [0 if p[0] > p[1] else 1 for p in y_submission.predictions]

# Afficher les nouvelles prédictions
predictions[:10]

[1, 1, 1, 0, 1, 0, 1, 1, 0, 0]

### 4.4 Formatage des résultats pour la soumission Kaggle

In [28]:
# Remettre le status sous forme "legitimate" ou "phishing"
predictions_decoded = le.inverse_transform(predictions)
df_submission = pd.DataFrame(data= {'url': df_submission['text'].values, 'status': predictions_decoded } )

df_submission.to_csv('kaggle_submission_4.csv', index=False)

## 5. Discussion 



Nous obtenons un score d'environ 0.94 sur Kaggle avec cette approche. Ce qui est un score honorable. <br>
Nous avons testé d'autres modèles transformers (DistilBERT, Bert, ...) de la librairie mais les résultats n'étaient pas aussi bon. <br>

Malheureusement, il est difficile de comparer tous ces modèles sur un même notebook car nous obtenions beaucoup d'erreurs dues à la limitation de la mémoire. <br> 
De plus, certains modèles sont sur Pytorch d'autre sur Tensorflow et les temps de compilations sont parfois énormes. <br>

De notre point de vu, le modèle que nous avons utilisé semble efficace pour plusieurs raisons :
- utiliser un modèle pré-entrainé permet une plus grande connaissance du domaine et donc une meilleure précision
- la tokenization utilisée est adaptée pour des urls (à l'inverse des tokenizations courrantes adaptées pour des textes)
- d'un point de vu performance, le modèle s'entraine assez rapidement (avec un GPU)

Le principal point négatif de notre approche est que nous n'avons pas pris en compte les autres variables du dataset qui auraient pu être pertinantes.
