### RDAI-ANLP Project
In this project, I am going to do a multi-label classification of text collected from the dark-web forums. There are a total of 24 labels and 1.9k texts training set + 252 texts test set. All the data collected are custom scraped and labelled because there are little or no labelled datasets of the dark web forums into the respectives categories. These data and labels are inclined towards the cyber domain. Labels include data leaks, personal information, company or organisation information etc.

I will be using spacy (Word2vec + roberta) to train a model that can classify these texts into the relevant categories

### Install relevant packages

In [None]:
!pip install -U spacy
!pip install spacy-transformers
!python -m spacy download en_core_web_trf

In [36]:
import os
import pandas as pd
import re
import spacy
from spacy.tokens import DocBin
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

### Labels

In [13]:
cwd = os.getcwd()
label_path = os.path.join(cwd, "labels.txt")
label_data = open(label_path,"r").read()
labels = label_data.split("\n")
mlb = MultiLabelBinarizer(classes=labels)
mlb.classes

['REQUEST FOR SERVICE OR PRODUCT',
 'OFFERING OF SERVICE OR PRODUCT',
 'MONEY INVOLVED',
 'ADVICE',
 'NETWORK OR PANEL ACCESS',
 'CREDENTIALS OR ACCOUNTS',
 'CARDING',
 'INFRASTRUCTURE AND HOSTING',
 'DATA LEAKS',
 'PERSONAL INFORMATION',
 'COMPANY OR ORG INFORMATION',
 'ADULT',
 'MALWARE TOOLS AND EXPLOITS',
 'VULNERABILITY',
 'RECRUITMENT',
 'DEFACEMENT',
 'PHISHING',
 'SPAMMING',
 'HACKING',
 'SCAM PAGE',
 'LOGS',
 'SMS OR EMAIL MAILER',
 'GOOD REVIEW',
 'BAD REVIEW']

In [2]:
nlp = spacy.load("en_core_web_trf")

### Importing relevant training files
I prevoiusly have trained and pre-processing on the texts such as removing stopwords, converting all to lowercase. But after several iterations and comparisons of results, I have decided to feed the raw text directly into the model because of how messy and unstructued texts are in the dark web forums. I also believe that the unstructuredness would hold valuable information to the model.

In [8]:
file1 = "forum_breached_20221115_20221201_155_annotations.jsonl"
file2 = "forum_exploit_20220101_20220201_300_posts_set1_226_annotations.jsonl"
file3 = "forum_exploit_20220301_20220401_251_annotations.jsonl"
file4 =  "forum_exploit_20220801_20220815_163_annotations.jsonl"
file5 = "forum_nulled_20220801_20220815_147_annotations.jsonl"
file6 = "forum_xss_posts_20220801_20220815_157_annotations.jsonl"
file7 = "popular_forums_20221101_20221104_500_posts_set1_486_annotations.jsonl"
phishing = "phishing.jsonl"
company_orginfo = "company_orginfo.jsonl"
vuln = "vulnerability.jsonl"

files = [file1,file2,file3,file4,file5,file6,file7,phishing,company_orginfo,vuln]

#Importing relevant files

def merge_jsonl_files(files):
    curr_path = os.getcwd()
    df_list = []

    for file in files:
        file_path = os.path.join(curr_path,"prodigy","annotation_output", file)
        print(file_path)
        df = pd.read_json(file_path,lines= True)
        df_list.append(df)

    merged_df = pd.concat(df_list)

    return merged_df

#Removing special characters
def sp_char_remove(review):
    review = re.sub('\[[^]]*\]', ' ', review)
    review = re.sub('[^a-zA-Z]', ' ', review)
    return review

#Removing special characters
def stopword_remover(text):
    x=[]
    text=text.split()    #splitting into individual words
    for i in text:
        if i not in stopwords.words('english'):
            x.append(i)
    return x

def url_remover(text):
    remove = "http\S+"
    text = re.sub(remove, " ", text)
    return text

#Total dataframe
df_dummy = merge_jsonl_files(files)
df_dummy_dummy = df_dummy[df_dummy.answer == "accept"]
df = df_dummy_dummy.drop(columns=["_input_hash","_session_id","_task_hash","_view_id","options","config", "answer"])
#df["accept"] = df["accept"].apply(lambda x: x if x else ["EMPTY"])
#df["text"] = df["text"].apply(url_remover)
#df["text"] = df["text"].apply(lambda x: x.lower())
del df["meta"]
df

/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_breached_20221115_20221201_155_annotations.jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_exploit_20220101_20220201_300_posts_set1_226_annotations.jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_exploit_20220301_20220401_251_annotations.jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_exploit_20220801_20220815_163_annotations.jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_nulled_20220801_20220815_147_annotations.jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_xss_posts_20220801_20220815_157_annotations.jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/popular_forums_20221101_20221104_500_posts_set1_486_annotations.jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/phishing.jsonl
/home/seb/Desktop/Seb/RDAI/ANL

Unnamed: 0,text,accept,_timestamp
0,"Government of San Pedro Garza Garcia, NL, Mexi...","[DATA LEAKS, PERSONAL INFORMATION]",
1,ECUADOR CELL PHONE WHATSAPP DATABASE ECUADOR C...,"[DATA LEAKS, PERSONAL INFORMATION]",
2,i need to buy combo virgilio.it with high mail...,"[REQUEST FOR SERVICE OR PRODUCT, MONEY INVOLVE...",
3,"x2100 Fresh Logs [9.11.2022] World(USA, EU inc...","[LOGS, OFFERING OF SERVICE OR PRODUCT, DATA LE...",
4,Nirvana - Smells Like Teen Spirit ( \n\nRemast...,[],
...,...,...,...
250,Acunetix Version 1.7.1.955 (Vulnerability Data...,[VULNERABILITY],
251,Acunetix Version 1.7.1.955 (Vulnerability Data...,[VULNERABILITY],
252,"Twitter (Partial) Database - Leaked, Download!...","[OFFERING OF SERVICE OR PRODUCT, DATA LEAKS, C...",
253,Обсуждение Спутниковый интернет Раз тема каса...,"[VULNERABILITY, MALWARE TOOLS AND EXPLOITS, AD...",


In [10]:
df.head()

Unnamed: 0,text,accept,_timestamp
0,"Government of San Pedro Garza Garcia, NL, Mexi...","[DATA LEAKS, PERSONAL INFORMATION]",
1,ECUADOR CELL PHONE WHATSAPP DATABASE ECUADOR C...,"[DATA LEAKS, PERSONAL INFORMATION]",
2,i need to buy combo virgilio.it with high mail...,"[REQUEST FOR SERVICE OR PRODUCT, MONEY INVOLVE...",
3,"x2100 Fresh Logs [9.11.2022] World(USA, EU inc...","[LOGS, OFFERING OF SERVICE OR PRODUCT, DATA LE...",
4,Nirvana - Smells Like Teen Spirit ( \n\nRemast...,[],


### Sample of dataset and labels

In [21]:
print("Sample text")
print(df["text"].tolist()[2])
print("==================")
print("Sample labels")
print(df["accept"].tolist()[2])

Sample text
i need to buy combo virgilio.it with high mail access telegram contact: @Nabuto1
Sample labels
['REQUEST FOR SERVICE OR PRODUCT', 'MONEY INVOLVED', 'NETWORK OR PANEL ACCESS']


In [24]:
print("Sample text")
print(df["text"].tolist()[10])
print("==================")
print("Sample labels")
print(df["accept"].tolist()[10])

Sample text
Will @IntelBroker become bigger than @KelvinSecurity Looking on the breaches @ IntelBroker (   is doing in the past month and the rate he is doing them. Do you think he will soon be one of the big names. 

Like 

@ LeakBase (  

@ kelvinsecurity ( 
Sample labels
['ADVICE']


In [23]:
# Example of a text that does not fit into any category
print("Sample text")
print(df["text"].tolist()[4])
print("==================")
print("Sample labels")
print(df["accept"].tolist()[4])

Sample text
Nirvana - Smells Like Teen Spirit ( 

Remastered in HD, Enjoy

  

 Thanks  @ SafeSig (   for the credits and VIP rank
Sample labels
[]


### Splitting df to training and test


In [None]:
#Splitting df to training and test
train, validation = train_test_split(df, test_size=0.2)

print("size of training data:",len(train))
print("size of test data:", len(validation))

### Creating DocBin
this is necsarry as DocBin is one of spaCy accepted input formats

In [None]:
def convert_text_to_bin_format(nlp, row, categories):
    doc = nlp.make_doc(row["text"])
    #print(categories)
    doc.cats = {cat: 0 for cat in categories}

    for label in row["accept"]:
        doc.cats[label] = 1
    #print(doc.cats)

    return doc

In [None]:
#Creating a DocBin - train
num_of_rows = len(train)
docs = []
categories = mlb.classes

for i in range(num_of_rows):
    row = train.iloc[i]
    doc = convert_text_to_bin_format(nlp, row, categories)
    docs.append(doc)

train_doc_bin = DocBin(docs=docs)
curr_path = os.getcwd()
path = os.path.join(curr_path,"..","data","training.spacy")

train_doc_bin.to_disk(path)

In [None]:
#Creating a DocBin - validation
num_of_rows = len(validation)
docs = []
categories = mlb.classes
for i in range(num_of_rows):
    row = validation.iloc[i]
    doc = convert_text_to_bin_format(nlp, row, categories)
    docs.append(doc)

test_doc_bin = DocBin(docs=docs)
curr_path = os.getcwd()
path = os.path.join(curr_path,"..","data","validation.spacy")

test_doc_bin.to_disk(path)

### Creating specific spacy config files for training


!python -m spacy init fill-config <path/to/input/base/config/file>  <output/config/path>

In [None]:
!python -m spacy init fill-config configs/base_config_textcat.cfg configs/txt_classification_config_batch128_raw.cfg

### Training the model using spaCy

!python -m spacy train <path/of/config/file> --output <path/to/save/model> --paths.train <training/data/path> --paths.dev <validation/data/path>

In [None]:
!python -m spacy train ../configs/txt_classification_config.cfg --output ../models/v3 --paths.train ./../data/training.spacy --paths.dev ./../data/validation.spacy

### Inference

In [35]:
def standardize_tags(doc, threshold):
    tags = doc.cats
    for k in tags:
        #print(k)
        if tags[k] >= threshold:
            tags[k] = 1
        else:
            tags[k] = 0
    return tags

# Using model to predict each text.
def nlp_predict(text, nlp, threshold):
    doc = nlp(text)
    tags = standardize_tags(doc, threshold)
    tags_list = []
    for k,v in tags.items():
        if v == 1:
            tags_list.append(k)
    return tags_list

In [28]:
trained_nlp = spacy.load("v3/model-best")
text = " 685K HQ Private Combolist Email:Pass [Netflix,Minecraft,Uplay,Steam,Paypal,Hulu,Vpn,Spotify,Etc....]  PLZ REPLY THIS THREAD FOR MOR COMBO  Download Link......   https://rosefile.net/2m9km2ui7g/685K.zip.html 685K HQ Private Combolist Email:Pass [Netflix,Minecraft,Uplay,Steam,Paypal,Hulu,Vpn,Spotify,Etc....]\n\nPLZ REPLY THIS THREAD FOR MOR COMBO\n\nDownload Link......\n\nHidden Content\n\nReply  to this topic to view hidden content or  Update your account. (https://crackingall.com/index.php?/subscriptions/)"
doc = trained_nlp(text)
doc.cats
print(standardize_tags(doc, 0.9))
# I am expecting the following labels: Data Leaks, offering service or product, credentials or accounts

{'REQUEST FOR SERVICE OR PRODUCT': 0, 'OFFERING OF SERVICE OR PRODUCT': 1, 'MONEY INVOLVED': 0, 'ADVICE': 0, 'NETWORK OR PANEL ACCESS': 0, 'CREDENTIALS OR ACCOUNTS': 1, 'CARDING': 0, 'INFRASTRUCTURE AND HOSTING': 0, 'DATA LEAKS': 1, 'PERSONAL INFORMATION': 0, 'COMPANY OR ORG INFORMATION': 0, 'ADULT': 0, 'MALWARE TOOLS AND EXPLOITS': 0, 'VULNERABILITY': 0, 'RECRUITMENT': 0, 'DEFACEMENT': 0, 'PHISHING': 0, 'SPAMMING': 0, 'HACKING': 0, 'SCAM PAGE': 0, 'LOGS': 0, 'SMS OR EMAIL MAILER': 0, 'GOOD REVIEW': 0, 'BAD REVIEW': 0}


In [33]:
text = "Hi guys!!!!! I am planning to exploit this particular CVE to gain remote access to the company's infrastrcuture! do u think this will help??"
doc = trained_nlp(text)
doc.cats
print(standardize_tags(doc, 0.8))

# I am expecting the following labels: Advice. network or panel access, 

{'REQUEST FOR SERVICE OR PRODUCT': 0, 'OFFERING OF SERVICE OR PRODUCT': 0, 'MONEY INVOLVED': 0, 'ADVICE': 1, 'NETWORK OR PANEL ACCESS': 1, 'CREDENTIALS OR ACCOUNTS': 0, 'CARDING': 0, 'INFRASTRUCTURE AND HOSTING': 0, 'DATA LEAKS': 0, 'PERSONAL INFORMATION': 0, 'COMPANY OR ORG INFORMATION': 0, 'ADULT': 0, 'MALWARE TOOLS AND EXPLOITS': 0, 'VULNERABILITY': 0, 'RECRUITMENT': 0, 'DEFACEMENT': 0, 'PHISHING': 0, 'SPAMMING': 0, 'HACKING': 1, 'SCAM PAGE': 0, 'LOGS': 0, 'SMS OR EMAIL MAILER': 0, 'GOOD REVIEW': 0, 'BAD REVIEW': 0}


### Creation of test set

In [34]:
#Input relevant files to create test set
file1 = "forum_nulled_20230123_20230207_61_annotations_(test_set).jsonl"
file2 = "forum_breached_20230123_20230207_102_annotations_(test_set).jsonl"
file3 = "forum_xss_20230123_20230207_105_annotations_(test_set).jsonl"

test_set = [file1,file2,file3]

df_dummy = merge_jsonl_files(test_set)
df_dummy_dummy = df_dummy[df_dummy.answer == "accept"]
test_df = df_dummy_dummy.drop(columns=["_input_hash","_session_id","_task_hash","_view_id","options","config", "answer"])
test_df["text"] = test_df["text"].apply(url_remover)

del test_df["meta"]
test_df

/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_nulled_20230123_20230207_61_annotations_(test_set).jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_breached_20230123_20230207_102_annotations_(test_set).jsonl
/home/seb/Desktop/Seb/RDAI/ANLP/RDAI-ANLP/prodigy/annotation_output/forum_xss_20230123_20230207_105_annotations_(test_set).jsonl


Unnamed: 0,text,accept
0,x2000 Steam Accounts with Games #4 \n\nThis l...,"[DATA LEAKS, CREDENTIALS OR ACCOUNTS, OFFERING..."
1,332K Combolist EDU OFFICE 332K Combolist EDU O...,"[DATA LEAKS, COMPANY OR ORG INFORMATION, OFFER..."
2,Mycanal ACCOUNTS PREMIUM diariatouaidara1999@g...,"[CREDENTIALS OR ACCOUNTS, DATA LEAKS, OFFERING..."
4,Connecting to shoutbox Anyone have solution to...,[ADVICE]
5,BWW - Free Food - Accounts with Over 1000 Pts ...,"[OFFERING OF SERVICE OR PRODUCT, CREDENTIALS O..."
...,...,...
101,Looking for stealer Hello please what’s the la...,"[REQUEST FOR SERVICE OR PRODUCT, ADVICE, MALWA..."
102,Android vulnerability to install a silent payl...,"[VULNERABILITY, MALWARE TOOLS AND EXPLOITS, AD..."
103,Email/Phone leads - USA banks I have variously...,"[OFFERING OF SERVICE OR PRODUCT, MONEY INVOLVE..."
104,110K Malaysian Online Casino Customers [ Depos...,"[OFFERING OF SERVICE OR PRODUCT, DATA LEAKS, P..."


### Test Set Inference

In [41]:
test_df["predicted_output"] = test_df["text"].apply(lambda text: nlp_predict(text, trained_nlp,0.8))

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


In [44]:
y_pred_set = mlb.fit_transform(test_df["predicted_output"])
y_test_set = mlb.fit_transform(test_df["accept"])

y_pred_set.shape == y_test_set.shape

confusion_matrix_= multilabel_confusion_matrix(y_test_set, y_pred_set)
cls_report = classification_report(y_test_set, y_pred_set)
f1 = f1_score(y_test_set, y_pred_set, average = "micro")

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [45]:
test_df.head()

Unnamed: 0,text,accept,predicted_output
0,x2000 Steam Accounts with Games #4 \n\nThis l...,"[DATA LEAKS, CREDENTIALS OR ACCOUNTS, OFFERING...","[CREDENTIALS OR ACCOUNTS, DATA LEAKS, VULNERAB..."
1,332K Combolist EDU OFFICE 332K Combolist EDU O...,"[DATA LEAKS, COMPANY OR ORG INFORMATION, OFFER...","[OFFERING OF SERVICE OR PRODUCT, CREDENTIALS O..."
2,Mycanal ACCOUNTS PREMIUM diariatouaidara1999@g...,"[CREDENTIALS OR ACCOUNTS, DATA LEAKS, OFFERING...","[CREDENTIALS OR ACCOUNTS, DATA LEAKS]"
4,Connecting to shoutbox Anyone have solution to...,[ADVICE],"[ADVICE, NETWORK OR PANEL ACCESS]"
5,BWW - Free Food - Accounts with Over 1000 Pts ...,"[OFFERING OF SERVICE OR PRODUCT, CREDENTIALS O...","[CREDENTIALS OR ACCOUNTS, DATA LEAKS]"


In [46]:
print(f1)

0.7374784110535406


In [47]:
print(cls_report)

              precision    recall  f1-score   support

           0       0.70      0.95      0.81        40
           1       0.77      0.85      0.81        93
           2       0.90      0.86      0.88        91
           3       0.98      0.62      0.76        71
           4       0.52      0.65      0.58        17
           5       0.77      0.78      0.77        46
           6       1.00      0.17      0.29         6
           7       0.00      0.00      0.00         2
           8       0.86      0.69      0.77        83
           9       0.90      0.70      0.79        27
          10       0.36      0.16      0.22        31
          11       0.00      0.00      0.00         5
          12       0.87      0.53      0.66        51
          13       0.60      0.50      0.55         6
          14       0.56      0.56      0.56         9
          15       0.00      0.00      0.00         1
          16       0.50      0.67      0.57         3
          17       0.75    