# NPS Analysis
This notebook will demonstrate how we used the crawled data from the dark web in order to make predictions for unknown NPS names.

The notebook consists of the following sections:

## Table of Contents
- [0. Libraries](#0.-Libraries)
- [1. Preprocessing](#1.-Preprocessing)
- [2. Training Dataset](#2.-Training-Dataset)
- [3. Candidates](#3.-Candidates)
- [4. Model Training](#4.-Model-Training)
- [5. Predictions](#5.-Predictions)
- [6. Dashboard Preparation](#6.-Dashboard-Preparation)
- [Appendix: Vendor Predictions](#Appendix:-Vendor-Predictions)

# 0. Libraries

Here all libraries needed to run the dashboard are imported. Run the cell below to install the required packages first.

In [None]:
!pip install -r requirements.txt

In [6]:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from collections import defaultdict
import requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import random
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, classification_report, confusion_matrix
import ast
import time
from collections import Counter
import pickle
import streamlit as st
from st_aggrid import GridOptionsBuilder, AgGrid, GridUpdateMode
import altair as alt
import streamlit.components.v1 as components
from lime.lime_text import LimeTextExplainer



# 1. Preprocessing

In this section, the crawled data is loaded in and cleaned for further analysis

In [1]:
def clean_text(text):
    '''Function to clean the raw messages in the dataset'''
    text = re.sub(r'<.*?>', ' ', text)               # Remove HTML
    text = re.sub(r'\n', ' ', text)                  # Remove \n for spaces
    text = re.sub(r'http\S+', ' ', text)             # Remove URLs
    text = re.sub(r'@\w+', ' ', text)                # Remove usernames
    text = re.sub(r'[^a-zA-Z0-9\s\-]', ' ', text)    # Remove non-letters and non-numbers
    text = re.sub(r'\s-\s?|\s?-\s', ' ', text)       # Remove dashes surrounded by spaces
    text = re.sub(r'\s+', ' ', text)                 # Remove double whitespaces
    return text.lower().strip()                      # Convert to lower case

In [4]:
# Load the dataset
df = pd.read_csv('drugs_data.csv')
# Clean the messages in the dataset
df['cleaned_message'] = df['Message'].astype(str).apply(clean_text)
# Drop all the promotion messages that are spam and therefore lead to biased results
df = df.drop_duplicates(subset=['Message', 'User']).reset_index(drop=True)
df.head()

Unnamed: 0,Message,MessageId,MessageTitle,User,Timestamp,Subdread,NumberOfComments,CommentToF,cleaned_message
0,Just giving a shout out to Cheshirecat82. Just...,62dc896a6d0c1b88cf90,Cheshirecat82,/u/ecto,2023-11-05 20:35:00,/d/Psychedelics,0,False,just giving a shout out to cheshirecat82 just ...
1,How come there's nobody whose shared the REAL ...,55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/Ghwbushsr,2025-03-12 00:56:00,/d/DrugManufacture,17,False,how come there s nobody whose shared the real ...
2,I 100% agree OP. Birtch reeduction all the way...,55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/Thehighlow,2025-03-12 01:19:00,/d/DrugManufacture,0,True,i 100 agree op birtch reeduction all the way i...
3,I understand no one wants to tell someone who ...,55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/UnpaintedSinner,2025-03-12 02:03:00,/d/DrugManufacture,0,True,i understand no one wants to tell someone who ...
4,The thing is fucks like Uncle Fester and miste...,55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/Ghwbushsr,2025-03-15 01:50:00,/d/DrugManufacture,0,True,the thing is fucks like uncle fester and miste...


# 2. Training Dataset

In this section:
* The <b>seed lists</b> will be obtained and created (for drugs, vendors and the negative class)
* The <b>contexts</b> will be scraped from the messages using these seeds

As described in the scientific deliverable, we use a drug seed list, a vendor seed list, and a negative seed list. In this way, we can obtain 3 different classes of contexts and can train a classifier afterwards on them.

## 2.1 Seed lists

The 3 different seed lists will be obtained in the following way:
* <b>Drugs</b>: Scrape the up-to-date list of known drug and NPS names and their aliases from [Talk To Frank](https://www.talktofrank.com/drugs-a-z).
* <b>Vendors</b>: Manually created list by Google searching and LLM-prompting about known dark web marketplaces. This was extended with a list of vendor names obtained from the messages directly. We used RegEx to find Dread user names in messages and then checked whether they were mentioned in relation to selling drugs (e.g. "The shrooms from /u/lilxan are delicious")
* <b>Negative</b>: In order to train a meaningful classifier, a Negative class was needed such that the model can learn when a context is drug-related and when it is not. [Del Vigna et al. (2016)](https://arxiv.org/abs/1609.06577) did not have a Negative class in their method, therefore we came up with a method ourselves. We looked up the most frequently occurring nouns in the dataset, computed how many times these nouns occurred in all messages, and then filtered these nouns from most frequent to less frequent. These were the most interesting nouns, because they would provide to the highest amount of training data for the negative class. Then we computed the amount of times a <b>known drug</b> or <b>known vendor</b> from the drugs and vendors seeds <b>occurred in the contexts of these most frequent nouns</b>. Then, for the top 30-ish most frequent nouns, we selected all nouns with a percentage of 18% or less of drugs occurring in their contexts as our negative class. The 18% is based on the fact that for this number, we managed to obtain roughly the same amount of negative contexts as drug contexts, which was desired to combat class imbalances. We extracted these contexts (of course without the contexts in which a known drug or known vendor occurs) and added these to the negative class.

<b>NOTE</b>: We understand that the method for finding negative samples is not perfect, and has some disadvantages, with the most prominent one being that it is possible for <b>not yet known NPS names</b> to occur in the contexts of the negative class, which will degrade the performance of the classifier. However, we tried to mitigate this problem by filtering out the negative contexts with known drugs and vendors in there, so therefore we the negative class' contexts should in general contain less words related to drugs and vendors, and our method may still be able to find unknown NPS names (SPOILER: it does ;) )

In [5]:
# Scrape the known drug/NPS names from Talk to Frank
content = requests.get('https://www.talktofrank.com/drugs-a-z')
soup = BeautifulSoup(content.text)
# Only scrape the drug names; no other information
unfiltered_drug_list = [tag.get_text().lower() for tag in soup.find_all('span', {'class':'inverted'})]
# Filter out drugs in the list of one letter or number 'H', 'C', '1', because this will lead to many contexts, and too many wrong contexts
drug_list = [drug for drug in unfiltered_drug_list if len(drug) > 1]

In [6]:
# Manually selected lists of marketplaces and vendors
vendor_list = ['lilxan', 't4ps', 'justdefyreality', 'redbook', 'bbmc', 'bjohndoe70', 'adderallz', 'bluesconnect', 'mexicanconnection', \
              'peterchein', 'candyland', 'drugslocal', 'clandestiny', 'usapillpress', 'druggiebears', 'xanaxking']
marketplace_list = ['archetyp', 'shell', 'aliexpress', 'dark0de', 'darkode', 'torrez', 'empire', 'hydra', 'torzon', \
               'dream', 'mgm', 'alibaba', 'abacus', 'tmg', 'gofish', 'aplhabay', 'dreammarket', 'silkroad', 'wethenorth', 'wtn']
vendor_list.extend(marketplace_list)

### Negative seeds
<b>NOTE</b>: We have commented out the cells for the obtaining of the negative seeds because this takes a very long time to run... (talking in hours). The negative seeds that we have obtained from this procedure you can see below. If you still want to run the procedure for obtaining the negative seeds, you can uncomment the cells below.

In [7]:
# The negative seeds have been obtained by the procedure described above
neg_seeds = ['thanks', 'thank', 'money', 'account', 'days', 'post', 'orders', 'everything', 'things', 'man', 'everyone', 'customers', 'time', 
             'review', 'something', 'guy', 'anything', 'week', 'way', 'someone', 'order']

In [None]:
# def extract_nouns(text, drug_list, vendor_list):
#     '''Function to find potential negative seeds'''
#     words = word_tokenize(text, preserve_line=True)
#     tagged = pos_tag(words, lang='eng')  # Explicitly set lang
#     return [word for word, tag in tagged if tag.startswith('NN') and word not in drug_list and word not in vendor_list and len(word) > 2]

In [None]:
# # Extract all potential candidates from all messages and count how often they occur
# noun_dict = dict()
# for text in df['cleaned_message']:
#     nouns = extract_nouns(text, drug_list, vendor_list) 
#     for noun in nouns:
#         if noun in noun_dict:
#             noun_dict[noun] += 1
#         else:
#             noun_dict[noun] = 1

In [None]:
# # Sort the list from most frequent potential seeds to least
# sorted_words = sorted(noun_dict.items(), key=lambda x:x[1], reverse=True)

In [None]:
# # Loop over all these words from most frequent to least, and calulcate the ratio of drugs and vendors in their contexts
# # This can take ages to run, advised is to stop it yourself after e.g. the top-30 seeds (also this will take a long time though)
# for neg_seed, neg_count in sorted_words:
#     count = 0
#     for text in df['cleaned_message']:
#         tokens = word_tokenize(text, preserve_line=True)
#         for i, word in enumerate(tokens):
#             if word == neg_seed:
#                 start = max(i - 10, 0)
#                 end = min(i + 11, len(tokens))
#                 context = tokens[start:end]
#                 for pot_drug in context:
#                     if pot_drug in drug_list or pot_drug in vendor_list:
#                         # print(f"found {pot_drug} near {word}")
#                         count += 1
#                         break
#     print(f"Word: {neg_seed} has {count} drugs in its context out of {neg_count}: {count/neg_count}")

## 2.2 Obtaining the contexts
The next step was to use the seeds to find the contexts. That worked as follows:
1. Loop through all cleaned messages in the dataset
    2. Tokenize each message and loop through all words (<b>word_X</b>)
        3. If <b>word_X</b> occurred in one of the seed lists do:
            4. Grap the surrounding 20 words of <b>word_X</b>
            5. Remove <b>word_X</b> from this string
            6. Add these surrounding words as <b>context</b> for the class of the corresponding seed list

In [8]:
def extract_contexts(text, seed_words, window=10):
    '''Extract the contexts surrounding seeds'''
    # Tokenize the message
    words = word_tokenize(text, preserve_line=True)
    word_list = []
    contexts = []

    # Obtain the contexts for the seeds
    for i, word in enumerate(words):
        if word in seed_words:
            start = max(i - window, 0)
            end = min(i + window + 1, len(words))
            context = ' '.join(words[start:i]) + " " + ' '.join(words[i+1:end])
            contexts.append(context)
            word_list.append(word)
    return contexts, word_list

def extract_neg_contexts(text, seed_words, drug_list, vendor_list, window=10):
    '''Extract the contexts surronding negative seeds'''
    # Tokenize the message
    words = word_tokenize(text, preserve_line=True)
    contexts = []

    # Obtain the contexts for the seeds
    for i, word in enumerate(words):
        drug_present = False
        if word in seed_words:
            start = max(i - window, 0)
            end = min(i + window + 1, len(words))
            temp_context = words[start:end]
            # Only add contexts for negative class if there is no drug or vendor in the context
            for potential_drug_vendor in temp_context:
                if potential_drug_vendor in drug_list or potential_drug_vendor in vendor_list:
                    drug_present = True
            if not drug_present:
                context = ' '.join(words[start:i]) + " " + ' '.join(words[i+1:end])
                contexts.append(context)
    return contexts

In [9]:
# For the drugs and vendors, we store both the contexts as well as the terms based on which the contexts where found (for negative this is not interesting)
drugs_contexts = []
drugs_named = []
vendors_contexts = []
vendors_named = []
negative_contexts = []

# Loop over all messages and find the contexts for all seeds
for message in df['cleaned_message']:
    # Obtain the contexts for all classes for this message (if any)
    context_drugs, drug_context_list = extract_contexts(message, drug_list)
    context_vendors, vendor_context_list = extract_contexts(message, vendor_list)
    context_negatives = extract_neg_contexts(message, neg_seeds, drug_list, vendor_list)

    # Add the contexts to the lists for this message
    if len(context_drugs) > 0:
        drugs_contexts.extend(context_drugs)
        drugs_named.extend(drug_context_list)
    if len(context_vendors) > 0:
        vendors_contexts.extend(context_vendors)
        vendors_named.extend(vendor_context_list)
    if len(context_negatives) > 0:
        negative_contexts.extend(context_negatives)

# Create a dataframe for each class and then add them together
drug_df = pd.DataFrame(drugs_contexts, columns=['context'])
drug_df['term'] = drugs_named
drug_df['label'] = 'substance'
vendor_df = pd.DataFrame(vendors_contexts, columns=['context'])
vendor_df['label'] = 'vendor'
vendor_df['term'] = vendors_named
neg_df = pd.DataFrame(negative_contexts, columns=['context'])
neg_df['label'] = 'other'
neg_df['term'] = 'other'

# Create the training dataset
df_train = pd.concat([drug_df, vendor_df, neg_df], ignore_index=True)
df_train

Unnamed: 0,context,term,label
0,reeduction all the way i used liquid my own am...,gas,substance
1,ephedrine intermittently until it s ready to c...,juice,substance
2,someone who is unexperienced how to begin lear...,meth,substance
3,book would be worth some money a full guide co...,meth,substance
4,i know a really good process to go from pseudo...,pills,substance
...,...,...,...
284987,this is the strangest but i like it,other,other
284988,testers off amazon and i will say cozy boy and...,other,other
284989,t bullshitting ended up failing a drug test ar...,other,other
284990,this is what i heard experienced and i really ...,other,other


# 3. Candidates

In this section, we obtain the candidate NPS names and their contexts as follows:
* Extract all unique nouns from all messages (these are our candidate NPS names)
* Loop over all candidate NPS names and scrape their contexts

<b>NOTE</b>: This method for finding candidate NPS names is functional, because the set of actual unknown NPS names will always be a subset of the set of nouns in all messages (because any NPS name should be a noun). However, this finds A LOT of candidate NPS names (almost 75.000), which means that the process of obtaining them and then also scraping all their contexts can take a little while... Around (15+30=) 45 minutes ish... Because of this, we have also allowed the possibility to just simply load the datasets in from a pre-saved version. This will be exactly the same content as if you were to run the code yourself, so it is just to save you time.

In [11]:
def extract_all_nouns(df, drug_list, vendor_list, neg_list):
    '''Function to find all words occurring at least once as a noun in the messages; these will be our candidate NPS names'''
    noun_set = set()
    # Loop through all messages
    for text in df['cleaned_message']:
        
        # Use nltk's pos_tag to find nouns
        words = word_tokenize(text, preserve_line=True)
        tagged = pos_tag(words, lang='eng')  # Explicitly set lang
        for word, tag in tagged:
            # If it's already in the set with nouns, then we can continue
            if word in noun_set:
                continue
            # Otherwise we will add it to the set of nouns if it is a noun (and it does not occur in the seed lists already)
            else:
                if tag.startswith('NN') and word not in drug_list and word not in vendor_list and word not in neg_list and len(word) > 2:
                    noun_set.add(word)
    return list(noun_set)

In [12]:
# Extract candidate NPS names list beforehand (can take 15 minutes or so to run)
noun_list = extract_all_nouns(df, drug_list, vendor_list, neg_seeds)
len(noun_list)

74909

In [24]:
# IF YOU DO NOT WANT TO RUN THE CODE ABOVE
with open("candidate_seeds.pkl", 'rb') as f:
    noun_list = pickle.load(f)
len(noun_list)

74909

In [19]:
# THIS CELL CAN TAKE 30 MINUTES TO RUN...
candidate_contexts = defaultdict(list)

# loop over all messages
for text in df['cleaned_message']:
    # Converts message to tokens and finds the candidate NPS names in the message (intersection of tokens and noun_list)
    tokens = word_tokenize(text, preserve_line=True)
    nouns = set(tokens) & set(noun_list)

    # Loops through all words in message and for the candidate NPS names it extracts the contexts
    for i, word in enumerate(tokens):
        for noun in nouns:
            if word == noun:
                start = max(0, i - 10)
                end = min(len(tokens), i + 11)
                context = ' '.join(tokens[start:i]) + " " + ' '.join(tokens[i+1:end])
                candidate_contexts[noun].append(context)

# Add the candidates with all their scraped contexts to a dataframe
candidate_data = [(term, context) for term, contexts in candidate_contexts.items() for context in contexts]
candidate_df = pd.DataFrame(candidate_data, columns=['term', 'context'])

In [None]:
# IF YOU DO NOT WANT TO RUN THE CODE ABOVE
candidate_df = pd.read_csv('candidate_predictions_final.csv')[['term', 'context']]
candidate_df

# 4. Model Training

In this section, we train the model
* First we split the training dataset into a train and test set
* Then the contexts are Tf-idf vectorized of the train set and a logistic regresssion model is trained
* Then we assess the model's performance on the test set

In [28]:
# We split the data randomly into a 80% train 20% test set
X_train, X_test, y_train, y_test = train_test_split(df_train['context'], df_train['label'], test_size=0.2, random_state=42)

# We create a pipeline that first performs tf-idf vectorizing on the contexts and then trains a logistic regression model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000))
])

# We fit the model on the train set
pipeline.fit(X_train, y_train)

33.0830500125885


In [30]:
# Here we calculate several performance evaluation metrics on our trained model

# Obtain the predictions
y_pred = pipeline.predict(X_test)

# Overall Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Precision, Recall, and F1 Score (macro)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Full classification report on test and predictions
report = classification_report(y_test, y_pred)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (macro): {precision:.4f}")
print(f"Recall (macro): {recall:.4f}")
print(f"F1 Score (macro): {f1:.4f}")
print("\nClassification Report:\n", report)
print("Confusion Matrix:\n", conf_matrix)

Accuracy: 0.7697
Precision (macro): 0.7191
Recall (macro): 0.5969
F1 Score (macro): 0.6244

Classification Report:
               precision    recall  f1-score   support

       other       0.76      0.87      0.81     30439
   substance       0.79      0.71      0.75     23629
      vendor       0.60      0.21      0.31      2931

    accuracy                           0.77     56999
   macro avg       0.72      0.60      0.62     56999
weighted avg       0.77      0.77      0.76     56999

Confusion Matrix:
 [[26587  3668   184]
 [ 6742 16665   222]
 [ 1572   738   621]]


# 5. Predictions

Finally, in this section, we will make the predictions on what might be unknown NPS names. That works as follows:
* First, we predict every context of every candidate seed using the trained Logistic Regression model
* This gives us a probability for the context being that of a substance (drug), vendor, or other (negative)
* We extract for every candidate seed the known drugs in their contexts (this will be used in the dashboard later on)
* Then, we <b>aggregate</b> the predicted probabilities for the drug class for each candidate (as well as the count of contexts for each candidate, and the top 10 most frequent occurring drugs in the candidate's contexts)
* This list is sorted in descending order, which provides us with the candidate with the <b>highest probability of being a drug</b> at the top (note that this may only be present in 1 or very little contexts, so this is not necessarily our best prediction)

In this way, we have obtained a list of almost <b>75.000</b> candidate NPS names with their associated probabilities of being an actual NPS/Drug! In the dashboard, filters are applied to this list to obtain a more accurate list of unknown NPS names (e.g. only show candidates that have at least occurred in 20 messages, have at least been used by 5 different users, and have a probability of being a NPS of at least 0.75).

In [31]:
# Obtain the predictions for the contexts of the candidate seeds (can take 5 minutes to run)
candidate_df['prediction'] = pipeline.predict(candidate_df['context'])
candidate_df['prob_substance'] = pipeline.predict_proba(candidate_df['context'])[:, list(pipeline.classes_).index('substance')]
candidate_df['prob_vendor'] = pipeline.predict_proba(candidate_df['context'])[:, list(pipeline.classes_).index('vendor')]
candidate_df['prob_other'] = pipeline.predict_proba(candidate_df['context'])[:, list(pipeline.classes_).index('other')]
candidate_df.head()

Unnamed: 0,term,context,prediction,prob_substance,prob_vendor,prob_other
0,shout,just giving a out to cheshirecat82 just got a ...,substance,0.525836,0.020927,0.453237
1,shout,get into here mate but i ll give you a through pm,other,0.049611,0.022997,0.927392
2,shout,out to the mod that posted it 2 years ago,other,0.109568,0.019599,0.870833
3,shout,my china packs land 12 days or less every time...,other,0.153924,0.035803,0.810273
4,shout,nah you can smoke and shake and bake you can e...,substance,0.924407,0.004375,0.071219


### Find similar drugs

In [32]:
def extract_drugs(context, drugs_list):
    '''Function to extract the known drugs occurring in a candidate's context'''
    present_drugs = [word for word in context.split() if word in drugs_list]
    return present_drugs

In [33]:
# This adds the known drugs to every candidate (can take some time to run too...)
candidate_df['known_drugs'] = candidate_df['context'].apply(lambda x: extract_drugs(x, drug_list))
candidate_df.head()

Unnamed: 0,term,context,prediction,prob_substance,prob_vendor,prob_other,known_drugs
0,shout,just giving a out to cheshirecat82 just got a ...,substance,0.525836,0.020927,0.453237,[]
1,shout,get into here mate but i ll give you a through pm,other,0.049611,0.022997,0.927392,[]
2,shout,out to the mod that posted it 2 years ago,other,0.109568,0.019599,0.870833,[]
3,shout,my china packs land 12 days or less every time...,other,0.153924,0.035803,0.810273,[]
4,shout,nah you can smoke and shake and bake you can e...,substance,0.924407,0.004375,0.071219,[wash]


### Aggregate probabilities to interpret candidates' likelihoods of being a NPS

In [34]:
def aggregate_top_known_drugs(drug_lists):
    '''Function to obtain the top 10 most frequently occurring drugs in a candidate's context '''
    # Flatten the list of lists and count frequencies
    counter = Counter([drug for sublist in drug_lists for drug in sublist])
    # Get the top 10 most common drugs
    top_10 = [drug for drug, _ in counter.most_common(10)]
    return top_10

In [35]:
# Aggregate the drug probabilities per candidate with the mean and sort them from most likely to be a drug to least likely
term_scores = candidate_df.groupby('term').agg(mean_prob=('prob_substance','mean'), num_contexts=('prob_substance', 'count'), similar_drugs=('known_drugs', aggregate_top_known_drugs)).reset_index().sort_values(by='mean_prob', ascending=False)
term_scores.head()

Unnamed: 0,term,mean_prob,num_contexts,similar_drugs
60895,squidgame,0.999961,1,"[xtc, mdma, ketamine]"
62,-ketamine,0.999928,2,[ketamine]
13159,colombiana,0.999919,3,"[4-mmc, speed, paste, amphetamine, cocaine, ic..."
33998,isoproyl,0.999875,1,"[bubble, alcohol, freebase, mixture]"
18292,discrepency,0.999864,2,"[mdma, cocaine, lsd]"


# 6. Dashboard Preparation
We want to display our results in an interactive dashboard that can be used by Law Enforcement Agencies to investigate our predicted unknown NPS names in more detail. The dashboard will contain a few features that will be created here in this section:
* <b>Similar known drugs</b>: This was already obtained in the previous section
* <b>Number of messages</b>: The number of messages in which the candidate NPS occurred
* <b>Number of users</b>: The number of unique users that used the candidate NPS in their messages

A few files will be exported that will be used in the dashboard:
* <b>Cleaned dataset</b>: The drugs dataset with the messages already cleaned for faster processing in the dashboard
* <b>Term scores</b>: This is the main dataset with the predictions of candidates' likeliness of being a NPS
* <b>Candidate dataset</b>: This dataset will be used to provide explanations for a certain candidate's context, why that context was predicted to be drug-related
* <b>Pipeline as pickle</b>: This will also be used for the explanation providing, to use the model's weights

In [36]:
# Make the message into a list which will be used to count the number of messages and number of users that used a candidate NPS
df_split_message = pd.read_csv("drugs_data.csv", parse_dates=["Timestamp"])
df_split_message['Message'] = df_split_message['Message'].astype(str).apply(clean_text)
df_split_message['Message'] = df_split_message['Message'].apply(lambda x: str(x.split(" ")))
df_split_message = df_split_message.drop_duplicates(subset=['Message', 'User']).reset_index(drop=True)
df_split_message['Message'] = df_split_message['Message'].apply(ast.literal_eval)
df_split_message.head()

Unnamed: 0,Message,MessageId,MessageTitle,User,Timestamp,Subdread,NumberOfComments,CommentToF
0,"[just, giving, a, shout, out, to, cheshirecat8...",62dc896a6d0c1b88cf90,Cheshirecat82,/u/ecto,2023-11-05 20:35:00,/d/Psychedelics,0,False
1,"[how, come, there, s, nobody, whose, shared, t...",55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/Ghwbushsr,2025-03-12 00:56:00,/d/DrugManufacture,17,False
2,"[i, 100, agree, op, birtch, reeduction, all, t...",55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/Thehighlow,2025-03-12 01:19:00,/d/DrugManufacture,0,True
3,"[i, understand, no, one, wants, to, tell, some...",55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/UnpaintedSinner,2025-03-12 02:03:00,/d/DrugManufacture,0,True
4,"[the, thing, is, fucks, like, uncle, fester, a...",55020c8fbfe69b0f722b,The Birch method with pseudoephedrine precurso...,/u/Ghwbushsr,2025-03-15 01:50:00,/d/DrugManufacture,0,True


In [37]:
# Explode the dataframe of user and messages on the message
df_exploded = df_split_message[['User', 'Message']].explode('Message')

# Group by words to then perform the counts of candidates
word_groups = df_exploded.groupby('Message')

# Now quickly get counts for terms
results = []
for term in term_scores['term']:
    try:
        group = word_groups.get_group(term.lower())
        num_messages = group.index.nunique()
        num_users = group['User'].nunique()
    except KeyError:
        num_messages = 0
        num_users = 0
    results.append((num_messages, num_users))

# Add the results to the list with predicted NPS
term_scores['num_messages'] = [r[0] for r in results]
term_scores['num_users'] = [r[1] for r in results]

In [51]:
# Example filter that can be used now to get more accurate candidate predictions
term_scores[(term_scores['num_messages'] > 20) & (term_scores['num_users'] > 5) & (term_scores['mean_prob'] > 0.75)].head(10)

Unnamed: 0,term,mean_prob,num_contexts,similar_drugs,num_messages,num_users
41230,metaphedrone,0.989311,33,"[white, weed, hash, meth, crystal, ice, ketami...",29,8
31700,hydrochloric,0.973559,145,"[acid, gas, base, heroin, mixture, mdma, cocai...",109,69
17880,diethylamide,0.961595,50,"[acid, lsd, white, mixture, cocaine, drop, cry...",25,24
27323,glacial,0.957358,34,"[acid, alcohol, amphetamine, heroin, mixture, ...",28,22
62160,sulphuric,0.952639,40,"[acid, amphetamine, gas, base, white, glass, c...",30,19
63293,tartaric,0.952593,41,"[acid, meth, lsd, ketamine, methamphetamine, m...",27,16
11353,champagne,0.952089,259,"[mdma, xtc, pills, ketamine, rocks, cocaine, p...",148,62
56416,s-isomer,0.951635,143,"[ketamine, mdma, crystal, rocks, meth, ice, wh...",105,36
62150,sulfuric,0.951433,157,"[acid, base, mixture, gas, cocaine, drop, alco...",105,65
32687,impurity,0.942276,267,"[acid, cocaine, wash, alcohol, mxe, ketamine, ...",259,176


In [44]:
# Export the list of candidates for the dashboard
term_scores.to_csv('term_scores_final.csv', index=False)

In [45]:
# Export the candidates with their original contexts dataset
candidate_df.to_csv('candidate_predictions_final.csv', index=False)

In [46]:
# Save the pipeline to pickle file so it can be used in dashboard
with open('pipeline_final.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

In [47]:
# Save a cleaned version of the dataset for the dashboard
df_drugs_cleaned = pd.read_csv("drugs_data.csv", parse_dates=["Timestamp"])
df_drugs_cleaned['Message'] = df_drugs_cleaned['Message'].astype(str).apply(clean_text)
df_drugs_cleaned.to_csv("drugs_data_cleaned_final.csv", index=False)

# Appendix: Vendor Predictions
A small section of our report mentioned the potential future work for the identification of unknown vendors. This section shows the creation of the small list that was added in the report.

In [48]:
# Obtain the predicted probabilities for vendors
vendors = candidate_df.groupby('term').agg(mean_prob=('prob_vendor','mean'), num_contexts=('prob_vendor', 'count')).reset_index().sort_values(by='mean_prob', ascending=False)
vendors.head()

Unnamed: 0,term,mean_prob,num_contexts
66340,treesntreats-empire,0.996334,1
58895,silk3,0.993303,1
33501,interociter,0.99217,1
6277,bersculoni,0.989706,1
19836,drugsinkuk,0.982691,1


In [49]:
# Small inspection for report of predicted vendors
vendors[(vendors['num_contexts'] > 5) & (vendors['mean_prob'] > 0.7)]

Unnamed: 0,term,mean_prob,num_contexts
15932,d9f66c249805799a90f8c57b,0.914405,16
65847,tormarket,0.903145,7
64450,themiddleman,0.894063,60
22700,exoticpacks,0.88863,12
16176,dankservices3,0.870804,6
73658,yoururl,0.795792,6
1194,addyinc0,0.795397,6
19737,drpremiumexotic,0.783042,40
62356,superwavey,0.762502,6
55129,revews,0.762275,9
