# Find All Drug Synonyms in Tweets
<br>
James Chapman<br>
CIS 830 Advanced Topics in AI – Term Project<br>
Kansas State University<br><br>

This notebook identifies EVERY synonym MENTION.<br>

- Every tweet in Main data (~68,000)

- 5 Columns of Synonyms (from data_processing/get_known_synonym_lists.ipynb)
    - all_synonyms
    - GPT_synonyms
    - pubchem_synonyms
    - redmed_synonyms
    - DEA_synonyms

** Saves CSV file with matches 'data/tweets_with_synonyms_matching.csv'

In [None]:
import os
import json
import csv
import ast
import re
import ftfy
import pandas as pd
import numpy as np
import concurrent.futures

from multiprocessing import Pool, cpu_count
from tqdm.notebook import tqdm
from datetime import datetime
from collections import Counter
from collections import OrderedDict
from tqdm import tqdm
tqdm.pandas()

from utils import (
    get_tweets_dataset, 
    match_terms,
    get_confusion_matrix_and_metrics,
    load_synonym_dict

)

In [2]:
def get_tweets_dataset():
    """
    Load the dataset & clean the text
    return a DataFrame with 'text' and 'label' columns
    """
    tweets = pd.read_csv('data/tweets.csv', encoding="utf-8-sig")


    # all special characters were displayed weird, ftfy is a library that fixes text encoding issues
    def fix_text_cell(cell):
        if isinstance(cell, str):
            return ftfy.fix_text(cell)
        return cell


    def safe_convert(x):
        if isinstance(x, str):
            try:
                return ast.literal_eval(x)
            except Exception as e:
                return []
        return x

    # Apply the fix to all text (object) columns in robust_data
    for col in tweets.select_dtypes(include=["object"]).columns:
        tweets[col] = tweets[col].apply(fix_text_cell)
        #tweets[col] = tweets[col].apply(safe_convert)
    return tweets

In [3]:
tweets = get_tweets_dataset()
tweets.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68378 entries, 0 to 68377
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       68378 non-null  object
 1   label      68378 non-null  object
 2   tweet_num  68378 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.6+ MB


In [4]:
# Run match_terms
# uses regex to match exact terms in the text
# 4 sources of synonyms
tweets = match_terms('all_synonyms', tweets, 'text', 'found_terms', 'found_index_terms')
tweets = match_terms('GPT_synonyms', tweets, 'text', 'GPT_found_terms', 'GPT_found_index_terms')
tweets = match_terms('pubchem_synonyms', tweets, 'text', 'pubchem_found_terms', 'pubchem_found_index_terms')
tweets = match_terms('redmed_synonyms', tweets, 'text', 'redmed_found_terms', 'redmed_found_index_terms')
tweets = match_terms('DEA_synonyms', tweets, 'text', 'DEA_found_terms', 'DEA_found_index_terms')

Drugs with ≥1 synonym (71 total):
Total number of synonyms 17227


100%|██████████| 68378/68378 [06:30<00:00, 174.97it/s]


Drugs with ≥1 synonym (22 total):
Total number of synonyms 5395


100%|██████████| 68378/68378 [01:37<00:00, 701.40it/s] 


Drugs with ≥1 synonym (71 total):
Total number of synonyms 5839


100%|██████████| 68378/68378 [01:59<00:00, 573.70it/s] 


Drugs with ≥1 synonym (29 total):
Total number of synonyms 1604


100%|██████████| 68378/68378 [00:33<00:00, 2067.40it/s]


Drugs with ≥1 synonym (12 total):
Total number of synonyms 1006


100%|██████████| 68378/68378 [00:22<00:00, 3105.70it/s]


In [5]:
tweets.to_csv('data/tweets_with_synonyms_matching.csv', index=False, encoding="utf-8-sig")
tweets.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68378 entries, 0 to 68377
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   text                       68378 non-null  object
 1   label                      68378 non-null  object
 2   tweet_num                  68378 non-null  int64 
 3   found_terms                68378 non-null  object
 4   found_index_terms          68378 non-null  object
 5   GPT_found_terms            68378 non-null  object
 6   GPT_found_index_terms      68378 non-null  object
 7   pubchem_found_terms        68378 non-null  object
 8   pubchem_found_index_terms  68378 non-null  object
 9   redmed_found_terms         68378 non-null  object
 10  redmed_found_index_terms   68378 non-null  object
 11  DEA_found_terms            68378 non-null  object
 12  DEA_found_index_terms      68378 non-null  object
dtypes: int64(1), object(12)
memory usage: 6.8+ MB


In [6]:
tweets = pd.read_csv('data/tweets_with_synonyms_matching.csv', encoding="utf-8-sig")
tweets.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68378 entries, 0 to 68377
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   text                       68378 non-null  object
 1   label                      68378 non-null  object
 2   tweet_num                  68378 non-null  int64 
 3   found_terms                68378 non-null  object
 4   found_index_terms          68378 non-null  object
 5   GPT_found_terms            68378 non-null  object
 6   GPT_found_index_terms      68378 non-null  object
 7   pubchem_found_terms        68378 non-null  object
 8   pubchem_found_index_terms  68378 non-null  object
 9   redmed_found_terms         68378 non-null  object
 10  redmed_found_index_terms   68378 non-null  object
 11  DEA_found_terms            68378 non-null  object
 12  DEA_found_index_terms      68378 non-null  object
dtypes: int64(1), object(12)
memory usage: 6.8+ MB


In [7]:
tweets['found_terms']

0                                         []
1                                         []
2                                         []
3                                 ['killer']
4                                 ['doctor']
                        ...                 
68373    ['pain pill', ' codeine', 'doctor']
68374                           [' codeine']
68375          ['codeine', 'lean', 'nyquil']
68376                   ['wolf', 'morphine']
68377                           [' codeine']
Name: found_terms, Length: 68378, dtype: object

In [8]:
def safe_convert(x):
    if isinstance(x, str):
        try:
            return ast.literal_eval(x)
        except Exception as e:
            return []
    return x

# Convert string representations to actual lists
tweets['found_terms'] = tweets['found_terms'].apply(safe_convert)
tweets['found_index_terms'] = tweets['found_index_terms'].apply(safe_convert)
tweets['GPT_found_terms'] = tweets['GPT_found_terms'].apply(safe_convert)
tweets['GPT_found_index_terms'] = tweets['GPT_found_index_terms'].apply(safe_convert)
tweets['pubchem_found_terms'] = tweets['pubchem_found_terms'].apply(safe_convert)
tweets['pubchem_found_index_terms'] = tweets['pubchem_found_index_terms'].apply(safe_convert)
tweets['redmed_found_terms'] = tweets['redmed_found_terms'].apply(safe_convert)
tweets['redmed_found_index_terms'] = tweets['redmed_found_index_terms'].apply(safe_convert)
tweets['DEA_found_terms'] = tweets['DEA_found_terms'].apply(safe_convert)
tweets['DEA_found_index_terms'] = tweets['DEA_found_index_terms'].apply(safe_convert)

# Now, count the number of elements in each list
tweets['count_found_terms'] = tweets['found_terms'].apply(len)
tweets['count_found_terms'] 


0        0
1        0
2        0
3        1
4        1
        ..
68373    3
68374    1
68375    3
68376    2
68377    1
Name: count_found_terms, Length: 68378, dtype: int64

In [9]:
rows_with_terms = tweets[tweets['found_terms'].apply(lambda terms: len(terms) > 0)]
print(len(rows_with_terms))
tweets['found_terms']

32727


0                                   []
1                                   []
2                                   []
3                             [killer]
4                             [doctor]
                     ...              
68373    [pain pill,  codeine, doctor]
68374                       [ codeine]
68375          [codeine, lean, nyquil]
68376                 [wolf, morphine]
68377                       [ codeine]
Name: found_terms, Length: 68378, dtype: object

In [10]:
tweets['found_index_terms']

0                                  []
1                                  []
2                                  []
3                     [phencyclidine]
4                              [mdma]
                     ...             
68373    [oxymorphone, codeine, mdma]
68374                       [codeine]
68375                       [codeine]
68376       [phencyclidine, morphine]
68377                       [codeine]
Name: found_index_terms, Length: 68378, dtype: object

In [11]:
# Explode the lists into rows and count each index term
index_term_counts = (
    tweets
      .explode('found_index_terms')['found_index_terms']
      .value_counts(dropna=True)
)

print("Index Term Counts Found in tweets:")
print(len(index_term_counts))
print(index_term_counts)


Index Term Counts Found in tweets:
36
found_index_terms
codeine             12634
morphine             9023
fentanyl             5361
delta-9-thc-cooh     4263
methamphetamine      3310
oxycodone            1769
hydrocodone          1444
mdma                 1135
phencyclidine         750
lsd                   731
ketamine              472
oxymorphone           457
methadone             389
hydromorphone         369
cbd                   321
psilocybin            266
amphetamine           153
zolpidem              148
buprenorphine         104
pregabalin             89
lorazepam              82
diphenhydramine        53
benzoylecgonine        36
cyclobenzaprine        31
meperidine             30
gabapentin             30
temazepam              28
zopiclone              25
carfentanil            19
norfentanyl            14
xylazine               12
naltrexone             11
meprobamate             5
4-anpp                  2
oxazepam                2
nordiazepam             2
Name: co

In [12]:
counts = (
    tweets
      .explode('found_terms')['found_terms']
      .value_counts(dropna=True)
)
print("Term Counts Found in tweets:")
print(counts)

Term Counts Found in tweets:
found_terms
 codeine            10188
morphine             8528
fentanyl             2679
codeine              2538
actiq                1348
                    ...  
multiple pills          1
vikes                   1
therapeutic uses        1
street pills            1
k hole                  1
Name: count, Length: 1218, dtype: int64


In [13]:
# 1) All possible index terms (lowercased keys of your lexicon)
csv_path = 'data/synonym_lists.csv'
synonym_dict = load_synonym_dict(csv_path, 'all_synonyms')
all_index_terms = set(synonym_dict.keys())

# 2) All actually found index terms
found_set = set(
    term 
    for sublist in tweets['found_index_terms'] 
    for term in sublist
)

# 3) Compute the difference
missing = all_index_terms - found_set

print("Index terms not found anywhere:")
print(len(missing))
for term in sorted(missing):
    print(term)


Drugs with ≥1 synonym (71 total):
Index terms not found anywhere:
41
2,6-xylidine
2-amino-5-chloropyridine
2-fluoro-2-oxo pce
2-oxo-3-hydroxy-lsd
3-hydroxy flubromazepam
3-hydroxy flubromazepam glucuronide
4-hiaa
6-acetylmorphine
7-aminoclonazepam
7-hydroxymitragynine
7-oh-cbd glucuronide
8-aminoclonazolam
8r-oh-r-hhc
8s-oh-r-hhc
alpha-hydroxyalprazolam
alpha-hydroxybromazolam
bromazolam
delta-8-thc-cooh
eddp
flubromazepam
mda
mdmb-4en-pinaca butanoic acid
metonitazene
mitragynine
n,n-dimethylpentylone
n-desethylmetonitazene
n-pyrrolidinoetonitazene
norbuprenorphine
norcarfentanil
norketamine
normeperidine
noroxycodone
o-desmethyltramadol
ortho-methylfentanyl
para-fluorofentanyl
para-fluoronorfentanyl
pentylone
psilocin
r-hhc-cooh
s-hhc-cooh
speciociliatine


In [14]:
# Filter rows where 'pregabalin' appears in found_index_terms
pregabalin_data = tweets[tweets['found_index_terms'].apply(lambda x: 'pregabalin' in x)]
pregabalin_data.info(verbose=True)
pregabalin_data.head(10)


<class 'pandas.core.frame.DataFrame'>
Index: 89 entries, 740 to 67932
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   text                       89 non-null     object
 1   label                      89 non-null     object
 2   tweet_num                  89 non-null     int64 
 3   found_terms                89 non-null     object
 4   found_index_terms          89 non-null     object
 5   GPT_found_terms            89 non-null     object
 6   GPT_found_index_terms      89 non-null     object
 7   pubchem_found_terms        89 non-null     object
 8   pubchem_found_index_terms  89 non-null     object
 9   redmed_found_terms         89 non-null     object
 10  redmed_found_index_terms   89 non-null     object
 11  DEA_found_terms            89 non-null     object
 12  DEA_found_index_terms      89 non-null     object
 13  count_found_terms          89 non-null     int64 
dtypes: int64(2),

Unnamed: 0,text,label,tweet_num,found_terms,found_index_terms,GPT_found_terms,GPT_found_index_terms,pubchem_found_terms,pubchem_found_index_terms,redmed_found_terms,redmed_found_index_terms,DEA_found_terms,DEA_found_index_terms,count_found_terms
740,Sensitisation of nervous system from codeine w...,T,740,"[ codeine, fibromyalgia]","[pregabalin, codeine]","[ codeine, fibromyalgia]","[pregabalin, codeine]",[codeine],[codeine],[codeine],[codeine],[codeine],[codeine],2
1428,Ya GABA analogue withdrawals can get pretty cr...,T,1428,"[gaba analogue, crazy]","[pregabalin, fentanyl]",[gaba analogue],[pregabalin],[],[],[],[],[crazy],[fentanyl],2
1562,Pregabalin- Lyrica \nSolpadol\nOramorph- Liqui...,T,1563,"[pregabalin, lyrica, solpadol, oramorph, liqui...","[pregabalin, codeine, morphine]","[pregabalin, lyrica, solpadol, oramorph, liqui...","[pregabalin, codeine, morphine]","[pregabalin, lyrica, morphine]","[pregabalin, morphine]","[pregabalin, lyrica, oramorph, morphine]","[pregabalin, morphine]","[pregabalin, morphine]","[pregabalin, morphine]",5
2815,(Passion-flower)\nAn efficient anti-spasmodic....,T,2816,"[flower, morphine, neuralgia]","[pregabalin, morphine, delta-9-thc-cooh]","[morphine, neuralgia]","[pregabalin, morphine]",[morphine],[morphine],[morphine],[morphine],"[flower, morphine]","[morphine, delta-9-thc-cooh]",3
3748,"Physical dependence occurs, but addiction is e...",T,3749,"[methadone, methadone, nerve pain, duragesic, ...","[pregabalin, fentanyl, oxymorphone, methadone]","[methadone, methadone, nerve pain, duragesic, ...","[pregabalin, fentanyl, oxymorphone, methadone]","[methadone, methadone, duragesic, opana]","[oxymorphone, fentanyl, methadone]","[methadone, methadone, duragesic, actiq, opana]","[oxymorphone, fentanyl, methadone]","[methadone, methadone]",[methadone],6
4310,Same here and Thank God it works for CFS/ME an...,T,4311,"[fibromyalgia, morphine]","[pregabalin, morphine]","[fibromyalgia, morphine]","[pregabalin, morphine]",[morphine],[morphine],[morphine],[morphine],[morphine],[morphine],2
6302,Oh fuck I don't use the despencery for weed my...,T,6303,"[weed, narcotics, morphine, lyrica, weed]","[pregabalin, morphine, delta-9-thc-cooh, oxyco...","[narcotics, morphine, lyrica]","[pregabalin, morphine]","[morphine, lyrica]","[pregabalin, morphine]","[morphine, lyrica]","[pregabalin, morphine]","[weed, morphine, weed]","[morphine, delta-9-thc-cooh]",5
6833,Drug Science\nClaire Bywalec - Project Twenty2...,T,6834,[fibromyalgia],[pregabalin],[fibromyalgia],[pregabalin],[],[],[],[],[],[],1
7464,"Paracetamol does nothing for me, I need someth...",T,7465,"[ codeine, morphine, lyrica, endone, oxycodone...","[pregabalin, codeine, morphine, oxycodone]","[ codeine, morphine, lyrica, endone, oxycodone...","[pregabalin, codeine, morphine, oxycodone]","[codeine, morphine, lyrica, endone, oxycodone,...","[pregabalin, codeine, morphine, oxycodone]","[codeine, morphine, lyrica, oxycodone, codeine]","[pregabalin, codeine, morphine, oxycodone]","[codeine, morphine, oxycodone, codeine]","[codeine, morphine, oxycodone]",6
8176,He was given a shot of morphine for several br...,T,8177,"[morphine, capsules, pills]","[pregabalin, hydrocodone, morphine]","[morphine, capsules]","[pregabalin, morphine]",[morphine],[morphine],[morphine],[morphine],[morphine],[morphine],3


In [15]:
for element in pregabalin_data['text']:
        # Remove any leading or trailing whitespace
        element = element.strip()
        # Replace multiple spaces with a single space
        element = re.sub(r'\s+', ' ', element)
        # Print the cleaned string
        print(element)


Sensitisation of nervous system from codeine withdrawal plus fibromyalgia makes it hard to know which is which. Flip a coin?
Ya GABA analogue withdrawals can get pretty crazy too
Pregabalin- Lyrica Solpadol Oramorph- Liquid Morphine And a cup of coffee/ Milk
(Passion-flower) An efficient anti-spasmodic. Whooping-cough. Morphine habit. Delirium tremens. Convulsions in children; neuralgia. Has a quieting effect on the nervous system. Insomnia, produces normal sleep, cerebral functions, neuroses of children,
Physical dependence occurs, but addiction is extremely rare because methadone doesn't release dopamine. Don't let them guilt trip you. was on methadone for years and it helped with nerve pain a lot. He still needed Duragesic, Actiq, & Opana. Also limited to one drink
Same here and Thank God it works for CFS/ME and fibromyalgia pain too. I had severe back pain in the 90's and the only thing I had was morphine and is "sorta" worked but cannabis is a freaking godsend and I wish I had it 

In [16]:
index_term_counts = (
    tweets
      .explode('found_index_terms')['found_index_terms']
      .value_counts(dropna=True)
)

rare_terms = index_term_counts[index_term_counts < 100].index.tolist()
print(f"Rare index terms (<100 hits): {rare_terms}")

mask = tweets['found_index_terms'].apply(
    lambda terms: any(term in rare_terms for term in terms)
)
rare_tweets = tweets[mask]
print(f"Rows containing rare terms: {len(rare_tweets)} of {len(tweets)}")

rare_tweets.to_csv("data/rare_tweets.csv", index=False, encoding="utf-8-sig")


Rare index terms (<100 hits): ['pregabalin', 'lorazepam', 'diphenhydramine', 'benzoylecgonine', 'cyclobenzaprine', 'meperidine', 'gabapentin', 'temazepam', 'zopiclone', 'carfentanil', 'norfentanyl', 'xylazine', 'naltrexone', 'meprobamate', '4-anpp', 'oxazepam', 'nordiazepam']
Rows containing rare terms: 453 of 68378
