# Entity Linking
<br>
James Chapman<br>
CIS 830 Advanced Topics in AI – Term Project<br>
Kansas State University<br><br>

This notebook identifies EVERY DRUG MENTION (illicit/abusive) from list of 77 substances of interest.<br>

- Uses 'entity_linking_prompt'
- Runs 4 LLMs (2 Paid Services & 2 Local HuggingFace Models)
    - o4-mini 
    - GPT-4o-mini
    - Meta-Llama-3.1-8B-Instruct
    - Qwen-4B

*** Saves response from every LLM call in separate text files.

In [None]:
import pandas as pd
import json
import os
import re
from tqdm import tqdm
tqdm.pandas()

from utils import (
    get_tweets_dataset, 
    extract_responses,
    extract_JSON_labels_and_explanations,
    extract_drug_labels,
    extract_T_F_labels, 
    match_terms,
    get_confusion_matrix_and_metrics,
)
from llm_manager import (
    run_prompt_on_llm,
### paid #################
    # get_claude_response, 
    # get_openai_response, 
    # get_perplexity_response,
    get_openai_gpt4omini_response,
    get_openai_o4mini_response,
### local #################
    get_llama_response, 
    get_qwen_4b_response,                        
    #get_deepseek_qwen_response,
    # get_qwen_response,
    # get_mistral_response,
    cleanup_llama,
    cleanup_qwen,
    cleanup_qwen_4b,
    cleanup_deepseek_qwen,
    cleanup_mistral,
)

In [None]:
entity_linking_prompt = """
### Role
You moderate social‑media content for drug‑related references.

### Tasks
Given a post:
1. Detect every word or phrase (slang, euphemism, synonym, cultural references) that refers to an illicit drug or substance.
   • Include illegal, RX, OTC, or recreational drugs (using, buying, promoting, encouraging or glorifying drug use).
2. Link each detected phrase to its matching **Index Term** (see list below). 
   • If the drug is *not* in the list, use "other".
3. Return **only** a valid JSON array, one object per match, using the schema  
   `{"index_term": "<Index Term or other>", "matched_phrase": "<exact text>"}`.

### Output: exact format
[
  {"index_term": "Codeine", "matched_phrase": "lean"},
  {"index_term": "Codeine", "matched_phrase": "codeine"},
  ...
]    

### Index Terms
2,6-Xylidine (2,6-Dimethylaniline, 2‑amino‑1,3‑dimethylbenzene) – A toxic aniline derivative used as an intermediate in manufacturing local anesthetics and dyes.
2‑Amino‑5‑chloropyridine (5‑Chloro‑2‑aminopyridine) – A breakdown product of the sleep drug zopiclone that serves as a urinary marker in drug tests.
2‑Fluoro‑2‑oxo PCE (CanKet, 2F‑NENDCK, 2‑FXE) – A novel ketamine‑like dissociative anesthetic sold illicitly for its hallucinogenic effects.
2‑Oxo‑3‑hydroxy‑LSD (2‑oxo‑3‑hydroxy lysergic acid diethylamide, oxo‑HO‑LSD) – The principal urinary metabolite of LSD produced after psychedelic use.
3‑Hydroxy flubromazepam () – A primary metabolite of the designer benzodiazepine flubromazepam detectable in urine for an extended period.
3‑Hydroxy flubromazepam glucuronide () – The glucuronidated form of 3‑hydroxy flubromazepam that aids bodily elimination of the parent drug.
4‑ANPP (Despropionyl fentanyl, 4‑anilino‑N‑phenethylpiperidine) – A key precursor and metabolite of fentanyl often found as an impurity in illicit opioid production.
4‑HIAA (4‑Hydroxyindole‑3‑acetic acid) – The main inactive metabolite excreted after psilocybin ingestion (“magic mushrooms”).
6‑Acetylmorphine (6‑Monoacetylmorphine, 6‑MAM) – An active intermediate metabolite formed almost immediately after heroin use.
7‑Aminoclonazepam () – The chief urinary metabolite of the prescription benzodiazepine clonazepam.
7‑Hydroxymitragynine (7‑OH‑Mitragynine) – A potent kratom alkaloid and human metabolite that produces strong opioid‑like effects.
7‑OH‑CBD glucuronide () – A conjugated metabolite formed during cannabidiol (CBD) metabolism and excreted in urine.
8‑Aminoclonazolam () – A metabolite indicating use of the designer benzodiazepine clonazolam in toxicology screens.
8R‑OH‑R‑HHC (8α‑hydroxy‑9β‑HHC) – A hydroxylated metabolite of the synthetic cannabinoid HHC produced during metabolism.
8S‑OH‑R‑HHC (8β‑hydroxy‑9β‑HHC) – The epimeric hydroxylated metabolite of R‑HHC likewise formed in users of HHC products.
α‑Hydroxyalprazolam (α‑OH‑alprazolam) – The major active metabolite of alprazolam (Xanax) contributing to its sedative effects.
α‑Hydroxybromazolam (α‑OH‑bromazolam) – An oxidative metabolite used to confirm bromazolam ingestion.
Amphetamine (Speed, Bennies, Adderall) – A central nervous system stimulant prescribed for ADHD that is frequently abused for euphoria and wakefulness.
Benzoylecgonine (BZE, Ecgonine benzoate) – The primary metabolite of cocaine measured in urine drug testing.
Bromazolam (XLI‑268, Fake Xanax) – A designer benzodiazepine sold illicitly that mimics the sedative effects of alprazolam.
Buprenorphine (Suboxone, Subutex, Bupe) – A partial opioid agonist used to treat opioid dependence and moderate pain with a ceiling on respiratory depression.
Carfentanil (Wildnil, Elephant tranquilizer) – An ultra‑potent fentanyl analog used for large‑animal sedation and deadly in microgram doses to humans.
CBD (Cannabidiol, Epidiolex) – A non‑intoxicating cannabinoid with antiepileptic and anxiolytic properties approved for certain seizure disorders.
Codeine (Lean, Purple Drank, Schoolboy) – A mild opioid painkiller and cough suppressant that is widely abused in syrup form for euphoric “lean.”
Cyclobenzaprine (Flexeril, Amrix, Fexmid) – A muscle relaxant related to tricyclic antidepressants used for short‑term relief of muscle spasms.
delta‑8‑THC‑COOH (11‑nor‑9‑carboxy‑delta‑8‑THC) – The main inactive urinary metabolite indicating use of delta‑8‑THC products.
delta‑9‑THC‑COOH (11‑nor‑9‑carboxy‑THC, THC‑COOH) – The principal inactive metabolite of cannabis’ delta‑9‑THC detected in urine screens.
Diphenhydramine (Benadryl, Nytol) – A first‑generation antihistamine with sedative and anticholinergic effects that is sometimes misused at high doses.
EDDP () – The inactive metabolite of methadone used to verify methadone compliance in treatment programs.
Fentanyl (China White, Apache, Duragesic) – A synthetic opioid 50–100 times stronger than morphine that drives many overdose deaths.
Flubromazepam () – A long‑acting “research chemical” benzodiazepine producing prolonged sedation and anxiolysis.
Gabapentin (Neurontin, Gralise, Gabbies) – An anticonvulsant prescribed for neuropathic pain and seizures that carries some abuse potential.
Hydrocodone (Vicodin, Norco, Vike) – A commonly prescribed semi‑synthetic opioid painkiller frequently diverted for nonmedical use.
Hydromorphone (Dilaudid, Dillies) – A potent morphine derivative used for severe pain with rapid onset and high addiction risk.
Ketamine (Special K, Keta, Ketalar) – A dissociative anesthetic causing trance‑like analgesia and hallucinations prized both medically and recreationally.
Lorazepam (Ativan) – A short‑acting benzodiazepine anxiolytic used for anxiety, insomnia, and seizure emergencies.
LSD (Acid, Lucy, Delysid) – A potent psychedelic that dramatically alters perception and cognition at microgram doses.
Marijuana (Cannabis, Weed, Pot) – The most widely used illicit drug, containing psychoactive cannabinoids THC only.
MDA (Sass, Love Drug, 3,4‑methylenedioxyamphetamine) – An entactogenic and psychedelic stimulant similar to MDMA but with stronger visuals.
MDMA (Ecstasy, Molly, 3,4‑methylenedioxymethamphetamine) – A popular club drug producing euphoria, empathy, and heightened sensory perception.
MDMB‑4en‑PINACA butanoic acid () – A primary metabolite confirming use of the potent synthetic cannabinoid MDMB‑4en‑PINACA.
Meperidine (Demerol, Pethidine) – A synthetic opioid analgesic now rarely used due to its toxic metabolite normeperidine.
Meprobamate (Miltown, Equanil) – A historical anxiolytic “tranquilizer” that produces barbiturate‑like sedation and dependency risk.
Methadone (Dolophine, Methadose, Done) – A long‑acting opioid agonist employed in maintenance therapy and chronic pain management.
Methamphetamine (Crystal Meth, Ice, Desoxyn) – A powerful stimulant drug with high addiction potential and neurotoxic effects when abused.
Metonitazene () – A nitazene‑class synthetic opioid comparable in potency to fentanyl and implicated in recent overdose clusters.
Mitragynine () – The main kratom alkaloid acting as a partial opioid receptor agonist with stimulant properties at low doses.
Morphine (MS Contin, Miss Emma, Morph) – The prototypical opioid analgesic used for severe pain but prone to tolerance and dependence.
N,N‑Dimethylpentylone () – A novel synthetic cathinone stimulant (“bath salt”) producing amphetamine‑like sympathomimetic effects.
Naltrexone (Vivitrol, Revia) – An opioid antagonist medication that blocks opioid and alcohol effects to support abstinence.
N‑Desethylmetonitazene () – A potent nitazene opioid analog/metabolite contributing to toxicity in illicit opioid samples.
Norbuprenorphine () – The primary active metabolite of buprenorphine with limited brain penetration and peripheral opioid effects.
Norcarfentanil () – The main metabolite of carfentanil detected in biological samples after exposure to the parent ultra‑potent opioid.
Nordiazepam (Desmethyldiazepam, Nordazepam) – A long‑acting active metabolite of diazepam that extends benzodiazepine sedation.
Norfentanyl () – The principal inactive metabolite of fentanyl used as a biomarker of fentanyl intake.
Norketamine () – The chief metabolite of ketamine that retains mild anesthetic and analgesic activity.
Normeperidine (Norpethidine) – A neurotoxic excitatory metabolite of meperidine that can cause seizures at high concentrations.
Noroxycodone () – An inactive N‑demethylated metabolite of oxycodone excreted renally.
N‑Pyrrolidinoetonitazene () – An emerging nitazene synthetic opioid of extreme potency implicated in fatal overdoses.
O‑Desmethyltramadol (O‑DSMT) – The active metabolite of tramadol with significantly stronger μ‑opioid receptor activity than the parent drug.
ortho‑Methylfentanyl (2‑Methylfentanyl) – A highly potent clandestine fentanyl analog historically known as “China White.”
Oxazepam (Serax) – A short‑to‑intermediate‑acting benzodiazepine used for anxiety and alcohol withdrawal.
Oxycodone (OxyContin, Percocet, Hillbilly Heroin) – A semi‑synthetic opioid widely prescribed for pain and heavily misused for its heroin‑like high.
Oxymorphone (Opana) – A very potent opioid analgesic withdrawn from U.S. markets due to abuse of its extended‑release form.
para‑Fluorofentanyl (p‑Fluorofentanyl) – A fentanyl analog with comparable potency that has surfaced in illicit opioid mixtures.
para‑Fluoronorfentanyl () – The N‑dealkylated metabolite of para‑fluorofentanyl used to confirm exposure to the parent analog.
Pentylone (bk‑MBDP, βk‑Methylbenzodioxylpentanamine) – A synthetic cathinone stimulant producing euphoria and increased energy similar to methylone.
Phencyclidine (PCP, Angel Dust, Sherm) – A dissociative anesthetic that causes hallucinations, detachment, and unpredictable behavior.
Pregabalin (Lyrica) – An anticonvulsant and anxiolytic agent used for neuropathic pain and prone to misuse for its calming effects.
Psilocin (4‑HO‑DMT) – The psychoactive metabolite of psilocybin responsible for the hallucinations of “magic mushrooms.”
Psilocybin (4‑PO‑DMT) – A natural prodrug compound in psychedelic mushrooms that converts to psilocin to produce its effects.
R‑HHC‑COOH () – The carboxylic acid metabolite of the R‑enantiomer of hexahydrocannabinol excreted after HHC use.
S‑HHC‑COOH () – The carboxylic acid metabolite of the S‑enantiomer of hexahydrocannabinol excreted after HHC use.
Speciociliatine () – A minor kratom alkaloid with weak opioid activity contributing modestly to the plant’s overall effects.
Temazepam (Restoril, Jellies) – A benzodiazepine hypnotic taken for short‑term insomnia and sometimes abused for its strong sedation.
Xylazine (Tranq, Rompun) – A veterinary sedative increasingly found as a street‑drug adulterant causing deep sedation and severe skin ulcers.
Zolpidem (Ambien, Stilnox) – A short‑acting non‑benzodiazepine hypnotic prescribed for insomnia that can cause complex sleep behaviors.
Zopiclone (Imovane) – A non‑benzodiazepine sleep aid used to treat insomnia, known for metallic after‑taste and next‑day drowsiness.

### Example #1 
Post: I got her hooked on that lean everyday she say she want codeine
[{"index_term": "Codeine", "matched_phrase": "lean"}, {"index_term": "Codeine", "matched_phrase": "codeine"}]

### Example #2 
Post: Pulled an all‑nighter on addy again.
[{"index_term": "Amphetamine", "matched_phrase": "addy"}]

### Example #3
Post: Popped two xans before class.
[{"index_term": "other", "matched_phrase": "xans"}]

### Example #4 (Cocaine not on the list)
Post: Picked up fresh snow to ride those white lines tonight ❄️
[{"index_term": "other", "matched_phrase": "snow"}, {"index_term": "other", "matched_phrase": "white lines"}]

Post: {{tweet_text}}
"""


In [None]:
tweets = get_tweets_dataset()
SEED = 777
tweets = ( tweets.sample(n=1_000, random_state=SEED, replace=False)
                 .sort_index()
                 .reset_index(drop=True)
)
tweets.info(verbose=True)

# RUN 4 MODELS: entity_linking_prompt

In [None]:
# o4-mini (smarter than GPT-4o-mini)
responses = run_prompt_on_llm(get_openai_o4mini_response, "o4mini_entity_linking", entity_linking_prompt, tweets)

# GPT-4o-mini
responses = run_prompt_on_llm(get_openai_gpt4omini_response, "gpt4omini_entity_linking", entity_linking_prompt, tweets)

# # Meta-Llama-3.1-8B-Instruct
responses = run_prompt_on_llm(get_llama_response, "llama_entity_linking", entity_linking_prompt, tweets)
cleanup_llama()

# # Qwen-4B
responses = run_prompt_on_llm(get_qwen_4b_response, "qwen_4b_entity_linking", entity_linking_prompt, tweets)
cleanup_qwen_4b()

In [None]:
# Collect responses from saved files, get labels and explanations
responses = extract_responses(tweets, "gpt4omini_entity_linking")
tweets["4o_mini_response_entity_linking"] = responses
drug_labels = extract_drug_labels(tweets, "gpt4omini_entity_linking")
tweets["4o_mini_drug_labels"] = drug_labels

responses = extract_responses(tweets, "o4mini_entity_linking")
tweets["o4mini_response_entity_linking"] = responses   
drug_labels = extract_drug_labels(tweets, "o4mini_entity_linking")
tweets["o4mini_drug_labels"] = drug_labels

responses = extract_responses(tweets, "qwen_4b_entity_linking")
tweets["qwen_4b_response_entity_linking"] = responses
drug_labels = extract_drug_labels(tweets, "qwen_4b_entity_linking")
tweets["qwen_4b_drug_labels"] = drug_labels

responses = extract_responses(tweets, "llama_entity_linking")
tweets["llama_response_entity_linking"] = responses
drug_labels = extract_drug_labels(tweets, "llama_entity_linking")
tweets["llama_drug_labels"] = drug_labels

tweets.info(verbose=True)


In [None]:
# Print value counts for every label column in tweets
label_cols = [col for col in tweets.columns if "label" in col]
for col in label_cols:
    print(f"\nValue counts for '{col}':")
    print(tweets[col].value_counts(dropna=False))

In [None]:
# Find rows where not all label columns agree (all True or all False)
label_cols = [col for col in tweets.columns if "label" in col and col != "label"]
def not_all_agree(row):
    vals = [str(row[col]).strip().lower() for col in label_cols]
    # Only consider rows where all values are either 'true' or all 'false'
    return not (all(v == "true" for v in vals) or all(v == "false" for v in vals))

disagreeing_tweets = tweets[tweets.apply(not_all_agree, axis=1)].copy()
print(f"Number of rows where label columns do not all agree: {len(disagreeing_tweets)}")
disagreeing_tweets[label_cols + ["text"]].head()

In [None]:
:)