This notebook is used to generate named entity recognition (NER) labels from the Micromed dataset. Tweet content is appended via the Twitter API, and is processed to approximately match MedRed's `NER_labels_from_AMT.csv`.

API access obtained via https://developer.twitter.com. Will need bearer token for this implementation.

Doc for client config. Straight forward. https://docs.tweepy.org/en/latest/client.html#

Doc for tweet lookup by IDs (max 900 IDs/15 minutes as of 05/05/2022):
- https://docs.tweepy.org/en/latest/client.html#tweepy.Client.get_tweets
- https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/api-reference/get-tweets

---

Environment config.

In [232]:
import numpy as np
import pandas as pd
import tweepy # 4.9.0
import json
from itertools import chain

In [233]:
# # included for convenience to help find correct paths
# import os
# os.getcwd()
# os.listdir()

Constants.

In [238]:
# paths
MEDRED_REPRODUCIBLE_DIR = "../../MedRed_Reproducible/"
MICROMED_IN = MEDRED_REPRODUCIBLE_DIR + "data/Micromed/medinfo2015.linejson"
NER_OUT = MEDRED_REPRODUCIBLE_DIR + "data/Micromed/NER_labels_from_Micromed.csv"

# first execution
#   Twitter API bearer token string (required on first execution; obtain at https://developer.twitter.com)
BEARER_TOKEN = None
#   Where to store post text
POSTS_OUT = MEDRED_REPRODUCIBLE_DIR + "data/Micromed/posts.csv"

# second+ execution
POSTS_IN = POSTS_OUT

# subset of tweet IDs to test with
# ids = [466247498093187073, 469917371466272768, 471178284143632384, 471046746551119873, 466146469901504513, 470385120277725184, 466638818276556800, 469789512336285696, 466674226569957376,466943899353628672]

Load Micromed. The data is at entity level, containing a twitter post ID, type of term, its location in the post (post not included), and an attribute field unused here.

In [243]:
# load non-post data
encoded = []
with open(MICROMED_IN, 'r') as fl:
    for line in fl:
        encoded.append(json.loads(line))
# adjust JSON to DF
df = pd.json_normalize(encoded, record_path="entities", meta=["tid"])
# clean up formatting for merge with text API output
df.tid = pd.to_numeric(df.tid)
df["id"] = df["tid"]   # instead of renaming, retain tid column for convenient access
df = df.set_index("id")

df

Unnamed: 0_level_0,attributes,type,locations,tid
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
469917371466272768,[{'HeadPOS': 'A'}],Symptom,"[{'start': 36, 'end': 42}]",469917371466272768
469917371466272768,[{'HeadPOS': 'A'}],Symptom,"[{'start': 46, 'end': 51}]",469917371466272768
466638818276556800,[],Disease,"[{'start': 116, 'end': 125}]",466638818276556800
469789512336285696,[{'HeadPOS': 'V'}],Symptom,"[{'start': 57, 'end': 65}]",469789512336285696
471137960096567297,[],Pharmacological_Substance,"[{'start': 44, 'end': 52}]",471137960096567297
...,...,...,...,...
466674226569957376,"[{'Figurative': None}, {'NonMedical': None}]",Symptom,"[{'start': 44, 'end': 48}]",466674226569957376
466674226569957376,"[{'Figurative': None}, {'NonMedical': None}]",Symptom,"[{'start': 89, 'end': 93}]",466674226569957376
466674226569957376,"[{'Figurative': None}, {'NonMedical': None}]",Symptom,"[{'start': 64, 'end': 68}]",466674226569957376
466943899353628672,"[{'HeadPOS': 'A'}, {'Figurative': None}, {'Non...",Symptom,"[{'start': 41, 'end': 45}]",466943899353628672


Load tweet text. Run **ONE** of two next cells.

The first cell pulls tweets using Twitter API and requires a bearer token for auth. Use this when running this the first time, as tweet text is not made available here. This is done so as to respect the Micromed authors' decision to not include them in their dataset, as well as to preserve users' right to be forgotten and avoid pontential conflicts with Twitter terms. The bearer token can be obtained via https://developer.twitter.com.

The second cell loads tweet text from pre-run file.

In [239]:
# Pull tweets

# client setup
if BEARER_TOKEN == None:
    raise TypeError("Missing BEARER_TOKEN constant. Check and re-run Constants cell.")
clt = tweepy.Client(bearer_token=BEARER_TOKEN, wait_on_rate_limit=True)
# tweet IDs; comment out if want to use subset defined earlier among constants
ids = list(df.tid.unique())
# query tweets in batches manageable by API (up to 100 at tiem of writing)
tweets_list = []
for batch in range(int(len(ids)/100)):
    # range for this batch
    id_batch = ids[(batch*100):((batch+1)*100)]
    # query tweets
    tweets = clt.get_tweets(ids=id_batch)
    # extract id and text
    tweets = [(tweets.data[tweet].id, tweets.data[tweet].text) for tweet in range(len(tweets.data))]
    tweets_list.extend(tweets)

# to DF
dfText = pd.DataFrame(tweets_list, columns=["id", "post"])
dfText = dfText.set_index("id")

dfText.to_csv(POSTS_OUT)

dfText

Unnamed: 0_level_0,post
id,Unnamed: 1_level_1
469789512336285696,"[Nods to Erisa, a weak smile on her lips.] Tha..."
466511827334340608,No lozenge or paracetamol can soothe my sore t...
466815981479022592,@rachellvukovich I know but for old times sake...
466451475498291200,Been telling my friends this for yrs. First 3 ...
471128027955339264,The medicine i drank was effective! No more he...
...,...
466399461934391297,Lester is going to die of a staph infection be...
470242539077783552,Enjoying my FIX (stress reliever lowering my c...
469780955931369472,@LewisHamilton pretty sure he said the hunger ...
471561236643999744,"im so confused i typed ""buy"" but i meant ""poin..."


In [240]:
# Load tweets from file
dfText = pd.read_csv(POSTS_IN, usecols=["id", "post"], index_col="id")
dfText

Unnamed: 0_level_0,post
id,Unnamed: 1_level_1
469789512336285696,"[Nods to Erisa, a weak smile on her lips.] Tha..."
466511827334340608,No lozenge or paracetamol can soothe my sore t...
466815981479022592,@rachellvukovich I know but for old times sake...
466451475498291200,Been telling my friends this for yrs. First 3 ...
471128027955339264,The medicine i drank was effective! No more he...
...,...
466399461934391297,Lester is going to die of a staph infection be...
470242539077783552,Enjoying my FIX (stress reliever lowering my c...
469780955931369472,@LewisHamilton pretty sure he said the hunger ...
471561236643999744,"im so confused i typed ""buy"" but i meant ""poin..."


Merge tweet text with Micromed data.

In [244]:
df = df.merge(dfText, on="id", how="right") # only want tweets with available posts; can't text mine the rest anyway
df  # expected cols: id, attributes, type, locations, tid, post

Unnamed: 0_level_0,attributes,type,locations,tid,post
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
469789512336285696,[{'HeadPOS': 'V'}],Symptom,"[{'start': 57, 'end': 65}]",469789512336285696,"[Nods to Erisa, a weak smile on her lips.] Tha..."
466511827334340608,[],Pharmacological_Substance,"[{'start': 3, 'end': 10}]",466511827334340608,No lozenge or paracetamol can soothe my sore t...
466511827334340608,[],Pharmacological_Substance,"[{'start': 14, 'end': 25}]",466511827334340608,No lozenge or paracetamol can soothe my sore t...
466511827334340608,[],Symptom,"[{'start': 40, 'end': 51}]",466511827334340608,No lozenge or paracetamol can soothe my sore t...
466815981479022592,[{'NonMedical': None}],Symptom,"[{'start': 60, 'end': 64}]",466815981479022592,@rachellvukovich I know but for old times sake...
...,...,...,...,...,...
470242539077783552,[],Symptom,"[{'start': 17, 'end': 23}]",470242539077783552,Enjoying my FIX (stress reliever lowering my c...
469780955931369472,[{'ToDiscuss': None}],Symptom,"[{'start': 84, 'end': 90}]",469780955931369472,@LewisHamilton pretty sure he said the hunger ...
471561236643999744,[{'HeadPOS': 'A'}],Symptom,"[{'start': 93, 'end': 98}]",471561236643999744,"im so confused i typed ""buy"" but i meant ""poin..."
466602046817566720,[{'HeadPOS': 'A'}],Symptom,"[{'start': 4, 'end': 10}]",466602046817566720,I'm hungry and tired at the same time ..


Extract Micromed tagged terms from text.

Watch for having few 1's printing: each indicates a bad record. If there are only a few, should be safe.

In [245]:
# using post and term indices, extract the actual terms
def extract_terms(row):
    key_words = []
    # in case one row contains multiple locations (of terms), loop across and aggregate
    for idx in row.locations:
        if str(row.post) != 'nan':
            # ignore bad location labels
            if idx["end"] <= len(row.post):
                key_words.append(row.post[idx["start"]:idx["end"]])
            else:
                print(1) # note when encounter bad label; ok if only few
    return key_words

df["terms"] = df.apply(extract_terms, axis=1)

df

1
1
1
1


Unnamed: 0_level_0,attributes,type,locations,tid,post,terms
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
469789512336285696,[{'HeadPOS': 'V'}],Symptom,"[{'start': 57, 'end': 65}]",469789512336285696,"[Nods to Erisa, a weak smile on her lips.] Tha...",[fatigued]
466511827334340608,[],Pharmacological_Substance,"[{'start': 3, 'end': 10}]",466511827334340608,No lozenge or paracetamol can soothe my sore t...,[lozenge]
466511827334340608,[],Pharmacological_Substance,"[{'start': 14, 'end': 25}]",466511827334340608,No lozenge or paracetamol can soothe my sore t...,[paracetamol]
466511827334340608,[],Symptom,"[{'start': 40, 'end': 51}]",466511827334340608,No lozenge or paracetamol can soothe my sore t...,[sore throat]
466815981479022592,[{'NonMedical': None}],Symptom,"[{'start': 60, 'end': 64}]",466815981479022592,@rachellvukovich I know but for old times sake...,[pain]
...,...,...,...,...,...,...
470242539077783552,[],Symptom,"[{'start': 17, 'end': 23}]",470242539077783552,Enjoying my FIX (stress reliever lowering my c...,[stress]
469780955931369472,[{'ToDiscuss': None}],Symptom,"[{'start': 84, 'end': 90}]",469780955931369472,@LewisHamilton pretty sure he said the hunger ...,[hungry]
471561236643999744,[{'HeadPOS': 'A'}],Symptom,"[{'start': 93, 'end': 98}]",471561236643999744,"im so confused i typed ""buy"" but i meant ""poin...",[]
466602046817566720,[{'HeadPOS': 'A'}],Symptom,"[{'start': 4, 'end': 10}]",466602046817566720,I'm hungry and tired at the same time ..,[hungry]


Reformat the data.

In [196]:
# collapse lists of values within a post by type into a single cell
#   e.g. below needs to be made into one row, keyed on id and type
# df[(df.index==466073644209156096) & (df.type == "Disease")]

#   common procedure to group and format, abstracting away how to handle values
def agg_col(df, col, aggfunc):
    '''Calls and formats an aggregating function, grouping by post id and term type.'''
    dfAgg = df.groupby(["tid", "type"])[col].aggregate(aggfunc)
    dfAgg = dfAgg.reset_index()
    dfAgg = dfAgg.rename(columns={"tid":"id", "locations":"locations_list", "terms":"terms_list"})
    dfAgg = dfAgg.set_index(["id", "type"])
    return dfAgg

#   aggregation function specific to terms, used with grouping above
def join_terms(col):
    '''Collapses series of lists of terms into single string, with terms separated by semicolons.'''
    x = ';'.join(list(chain(*col)))
    return x

#   aggregation function specific to term locations, used with grouping above
#   some locations are inaccurate, but since they're used to define the terms, may as well use them
def join_locs(col):
    return list(chain(col.values))

#   ...do it
dfTerms = agg_col(df, "terms", join_terms)
dfLoc = agg_col(df, "locations", join_locs)

dfTerms

# test:
# dfTerms[dfTerms.index == (466073644209156096, "Disease")].terms_list.values[0] == 'small-vessel disease;Cerebral small-vessel disease;Alzhe;Alzheimer'

Unnamed: 0_level_0,Unnamed: 1_level_0,terms_list
id,type,Unnamed: 2_level_1
466073644209156096,Disease,small-vessel disease;Cerebral small-vessel dis...
466091939730030592,Symptom,blind;pain
466095676883873792,Disease,besity.;bese ;verweight
466112043049701376,Disease,migraine
466130938393403392,Disease,scurvy;malnourished
...,...,...
471709203300495360,Symptom,stress
471723585577308160,Symptom,tired
471725712093626368,Symptom,tired
471740031447891968,Pharmacological_Substance,painkillers


In [215]:
# join terms and locations
#   merging onto posts for readability; post text dropped and re-added later
dfAgg = dfText.join(dfTerms)
dfAgg = dfAgg.join(dfLoc)

dfAgg

Unnamed: 0_level_0,Unnamed: 1_level_0,post,terms_list,locations_list
id,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
466073644209156096,Disease,"Association between small-vessel disease, Alzh...",small-vessel disease;Cerebral small-vessel dis...,"[[{'start': 20, 'end': 40}], [{'start': 71, 'e..."
466091939730030592,Symptom,I don't wanna drown in the rain. I don't wanna...,blind;pain,"[[{'start': 50, 'end': 55}], [{'start': 121, '..."
466095676883873792,Disease,RT @DhesiBahaRaja: ⚠️Important Message;\nMalay...,besity.;bese ;verweight,"[[{'start': 75, 'end': 82}], [{'start': 88, 'e..."
466112043049701376,Disease,@rachjohnson0 @DailyDose248 @LoraRoule previou...,migraine,"[[{'start': 127, 'end': 135}]]"
466130938393403392,Disease,@el_diabl0_cake and you'd probably be malnouri...,scurvy;malnourished,"[[{'start': 60, 'end': 66}], [{'start': 38, 'e..."
...,...,...,...,...
471709203300495360,Symptom,"Edible greens such as kale, spinach, bok choy ...",stress,"[[{'start': 77, 'end': 83}]]"
471723585577308160,Symptom,Mood: cold and tired,tired,"[[{'start': 15, 'end': 20}]]"
471725712093626368,Symptom,"Day 2 is winding down. We are hot, tired, dirt...",tired,"[[{'start': 35, 'end': 40}]]"
471740031447891968,Pharmacological_Substance,I wonder when is it a good time to take painki...,painkillers,"[[{'start': 40, 'end': 51}]]"


In [224]:
# pivot to match MedRed_AMT_labels.csv formatting
dfOut = dfAgg.reset_index()
dfOut = pd.pivot_table(dfOut, index="id", columns="type", values=["terms_list", "locations_list"], aggfunc=lambda x: x)

# add full post text
dfOut = dfOut.join(dfText)

# set column names
dfOut.columns = list(map("_".join, dfOut.columns)) # collapse multi-level col index into rename-able columns
name_map = {
    "terms_list_Pharmacological_Substance":"Answer.drugs",
    "terms_list_Symptom":"Answer.symptoms",
    "terms_list_Disease":"Answer.diseases",
    "locations_list_Pharmacological_Substance":"Locations.drugs",
    "locations_list_Symptom":"Locations.symptoms",
    "locations_list_Disease":"Locations.diseases",
    "p_o_s_t":"post" # "post" gets messed up in collapse above
}
dfOut = dfOut.rename(columns=name_map)

dfOut

# row with pharmacological substance and symptom as sanity check
# dfOut[dfOut.index == 471740031447891968]

  return merge(


Unnamed: 0_level_0,Locations.diseases,Locations.drugs,Locations.symptoms,Answer.diseases,Answer.drugs,Answer.symptoms,post
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
466073644209156096,"[[{'start': 20, 'end': 40}], [{'start': 71, 'e...",,,small-vessel disease;Cerebral small-vessel dis...,,,"Association between small-vessel disease, Alzh..."
466091939730030592,,,"[[{'start': 50, 'end': 55}], [{'start': 121, '...",,,blind;pain,I don't wanna drown in the rain. I don't wanna...
466095676883873792,"[[{'start': 75, 'end': 82}], [{'start': 88, 'e...",,,besity.;bese ;verweight,,,RT @DhesiBahaRaja: ⚠️Important Message;\nMalay...
466112043049701376,"[[{'start': 127, 'end': 135}]]",,,migraine,,,@rachjohnson0 @DailyDose248 @LoraRoule previou...
466130938393403392,"[[{'start': 60, 'end': 66}], [{'start': 38, 'e...",,,scurvy;malnourished,,,@el_diabl0_cake and you'd probably be malnouri...
...,...,...,...,...,...,...,...
471682384954675201,,,"[[{'start': 51, 'end': 57}], [{'start': 35, 'e...",,,stress;pain,"Sleeping is a cure to forget about pain, probl..."
471709203300495360,,,"[[{'start': 77, 'end': 83}]]",,,stress,"Edible greens such as kale, spinach, bok choy ..."
471723585577308160,,,"[[{'start': 15, 'end': 20}]]",,,tired,Mood: cold and tired
471725712093626368,,,"[[{'start': 35, 'end': 40}]]",,,tired,"Day 2 is winding down. We are hot, tired, dirt..."


Save for further preprocessing.

In [227]:
dfOut.to_csv(NER_OUT)