 The goal of this study was to cross-reference a list of health effects from the Unified Medical Language System® (UMLS) Consumer Health Vocabulary dictionary with a corpus of cannabis-related tweets to identify and describe the common perceived health effects from cannabis use

In [1]:
import pandas as pd
import numpy as np
import tqdm

Reading list of terms for various categories

In [2]:
terms_df = pd.read_excel("Filtered_List_AM.xlsx")

In [3]:
categories = list(terms_df.columns[4:-1])

In [4]:
categories

['Cancer',
 'Cardiovascular',
 'Respiratory',
 'Cognitive',
 'Gastrointestinal',
 'Neurological',
 'Weight',
 'PregnancyInUtero',
 'MentalHealth',
 'Stress',
 'Pain',
 'Injury',
 'Immune System',
 'Dermatological',
 'Death',
 'Poisoning',
 'Other']

In [5]:
categories_dict = dict(((k,[]) for k in categories))

In [6]:
terms_df.head()

Unnamed: 0,MedicalTerms,medical_terms_lower,ColloquialsTerms,colloquial_terms_lower,Cancer,Cardiovascular,Respiratory,Cognitive,Gastrointestinal,Neurological,...,MentalHealth,Stress,Pain,Injury,Immune System,Dermatological,Death,Poisoning,Other,SumCheck
0,,,twitch,twitch,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1.0
1,abortion,abortion,,,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
2,abreaction,abreaction,,,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1.0
3,abscence oxygen,abscence oxygen,suffocate,suffocate,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
4,abscess,abscess,pimple,pimple,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1.0


In [7]:
# getting all the terms in each category
for i,row in terms_df.iterrows():
    terms_to_search = []
    if(isinstance(row['medical_terms_lower'], str)):
        terms_to_search.append(row['medical_terms_lower'])
    if(isinstance(row['colloquial_terms_lower'], str)):
        terms_to_search.append(row['colloquial_terms_lower'])
        
    for t in terms_to_search:
        for c in categories_dict.keys():
            if (row[c] == 1):
                categories_dict[c].append(t)
                continue
                

Now we have all the terms in each category for example:- terms in the "Stress" category

In [12]:
categories_dict['Stress']

['hypercortisolemia',
 'cortisol',
 'hypercortisolic',
 'hypercortisolism',
 "cushing's syndrome",
 'hyperventilate',
 'stress',
 ' tension ',
 'perspiration',
 'sweating',
 'tremor',
 'shaking',
 'testiness',
 'irritable']

Tweets containing cannabis terms dated from 1st Jan 2020 to 31st Aug 2020 were collected. There were 16703751 tweets containing cannabis terms. These tweets were then filtered for all the tweets having medical and colloquial terms resulting in 609231 tweets.


In [None]:
CANNABIS_KEYWORDS = [
    'blunt', 'bong', 'budder', 'cannabis',
    'cbd', 'ganja', 'hash', 'hemp',
    'indica', 'kush', 'marijuana', 'marihuana',
    'reefer', 'sativa', 'thc', 'weed'
]

In [None]:
df = pd.read_csv("data.csv", error_bad_lines=False, warn_bad_lines=False)

In [None]:
df = df.dropna()
df = df[df['id'].map(len) == 19]
df = df[df['createdAt'].map(len) == 19] 
df = df[df['userId'].map(str.isnumeric) == True]
df = df[df['createdAt'].str.contains("-")]

In [45]:
# All the tweets 
print("total number of tweets containing cannabis terms: " + str(len(df)))

total number of tweets containing cannabis terms: 16703751


In [31]:
def contains_word(s, w):
    return f' {w} ' in f' {s} '

In [9]:
ids = []
for i, row in df.iterrows():
    tweet = row['text']
    for c in categories_dict.keys():
        for w in categories_dict[c]:
            if contains_word(tweet.lower(), w):
                ids.append(i)

In [13]:
ids = set(ids)

In [None]:
df = df.loc[ids]
df.dropna()

In [None]:
df.to_csv("terms_final.csv", index = False)

In [10]:
df = pd.read_csv('terms_final.csv')

In [44]:
print("Total tweets having medical and colloquial terms: " + str(len(df)))

Total tweets having medical and colloquial terms: 609227


For removing bot accounts, we used BotometerV4 API. There were a total of 388274 unique users. Using botometer we were able to fetch bot scores for 261134 users. The other 127140 user accounts were deleted and hence were not fetched. Out of 261134 fetched users, 15245 users were marked as bots, and 245889 users were marked as non-bots. We marked a user as a bot if the overall English bot score was greater than 4. There were a total of 53487 tweets by bots and 353353 tweets by real users.

In [13]:
ids = list(df['userId'])

In [78]:
print('total unique users:' + str(len(set(ids))))

total unique users:388274


In [15]:
import pickle
#Reading bot scores from pickled file
with open('combined_users.pickle', 'rb') as file:
    user_scores = pickle.load(file)

In [16]:
unique_users = list(set(df['userId']))
unique_users = list(map(int,unique_users))

In [17]:
existing_users = []
missed_users = []
count = 0
for user in tqdm.tqdm(unique_users):
    if user_scores[user] != None:
        existing_users.append(user)
    else:
        missed_users.append(user)
        count += 1

100%|██████████| 388274/388274 [00:00<00:00, 1373306.13it/s]


In [46]:
print("Number of existing users: " + str(len(existing_users)))
print("Number of deleted users: " + str(len(missed_users)))

Number of existing users: 261134
Number of deleted users: 127140


In [22]:
bot_df = []
real_df = []
bot_count = 0
ct = 0
for row in tqdm.tqdm(df.itertuples()):
    user = row.userId
    if user_scores[user] == None:
        ct += 1
        continue
    elif user_scores[user]['display_scores']['english']['overall'] >= 4:
        bot_df += [row]
    else:
        real_df += [row]

609227it [00:02, 233384.93it/s]


In [23]:
real_count, bot_count = 0, 0
ct = 0
for user in existing_users:
    if user_scores[user] == None:
        ct += 1
        continue
    elif user_scores[user]['display_scores']['english']['overall'] >= 4:
        bot_count+=1
    else:
        real_count += 1

In [25]:
print("Number of bot accounts: " + str(bot_count))
print("Number of real accounts: " + str(real_count))

Number of bot accounts: 15245
Number of real accounts: 245889


In [27]:
print("Number of bot tweets: " + str(len(bot_df)))
print("Number of real tweets: " + str(len(real_df)))

Number of bot accounts: 53487
Number of real accounts: 353353


In [29]:
real_df = pd.DataFrame(real_df)
bot_df = pd.DataFrame(bot_df)

Now we use the terms from each category to classify the tweets. We classify a tweet as belonging to a category if the tweet has at least one term from that category. A tweet can belong to multiple categories for example: "I am sweating and I am having a major headache" would belong to Stress and Pain categories as it has the word "sweating" and "headache" which belong to Stress and Pain categories respectively.

In [None]:
# function to calculate tweets in each category
from collections import defaultdict
def calc_categories_count(df):
    count_dict = defaultdict(lambda:0)
    text = df['text']
    for i in tqdm.tqdm(range(len(text))):
        t = text[i]
        for c in categories_dict.keys():
            for w in categories_dict[c]:
                if contains_word(t.lower(),w):
                    count_dict[c]+=1
                    break
  #creating table
  table = []
    for k,v in count_dict.items():
        table.append([k, v, v/len(text)*100])

    table_df = pd.DataFrame(table, columns=['Categories', 'Tweet Count', 'Percentage'])
    table_df = table_df.sort_values(by='Categories')
    table_df.reset_index(drop=True, inplace=True)
    return table_df

In [75]:
non_bot_result = calc_categories_count(real_df)

100%|██████████| 353353/353353 [02:49<00:00, 2087.97it/s]


In [76]:
non_bot_result

Unnamed: 0,Categories,Tweet Count,Percentage
0,Cancer,13834,3.915065
1,Cardiovascular,1810,0.512236
2,Cognitive,8807,2.492408
3,Death,31590,8.940068
4,Dermatological,1557,0.440636
5,Gastrointestinal,10434,2.952855
6,Immune System,12229,3.460845
7,Injury,19490,5.515731
8,MentalHealth,100155,28.344177
9,Neurological,56347,15.946377


In [73]:
bot_result = calc_categories_count(bot_df)

100%|██████████| 53487/53487 [00:25<00:00, 2069.44it/s]


In [74]:
bot_result

Unnamed: 0,Categories,Tweet Count,Percentage
0,Cancer,3438,6.42773
1,Cardiovascular,380,0.710453
2,Cognitive,1138,2.12762
3,Death,3766,7.040963
4,Dermatological,493,0.921719
5,Gastrointestinal,951,1.778002
6,Immune System,1933,3.613962
7,Injury,1903,3.557874
8,MentalHealth,13087,24.467628
9,Neurological,6141,11.481295
