# Keyword expansion 

In this exercise we are going to use the keyword expansion technique propsoed in `Computer-Assisted Keyword and Document Set Discovery from Unstructured Text` by King, Lam and Roberts (2017), in order to label a dataset of tweets according to whether or not they are related to covid-19. 

The idea is to use an initial list of keywords to label the date, and then use supervised learning to expand the list of keywords to get a better sense of how people talk about a topic. It is an iterative approach, meaning that you start with a list of keywords, and expand it, run it again etc. until you saturate the list. The approach also emphasises that you should read some of the text that you label, in order to ensure correct labelling. 


This exercise is a python translation of Gregory Eady's R exercise, heavily inspired by the replication material found here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FMJDCD. If interested, you can also see Greg's walk-through of the R version of this code in his video here: https://gregoryeady.com/SocialMediaDataCourse/readings/Keywords/

### Read in required packages 

In [1]:
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
import re
from tqdm import tqdm
from collections import OrderedDict
from collections import defaultdict
from collections import namedtuple
import numpy as np
from nltk.stem import PorterStemmer
from nltk.tokenize import WhitespaceTokenizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import random
from math import lgamma
from sklearn import linear_model
import matplotlib.pyplot as plt
import datetime

# 1. Load the data

Read in data as usual. 

In [3]:
df = pd.read_csv("../dataset/MOC_Tweets.csv")

# 1.1. Preprocessing 

Due to time restraints, the preprocessing code is given below, ready to be run. Take a look at the code to understand what is being done. 


Subset the data by removing tweets before 2019 (we are only interested in tweets that may reference COVID-19).

In [4]:
df = df[df.date  >= 20190101] # Subset to 2019 and later because we'll look at COVID-19 over time
df = df.loc[df.tweet_id.drop_duplicates().index] # removing duplicate observations (tweets)

In [5]:
df.reset_index(inplace = True, drop = True) 

Save the original text and lowercase the text column.

In [6]:
df['text_original'] = df['text']
df['text'] = df['text'].str.lower()

Do some (but not all) preprocessing by removing tweet elements that we do not care about. 


In [7]:
# Remove mentions (posts that start with a "@some_user_name ")
df['text'] = df['text'].str.replace("\\B@\\w+|^@\\w+", "", regex = True)
# Change ampersands to "and"
df['text'] = df['text'].str.replace("&amp;", "and")
# Remove the "RT" and "via" (old retweet style)
df['text'] = df['text'].str.replace("(^RT|^via)((?:\\b\\W*@\\w+)+)","", regex=True, case=False)
# Remove URLs             
df['text'] = df['text'].str.replace("(https|http)?:\\/\\/(\\w|\\.|\\/|\\?|\\=|\\&|\\%)*\\b", "", regex = True)
# Keep ASCII only (removes Cyrillic, Japanese characters, etc.)
df['text'] = df['text'].str.replace("[^ -~]", "", regex = True)
# Remove double+ spaces (e.g. "build   the wall" to "build the wall")
df['text'] = df['text'].str.replace("\\s+", " ", regex = True)

With our mostly preprocessed tweets, let us begin building our classifier from chosen keywords. 

# 2. Define inclusion and exclusion keywords

You should now define the initial keywords that you want to include and exclude. Keywords to include should reference COVID-19, e.g. "covid19" and/or "coronavirus". We will use these initial keywords to find more keywords relevant to the topic.

1. Define 4 lists: the **first** should contain a seed reference word to be included, the **second** should contain the expanded list of reference words to include (empty to begin with), the **third** should contain a seed reference word to be excluded (can be left empty), and the **fourth** should contain the expanded list of reference words to exclude (empty to begin with). 

2. Using `.join`, collapse the two inclusion and exclusion lists, respectively, into strings that can be used as regex OR-operations. The result should be in the form \['dog', 'cat'\] --> 'dog|cat'

3. Use this regex string to create a bool column indicating whether the tweet contains one of your keywords.

4. If you have any exlusions, also find the tweets that contain the excluded keywords (the exclusion list can be left empty). 

5. Define a variable that is either 0 or 1, where 1 shows that the tweet contains one or more of your inclusion keywords _and_ does not contain any exclusion keywords. Create a bool column with this. 

6. See how many tweets you have labelled as related to COVID-19 so far (how many 0s and how many 1s). 

7. Sample 10 tweets labelled as COVID-19, and read the text in them (in the text_original column).

In [11]:
# Define the seed reference words to be included and excluded
included_seed_word = 'Covid-19'
excluded_seed_word = 'negative'

# Define the expanded list of reference words to be included and excluded
included_expanded_words = ['covidvirus', 'covid-19', 'covid', 'corona']
#excluded_expanded_words = [']

# Collapse the inclusion and exclusion lists into regex OR-operations
included_regex = '|'.join([included_seed_word] + included_expanded_words)
excluded_regex = '|'.join([excluded_seed_word] + excluded_expanded_words)


# Print the four lists
print('Seed word to be included:', included_seed_word)
print('Expanded list of reference words to include:', included_expanded_words)
#print('Seed word to be excluded:', excluded_seed_word)
#print('Expanded list of reference words to exclude:', excluded_expanded_words)


Seed word to be included: Covid-19
Expanded list of reference words to include: ['covidvirus', 'covid-19', 'covid', 'corona', 'coronavirus']


In [45]:
# Laver funktion
def true_reg(reg, text):
    t = re.search(reg, text)
    if t != None:
        return(True)


# Laver regex udtryk (Lige nu dealer jeg ikke med mellemrum)
reg = re.compile(included_regex)

# forloop udført
m_liste = []
for x in df["text"]:
    if true_reg(reg, x)==True:
        m_liste.append(1)
    else:
        m_liste.append(0)

df["covid"] = m_liste

# Smartere måde at gøre det på: 
df["covid"] = [1 if re.search(reg, x) else 0 for x in df["text"]]

print("We have {} tweets about covid".format(len(df[df["covid"]==1])))
print("We have {} tweets not about covid".format(len(df[df["covid"]==0])))

We have 3215 tweets about covid
We have 576605 tweets not about covid


In [49]:
for x in df[df["covid"]==1].sample(10)["text_original"]:
    print(x)

See below for answers to frequently asked questions about the #Coronavirus from the @WHO.  At the federal level, we… https://t.co/VBmeFitGt3
Uhunoma, a 7th grade student at Coronado Middle School in KCK, will launch his experiment on mint, literally, into space. #STEM https://t.co/OCW01LDRdD
"The threat of contracting the Coronavirus remains low according to all of our experts." — Vice President… https://t.co/2qVp2EwPAG
Must-read on why Medicare For All is needed to help protect Americans against global pandemics like Coronavirus.   https://t.co/yKthyoVcRT
From the desk of Dr. Cameron Kaiser: An open letter to the community https://t.co/yOG2MjPcMy #coronavirus #ruhealth… https://t.co/hWlyaGG3cg
Manhattan woman is NY's first confirmed coronavirus case. This is no cause to panic. The @nyccouncil will hold a he… https://t.co/MoWduUGZAj
Good #coronavirus resource page:   #COVID19   https://t.co/6dZgLZkQc1 https://t.co/oPjyf6v4Th
.@realDonaldTrump &amp; @HouseGOP are trying to responsibly p

# 3. Further preprocessing and vectorizing

Next, we need to tokenize the data and preprocess the tokens (as opposed to the preprossesing on the full string as earlier). 

We will also remove all the keywords that demarcate exclusion and inclusion from the covid-19 theme. This is becasue we want the model to learn to predict the topic using other, new keywords. 

1. Create a new col named "text_preprocessed" - it should be equal the text col, but with the keywords removed (Hint: use `.str.replace()` with `regex = True`). 

----- 

To spend less time on lessons you have already been through, code for further preprocessing is provided. This code may take a few minutes to run. The steps are: 

2. Tokenizing. A whitespace tokenizer is used, since we want to keep words with '-'.

3. Removing any tokens that are only numbers (you can remove more types of tokens if you want - up to you).

4. Remove any empty strings.

5. Stemming.

6. Re-joining the stemmed tokens using a whitespace.

7. Creating a column with the preprocessed sentences.

----- 

8. Now you have a column  of sentences made out of stemmed and preprocessed tokens. Use a CountVectorizer to make a document term matrix based on this column. Set `min_df = 10` and `max_df = 0.999`, as well as `stop_words = 'english'` and set an appropriate `ngram_range`. 

NB: Do not try to make this DTM into a dataframe or np array, as you will most likely run out of memory. It is a sparse matrix that you can work with in the same way as an np.array.



In [None]:
#Create a new text column with both inclusion and exclusion keywords removed

df['text_preprocessed'] =  


In [None]:
tokenizer = WhitespaceTokenizer()
ps = PorterStemmer()

preprocessed_sents =[]

for sent in tqdm(df['text_preprocessed']):
    words = tokenizer.tokenize(sent)
    words = [re.sub(r'\d+', '', word) for word in words] #removing tokens that are only digits 
    words = [x for x in words if x] #removing empty strings
    sent_stem = [ps.stem(word) for word in words]
    
    sent_done = " ".join(sent_stem)
    preprocessed_sents.append(sent_done)

df['text_stemmed'] = preprocessed_sents

In [None]:
# Create a document term matrix here



# 4. Sample training data and make predictions

Let us sample some tweets we will use to train our classifier. 

1) Define two lists of indices: One list containing the indices of the tweets in the reference set (those labelled as belonging to the covid-19 topic), and another list containing N sample of tweets not from the reference set (N should be either 2x the amount of tweets in the reference set or 50000, whichever is smaller).

2) You now have 2 lists of indices – use these to subset the Document Term Matrix (where each row represents a tweet, and each column a token) and the reference set column in the dataframe (the labels). Define a train DTM and  a train labels object. 

3) Fit a cross validated lasso regression, using the DTM subset as input (X) and the reference subset as labels (y). This means that we are trying to predict whether a tweet is in the reference set using the term frequencies. (Hint:  use sklearn's `linear_model.LasssoCV()`). This may take some time (approx. 5 min, depending on the size of your train data).

4) Use the fitted model to make predictions on the full DTM, and create a column in the dataframe called `predicted_raw` based on this. (Remember that the rows in the DTM correspond to the rows in the dataframe).

5) The prediction outputs propabilities and not classes, so check the standard deviation of the predicion_raw column - this will check if we actually have some variance in the prediction. This is just a sanity check.

6) Set a threshold of 0.25, and assign 1 or 0 to a new column called `predicted`, depending on whether the probability in `predicted_raw` is >= the threshold. (Note: Keep the threshold low if you want more tweets to get into the target set).

7) Create a column called `set_var`. This variable should be == "Reference" if the observation is in the reference set (our original covid-19 labels), "Target" if it is _predicted_ to be a covid-19 related tweet (1) and "Not target" if it is _predicted_ not to be (0).

8) Create a crosstable of the prediciton and set_var, to see how you model does (hint: use use `pd.crosstab()`). Examine the crosstab - what do the different entries mean? 

# 5. Calculate the log likelihood as in the paper

1) Create 3 sets of indices based on the `set_var` colum: one for "Target", one for "Not target" and one for "Reference". 

2) Create 3 objects for the target, not_target and reference sets, based on the DTM. These should be: for each token, how often is the given token in the set, how many documents in the set contains the given token, and the proportion of documents in the set containing the given token. (Hint: see sample code for the target set. If you want to convert to a list and not a matrix object, you can use the `.tolist()[0]`)

3) Create a new dataframe, where each row is a token from the DTM (you can use `vectorizer.get_feature_names()`), with 9 cols for each of the 9 objects you just created. 

4) Subset the dataset by removing any observations where the terms do not appear in either the target or not_target set, thus keeping only tokens that were in the original search set (step (a) on page 979).

5) Keywords go in the target list if their proportion is higher among those documents estimated to be in the reference set than not; e.g. if for the word "pandemic", 15% of documents predicted as target contain the word "pandemic" versus only 2% among those in the not_target set (step (b) on page 979). Therefore: create a new column that should be True if the token has a higher or equal proportion in the target set than in the not_target set. 

6) Examine the `llik` function provide and look in the paper - what does it do? 

7) Calculate the amount of documents in the target and the not_target set.

8) Use the provided function to calculate the log likelihood for each token. Assign this to a new column in the dataframe created in step 3. 


In [None]:
# Create 3 lists of indices

target_ids = 
not_target_ids = 
ref_ids = 


In [None]:
# Code showing how 5.2 is calculated for the target group 

target_freq = np.sum(DTM[target_ids,:], axis = 0) # how often is a given keyword in the target set 
target_num_docs = np.sum(DTM[target_ids,:] > 0, axis = 0) # how many documents in the target set contain a given keyword
target_num_docs_prop =  target_num_docs / sum(target_ids) # the proportion

# ... 


In [91]:
# Equation from King et al. (2017) p. 979 (using logs for stability) to calculate the likelihood

def llik(target_num_docs, nottarget_num_docs, target_num_docs_total, nottarget_num_docs_total):
    
    '''No docstring - you neew to see what it does :) '''
    
    x1 = ((lgamma(target_num_docs + 1) + lgamma(nottarget_num_docs + 1)) -
           lgamma(target_num_docs + nottarget_num_docs + 1 + 1))
    
    x2 = ((lgamma(target_num_docs_total - target_num_docs + 1) +
           lgamma(nottarget_num_docs_total - nottarget_num_docs + 1)) -
           lgamma(target_num_docs_total - target_num_docs +
           nottarget_num_docs_total - nottarget_num_docs + 1 + 1))
    
    llik = x1 + x2
    
    return llik


# 6. Examine new keywords

1) Show the top 25 keywords based on highest log likelihood, where the share of documents in the target set is higher than in the not_target set (see task 5.5). These are the tokens that are most likely to differentiate between the target and not_target sets (meaning that they help the model predict covid-19 related tweets).

2) Do the same with the not_target - what are these terms representative of? 

3) Are there any of these tokens that you want to include in the keywords? Choose 1-3 keywords that you want to include or exclude. 

4) For the 1-3 keywords you have found, find tweets that contain the given keyword in the original tweet text in the original dataframe. Read some tweets where the keyword is used in context - do you still want to include or exclude the keyword? 

5) Optional: add the new keywords to the original list at the beginning of this exercise in 2.1, and rerun the exercises until here, now including the new keywords. This is how the computer-assisted keyword discovery is used iteratively. 


# 7. Optional: Use your new classifier for downstream tasks

1) Assign a `final_classification` boolean column in the original dataframe, which should be 1 if the tweet contains any of the keywords in the new, complete list and if it does not contain any of the exclusion keywords. 

2) Examine the value counts of the political affiliation variable. Assign "Democrat" to the tweets labelled with "Independent" (see the people behind the tweets for reason).

3) Plot the share of tweets labelled as covid-19 relevant by your classifier (y), grouped on days (x) for each party - meaning two lines of covid-19 share across time. 

**Hints:** <br>
The pandas `groupby` functionality may be of help to you. <br>
You can also also turn the date ints into so-called datetime objects using this:

`dates =[datetime.datetime(year=int(x[i][0:4]), month=int(x[i][4:6]), day=int(x[i][6:8])) for i in range(len(x))]`

where x is a list of the unique dates as int. 
