# Liberty Dictionary and Scoring

## Installing some required dependencies

In [1]:
!pip install --quiet xlrd==1.2.0

You should consider upgrading via the '/Users/brinxu/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
# load libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import json
from nltk.tokenize import word_tokenize
import spacy
from nltk import word_tokenize
import string
from nltk.corpus import stopwords
import re

In [3]:
try:
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
except OSError as error:
    if "Can't find model 'en_core_web_sm'" in error.args[0]:
        print('Downloading files required by the Spacy language processing library (this is only required once)')
        spacy.cli.download('en_core_web_sm')
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])

In [4]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brinxu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/brinxu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/brinxu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
# read questionnaire data
data = pd.read_excel("questionnaire.xls")

In [6]:
# let's have a look on the metadata
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10538 entries, 0 to 10537
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Candidate Words      10527 non-null  object 
 1    Liberty/oppression  10524 non-null  float64
dtypes: float64(1), object(1)
memory usage: 164.8+ KB


In [7]:
data.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
1004,help,3.0
646,mistreatment,3.0
8641,bring,3.0
6213,flag,7.0
3050,charitable,3.0


## Build Dictionary
### Data Preprocessing
Beofore building teh dictionary of `Liberty/oppression `, we will prepare our dataset by first droping columns we don't need and then we will normalize annotated values for a scale between 0-1.

In [8]:
data.rename(columns={' Liberty/oppression': 'Liberty/oppression'}, inplace=True)

In [9]:
data.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
8932,forthrightness,2.0
9308,financed,7.0
6893,disentangle,0.0
2077,all,1.0
10084,appoint,4.0


Now that our dataframe is ready to be transformed into a dictionary of keys and values, where keys are `Candidate Words` and values are `Liberty/oppression` scores, but before that we shall average the scores given by annotators to each candidate word and then we normalize our values.

In [10]:
data["Candidate Words"].value_counts()

restrict      11
choose        11
tops          11
disapprove    11
merit         11
              ..
deduct        10
make          10
5              1
1              1
4              1
Name: Candidate Words, Length: 960, dtype: int64

The block above show that each word is mostly annotated 11 times, some of them 10 times except the last three values of `1`, `4` and `5` who are annotated only once and they don't have any pertinence as candidate words, so they should be deleted.
## Averaging scores by candidate words

In [11]:
# averaging Liberty/oppression scores by candidate words
data_avg = data.groupby("Candidate Words").mean().reset_index()

In [12]:
data_avg.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
805,sharp,2.090909
783,sanction,3.272727
692,provided,3.454545
565,marginalization,4.545455
467,independence,3.454545


Let's explore how our `Liberty/oppression` is distributed now after averaging their values.

In [13]:
data_avg.describe()

Unnamed: 0,Liberty/oppression
count,960.0
mean,3.333496
std,0.676836
min,1.0
25%,2.9
50%,3.272727
75%,3.818182
max,5.454545


As we can see, after averaging our `Liberty/oppression` the scale of score values is now between `1` and `5.454545`. This scale won't be helpful when we will do average scores at the document level, that's why normalzation is a mandatory step. Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

In [14]:
# original value of the word "truth"
data_avg[data_avg["Candidate Words"] == "truth"] 

Unnamed: 0,Candidate Words,Liberty/oppression
881,truth,3.181818


The way the value will be transformed is using the `MinMaxScaler` function which subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

Max = 5.454545
Min = 1.000000

So by applying the `MinMaxScaler` formula which is <img src="https://i.stack.imgur.com/EuitP.png"><br/>
we will get : 

In [15]:
x_truth = (3.181818 - 1)/(5.454545 - 1)
print(x_truth)

0.4897959275301966


In [16]:
# apply normalization techniques 
column = 'Liberty/oppression'
data_avg[column] = MinMaxScaler().fit_transform(np.array(data_avg[column]).reshape(-1,1))

# view normalized data  
data_avg.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
957,you,0.55102
856,taxes,0.816327
322,equity,0.673469
57,arrest,0.489796
670,prime,0.428571


In [17]:
# View top 20 sample
data_avg.sort_values(by="Liberty/oppression", ascending=False).head(20)

Unnamed: 0,Candidate Words,Liberty/oppression
298,emancipation,1.0
389,freedom,0.979592
541,liberated,0.959184
603,obstruction,0.938776
233,demobilize,0.918367
70,autonomic,0.918367
71,autonomous,0.918367
144,choose,0.918367
895,unenslaved,0.897959
543,liberties,0.897959


In [18]:
# let's get a bit more insights about our normalization output 
data_avg.describe()

Unnamed: 0,Liberty/oppression
count,960.0
mean,0.523846
std,0.151943
min,0.0
25%,0.426531
50%,0.510204
75%,0.632653
max,1.0


As we can see from the table above our column has been transformed into a range of values where the min is 0 and the max is 1.
Now that our dataset is well normalized, we can start building our dictionary.

We then created a loop to read the data from the dataframe where we averaged scores given by coders to each candidate word and create a new dictionary that we will be using later on when we want to get the Liberty Score for a given word, since dictionary format in python is the most switable format for such task, all you need to provide is the key which iw the candidate word and it retruns the Liberty score, without needing to loop over the dataframe.

In [19]:
# initialize an empty dict
liberty_lexicon = {}

# loop over our dataframe
for index, row in data_avg.iterrows():
  liberty_lexicon[row["Candidate Words"]] = row["Liberty/oppression"]

print(f"there is {len(liberty_lexicon)} elements in our dictionary for Liberty/oppression scores")

there is 960 elements in our dictionary for Liberty/oppression scores


In [20]:
# test our dictionary 
liberty_lexicon["power"]

0.6326530612244899

In [21]:
# save the dictionary as a json file
with open("liberty_lexicon.json", "w") as f:
  json.dump(liberty_lexicon, f, indent=4)


Within the same way, we will save another version of our lexicon but in a very fine-grained version, where we will lemmatize the candidate words, so that both words banning and ban would have access to the same score since they share the same semantic behind.

In [22]:
def lemmatize_word(word):
  return ''.join([token.lemma_ for token in nlp(word)])

In [23]:
# test the function of lemmatization
lemmatize_word("banning")

'ban'

In [24]:
# initialize an empty dict
lemmatized_liberty_lexicon = {}

# loop over our dataframe
for index, row in data_avg.iterrows():
  lemmatized_liberty_lexicon[lemmatize_word(str(row["Candidate Words"]))] = row["Liberty/oppression"]

print(f"there is {len(lemmatized_liberty_lexicon)} elements in our dictionary for Liberty/oppression scores")

there is 874 elements in our dictionary for Liberty/oppression scores


In [25]:
# save the dictionary as a json file
with open("lemmatized_liberty_lexicon.json", "w") as f:
  json.dump(lemmatized_liberty_lexicon, f, indent=4)

We can clearly see that after lemmatization process, our lexicon was reduced from 960 into 877, which means that there was a lot of words that shares the same root part of the word with just some different suffixes.

## Liberty scoring at a post level
The approach we will use in this notebook is similar to the one used [here](https://github.com/oaraque/moral-foundations/blob/9d84f014fb257ce5d6cd77b48ed104edc911e31e/moralstrength/moralstrength.py#L46), where they average the annotations for the words in the sentence, if the word isn't found in our lexicon, it will be skipped.

To add more details about the used approach, I will give an example, suppose we have the sentence below :
sentence = "I am feeling exhausted" the below function will actually transform this sentence into a set of tokens like ["I", "am", "feeling", "exhausted"] and then it will start looping over this list for each token we will get its lemmatized version which will be like ["i", "be", "feel", "exhausted"] and then for each word we will try to get its Liberty score from the lemmatized_liberty_lexicon dictionary in case it exist and sum its value into the sum variable which is initialised with 0 at the beginning, at the end we will count how many score we were able to find in our dictionary this will be recognized_words_no , hence we will divide the sum of values found by how many words were found, if a word wasn't found in the Libert dictionary it is ignored.

In [26]:
def document_average_liberty(text):

  # initialize the sum of scores and how many token will be found in our lexicon 
  sum = 0
  recognized_words_no = 0

  # loop over each token in the text and get their score of liberty from the lexicon
  for token in nlp(text):
    try:
      # get the liberty score of the lemmatized token
      liberty_score = lemmatized_liberty_lexicon[token.lemma_]
      # check if there is a score in the dict
      if liberty_score:
        # then sum up with the old value
        sum += liberty_score
        # increment the number of words that were found in the dictionary 
        recognized_words_no += 1
    # this is a silent exception in case the word wasn't found in the dict
    except:
      # print(f"The word {token} was not found in the dictionary")
      continue
  
  if recognized_words_no == 0:
    return float('NaN')
  else:
    return sum/recognized_words_no

In [27]:
# test the function over a sample text
document_average_liberty("limiting freedom is")

0.6938775510204082

Even though the token is doesn't exist in our dictionary but after lemmatization, it was transformed to its original verb be which allows us to get its Liberty score and contribute to the document scoring.

Now our next step will be loading the tweets data and do some further data preprocessing to normalize the text data and then we can call our built-in function document_average_liberty to retrun the scores for each tweet.

In [28]:
# load posts data 
df_posts = pd.read_csv("covid_data_final.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [29]:
df_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84385 entries, 0 to 84384
Data columns (total 58 columns):
 #   Column                                                                                                                Non-Null Count  Dtype  
---  ------                                                                                                                --------------  -----  
 0   Page Name                                                                                                             84385 non-null  object 
 1   User Name                                                                                                             84385 non-null  object 
 2   Facebook Id                                                                                                           84385 non-null  int64  
 3   Page Category                                                                                                         84385 non-null  object 
 4   Page

In [30]:
# get stopwords list
stoplist = stopwords.words('english')
# get list of punctuations
punctuations = string.punctuation + "’¶•@°©®™"

In [31]:
# the next step will be normalizing text data to be ready for scoring calculation
def preprocess_text(text):
    """
    @param text string
    @return text string
    
    This function preprocess a given raw text by normalizing it to lowercase removing the stop words,
    punctuations and lemmatization
    """
        
    #string to lowercase
    txt = text.lower()
    
    # keep only ascii characters
    txt = re.sub(r"[^a-zA-ZÀ-ÿ]", " ", txt)
    
    #punctuation removal and map it to space
    translator = str.maketrans(punctuations, " "*len(punctuations))
    s = txt.translate(translator)
    
    #remove digits 
    no_digits = ''.join([i for i in s if not i.isdigit()])
    cleaner = " ".join(no_digits.split())
    
    # the word_tokenize function will transform the text from a simple string to a list of token "a b c d" ==> ["a", "b", "c", "d"]
    word_tokens = word_tokenize(cleaner)
    # here we are interested on keeping only words that doesn’t appear on the stoplist variable we created above
    filtered_sentence = [w for w in word_tokens if not w in stoplist]
    # sometime the stopwords list isn't exhaustive and we may found meaningless words with only one character, it's better to drop them
    filtered_sentence = [w for w in filtered_sentence if len(w)>1 ]
    # the opposite of word_tokenize ["a", "b", "c", "d"] ==> "a b c d"
    filtered_sentence = " ".join(filtered_sentence)
    
    # a double layer lemmatization word block
    # filtered_sentence = " ".join([lemmatize_word(word) for word in word_tokenize(filtered_sentence)])
    filtered_sentence = " ".join([token.lemma_ for token in nlp(filtered_sentence)])
    
    return filtered_sentence

In [32]:
example_tweet = df_posts["Message"][9]
print("Post before preprocessing : \n {}\n".format(example_tweet))
clean_tweet = preprocess_text(example_tweet)
print("Post After preprocessing : \n {}".format(clean_tweet))

Post before preprocessing : 
 Millions of jobs and a shortage of applicants. Welcome to the new economy

Post After preprocessing : 
 million job shortage applicant welcome new economy


In [33]:
!pip install --quiet mapply

You should consider upgrading via the '/Users/brinxu/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In this step we will need to apply the function we have created before called preprocess_text to all the messages, the normal process will do this row by row, so in order to maximise the speed of this process we call mapply function which will run it in a parallel way instead of sequential way which is very slow.

In [34]:
# to parallelize the processing function over the whole dataframe
import mapply
# the value n_workers=-1 means we gonna use all the cores available in our CPU
mapply.init(n_workers=-1)

In [35]:
# convert the values in messages to string fromat
df_posts["Message"] = df_posts["Message"].astype(str)
# now let's apply this preprocessing function over all our text data in the dataframe
df_posts["clean_post"] = df_posts["Message"].mapply(preprocess_text)

  0%|          | 0/40 [00:00<?, ?it/s]

In [36]:
df_posts.sample(5)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,sanctity_p,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post
60049,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4644112.0,5137451.0,2021-01-30 11:48:18 EST,...,0.10258,-0.040991,-0.127771,0.003844,-0.177025,-0.052653,1.0,0.000468,0.005247,dubai say saturday roll china sinopharm vaccin...
24812,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2437609.0,2512337.0,2020-11-11 23:41:10 EST,...,0.087434,-0.168339,-0.073885,-0.126453,-0.079438,-0.114194,1.125,0.000223,0.001475,government care people would fight we right ch...
50872,The Wall Street Journal,WSJ,8304333127,MEDIA_NEWS_COMPANY,US,"Breaking news, investigative reporting, busine...",2008-02-11 22:26:53,6575490.0,6702329.0,2020-12-27 19:00:29 EST,...,0.090429,-0.102964,-0.207736,-0.096968,-0.036619,-0.102668,0.571429,0.000612,0.003802,medical center overrun critical covid patient ...
42618,Fox News,FoxNews,15704546335,MEDIA_NEWS_COMPANY,US,Welcome to the official Fox News Facebook page...,2008-04-30 18:26:36,18775416.0,22411835.0,2021-01-28 17:29:34 EST,...,0.068245,-0.019818,-0.01427,0.045541,0.065213,0.106469,1.333333,0.000292,0.002888,president biden thursday sign pair executive o...
51196,The Wall Street Journal,WSJ,8304333127,MEDIA_NEWS_COMPANY,US,"Breaking news, investigative reporting, busine...",2008-02-11 22:26:53,6572863.0,6692209.0,2020-12-07 10:00:15 EST,...,0.098423,-0.088155,-0.065322,-0.117757,-0.040875,-0.107607,3.666667,0.000541,0.000981,think soon quick public anxiety safety first c...


Now that our dataframe has the cleaned text data, we are ready to call our document_average_liberty function to score the documents on it.

In [37]:
df_posts["Liberty/oppression"] = df_posts["clean_post"].mapply(document_average_liberty)

  0%|          | 0/40 [00:00<?, ?it/s]

In [38]:
# let's explore the results
df_posts.sample(5)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post,Liberty/oppression
38112,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17296921.0,17295821.0,2020-05-27 15:26:09 EDT,...,0.020178,0.037055,0.068215,0.006878,0.040835,4.333333,0.000304,0.000538,education secretary betsy devos say force publ...,0.618076
61553,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4598408.0,5032372.0,2020-12-20 07:36:13 EST,...,-0.049119,0.0077,0.053888,-0.006493,0.047233,1.083333,0.000152,0.001771,new national lockdown inevitable britain stop ...,0.55102
4820,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,33064976.0,35753556.0,2020-05-09 15:00:01 EDT,...,-0.025027,-0.045027,0.065662,-0.094205,-0.050249,0.75,0.000137,0.003483,uc hastings college law along san francisco bu...,0.619048
22965,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2475801.0,2573254.0,2021-01-27 07:32:08 EST,...,-0.112362,-0.05238,-0.138014,-0.149883,-0.018724,1.5,0.000636,0.003199,georgia state lawmaker remove house chamber re...,0.816327
71358,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4341929.0,4445244.0,2020-05-08 02:57:11 EDT,...,0.003731,0.088134,0.061091,0.013507,0.054871,1.75,2.3e-05,0.001227,green hydrogen time come long tout clean alter...,0.501134


In [39]:
# View the top 20 scores
df_posts.sort_values(by="Liberty/oppression", ascending=False).head(20)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post,Liberty/oppression
11054,NPR,NPR,10643211755,BROADCASTING_MEDIA_PRODUCTION,US,"NPR is an independent, nonprofit media organiz...",2008-02-21 00:53:35,6654070.0,7697756.0,2020-07-30 20:36:36 EDT,...,0.172398,0.204521,0.168465,0.14544,0.099991,0.555556,0.000191,0.001501,listen billie eilish new song future write rec...,0.897959
17596,CBS,CBS,47360808996,TV_NETWORK,US,This is CBS. Watch full episodes for free on w...,2009-01-29 02:25:43,1584256.0,1957046.0,2021-06-15 11:26:37 EDT,...,-0.105921,-0.033917,0.003375,0.024413,-0.049749,0.846154,0.000109,0.002554,tony nominee nick cordero lose battle covid la...,0.877551
39930,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17058831.0,16965143.0,2020-03-24 15:35:07 EDT,...,-0.191699,-0.040353,-0.096786,-0.081401,-0.012155,0.5,0.000553,0.004709,terrence mcnally tony award win playwright dra...,0.877551
41567,TheBlaze,TheBlaze,140738092630206,MEDIA_NEWS_COMPANY,US,News & entertainment for people who love America.,2010-08-18 23:57:09,2128162.0,2128833.0,2020-03-24 22:30:14 EDT,...,-0.279402,-0.169415,-0.234339,-0.120924,-0.158605,0.5,0.000761,0.004027,man florida survive covid credits malaria drug...,0.877551
37472,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17377422.0,18043762.0,2020-07-09 04:55:05 EDT,...,-0.126714,-0.025842,0.039374,-0.122024,-0.008789,0.888889,9e-05,0.005334,thousand serb demonstrate second consecutive n...,0.877551
54780,The Wall Street Journal,WSJ,8304333127,MEDIA_NEWS_COMPANY,US,"Breaking news, investigative reporting, busine...",2008-02-11 22:26:53,6449115.0,6486277.0,2020-04-12 20:31:24 EDT,...,-0.002236,-0.125102,0.056946,0.075866,0.030497,1.0,7.5e-05,0.006331,life hack weather pandemic read advice submit ...,0.877551
47271,Breitbart,Breitbart,95475020353,MEDIA_NEWS_COMPANY,US,Breitbart News (www.breitbart.com) is a conser...,2008-11-10 23:31:33,4375096.0,5274462.0,2020-08-24 02:46:39 EDT,...,-0.040229,-0.054519,-0.040611,-0.023919,-0.133633,2.6,0.000171,0.001878,official yale university tell student email la...,0.877551
2055,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,34014289.0,37313450.0,2020-11-23 06:30:14 EST,...,-0.004923,-0.115955,-0.084191,-0.030739,0.019418,0.714286,4e-05,0.003129,coronavirus app dog concern privacy particular...,0.877551
38708,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17230730.0,17205574.0,2020-05-03 18:55:08 EDT,...,-0.225541,-0.141413,0.061886,-0.011966,-0.087397,1.0,0.000303,0.012441,life across look different coronavirus shutdown,0.877551
19132,USA TODAY,usatoday,13652355666,MEDIA_NEWS_COMPANY,US,"We bring clarity to the news of the day, helpi...",2008-04-09 18:29:16,8252548.0,8729284.0,2020-10-16 15:11:10 EDT,...,-0.122385,-0.099188,-0.247581,-0.279762,-0.202927,0.333333,0.00036,0.006089,preview upcoming appearance cbs sunday morning...,0.877551


In [40]:
df_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84385 entries, 0 to 84384
Data columns (total 60 columns):
 #   Column                                                                                                                Non-Null Count  Dtype  
---  ------                                                                                                                --------------  -----  
 0   Page Name                                                                                                             84385 non-null  object 
 1   User Name                                                                                                             84385 non-null  object 
 2   Facebook Id                                                                                                           84385 non-null  int64  
 3   Page Category                                                                                                         84385 non-null  object 
 4   Page

As we can see from the metadata above, we have 79,786 of non null scores on a dataset of 84,385, which seeme like 94.5% of prediction scores are set while the other 5.5% is null.

In [41]:
df_posts.to_csv('covid_data_final_with_scores.csv', index=False)