# Liberty Dictionary and Scoring

## Installing some required dependencies

In [1]:
!pip install --quiet xlrd==1.2.0

You should consider upgrading via the '/Users/brinxu/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
# load libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import json
from nltk.tokenize import word_tokenize
import spacy
from nltk import word_tokenize
import string
from nltk.corpus import stopwords
import re

In [3]:
try:
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
except OSError as error:
    if "Can't find model 'en_core_web_sm'" in error.args[0]:
        print('Downloading files required by the Spacy language processing library (this is only required once)')
        spacy.cli.download('en_core_web_sm')
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])

In [4]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brinxu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/brinxu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/brinxu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
# read questionnaire data
data = pd.read_excel("questionnaire.xls")

In [6]:
# let's have a look on the metadata
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10538 entries, 0 to 10537
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Candidate Words      10527 non-null  object 
 1    Liberty/oppression  10524 non-null  float64
dtypes: float64(1), object(1)
memory usage: 164.8+ KB


In [7]:
data.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
5329,increase,0.0
2563,nationalism,5.0
119,fine,5.0
4296,express,0.0
9311,funded,5.0


## Build Dictionary
### Data Preprocessing
Beofore building teh dictionary of `Liberty/oppression `, we will prepare our dataset by first droping columns we don't need and then we will normalize annotated values for a scale between 0-1.

In [8]:
data.rename(columns={' Liberty/oppression': 'Liberty/oppression'}, inplace=True)

In [9]:
data.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
9842,unchain,3.0
8949,justice,7.0
1827,going,3.0
4598,enclose,1.0
6575,becoming,2.0


Now that our dataframe is ready to be transformed into a dictionary of keys and values, where keys are `Candidate Words` and values are `Liberty/oppression` scores, but before that we shall average the scores given by annotators to each candidate word and then we normalize our values.

In [10]:
data["Candidate Words"].value_counts()

torture      11
bans         11
equality     11
oppressed    11
patch        11
             ..
make         10
deduct       10
4             1
5             1
1             1
Name: Candidate Words, Length: 960, dtype: int64

The block above show that each word is mostly annotated 11 times, some of them 10 times except the last three values of `1`, `4` and `5` who are annotated only once and they don't have any pertinence as candidate words, so they should be deleted.
## Averaging scores by candidate words

In [11]:
# averaging Liberty/oppression scores by candidate words
data_avg = data.groupby("Candidate Words").mean().reset_index()

In [12]:
data_avg.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
172,complicate,2.454545
705,raise,3.272727
760,restraint,4.0
818,sound,3.0
185,conscience,2.636364


Let's explore how our `Liberty/oppression` is distributed now after averaging their values.

In [13]:
data_avg.describe()

Unnamed: 0,Liberty/oppression
count,960.0
mean,3.333496
std,0.676836
min,1.0
25%,2.9
50%,3.272727
75%,3.818182
max,5.454545


As we can see, after averaging our `Liberty/oppression` the scale of score values is now between `1` and `5.454545`. This scale won't be helpful when we will do average scores at the document level, that's why normalzation is a mandatory step. Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

In [14]:
# original value of the word "truth"
data_avg[data_avg["Candidate Words"] == "truth"] 

Unnamed: 0,Candidate Words,Liberty/oppression
881,truth,3.181818


The way the value will be transformed is using the `MinMaxScaler` function which subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

Max = 5.454545
Min = 1.000000

So by applying the `MinMaxScaler` formula which is <img src="https://i.stack.imgur.com/EuitP.png"><br/>
we will get : 

In [15]:
x_truth = (3.181818 - 1)/(5.454545 - 1)
print(x_truth)

0.4897959275301966


In [16]:
# apply normalization techniques 
column = 'Liberty/oppression'
data_avg[column] = MinMaxScaler().fit_transform(np.array(data_avg[column]).reshape(-1,1))

# view normalized data  
data_avg.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
138,charitable,0.367347
481,inside,0.367347
265,discrimination,0.714286
842,suffer,0.612245
483,institution,0.653061


In [17]:
# View top 20 sample
data_avg.sort_values(by="Liberty/oppression", ascending=False).head(20)

Unnamed: 0,Candidate Words,Liberty/oppression
298,emancipation,1.0
389,freedom,0.979592
541,liberated,0.959184
603,obstruction,0.938776
233,demobilize,0.918367
70,autonomic,0.918367
71,autonomous,0.918367
144,choose,0.918367
895,unenslaved,0.897959
543,liberties,0.897959


In [18]:
# let's get a bit more insights about our normalization output 
data_avg.describe()

Unnamed: 0,Liberty/oppression
count,960.0
mean,0.523846
std,0.151943
min,0.0
25%,0.426531
50%,0.510204
75%,0.632653
max,1.0


As we can see from the table above our column has been transformed into a range of values where the min is 0 and the max is 1.
Now that our dataset is well normalized, we can start building our dictionary.

We then created a loop to read the data from the dataframe where we averaged scores given by coders to each candidate word and create a new dictionary that we will be using later on when we want to get the Liberty Score for a given word, since dictionary format in python is the most switable format for such task, all you need to provide is the key which iw the candidate word and it retruns the Liberty score, without needing to loop over the dataframe.

In [19]:
# initialize an empty dict
liberty_lexicon = {}

# loop over our dataframe
for index, row in data_avg.iterrows():
  liberty_lexicon[row["Candidate Words"]] = row["Liberty/oppression"]

print(f"there is {len(liberty_lexicon)} elements in our dictionary for Liberty/oppression scores")

there is 960 elements in our dictionary for Liberty/oppression scores


In [20]:
# test our dictionary 
liberty_lexicon["power"]

0.6326530612244899

In [21]:
# save the dictionary as a json file
with open("liberty_lexicon.json", "w") as f:
  json.dump(liberty_lexicon, f, indent=4)


Within the same way, we will save another version of our lexicon but in a very fine-grained version, where we will lemmatize the candidate words, so that both words banning and ban would have access to the same score since they share the same semantic behind.

In [22]:
def lemmatize_word(word):
  return ''.join([token.lemma_ for token in nlp(word)])

In [23]:
# test the function of lemmatization
lemmatize_word("banning")

'ban'

In [24]:
# initialize an empty dict
lemmatized_liberty_lexicon = {}

# loop over our dataframe
for index, row in data_avg.iterrows():
  lemmatized_liberty_lexicon[lemmatize_word(str(row["Candidate Words"]))] = row["Liberty/oppression"]

print(f"there is {len(lemmatized_liberty_lexicon)} elements in our dictionary for Liberty/oppression scores")

there is 874 elements in our dictionary for Liberty/oppression scores


In [25]:
# save the dictionary as a json file
with open("lemmatized_liberty_lexicon.json", "w") as f:
  json.dump(lemmatized_liberty_lexicon, f, indent=4)

We can clearly see that after lemmatization process, our lexicon was reduced from 960 into 877, which means that there was a lot of words that shares the same root part of the word with just some different suffixes.

## Liberty scoring at a post level
The approach we will use in this notebook is similar to the one used [here](https://github.com/oaraque/moral-foundations/blob/9d84f014fb257ce5d6cd77b48ed104edc911e31e/moralstrength/moralstrength.py#L46), where they average the annotations for the words in the sentence, if the word isn't found in our lexicon, it will be skipped.

To add more details about the used approach, I will give an example, suppose we have the sentence below :
sentence = "I am feeling exhausted" the below function will actually transform this sentence into a set of tokens like ["I", "am", "feeling", "exhausted"] and then it will start looping over this list for each token we will get its lemmatized version which will be like ["i", "be", "feel", "exhausted"] and then for each word we will try to get its Liberty score from the lemmatized_liberty_lexicon dictionary in case it exist and sum its value into the sum variable which is initialised with 0 at the beginning, at the end we will count how many score we were able to find in our dictionary this will be recognized_words_no , hence we will divide the sum of values found by how many words were found, if a word wasn't found in the Libert dictionary it is ignored.

In [26]:
def document_average_liberty(text):

  # initialize the sum of scores and how many token will be found in our lexicon 
  sum = 0
  recognized_words_no = 0

  # loop over each token in the text and get their score of liberty from the lexicon
  for token in nlp(text):
    try:
      # get the liberty score of the lemmatized token
      liberty_score = lemmatized_liberty_lexicon[token.lemma_]
      # check if there is a score in the dict
      if liberty_score:
        # then sum up with the old value
        sum += liberty_score
        # increment the number of words that were found in the dictionary 
        recognized_words_no += 1
    # this is a silent exception in case the word wasn't found in the dict
    except:
      # print(f"The word {token} was not found in the dictionary")
      continue
  
  if recognized_words_no == 0:
    return float('NaN')
  else:
    return sum/recognized_words_no

In [27]:
# test the function over a sample text
document_average_liberty("limiting freedom is")

0.6938775510204082

Even though the token is doesn't exist in our dictionary but after lemmatization, it was transformed to its original verb be which allows us to get its Liberty score and contribute to the document scoring.

Now our next step will be loading the tweets data and do some further data preprocessing to normalize the text data and then we can call our built-in function document_average_liberty to retrun the scores for each tweet.

In [29]:
# load posts data 
df_posts = pd.read_csv("final.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [30]:
df_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84385 entries, 0 to 84384
Data columns (total 58 columns):
 #   Column                                                                                                                Non-Null Count  Dtype  
---  ------                                                                                                                --------------  -----  
 0   Page Name                                                                                                             84385 non-null  object 
 1   User Name                                                                                                             84385 non-null  object 
 2   Facebook Id                                                                                                           84385 non-null  int64  
 3   Page Category                                                                                                         84385 non-null  object 
 4   Page

In [31]:
# get stopwords list
stoplist = stopwords.words('english')
# get list of punctuations
punctuations = string.punctuation + "’¶•@°©®™"

In [32]:
# the next step will be normalizing text data to be ready for scoring calculation
def preprocess_text(text):
    """
    @param text string
    @return text string
    
    This function preprocess a given raw text by normalizing it to lowercase removing the stop words,
    punctuations and lemmatization
    """
        
    #string to lowercase
    txt = text.lower()
    
    # keep only ascii characters
    txt = re.sub(r"[^a-zA-ZÀ-ÿ]", " ", txt)
    
    #punctuation removal and map it to space
    translator = str.maketrans(punctuations, " "*len(punctuations))
    s = txt.translate(translator)
    
    #remove digits 
    no_digits = ''.join([i for i in s if not i.isdigit()])
    cleaner = " ".join(no_digits.split())
    
    # the word_tokenize function will transform the text from a simple string to a list of token "a b c d" ==> ["a", "b", "c", "d"]
    word_tokens = word_tokenize(cleaner)
    # here we are interested on keeping only words that doesn’t appear on the stoplist variable we created above
    filtered_sentence = [w for w in word_tokens if not w in stoplist]
    # sometime the stopwords list isn't exhaustive and we may found meaningless words with only one character, it's better to drop them
    filtered_sentence = [w for w in filtered_sentence if len(w)>1 ]
    # the opposite of word_tokenize ["a", "b", "c", "d"] ==> "a b c d"
    filtered_sentence = " ".join(filtered_sentence)
    
    # a double layer lemmatization word block
    # filtered_sentence = " ".join([lemmatize_word(word) for word in word_tokenize(filtered_sentence)])
    filtered_sentence = " ".join([token.lemma_ for token in nlp(filtered_sentence)])
    
    return filtered_sentence

In [33]:
example_tweet = df_posts["Link Text"][9]
print("Post before preprocessing : \n {}\n".format(example_tweet))
clean_tweet = preprocess_text(example_tweet)
print("Post After preprocessing : \n {}".format(clean_tweet))

Post before preprocessing : 
 Millions of jobs and a shortage of applicants. Welcome to the new economy

Post After preprocessing : 
 million job shortage applicant welcome new economy


In [34]:
!pip install --quiet mapply

You should consider upgrading via the '/Users/brinxu/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In this step we will need to apply the function we have created before called preprocess_text to all the Link Textvalues, the normal process will do this row by row, so in order to maximise the speed of this process we call mapply function which will run it in a parallel way instead of sequential way which is very slow.

In [35]:
# to parallelize the processing function over the whole dataframe
import mapply
# the value n_workers=-1 means we gonna use all the cores available in our CPU
mapply.init(n_workers=-1)

In [36]:
# convert the values in Link Text to string fromat
df_posts["Link Text"] = df_posts["Link Text"].astype(str)
# now let's apply this preprocessing function over all our text data in the dataframe
df_posts["clean_post"] = df_posts["Link Text"].mapply(preprocess_text)

  0%|          | 0/40 [00:00<?, ?it/s]

In [37]:
df_posts.sample(5)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,sanctity_p,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post
305,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,34552092.0,38270526.0,2021-05-12 16:16:09 EDT,...,0.103192,-0.007492,0.02756,0.037903,0.020948,-0.0617,2.0,0.000174,0.00161,cdc adviser vote recommend use pfizer covid va...
14353,Washington Post,washingtonpost,6250307292,BROADCASTING_MEDIA_PRODUCTION,US,Our award-winning journalists have covered Was...,2007-11-07 18:26:05,6621511.0,7002726.0,2021-05-14 13:20:07 EDT,...,0.035711,-0.135998,-0.141972,-0.160462,-0.215775,-0.03943,0.571429,0.000122,0.004072,reasonable discuss end pandemic yes caveat
35445,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17553738.0,17687321.0,2020-12-06 15:55:07 EST,...,0.062676,-0.097176,-0.000404,-0.084751,-0.050399,0.180276,0.75,0.000756,0.012779,new home test tell whether get flu coronavirus
78041,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4230220.0,4272945.0,2020-01-29 04:20:39 EST,...,0.050888,-0.022955,-0.064962,-0.014832,0.004124,0.024462,1.333333,4e-05,0.001121,chinese family diagnose virus uae first know c...
54879,The Wall Street Journal,WSJ,8304333127,MEDIA_NEWS_COMPANY,US,"Breaking news, investigative reporting, busine...",2008-02-11 22:26:53,6441955.0,6475155.0,2020-04-08 19:30:29 EDT,...,0.058086,-0.159142,-0.089731,-0.078473,-0.101883,-0.122882,1.0,0.00045,0.001013,opinion world watch america handle coronavirus


Now that our dataframe has the cleaned text data, we are ready to call our document_average_liberty function to score the documents on it.

In [38]:
df_posts["Liberty/oppression"] = df_posts["clean_post"].mapply(document_average_liberty)

  0%|          | 0/40 [00:00<?, ?it/s]

In [39]:
# let's explore the results
df_posts.sample(5)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post,Liberty/oppression
35500,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17552271.0,17683748.0,2020-12-03 04:55:05 EST,...,-0.10286,-0.016855,-0.060084,-0.012929,-0.092539,1.0,4e-05,0.001732,college entrance exam hour long covid make hard,0.510204
84330,Scientific American,science chennel,22297920245,MAGAZINE,US,Scientific American is the authoritative sourc...,2008-02-20 16:53:34,3209963.0,3188777.0,2020-03-23 09:18:47 EDT,...,0.048515,0.043009,0.07508,-0.077849,-0.04431,1.285714,0.000295,0.004367,lesson past outbreak could help fight coronavi...,0.561224
25257,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2402914.0,2459894.0,2020-10-15 11:35:50 EDT,...,-0.148861,-0.043843,-0.011597,0.001109,0.084936,1.5,0.000285,0.007159,harris cancel travel top aide test positive covid,0.285714
80487,New Scientist,science chennel,235877164588,MEDIA_NEWS_COMPANY,GB,The best place to find out what’s new in scien...,2010-01-04 15:10:07,3570398.0,3530349.0,2020-04-14 12:04:35 EDT,...,-0.090999,-0.028762,-0.041914,-0.074891,-0.073537,0.666667,6.2e-05,0.000662,psychology tip maintain social relationship lo...,0.571429
74407,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4295144.0,4372295.0,2020-04-03 05:24:41 EDT,...,-0.16858,-0.032044,-0.069213,-0.023514,-0.090103,1.375,8.6e-05,0.003375,australia say true coronavirus infection could...,0.469388


In [40]:
# View the top 20 scores
df_posts.sort_values(by="Liberty/oppression", ascending=False).head(20)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post,Liberty/oppression
2465,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,33564706.0,36743751.0,2020-10-17 05:00:26 EDT,...,-0.236038,-0.149378,-0.139987,-0.158084,-0.147382,0.777778,0.000181,0.001567,biden blame trump liberate michigan tweet plot...,0.959184
23066,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2472629.0,2568699.0,2021-01-22 08:24:17 EST,...,0.065083,0.063298,0.102093,0.033883,0.05763,1.571429,0.000223,0.0006,biden unveil wartime strategy fauci feels libe...,0.959184
16692,Washington Post,washingtonpost,6250307292,BROADCASTING_MEDIA_PRODUCTION,US,Our award-winning journalists have covered Was...,2007-11-07 18:26:05,6394352.0,6536016.0,2020-04-17 07:20:58 EDT,...,-0.097381,-0.077367,0.021771,-0.081325,-0.037194,0.714286,0.001382,0.002299,holocaust survivor die coronavirus year libera...,0.959184
22903,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2477066.0,2575050.0,2021-01-29 13:02:00 EST,...,-0.0145,0.107393,0.096114,0.098489,0.096776,1.083333,6.6e-05,0.002629,feel like try white house reach gop despite pa...,0.938776
55018,The Wall Street Journal,WSJ,8304333127,MEDIA_NEWS_COMPANY,US,"Breaking news, investigative reporting, busine...",2008-02-11 22:26:53,6434385.0,6463490.0,2020-04-02 21:30:29 EDT,...,-0.0179,-0.050063,0.09036,0.129566,0.059716,0.454545,0.000135,0.005596,opinion coronavirus medal freedom,0.897959
33576,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17659280.0,17904137.0,2021-04-12 17:00:49 EDT,...,-0.16691,-0.172288,-0.067778,-0.18089,-0.076599,0.7,0.001308,0.003105,freedom england begin reopen month lockdown,0.897959
57987,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4736351.0,5330181.0,2021-04-12 18:30:15 EDT,...,0.014588,0.121625,0.085999,0.096319,0.178817,0.636364,0.000223,0.003546,england pub reopen major step freedom,0.897959
72472,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4328973.0,4428290.0,2020-04-25 11:12:59 EDT,...,-0.141015,-0.128001,-0.094028,-0.135654,-0.245253,1.090909,7.5e-05,0.003242,spain kid prepare taste freedom six week lockdown,0.897959
34610,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17610707.0,17785628.0,2021-01-27 07:40:10 EST,...,-0.085363,0.030992,-0.030484,-0.007293,-0.115098,1.125,6.2e-05,0.003472,gov kristi noem rebrande failure freedom,0.897959
58017,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4735589.0,5328674.0,2021-04-11 18:35:04 EDT,...,0.014588,0.121625,0.085999,0.096319,0.178817,0.583333,0.000223,0.003546,english shop pub garden reopen major step freedom,0.897959


In [41]:
df_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84385 entries, 0 to 84384
Data columns (total 60 columns):
 #   Column                                                                                                                Non-Null Count  Dtype  
---  ------                                                                                                                --------------  -----  
 0   Page Name                                                                                                             84385 non-null  object 
 1   User Name                                                                                                             84385 non-null  object 
 2   Facebook Id                                                                                                           84385 non-null  int64  
 3   Page Category                                                                                                         84385 non-null  object 
 4   Page

As we can see from the metadata above, we have 56,088 of non null scores on a dataset of 84,385, which seeme like 66% of prediction scores are set while the other 35% is null.

In [None]:
df_posts.to_csv('final_with_scores.csv', index=False)