# Liberty scoring at a post level
The approach we will use in this notebook is similar to the one used [here](https://github.com/oaraque/moral-foundations/blob/9d84f014fb257ce5d6cd77b48ed104edc911e31e/moralstrength/moralstrength.py#L46), where they average the annotations for the words in the sentence, if the word isn't found in our lexicon, it will be skipped.

To add more details about the used approach, I will give an example, suppose we have the sentence below :
sentence = "I am feeling exhausted" the below function will actually transform this sentence into a set of tokens like ["I", "am", "feeling", "exhausted"] and then it will start looping over this list for each token we will get its lemmatized version which will be like ["i", "be", "feel", "exhausted"] and then for each word we will try to get its Liberty score from the lemmatized_liberty_lexicon dictionary in case it exist and sum its value into the sum variable which is initialised with 0 at the beginning, at the end we will count how many score we were able to find in our dictionary this will be recognized_words_no , hence we will divide the sum of values found by how many words were found, if a word wasn't found in the Libert dictionary it is ignored.

In [1]:
# load libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import json
from nltk.tokenize import word_tokenize
import spacy
from nltk import word_tokenize
import string
from nltk.corpus import stopwords
import re

2022-12-24 12:04:04.988962: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
try:
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
except OSError as error:
    if "Can't find model 'en_core_web_sm'" in error.args[0]:
        print('Downloading files required by the Spacy language processing library (this is only required once)')
        spacy.cli.download('en_core_web_sm')
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brinxu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/brinxu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/brinxu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
import json

with open('liberty_lexicon.json', 'r') as f:
    lemmatized_liberty_lexicon = json.load(f)

In [5]:
lemmatized_liberty_lexicon["freedom"]

0.9795918367346939

In [6]:
def document_average_liberty(text):
    # initialize the sum of scores and how many token will be found in our lexicon 
    sum = 0
    recognized_words_no = 0

    # loop over each token in the text and get their score of liberty from the lexicon
    for token in nlp(text):
        try:
            # get the liberty score of the lemmatized token
            liberty_score = lemmatized_liberty_lexicon[token.lemma_]
            # check if there is a score in the dict
            if liberty_score:
                # then sum up with the old value
                sum += liberty_score
            # increment the number of words that were found in the dictionary 
            recognized_words_no += 1
        # this is a silent exception in case the word wasn't found in the dict
        except:
          # print(f"The word {token} was not found in the dictionary")
          continue
  
    if recognized_words_no == 0:
        # delete the below line if want value 0 instead of NaN
        return float('NaN')
    else:
        return sum/recognized_words_no

In [7]:
# test the function over a sample text
document_average_liberty("limiting freedom is")

0.5782312925170069

Even though the token is doesn't exist in our dictionary but after lemmatization, it was transformed to its original verb be which allows us to get its Liberty score and contribute to the document scoring.

Now our next step will be loading the tweets data and do some further data preprocessing to normalize the text data and then we can call our built-in function document_average_liberty to retrun the scores for each tweet.

In [8]:
# load posts data 
df_posts = pd.read_csv("covid_data_final.csv")

  df_posts = pd.read_csv("covid_data_final.csv")


In [9]:
df_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82389 entries, 0 to 82388
Data columns (total 48 columns):
 #   Column                                                                                                              Non-Null Count  Dtype  
---  ------                                                                                                              --------------  -----  
 0   Page Name                                                                                                           82389 non-null  object 
 1   User Name                                                                                                           82389 non-null  object 
 2   Facebook Id                                                                                                         82389 non-null  int64  
 3   Page Category                                                                                                       82389 non-null  object 
 4   Page Admin Top C

In [10]:
# get stopwords list
stoplist = stopwords.words('english')
# get list of punctuations
punctuations = string.punctuation + "’¶•@°©®™"

In [11]:
# the next step will be normalizing text data to be ready for scoring calculation
def preprocess_text(text):
    """
    @param text string
    @return text string
    
    This function preprocess a given raw text by normalizing it to lowercase removing the stop words,
    punctuations and lemmatization
    """
        
    #string to lowercase
    txt = text.lower()
    
    # keep only ascii characters
    txt = re.sub(r"[^a-zA-ZÀ-ÿ]", " ", txt)
    
    #punctuation removal and map it to space
    translator = str.maketrans(punctuations, " "*len(punctuations))
    s = txt.translate(translator)
    
    #remove digits 
    no_digits = ''.join([i for i in s if not i.isdigit()])
    cleaner = " ".join(no_digits.split())
    
    # the word_tokenize function will transform the text from a simple string to a list of token "a b c d" ==> ["a", "b", "c", "d"]
    word_tokens = word_tokenize(cleaner)
    # here we are interested on keeping only words that doesn’t appear on the stoplist variable we created above
    filtered_sentence = [w for w in word_tokens if not w in stoplist]
    # sometime the stopwords list isn't exhaustive and we may found meaningless words with only one character, it's better to drop them
    filtered_sentence = [w for w in filtered_sentence if len(w)>1 ]
    # the opposite of word_tokenize ["a", "b", "c", "d"] ==> "a b c d"
    filtered_sentence = " ".join(filtered_sentence)
    
    # a double layer lemmatization word block
    # filtered_sentence = " ".join([lemmatize_word(word) for word in word_tokenize(filtered_sentence)])
    filtered_sentence = " ".join([token.lemma_ for token in nlp(filtered_sentence)])
    
    return filtered_sentence

In [12]:
example_tweet = df_posts["Message"][9]
print("Post before preprocessing : \n {}\n".format(example_tweet))
clean_tweet = preprocess_text(example_tweet)
print("Post After preprocessing : \n {}".format(clean_tweet))

Post before preprocessing : 
 It's another sign that airlines see a recovery from the pandemic on the horizon

Post After preprocessing : 
 another sign airline see recovery pandemic horizon


In [13]:
!pip install --quiet mapply

In this step we will need to apply the function we have created before called preprocess_text to all the messages, the normal process will do this row by row, so in order to maximise the speed of this process we call mapply function which will run it in a parallel way instead of sequential way which is very slow.

In [14]:
# to parallelize the processing function over the whole dataframe
import mapply
# the value n_workers=-1 means we gonna use all the cores available in our CPU
mapply.init(n_workers=-1)

In [15]:
# convert the values in messages to string fromat
df_posts["Message"] = df_posts["Message"].astype(str)
# now let's apply this preprocessing function over all our text data in the dataframe
df_posts["clean_post"] = df_posts["Message"].mapply(preprocess_text)

  0%|                                                    | 0/40 [00:00<?, ?it/s]

In [16]:
df_posts.sample(5)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,sanctity_p,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post
69232,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4453778,4868378.0,2020-08-29 22:12:07 EDT,...,0.052078,-0.050607,-0.093808,-0.029321,-0.104793,-0.075743,1.0,4.7e-05,0.000959,victoria australia post new case coronavirus s...
81214,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4239432,4286030.0,2020-02-23 06:48:31 EST,...,0.056791,-0.237856,0.010871,0.010495,-0.107358,-0.14392,2.0,0.000175,0.01132,follow list international sport event affect o...
79315,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4269121,4331947.0,2020-03-23 10:49:43 EDT,...,0.095366,-0.191624,-0.073602,-0.125628,-0.099619,-0.198258,1.444444,0.000101,0.00307,india say shut domestic flight halt spread cor...
35310,The New York Times,nytimes,5281959998,MEDIA_NEWS_COMPANY,US,Welcome to The New York Times on Facebook - a ...,2007-10-29 23:03:34,17687192,17964640.0,2021-06-02 09:15:12 EDT,...,0.087273,-0.282908,-0.082825,-0.057915,-0.172957,-0.133163,1.428571,0.000107,0.007846,crime emerge significant issue pandemic contex...
2420,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,34004742,37292022.0,2020-11-19 09:13:01 EST,...,0.085698,-0.075863,-0.049757,0.063734,0.003463,-0.06981,1.0,0.000261,0.003472,year old michigan er doctor go run marathon ba...


Now that our dataframe has the cleaned text data, we are ready to call our document_average_liberty function to score the documents on it.

In [17]:
df_posts["Liberty/oppression"] = df_posts["clean_post"].mapply(document_average_liberty)

  0%|                                                    | 0/40 [00:00<?, ?it/s]

In [18]:
# Explore the results
df_posts.sample(5)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post,Liberty/oppression
31133,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2355067,2380637.0,2020-05-02 01:01:10 EDT,...,-0.099867,-0.069057,-0.066499,-0.015203,-0.013154,1.142857,8.2e-05,0.001413,dr fauci say state follow federal guideline be...,0.520408
66005,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4571126,4984939.0,2020-11-27 15:20:13 EST,...,-0.012237,0.013245,0.043233,0.04075,0.01482,1.333333,0.000189,0.00052,canada next week reveal breadth emergency spen...,0.468878
77552,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4306970,4393287.0,2020-04-10 06:47:37 EDT,...,-0.110028,-0.0319,0.011207,-0.098941,0.005719,0.666667,0.000141,0.00326,china wuhan keep test resident coronavirus loc...,0.536892
27585,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2402454,2458834.0,2020-10-10 18:15:51 EDT,...,-0.090325,-0.004022,0.02703,-0.02175,-0.096039,1.9,7e-05,0.002937,coronavirus case set new single day record sta...,0.510204
14394,NPR,NPR,10643211755,BROADCASTING_MEDIA_PRODUCTION,US,"NPR is an independent, nonprofit media organiz...",2008-02-21 00:53:35,6612554,7181657.0,2020-03-24 07:45:04 EDT,...,-0.054901,-0.035541,-0.023704,-0.027798,-0.063129,0.538462,0.000157,0.000297,president trump hail anti malarial possible ga...,0.506122


In [19]:
# View the top 20 scores
df_posts.sort_values(by="Liberty/oppression", ascending=False).head(20)

Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var,clean_post,Liberty/oppression
11990,NPR,NPR,10643211755,BROADCASTING_MEDIA_PRODUCTION,US,"NPR is an independent, nonprofit media organiz...",2008-02-21 00:53:35,6654070,7697756.0,2020-07-30 20:36:36 EDT,...,0.172398,0.204521,0.168465,0.14544,0.099991,0.555556,0.000191,0.001501,listen billie eilish new song future write rec...,0.979592
15346,Washington Post,washingtonpost,6250307292,BROADCASTING_MEDIA_PRODUCTION,US,Our award-winning journalists have covered Was...,2007-11-07 18:26:05,6623823,7005003.0,2021-06-01 09:00:48 EDT,...,-0.095622,-0.013458,0.01857,0.008436,-0.032978,1.25,0.000326,0.002045,anthony fauci pandemic email march april obtai...,0.979592
79901,Reuters,Reuters,114050161948682,MEDIA_NEWS_COMPANY,US,Welcome to Reuters news on Facebook. We share ...,2010-04-16 17:16:47,4258046,4316801.0,2020-03-18 04:40:30 EDT,...,-0.027828,-0.107975,-0.114024,-0.120254,-0.103167,1.0,0.00087,0.001437,china expel american journalist three newspape...,0.979592
8303,Daily Kos,dailykos,43179984254,NEWS_SITE,US,News you can do something about.,2009-01-10 01:42:46,1291981,1230317.0,2021-05-05 23:35:57 EDT,...,-0.193133,-0.177365,0.040152,-0.067462,-0.111529,0.166667,0.0004,0.008869,news roundup vaccine patent waiver mcconnell v...,0.938776
52265,The Wall Street Journal,WSJ,8304333127,MEDIA_NEWS_COMPANY,US,"Breaking news, investigative reporting, busine...",2008-02-11 22:26:53,6565735,6766909.0,2021-06-02 11:30:55 EDT,...,-0.104178,-0.077201,-0.06167,-0.005496,-0.057586,1.571429,0.0007,0.001305,oil rich country venezuela lack vaccine perfor...,0.918367
19516,USA TODAY,usatoday,13652355666,MEDIA_NEWS_COMPANY,US,"We bring clarity to the news of the day, helpi...",2008-04-09 18:29:16,8454194,9041160.0,2021-03-11 11:03:46 EST,...,0.024818,-0.038912,0.113412,0.16602,0.009502,0.4,0.001446,0.006884,heartbreaking heartwarming moment year life pa...,0.877551
33750,MSNBC,msnbc,273864989376427,MEDIA_NEWS_COMPANY,US,The destination for in-depth analysis of daily...,2012-05-14 16:26:44,2325559,2336835.0,2020-03-18 00:33:17 EDT,...,-0.093369,-0.051603,0.000963,-0.042207,-0.16822,1.333333,0.000682,0.004084,fundamental quality health crisis living depen...,0.877551
82332,NBC,nbc,89742860745,TV_NETWORK,US,America’s Most Watched Network. The official F...,2009-04-21 21:27:43,3030100,3255731.0,2020-09-11 17:14:36 EDT,...,0.131403,0.156768,0.2578,0.214991,0.212431,0.272727,0.000785,0.002538,connect brand new comedy tackle love life lock...,0.877551
23054,USA TODAY,usatoday,13652355666,MEDIA_NEWS_COMPANY,US,"We bring clarity to the news of the day, helpi...",2008-04-09 18:29:16,8043243,8245757.0,2020-02-02 13:51:22 EST,...,-0.061467,0.163779,0.226652,0.227953,0.212205,0.5,0.00041,0.015164,life like epicenter coronavirus outbreak http ...,0.877551
14375,NPR,NPR,10643211755,BROADCASTING_MEDIA_PRODUCTION,US,"NPR is an independent, nonprofit media organiz...",2008-02-21 00:53:35,6612554,7181657.0,2020-03-24 19:11:05 EDT,...,-0.02549,0.023466,-0.051592,-0.080446,0.019548,0.636364,0.000822,0.002023,tony award win american playwright terrence mc...,0.877551


In [20]:
df_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82389 entries, 0 to 82388
Data columns (total 50 columns):
 #   Column                                                                                                              Non-Null Count  Dtype  
---  ------                                                                                                              --------------  -----  
 0   Page Name                                                                                                           82389 non-null  object 
 1   User Name                                                                                                           82389 non-null  object 
 2   Facebook Id                                                                                                         82389 non-null  int64  
 3   Page Category                                                                                                       82389 non-null  object 
 4   Page Admin Top C

As we can see from the metadata above, we have 79,786 of non null scores on a dataset of 84,385, which seeme like 94.5% of prediction scores are set while the other 5.5% is null.

In [21]:
df_posts.to_csv('covid_data_final_with_scores.csv', index=False)