# Liberty Dictionary and Scoring

## Installing some required dependencies

In [1]:
!pip install --quiet xlrd==1.2.0

In [2]:
pip install --upgrade xlrd

Collecting xlrd
  Using cached xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
Installing collected packages: xlrd
  Attempting uninstall: xlrd
    Found existing installation: xlrd 1.2.0
    Uninstalling xlrd-1.2.0:
      Successfully uninstalled xlrd-1.2.0
Successfully installed xlrd-2.0.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
# load libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import json
from nltk.tokenize import word_tokenize
import spacy
from nltk import word_tokenize
import string
from nltk.corpus import stopwords
import re

In [4]:
try:
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
except OSError as error:
    if "Can't find model 'en_core_web_sm'" in error.args[0]:
        print('Downloading files required by the Spacy language processing library (this is only required once)')
        spacy.cli.download('en_core_web_sm')
    nlp = spacy.load("en_core_web_sm")
    nlp_reduced = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brinxu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/brinxu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/brinxu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# read questionnaire data
data = pd.read_excel("questionnaire.xls")

In [7]:
# let's have a look on the metadata
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10538 entries, 0 to 10537
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Candidate Words      10527 non-null  object 
 1    Liberty/oppression  10524 non-null  float64
dtypes: float64(1), object(1)
memory usage: 164.8+ KB


In [8]:
data.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
9861,unused,5.0
2692,inhibit,7.0
10004,obtrude,4.0
7966,emancipation,7.0
9560,sufficient,5.0


## Build Dictionary
### Data Preprocessing
Beofore building teh dictionary of `Liberty/oppression `, we will prepare our dataset by first droping columns we don't need and then we will normalize annotated values for a scale between 0-1.

In [9]:
data.rename(columns={' Liberty/oppression': 'Liberty/oppression'}, inplace=True)

In [10]:
data.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
9640,not,4.0
7693,could,1.0
7689,concede,3.0
8399,working,1.0
2988,excellent,5.0


Now that our dataframe is ready to be transformed into a dictionary of keys and values, where keys are `Candidate Words` and values are `Liberty/oppression` scores, but before that we shall average the scores given by annotators to each candidate word and then we normalize our values.

In [11]:
data["Candidate Words"].value_counts()

able            11
maltreatment    11
hostility       11
humiliation     11
hypocrisy       11
                ..
make            10
need            10
1                1
4                1
5                1
Name: Candidate Words, Length: 960, dtype: int64

The block above show that each word is mostly annotated 11 times, some of them 10 times except the last three values of `1`, `4` and `5` who are annotated only once and they don't have any pertinence as candidate words, so they should be deleted.
## Averaging scores by candidate words

In [12]:
# averaging Liberty/oppression scores by candidate words
data_avg = data.groupby("Candidate Words").mean().reset_index()

In [13]:
data_avg.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
469,individual,3.727273
715,reasonable,3.545455
333,excellent,2.909091
483,institution,3.909091
153,civic,3.545455


Let's explore how our `Liberty/oppression` is distributed now after averaging their values.

In [14]:
data_avg.describe()

Unnamed: 0,Liberty/oppression
count,960.0
mean,3.333496
std,0.676836
min,1.0
25%,2.9
50%,3.272727
75%,3.818182
max,5.454545


As we can see, after averaging our `Liberty/oppression` the scale of score values is now between `1` and `5.454545`. This scale won't be helpful when we will do average scores at the document level, that's why normalzation is a mandatory step. Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

In [15]:
# original value of the word "truth"
data_avg[data_avg["Candidate Words"] == "truth"] 

Unnamed: 0,Candidate Words,Liberty/oppression
881,truth,3.181818


The way the value will be transformed is using the `MinMaxScaler` function which subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

Max = 5.454545
Min = 1.000000

So by applying the `MinMaxScaler` formula which is <img src="https://i.stack.imgur.com/EuitP.png"><br/>
we will get : 

In [16]:
x_truth = (3.181818 - 1)/(5.454545 - 1)
print(x_truth)

0.4897959275301966


In [17]:
# apply normalization techniques 
column = 'Liberty/oppression'
data_avg[column] = MinMaxScaler().fit_transform(np.array(data_avg[column]).reshape(-1,1))

# view normalized data  
data_avg.sample(5)

Unnamed: 0,Candidate Words,Liberty/oppression
628,own,0.612245
457,imposed,0.55102
936,want,0.653061
439,humiliation,0.530612
217,damage,0.469388


In [18]:
# View top 20 sample
data_avg.sort_values(by="Liberty/oppression", ascending=False).head(20)

Unnamed: 0,Candidate Words,Liberty/oppression
298,emancipation,1.0
389,freedom,0.979592
541,liberated,0.959184
603,obstruction,0.938776
233,demobilize,0.918367
70,autonomic,0.918367
71,autonomous,0.918367
144,choose,0.918367
895,unenslaved,0.897959
543,liberties,0.897959


In [19]:
# let's get a bit more insights about our normalization output 
data_avg.describe()

Unnamed: 0,Liberty/oppression
count,960.0
mean,0.523846
std,0.151943
min,0.0
25%,0.426531
50%,0.510204
75%,0.632653
max,1.0


As we can see from the table above our column has been transformed into a range of values where the min is 0 and the max is 1.
Now that our dataset is well normalized, we can start building our dictionary.

We then created a loop to read the data from the dataframe where we averaged scores given by coders to each candidate word and create a new dictionary that we will be using later on when we want to get the Liberty Score for a given word, since dictionary format in python is the most switable format for such task, all you need to provide is the key which iw the candidate word and it retruns the Liberty score, without needing to loop over the dataframe.

In [20]:
# initialize an empty dict
liberty_lexicon = {}

# loop over our dataframe
for index, row in data_avg.iterrows():
  liberty_lexicon[row["Candidate Words"]] = row["Liberty/oppression"]

print(f"there is {len(liberty_lexicon)} elements in our dictionary for Liberty/oppression scores")

there is 960 elements in our dictionary for Liberty/oppression scores


In [21]:
# test our dictionary 
liberty_lexicon["power"]

0.6326530612244899

In [22]:
# save the dictionary as a json file
with open("liberty_lexicon.json", "w") as f:
  json.dump(liberty_lexicon, f, indent=4)