# eMFDscore Tutorial

© Media Neuroscience Lab  
October 2020

***

This notebook provides a tutorial on how to use eMFDScore for extracing various moral information metrics from texutal input.  
Specifically, this tutorial guides the reader how to effectively use the eMFDScore tool either on the command line (for MACOS and Linux) and in Python (for Windows, MACOS, and Linux).  
In addition, this tutorial also demonstrates which scoring options are appropriate for particular tasks.  
For more detailed background information on the eMFD, please consult the respective [publication](https://link.springer.com/article/10.3758/s13428-020-01433-0).

Finally, when using eMFDscore, please consider "starring" the Github repository and citing the follwing article: 

Hopp, F. R., Fisher, J. T., Cornell, D., Huskey, R., & Weber, R. (2020). The extended Moral Foundations Dictionary (eMFD):  
Development and applications of a crowd-sourced approach to extracting moral intuitions from text.   
_Behavior Research Methods_, https://doi.org/10.3758/s13428-020-01433-0

***

To interactively run this tutorial, you should clone the eMFDscore github repository and follow the install instructions below.

## 1. Set-up Your Environment

eMFDscore requires a Python installation (v3.7+). If your machine does not have Python installed,  
we recommend installing Python by downloading and installing either Anaconda or Miniconda for your OS.

For best practises, we recommend installing eMFDscore into a virtual conda environment.  
Hence, you should first create a virtual environment by executing the following command in your terminal:

`$ conda create -n emfd python=3.7` 

Once Anaconda/Miniconda is installed activate the env via:

`$ source activate emfd`

Next, you must install spaCy, which is the main natural language processing backend that eMFDscore is built on:

`$ conda install -c conda-forge spacy`  
`$ python -m spacy download en_core_web_sm`

Finally, you can install eMFDscore by copying, pasting, and executing the following command:

`pip install https://github.com/medianeuroscience/emfdscore/archive/master.zip`

In addition, if you plan to run eMFDscore in an interactive python environment (IPython) or using jupyter notebooks, we encourage you to install jupyter-lab into the eMFD environment:  
`conda install -c conda-forge jupyterlab`



## 2. Using eMFDScore

eMFDScore is a versatile tool that can either be run using the command line or directly from Python.  
Note that if you are on a **Windows** machine, you must run eMFDscore from a Python environment. 

In this tutorial, we will load a few packages to inspect the output of eMFDScore's computed metrics.  
These packages must be installed/available in your conda environment, but are not necessary for  
eMFDscore to run properly. 

In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

### Options for Document Scoring

With eMFDScore, you have several options to extract moral information metrics from texutal corpora.   
Below, we go over these options one by one.  

When scoring documents with the extended Moral Foundations Dicitonary (eMFD; default in eMFDScore),  
you must decide how you would like to use the eMFD for scoring textual documents.  

As a reminder, in the eMFD, every of the 3020 words is assigned the following scores: 
- `Foundation Probabilities`: Each word is assigned 5 probabalities that denote the likelihood  
that this word is associated with each one of the five moral foundations as identified by Moral  
Foundations  Theory. For example, the word "kill" has a care probability of 0.4 and  
a loyalty probability of 0.24, meaning that there is a 40% chance that a coder highlighted a context in which the word "kill" appeared  
with the care-harm foundation and a 24% chance that this context was highlighted with the loyalty-betrayal foundation.

- `Sentiment Scores`: Each word is assigned 5 sentiment scores that denote the average sentiment  
of the foundation context in which this word appeared. For example, the word "kill" has an average  
"care_sent" of -0.69, meaning that all "care-harm" highlights in which "kill" appeared had an average,  
negative sentiment of -0.69.

***

Based on these scores, there are two options how these scores can be "mapped" when scoring a  
document (flag `prob_map` below):

1. Use `all` probabilities per word in the eMFD (option `all`):  
=> Using all five foundation probabilities assumes that each word is used as an indicator for multiple foundations with the probabilities as weights. 
2. Assign a `single`  probability to each word in the eMFD according to the foundation with the highest  probability (option `single`):  
=> Each word only indicates **one** foundation (the one with the highest foundation probability) and each time this word is found  
the respective foundation is increased by that word's foundation probability.

***

In addition, you can decide whether you want eMFDScore to return the average sentiment for each  
foundation, or whether you would like eMFDScore to split each foundation  
into a `vice` and `virtue`  category (flag `output_metrics` below):

1. Return the average `sentiment ` for each foundation (option `sentiment`) 
2. Split foundations into a `vice-virtue` category (option `vice-virtue`). 

The vice-virtue split is accomplished by considering the average sentiment of each foundation of each  
word, and then assigning this word to "virtue" if the foundation sentiment is positive,  
or to "vice" if the sentiment is negative.  
For instance, if using the `all` option for the `prob_map` option above, a word's foundation probabilities  
will be translated into five `virtue` scores (e.g., care, fairness,  loyalty, authority, and sanctity)   
if the word's sentiment for these foundations is positive, whereas a word  whose sentiments for each  
foundation is negative will be assigned five `vice` scores (e.g., harm, cheating,  betrayal, subversion, and degradation). 

***

Based on the above, there is a total of 4 different options how the eMFD can be used.  
The specific usage of each and use case is explicated below. 

#### eMFDScore Command-Line Options

A typical command for eMFDScore specifies the following:

`$ emfdscore [INPUT_FILE][OUTPUT_FILE][SCORING_METHOD][DICT_TYPE[prob-map][output_metrics]]`

When using eMFDscore, several inputs need to be defined in a specific order: 

- [INPUT_FILE]: = The path to a CSV file in which the first column contains the document texts to be scored.  
  Each row should reflect its own document. See the template_input.csv for an example file format.
  
  
- [OUTPUT_FILE] = Specifies the file name of the generated output csv.


- [SCORING_METHOD] = Currently, eMFDscore employs three different scoring algorithms:
    - `bow` is a classical Bag-of-Words approach in which the algorithm simply searches for word matches between document texts and the specified dictionary.
    - `pat` (in development) relies on named entity recognition and syntactic dependency parsing. For each document, the algorithm first extracts all mentioned entities.  
    Next, for each entitiy, eMFDscore extracts words that pertain to 1) moral verbs for which the entity is an agent argument (Agent verbs), 2) moral verbs for  
    which the entity is the patient, theme, or other argument (Patient verbs), and other moral attributes (i.e., adjectival modifiers, appositives, etc.).
    - `wordlist` is a simple scoring algorithm that lets users examine the moral content of individual words. This scoring method expects a CSV where each row corresponds  
    to a unique word. Note: The wordlist scoring algorithm does not perform any tokenization or preprocessing on the wordlists.   
    For a more fine-grained moral content extraction, users are encouraged to use either the bow or path methodology. Furthermore, only the emfd is currenlty supported for PAT extraction.   
    Additionally, this method is more computationally expensive and thus has a longer execution time.
    - `gdelt.ngrams` is designed for the Global Database of Events, Language, and Tone Television Ngram dataset.   
    This scoring method expects a unigram (1gram) input text file from GDELT and will score each unprocessed (untokenized) unigram with the eMFD.
    
    
- [DICTIONARY_TYPE] = Declares which dictionary is applied to score documents. In its current version, eMFDscore lets users choose between three dictionaries:
    - `emfd` = extended Moral Foundations Dictionary (eMFD)
    - `mfd2` = Moral Foundations Dicitonary 2.0 (Frimer et al., 2017; https://osf.io/xakyw/ )
    - `mfd` = original Moral Foundations Dictionary (https://moralfoundations.org/othermaterials)


- When choosing the eMFD; the following two additional flags need to be defined:
    - [PROB_MAP]: How are the foundation probabilities mapped when scoring a document? 
        - `all` : use all probabilities per word in the eMFD
        - `single`: Assign a single probability to each word in the eMFD according to the foundation with the highest probability
           
    - [OUTPUT_METRICS]: Which metrics are returned? 
        - `sentiment`: Return the average sentiment for each foundation
        - `vice-virtue`: Split foundations into a vice-virtue category

#### Scoring Documents with the eMFD

Below, we illustrate the various text scoring options in eMFDscore.  
For this purpose, we will be using a CSV file in which each row corresponds to  
a news article text. 

In [2]:
template_input = pd.read_csv('test.csv', header=None)
template_input.head()

Unnamed: 0,0
0,A pair of US senators are asking the Biden adm...
1,There's little evidence the Delta variant can ...
2,The stark disparity between low and high vacci...
3,US Surgeon General Dr. Vivek Murthy says he is...
4,Kim Jong Un fired several senior officials who...


In [3]:
print(template_input.index.name)

None


In [4]:
#template_input = template_input.rename(columns={"Message": "0"})
template_input.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82389 entries, 0 to 82388
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       82389 non-null  object
dtypes: object(1)
memory usage: 643.8+ KB


In [5]:
template_input = pd.read_csv('covid_data.csv', usecols=["Message"])
#template_input = template_input.rename(columns={"Message": "0"})
template_input.head()

Unnamed: 0,Message
0,A pair of US senators are asking the Biden adm...
1,There's little evidence the Delta variant can ...
2,The stark disparity between low and high vacci...
3,US Surgeon General Dr. Vivek Murthy says he is...
4,Kim Jong Un fired several senior officials who...


In [6]:
template_input.to_csv('new_covid_df.csv', header=None, index=False)
template_input = pd.read_csv('new_covid_df.csv', header=None)

#### 1. Use All Probabilities per Word and Return Sentiment Scores 

This option should be used when one wants to extract the overall, holistic moral signal from a document.  
Note that because each word is assigned five foundation probabilities, there exist higher correlations  
across these foundations, making this method less suitable when one wants to
- use the foundation probabilities as predictor variables in statistical models
- discriminate which foundations are more or less represented in a text.  

For these cases, options (2) and (4) below  should be preferred. 

##### Using Python 

In [7]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'emfd'
PROB_MAP = 'all'
SCORE_METHOD = 'bow'
OUT_METRICS = 'sentiment'
OUT_CSV_PATH = 'all-sent.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

2022-12-24 11:50:14.869489: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Processed: 82389 100% |❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤| Elapsed Time: 0:04:20 Time:  0:04:20


In [8]:
!pip install spacy



In [9]:
!python -m spacy download en_core_web_sm

2022-12-24 11:57:16.921398: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
import spacy
print(spacy.__version__)

3.0.0


In [11]:
dictionary = {}
print(dictionary.get(0, "Key does not exist"))

Key does not exist


## Data combinnation

In [12]:
df = pd.read_csv('all-sent.csv')
df.head(3)

Unnamed: 0,care_p,fairness_p,loyalty_p,authority_p,sanctity_p,care_sent,fairness_sent,loyalty_sent,authority_sent,sanctity_sent,moral_nonmoral_ratio,f_var,sent_var
0,0.090455,0.119205,0.130082,0.139355,0.085406,-0.121268,-0.095777,0.006735,-0.070749,-0.175241,0.777778,0.000574,0.004499
1,0.134069,0.113038,0.085576,0.090797,0.088874,0.042614,0.090464,0.14208,0.086809,0.046238,1.0,0.000429,0.001634
2,0.128494,0.126324,0.108211,0.083956,0.093256,-0.120624,-0.086392,-0.081137,-0.065666,-0.057119,0.777778,0.000388,0.000599


In [13]:
original_df = pd.read_csv('covid_data.csv')
original_df.head(3)

  original_df = pd.read_csv('covid_data.csv')


Unnamed: 0,Page Name,User Name,Facebook Id,Page Category,Page Admin Top Country,Page Description,Page Created,Likes at Posting,Followers at Posting,Post Created,...,Total Views,Total Views For All Crossposts,URL,Message,Link,Link Text,Description,Overperforming Score (weighted — Likes 1x Shares 1x Comments 1x Love 1x Wow 1x Haha 1x Sad 1x Angry 1x Care 1x ),LNC_category,covid
0,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,34563652,38358192.0,2021-06-30 13:00:28 EDT,...,0,0,https://www.facebook.com/5550296508/posts/1016...,A pair of US senators are asking the Biden adm...,https://cnn.it/3dpOHxJ,Senators ask DOT to remove expiration date for...,,-3.81,liberal,covid
1,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,34563652,38358192.0,2021-06-30 12:00:50 EDT,...,0,0,https://www.facebook.com/5550296508/posts/1016...,There's little evidence the Delta variant can ...,https://cnn.it/3heEWmU,Rise of Delta variant brings mask question bac...,,1.72,liberal,covid
2,CNN,cnn,5550296508,MEDIA_NEWS_COMPANY,US,Instant breaking news alerts and the most talk...,2007-11-07 22:14:27,34563652,38358192.0,2021-06-30 09:27:09 EDT,...,0,0,https://www.facebook.com/5550296508/posts/1016...,The stark disparity between low and high vacci...,https://cnn.it/2UbdFdl,Fauci warns there may soon be 'two Americas' a...,,2.63,liberal,covid


In [14]:
data_combined = pd.concat([original_df, df], axis=1)

In [15]:
data_combined.to_csv("covid_data_final.csv", index=False)

### Questions or Concerns?

For any questions or concerns, please open an [issue](https://github.com/medianeuroscience/emfdscore/issues) on the Github repository.  