# In-class exercise 6a: Corpus Pre-processing and Dictionary Method

## The Moral Foundation Dictionary 

In this exercise, we will practice pre-processing natural language and use a very simple dictionary method to explore to what extend the American democrats and republican presidential candidates appeal to different **moral values**. 


Before that, please install two additional packages into our datamanagement environment: `langdetect` and `fastparquet`. You can do so by running the following command in a code cell:

In [None]:

pip install langdetect
pip install fastparquet

### The moral foundations

A group of psychologists - notably Jonathan Halddt and Jesse Graham - noticed that human beings across different culture background often value similar virtues - such as careness, fairness, loyalty and authority - and theorized that these common themes might have been essential for humanity's survival. They developed the **moral foundation theory** (https://moralfoundations.org) which established the five pillers of morality that are shared among cultures: 

1. **Careness**: sympathy, mutual aid and nurturance. It helps human to care about the small and the weak and survives as a group.
2. **Fairness**: reciprocity and proportionality. It encourages production and enables justice and right. 
3. **Loyalty/In-group**: feeling like belonging to a group or a tribe, a difference between us and them. It encourages contributing to the collective and formation of coalitions.
4. **Authority**: observance of a storng leader, doing as orders command. It makes cooperation more efficient. 
5. **Purity/Sanctity**: disgust of filth and contamination - including physical ones, such as blood or dead body,  and religious ones ,such as incest. It improves the sanitary condition in human habitat and promotes self-control and spirituality.

We call the words that describe things promoted by the five ideas "virtues" and the opposites - negligence, freeloading, betrayal, disobedience and sacrilege - "vices". 

Researchers in social sciences often notice that different political factions appeal to different moral foundations. In the context of American politics for example, the Democrats emphasizes more careness and fairness, while the Republicans emphasizes loyalty and authority. 

We will work on a small exercise today, inspired by but not identical with Hackenburg, Brady and Tsakiris (2023, https://doi.org/10.1093/pnasnexus/pgad189). The data are downloaded from OSF (10.17605/OSF.IO/FZ6KP)

In [29]:
# Import everything we need today 

# The basics
from pathlib import Path
import pandas as pd

# NLP tool kit
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('punkt_tab')      
nltk.download('wordnet')    
nltk.download('omw-1.4') 
nltk.download('averaged_perceptron_tagger_eng')
from nltk.corpus import wordnet
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag

# Regular expressions
import re

# Language detection 
import langdetect

import fastparquet



[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


## Exercise 1: Import The Moral Dictionary 

Some words describing each virtue and vice is included data/moral_foundation_dictionary. 

1. Open the `mfd2.0.dic` in this folder with a text reader (VSC works well) and observe the structure. 
2. Read this document into python, remove the headers (line 1 to 12) so that it can be converted to a data frame. 
2. Convert `mfd2.0.dic` in python to a pandas data frame, where one column is the word, another column is the code representing its categories (e.g. "1" for care.virtue, 2 for care.vice). Name the two columns `word` and `category`. 
3. Replace the category column with its label from the MFD, so that instead of 1, the column now has a value of "care.virtue" (see exercise in the pandas session for reference)
4. Not necessary but good data practice: put the columns to the good category (str for word and categorical for category.)


Hints: 
- You might meet all kinds of text materials in different formats in NLP exercises. Most of them can be opened by VSC as text, and read as text. `read().splitlines()` and `read().split()` are your good friends. 
- Here we are working in English so special characters are not a problem. If you are working with French, its a good practice to always read with the accurated encoding (e.f. "utf-8") to make sure that accented characters are visible. 
- "\t" means space and "\n" means line skip. 


In [None]:
# imports - here are some useful paths
this_dir = Path(".")
moral_foundation = this_dir / "data" / "moral_foundation_dictionary"
word_list = moral_foundation / "mfd2.0.dic"
word_list.exists()


## Exercise 2: Stemming/Lemmatizing the words

### Try stemming and lemmatizing
We notice that each word comes in different grammar forms (such as 'caring', 'cared' and 'cares'). We would like to shrink the size of our dictionary, so that any program would run faster. 

For that we have two choices: stemming and lemmatizing. 

1. Produce a vector of stemmed moral keywords - be careful not to overwrite the column in the data frame yet. We will use the `SnowballStemmer` from `nltk` as in the lecture. 
2. Produce a vector of lemmatized moral keywords - be careful not to overwrite the column in the data frame yet. We will use the `WordNetLemmatizer` from `nltk` as in the lecture. 

### (Might be a problem): why is my lemmatizer not working? 
Now, you might notice that the lemmatizer is "barely altering" your words! For example, "caring" was not changed to "care". Why is this? 

Try to find out the solution on your favoriate search engine. Hint: it has something to do with how lemmatizer understand the word - the part-of-speech that it assigned to the word. 


### Pick an option 

Given the results and our goal (count occurences of these words in natural speech = Twitter), do you prefer stemming or lemmatizing? 

There is no correct answer, just explain your logic. 

### We adopt the lemmatizer 

For the rest of the exercise we will keep the lemmatizer. Please replace the `word` column of your data frame `mfd` with the lemmatized version, and then drop the duplicates. (`drop_duplicates()` is helpful here )

## Exercise 3: Importing Tweets from presidential candidates 

We would like to pre-process our presendital candidate tweets right now! We are looking at 4 sets of tweets: Biden, Clinton, Trump 2016 and Trump 2020. Similarly, we would like to read it into a pandas DF with the name of the candidate and the tweet.

We would ideally like to have a pandas dataframe with three columns: `candidate`, `year` and `original_text`. We will call the dataframe `all_tweets`.

Here is how I proceeded to read the files, please complete the code so that you manage to read the dataframe (hint: look at what variables have not been defined) but feel free to do otherwise! 

1. create an empty data frame with the desired columns. 
2. iterrate files in the directorate 
3. for each directorate, extracte "2016" or "2020" in the file name for year and the name of the candidate. Put them in variables to be inserted later. 
4. read the file by line and convert it to a dataframe. 
5. create the year and candidate columns using stored variables. 
6. append (`pd.concat`) the dataframe read from one file to the main file `all_tweets`. 


P.S. Some of Trump's tweet looks "unfinished" in the original file. This is the way I found it at download - it is ok for the purpose of the exercise. 

In [None]:

for dir in tweet_data.iterdir():

    if dir.name.find("2016") > 0 : 
        # if there is 2016, we should give a value to some variable called 2016

    for candidate in ['biden', 'hillary', 'trump']:
        if dir.name.find(candidate) > 0 :
            person = candidate 

    with open(dir, encoding='utf-8') as file:
        tweets = file.read().splitlines()

    tweets = pd.DataFrame(tweets)
    tweets.columns = ['original_text']
    tweets['candidate'] = # give the value person here 
    tweets['year'] = # give the value year here
    
    all_tweets = pd.concat([all_tweets, tweets ])
    
# Put them into the correct category - to do 

# don't foget to reset index! - to do 

Unnamed: 0,original_text,year,candidate
0,Just landed in Iowa - speaking soon!,2016,trump
1,It is time to,2016,trump
2,Thank you!,2016,trump
3,Will be on at 7:02 A.M. Enjoy.,2016,trump
4,,2016,trump
...,...,...,...
17288,"""Hillary Clinton is the better candidate to ta...",2016,hillary
17289,72 years after let's not just eulogize the br...,2016,hillary
17290,Donald Trump should come out of the towers he ...,2016,hillary
17291,bigoted comments about a Latino judge are so ...,2016,hillary


(we can insert an exercise on sentiment analysis or part-of-word here, we would also add purging spanish) 

## Exercise 4: Pre-processing the tweets

We would like to do the following things: 
- put everything to lower case
- remove all non-alphabetic things
- remove punctuation 
- remove stopwords
- lemmatizing 

1. Please create a function called `noralize_text` that would take a document (here, a tweet) and return a list of lemmatized words. Feel free to copy-paste from the lecture and see what happens. 
2. Then, `apply` it to the column `original_text` to generate a column called `processed_text`.

### Bonus: Can I adapt the off-the-shelf pre-processing function to better fit our use case? 

If you had copy-pasted the function made by Malka in the lecture, you might want to circle a bit in the dataset and see whether it has problems - whether it failed to process certain words that should be processed. 

Here are some of the questions that I noticed: 
- emojis are not purged
- it failed to separate contractions (we've, haven't)...
- it failed to separate words with punctuation in the middle (e-mail, anti-vax, a point after a word with no space, etc)
- it failed to remove extremely common words (we've, us, etc)
- it failed to discern other languages (spanish, notably)

Can you improve the function so that it takes care of these? 

Hints: 
- Emojis are ususally considered unicodes by python. They can be purged with regular expressions : try `[\U00010000-\U0010FFFF]`
- For contractions, try to spilt further by the apostrophe '
- For Spanish, try a language recogniser. (check out what is imported in the intro! )
- also purging some other extremely common stopwords might be a good idea. here is my list : 

`extra_stopwords = ['us', 'we', 've', 'i', 'you', 'he', 'she', 'it', 'they', 'me', 'him', 'her', 'them', 'my', 'your', 'his', 'its', 'our', 'their', 'a', 'an', 'the', 'and', 'or', 'but', 'if', 'in', 'on', 'at', 'to', 'for', 'with', 'of', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'get', 'let']`


You are welcomed (and encouarged) to fix problems that I didn't find!

## Exercise 5: count dictionary 

Now, we would like to count the frequencies of different virtues keyword and store them as additional variables in the data frame `all_tweets`. 

1. Create a function called `count_virtue` that would count that takes a category of virtue or vice (e.g. `"care.virtue"`) and return a count of the # of occurences of any virtue key word in the tweet. 
2. Apply it to the tweets for each virtue and vice, and add them to the variable as additional columns (e.g. `count_care_virtue`)

Hints: 

- Try to make your function self-contained; the data frame `all_tweets` and the dictionary `mfd` should not be pulled from the global environment but rather given as inputs. 
- If your function counts at the level of a single string, the function `apply` is your friend. 


## Exercise 6: Who says it more? 

Finally, we would like to test the hypothesis that democrats emphasizes more the care and fairness virtures, while Republicans (Trump here) emphasizes more authority and in-groupness. Of course, we must normalize by the total sum of tweets - trump is just tweeting more than other people! 

Use pandas' `.groupby` and `.agg` functions to calculate the mean number of words about a virtue or vice per tweet emitted by each candidate/year. What do you think? 


## Save your pre-processed database as a .parquet 

Please save the columns `candidate`,  `year` , `original_text` and `processed_text` in a parquet document called "`cooked_data.parquet`" in the folder `data`. and the mfd dataframe as `cooked_mfd.parquet` in the same place. 

We will keep using this data for the next class! 

In [44]:
# Define a path 
path_cooked_data = this_dir / "data" / "cooked_data.parquet"

all_tweets[['candidate','year','original_text', 'processed_text']].to_parquet(path_cooked_data)


In [45]:
# Same for mfd
path_mfd_cooked = this_dir / "data" / "cooked_mfd.parquet"
mfd.to_parquet(path_mfd_cooked)

### If you still have time: Try other things we did in class!

If you still have some time at hand, please try the other things we mentioned today: 
- Sentiment analysis: calculate a positivity score for each tweet and aggregate. Which candidate is the most positive one ?
- Part-of-Speech; try Named entity recogination on some tweets and see what entity (e.g. "Donald Trump" "The Fed" "2016") are highlighted. 
