<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-06/W6_HW6_DLATK_1to3grams_(dla_tutorial%2Cblog_authorship%2Csvitlana).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W6 Homework 6: Tuning 1to3grams, extraction & correlation (DB: dla_tutorial, blog_authorship, svitlana)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.


This homework will work with 3 different datasets.

* Questions 1-2: `dla_tutorial` (from your Google Drive)
* Questions 3-4: `blog_authorship` (./psych290/blog_authorship/*.csv in your file tree)
* Question 5: `svitlana` (./psych290/svitlana/*.csv in your file tree)

😎 We'll walk you through the database setup for each!

☝🏻💡 Quick tip: in this homework, many of the DLATK commands can take a while to run (1-2 hours). So if you're working on the HW over a few days make sure to **save your database file to Google Drive frequently**! (See bottom of the notebook for code cells to save each database).

Please set up Colab first, as usual!

## Setup DLATK

### Install DLATK

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql

### Download the data for this tutorial

In [None]:
# this downloads the csvs we need for this tutorial
!git clone https://github.com/CompPsychology/psych290_data.git

Have a look on the left in the file browser, you see csv files in ./psych290_data/blog_authorship and in ./psych290_data/svitlana

### Mount Google Drive and copy databases

In case you don't have that `dlatk_lexica.db` in your Google Drive, look at Tutorial 5.

In [None]:
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies dla_tutorial.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dla_tutorial.db" "sqlite_data"

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

BTW: we have 3 different databases we'll be working within this homework 😎

## 1) (2 pts) How many 2 grams are there...

...in the `dla_tutorial` dataset (the `msgs` and `outcomes` tables in your dla_tutorial.db.) that meet a Pointwise Mutual Information criterion of `PMI > 5`, and that meet an occurrence threshold of being present in `7%` of the groups that meet a group_freq_threshold of `500`? What are the most frequent 10 of these 2grams by average group_norm across users?

First things first, let's connect to the `dla_tutorial` SQLite database!



In [None]:
database = 'dla_tutorial'

# loads the %%sql extension
%load_ext sql

# connects the extension to the database - mounts the database as an engine
from sqlalchemy import create_engine
dla_tutorial_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")

# use tutorial_db_engine engine
%sql dla_tutorial_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

**Pro-tip:** Ingest the unfiltered 2gram feature table, and filter it down.

**Answer**

## 2) Make word clouds for the correlated 2grams for different occupations (one-hot encoded) with a occurrence threshold of being present in 10% of the groups that have at least 500 tokens and a PMI of 4.

Look at the word clouds -- what's the bi-gram most associated with student status?

**Pro-tip:** Pipe away the DLATK output (especially while producing wordclouds).

**Answer**

## 3) (3 pt) Full blog-authorship dataset: descriptives

You will now switch databases to a new `blog_authorship` database. It contains a larger corpus of blogs! Have a look around in that database to get yourself oriented.

<!-- Please don't extract 1, 2 or 3 grams in this corpus, that would take 8.5 hours -- per person in the class. Instead, we have extracted feature tables using the usual DLATK commands for you. -->

#### Setup the SQLite database file from CSVs

First things first, we need to create the new database from the CSV files we cloned (downloaded) from the `psych290_data` GitHub repository.

In [None]:
# define the new database name
database='blog_authorship'

In [None]:
import os
from dlatk.tools.importmethods import csvToSQLite

# point to the database file path
database_path = os.path.join("sqlite_data", database)

# insert the messages CSV into the new database
csvToSQLite("./psych290_data/blog_authorship/blog_authorship_msgs_2k.csv", database_path, "blog_authorship_msgs")

# insert the outcomes CSV into the new database
csvToSQLite("./psych290_data/blog_authorship/blog_authorship_outcomes_2k.csv", database_path, "blog_authorship_outcomes")

💡 If you're returning to this homework (e.g., you already extracted 1to3grams and saved the database to Google Drive) and do not want to recreate your blog_authorship database, you can:
* (1) Mount your Google Drive
* (2) copy over your existing blog_authorship.db file.

```
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies blog_authorship.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/blog_authorship.db" "sqlite_data"
```

Now, the usual SQLite database connection procedure!

In [None]:
# reloads the %%sql extension
%reload_ext sql

# connects the extension to the database - mounts the database as an engine
from sqlalchemy import create_engine
blog_authorship_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")

# use engine (this activates the connection!)
%sql blog_authorship_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

#### Extract 1to3gram features

Here's the command to extract 1to3 grams with good, high-quality settings for this dataset that err on the side of including more features for discovery: Occurrence threshold = 5%, Pointwise-mutual information PMI = 3. Group-frquency threshold GFT of 1000.

**This will take ~70 minutes!**

In [None]:
database = 'blog_authorship'
msgs_table = 'blog_authorship_msgs'

In [None]:
!dlatkInterface.py \
   --corpdb {database} \
   --corptable {msgs_table} \
   --correl_field user_id \
   --group_freq_thresh 500 \
   --add_ngrams -n 1 2 3 \
   --combine_feat_tables 1to3gram \
   --feat_occ_filter --set_p_occ 0.05 \
   --feat_colloc_filter --set_pmi_threshold 3

### 3.1. How many users and messages are in this dataset? How many user have more than 1,000 tokens?

**HINT:** use the 1gram meta feature table....

**Answer**

### 3.2. How many distinct feat are there in the occ 0.05 and PMI 3 1to3gram feat table?

**Answer**

### 3.3 How does this overall number of 1to3grams relate to the rules of thumbs we've mentioned?

**Answer**


## 4) (2 pts + 2 extra) Full blog-authorship dataset: language correlations


### 4.1) Correlate 1to3grams with bonferroni correction and group_freq_thresh (GFT = 1000) for gender, age, occupation and star sign.

Before you set up your DLATK command, have a look at `gender` in the outcomes table, in case any variable formatting has changed from what you've seen before in a way that you need to consider (describe in SQL, etc).

**Tip:** P-corrections can be adjusted with `--p_correction`  

🚨 **FYI:** This will take ~1.5 hours! Consider starting with just a few outcomes, like gender and age, so you can debug as needed.

**Pro-tip:** Pipe away the DLATK output.

**Answer**

### 4.2) Please make LIWC2015 word clouds with bonferroni correction and GFT = 1000 for age, gender, occupation and star sign.

By LIWC2015 word clouds, we mean word clouds that contain the name of the LIWC dictionaries themselves as features.

Note that you will need to extract the `LIWC2015` feature table (`feat$cat_LIWC2015$msgs$user_id$1gra`) first. And then run the correlations (which will be much quicker, 2-3 minutes).

**Tip:** Check the "setup" section and section 7 of *Tutorial 09* for LIWC extraction and LIWC word clouds.

**Pro-tip:** pipe away the DLATK output.

In [None]:
database = 'blog_authorship'
msgs_table = 'blog_authorship_msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 0 \
    --add_lex_table -l LIWC2015

**Answer:**

### 4.3) EXTRA CREDIT (2pts) For an occupation of your choice, please create a "language summary figure."

One nice thing about showing LIWC correlations next to 1to3gram corelations is that the LIWC cloud summarizes the 1to3gram word cloud.

Using the [PPT template](https://comppsychology.github.io/psych290/wordcloud_template.pptx), in one figure, please arrange
* The positively correlated 1to3gram word cloud for your chosen occupation.
* Next to it, the positively correlated LIWC2015 dictionaries for the same occupation.
* Add labels stating the name of the occupations shown, and the range of association coefficients (see word cloud file names).
* Please also add the legend that explains size and color (see template).

Please insert a screenshot of your figure.

**Answer**

### 4.4) Please comment on the star sign word clouds in this large blog_authorship corpus in relationship to the star sign word clouds derived from the N ~ 1,000 dataset you've seen in this tutorial. What's your best estimate about  astrology carrying psychological signal?

**Answer**


## 5) (5 points) Annotated Emotion dataset
For these questions, we will be using a dataset made available by [Svitlana Volkova](https://www.pnnl.gov/people/svitlana-volkova). It contains ~29k Tweets annotated with emotions.

We will make a new database called `svitlana`. Here, the documents (=tweets) are directly annotated with one of a few emotions. To make life easier for you, we will add an `outcomes` table, that has the emotions listed for a given `message_id`. z

📝
Note that in real life when you get a csv that has two columns: tweet and emotion. You would have to import that data into R following Tutorial 7, and create a message table with the columns `message` (the tweet) and `message_id` (some number you pick), that connect to an outcome table that has columns `message_id` and `emotion` as an outcome. For simplicity, we've done that for you.

Citation for the dataset -
```
Volkova, S., & Bachrach, Y. (2016, August). Inferring perceived demographics from user emotional tone and user-environment emotional contrast. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1567-1578).
```

💡 Let's start by importing the preprocessed CSVs to a SQLite database.

In [None]:
# name of new database
database = 'svitlana'

In [None]:
import os
from dlatk.tools.importmethods import csvToSQLite

# point to the database file path
database_path = os.path.join("sqlite_data", database)

# insert the messages CSV into the new database
csvToSQLite("./psych290_data/svitlana/svitlana_msgs.csv", database_path, "svitlana_msgs")

# insert the outcomes CSV into the new database
csvToSQLite("./psych290_data/svitlana/svitlana_outcomes.csv", database_path, "svitlana_outcomes")

Now, the usual SQLite database connection procedure!

In [None]:
# reloads the %%sql extension
%reload_ext sql

# connects the extension to the database - mounts the database as an engine
from sqlalchemy import create_engine
svitlana_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")

# use engine (activates the connection to new database!!)
%sql svitlana_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### 5.1) How many messages are there? How many messages are there per emotion?

**Answer**

### 5.2) Please extract 1grams. How many 1grams are there unfiltered? is that too many for 1gram wordclouds given our rules of thumb? Also, to orient ourselves: what's the average length of a message (tweet)?

This will take ~13min. Note that the `--correl_field` will have to be `message_id`.

**HINT:** remember that meta tables exist.

**Note:** no GFT setting is needed on raw 1gram extraction, but you could set it anyway so as not to forget what's appropriate here (i.e., setting GFT such that everything is included).

**Answer:**

### 5.3) Tuning message-level occurrence thresholds

We want to create 1gram word clouds for the different emotions.

Previously, for the choice of occurence threshold we've used user-level rules of thumb (such as 5% of all groups). Now that we are at the message level, an occurrence threshold of 5% would mean that the word would have to be contained in 5% of all messages -- that turns out to only be true of the 47 most frequent 1grams, given that Tweets are ~15 words long.

So please, pick an occurrence threshold that makes sense for this situation. In this dataset with  30,000 documents, let's aim for a feature to be included if it appears in ~30 messages -- so enough for us to correlate over, barely. Can you pick an occurrence threshold that satisfies that condition? How many features are left at that threshold?

⚠️ Careful to pick a GFT that's appropriate for very short messages/groups -- to include all tweets. I.e., `0`.


**Answer**



### 5.4) 1gram word clouds for emotions

Please make 1gram word clouds for message level emotion annotations -- i.e., word clouds that show the words most (positively) correlated with the different emotions.

Insert screenshots for two emotions of your choice. (you insert images from Edit > Insert Image).

**Pro-tip:** Pipe the DLATK output `> logs.txt 2>&1` away to suppress oppressive output.

Expected DLATK runtime: 3 minutes.

**Answer:**

### 5.5) LIWC word clouds for emotions

Please create "LIWC wordclouds" for the emotion annotations, and insert screenshots for the same two emotions you picked above.

This will allow us to look at the emotion annotations through a 1gram lens, and through a LIWC lens.

Commands should run fast.

**Answer**

## ‼️ **Save your database** ‼️

Let's save all this work into database files in your GDrive `sqlite_databases` folder!

**First, `dla_tutorial.db`**

In [None]:
dla_database = 'dla_tutorial'

# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{dla_database}.db" "/content/drive/MyDrive/sqlite_databases/"

**Second, `blog_authorship.db`** (this may take ~1 min because of it's size)

In [None]:
blog_authorship = 'blog_authorship'

# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{blog_authorship}.db" "/content/drive/MyDrive/sqlite_databases/"

**Third, `svitlana.db`**

In [None]:
svitlana_database = 'svitlana'

# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{svitlana_database}.db" "/content/drive/MyDrive/sqlite_databases/"

We generated a lot of output in this tutorial! Here's how you can save it to your Drive if you want to!

In [None]:
OUTPUT_FOLDER = './output_hw6'

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive (-r makes it copy the folder and all files/folders inside)
!cp -f -r {OUTPUT_FOLDER} "/content/drive/MyDrive/"

print(f"✅ '{OUTPUT_FOLDER}' has been copied to your Google Drive.")