<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-04/W4_HW4_DLA_LexCorrelation_(dla_tutorial).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W4 Homework 4: Weighted dictionaries and correlations

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.

Before you start solving the question, set up the as usual, to execute sql and dlatk commands in the notebook.

## 1) Setting up Colab with DLATK and SQLite

### 1a) Install DLATK


In [None]:
# assigning the database name
database = "dla_tutorial"

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql

Cloning into 'dlatk'...
remote: Enumerating objects: 6991, done.[K
remote: Counting objects: 100% (1076/1076), done.[K
remote: Compressing objects: 100% (149/149), done.[K
remote: Total 6991 (delta 994), reused 935 (delta 927), pack-reused 5915 (from 1)[K
Receiving objects: 100% (6991/6991), 62.38 MiB | 7.99 MiB/s, done.
Resolving deltas: 100% (4947/4947), done.
Updating files: 100% (338/338), done.
Collecting image<=1.5.33 (from -r dlatk/install/requirements.txt (line 1))
  Downloading image-1.5.33.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langid<=1.1.6,>=1.1.4 (from -r dlatk/install/requirements.txt (line 2))
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mysqlclient<=2.1.1 (from -r dlatk/install/requirements.txt (line 4))
  Downloading mysqlclient-2.1.1.tar.gz 

### 1b) Mount Google Drive and copy database

In [None]:
# Mount Google Drive & copy database to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copies {database_name}.db to the sqlite_data folder in this Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

Mounted at /content/drive


### 1c) Setup database connection

In [None]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# attaches the dlatk_lexica.db so tutorial_db_engine can query both databases
from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### 1d) (ONLY IF NEEDED: SOFT RELOAD): **If you have a "database lock" problem**

In [None]:
# If you face a "database locked" issue, restart the session & run this cell to get set back up!

database = "dla_tutorial"

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# set the output limit to 50
%config SqlMagic.displaylimit = 50

from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

# 🦆👩‍🚀 Questions

**FOR ALL QUESTIONS:  
Please remember to summarize your answers in one sentence using human language.**

⚠️ TIP: for all of our sanity, please remember that you can pipe away DLATK output with `>bla.txt 2>&1` at the end of the command.


## 1) Please correlate *LIWC* (`POSEMO`), _labmt_ (`valence`) and *NRC* (`SENT`) against `gender`, and report the correlations (with CIs and p-values). For which one do you see the highest correlation?

**FYI:** Feel free to use `mini_LIWC2015` to extract `POSEMO` features.

**Answer:**

## 2) What's the correlation for the (mini or full) LIWC `POSEMO` category with gender controlling for age, with confidence intervals? Is it significant?  

**Answer:**

## 3)

### 3a) Please extract the data-driven machine-learning based age and gender dictionaries from an [EMNLP'14](https://www.aclweb.org/anthology/D14-1121.pdf) paper.

It's in `dlatk_lexica.dd_emnlp15_ageGender`. Please correlate its dictionaries against the age and gender variables. How well are the age and gender of blog authors predicted by it? Is its gender dictionary's correlation with gender higher than that of the sentiment dictionaries' correlation with gender determined in question 2 above?

**Note:** Both for the Gender dictionary and in our outcomes, gender = 1 means **woman**

**Answer:**

### 3b) Find the 5 messages that are scored to be "the oldest" (using the `--top_messages` flag).

Please restrict to messages with at least 20 tokens (for the messages to make sense).

**Hint:** Use `--group_freq_thresh`.

**Note:** May take a while!

**Answer:**

## 4) Please extract and correlate all of the LIWC 2015 categories against one-hot encoded occupation controlling for gender. What LIWC category is most correlated with working in `BIOTECH`? Or put differently, which LIWC category is most distinguishing of authors working in `BIOTECH`?

**Note:** you will get A LOT of console DLATK output. Please pipe it away by adding `> bla.txt 2>&1` (or any filename) to your command, like we did in the tutorial.

**Answer:**

## 5) In the above results, can you write 1-2 sentences stating and minimally interpreting  the associations for student authors, drawing on [Tausczik & Pennebaker](https://drive.google.com/file/d/1iAB7trS4puQUjBJVQBxk7cHUm4NA5PoR/view?usp=sharing)--the appendix is a good place to look.

For example, "Across N = NNN authors, we observed that X-occupation is correlated with Y-languageCategory at r = Z \[CI, CI\], p = ZZZ when controlling for gender. This suggests that people in X-occupation may use more YY language, which has been previously associated with increased ZZ-PsychologicalConstruct (someCitation)."

Feel free to write something smoother, but don't write a lot.

**Please do the LIWC readings! A friendly shout out.**

**Answer:**

## 6) Correlate the meta features against age, controlling for gender. What do you observe? Can you interpret the word length finding (minimally) drawing on T&P, as above?

**Note:** The meta feature table has a similar structure as a regular feature table. You can simply pass it to DLATK with `--feat_table`, and it will do as you say.

(**Note:** LIWC implements word length as fraction of words with more than 6 characters, DLATK uses token length. Same idea / you can use the same references).

**Answer:**

## 7) If you set the group frequency threshold to 2,000 to run the LIWC POSEMO correlation against gender, does the result change from question 1? How many users is the correlation now run over?

**Tip:** look at the DLATK console output to get to that info quickly.

**Answer:**

## 8) For question 7, how could you check that the number of users with 2000 words is correct in SQL using the meta_1gram_user table?

**Answer:**

## 9) (2 pts) Please find the LIWC2015 dictionary that correlates most strongly with younger age, when controlling for gender.

* Once you have found which dictionary this is, see how the top 10 most frequent words are associated with younger age (so negatively associated, controlling for gender), and produce a table that could go into the supplement of your paper. It should contain the total frequency of the words (in descending order), their beta coefficients, with 95% Confidence Intervals.

* Ty to make it look not terrible -- we recommend you use the **excel template** for formatting that we have provided with the tutorial in your home folder, it'll make it nice looking. But we know you are busy/may hate excel for some reason, etc.

You can add an image to Colab from your machine by being in a markdown cell, then clicking the Insert Image icon, and then executing that cell. Colab will now render your image and save it!

**Answer:**

## **Save your database and output!**

In [None]:
database = 'dla_tutorial'

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

Now, save your data!

In [None]:
OUTPUT_FOLDER = './outputs_hw4'

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive (-r makes it copy the folder and all files/folders inside)
!cp -f -r {OUTPUT_FOLDER} "/content/drive/MyDrive/"

print(f"✅ '{OUTPUT_FOLDER}' has been copied to your Google Drive.")