<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-04/W4_Tutorial_06_DLATK_lex_correlation_(dla_tutorial)_withSolutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W4 Tutorial 6 -- Weighted dictionaries and correlations (DB: dla_tutorial) (2025-03)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

‚úãüèª‚úãüèª NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

‚úâÔ∏èüêû If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.


In this tutorial we will extract 1grams (with its meta-table) and LIWC features like last tutorial. In addition to it, we'll also extract weighted dictionaries like **labMT** and **NRC**.

Let's get on to it starting with setting up DLATK and copying the `dla_tutorial` database and `dlatk_lexica.db` to your Colab.

## 1) Setting up Colab with DLATK and SQLite

### 1a) Install DLATK

In [None]:
# assigning the corpus database name
database = "dla_tutorial"

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql

Cloning into 'dlatk'...
remote: Enumerating objects: 6991, done.[K
remote: Counting objects: 100% (1076/1076), done.[K
remote: Compressing objects: 100% (149/149), done.[K
remote: Total 6991 (delta 994), reused 935 (delta 927), pack-reused 5915 (from 1)[K
Receiving objects: 100% (6991/6991), 62.38 MiB | 6.58 MiB/s, done.
Resolving deltas: 100% (4947/4947), done.
Updating files: 100% (338/338), done.
Collecting image<=1.5.33 (from -r dlatk/install/requirements.txt (line 1))
  Downloading image-1.5.33.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langid<=1.1.6,>=1.1.4 (from -r dlatk/install/requirements.txt (line 2))
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.9/1.9 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mysqlclient<=2.1.1 (from -r 

### 1b) Mount Google Drive and copy database

In [None]:
# Mount Google Drive & copy database to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copies {database_name}.db to the sqlite_data folder in this Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

Mounted at /content/drive


### 1c) Setup database connection

In [None]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# attaches the dlatk_lexica.db so tutorial_db_engine can query both databases
from IPython import get_ipython
from sqlalchemy import event

# auto‚Äëattach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### 1d) (ONLY IF NEEDED: SOFT RELOAD): **If you have a "database lock" problem**

In [None]:
# If you face a "database locked" issue, restart the session & run this cell to get set back up!

database = "dla_tutorial"

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# set the output limit to 50
%config SqlMagic.displaylimit = 50

from IPython import get_ipython
from sqlalchemy import event

# auto‚Äëattach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

## **(If needed)** re-extract features and dictionaries

This notebook needs 1grams and LIWC features. For the latter, we'll extract `mini_LIWC2015` (containing only `POSEMO`, `NEGEMO`, and `SOCIAL` categories) like last tutorial. If you have them from previous tutorials, you can skip this step. Otherwise, create these feature tables with the DLATK command below. If you run these commands and the tables already exist, the commands will simply replace the existing tables.

First comes the 1grams feature table:

In [None]:
database = "dla_tutorial"
msgs_table = 'msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_ngrams -n 1

The above command produced table `feat$1gram$msgs$user_id` containing 1grams as features. Then, the `mini_LIWC2015` feature table.

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_lex_table -l mini_LIWC2015

The above command produced table `feat$cat_mini_LIWC2015$msgs$user_id$1gra` with `mini_LIWC2015` as features.

## 2) Weighted Dictionaries

Alright, time for weighted dictionaries. These weights could originate from word-level annotations (as in the case of labMT), or be inferred through machine learning based on document-level annotations (or some other magic).  

All necessary dictionaries are stored in the `dlatk_lexica` database. Let's look at the dictionaries inside the database.

üí°üí° Remember, to switch to switch between databases, we use:
```
%sql tutorial_db_engine  # for the main database (i.e., dla_tutorial)
```

or

```
%sql dlatk_lexica_engine  # for the dlatk lexica database
```

If you get an error saying "other database already connected", try running `%reload_ext sql`. If that doesn't work, resort to Runtime ==> Restart Session.

In [None]:
%sql dlatk_lexica_engine

In [None]:
res = %sqlcmd tables
print(res)

+----------------------+
|         Name         |
+----------------------+
|       LIWC2015       |
|    dd_PastPreFut     |
| dd_emnlp14_ageGender |
|      dd_permaV3      |
| dd_wassa16_affectInt |
|      fb2000_cp       |
|  fb2000_freq_t50ll   |
|        labmt         |
|    mini_LIWC2015     |
|         nrc          |
|       nrc_emot       |
|       nrc_sent       |
+----------------------+


```
üê¨üê¨üê¨
USE dlatk_lexica;
SHOW TABLES;
üê¨üê¨üê¨
```

Among the tables, we have already seen `LIWC2015`, which looks like -

In [None]:
%%sql

SELECT *
FROM LIWC2015
ORDER BY RANDOM()
LIMIT 5;

id,term,category,weight
15743,various,ADJ,1
6457,world-class,POWER,1
6296,poorest,POWER,1
5519,miscar*,BIO,1
15016,ya,YOU,1


The column `weight` is 1.0 for all rows in LIWC2015 because it deems all the terms in a category to be equally important (LIWC is unweighted). Let's confirm this with one of the categories.

In [None]:
%%sql

SELECT *
FROM LIWC2015
WHERE category = 'WE';

id,term,category,weight
309,let's,WE,1
310,lets,WE,1
311,our,WE,1
312,ours,WE,1
313,ourselves,WE,1
314,us,WE,1
315,we,WE,1
316,we'd,WE,1
317,we'll,WE,1
318,we're,WE,1


The dictionaries **LabMT** and **NRC**, on the other hand, do not consider all terms to be equally important and so weigh different terms differently based on their importance within the category.  

[**LabMT**](https://www.nature.com/articles/srep02625)  
The labMT word list was created by combining the ~10k words most frequently appearing in four sources: Twitter, the New York Times, Google Books, and music lyrics, and then scoring the words for sentiment on Amazon‚Äôs Mechanical Turk (annotations range from 1 to 9). The table `dlatk_lexica.labmt` contains this lexicon/dictionary, with the 3.7k words remaining that have valences <4 and >6. This follows the original authors' guidelines on how to use the dictionary.

[**NRC**](https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)  
The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing. The table `dlatk_lexica.nrc` contains just the positive/negative sentiments from NRC.

Let us look at few records of `labmt`.

In [None]:
%%sql

SELECT *
FROM labmt
ORDER BY RANDOM()
LIMIT 5;

id,term,category,weight
2590,writes,valence,6.02
2267,proceeded,valence,6.12
10027,beating,valence,2.56
18,smiled,valence,8.08
2175,wed,valence,6.16


There is only one category in `labmt` -- `valence`.

In [None]:
%%sql

SELECT DISTINCT(category)
FROM labmt;

category
valence


#### üë©‚Äçüî¨üíª Exercise

All the terms/words belong to this one category - `valence`. Can you see how many words/terms we have?

In [None]:
%%sql

SELECT COUNT(DISTINCT term) AS num_terms
FROM labmt;

num_terms
3731


That is `3731` terms.

#### üë©‚Äçüî¨üíª Exercise

Now, what are the the minimum and maximum weights in this dictionary?

In [None]:
%%sql

SELECT MIN(weight) AS min_weight, MAX(weight) AS max_weight
FROM labmt;

min_weight,max_weight
1.3,8.5


#### üë©‚Äçüî¨üíª Exercise

Also, confirm if there are any terms with weight >4 and <6?

In [None]:
%%sql

SELECT COUNT(*) AS num_terms
FROM labmt
WHERE (weight > 4 ) AND (weight < 6);

num_terms
0


Now that we have explored `labmt`, let's check `nrc` - which is a sentiment dictionary. Let's start by looking at some random records.

In [None]:
%%sql

SELECT *
FROM nrc
ORDER BY RANDOM()
LIMIT 5;

id,term,category,weight
26139,@wweajlee,SENT,0.19
53269,limbs,SENT,-0.081
97430,ledge,neg_sent,-0.081
86153,cave,pos_sent,0.055
40804,91,SENT,0.008


We see that nrc also has negative weights -- those words subtract from weights with positive weights in the same category.

If those words dominate, the group_norms will end up being negative once it's extracted. That works fine in DLATK -- correlations etc. will still work (but maybe flipped, because there will be negative `group_norm`s)

#### üë©‚Äçüî¨üíª Exercise

Can you check the number of unique categories?

In [None]:
%%sql

SELECT COUNT(DISTINCT category) AS num_categories
FROM nrc;

num_categories
3


#### üë©‚Äçüî¨üíª Exercise

Also, check the number of types (distinct tokens) in the vocabulary.

In [None]:
%%sql

SELECT COUNT(distinct(term)) AS num_terms
FROM nrc;

num_terms
54128


That's a huge vocabulary! Another sign that these weights were learned though machine learning, and not through word annotations.

### 2a) Extracting Weighted Dictionaries

Let's now extract a weighted dictionary. The extraction process is the same, except now DLATK should multiply the occurrence of words in the dictionary by the weight of the words. The command for DLATK to extract a weighted dictionary needs `--weighted_lexicon` to tell DLATK to handle weights properly. Without it, it would ignore the weights, and the `_w` would be missing from the feat table name.

#### (i) Extracting `labmt`

Let's extract `labmt`. Again, the order of DLATK flags does not matter. the `\` just allows for line breaks to make it easier to look over the command.

It's the same as when we extracted something like `mini_LIWC`, it just has `--weighted_lexicon` as an additional flag.

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --weighted_lexicon \
    --add_lex_table -l labmt



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-17 22:10:00
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$cat_labmt_w$msgs$user_id$1gra
SQL Query: CREATE TABLE feat$cat_labmt_w$msgs$user_id$1gra ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(10), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$cat_labmt_w$msgs$user_id$1gra, column:group_id 


SQL Query: CREATE INDEX correl_field$cat_labmt_w$msgs$user_id$1gra ON feat$cat_labmt_w$msgs$user_id$1gra (group_id)


Creating index feature on table:feat$cat_labmt_w$msgs$user_id$1gra, column:feat 


SQL Query: CREATE INDEX feature$cat_labmt_w$msgs$user_id$1gra ON feat$cat_labmt_w$msgs$user_id$1gra (feat)
WORD TABLE feat$1gram$msgs$user_id
10 out of 1000 group Id's processed; 0.01 complete
20 out of 1000 group Id's processed; 0.02 complete
30 out 

The above command produced a feature table `feat$cat_labmt_w$msgs$user_id$1gra`. Note that now the feature table name contains `...cat_labmt_w...` where the `_w` means that it was created with `--weighted_lexicon`. So if had forgotten to activate the weighted extraction, `_w` will not be in the name.

Before we analyse the feature table, let's move to our `dla_tutorial` database.

```
üê¨üê¨üê¨
USE {database};
üê¨üê¨üê¨
```

In [None]:
%sql tutorial_db_engine

In [None]:
feat_labmt_user = "feat$cat_labmt_w$msgs$user_id$1gra"

In [None]:
%%sql

SELECT *
FROM {{feat_labmt_user}}
ORDER BY RANDOM()
LIMIT 5;

id,group_id,feat,value,group_norm
689,3478510,valence,241,1.4100778967867575
857,3570813,valence,4097,1.2980608345112832
1209,3808913,valence,239,1.391739526411658
1121,3745004,valence,3924,1.2079569009748576
35,734023,valence,204,1.2232298136645954


The `_intercept` feature in `feat` column is a DLATK dummy variable. It is there to make sure a row appears for users who did not use any words from category dictionary. Basically, it ensures that if you run `SELECT COUNT(DISTINCT(group_id))` on a feature table, you will always get the number of `group_id`s it was extracted over.

#### üë©‚Äçüî¨üíª Exercise

Before we go forward, can you quickly find the top 10 most frequent words contained in the `labmt`? Ignoring weights, just in terms of unweighted mentions -- but do also output the weight (valence) for them, together with their word count.

Please skip this exercise if it's holding you up for too long -- lots to get to!

In [None]:
feat_1gram_user = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

SELECT feat, counts, weight
FROM (SELECT feat, COUNT(value) AS counts
      FROM {{feat_1gram_user}}
      GROUP BY feat) AS word_counts, dlatk_lexica.labmt
WHERE feat = term
ORDER BY counts DESC
LIMIT 10;

feat,counts,weight
my,985,6.16
all,983,6.22
me,965,6.58
not,962,3.86
like,949,7.22
up,933,6.14
you,928,6.24
more,905,6.24
will,898,6.02
we,895,6.38


As you see, function words are remaining in labmt with valences (weights) **just** above 6.

What could possibly go wrong??

#### (ii) Extracting `nrc`

Let's repeat this for dictionary `NRC`. We use the same command but change the dictionary name. It does not matter to DLATK _how_ the weights came about (through word annotation or some machine learning model), only that the dictionary _has_ weights, and that it should use them (`--weighted_lexicon`).

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --weighted_lexicon \
    --add_lex_table -l nrc

The above command produced a feature table `feat$cat_nrc_w$msgs$user_id$1gra` and again we see the `_w` in the name telling us the extraction was done with a weighted lexicon/dictionary

In [None]:
feat_nrc_user = 'feat$cat_nrc_w$msgs$user_id$1gra'

In [None]:
%%sql

SELECT *
FROM {{feat_nrc_user}}
ORDER BY RANDOM()
LIMIT 5;

id,group_id,feat,value,group_norm
93,832265,SENT,40232,-0.015619862794127
1572,3527113,_intercept,1,1.0
1567,3525639,pos_sent,504,0.1157931034482757
2395,3802286,neg_sent,16678,-0.2211548211463031
2609,3858126,SENT,9853,0.0194108956602031


Again, the value `_intercept` in `feat` column is to make sure a row appears for users who did not use any words from the dictionary (see above).

#### üë©‚Äçüî¨üíª Exercise

Can you check for users who did not use any words from the `nrc` set of dictionaries (`SENT`, `POS_SENT`, `NEG_SENT`)?

**Hint:** How would they appear in the NRC feat table?

In [None]:
feat_nrc_user = 'feat$cat_nrc_w$msgs$user_id$1gra'

In [None]:
%%sql

SELECT group_id
FROM {{feat_nrc_user}}
GROUP BY group_id
HAVING COUNT(*) = 1;

group_id


So this means that all users had at least 1 word that appeared in NRC.

### 2b) Words per group_id threshold

### Group Frequency Threshold -- Ensuring minimum number of words per user

Before we get to correlations, we need to talk about something important.

Your 401(k) tax planning.

When we do language analyses like correlations, we want to make sure users have at least (some) minimum number of words across their blog posts. If a user only has 5 words, their (relative frequency = group_norm) dictionary statistics won't be particularly meaningful, and might throw off the correlations. We don't want to lump them in with users who have 1,000 words.

Generally, we often default to drop users (or groups, more generally) who have less than a predetermined (threshold) number of words (tokens) in the data set.

In DLATK that's implemented as a cut-off (threshold), `--group_freq_thresh`, or the group frequency threshold -- minimum word count per group. With a (`--correl_field user_id`), that means minimum number of tokens/words for a user to be included in the correlation. `--group_freq_thresh` is meant to be used along with `--correlate`.

**DLATK defaults?**
- When we don't include it, DLATK will default to a `--group_freq_thresh` of 500, which is a good minimal threshold in our experience.
- When we run message-level correlations, we want to include all messages, generally, even very short ones: If we want to override the threshold, we can always say `--group_freq_thresh 0` (include every group, regardless of how many tokens).

In our dla_tutorial data set, if we limit to users with 500 words or more, 978 out of 1,000 make the cut. We verify this below.

This is something we have to keep in mind when we work in R -- when we run correlations there between outcomes and features, as we will in the next tutorial, we want to make sure we also shortlist to the same 978 users.

The best way to keep track is **just make word count another group-level outcome in the outcomes table**. So, let's get word counts from all users, and merge it as a column onto the outcome table.  



#### (i) Word Counting Way (1)

Let's remind ourselves, here is the first way -- summing over the values in the 1gram table:

In [None]:
feat_1gram_user = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

SELECT group_id, SUM(value) AS word_count
FROM {{feat_1gram_user}}
GROUP BY group_id
ORDER BY word_count DESC
LIMIT 10;

group_id,word_count
942828,365311
664485,262052
1234212,259708
979795,135594
1807720,130719
518116,111002
2314011,104240
2238828,97572
1488330,95536
1826527,95372


#### (ii) Word Counting Way (2) (easiest)

Here is the second way: just using the meta-table that was also automatically extracted during 1gram extraction.

In [None]:
meta_1gram_user = 'feat$meta_1gram$msgs$user_id'

In [None]:
%%sql

SELECT group_id, group_norm AS word_count
FROM {{meta_1gram_user}}
WHERE feat = "_total1grams"
ORDER BY word_count DESC
LIMIT 10;

group_id,word_count
942828,365311.0
664485,262052.0
1234212,259708.0
979795,135594.0
1807720,130719.0
518116,111002.0
2314011,104240.0
2238828,97572.0
1488330,95536.0
1826527,95372.0


As you can see, these are identical.  

### 2c) Let's add wordcounts to our outcome table

Let's add a column to the outcome table `outcomes` and store these counts there. It will put the column at the end. It will default to NULL entries, which is the `NA` of SQL.

In [None]:
%%sql

ALTER TABLE outcomes ADD COLUMN wordcount INT NULL;

Now we use `UPDATE table SET Var = something` to update.

In [None]:
%%sql

UPDATE outcomes AS a
SET wordcount = (
                SELECT b.group_norm
                FROM {{meta_1gram_user}} AS b
                WHERE b.group_id = a.user_id AND b.feat = '_total1grams'
                )
WHERE a.user_id IN (SELECT group_id FROM {{meta_1gram_user}} WHERE feat = '_total1grams');

In SQLite, you have to join by using a subquery in the `UPDATE table SET` syntax. But in MySQL you can simply:

```
üê¨üê¨üê¨
UPDATE outcomes AS a, {{meta_1gram_user}} AS b
SET wordcount = group_norm
WHERE a.user_id = b.group_id AND feat = "_total1grams";
üê¨üê¨üê¨
```

Note: you're altering the `outcomes` table here. If you want a fresh copy you can always get it from `csvToSQLite` (showed in mini tutorial 5B)!

Alrighty! Let's see how many users have at least 500 words in that new column.

**NOTE** 500-1000 words is the rule of thumb threshold for number of words per user to get decent dictionary language variables. Less if you use dictionaries that ingest giant vocabularies (>10,000), like NRC. Those really can make every word... count.

In [None]:
%%sql

SELECT COUNT(*) AS n_users
FROM outcomes
WHERE wordcount >= 500;

n_users
978


Bingo, OK, the world makes sense. It's nice to have all this important user-level stuff in the same table, especially as we think about pulling these tables into R.

## 3) Language Correlations With Feature Tables (Dictionaries)

Ok, now we've spent a lot staring at and looking into feature tables. The time has come to do something with them -- we'll run our first language correlations against "outcomes" in the table `blog_outcomes` -- such as age and gender for users.  The DLATK command to do this will have the argument `--correlate`.

The basic idea is that `--correlate` correlates:
- **Language** in a feature table specified with `--feat_table`
- **Outcomes** in the outcome table specified with `--outcome_table`. Within the outcome table, which can have many columns. We specify column names/variables we wish to designate as **outcomes** (to correlate against) with `--outcomes`

We'll give it a few extra flags to give output of a few different kinds. Following are all the arguments we need:
- `--feat_table` -- Feature table with language features. This can be any one of the feature tables we have extracted so far including -- 1grams, LIWC, labmt, NRC
- `--outcome_table` -- Table where our extra-linquistic data lives. For us this is the _blog_outcomes_ table.
- `--outcomes` -- Column names within outcome table (specified in `--outcome_table`) for variables we want to correlate language with. This is age, gender, sign etc for us as per our `blog_outcomes` table.
- `--rmatrix` -- Produces a correlation matrix in HTML format with color coding so it is easier to read.
- `--csv` -- Produces a correlation matrix in csv format so we can open it up in excel, R, Python etc.
- `--sort` -- Appends a table to the HTML or csv with correlations sorted by effect size.
- `--output_name` Specifies what the output files should be called. Let us start the name of the output with the feature table so we can run multiple times with different feature tables.

this is a good output_name convention: {feature-type}\_{outcomes}\_CONTROLS\_{controls-if-any}

finally, from now on, let us ALWAYS specify the group_freq_thresh explicitly -- it will reduce the odds that we forget, ever, what it is set to, and that that's appropriate for what we are doing.

- `--group_freq_thresh` 500 -- limit to 500 words or more per group.




Lets run the correlations for `mini_LIWC2015` with these arguments. But before that, let's create a folder for all the outputs that we'll generate in this tutorial. (Then you can download the folder or save it in Google Drive!)

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_miniliwc_user = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'mini_liwc_age_gender'
!mkdir -p {OUTPUT_FOLDER} # this makes a folder, the -p does it even if it exists already

**Note:** make sure to put `'` quotes around your feat table name, even if you insert it cleverly with a Python variable.

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_miniliwc_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}

Please scroll through the output and confirm that you see the row that tells you that `978` groups were retained -- it's the same `978` that we manually checked above.

‚ö†Ô∏è It is wise to look for the **number of groups** on every DLATK run. It will save you countless headaches!

The above command produced files in the home directory. All of them are prefixed by `mini_liwc_age_gender` which we specified in the command with `--output_name`. Let us look at what we have. The `*` is a wild-card that allows us to print any file or directory that begins with `mini_liwc_age_gender`.

In [None]:
!ls -lh {OUTPUT_FOLDER}/{OUTPUT_NAME}*

-rw-r--r-- 1 root root 7.6K Apr 17 22:54 ./outputs_tutorial_06/mini_liwc_age_gender.csv
-rw-r--r-- 1 root root  16K Apr 17 22:54 ./outputs_tutorial_06/mini_liwc_age_gender.html


There are 2 files -- html & csv.  

You can go back to the Colab file tree (on the left), click refresh the button in the tree, and download the files there.

üí° Some files are small enough & formatted correctly to be opened inside Colab's file preview feature. Try double-clicking on the file and you will either get a little popout window of the csv/image/file or you'll get an error! Otherwise, download it and open on your laptop :)


The csv file contains the same information as in the html but is readable by Excel / R. When you open the html file, it will look like this:


![Fig¬†1](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig1.png)


When you open the csv file in excel, it will look like this:

![Fig¬†2](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig2.png)


We see that `gender` is correlated 0.248 with `SOCIAL`, and 0.142 with `POSEMO`. As women are coded as 1, this means women use more POSEMO and SOCIAL langauge.

The output also contains `p values` (already adjusted for multiple comparisons) -- in this case, for three types of features (`NEGEMO`, `POSEMO`, `SOCIAL`), `N` contains the numbers of groups this was run over (978 again!). `CI_l` is the lower end of the 95% Confidence Interval, `CI_u` the upper bound. `freq` is the total sum(value) for the dictionaries -- the total number of dictionary words seen across the 978 users.

So in APA reporting, LIWC's `NEGEMO` and age are correlated at `r = -.24 [-.29, -.18], p < .001`. That's what you would report, with a note in the methods section that all the p's are corrected for multiple comparison using the Benjamini-Hochberg False Discovery Rate adjustment.

### 3a) Weighted Dictionary Correlations

Weighted dicationary correlations are exactly the same as unweighted dictionary correlations. We just supply the feature table that was created with weighted lexicon (`--weighted_lexicon`) to the correlate command, and that's it.

That's the beauty of the feature tables -- they are so nicely interchangable.

Let's correlate `labmt` with `age` and `gender`. Note the `labmt` feature table below. Of course, `labmt` feature table needs to be extracted first. (Just a reminder: Before we can extract the `labmt` feature table, we have to extract the 1grams feature table, which we have already done, of course.)

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_labmt_user = 'feat$cat_labmt_w$msgs$user_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'labmt_age_gender'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_labmt_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}

Similar to the previous exercise, the above command produced files in the home directory. All of them are prefixed by `labmt_age_gender` which we specified in the command with `--output_name`. Let us look at what we have. The `*` is a wild-card that allows us to print any file or directory that begins with `labmt_age_gender`.

In [None]:
!ls -lh {OUTPUT_FOLDER}/{OUTPUT_NAME}*

-rw-r--r-- 1 root root 6.8K Apr 17 22:55 ./outputs_tutorial_06/labmt_age_gender.csv
-rw-r--r-- 1 root root  13K Apr 17 22:55 ./outputs_tutorial_06/labmt_age_gender.html


Open the files with your Jupyter browser and take a look!

#### üë©‚Äçüî¨üíª Exercise

Can you repeat this with `nrc` and take a look at the files? Remember to rename the `--output_name` accordingly.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_nrc_user = 'feat$cat_nrc_w$msgs$user_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'nrc_age_gender'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_nrc_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}

### 3b) Categorical Variables as Outcomes

Categorical data is data that takes only a limited number of values. For example, if people responded to a survey about which brand of car they own, the result would be categorical (because the answers would be things like Honda, Toyota, Ford, None, etc.). Responses fall into a fixed set of categories.  

[One-hot encoding](https://en.wikipedia.org/wiki/One-hot) is a widespread approach to encoding categorical variables when the number of possible values of the categorical variable is small enough.

For example, let us say our categorical variable is _color_ and it takes only 3 values -- red, blue, green. Our one-hot encoding for them can look like:

| color | encoding | isRed | isBlue | isGreen |
| ----- | -------- | --------- | --------- | --------- |
| red   | 1,0,0 | 1 | 0 | 0 |
| blue  | 0,1,0 | 0 | 1 | 0 |
| green | 0,0,1 | 0 | 0 | 1 |


Things to note:
- This is a binary encoding -- there are only 1s and 0s in the encoding.
- There is always just one binary 1 in the encoding and rest are zero.
- Position of the binary 1 in the encoding tells us the category it maps to. This mapping is decided by us.
- The length of our encoding, in characters, is equal to the number of values our variable will take. In this case 3.

**This is the important bit:** DLATK will use one-hot encoding for us if we specify `--categories_to_binary`. The results will automatically be in terms of our categories, which is very convenient.  

Let's say we want to correlate "positive emotions against the different occupations" -- answering the question if one occupation uses more POSEMO langauge than the others.

Let us use the categorical column `occu` (occupation) from `blog_outcomes` to correlate with langauge we extracted with _mini_LIWC2015_.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_miniliwc_user = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'mini_liwc_occu'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_miniliwc_user}' \
    --outcome_table {outcomes_table} \
    --categories_to_binary occu \
    --outcomes occu \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}

The above command produced files in the output directory. All of them are prefixed by `mini_liwc_occu` which we specified in the command with `--output_name`. Let us look at what we have. The `*` is a wild-card that allows us to print any file or directory that begins with `mini_liwc_occu`.

In [None]:
!ls -lh {OUTPUT_FOLDER}/{OUTPUT_NAME}*

-rw-r--r-- 1 root root  33K Apr 17 22:58 ./outputs_tutorial_06/mini_liwc_occu.csv
-rw-r--r-- 1 root root 151K Apr 17 22:58 ./outputs_tutorial_06/mini_liwc_occu.html


#### üë©‚Äçüî¨üíª Exercise

Look at the output html file (see it in the `outputs_tutorial_06` folder). Do students express more Negative Emotion? üò∂üò∂üò∂

## 4) Statistical (Covariate) Controls

As you know from your statsy courses, often we would like to know the correlation for our variables of interest in the presence of other variables -- to isolate the effect of some outcome variable adjusting for some control variable (alternative phrases: controlling for, explaining over and above).

In DLATK, we specify statistical controls the argument `--controls`. The columns are also expected in the `--outcome_table` (there is no additional `--control_table` expected or anything like that).   

For example we can run mini_liwc correlations for `age` controlling for `gender`, which means we are interested in `age` but while controlling for `gender`.

**NOTE:** Given that gender has the strongest language associations of all variables, we almost always want to control for gender.

In [None]:
database = "dla_tutorial"
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_miniliwc_user = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'mini_liwc_age_CTRL_gender'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_miniliwc_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age \
    --controls gender \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-17 23:00:57
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Loading Outcomes and Getting Groups for: {'age', 'gender'}
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Yielding data over ['age'], adjusting for: ['gender'].
Yielding norms with zeros (978 groups * 4 feats).
                                 OLS Regression Results                                
Dep. Variable:                    age   R-squared (uncentered):                   0.000
Model:                            OLS   Adj. R-squared (uncentered):             -0.001
Method:                 Least Squares   F-statistic:                             0.4216
Date:                Thu, 17 Apr 2025   Prob (F-statistic):                       0.516
Time:                        23:00:58   Log-Likelihood:          

Because we are using covariate controls the DLATK output got a little fancier, as we're calling on regression packages in the background. You can largely ignore this.

The above command produced files in the output directory. All of them are prefixed by `mini_liwc_age_CTRL_gender` which we specified in the command with `--output_name`. Let us look at what we have. `*` wild-card allows us to print any file or directory that begins with `mini_liwc_age_CTRL_gender`.

In [None]:
!ls -lh {OUTPUT_FOLDER}/{OUTPUT_NAME}*

-rw-r--r-- 1 root root 6.9K Apr 17 23:00 ./outputs_tutorial_06/mini_liwc_age_CTRL_gender.csv
-rw-r--r-- 1 root root  12K Apr 17 23:00 ./outputs_tutorial_06/mini_liwc_age_CTRL_gender.html


When you open the html file, it will look like this:

![Fig¬†3](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig3.png)


And the csv file looks like this:

![Fig¬†4](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig4.png)


So this tells us that older users' blog posts contain less POSEMO and NEGEMO words -- the langauge gets less emotional.

### 4a) LabMT -- correlate `age`, controlling `gender`

#### üë©‚Äçüî¨üíª Exercise

Can you repeat the above process for `labmt` now? Note the `--controls gender` in the command, and also that gender is no longer among the `--outcomes`. Remember to rename the `output_name` appropriately. What's the trend with labMT for age, controlling for gender? Does the valence go up or down with age?

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_labmt_user = 'feat$cat_labmt_w$msgs$user_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'labmt_age_CTRL_gender'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_labmt_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age \
    --controls gender \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-17 23:02:26
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Loading Outcomes and Getting Groups for: {'gender', 'age'}
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Yielding data over ['age'], adjusting for: ['gender'].
Yielding norms with zeros (978 groups * 2 feats).
                                 OLS Regression Results                                
Dep. Variable:                    age   R-squared (uncentered):                   0.000
Model:                            OLS   Adj. R-squared (uncentered):             -0.001
Method:                 Least Squares   F-statistic:                             0.4216
Date:                Thu, 17 Apr 2025   Prob (F-statistic):                       0.516
Time:                        23:02:26   Log-Likelihood:          

Like the commands earlier, the above command produced files in the output directory, prefixed by `labmt_age_CTRL_gender` which we specified in the command with `--output_name`.

In [None]:
!ls -lh {OUTPUT_FOLDER}/{OUTPUT_NAME}*

-rw-r--r-- 1 root root 6.6K Apr 17 23:02 ./outputs_tutorial_06/labmt_age_CTRL_gender.csv
-rw-r--r-- 1 root root  10K Apr 17 23:02 ./outputs_tutorial_06/labmt_age_CTRL_gender.html


There are 2 files -- html & csv. Look at the html file, or the console output. Valence goes down with age.

### 4b) **FYI: what to do if the DLATK output gets very long**

Let's say you want to correlate mini_liwc against all occupations. This will work nicely with the 1-hot-encoding, but it will create endless output.

At the end of the command, you can pipe ("forward") DLATK's output into a text file. This will make DLATK show you a lot less output in the actual console. You do this by just adding
```
> somefilename 2>&1
```

at the end of the DLATK command. This is a basic, command line Linux trick. Run it once like it is below, and then remove the pipe -- you can see the difference.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_miniliwc_user = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'mini_liwc_occu_CTRL_gender'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_miniliwc_user}' \
    --outcome_table {outcomes_table} \
    --categories_to_binary occu \
    --outcomes occu \
    --controls gender \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.text 2>&1

## 5) Unpacking dictionary correlations

### How do the words within them correlate? `--whitelist` flag.

We have already seen that it's important to know which words drive a dictionary. Now that we are correlating dictionaries, wouldn't it be nice to know how the words within them correlate with an outcome?

How to correlate _ALL_ 1grams  we will talk about in a later tutorial. We correlate 1gram features just like we would correlate dictionaries -- just by giving DLATK a 1gram `--feat` table for its `--correlate` command.

But for today, all we want to correlate is the words that are contained in a particular `category` in a `DLATK_lexica` table -- LIWC'S POSEMO, for example.

We do this by adding the flags `--whitelist --lex_table 'LIWC2015' --categories 'POSEMO'` to the 1gram correlation.

Let's see how the top words within POSEMO correlate with gender.

Let's also pipe the output away to not suffer through DLATK telling us about all the 1grams it's correlating.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_1gram_user = 'feat$1gram$msgs$user_id'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = '1gram_age_gender_FILTER_POSEMO'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --correlate --csv \
    --feat_table '{feat_1gram_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --whitelist --categories POSEMO --lex_table LIWC2015 \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.text 2>&1

The output in the files is long -- let's actually use the CSV's this time, and sort by the frequency, descending. That way we get the most frequent words with their correlations.  

Here is what this looks like in Excel / Google Sheets / Open Office, etc:

![Fig¬†5](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig5.png)

Let's sort by column M descending through a filter, and add a cell formula that gives us the betas with CI's. We have shared an excel sheet with this tutorial, so you can copy the formatting and the cell formala.

ü§ìü§ìü§ì Using conditional formatting that colors cells based on numerical value (as shown below) is...very pretty and makes it easy to eyeball large sets of values.

![Fig¬†6](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig6.png)

And here is what this would look like in APA style for your supplement.

Table S1

_The most frequent words in the LIWC positive emotion dictionary, and their association with gender_

![Fig¬†7](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig7.png)

As you can see, the association between LIWC and female gender is probably entirely driven through the word `love.`

What are the odds??

### Quick preview: 1gram word correlation differential wordclouds!

BTW, as a preview: all you need to get the 1gram correlations as wordcloud images is to also add the flags `--tagcloud --make_wordclouds`

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_1gram_user = 'feat$1gram$msgs$user_id'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = '1gram_age_gender_FILTER_POSEMO'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --correlate --csv \
    --feat_table '{feat_1gram_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --whitelist --categories POSEMO --lex_table LIWC2015 \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.txt 2>&1

You can browse the word clouds through the Colab file tree (it may take a moment to load)!

üí° Double-click to open the file and a side panel on the right side of the Colab notebook will pop up with the image!

![Fig¬†13](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig13.png)

Here are all the LIWC POSEMO tokens significantly positively correlated with gender (controlling for multiple comparisons). The file name tells us that the coefficients of the tokens range from r = .12 (yay) to r = .21 (love). Words are sized by beta coefficient, colors gives frequence (from grey to blue to red being the most frequent).

![Fig¬†9](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig9.png)

### Finding the messages that have the highest dictionary scores. `--top_messages` flag.

DLATK also provides a way to check which messages score highest in a dictionary/lexicon. For example, we want to find the top 5 messages for every dictionary in LIWC2015 based on their proportion. This can be done using the `--top_messages n` flag.

To to do this, we need to have 1gram and then the dictionary features extracted at the **message level.**

So, let's extract the 1gram and LIWC2015 features at the message level.

Message level 1grams:

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

üö®üö®üö® If you've already extracted message-level features (we did in HW3!), you don't need to extract again. This takes around 30 minutes!

In [None]:
# Read above!
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field message_id \
    --add_ngrams -n 1

Message level LIWC:

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field message_id \
    --add_lex_table -l LIWC2015

Now that we have extracted the features, let's check the top-5 messages for every dictionary in LIWC using `--top_messages 5 --lex_table LIWC2015`.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
feat_liwc_msg = 'feat$cat_LIWC2015$msgs$message_id$1gra'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'liwc_5'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field message_id \
    --feat_table '{feat_liwc_msg}' \
    --top_messages 5 --lex_table LIWC2015 \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}

This produces a CSV file with the top-n messages in the output folder, prefixed with `liwc_5` which we specified with `--output_name`.

In [None]:
!ls -lh {OUTPUT_FOLDER}/{OUTPUT_NAME}*

-rw-r--r-- 1 root root 65K Apr 17 23:21 ./outputs_tutorial_06/liwc_5_topmsgs.csv


Here are the top-5 messages for `POSEMO` from the CSV. And you can see why these messages have scored high on the dictionary--they are pure POSEMO. (The csv file also includes the words contained in every dictionary, as shown below).

![Fig¬†10](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig10.png)

Ok, this was a lot -- but this is important. We always want to know what's happening within our dictionaries.

Top messages per dictionary, and top words that drive a dictionary should both go into the supplement of a language analysis paper.

## FYI: Pure Outcome Cross-Correlation with DLATK

We'll soon be in R, but just so you know, DLATK has a nifty function to make cross-correlation tables that just cross-correlate outcome columns, no language. Helpful for when you first get your data and want to just throw a cross-correlation table into your Jupyter/DLATK workflow.

The argument for DLATK is:
`--outcome_with_outcome_only` -- Says that we are ignoring language and are only looking at the outcomes.

When computing outcome cross-correlation, we don't specify feature table `--feat_table`, because we don't need that to compute outcome cross-correlations. Everything else remains the same and the command looks like below. Note that we change the output location so we can keep our previous `mini_LIWC` results and not overwrite it.

It's equivalent to calling `cor(df$age, df$gender)` in R.

Note that this ignores group_freq_thresh -- you can see in the output that it ran over 1,000 groups.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'

OUTPUT_FOLDER = './outputs_tutorial_06'
OUTPUT_NAME = 'outcome_correlations_age_gender'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --outcome_with_outcome_only \
    --rmatrix --csv --sort \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.txt 2>&1

![Fig¬†11](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig11.png)

When you open the csv file in excel, it will look like this:

![Fig¬†12](https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-06/fig12.png)

We see that age and gender are correlated 0.013 (basically they don't), and that it was run over all `N = 1000` groups.

Ok, kewl.

## ‚ÄºÔ∏è **Save your database and/or output files** ‚ÄºÔ∏è

In [None]:
database = 'dla_tutorial'

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"‚úÖ Database '{database}.db' has been copied to your Google Drive.")

We generated a lot of output in this tutorial! Here's how you can save it to your Drive if you want to!

In [None]:
OUTPUT_FOLDER = './outputs_tutorial_06'

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive (-r makes it copy the folder and all files/folders inside)
!cp -f -r {OUTPUT_FOLDER} "/content/drive/MyDrive/"

print(f"‚úÖ '{OUTPUT_FOLDER}' has been copied to your Google Drive.")

Yay! Done üòé