<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-06/W6_Tutorial_09_DLATK_1to3gram_tuneCorrelation_(dla_tutorial).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W6 Tutorial 9 -- 1to3gram extraction, fine-tuning and correlation word clouds (2025-04)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.


✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.


In this tutorial we will explore what we need to do to correlate words **and phrases** directly with outcomes. This will involve the same `--correlate` command as before, but importantly, we need to pre-process the 1to3grams with DLATK:
* extract them
* filter rare words
* filter phrases to those that are really informative -- "United States" (yes) vs. "the dog" (no)
* correlate them against outcomes, and create word clouds
* and while we are at it: let's also make wordclouds of sets of dictionaries, like LIWC

Let's set up Colab, as usual.

## 1) Setting up Colab with DLATK and SQLite


In [None]:
database="dla_tutorial"

### 1a) Install DLATK

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql

Cloning into 'dlatk'...
remote: Enumerating objects: 6996, done.[K
remote: Counting objects: 100% (1176/1176), done.[K
remote: Compressing objects: 100% (168/168), done.[K
remote: Total 6996 (delta 1082), reused 1023 (delta 1008), pack-reused 5820 (from 2)[K
Receiving objects: 100% (6996/6996), 62.40 MiB | 6.03 MiB/s, done.
Resolving deltas: 100% (4942/4942), done.
Collecting image<=1.5.33 (from -r dlatk/install/requirements.txt (line 1))
  Downloading image-1.5.33.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langid<=1.1.6,>=1.1.4 (from -r dlatk/install/requirements.txt (line 2))
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mysqlclient<=2.1.1 (from -r dlatk/install/requirements.txt (line 4))
  Downloading mysqlclient-2.1.1.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━

Processing ./dlatk
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: dlatk
  Building wheel for dlatk (setup.py) ... [?25l[?25hdone
  Created wheel for dlatk: filename=dlatk-1.3.1-py3-none-any.whl size=35635918 sha256=daf0bf39d672200675081d12681c6ebddd5226c26da72c85125f61fba9ee9172
  Stored in directory: /tmp/pip-ephem-wheel-cache-lshjrv8b/wheels/cc/c9/65/e1ecc64bac68518c07b286fe86921aa938e11a0c3a87d8ff93
Successfully built dlatk
Installing collected packages: dlatk
Successfully installed dlatk-1.3.1
Collecting jupysql
  Downloading jupysql-0.11.1-py3-none-any.whl.metadata (5.9 kB)
Collecting jupysql-plugin>=0.4.2 (from jupysql)
  Downloading jupysql_plugin-0.4.5-py3-none-any.whl.metadata (7.8 kB)
Collecting ploomber-core>=0.2.7 (from jupysql)
  Downloading ploomber_core-0.2.26-py3-none-any.whl.metadata (527 bytes)
Collecting posthog (from ploomber-core>=0.2.7->jupysql)
  Downloading posthog-4.0.1-py2.py3-none-any.whl.metadata (3.0 kB)
Colle

### 1b) Mount Google Drive and copy databases

In [None]:
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

# this copies {database}.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"

Mounted at /content/drive


### 1c) Setup database connection

In [None]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database - mounts both databases as engines
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# attaches the dlatk_lexica.db so tutorial_db_engine can query both databases
from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### 1d) (ONLY If nedded: SOFT RELOAD): If you have a **"database lock"** problem

If you face a "database locked" issue, restart the session (Runtime ==> Restart Session) & run this cell to get set back up!


In [None]:
database = "tutorial_07"

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# set the output limit to 50
%config SqlMagic.displaylimit = 50

from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

### If needed, one-gram extraction

This tutorial needs 1-gram feature table and the one obtained using `LIWC2015`. *If you haven't already* (you can check with `%sqlcmd tables`), you can extract them using the below DLATK commands.

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"

In [None]:
# !dlatkInterface.py \
#     --corpdb {database} \
#     --corptable {msgs_table} \
#     --correl_field user_id \
#     --add_ngrams -n 1

Similarly, we can produce the `LIWC2015` feature table as below.

### If needed, LIWC feature extraction

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"

In [None]:
# !dlatkInterface.py \
#     --corpdb {database} \
#     --corptable {msgs_table} \
#     --correl_field user_id \
#     --add_lex_table -l LIWC2015

The above command produces `feat$cat_LIWC2015$msgs$user_id$1gra` table with LIWC2015 dictionary features.

## 2) One-gram ocurrence filtering

Firstly, let's count how many unique features (types) people use.

In [None]:
feat_1gram_user = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat)
FROM {{feat_1gram_user}};

COUNT(DISTINCT feat)
137687


That's almost 140,000 distinct tokens!

Let's see how many of them were used less than 50 times (putting them in the the tail of the Zipf distribution).

#### 🤓💻

Observe the inner query, it's the same as creating the `word_counts` table from the previous homeworks.

In [None]:
feat_1gram_user = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

SELECT COUNT(*)
FROM (
  SELECT SUM(value) AS feat_count
  FROM {{feat_1gram_user}}
  GROUP BY feat) AS a
WHERE feat_count <= 50;

Gahh!! That's clearly a lot -- almost 130,000 tokens appear 50 times or less.

FYI: this ran pretty fast because feat tables always come with indices on all columns -- makes it easy to sum over.

Here is another way to look at this: which words were used by at leat 50 users?

#### 🤓💻

Before reading the answer below -- do you remember how to get distinct user counts for which something is true...?

In [None]:
feat_1gram_user = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

SELECT COUNT(*)
FROM (
  SELECT count(distinct(group_id)) AS group_count
  FROM {{feat_1gram_user}}
  GROUP BY feat) AS a
WHERE group_count <= 50;

That's even more. So out of 137,687 types, 133,103 are used by 50 users or less, leaving `137,687 - 133,103 = 4,584` features.

We could go on like this with SQL, and shortlist our feature table down to words that occur across at least 50 groups, something like that.

**But let's not do that.**

Instead, let's just ask DLATK to do that for us, with the `--feat_occ_filter --set_p_occ 0.05` flags during feature extraction

* `--feat_occ_filter` "activates" the feature occurrence filtering
* `--set_p_occ 0.05` sets the feature occurrence threshold to 5% of the sample size. That means if we are extracting for 1,000 groups, we only want to retain features that have been used by **at least 50 users (groups)** (so it's not just using a minimal token count.)

It seems a little complicated at first, but it's a nice way to set good thresholds for samples of different sizes, and ensure that you have enough groups (users) who have used that word to run correlations over (rule of thumb: at least 40 to 50).



### 2a) More nuanced explanation:

Occurrence filtering depends a little bit on how many words are typically nested within a given group.

(In the language space, power calculations are in 3D -- groups * num of features * words/group).

Typical occurrence thresholds for different types of groups:

* Occurrence thresholds between 1 and 10% are typical when we are grouping by **people** (such as with `--correl_field user_id`).

* When we aggregate language to larger levels (such as **counties, cities, states**), we sometimes go as high as 30% of the groups (because we just have so much language from a county \[50,000+ words+\] -- and if a word doesn't appear among 50k words, it's probably pretty deep in the tail).

* Conversely, if we extract language at the **message level** (say a single blog post), we set this threshold to be much lower -- if we have 30,000 blog posts, for a word to occur in 30, the occurrence threshold would be 1/1000, or 0.001

**The key thing to remember here: you are setting the FRACTION of groups in which a feature should occur. That should be >40 groups. Based on the numbers of total groups you have, you can work out what this fraction should be.**

Anyway, here we have ~1,000 users: let's extract features with the occurrence threshold of 5%, which will mean that a feature has to occure in at least `1,000 * 0.05 = 50` groups to be included. `--feat_occ_filter --set_p_occ 0.05`

👆 Please make sure you understand this. ⚠️

(BTW, as always in these dlatk commands, the order of the flags does not matter).

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --add_ngrams -n 1 \
    --feat_occ_filter --set_p_occ 0.05



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-08 18:04:33
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$1gram$msgs$user_id
SQL Query: CREATE TABLE feat$1gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(36), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$1gram$msgs$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$1gram$msgs$user_id ON feat$1gram$msgs$user_id (group_id)


Creating index feature on table:feat$1gram$msgs$user_id, column:feat 


SQL Query: CREATE INDEX feature$1gram$msgs$user_id ON feat$1gram$msgs$user_id (feat)
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$meta_1gram$msgs$user_id
SQL Query: CREATE TABLE feat$meta_1gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(16), value INTEGER, group_norm

This took longer (\~2.5 min) than just extracting the 1grams (\~2.2min) -- as it created two tables: the unfiltered one `feat$1gram$msgs$user_id` and the filtered one `feat$1gram$msgs$user_id$0_05`.

The above command produced the table `feat$1gram$msgs$user_id$0_05`.

<h2> ⚠️ PSA

**The occurrence threshold is based on the number of included groups. DLATK will only consider groups that have a minimal token count: 500 is the default. So `--group_freq_thresh` will be used automatically and should be set explicitly (default  = 500).**

This means that the determination of how many groups it uses for the 5\% calculation will be based on those groups that meet the GFT threshold. In this case, it's 978 groups that contain 500 words or more. 5\% of 978 = 48.9. Please scroll through the output and double check: DLATK retained all features that met this `[threshold: 49]` groups.

**We should basically always set `--group_freq_thresh` explicitly, in every DLATK command. It never hurts.**

FYI, `--group_freq_thresh 0` is what we use only if when we work with very short documents (such as Tweets) as groups (typically: `--correl_field message_id`).

<h2> ⚠️ END of PSA

Back to the main story here: Note the `0_05` at the end in the feat table name: `feat$1gram$msgs$user_id$0_05`.

* **`0_05` <- THIS IS NEW -- this captures the fact that it was filtered down to those features only used by at least 5\% of the sample** \[which had word_count > group_freq_thresh `[default: 500]`\]

Let's review:

* `feat` <- this is a feature table (the result of `--add_ngrams`)

* `1gram` <- with 1grams in it (single tokens) (the result of `-n 1`)

* `msgs` <- from the message table `msgs` (the result of `--corptable msgs`)

* `user_id` <- extracted and aggregated for the unit of analysis identified by `user_id` (the result of `--correl_field user_id`)

So the `0_05` keeps track of the last filtering step.

Let's confirm the output, by seeing if there are any rare tokens left, and then seeing how many unique tokens have survived the filtering and are in the new feature table.

So how many features are left?

In [None]:
feat_1gram_occ05_user = 'feat$1gram$msgs$user_id$0_05'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat) AS unique_tokens
FROM {{feat_1gram_occ05_user}};

unique_tokens
4751


~5,000 is a good number!

Generally, between say 3,000 and 10,000 unique 1grams/tokens (=types) means we've done reasonable occurrence filtering.  

let's see what the rarest words are that we have retained.

In [None]:
feat_1gram_occ05_user = 'feat$1gram$msgs$user_id$0_05'

In [None]:
%%sql

SELECT feat, SUM(value) AS feat_count, COUNT(DISTINCT(group_id)) AS group_count
FROM {{feat_1gram_occ05_user}}
GROUP BY feat
ORDER by feat_count ASC
LIMIT 10;

feat,feat_count,group_count
newest,54,49
repeating,59,49
spelled,59,49
entertained,60,51
fist,60,51
phrases,60,52
traveled,60,50
inspiring,61,50
bust,62,52
cancel,62,50


Alrighty. `newest` was used 54 times by 49 people, and just made the cut -- on the bubble.

### 2b) Occurrence filtering using an existing table

In the above DLATK command, we re-extracted the table from scratch using `--add_ngrams -n 1` and then filtered the occurrences of 1-grams.

However, it's often desirable to take in the unfiltered version of the 1-gram feature table when we already have it, and then just filter it down further as needed, by feeding the `--feat_table` flag when we use `--feat_occ_filter`. It's much faster. Run the below command to see!!

This is the most efficient way to tune your occurrence threshold to get 3,000-10,000 1grams:

* **Step 1** - Extract the unfiltered feature table
* **Step 2** - Then try filtering it down with different occurrence thresholds (as below).

For example, below we make the threshold more strict by changing it to 7\% of the groups (978 * 7\% = 68) -- fewer words will have been used by this higher number of groups.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
feat_1gram_user = 'feat$1gram$msgs$user_id'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --group_freq_thresh 500 \
    --correl_field user_id \
    --feat_table '{feat_1gram_user}' \
    --feat_occ_filter --set_p_occ 0.07



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-08 18:07:46
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
 feat$1gram$msgs$user_id [threshold: 68]
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
SQL Query: DROP TABLE IF EXISTS feat$1gram$msgs$user_id$0_07
 feat$1gram$msgs$user_id <new table feat$1gram$msgs$user_id$0_07 will have 3514 distinct features.>
SQL Query: CREATE TABLE feat$1gram$msgs$user_id$0_07 ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(36), value INTEGER, group_norm DOUBLE)
0.1m feature instances written
0.2m feature instances written
0.3m feature instances written
0.4m feature instances written
0.5m feature instances written
0.6m feature instances written
0.7m feature instances written
Done inserting.
Enabling keys.
done.
-------
Settings:

Database - dla_tutorial
Corpus - msgs
Group I

After ~21 seconds, this results in a new table with 3,514 distinct features (see DLATK output: `new table ... will have 3514 distinct features`. Also note the group threshold resulting from the occurrence threshold: `[threshold: 68]`)

#### 👩‍🔬💻 Exercise

Can you get the number of types (distinct words) from the `feat$1gram$msgs$user_id$0_07` table?

**Answer**

## 3) Extracting 2-grams, filter on occurrence and PMI

Now that we have explored 1-gram extraction quite a bit, let's go to the next step. Let's extract 2grams -- sequences of tokens of length two (such as "happy, birthday" and "yay, !")

It's the same command as in 1-gram extraction -- the only change being `--add_ngrams -n 2`.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_ngrams -n 2



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-06 21:27:26
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$2gram$msgs$user_id
SQL Query: CREATE TABLE feat$2gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(70), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$2gram$msgs$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$2gram$msgs$user_id ON feat$2gram$msgs$user_id (group_id)


Creating index feature on table:feat$2gram$msgs$user_id, column:feat 


SQL Query: CREATE INDEX feature$2gram$msgs$user_id ON feat$2gram$msgs$user_id (feat)
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$meta_2gram$msgs$user_id
SQL Query: CREATE TABLE feat$meta_2gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(16), value INTEGER, group_norm

This takes about 6 min. The above command produces `feat$2gram$msgs$user_id` table containing 2-grams.

So, how many of them are there BTW? Let's check with SQL.

In [None]:
feat_2gram_user = 'feat$2gram$msgs$user_id'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat)
FROM {{feat_2gram_user}};

COUNT(DISTINCT feat)
1536988


Take a moment to count the digits -- that's 1.5 million. That's clearly an insane amount!!

Let us filter it down with occurrence threshold of 0.05 (like we did with 1-grams above). Note that we are using the `--feat_table` trick here: we are ingesting the unfiltered table. That runs much faster (as it doesn't create the unfiltered table again).

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
feat_2gram_user = 'feat$2gram$msgs$user_id'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --feat_table '{feat_2gram_user}' \
    --feat_occ_filter --set_p_occ 0.05



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-06 21:33:50
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
 feat$2gram$msgs$user_id [threshold: 49]
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
    checked 1000000 features
SQL Query: DROP TABLE IF EXISTS feat$2gram$msgs$user_id$0_05
 feat$2gram$msgs$user_id <new table feat$2gram$msgs$user_id$0_05 will have 11214 distinct features.>
SQL Query: CREATE TABLE feat$2gram$msgs$user_id$0_05 ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(70), value INTEGER, group_norm DOUBLE)
0.1m feature instances written
0.2m feature instances written
0.3m feature instances written
0.4m feature instances written
0.5m feature instances written
0.6m feature instances written
0.7m feature instances written
0.8m feature instances written
0.9m feature instances written
1.0m feat

Takes 20 seconds. The above command produces the table `feat$2gram$msgs$user_id$0_05`.

How many feats are left then?

In [None]:
feat_2gram_occ05_user = 'feat$2gram$msgs$user_id$0_05'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat)
FROM {{feat_2gram_occ05_user}};

COUNT(DISTINCT feat)
11214


Looks like there are ~11k of those. Let's have a look at a random sample of 10.

In [None]:
feat_2gram_occ05_user = 'feat$2gram$msgs$user_id$0_05'

In [None]:
%%sql

SELECT *
FROM {{feat_2gram_occ05_user}}
ORDER by RANDOM()
LIMIT 10;

id,group_id,feat,value,group_norm
321397,2311511,to lay,1,4.044325810887325e-05
1278070,4184069,. being,1,0.0003330003330003
467553,3022585,will try,1,2.464632523290777e-05
192820,1694057,ones .,2,2.3073373327180436e-05
945510,3745004,the ability,1,5.145885864251531e-05
1046289,3860954,think i,1,0.0006779661016949
783875,3556477,that a,1,0.0001047010784211
52451,671748,to post,2,4.20035703034758e-05
510349,3178789,remember what,1,8.863676653075696e-05
768442,3536867,decided that,1,0.0007183908045977


As you can observe, a lot of those 2 grams are uninformative -- often just random combinations of highly frequent words and punctuation. Let's drop them!

### 3a) Using Pointwise-Mutual Information to filter two-grams

[Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) can be used to drop them.

\begin{equation}
pmi(a,b) = log \left( \frac{p(a,b)}{p(a) p(b)} \right)
\end{equation}

We cover the details in the lecture but basically the idea is we want to see how much more likely the phrase ("happy birthday") is, for example, than the (independent) likelihoods of "happy" and "birthday" would suggest. We calculate a ratio of these, and then filter on it. The typical PMI thresholds (3 to 6) are based on empirical testing across many data sets (they are rules of thumb).

So let's filter down the extracted 2-grams to only those that meet a PMI threshold of 3 with the `--feat_colloc_filter --set_pmi_threshold 3` flags. Same as occurrence filtering - the first flag "activates" the filter, the second sets it.

Note that we are ingesting **the occurrence filtered** --feat_table here (`feat$2gram$msgs$user_id$0_05`) -- same idea as before, but now we just add the PMI filtering on top of the occurence filtering.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
feat_2gram_occ05_user = 'feat$2gram$msgs$user_id$0_05'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --feat_table '{feat_2gram_occ05_user}' \
    --feat_colloc_filter --set_pmi_threshold 3



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-06 21:35:37
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
feat$2gram$msgs$user_id$0_05
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
SQL Query: DROP TABLE IF EXISTS feat$2gram$msgs$user_id$0_05$pmi3_0
 feat$2gram$msgs$user_id$0_05 <new table feat$2gram$msgs$user_id$0_05$pmi3_0 will have 3517 distinct features.>
SQL Query: CREATE TABLE feat$2gram$msgs$user_id$0_05$pmi3_0 ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(70), value INTEGER, group_norm DOUBLE)
0.1m feature instances written
0.2m feature instances written
0.3m feature instances written
0.4m feature instances written
Done inserting.
Enabling keys.
done.
-------
Settings:

Database - dla_tutorial
Corpus - msgs
Group ID - user_id
Feature table(s) - feat$2gram$msgs$user_id$0_05$pmi3_0
-------
Interface Runtime: 12.45 seconds
DLATK exits with success! A good day

The above command produces table `feat$2gram$msgs$user_id$0_05$pmi3_0`. Note the `pmi3_0` at the end.

As you can see, the filtering steps are stacking up in the table name -- it keeps tracks of what happened to it. It's a simple blockchain, if you will. 💎👐

Let's check how many features are left.

In [None]:
feat_2gram_occ05_pmi3_user = 'feat$2gram$msgs$user_id$0_05$pmi3_0'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat)
FROM {{feat_2gram_occ05_pmi3_user}};

COUNT(DISTINCT feat)
3517


~3500 of them. Let's look at some of them.

In [None]:
%%sql

SELECT *
FROM {{feat_2gram_occ05_pmi3_user}}
ORDER by RANDOM()
LIMIT 10;

id,group_id,feat,value,group_norm
258911,3565927,we went,2,0.0001186661920018
390225,4043736,if you,1,0.0006600660066006
172957,3252533,which was,1,4.671150971599402e-05
72647,1826527,i'm supposed,5,5.260721350111527e-05
106830,2366391,he makes,2,3.238813946332853e-05
410137,4148541,a bit,1,0.0003533568904593
278633,3632734,i forgot,2,0.0037807183364839
396474,4068475,don't know,1,0.0010373443983402
116816,2607577,i'm not,22,0.0003484430929075
77894,1960271,who are,4,0.000136262987566


Better, but it still looks like we are getting phrases that aren't really that phrase-like. Let's filter with a stronger PMI threshold.

Let's do another one with PMI = 7. Same trick! Runs through quickly.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
feat_2gram_occ05_user = 'feat$2gram$msgs$user_id$0_05'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --feat_table '{feat_2gram_occ05_user}' \
    --feat_colloc_filter --set_pmi_threshold 7



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-06 21:36:15
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
feat$2gram$msgs$user_id$0_05
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
SQL Query: DROP TABLE IF EXISTS feat$2gram$msgs$user_id$0_05$pmi7_0
 feat$2gram$msgs$user_id$0_05 <new table feat$2gram$msgs$user_id$0_05$pmi7_0 will have 218 distinct features.>
SQL Query: CREATE TABLE feat$2gram$msgs$user_id$0_05$pmi7_0 ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(70), value INTEGER, group_norm DOUBLE)
Done inserting.
Enabling keys.
done.
-------
Settings:

Database - dla_tutorial
Corpus - msgs
Group ID - user_id
Feature table(s) - feat$2gram$msgs$user_id$0_05$pmi7_0
-------
Interface Runtime: 6.41 seconds
DLATK exits with success! A good day indeed  ¯\_(ツ)_/¯.


The above command produces table `feat$2gram$msgs$user_id$0_05$pmi7_0`.

So, how many feats are left now?

In [None]:
feat_2gram_occ05_pmi7_user = 'feat$2gram$msgs$user_id$0_05$pmi7_0'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat)
FROM {{feat_2gram_occ05_pmi7_user}};

COUNT(DISTINCT feat)
218


Ok, it's ~200, that's not many.

In [None]:
%%sql

SELECT *
FROM {{feat_2gram_occ05_pmi7_user}}
ORDER BY RANDOM()
LIMIT 10;

id,group_id,feat,value,group_norm
15866,3665752,take care,1,7.31528895391368e-05
572,664485,six months,9,3.4418273808840904e-05
3302,1778579,cell phone,1,6.994474365251452e-05
16061,3678942,last night,1,0.0002568053415511
14990,3594873,staring at,2,0.0001197031362221
4031,1841346,worry about,2,0.0002240896358543
5380,2259900,few weeks,1,0.000390167772142
2206,1223561,difference between,2,0.0001220554131575
11730,3422339,10 minutes,3,0.0001027749229188
11023,3382977,reminds me,1,3.893171377404033e-05


Ahh OK! These things look like standing phrases (hang out, pick up, figure out, woke up, far away, etc.) Those are much more likely than you expect by chance.

### 3b) Occurence and PMI-filtering a feat table at the same time

We can extract 2 grams, apply occcurrence threshold and PMI, all in one command, either by ingesting a unfiltered 2-gram feature table (using `--feat_table`), or by extracting it from scratch (using `--add_ngrams -n 2`), see below -

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --add_ngrams -n 2 \
    --feat_occ_filter --set_p_occ 0.05 \
    --feat_colloc_filter --set_pmi_threshold 3



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-06 21:36:41
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$2gram$msgs$user_id
SQL Query: CREATE TABLE feat$2gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(70), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$2gram$msgs$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$2gram$msgs$user_id ON feat$2gram$msgs$user_id (group_id)


Creating index feature on table:feat$2gram$msgs$user_id, column:feat 


SQL Query: CREATE INDEX feature$2gram$msgs$user_id ON feat$2gram$msgs$user_id (feat)
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$meta_2gram$msgs$user_id
SQL Query: CREATE TABLE feat$meta_2gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(16), value INTEGER, group_norm

This will take time, ~8 minutes. This produces the tables
* `feat$2gram$msgs$user_id`
* `feat$2gram$msgs$user_id$0_05`
* `feat$2gram$msgs$user_id$0_05$pmi3_0`

it's like a very nerdy and boring version of the hero's journey (2023 johannes comment: I don't even know that means!).

### 2c) 3gram table - extraction and filtering

Let's create such a occurrence and PMI filtered table for 3-grams, all in one step using the above command. For this we will use `--feat_colloc_filter` `--set_pmi_threshold 3`

This will take a while (~9min)!!

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --add_ngrams -n 3 \
    --feat_occ_filter --set_p_occ 0.05 \
    --feat_colloc_filter --set_pmi_threshold 3



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-06 21:50:06
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$3gram$msgs$user_id
SQL Query: CREATE TABLE feat$3gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(102), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$3gram$msgs$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$3gram$msgs$user_id ON feat$3gram$msgs$user_id (group_id)


Creating index feature on table:feat$3gram$msgs$user_id, column:feat 


SQL Query: CREATE INDEX feature$3gram$msgs$user_id ON feat$3gram$msgs$user_id (feat)
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$meta_3gram$msgs$user_id
SQL Query: CREATE TABLE feat$meta_3gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(16), value INTEGER, group_nor

This produces the following table: `feat$3gram$msgs$user_id$0_05$pmi3_0`

How many features does it have in it though?

In [None]:
feat_3gram_occ05_pmi3_user = 'feat$3gram$msgs$user_id$0_05$pmi3_0'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat)
FROM {{feat_3gram_occ05_pmi3_user}};

COUNT(DISTINCT feat)
2576


Ok, ~2.5k, which is totally reasonable. Notice that those are **fewer** than there were 2-grams with the same occurrence filter and PMI settings. A given 3gram is a lot less frequent, which is the main source of filtering.

## 4) Extracting 1to3grams all together, into a combined table

For our downstream correlations and analyses, we want to combine 1grams, 2grams and 3grams into one `1to3gram` table. We do this with the combination of the following two flags: `--add_ngrams -n 1 2 3` and `--combine_feat_tables 1to3gram`.

We can then combine this extraction and combination with the same occurrence and PMI filtering. So in one command -

```
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_ngrams -n 1 2 3 \
    --combine_feat_tables 1to3gram \
    --feat_occ_filter --set_p_occ 0.05 \
    --feat_colloc_filter --set_pmi_threshold 3
```

The above command produces the following tables:
- `feat$1gram$msgs$user_id`
- `feat$2gram$msgs$user_id`
- `feat$3gram$msgs$user_id`
- `feat$1to3gram$msgs$user_id` -- **combined 1 to 3 grams**
- `feat$1to3gram$msgs$user_id$0_05` -- combined 1 to3 grams, **filtered to occurrence threshold 0.05**
- `feat$1to3gram$msgs$user_id$0_05$pmi3_0` -- combined 1 to3 grams, occurrence threshold 0.05, **plus PMI > 3**

Plus 1gram, 2gram and 3gram `meta_` tables, so **9 tables** in total.

**Heads up, this will take roughly 20mins!**

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

In [28]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --add_ngrams -n 1 2 3 \
    --combine_feat_tables 1to3gram \
    --feat_occ_filter --set_p_occ 0.05 \
    --feat_colloc_filter --set_pmi_threshold 3



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-05-08 18:33:49
-----
Connecting to SQLite database: /content/sqlite_data/dla_tutorial
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$1gram$msgs$user_id
SQL Query: CREATE TABLE feat$1gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(36), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$1gram$msgs$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$1gram$msgs$user_id ON feat$1gram$msgs$user_id (group_id)


Creating index feature on table:feat$1gram$msgs$user_id, column:feat 


SQL Query: CREATE INDEX feature$1gram$msgs$user_id ON feat$1gram$msgs$user_id (feat)
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$meta_1gram$msgs$user_id
SQL Query: CREATE TABLE feat$meta_1gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(16), value INTEGER, group_norm

Alright!! How many features are in the final filtered combo table?

In [None]:
feat_1to3gram_occ05_pmi3_user = 'feat$1to3gram$msgs$user_id$0_05$pmi3_0'

In [None]:
%%sql

SELECT COUNT(DISTINCT feat)
FROM {{feat_1to3gram_occ05_pmi3_user}};

COUNT(DISTINCT feat)
10844


That's about 10,000 features. That's good. Less than our 10,000 1grams + 10,000 2to3 grams = 20,000 1to3grams max cap for user level analyses. We can work with this for correlations now.

Yay!!

## 👍 **Rules of thumb for thresholds**

### Step 1 -  set occurrence threshold

Set occurrence threshold so that you  have 3k – 10k 1grams.

Good range for **person-level type data**: `--set_p_occ` 0.01 to 0.1, depending on sample size. The larger the sample (e.g., 10k+), the lower (e.g., 0_01) you can set it.

Remember: You can always multiply \[sample size > GFT threshold\] times threshold to get the threshold number of groups you are picking (48, in the example above) -- and ask yourself: is that a good number for what I'm trying to do? 🤔

E.g., 10,000 groups of which 9,000 have 500 words means I can choose a occ thresh of 0.01, that's 90 groups. Enough to run correlations over if a feature exists!

#### Good Default -
`--feat_occ_filter --set_p_occ 0.05`


### Step 2 -  set PMI, using the occurrence threshold from step 1

You want to set the PMI such that you have <10k additional 2-3 grams.  

#### Good defaults -

For **person-level type data** are PMIs 3 to 6 using `--feat_colloc_filter --set_pmi_threshold 3`

## 5) Making one-gram correlation ("differential") word clouds


Alright, in the following we want to individually correlate the user-level frequencies of our 10,000 1to3grams against outcomes like age and gender, one at a time -- and then p-correct our signficance thresholds (done implicitly by default with Benjamini-Hochberg).


* We then take the 70 most correlated 1to3grams, and ask DLATK to put them into a **wordcloud in descending order of the magnitude of the correlation coefficient with a given outcome**. We do this **separately for positively and for negatively correlated language features** for a given outcome. So we will get, for example, all the words most positively associated with age, and the words most negatively associated with age -  two word clouds per outcome.


* So the main work here will be done by `--correlate` and `--tagcloud --make_wordclouds` into an output folder (which we set here with `--output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}`.


* Importantly, the **size of the words** in the wordcloud corresponds to the size of the correlation coefficients. The **coefficient range in the cloud is reported in the filename of the word cloud image files.** These are "differential word clouds," and the method is called "differential language analysis" (as in, using correlation cofficients as the primary dimension for visualization-shortlisting).


* The **color indexes the relative frequency,** from red (most frequent) to blue (moderate) to grey (rarely used).


* Note: these word clouds differ from your "standard" lame-o word clouds where size gives frequency. These other kinds of visualizations aren't great because #languageFrequencyStatistics are generally pretty uninformative on their own, many blog posts in need of an image nonwithstanding. You end up just plotting whichever function words you didn't throw out.


* Note that if you include the `--rmatrix --csv` flags, you get the usual output files that give you correlation coefficients for every word in an html and a csv file, respectively. With 10,000 language features, these files can get pretty big, but whatever. You could ingest them into R if you wanted, and create other visualizations or use table technology to show correlations, etc.


* Here is the legend for these plots, a good thing to have.

**A figure template powerpoint with these legends is also linked from the class website, and in the Class GitHub.**

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig1.png" width="300">

[Link to PPTX!](https://github.com/CompPsychology/psych290_colab_public/blob/main/notebooks/FIGURE%20TEMPLATE%20-%201to3grams%20and%20topics.pptx)

Let's start by just looking at the 1grams most correlated with age and gender. Let's use the occurrence-filtered 1gram table. That's always the most sensical thing to do :: any extra features you correlate suppress your statistical power, as they are taken into account by the Benjamini-Hochberg correction for multiple comparisons.

Btw, let's also pipe `> [name_you_want].txt 2>&1` the output away into a text file, so we don't have to hear DLATK tell us about every other word it couldn't find enough examples of, which were significant, etc. 🙄

In [None]:
#new
OUTPUT_NAME = '1grams_age_gender'
feat_1gram_occ05_user = 'feat$1gram$msgs$user_id$0_05'

database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'

OUTPUT_FOLDER = 'output_tutorial_9'
!rm -rf {OUTPUT_FOLDER}/{OUTPUT_NAME}* # this deletes the output if exists
!mkdir -p {OUTPUT_FOLDER} # and this makes the folder!

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate --rmatrix --csv --sort \
    --feat_table '{feat_1gram_occ05_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.txt 2>&1

The above command has written files to the `output_tutorial_9/` folder with the prefix `1grams_age_gender`. You can check this out in the Files pane in the tab to the left and view the word clouds in Colab by simply clicking them 😀

The word clouds are in:

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/file_tree_t9.png" width="300">

Have a look at them in Colab.

We have already seen the html and csv files when we ran LIWC correlations -- these look the same, just with 1grams  rows, rather than LIWC dictionaries.

**Reminder:** You can always download the results to your local machine.

Let's look at the word clouds that DLATK put in that subfolder. The position of the words is random between runs.

(BTW, if you are wondering how to insert figures into text cells, there's an insert image option on the top of the cell (for example, this cell). We arranged the images using Powerpoint, and took a screenshot)

**Gender:**

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig3_gender.png" width="900">

that looks vaguely like what we've seen before in the Eichstaedt et al., 2021. Males orient towards things (the, of), females towards relational and emotional terms. Clouds amplify even small differences: they can amplify stereotypes.

**Age:**

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig4_age.png" width="900">

Note that we should probably control for gender (`--controls gender`) when we run age language correlations, and vice versa.

### 5a) 1to3gram correlation ("differential") word clouds

Let's repeat the exercise, but use our oh-so-carefully crafted 1to3grams table instead.

Note that the only thing that changes from the 1grams is the `--feat_table` (`feat$1to3gram$msgs$user_id$0_05$pmi3_0`) we include in the command! (and the `OUTPUT_NAME` (`1to3grams_age_gender`), so that we keep the results separate and tidy.

In [None]:
OUTPUT_NAME = '1to3grams_age_gender'
feat_1to3gram_occ05_pmi3_user = 'feat$1to3gram$msgs$user_id$0_05$pmi3_0'

database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'

OUTPUT_FOLDER = 'output_tutorial_9'
!rm -rf {OUTPUT_FOLDER}/{OUTPUT_NAME}* # this deletes the output if exists
!mkdir -p {OUTPUT_FOLDER} # and this makes the folder!

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_1to3gram_occ05_pmi3_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age gender \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.txt 2>&1

The above command has written files to `output_tutorial_9/` with prefix `1to3grams_age_gender`.
We again get html and csv files with all correlations for 10,000 1to3grams, and the word clouds. Let's look at gender.

**Gender**

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig5.png" width="800">

Note the additional_phrases of 2grams and 3grams. As some of the 2grams and 3grams don't seem to add that much (as in, are uninformative: "a look at," "I felt," "when I"), one could go back and increase the PMI threshold (say 6), and get a new feature table.

#### 👩‍🔬💻 Exercise

What would be the most efficient way to create such a table?  
**HINT:** what's the last table you could ingest before PMI filtering?

**Answer:**  



### 5b) Differential 1to3gram clouds for occupation

Let's look at the words that correlate with the dummies for occupations, using the `--categories_to_binary` flag.

In [None]:
OUTPUT_NAME = '1to3grams_occu'

database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_1to3gram_occ05_pmi3_user = 'feat$1to3gram$msgs$user_id$0_05$pmi3_0'

OUTPUT_FOLDER = 'output_tutorial_9'
!rm -rf {OUTPUT_FOLDER}/{OUTPUT_NAME}* # this deletes the output if exists
!mkdir -p {OUTPUT_FOLDER} # and this makes the folder!

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_1to3gram_occ05_pmi3_user}' \
    --outcome_table {outcomes_table} \
    --outcomes occu \
    --categories_to_binary occu \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.txt 2>&1

This takes \~6 mins -- it does all the occupations (\~20) one at a time times 10k-ish language features -- that's 200,000k language correlations. The above command has written files to `output_tutorial_9/1to3grams_age_gender_occu`.

There are lot for images produced and mostly 1 for each occupation -- because negative correlations ("words not used by students") tend to have much weaker signal that "words used by students." Please take a look in your folder. Do you like one in particular?

Here are two positive correlation ones.

**Agriculture**

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig6_ag.png" width="500">

**Technology**

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig7_tech.png" width="500">

### 5c) Differential 1to3gram clouds for star signs

Let's apply this pipeline to see if we can learn something about star signs. We again throw the `--categories_to_binary sign` conversion for `--outcomes sign` (We don't expect great results to manifest.)*

\*or do we? 🤔♉

In [None]:
OUTPUT_NAME = '1to3grams_sign'

database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_1to3gram_occ05_pmi3_user = 'feat$1to3gram$msgs$user_id$0_05$pmi3_0'

OUTPUT_FOLDER = 'output_tutorial_9'
!rm -rf {OUTPUT_FOLDER}/{OUTPUT_NAME}* # this deletes the output if exists
!mkdir -p {OUTPUT_FOLDER} # and this makes the folder!

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_1to3gram_occ05_pmi3_user}' \
    --outcome_table {outcomes_table} \
    --outcomes sign \
    --categories_to_binary sign \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/logs.txt 2>&1

2 minutes for 12 * 10k correlations. The above command has written files to `output_tutorial_9/` with prefix `1to3grams_sign`.

Again, there are lot of images produced -- 1 for each sign. Have a look! Here is the wordcloud for `Pisces`.

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig8.png" width="500">

seems like we picked up spurious signal.

## 6) Bonferroni vs Benjamini-Hochberg correction

So far we've used the Benjamini-Hochberg correction for multiple comparisons. That largely works alright. When we get the sense that we may be looking at spurious language correlations (here an extra 20 significant among 10,000 features, so a 0.2% false positive fringe), it may be good to ramp up to the more stringent Bonferroni correction.

We do that with `--p_correction bonferroni` which overrides the implicit `--p_correction BH`.

Let's use Bonferroni for star signs to see if it makes a difference.

In [None]:
OUTPUT_NAME = '1to3grams_sign_BONF'

database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'
feat_1to3gram_occ05_pmi3_user = 'feat$1to3gram$msgs$user_id$0_05$pmi3_0'

OUTPUT_FOLDER = 'output_tutorial_9'
!rm -rf {OUTPUT_FOLDER}/{OUTPUT_NAME}* # this deletes the output if exists
!mkdir -p {OUTPUT_FOLDER} # and this makes the folder!

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --correlate \
    --rmatrix --csv --sort \
    --p_correction bonferroni \
    --feat_table '{feat_1to3gram_occ05_pmi3_user}' \
    --outcome_table {outcomes_table} \
    --categories_to_binary sign \
    --outcomes sign \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}

Let's have a look:

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig9.png" width="400">

This cleared it up. We just learned that Bonferroni correction may be a good idea on this dataset. What we really need is a bigger dataset, with higher statistical power where we can also ramp up the GFT to 1,000, which is more conservative (see the homework!).

1to3grams looked a little noisy for both occupations and star signs. We could have ramped up the PMI more. But let's move on.

As you can see, calibrating PMI, occurrence filtering, and p_correction is often an iterative process, guided by rules of thumb and inspection of the output ("are these phrases informative?")

And then of course, people who are born in Pisces-months may just say "worthy" and "my heart" slightly more, for whatever reason.

## 7) Word Clouds with dictionary names as features

In case we don't have it, let's extract LIWC2015 features again.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_lex_table -l LIWC2015

Let's use this word cloud making opportunity to generalize our learning a little bit:

What we want to do now is to create wordclouds that have **LIWC dictionary names** in them. For this, we don't have to change anything in the word cloud syntax, other than to give DLATK a `*cat_LIWC2015*` feature table to correlate, rather than a 1gram or 1to3gram table.

**Make sure you understand why this is** -- feature tables are abstract. It makes no difference to them if the distinct features in them are words, phrases, names of dictionaries, or topic ids (as we see in the next tutorial), or cats, chickens, or whatever.

If we correlate feature table against outcomes and make wordclouds,

* DLATK will **correlate** all the distinct **feature group_norms** (here: dictionary categories, like "POSEMO" and "NEGEMO") it has for all the **groups** against the **outcome**
* **visualize** the **shortlisted** strongest positive and negative correlations (separately) (using the strings written in the **feat** column: here, it's POSEMO, etc.),
* with the size reflecting correlation magnitude, and color relative frequency (from the values column -- total word count per dictionary).

So let's run a correlate for a LIWC feature table.

In [None]:
OUTPUT_NAME = 'LIWC_age_gender_occu'
feat_liwc_user = 'feat$cat_LIWC2015$msgs$user_id$1gra'

# same as before
database = 'dla_tutorial'
msgs_table = 'msgs'
outcomes_table = 'outcomes'

OUTPUT_FOLDER = 'output_tutorial_9'
!rm -rf {OUTPUT_FOLDER}/{OUTPUT_NAME}* # this deletes the output if exists
!mkdir -p {OUTPUT_FOLDER} # and this makes the folder!

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table '{feat_liwc_user}' \
    --outcome_table {outcomes_table} \
    --categories_to_binary occu \
    --outcomes age gender occu \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME}

Btw, note in the above that we've done all of this at once:

    --categories_to_binary occu \
    --outcomes age gender occu \

The above command has written files to `output_tutorial_9/LIWC_age_gender_occu`.

Images were produced -- 2 for age, 2 for gender, ~1 for each occupation. Please take a look.

We have seen the HTML output for LIWC before -- **convince yourself** that the table snippet below corresponds to the top features in the positive gender and age correlation clouds (female and older, respectively -- the clouds on the right below).

**HTML**

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig10_a.png" width="900">


<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig11_b.png" width="800">


<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig12_c.png" width="800">

Ok, good, at this point you should have a sense of how different feature tables translate into different word cloud output, and how you get 1to3grams tables ready for these exploratory methods.

### 7a) LIWC wordclouds next to 1to3gram word clouds

For whatever reason, we have found it very helpful to put LIWC wordclouds that give dictionary correlation patterns next to 1to3gram correlation word clouds for the same outcomes. See below for younger, with legend. Particularly if you are more familiar with the LIWC dictionaries, the LIWC clouds give a "shorthand" for the patterns you see in the full 1to3gram wordclouds.

<img src="https://raw.githubusercontent.com/CompPsychology/psych290_images/main/images/tutorial-09/fig13.png" width="1000">

# 💻🤓 FYI

We have put a Powerpoint layouting template for these figures [HERE](https://github.com/CompPsychology/psych290_colab_public/blob/main/notebooks/FIGURE%20TEMPLATE%20-%201to3grams%20and%20topics.pptx) for your use now and forever 💟.

## ‼️ **Save your database and/or output files** ‼️

Let's save all this work into as a new database file in your GDrive `sqlite_databases` folder!

In [None]:
database = 'dla_tutorial'

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

We generated a lot of output in this tutorial! Here's how you can save it to your Drive if you want to!

In [None]:
OUTPUT_FOLDER = './outputs_tutorial_09'

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive (-r makes it copy the folder and all files/folders inside)
!cp -f -r {OUTPUT_FOLDER} "/content/drive/MyDrive/"

print(f"✅ '{OUTPUT_FOLDER}' has been copied to your Google Drive.")