<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-08/W8_Tutorial_12_The_Everything_Tutorial_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W8 Tutorial 12 -- The Everything Tutorial (DB: `dla_tutorial`) (2025-05)

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.


This tutorial is designed to do the critical bits of most previous tutorials, so that you can apply them to your own dataset. It is a summary (or cheatsheet) of the important DLATK commands you need for your journey in text analysis. Thus, it's also a way to summarize the entire course. 🙂


## Overview

The tutorial mainly has 3 sections.

* **Feature extraction** - which talks about both n-gram and lexicon feature extraction.
* **Feature correlation** - where we correlate the extracted features.
* **Topic modeling** - where we model the topics ourselves and store them as lexicons.

You can access the individual sections and subsections in the table of contents on the left!


💡 **Most important DLATK flags in each section**

* *1. Feature extraction* (`--corpdb`, `--corptable`, `--correl_field`)

    * *1.1. N-gram extraction* (`--add_ngrams -n`, `--feat_occ_filter --set_p_occ`, `--feat_colloc_filter --set_pmi_threshold`, `--combine_feat_tables`)
    
    * *1.2. Topic extraction* (`--add_lex_table -l`, `--weighted_lexicon`)
    
* *2. Feature correlation*

    * *2.1. Correlation with tokens* (`--correlate`, `--outcomes`, `--controls`, `--categories_to_binary`, `--tagcloud --make_wordclouds`)
    
    * *2.2. Correlation with topic features* (`--topic_tagcloud --make_topic_wordclouds`, `--topic_lexicon`)
    
* *3. Topic modeling* (`--estimate_lda_topics`, `--num_stopwords`, `--num_topics`, `--lda_alpha`, `--lexicondb`, `--lda_lexicon_name`, `--make_all_topic_wordclouds`)

First, let's setup Colab as usual!

## Setting up Colab with DLATK and SQLite

### a) Install DLATK

☝🏻 Don't forget to click `Restart` if it prompts you to!

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql gensim==4.3

### b) Mount Google Drive and copy databases

💡 **Uploading your own data**: if you're working with new data (CSVs), see how to upload them [with DLATK in Tutorial 5B](https://github.com/CompPsychology/psych290_colab_public/blob/main/notebooks/week-03/W3_Tutorial_05B_mini_tutorial_saving_SQLite_in_GoogleDrive_(dla_tutorial).ipynb) or [with R in Tutorial 7](https://github.com/CompPsychology/psych290_colab_public/blob/main/notebooks/week-05/W5_Tutorial_07_R_dataImport_metaTablePlots_(csv).ipynb)!

For now, `dla_tutorial` should be saved in your Google Drive already 😎

In [None]:
database = "dla_tutorial"

In [None]:
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

# this copies {database}.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"

### c) Setup database connection

In [None]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database - mounts both databases as engines
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# attaches the dlatk_lexica.db so tutorial_db_engine can query both databases
from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### d) If you have a **"database lock"** problem

If you face a "database locked" issue:
  1. **restart the session** (Runtime ==> Restart Session)
  2. run this cell to get set back up!

In [None]:
database = "dla_tutorial" # or whichever database you are working with!! (e.g., 'svitlana', 'dla_tutorial', etc)

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# set the output limit to 50
%config SqlMagic.displaylimit = 50

from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

Now that we have everything set up, let's jump right into the tutorial.



## 1) Feature extraction

For any text analysis task, DLATK consumes 3 mandatory flags namely -

* **`--corpdb`** - which specifies the database to work with
* **`--corptable`** - which specifies the message table that contains the text corpus
* **`--correl_field`** - which picks our language unit of analysis, say `user_id`, `message_id`, etc.

Feature extraction (broadly) discusses representing these units of analysis as either proportion of tokens (ngram extraction) or proportion of lexicons (topic extraction or any dictionary as for that matter).

When no flags are specified DLATK simply lists the flags available and their usage as shown below.

In [None]:
!dlatkInterface.py

### 1.1. N-gram extraction <a name="sec1.1"></a>

One method of feature extraction is to use the "bag-of-words" model where we represent every unit of language analysis with the token types present in them and their relative frequencies. This is done using **`--add_ngrams -n`** switch as shown below.

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"
correl_field = "user_id"
gft = 500

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field {correl_field} \
    --group_freq_thresh {gft} \
    --add_ngrams -n 1

Out of the extracted tokens/1-grams, not all of them are significant. Some of them are used by only a few groups and don't represent the population. So, we can remove such outliers using the flag **`--feat_occ_filter --set_p_occ`**.

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"
correl_field = "user_id"

feat_1gram_user = "feat$1gram$msgs$user_id"

gft = 500
occ_threshold = 0.07

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field {correl_field} \
    --group_freq_thresh {gft} \
    --feat_table '{feat_1gram_user}' \
    --feat_occ_filter --set_p_occ {occ_threshold}

In addition to 1-grams, we can also extract bi-grams and tri-grams using `--add_ngrams -n` and retain the significant ones among them using **`--feat_colloc_filter --set_pmi_threshold`**. We do this and combine all the resulting feature tables into one giant table using the **`--combine_feat_tables`** flag for further analysis.

⏰ This takes about 25 minutes!

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"
correl_field = "user_id"

gft = 500
occ_threshold = 0.05
pmi_threshold = 3

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field {correl_field} \
    --group_freq_thresh {gft} \
    --add_ngrams -n 1 2 3 \
    --combine_feat_tables 1to3gram \
    --feat_occ_filter --set_p_occ {occ_threshold} \
    --feat_colloc_filter --set_pmi_threshold {pmi_threshold}

### 1.2. Topic extraction <a name="sec1.2"></a>

Another method or step of feature extraction is to use lexicons, either weighted ones like topics or unweighted ones like `LIWC2015`. DLATK does this using the **`--add_lex_table -l`** flags. For weighted lexicon extraction we also include **`--weighted_lexicon`**.

We will demonstrate this using 500 topics modeled on Facebook posts (`fb22_all_500t_cp`).

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"
correl_field = "user_id"

gft = 500

topics_cp_table = "fb22_all_500t_cp"

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field {correl_field} \
    --group_freq_thresh {gft} \
    --add_lex_table -l {topics_cp_table} \
    --weighted_lexicon

## 2) Feature correlation <a name="sec2"></a>

Once we extract the features using the above methods, our next step is to correlate the features against the outcomes of interest. Here are the flags that we need -

* **`--correlate`** - tells DLATK to run a correlation.
* **`--outcomes`** - takes a list of outcomes to correlate against.

And some optional flags like -

* **`--controls`** - takes a list of variables to be controlled for.
* **`--categories_to_binary`** - tells DLATK to one-hot encode a categorical variable.

We can augment this correlation by further producing wordclouds associated with these features using the **`--tagcloud --make_wordclouds`** flag.

### 2.1 Correlation with tokens <a name="sec2.1"></a>

We now correlate filtered 1gram tokens against `age` controlling for `gender`, and generate corresponding word clouds.

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"
correl_field = "user_id"
gft = 500

feat_1gram_occ07_user = "feat$1gram$msgs$user_id$0_07"

outcomes_table = "outcomes"

OUTPUT_NAME = "1grams_age_CTRL_gender"
OUTPUT_FOLDER = "./outputs_tutorial_12"

!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field {correl_field} \
    --group_freq_thresh {gft} \
    --correlate \
    --rmatrix --csv \
    --feat_table '{feat_1gram_occ07_user}' \
    --outcome_table {outcomes_table} \
    --outcomes age \
    --controls gender \
    --categories_to_binary gender \
    --tagcloud --make_wordclouds \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/{OUTPUT_NAME}_logs.txt

💡 Wordclouds and correlation matrix are located in `./outputs_tutorial_12`!

FYI: similar to 1gram correlations we can also correlate the lexicon features. For this we just need to pass the right feature table to the **`--feat_table`** flag.

### 2.2. Correlation with topic features <a name="sec2.2"></a>

We can also correlate topic features against outcomes. However to generate the wordclouds we need to pass a different set of flags namely **`--topic_tagcloud --make_topic_wordclouds`** which picks the words from topic lexicon (passed using **`--topic_lexicon`**).

**FYI:** Topic lexicon takes in the frequency table, say `fb22_all_500t_freq`.

We will correlate the 500 FB topics features against `age` controlling for `gender`, and also generate the wordclouds.

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"
correl_field = "user_id"
gft = 500

feat_topics_user = "feat$cat_fb22_all_500t_cp_w$msgs$user_id$1gra"

outcomes_table = "outcomes"

topic_lexicon = 'fb22_all_500t_freq'

OUTPUT_FOLDER = "outputs_tutorial_12"
OUTPUT_NAME = "fb500_age_CTRL_gender"

!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field {correl_field} \
    --correlate \
    --rmatrix --csv --sort \
    --group_freq_thresh 500 \
    --outcome_table {outcomes_table} \
    --outcomes age \
    --controls gender \
    --categories_to_binary gender \
    --feat_table '{feat_topics_user}' \
    --topic_tagcloud --make_topic_wordclouds \
    --topic_lexicon {topic_lexicon} \
    --tagcloud_colorscheme blue \
    --output_name {OUTPUT_FOLDER}/{OUTPUT_NAME} > {OUTPUT_FOLDER}/{OUTPUT_NAME}_logs.txt

💡 Topic wordclouds and correlations are located in `./outputs_tutorial_12`!

Pro tip: you can check for r/beta values in the `wordcloud.png` names or in the `fb500_age_CTRL_gender.csv`!

## 3) Topic modeling

DLATK needs a 1-gram *message* feature table to train the topic model. So, the first step would be to extract the 1-gram features grouped by message_id (rather than on user_id, as we did above).

Note that we need to extract 1grams at the message level, and set group-freq-thresh to 0 (to include all documents).

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"
correl_field = "message_id"
gft = 0

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field {correl_field} \
    --group_freq_thresh {gft} \
    --add_ngrams -n 1

Then we pick the number of stopwords to ignore.

In [None]:
database = "dla_tutorial"
feat_1gram_msg = "feat$1gram$msgs$message_id"
N = 200

Let's extend the `%sql` row display limit to 200.

In [None]:
# sets the output limit
%config SqlMagic.displaylimit = 200

In [None]:
%%sql

SELECT ROW_NUMBER() OVER (ORDER BY n_occ DESC) AS n, feat, n_occ
FROM (
  SELECT feat, SUM(value) AS n_occ
  FROM {{feat_1gram_msg}}
  GROUP BY feat
  ORDER BY n_occ DESC
  LIMIT {{N}}) AS a;

Based on the list above, we'll remove the top-N words as stop-words when we model the topics below.

The topic modeling is done in one DLATK command around the flag **`--estimate_lda_topics`**. Before we continue, let's see what the new parameters mean.

**`--estimate_lda_topics`** - tells DLATK to model new topics using the 1gram feature table.

**`--num_stopwords`** - Specifies the number of stopwords (50-150 are sensible).

**`--num_topics`** - Specifies the number of topics to be extracted (you need 50-100 documents per topic. dla_tutorial has 30k messages = 300-600 topics would be possible, at a maximum).

**`--lda_alpha`** - This parameter is used to set the parameter of how many topics you expect per document (5 is a good default. Values from 2-5 or so seem sensible. The longer the document, the larger we want to make this number. For something that's as long as a sentence, like a tweet, 2 is a good number).

**`--lexicondb`** - Database where the topic tables should be stored.

**`--lda_lexicon_name`** - If we set the lexicon name as NAME we get topic tables as NAME_cp and NAME_freq in the end.

**`--mallet_path`** - This is needed to specify the path to Mallet which implements Topic Modeling inside DLATK.

Now that we know the parameters, let's drop 200 stop words, model 100 topics, and set the alpha (topics per document) to 5.

In [None]:
database = 'dla_tutorial'
msgs_table = 'msgs'
feat_1gram_message = 'feat$1gram$msgs$message_id'

num_stopwords = 200
num_topics = 100
alpha = 5
lexicon_database = 'dla_tutorial'

lexicon_name = f'dlat_{num_topics}_a{alpha}s{num_stopwords}'

OUTPUT_FOLDER = './outputs_tutorial_12'
!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field message_id \
    --feat_table '{feat_1gram_message}' \
    --estimate_lda_topics \
    --num_stopwords {num_stopwords} \
    --num_topics {num_topics} \
    --lda_alpha {alpha} \
    --lexicondb {lexicon_database} \
    --save_lda_files {OUTPUT_FOLDER} \
    --lda_lexicon_name {lexicon_name} \
    --mallet_path /opt/mallet/bin/mallet

And we print the topic wordclouds using the command below that uses the **`--make_all_topic_wordclouds`** flag.

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"

lexicon_database = "dla_tutorial"
topics_freq_table = 'dlat_100_a5_s200_freq_t50ll'

OUTPUT_FOLDER = './outputs_tutorial_12'
OUTPUT_NAME = 'dlat_topic_wordclouds'

!mkdir -p {OUTPUT_FOLDER}

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --topic_lexicon {topics_freq_table} \
    --lexicondb {lexicon_database} \
    --make_all_topic_wordclouds \
    --tagcloud_colorscheme blue \
    --output {OUTPUT_FOLDER}/{OUTPUT_NAME}

## References <a name="ref"></a>

* [DLATK](https://aclanthology.org/D17-2010.pdf) - publicly available at https://github.com/dlatk/dlatk.
* [Closed and Open Vocabulary Approaches to Text Analysis: A Review, Quantitative Comparison, and Recommendations](https://jeichstaedt.com/s/2021psychMethods.pdf)
* [Topic Modeling - Latent Dirichlet Allocation](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

## ‼️ **Save your database and/or output files** ‼️

Let's save all this work into as a database file in your GDrive `sqlite_databases` folder!

In [None]:
database = 'dla_tutorial'

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

Now let's save the output (wordclouds and files) in this tutorial! Here's how you can save it to your Drive (if you want to)!

In [None]:
OUTPUT_FOLDER = './output_tutorial_12'

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive (-r makes it copy the folder and all files/folders inside)
!cp -f -r {OUTPUT_FOLDER} "/content/drive/MyDrive/"

print(f"✅ '{OUTPUT_FOLDER}' has been copied to your Google Drive.")