<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-02/W2_Tutorial_04_DLATK_metaTablesComplexity_(dla_tutorial)_withSolutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W2 Tutorial 4 - Introduction to Meta Tables (DB: dla_tutorial) (2025-03)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.

## 1) Setting up Colab with DLATK and SQLite

As discussed in the tutorials, we begin by setting up the Colab environment.

This will take ~1.5 to 2 minutes. If colab asks you about this not being authored by Google, say "Run anyway."

### 1a to 1c) Streamlined: Setting up Colab with DLATK and your data

In [None]:
# assigning the corpus database name
database_name = "tutorial_4"

########### 1a) Install

# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql textstat

########### 1b) Download data and insert into SQLite database

# this download the csvs we need for this tutorial
!git clone https://github.com/CompPsychology/psych290_data.git

# load the required package -- similar to library() function in R
import os
from dlatk.tools.importmethods import csvToSQLite

# store the complete path to the database -- sqlite_data/[database_name].db
database = os.path.join("sqlite_data", database_name)

# import CSVs into tables in this database
msgs = "psych290_data/dla_tutorial/msgs.csv"
csvToSQLite(msgs, database, "msgs")

outcomes = "psych290_data/dla_tutorial/blog_outcomes.csv"
csvToSQLite(outcomes, database, "outcomes")

############# 1c) Setup database connection

# loads the %%sql extension
%load_ext sql

# connects the extension to the database
from sqlalchemy import create_engine
engine = create_engine(f"sqlite:///sqlite_data/{database_name}.db?charset=utf8mb4")
%sql engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

## PRINT FINISHED
print(" ******* LOAD FINISHED ¯\_(ツ)_/ *******")

Cloning into 'dlatk'...
remote: Enumerating objects: 6991, done.[K
remote: Counting objects: 100% (1076/1076), done.[K
remote: Compressing objects: 100% (149/149), done.[K
remote: Total 6991 (delta 994), reused 935 (delta 927), pack-reused 5915 (from 1)[K
Receiving objects: 100% (6991/6991), 62.38 MiB | 6.56 MiB/s, done.
Resolving deltas: 100% (4947/4947), done.
Collecting image<=1.5.33 (from -r dlatk/install/requirements.txt (line 1))
  Downloading image-1.5.33.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langid<=1.1.6,>=1.1.4 (from -r dlatk/install/requirements.txt (line 2))
  Downloading langid-1.1.6.tar.gz (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mysqlclient<=2.1.1 (from -r dlatk/install/requirements.txt (line 4))
  Downloading mysqlclient-2.1.1.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━

SQL Query: CREATE TABLE msgs (message_id INT, user_id INT, date VARCHAR(31), created_time VARCHAR(31), message LONGTEXT);


Importing data, reading psych290_data/dla_tutorial/msgs.csv file
Reading 10000 rows into the table...
Reading 10000 rows into the table...
Reading 10000 rows into the table...
Reading remaining 1674 rows into the table...


SQL Query: CREATE TABLE outcomes (user_id INT, gender INT, age INT, occu VARCHAR(31), sign VARCHAR(15), is_indunk INT, is_student VARCHAR(7), is_education VARCHAR(7), is_technology VARCHAR(7));


Importing data, reading psych290_data/dla_tutorial/blog_outcomes.csv file
Reading remaining 1000 rows into the table...
 ******* LOAD FINISHED ¯\_(ツ)_/ *******


## 2) Let's extract features

Here we want to extract the `feat$1gram$msgs$user_id` table.

FYI: If you're running this notebook on a server with 🐬🐬🐬 MySQL, you may have already extracted features and don't need to do it again. You can check by running the DLATK command below to see if the tables already exist.

In [None]:
database = 'tutorial_4'
msgs_table = 'msgs'

In [None]:
%sqlcmd tables

Name
msgs
outcomes


In [None]:
!dlatkInterface.py \
  --corpdb {database} \
  --corptable {msgs_table} \
  --correl_field user_id \
  --add_ngrams -n 1



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-09 23:04:14
-----
Connecting to SQLite database: /content/sqlite_data/tutorial_4
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$1gram$msgs$user_id
SQL Query: CREATE TABLE feat$1gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(36), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$1gram$msgs$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$1gram$msgs$user_id ON feat$1gram$msgs$user_id (group_id)


Creating index feature on table:feat$1gram$msgs$user_id, column:feat 


SQL Query: CREATE INDEX feature$1gram$msgs$user_id ON feat$1gram$msgs$user_id (feat)
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$meta_1gram$msgs$user_id
SQL Query: CREATE TABLE feat$meta_1gram$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(16), value INTEGER, group_norm D

## 3) Getting to know the Meta tables

The above command produced 2 tables:
- `feat$1gram$msgs$user_id` - Feature table with feat/word counts grouped by user.

This is the one we were expecting. But it also produced this table with meta-features:

- `feat$meta_1gram$msgs$user_id` - Summary statistics regarding the above table. Note the word 'meta' in table name.

as shown below.

In [None]:
tables = %sqlcmd tables
print(tables)

+------------------------------+
|             Name             |
+------------------------------+
|   feat$1gram$msgs$user_id    |
| feat$meta_1gram$msgs$user_id |
|             msgs             |
|           outcomes           |
+------------------------------+


Now let's check their contents.

In [None]:
meta_1gram_table = "feat$meta_1gram$msgs$user_id"

⚠️ ⚠️ In SQL cells, you need to use double curly braces `{{variable_name}}`. This is because of how `jupysql` interpolates Python variables into your queries.

In [None]:
%%sql

SELECT *
FROM {{meta_1gram_table}}
ORDER BY group_id
LIMIT 10;

id,group_id,feat,value,group_norm
1,28451,_avg1gramLength,3.765490943755958,3.765490943755958
2,28451,_avg1gramsPerMsg,80.6923076923077,80.6923076923077
3,28451,_total1grams,1049.0,1049.0
4,28451,_totalMsgs,13.0,13.0
5,174357,_avg1gramLength,3.7334360554699537,3.7334360554699537
6,174357,_avg1gramsPerMsg,216.33333333333331,216.33333333333331
7,174357,_total1grams,649.0,649.0
8,174357,_totalMsgs,3.0,3.0
9,216833,_avg1gramLength,3.602441072928394,3.602441072928394
10,216833,_avg1gramsPerMsg,252.0675675675676,252.0675675675676


It has 4 rows per `group_id`, which is `user_id` from table `msgs` because we grouped by `user_id` when extracting 1-grams. With 1000 users in our `msgs` table, we get 4000 rows in this meta table.  

For each user we have:
- **_avg1gramLength**  
  `value` -- Average number of characters (i.e. length) in 1-grams extracted for this user. For example the word 'love' has 4 characters.  
  `group_norm` -- Same as `value` (but in `float` data-type) because the group is this user and `value` is for the whole group already.
  
- **_total1grams**  
  `value` -- Total number of 1-grams extracted for this user.  
  `group_norm` -- Same as `value` because the group is this user and `value` is for the whole group already.
  
- **_totalMsgs**  
  `value` -- Total number of messages/blogs posted by this user.  
  `group_norm` -- Same as `value` because the group is the user and `value` is for the whole group already.
  
- **_avg1gramsPerMsg**  
  `value` -- Average number of 1-grams per message/blog posted by this user.  
  `group_norm` -- Same as `value` (but in `float` data-type) because the group is this user and `value` is for the whole group already.
  
  
Why do we need these features? They can be helpful points of comparison, and good for descriptives -- how many words do users users, how long are there, across how many messages -- etc. We'll ingest them later to create descriptive tables.

The point of this tutorial is that these meta_tables exist, and what's in it. The below is FYI -- it's so you can track all the details precisely, and reproduce them.

### Let's reproduce what's in the table:

### 3a) average 1 gram length

Now let's try to reproduce the above columns beginning with `_avg1gramLength` which is the average length of 1-grams extracted for this user. We can get this by computing average length of `feat` from 1gram table `feat$1gram$msgs$user_id`. When computing average 1-gram length, we need to account for their frequency as well.

This uses the function `LENGTH()` to count the length of a string. `LENGTH('hi')` = 2

FYI: In MySQL, the function 🐬🐬🐬 `char_length()` counts the length of strings.

⚠️ Remember ⚠️ when doing division in SQLite, **always multiply the numerator and denominator by `1.0`** to ensure proper data type casting. `((numerator * 1.0) / (denominator * 1.0))`


In [None]:
feat_1gram_table = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

DROP TABLE IF EXISTS avg1gramLength;

CREATE TABLE avg1gramLength AS SELECT group_id, ((SUM(LENGTH(feat) * value) * 1.0) / (SUM(value) * 1.0)) AS avg_1gram_len
                                FROM {{feat_1gram_table}}
                                GROUP BY group_id;

We can merge our result to the meta table and cross-check our result.

In [None]:
meta_1gram_table = 'feat$meta_1gram$msgs$user_id'

In [None]:
%%sql

SELECT a.*, b.value, b.group_norm
FROM avg1gramLength AS a, {{meta_1gram_table}} AS b
WHERE a.group_id = b.group_id AND b.feat = '_avg1gramLength'
ORDER BY group_id
LIMIT 10;

group_id,avg_1gram_len,value,group_norm
28451,3.7283126787416583,3.765490943755958,3.765490943755958
174357,3.7288135593220337,3.7334360554699537,3.7334360554699537
216833,3.5908969066638075,3.602441072928394,3.602441072928394
317581,3.6020974706971005,3.622828285726583,3.622828285726583
446275,4.044959128065395,4.048592188919164,4.048592188919164
450169,3.611111111111111,3.643605870020964,3.643605870020964
468786,3.6770114942528735,3.732183908045977,3.732183908045977
477665,3.598917769228364,3.60382846955053,3.60382846955053
483266,3.6283309957924255,3.64796633941094,3.64796633941094
485307,3.5434782608695654,3.5531400966183573,3.5531400966183573


The differences are because of differences in float-point arithmetic. DLATK likely uses a higher precision format for this computation. Not to worry here.

### 3b) Total 1grams per group

We've done this before --

Let us reproduce `_total1grams` which is the total number of words by every user. We can get this by summing `value` column in the 1-gram table `feat$1gram$msgs$user_id` grouped by user.

In [None]:
feat_1gram_table = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

DROP TABLE IF EXISTS total1grams;

CREATE TABLE total1grams AS SELECT group_id, SUM(value) AS n_1grams
                             FROM {{feat_1gram_table}}
                             GROUP BY group_id;

Like last time, we can merge our result to the meta table and cross-check our result.

In [None]:
meta_1gram_table = 'feat$meta_1gram$msgs$user_id'

In [None]:
%%sql

SELECT a.*, b.value, b.group_norm
FROM total1grams AS a, {{meta_1gram_table}} AS b
WHERE a.group_id = b.group_id AND b.feat = '_total1grams'
ORDER BY group_id
LIMIT 10;

group_id,n_1grams,value,group_norm
28451,1049,1049,1049.0
174357,649,649,649.0
216833,55959,55959,55959.0
317581,69703,69703,69703.0
446275,2202,2202,2202.0
450169,954,954,954.0
468786,870,870,870.0
477665,31971,31971,31971.0
483266,713,713,713.0
485307,1656,1656,1656.0


### 3c) Total messages per user

Now, let's reproduce `_totalMsgs` which is the total number of messages/blogs by every user. We can get this by counting the number of entries in table `msgs` grouped by user.

#### 👩‍🔬💻 Exercise

Can you create a new table `totalMsgs` with this information?

In [None]:
msgs_table = 'msgs'

In [None]:
%%sql

DROP TABLE IF EXISTS totalMsgs;

CREATE TABLE totalMsgs AS SELECT user_id, COUNT(*) AS n_msgs
                           FROM {{msgs_table}}
                           GROUP BY user_id;

Then merge the result to the meta table to cross-check our result.

In [None]:
meta_1gram_table = 'feat$meta_1gram$msgs$user_id'

In [None]:
%%sql

SELECT a.*, b.value, b.group_norm
FROM totalMsgs AS a, {{meta_1gram_table}} AS b
WHERE a.user_id = b.group_id AND b.feat = '_totalMsgs'
ORDER BY user_id
LIMIT 10;

user_id,n_msgs,value,group_norm
28451,13,13,13.0
174357,3,3,3.0
216833,222,222,222.0
317581,619,619,619.0
446275,8,8,8.0
450169,14,14,14.0
468786,19,19,19.0
477665,97,97,97.0
483266,6,6,6.0
485307,12,12,12.0


### 3d) Average 1grams / message

Let us reproduce **_avg1gramsPerMsg** which is the average number of 1-grams extracted across all messages/blogs grouped by a user. We can created tables:
- *total1grams* with total number of 1grams grouped by user
- *totalMsgs* with total number of messages grouped by user

The average number of 1-grams per message is simply the ratio:
\begin{equation}
\frac{total1grams.n\_1grams}{totalMsgs.n\_msgs}
\end{equation}

For that we merge the tables _total1grams_ and _totalMsgs_ and get the ratio.

In [None]:
%%sql

DROP TABLE IF EXISTS avg1gramsPerMsg;

CREATE TABLE avg1gramsPerMsg AS SELECT b.user_id, (n_1grams * 1.0)/(n_msgs * 1.0) AS avg_1grams_per_msg
                                FROM total1grams AS a, totalMsgs AS b
                                WHERE a.group_id = b.user_id;

We double-check our result by merging to the meta table.

In [None]:
%%sql

SELECT a.*, b.value, b.group_norm
FROM avg1gramsPerMsg AS a, {{meta_1gram_table}} AS b
WHERE a.user_id = b.group_id AND b.feat = '_avg1gramsPerMsg'
ORDER BY user_id
LIMIT 10;

user_id,avg_1grams_per_msg,value,group_norm
28451,80.6923076923077,80.6923076923077,80.6923076923077
174357,216.33333333333331,216.33333333333331,216.33333333333331
216833,252.0675675675676,252.0675675675676,252.0675675675676
317581,112.60581583198709,112.60581583198709,112.60581583198709
446275,275.25,275.25,275.25
450169,68.14285714285714,68.14285714285714,68.14285714285714
468786,45.78947368421053,45.78947368421053,45.78947368421053
477665,329.5979381443299,329.5979381443299,329.5979381443299
483266,118.83333333333331,118.83333333333331,118.83333333333331
485307,138.0,138.0,138.0


## 4) Computing basic measures of language complexity

With average `_avg1gramLength` and `_avg1gramsPerMsg` from the feature tables we already have two markers of language complexity.

Let's compute two other measures of language complexity -

- The Type-Token ratio (in SQL), and the
- Flesch-Kincaid grade level (using DLATK)

### 4a) The way that's too simple: Type-Token ratio

The Type-Token ratio is a good measure of lexical diversity in the corpus. A very small Type-Token ratio means that a small number of unique words are repeated many times, whereas a larger ratio means that the corpus contains a large number unique words but repeated only a few times.

So, how do we calculate this using MySQL? Let's do this step by step. Firstly let's think of how do we get the distinct number of token types in the corpora -- we use the 1gram table for this `feat$1gram$msgs$user_id` as shown below.

### ⚠️ DISCLAIMER -- longer texts (by orders of magnitude) have higher TTRs -- follows from language statistics. It's a very crude measure.

In [None]:
%%sql

SELECT COUNT(DISTINCT feat) AS types
FROM feat$1gram$msgs$user_id;

types
137687


Now let's get the number of times all these types are instantiated, that is, how many token occurrences we have in the corpus.

We can use either the meta 1gram table `feat$meta_1gram$msgs$user_id` or the 1gram feature table for this. Let's do it both ways.

Using the meta table:

In [None]:
%%sql

SELECT SUM(value)
FROM feat$meta_1gram$msgs$user_id
WHERE feat = '_total1grams';

SUM(value)
7764052


In [None]:
%%sql

SELECT SUM(value)
FROM feat$1gram$msgs$user_id;

SUM(value)
7764052


The concur! What are the odds! 7.7 million words.

And finally the type-token ratio with some SQL napkin math. The world's most over-powered calculator! (Remember to multiply the numerator and denominator by `1.0`!)

In [None]:
%%sql

SELECT (137592 * 1.0) / (7772391 * 1.0) AS TTR;

TTR
0.0177026606098432


How do we interpret this? (from https://lexically.net/downloads/version7/HTML/type_token_ratio_proc.html)

#### Problems with TTR

But this type/token ratio (TTR) varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a TTR of 40%; a shorter one might reach 70%; **4 million words will probably give a type/token ratio of about 2%, and so on**. Such type/token information is rather meaningless in most cases, though it is supplied in a WordList statistics display. The conventional TTR is informative, of course, if you're dealing with a corpus comprising lots of equal-sized text segments (...). But in the real world, especially if your research focus is the text as opposed to the language, you will probably be dealing with texts of different lengths and the conventional TTR will not help you much.

FYI: We mostly did this as an exercise.

### 4b) The better way: Flesch-Kincaid grade level

The "Flesch–Kincaid Grade Level Formula" presents a score as a U.S. grade level, making it easier for teachers, parents, librarians, and others to judge the readability level of various books and texts. It can also mean the number of years of education generally required to understand this text, relevant when the formula results in a number greater than 10. The grade level is calculated with the following formula:

\begin{equation}
0.39 (\frac{\text{total words}}{\text{total sentences}}) + 11.8 (\frac{\text{total syllables}}{\text{total words}}) - 15.59
\end{equation}

DLATK provides a convenient flag to calculate Flesch–Kincaid Grade Level for the text per group (based on the `--correl_field` being either `user_id` or `message_id`).

Let's try to get the score for every user (`user_id`) in the dataset.

In [None]:
database = "tutorial_4"
msgs_table = "msgs"

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_fleschkincaid



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-10 16:26:20
-----
Connecting to SQLite database: /content/sqlite_data/tutorial_4
query: PRAGMA table_info(msgs)
SQL Query: DROP TABLE IF EXISTS feat$flkin$msgs$user_id
SQL Query: CREATE TABLE feat$flkin$msgs$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(16), value FLOAT, group_norm DOUBLE)


Creating index correl_field on table:feat$flkin$msgs$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$flkin$msgs$user_id ON feat$flkin$msgs$user_id (group_id)


Creating index feature on table:feat$flkin$msgs$user_id, column:feat 


SQL Query: CREATE INDEX feature$flkin$msgs$user_id ON feat$flkin$msgs$user_id (feat)
finding messages for 1000 'user_id's
         Please check that all messages have a unique message_id, this can significantly impact all downstream analysis
Messages Read: 5k
Messages Read: 10k
Messages Read: 15k
Messages Read: 20k
Mess

This creates a new table named `feat$flkin$msgs$user_id` with a unique `user_id` per row (as `group_id`) and its associated Flesch–Kincaid Grade Level score in the `group_norm` field.

Let's look at the table then.

In [None]:
flkin_table = 'feat$flkin$msgs$user_id'

In [None]:
%%sql

SELECT *
FROM {{flkin_table}}
LIMIT 5;

id,group_id,feat,value,group_norm
1,28451,m_fk_score,4.9,4.9
2,174357,m_fk_score,9.166666666666666,9.166666666666666
3,216833,m_fk_score,4.595945945945946,4.595945945945946
4,317581,m_fk_score,5.358481421647819,5.358481421647819
5,446275,m_fk_score,8.4375,8.4375


This table gives a grade score for the text for every user (here indexed by `group_id`) in the `group_norm` field (also value, if in doubt -- you should always default to reading such output from the group_norm field).

Ok, done with all this meta data stuff. Back to working with data. But first, the homework.