<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-03/W3_Tutorial_05_DLATK_lexiconExtraction_mostFreqWords_(dla_tutorial).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W3 Tutorial 5 -- Lexicon tables & feature extraction (DB: dla_tutorial) (2025-03)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.

## 1) Setting up Colab with DLATK and SQLite

If colab asks you about this not being authored by Google, say "Run anyway."

### 1a to 1c) Streamlined: Setting up Colab with DLATK and your data

In [None]:
# assigning the database name
database = "dla_tutorial"

In [None]:
########### 1a) Install

# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql

########### 1b) Download data and insert into SQLite database

# this download the csvs we need for this tutorial
!git clone https://github.com/CompPsychology/psych290_data.git

# load the required package -- similar to library() function in R
import os
from dlatk.tools.importmethods import csvToSQLite

# store the complete path to the database -- sqlite_data/[database_name].db
database_path = os.path.join("sqlite_data", database)

msgs = "psych290_data/dla_tutorial/msgs.csv"
csvToSQLite(msgs, database_path, "msgs")

outcomes = "psych290_data/dla_tutorial/blog_outcomes.csv"
csvToSQLite(outcomes, database_path, "outcomes")

############# 1c) Setup database connection

# loads the %%sql extension
%load_ext sql

# connects the extension to the database
from sqlalchemy import create_engine
tutorial_db_engine =  create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
%sql tutorial_db_engine

# set the output limit to 50
%config SqlMagic.displaylimit = 50

## PRINT FINISHED
print(" ******* LOAD FINISHED ¯\_(ツ)_/ *******")

### 1d) Get **`dlatk_lexica.db`** in your Google Drive!


Some dictionaries are not stored in the GitHub download of DLATK because they are not publically shared.

So, you'll have to upload a database with it to Colab when you want to work with it. We can store it in your Google Drive and access it easily.

**Follow these steps** 📋 (*only need to do this once!*)

1.  In your Google Drive make a folder called `sqlite_databases`
2.  Open [this shared folder](https://drive.google.com/drive/folders/1nxX0Qf6vd1hnNX9ywqwVsvNLoYWo62zA). Please feel free to request access.
3.  Make a copy of `dlatk_lexica.db` and put it in `sqlite_databases` in your Google Drive, so `MyDrive\sqlite_databases`.


⚠️ Make sure the file is `dlatk_lexica.db` and stored in `sqlite_databases` -- that's where the code below expects it.

### 1e) Mount Google Drive & configure the database in SQLite

Now that you have the right copy of dlatk_lexica.db stored in your Google Drive, let's connect your Drive to this Colab!

Google will ask you to allow this notebook to access your Drive--click yes and follow prompts to login and allow.

In [None]:
# mount Google Drive &  copy database to Colab

# this connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive')

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

This code block below enables SQLite to use `dlatk_lexica.db` and `dla_tutorial.db` in the same SQL connection!

In [None]:
# attaches the dlatk_lexica.db so tutorial_db_engine can query both databases

from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

‼️ Note: In mini tutorial 5B, we go in depth on how we use Google Drive to store databases.

### (ONLY IF NEEDED: SOFT RELOAD) **If you have a "database lock" problem**

First, go to Runtime => Restart Session. Wait for that to complete. Your colab files will be preserved during this, including the DLATK install you did earlier.

Second, execute this cell:

In [None]:
# If you face a "database locked" issue, restart the session & run this cell to get set back up!

database = "dla_tutorial"

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# set the output limit to 50
%config SqlMagic.displaylimit = 50

from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

## 2) Re-extract features

Let's quickly re-extract 1grams from the `dla_tutorial` messages table!

This will take 2.5ish minutes. Go make yourself a tea?

In [None]:
#database = "dla_tutorial" #we set that at the top! you can set it again if you like...
msgs_table = "msgs"

In [None]:
!dlatkInterface.py \
  --corpdb {database} \
  --corptable {msgs_table} \
  --correl_field user_id \
  --add_ngrams -n 1

Good!

Let's extract our first dictionaries (which needs the 1gram table in place -- that's why we re-extracted). But before we do that, let's learn about dictionaries (= lexicons).

## 3) Getting to know dictionary (= lexicon) tables

We will use the `LIWC2015` set of dictionaries, which are stored in the database `dlatk_lexica`.

Now we set up a connection between SQLite and `dlatk_lexica.db`.

In [None]:
# connection between sqlite and dlatk_lexica
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

In [None]:
# activates dlatk_lexica db
%sql dlatk_lexica_engine

💡 The `%sql dlatk_lexica_engine` would be roughly 🐬🐬🐬`USE database` in MySQL. So, run `%sql tutorial_db_engine` to swtich back to the `dla_tutorial` database.

To see all the dictionaries in `dlatk_lexica` run this:

In [None]:
%sqlcmd tables

Let's see what's in the LIWC2015 table first!! That's a new table type we should know about.

In [None]:
%sql PRAGMA table_info(LIWC2015)

cid,name,type,notnull,dflt_value,pk
0,id,INT,0,,0
1,term,VARCHAR(31),0,,0
2,category,VARCHAR(15),0,,0
3,weight,INT,0,,0


In 🐬🐬🐬 MySQL it looks more confusing, as the category column is an `enum` data type. It just means that category can be one of the values from the list. It's a glorified `VARCHAR`. Let's look into the table now.

In [None]:
%%sql

SELECT *
FROM LIWC2015
LIMIT 10;

id,term,category,weight
1,he,PPRON,1
2,he'd,PPRON,1
3,he's,PPRON,1
4,her,PPRON,1
5,hers,PPRON,1
6,herself,PPRON,1
7,hes,PPRON,1
8,him,PPRON,1
9,himself,PPRON,1
10,his,PPRON,1


In [None]:
%%sql

SELECT term, COUNT(*) as occ
FROM LIWC2015
GROUP BY term
ORDER BY occ DESC
LIMIT 20;

term,occ
weve,10
we've,10
we'll,10
weakened,9
we're,9
we'd,9
she's,9
she'll,9
perfected,9
lowest,9


Ah, Ok, we are getting a sense. so it lists `terms` that are in `category`s, and these terms can have `weight`s (which are all 1 for LIWC -- it's an unweighted dictionary).

#### 👩‍🔬💻 Exercise

So, can you write the command to check the number of words in each category?

⚠️ If you're using the `dla_tutorial` database, make sure to give the location of the LIWC2015 table specifically as residing in `dlatk_lexica` (e.g., `SELECT * FROM dlatk_lexica.LIWC2015`). Otherwise you have to switch into the dlatk_lexica database by running `%sql dlatk_lexica_engine` (🐬🐬🐬 `use dlatk_lexica` in MySQL).

In [None]:
%%sql



Nice!! That's looking like a good summary. Alright! Let's look at what's in the `POSEMO` dictionary. It says below there are 622 words in there.


In [None]:
%%sql

SELECT COUNT(*)
FROM LIWC2015
WHERE category = "POSEMO";

COUNT(*)
622


Let's get a few random samples by executing the below query a few times.

In [None]:
%%sql

SELECT *
FROM LIWC2015
WHERE category = "POSEMO"
ORDER BY RANDOM()
LIMIT 10;

id,term,category,weight
909,fondness,POSEMO,1
1301,usefully,POSEMO,1
969,handsomest,POSEMO,1
797,comfortable,POSEMO,1
1147,pleasant*,POSEMO,1
1216,smart,POSEMO,1
1185,respected,POSEMO,1
1272,thankful,POSEMO,1
790,cheers,POSEMO,1
743,beautiful,POSEMO,1


Alrighty! 622 words, some of those stemmed (with `*` asterixes). DLATK will know what to do with it when we extract the dictionary -- it will match all the words that start with this stem.

#### 👩‍🔬💻 Exercise

Also, let's check some terms are mentioned in more than one dictionary category.

⚠️ Make sure to use right database!

In [None]:
%%sql



Alright, seem's like some pronouns (+ stems) are in multiple dictionaries. But let's see what categories the word `perfected` (which is a content word) is in.

#### 👩‍🔬💻 Exercise

Can you write the command to check the categories that `perfected` is in?

In [None]:
%%sql



Interesting...

This tells us something about LIWC -- the same word can show up all over the place.

FYI, some of these categories are nested: every word in `POSEMO` is also in the `AFFECT` dictionary, all `ACHIEVE` words are in `DRIVE`, etc. (see page 4 here in the [LIWC manual](https://repositories.lib.utexas.edu/bitstream/handle/2152/31333/LIWC2015_LanguageManual.pdf)).

While we are at it, let's look at the word "love", which does all sorts of damage when measured with dictionaries [(see here for a discussion](https://static1.squarespace.com/static/53d29678e4b04e06965e9423/t/5f2ad5ec2985cb6e8e2e6548/1596642799449/PNAS-2020-subjectiveWellBeing.pdf), third page, under Highly Frequent Words).

In [None]:
%%sql

SELECT term, category
FROM LIWC2015
WHERE term = "love"

term,category
love,POSEMO
love,AFFECT
love,BIO
love,SOCIAL
love,DRIVES
love,AFFILIATION


Good to know where it shows up!

## 4) Dictionary extraction

Alright, we've gotten a sense of how dictionary tables work. Let's extract dictionaries. Basically, here is the logic

* For every user (i.e., group_id)
    * For every dictionary
        * Count how many words that he/she uses are in the dictionary
        * Divide that number by the total number of words that he/she has written

This will be in a feature table which will tell us, for example, that `5.3%` of the words by user `11111` are `POSEMO` words (that is, match `terms` in the POSEMO `category` (dictionary) in the LIWC2015 table (set of dictionaries)).

Ok, let's tell DLATK to do this for all users with a mini LIWC2015 (`mini_LIWC2015`) set of dictionaries that's a little easier to explore. It just contains `POSEMO`, `NEGEMO` and `SOCIAL` words.

In [None]:
%%sql

SELECT category, COUNT(*) AS terms_in_category
FROM mini_LIWC2015
GROUP BY category;

category,terms_in_category
NEGEMO,746
POSEMO,622
SOCIAL,756


### 4a) DLATK command: Dictionary extraction

⚠️ Before extracting dictionary features, we need to make sure that we have the 1gram feature table as dictionary extraction depends on it. Else, dictionary extraction throws an error.

Other than the first three, that we always need for a DLATK command, the dictionary extraction flags are

`--add_lex_table -l mini_LIWC2015`,

where the string after -l gives the name of the dictionary table in the `dlatk_lexica` database we want to extract. In this case, it's `mini_LIWC2015`

⚠️ note: long flags get two dashes (`--add_lex_table`), short single letter flags and parameters are given with one dash (`-l mini_LIWC2015`, or `-n 1`).

In [None]:
database = "dla_tutorial"
msgs_table = "msgs"

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_lex_table -l mini_LIWC2015

That was quick. So first of all, let's look at the new feature table that was created here:

`feat$cat_mini_LIWC2015$msgs$user_id$1gra`

## 5) Understanding dictionary feature tables

Let's look at the columns in the feature table `feat$cat_mini_LIWC2015$msgs$user_id$1gra`.

First we point our handy `%sql` extension to the dla_tutorial database again. (With this database switching, you may run into an error like "other database already being used!", then you just `run %reload_ext sql`!)

In [None]:
# switch into tutorial_db_engine (= "dla_tutorial.db")
%sql tutorial_db_engine

In [None]:
feat_mini_liwc_usr = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

In [None]:
%%sql

PRAGMA table_info( {{feat_mini_liwc_usr}} )

cid,name,type,notnull,dflt_value,pk
0,id,INTEGER,0,,1
1,group_id,INTEGER,0,,0
2,feat,VARCHAR(10),0,,0
3,value,INTEGER,0,,0
4,group_norm,DOUBLE,0,,0


```
🐬🐬🐬
SHOW COLUMNS FROM {feat_mini_liwc_usr};
🐬🐬🐬
```

Feature tables can be confusing. The thing to note about them is  
* their columns/fields always have the same names (`id`, `group_id`, `feat`, `value`, `group_norm`).
* the **name** of the feature table tells us how the table came about, and what exactly is in it.

In the name of this table, the `cat_` prefix means that it was created with `--add_lex_table`.

Let's look at the contents now. The offset command in conjunction with the limit skips the first 100 rows, and then shows you 10. It's a way to avoid always looking at the same darn 10 rows at the top of the table (instead you are now looking at rows 101-110).

In [None]:
feat_mini_liwc_usr = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

In [None]:
%%sql

SELECT *
FROM {{feat_mini_liwc_usr}}
LIMIT 10
OFFSET 100;

id,group_id,feat,value,group_norm
101,911744,POSEMO,28,0.0191387559808612
102,911744,SOCIAL,86,0.0587833219412167
103,911744,NEGEMO,8,0.0054682159945317
104,911744,_intercept,1,1.0
105,918175,NEGEMO,83,0.0165108414561368
106,918175,SOCIAL,433,0.0861348716928584
107,918175,POSEMO,87,0.0173065446588422
108,918175,_intercept,1,1.0
109,942828,POSEMO,11500,0.0314800266074658
110,942828,SOCIAL,33033,0.0904243233847333


In table `feat$cat_mini_LIWC2015$msgs$user_id$1gra`,
- **group_id** -- `user_id` from `msgs`, represents a user
- **feat** -- One of several categories (dictionaries) from our lexicon table, `mini_LIWC2015`
- **value** -- Number of times a word in the category mentioned in `feat` is used by user in `group_id`
- **group_norm** -- Proportion of words used by user in `group_id` that belong to category in `feat`

### 5a) How the feature table name connects to its contents

This is now the second time that we are encountering a feature table -- in the previous tutorial, we had a 1gram table. Now we have a dictionary table.

For the dictionary table `feat$cat_mini_LIWC2015$msgs$user_id$1gra`

| this column in the feature table... | ...contains this (as recorded in table name)|
|------|------|
| feat | **categories in _mini_LIWC2015_** |
| group_id | user_id |

Compare this to the 1gram table `feat$1gram$msgs$user_id$16to16`

| this column in the feature table... | ...contains this (as recorded in table name)|
|------|------|
| feat | **1gram** |
| group_id | user_id |

For example, for a particular user_id:

In [None]:
feat_mini_liwc_usr = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

In [None]:
%%sql

SELECT *
FROM {{feat_mini_liwc_usr}}
WHERE group_id = 911744;

id,group_id,feat,value,group_norm
101,911744,POSEMO,28,0.0191387559808612
102,911744,SOCIAL,86,0.0587833219412167
103,911744,NEGEMO,8,0.0054682159945317
104,911744,_intercept,1,1.0


User `911744` has used:
- 28 words in LIWC category `POSEMO`
- 86 words in LIWC category `SOCIAL`
- 8 words in LIWC category `NEGEMO`
- `_intercept`: **you can ignore this, if you like**. It's a dummy for every `group_id` that DLATK adds to make sure that every `group_id` shows up in this table, even if a particular `group_id` did not use any words in the `feat` categories (here: POSEMO, SOCIAL, or NEGEMO). This is to take into account the sparse encoding of feature tables-- group_ids would not get feature rows for dictionaries they did not mention, and so would not appear in the table otherwise.

Having this `_intercept` in here makes sure that if you run `count(distinct(group_id))` on a lexicon table,  you'll always get the total number of `group_id`'s that were in the message table.

At this point, a LIWC-minded person might say this person expresses more social words than negemo words, for example (but that makes strong assumptions about the recall of the dictionaries -- that they really hit all the social and negemo expressions that are out there in the world).

#### 👩‍🔬💻 Exercise

Do the group_norms (ignoring `_intercept`) sum to 1 in a lex table? Why or why not?

In [None]:
feat_mini_liwc_usr = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra '

In [None]:
%%sql



As shown above, the `group_norm` without `_intercept` don't sum up to one, because all 1-grams need not occur in at least one dictionary.

In [None]:
feat_1gram_usr = 'feat$1gram$msgs$user_id'

In [None]:
%%sql

SELECT group_id, SUM(value) AS tokens
FROM {{feat_1gram_usr}}
GROUP BY group_id
ORDER BY group_id ASC;

In [None]:
feat_mini_liwc_usr = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra '

In [None]:
%%sql

SELECT group_id, SUM(value) AS tokens
FROM {{feat_mini_liwc_usr}}
GROUP BY group_id
ORDER BY group_id ASC;


Let's look at users who have the largest proportions of `POSEMO`. For proportions we have to look at the
`group_norm` field.

#### 👩‍🔬💻 Exercise

Can you now guess the SQL spell to show which users have the largest proportion of LIWC `POSEMO` words?

In [None]:
feat_mini_liwc_usr = 'feat$cat_mini_LIWC2015$msgs$user_id$1gra'

In [None]:
%%sql



Now, let's get the average `group_norm` for each LIWC category.

In [None]:
%%sql

SELECT feat, AVG(group_norm)
FROM {{feat_mini_liwc_usr}}
GROUP BY feat
limit 10;

feat,AVG(group_norm)
NEGEMO,0.018002871887595
POSEMO,0.0306978558498915
SOCIAL,0.0769671148592737
_intercept,1.0


`POSEMO` is a bit less than twice as frequent than `NEGEMO` overall. And `SOCIAL` is the largest.

BTW, here is a way to get more statistics, all at once:

In [None]:
%%sql

SELECT feat, AVG(group_norm), MIN(group_norm), MAX(group_norm), count(distinct(group_id)) AS included_users
FROM {{feat_mini_liwc_usr}}
GROUP BY feat;

feat,AVG(group_norm),MIN(group_norm),MAX(group_norm),included_users
NEGEMO,0.018002871887595,0.0007887990534411,0.055045871559633,998
POSEMO,0.0306978558498915,0.0025934730927166,0.0921273031825796,1000
SOCIAL,0.0769671148592737,0.0042812077512392,0.1864951768488747,1000
_intercept,1.0,1.0,1.0,1000


it's good to include the "included_users" column to remind ourselves that we are doing "dumb" group_norm averages -- generally an acceptable strategy when it comes to dictionaries with decent coverage. Apparently there were two users who have never used a NEGEMO word in the dataset -- good for theeeeeem.

## 6) Which words are in a dictionary(s), and drive its frequency?

As discussed in the lectures, we need to know what's going on inside a dictionary, we can't really take it at face value. For that we have to know which words drive its occurrence.

For that, we have to combine two pieces of information:

* Table `feat$1gram$msgs$user_id` contains the number of occurrences of tokens (words) for users
* Table `mini_LIWC2015` contains which words belong to a LIWC category.

Let us merge the two on words (`feat` in the 1gram table, `term` in the lexicon table) so we can find out for each category, which words occurred the most.

This will show us which words are dominant in a category for our corpus.

### 6a) Step 1: make a word count table -- counts per words

The word counts in `feat$1gram$msgs$user_id` are per user (that's the group_id), and the meta_tables (if you thought of those) too report statistics *per user*. We want total counts per words.


#### 👩‍🔬💻 Exercise

Can you make a new table `word_counts` that will have those?


In [None]:
feat_1gram_usr = 'feat$1gram$msgs$user_id'

In [None]:
%%sql



Great, this gives us a table with overall word counts. Let's just check what are the most frequent ones.

In [None]:
%%sql

SELECT *
FROM word_counts
ORDER BY count DESC
LIMIT 10;

### 6b) Step 2: merge on the dictionary table

Alright! These words make sense. Now, let's filter this table down to only the words in the `mini_LIWC2015` dict.

The `a.*` below is clever -- it tells SQL to show us all columns contained in the first table (and then whatever we want from table b -- `b.*` would also be an option.) Make sure you understand the above query. `where a.word = b.term` with two tables is the same as an inner join (it's an **Implicit Inner Join** to be precise).

In [None]:
%%sql

SELECT a.*, b.category
FROM word_counts AS a, dlatk_lexica.mini_LIWC2015 AS b
WHERE a.word = b.term
ORDER BY count DESC
LIMIT 10;

🤔 that took a weebit long... I wonder why that is? do you know what we could have done to make it faster?

In [None]:
%%sql

CREATE INDEX idx_word_counts_word
ON word_counts(word);

```
🐬🐬🐬
ALTER TABLE word_counts ADD index (word);
🐬🐬🐬
```

Now run the cell again that does the implicit join above the one that created the index just now! Faster?? Seeeee-- indices! Like magic, except less exciting.

Let's save the join as a new table by wrapping a `create table ` around it.

In [None]:
%%sql

DROP TABLE IF EXISTS miniLiwc_wordcounts;

CREATE TABLE miniLiwc_wordcounts AS SELECT a.*, b.category
                                    FROM word_counts AS a, dlatk_lexica.mini_LIWC2015 AS b
                                    WHERE a.word = b.term
                                    ORDER BY count DESC;

### 6c) Investigating different categories (sub-dictionaries) within the (dlatk_lexica) dictionary

Ok, cool! So now let's only look at the `POSEMO` category -- what words occur the most in it?

In [None]:
%%sql

SELECT *
FROM miniLiwc_wordcounts
WHERE category = "POSEMO"
ORDER BY count DESC
limit 10;

Turns out, `well`, `good`, `love` are all bad words in that they are highly ambiguous. [see here](https://static1.squarespace.com/static/53d29678e4b04e06965e9423/t/5f2ad5ec2985cb6e8e2e6548/1596642799449/PNAS-2020-subjectiveWellBeing.pdf)

Can you mod the query above to show you the POSEMO words only used once?

Let's check in on the other two as well.

In [None]:
%%sql

SELECT *
FROM miniLiwc_wordcounts
WHERE category = 'NEGEMO'
ORDER BY count DESC
LIMIT 10;

In [None]:
%%sql

SELECT *
FROM miniLiwc_wordcounts
WHERE category = 'SOCIAL'
ORDER BY count DESC
LIMIT 10;

Who would have guessed that `SOCIAL` is full of function word pronouns!? It's basically just a pronoun dictionary, the rest won't matter. Good we checked!!

Bahh! Ignorance is bliss, except, of course, it isn't.

## 7) Message-level extraction: Blog posts that have POSEMO words

During the lectures, we saw that to judge the precision of a dictionary, we need to look at (1) of those messages containing dictionary words (2) how many actually express the concept conveyed by the dictionary. Specifically (2)/(1) is "precision."

Lets find some blogs with POSEMO words in it -- so that we can look at them, with our eyes, and see if they really are positive emotion-y.

For that, we again want to match the words in the POSEMO category against a feature table that contains words -- so 1grams. But now, we don't want to group blog posts from users, we want to get the words in the specific blog posts, so we can fish out blog posts that have POSEMO words in them.

DLATK will come to the rescue -- through the clever use of the `correl_field` (=group-by_field) -- we will tell it to extract 1-grams with `message_id` as our `correl_field`. That will get us a `feat$1gram$msgs$message_id` feature table.

Then we find `message_id` of `feat` in `feat$1gram$msgs$message_id` that belong `term` in `category` POSEMO in `mini_LIWC2015`.

This will take a ~25 minutes!

In [None]:
!dlatkInterface.py \
  --corpdb {database} \
  --corptable {msgs_table} \
  --correl_field message_id \
  --add_ngrams -n 1

Just to remind ourselves again --
* all feature tables have the same columns (`id`, `group_id`, `feat`, `value`, `group_norm`).
* the **name** of the feature table tells us how the table came about, and what exactly is in it.

### 7a) A feature table for messages!

This is now the third time that we are encountering a feature table.

Our new table is: `feat$1gram$msgs$message_id`

| this column in the feature table... | ...contains this (as recorded in table name)|
|------|------|
| feat | 1gram |
| group_id | message_id |

**Note that the group_id field now is no longer users, but messages (here: blog posts. That's what's in the message table).**

Previously we had, just for comparison:

The dictionary table `feat$cat_mini_LIWC2015$msgs$user_id$1gra`

| this column in the feature table... | ...contains this (as recorded in table name)|
|------|------|
| feat | **categories in _mini_LIWC2015_** |
| group_id | user_id |

And the 1gram table `feat$1gram$msgs$user_id`

| this column in the feature table... | ...contains this (as recorded in table name)|
|------|------|
| feat | **1gram** |
| group_id | user_id |

Hopefully you are getting the idea. It's an abstract data structure 🌈

Let's sanity check the new table. How many messages does it contain features for?




In [None]:
feat_1gram_msg = 'feat$1gram$msgs$message_id'

In [None]:
%%sql

SELECT COUNT(DISTINCT(group_id)) AS num_msgs
FROM {{feat_1gram_msg}};

So yeah, it's a big table! Because it has 31,000 messages times all their 1gram counts, for a total of how many rows?

In [None]:
feat_1gram_msg = 'feat$1gram$msgs$message_id'

In [None]:
%%sql

SELECT COUNT(*) AS num_rows
FROM {{feat_1gram_msg}};

3.7m rows -- we are not in Excel Land anymore.

Let's keep swimming though.

Now we want to create a query that pulls out the `message_id`s that have 1grams that match the `term` column for POSEMO words.

In [None]:
feat_1gram_msg = 'feat$1gram$msgs$message_id'

In [None]:
%%sql

SELECT group_id AS message_id, feat
FROM feat$1gram$msgs$message_id
WHERE feat IN (SELECT term
               FROM dlatk_lexica.mini_LIWC2015
               WHERE category = 'POSEMO')
LIMIT 10;

So this tells us that message 2523 had the feature "(:" in it, which is in the POSEMO dictionary. Hopefully you get the idea. So let's make ourselves a table that preserves this.

In [None]:
feat_1gram_msg = 'feat$1gram$msgs$message_id'

In [None]:
%%sql

DROP TABLE IF EXISTS posemo_blogs;

CREATE TABLE posemo_blogs AS SELECT group_id AS message_id, feat
                              FROM {{feat_1gram_msg}}
                              WHERE feat IN (SELECT term
                                             FROM dlatk_lexica.mini_LIWC2015
                                             WHERE category = 'POSEMO');

and let's pull some of those messages for a random sample.

In [None]:
%%sql

SELECT a.feat AS posemo_word, b.*
FROM posemo_blogs AS a, msgs AS b
WHERE a.message_id = b.message_id
ORDER BY RANDOM()
LIMIT 3;

Now we could annotate this to see if the `POSEMO` token really designates positive emotion for a random sample.

## 8) ‼️ Save your database ‼️

The homework assumes that you have extracted feature tables. So let's write a copy of your database to Google Drive! 📝

Google will ask you to allow this notebook to access your Drive--click yes and follow prompts to login and allow!

In [None]:
database = "dla_tutorial"

In [None]:
# Save your database in Google Drive

# mount Google Drive (if you haven't already)
from google.colab import drive
drive.mount('/content/drive')

# copy the database file to your Drive (`-f` forces it to write over the old database with any changes)
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"Database has been copied to your Google Drive with success!")

Now your database is saved in your Google Drive! We can double check it's there by running this:

In [None]:
!ls -lh "/content/drive/MyDrive/sqlite_databases"

🧐 Next up, check out the mini tutorial for more practice on saving databases to Google Drive!