<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-03/W3_HW3_DLATK_lexicon_extraction_(dla_tutorial).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W3 Homework 3: 1gram extraction and feat tables, meta tables

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.


Work through Tutorial 5 before you solve this homework. The homework assumes you have an existing `dla_tutorial` database file in your Google Drive (Tutorial 5).

* Please follow up every command cell with a markdown cell where you give the answer in a human sentence (see that "Cell" dropdown field at the top to choose the cell type).
* Please share a link to your notebook in your Google Drive (permission set to everyone with link can edit).



## 1) Setting up Colab with DLATK and SQLite

If colab asks you about this not being authored by Google, say "Run anyway."

### 1a) Install DLATK

In [None]:
# assigning the database name
database = "dla_tutorial"

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql

### 1b) Mount Google Drive and copy database

In [None]:
# 1b) Mount Google Drive & copy database to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copies {database_name}.db to the sqlite_data folder in this Colab
!cp "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

### 1c) Setup database connection

In [None]:
# loads the %%sql extension
%load_ext sql

# connect to the SQLite .db files
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# auto-attach dlatk_lexica.db whenever tutorial_db_engine connects (so tutorial_db_engine can query both databases)
from IPython import get_ipython
from sqlalchemy import event
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

# connecting to {database}.db
%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### 1d) *If you have to restart your session*

In [None]:
# If you face a database locked issue, restart the session & run this cell to get set back up!
database = "dla_tutorial"

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# auto-attach dlatk_lexica.db whenever tutorial_db_engine connects (so tutorial_db_engine can query both databases)
from IPython import get_ipython
from sqlalchemy import event
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

# connecting to {database}.db
%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### 1e) Check for feature table

We expect the 1grams extracted into the `feat$1gram$msgs$user_id` table.

Let's check, shall we?

In [None]:
%sql tutorial_db_engine

In [None]:
%sqlcmd tables

In [None]:
%%sql

SELECT *
FROM feat$1gram$msgs$user_id
ORDER BY RANDOM()
LIMIT 5;

Good! The world makes sense (I hope). If not, re-extract the 1gram features.

```
!dlatkInterface.py --corpdb {database} --corptable msgs --correl_field user_id --add_ngrams -n 1
```

## Questions

### 1) Now please extract all of `LIWC2015` for the users of the blog sample.

When we say **extract**, we mean adding the dictionary feature table (similar to the 1gram feature table).

### 2) Which gender has a higher relative frequency of  `MONEY` words? (female = 1).

Please do it the right way -- remember sparse encoding in feature tables (from last week).

### 3) (2pts) Get the 10 occupations that have the highest proportion of `POSEMO` to `NEGEMO` langauge

  \begin{equation}
  \frac{POSEMO}{NEGEMO}
  \end{equation}
  
As you may realize, you can do this in two possible ways, one involves new feature extraction in DLATK. both are OK. You only need to do one.

But do write a sentence -- what would the other way have been?.

This question has historically been the most tricky in this homework.

**Hint:** you probably need intermediate tables. POSEMO and NEGEMO features are stored in different rows in a normal feature table and their group_norms cannot simply be divided by one another in one typical SQL command.

Work slowly and step by step, looking at your outputs, etc.

**Note:** Depending on how exactly you implement either way, the two answers may differ slighty! That's alright, they reflect different analytical choices (ideally you realize what these are).

### 4) Please pull out 10 random blog posts that contain `MONEY` words.  

### 5) For how many of the 10 random blogs does the first sentence in which the `MONEY` word appears actually convey the concept of money? Just give a count, no explanation needed for your annotations. We trust you.

### 6) This number you just calculated by reading the blog posts, what is it called? Recall, precision, sensitivity, specificity, or something else?? Gah, it's so confusing.

### 7) Using human language in the form of a sentence to respond:

You run into an excited colleague in the lobby who just ran her first language analyses. She tells you that she just deployed LIWC to a 10% sample of Twitter to measure how everyone is doing around COVID. She found some striking results: the **INGESTION** dictionary is spiking, but the **RISK** dictionary is not moving away from baseline. She is writing up some very nuanced psychological interpretation about what this might mean, mentioning "body envelope violations" and "game-theoretical accounts of uncertainty in light of violation of the assumptive world." What do you politely tell her? What follow up analysis do you suggest to her?

### 8) (2pts) Which are the 10 most frequent words within the LIWC BIO dictionary? What fraction of total dictionary occurrences do the sum of the occurrences of these 10 words account for?

Do you think the top 3 are good "biological processes" words?

We mean: which fraction of total dictionary occurrences do the **sum of occurrences of these 10 words** (and not individual words) account for?

Note: you can ignore the fact that the BIO dictionary contains wildcards.

### 9) (2 pts). During the message-level (`--correl_field message_id`) extraction, a new meta_table was produced. Using this meta-table, can you derive how long the average blog post (message) is, in terms of tokens?

Looking into that table, two types of features in this table have the same values -- why is this? In the user-level meta feat tables, they are different.

It might be helpful to view what tables you have right now in dla_tutorial!
Like this:

In [None]:
%sql tutorial_db_engine

tables = %sqlcmd tables
print(tables)

### 10) Which age group generally uses the most POSEMO?

First, please check how many users have at least 1 `POSEMO` word (i.e., show up in the feature table/merged table with `POSEMO` group_norms).

Second, once you know that, please compute the average `POSEMO` for the age groups. It should be simpler now, somehow 💫✨.

When you get the result -- are you surprised?

**CLARIFICATION**: You can either do this by merging the user level features and the `blog_outcomes` table or you can extract features using DLATK grouped by `age`. Either one works.

## Save your database!

In [None]:
database = "dla_tutorial"

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")