<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-02/W2_HW2_DLA_FeatExtractionMeta_(dla_tutorial).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W2 Homework 2: 1gram extraction and feat tables, meta tables (2025-03)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

This homework is built on Tutorials 3 and 4. You can experiment below.
- Please follow up every command cell with a markdown cell where you give the answer in a human sentence (see the "Insert" dropdown field at the top to choose the cell type).

<br>

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.


## Setting up Colab with DLATK and SQLite

If colab asks you about this not being authored by Google, say "Run anyway."

### Streamlined setup

In [None]:
# assigning the corpus database name
database_name = "homework2_dlaTutorial"

########### 1a) Install

# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql textstat

########### 1b) Download data and insert into SQLite database

# this download the csvs we need
!git clone https://github.com/CompPsychology/psych290_data.git

# load the required package -- similar to library() function in R
import os
from dlatk.tools.importmethods import csvToSQLite

# store the complete path to the database -- sqlite_data/[database_name].db
database = os.path.join("sqlite_data", database_name)

msgs = "psych290_data/dla_tutorial/msgs.csv"
csvToSQLite(msgs, database, "msgs")

outcomes = "psych290_data/dla_tutorial/blog_outcomes.csv"
csvToSQLite(outcomes, database, "outcomes")

############# 1c) Setup database connection

# loads the %%sql extension
%load_ext sql

# connects the extension to the database
from sqlalchemy import create_engine
engine = create_engine(f"sqlite:///sqlite_data/{database_name}.db?charset=utf8mb4")
%sql engine

#set the output limit to 50
%config SqlMagic.displaylimit = 100

## PRINT FINISHED
print(" ******* LOAD FINISHED ¯\_(ツ)_/ *******")

## Extract features

For this homework, you will need to work with feature tables and meta tables. Let's re-extract those from our message table!

In [8]:
database = "homework2_dlaTutorial"
msgs_table = "msgs"

In [None]:
!dlatkInterface.py \
  --corpdb {database} \
  --corptable {msgs_table} \
  --correl_field user_id \
  --add_ngrams -n 1

In [None]:
%sqlcmd tables

## Questions

### 0) (Extra credit) Make yourself a new "super table" that has the 1gram feature table merged onto the outcome table. Put an index on `feat`, `gender`, `age`, `group_norm` and `sign`.

How many rows does this new table have, and how does this length relate to the lengths of the tables you joined? (sanity check)

Where helpful in the below queries, use your new table.

### 1) How often was the word `the` mentioned, both in absolute numbers and as the average relative frequency across all users (accounting for sparse encoding of the tokens in the feature table)?

### 2) Please compute the average relative frequency of "they"
- the dumb way: across all users who have said they (not accounting for sparsity)
- the right way: across all users (accounting for sparsity)

Which of the numbers is bigger? Why?

Will the difference be smaller or bigger for a word that is more rare, such as "backpack"?

### 3) Averaging the correct way (accounting for sparse encoding), how often was `of`, as an average relative frequency across users mentioned by women? and by men? (Women = 1)

### 4) What are top 100 most frequent words across all users, in relative frequency terms?


### 5) Do women use more distinct tokens (token types) than men?

Just nominally, we do significance tests later. Women are coded as 1.

### 6) What's the min, max, and average number of blog posts written by every star sign, and number of users in every star sign?  Which star sign wrote the most, on average?

Yes you can do this all in one query.

Hint: Facebook is now called...what?

### 7) What is the the average Flesch-Kincaid grade level of different ages?

### 8) Averaging the dumb way (across those who use the feature) in terms of relative frequency, which age uses `i` the most? Are you surprised? We can sort of get away with this, because `i` is highly frequent, so likely to be used by most.

For 0.1 points: check how many users use the word `i`.

### 9) Double points: Which star sign has the biggest Type-Token-Ratio, on average?

To do this, first you join the star signs and the extracted 1grams. Then find the ratio of vocabulary size per sign, and to it's respective number of tokens.

**Hint:** you can do all of this in the same query.

**Helpful**: If dividing, don't forget to multiply both the numerator and denominator by `1.0`!