<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-05/W5_HW5_R_statsWithDLATKTables_(hw5_csv).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W5 Homework 5: R stats with DLATK Tables (2025-04)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.


This homework has 15 pts + 1 extra credit.

## 1) Setting up Colab with DLATK, SQLite, and R

In [6]:
database = "homework_05"

### 1a) Install DLATK

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql

### 1b) Download the data for this tutorial & custom R script

This github repo contains copies of CSVs for `homework_05` and also `tutorial_07` & `dla_tutorial` CSVs! It also copies over our custom R script `psych290RcodeV1.R`

In [None]:
# this downloads the csvs & script we need for this tutorial
!git clone https://github.com/CompPsychology/psych290_data.git

Have a look on the left in the file browser, you see csv files in ./psych290_data/homework_05.


💡 BTW, if you ever need a copy of `psych290RcodeV1.R` (RStudio at home!), [you can download it here!](https://drive.google.com/drive/folders/1LnEKn7tyBiXLsuNl_SXkqljZFRUs9S4k?usp=sharing)

### 1c) Mount Google Drive and copy databases & custom R functions

In case you don't have that `dlatk_lexica.db` in your Google Drive, look at Tutorial 5.

In [None]:
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

### 1d) Setup database connection

In [5]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database - mounts both databases as engines
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# attaches the dlatk_lexica.db so tutorial_db_engine can query both databases
from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### 1e) (ONLY If nedded: SOFT RELOAD) If you have a **"database lock"** problem

If you face a "database locked" issue, restart the session (Runtime ==> Restart Session) & run this cell to get set back up!

Also, run the R-SQLite setup cell (with db_con = ...) 🤓

In [1]:
database = "homework_05"

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# set the output limit to 50
%config SqlMagic.displaylimit = 50

from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

## 2) R Setup

Let's load the R extension.

In [2]:
# load %%R extension
%load_ext rpy2.ipython

Download and load necessary packages!

In [None]:
# this is equivalent to install.packages() but much faster!!
!apt-get update

!apt install -y \
    r-cran-ggpubr \
    r-cran-psych \
    r-cran-reshape2 \
    r-cran-lm.beta \
    r-cran-rsqlite \
    r-cran-ggthemes \
    r-cran-reshape2 \
    r-cran-psych \
    r-cran-caret

Now let's load the packages and connect R to our SQLite db!

In [3]:
# constructs the pathname
database_path = f"sqlite_data/{database}.db"
database_path

'sqlite_data/homework_05.db'

In [None]:
%%R -i database_path

source("./psych290_data/helper_files/psych290RcodeV1.R")

require(tidyverse)
require(RSQLite)
require(ggthemes)
require(ggpubr)
require(grid)
require(reshape2)
require(psych)
require(lm.beta)
library(caret)


# load DBI for generic database functions and RSQLite as the SQLite backend
library(DBI)
library(RSQLite)

# connects to a file-based sqlite DB
db_con <- dbConnect(RSQLite::SQLite(),
                    dbname = database_path)

# enforce UTF-8 encoding
dbExecute(db_con, "PRAGMA encoding = 'UTF-8';")

# this attaches the dlatk_lexica database
dbExecute(db_con, "ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

## Question 1)

(2 points)

Following section 3 and 4 of tutorial 07 for how to work with CSVs, please read the HW CSVs into R and create a clean outcome and a message table with the right indexes into your SQLite database.

Then, following sections 6, 7, 8, create versions of the message and outcome tables that are shortlisted to the users with only 500+ words.

Show the output of the SQLite `PRAGMA table_info` commands for both tables containing the users with 500+ words (`dbGetQuery(db_con, "PRAGMA table_info(hw5_msgs_n ...)")`, and the `checkDf2()` output for the shortlisted outcome table. This will allow us to check that your tables are good.

**Hint: Make use of tutorial 07!**


## Question 2) (3 points)

Based on the tables which were shortlisted to users with more than 500 words, using the R code from Tutorial 7, sections 9, 10, 11 (copy it here please):

-   a plot of message counts over time, and a sentence that summarizes the count and the range (e.g., "Users wrote 33,136 posts between January 1st, 1999 and the 28th of August, 2005")
-   a histogram of messages per user, plus the summary sentence WITH MEDIAN (e.g., "Users wrote an average of 34 (SD = 96, Median = 10, min = 1, max = 1,308) blog posts."
-   a histogram of words per user, and a summary sentence WITH MEDIAN (e.g., "Users wrote an average of 8,931 words (Median = 2694, SD = 24,116) for a total of 8,600,995 words.")


# **Important**

For all the following questions, please use the shortlisted message and outcome tables (that only contain the users with 500+ words).

## Question 3) Please make an (author-level) cross-correlations table between the LIWC (POSEMO), labmt ( valence ) and NRC ( SENT ) features, and author age and gender.

First, extract the 1gram feat table for users from the shortlisted message table from question 1 (for users < 500 words).

After that, do make sure any extracted dictionary features use the shortlisted message table *(N.b., group_freq_thresh is not applied during feature extraction, only during correlation.)*

**DLATK flags you will need**: `...--add_ngrams -n 1`, `...--add_lex_table -l mini_LIWC2015` etc.

**R-commands you will need**: dbGetQuery, importFeat (from the R source script), merge, cor() -- or something similar.

LIWC is a theory-based, LabMT an annotation-based, and NRC a machine-learning based model. Which one of these three doesn't correlate highly with the other two?

**Note:** Remember the sparse encoding in feat tables. Fortunately, dictionaries have the `_intercept` features in them for every user. That means as long as you importFeat() a dictionary table with this `_intercept` feature, you will get all missing zeros!


💡💡💡 **Hint**: In this question, you need to extract 1 grams features, then extract dictionary features from `LIWC2015`, `labmt`, and `nrc`. Finally you must construct a cross-correlations table between the three language dictionary features of interest (POSEMO, valence, and SENT), `age`, and `gender` (encoded as `is_female`).


## Question 4) Please calculate a Cohen's d effect size (standard deviation difference) in how genders 0 and 1 use LIWC POSEMO and NRC Sent. Which dictionary distinguishes better between the genders?

R commands -- cohen.d(...)


## Question 5) Following from the previous question, please produce 3 plots plotting LIWC (`POSEMO`), labmt (`valence`) and NRC (`SENT`) against author-age, and add a line-of-best-fit (either linear or LOESS).

R-commands you will need: `dbGetQuery`, `importFeat` (from the custom R source script), `merge`, `qplot/ggplot`, `+geom_smooth()` or `+geom_smooth(method = "lm")`, the `+theme_Publication()` to make it look kewl (from the R source script).

## Question 6) Can you regress labmt (valence) and NRC (SENT) against age, controlling for gender? Please provide standardized coefficients with p-values.

R-commands you will need: `dbGetQuery`, `importFeat` (from the R source script), `merge`, `lm`, and the `lm.beta` package to get betas. FYI: `lm.beta` works by taking the model outputted by `lm`, and then standardizing it, like so: `model <- lm(), model_stand <- lm.beta(model)`


## Question 7) As you may recall, in a contribution in EMNLP (one of the top-3 NLP conferences), friend-of-the-lab Maarten Sap published dictionaries to predict age and gender from language. Please extract the data-driven dictionary for age and gender (`dlatk_lexica.dd_emnlp14_ageGender`). Please make one histogram each of the predicted age, and for the age in the outcomes table.


## Question 8)

(2 points)

(1pt) How well does Sap's age dictionary looks like it works? Please plot author age against dictionary-predicted age, and add a line-of-best fit (either linear or LOESS).

R-commands you will need: `dbGetQuery`, `importFeat` (from the R source script), `merge`, `qplot/ggplot`, `+geom_smooth()` or `+geom_smooth(method = "lm")`, the `+ theme_Publication()` (from the R source script)

(1pt) What's the standardized statistical association between user age and Sap-dictionary-age, controlling for gender (what's the beta)?

## Question 9) (EXTRA CREDIT) Repeat the above, but produce two lines of best fit, one for each author gender. which one has a (slightly) steeper slope?

hint: use ggplot's `group`

## Question 10) How many years is the dictionary 'prediction' off, on average? Please compute the mean-absolute error of the dictionary age 'predictions'.

FYI: `MAE = sum( abs(var1 - var2))`. Does this seem like a lot to you?


## Question 11) Let's investigate how well the gender prediction dictionary works.

Please make a histogram of the language-predicted gender values, and group it by the self-reported genders, such that you see overlapping histograms of predicted values, one for each (self-reported) gender.

What would be a good threshold on the predicted gender value to distinguish well between the genders?

Please draw this as a line in your combined histogram.

Hint: `ggplot:: + geom_vline()`


## Question 12) Using the reasonable threshold on the continuous gender prediction that you've picked, compute a confusion matrix.

You do this by first "declaring" the users above or below your threshold to have been classified as 1 or 0. Now that you've turned the continuous gender score into a "gender classification," you can compare these predicted genders against the self-reported genders.

Please report accuracy, precision and recall for your choice of the threshold, and the F1 score, which combines precision and recall conservatively -- giving an average that's closer to the lower number (`F1 = 2 / (1/recall + 1/precision)`).

Does this seem like satisfactory prediction performance to you?

R hints:
```
df$a_bin <- ifelse (a < X, 1, 0)
confusionMatrix(... reference = ground_truth)`
```

`confusionMatrix()` comes from `caret` package!

## ‼️ **Save your database** ‼️

Let's save all this work into as a new database file in your GDrive `sqlite_databases` folder!

In [115]:
database = 'homework_05'

In [116]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

Mounted at /content/drive
✅ Database 'homework_05.db' has been copied to your Google Drive.
