<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-07/W7_HW7_DLATK_modeling_extracting_correlating_topics_R_(dla_tutorial%2Cblog_authorshop%2Csvitlana).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W7 Homework 7 --  Extracting, correlating (DLATK) and plotting topics (R)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.

Please set up Colab first, as usual!

Every question is 1 point unless otherwise specified.

## Setting up Colab with DLATK and SQLite

### Install DLATK

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install -r dlatk/install/requirements.txt
!pip install dlatk/
!pip install wordcloud langid jupysql gensim==4.3

### Download the custom R script

This github repo contains our custom R script psych290RcodeV1.R (also copies of CSVs for dla_tutorial and other tutorials)!

In [1]:
# this downloads the csvs & script we need for this tutorial
!git clone https://github.com/CompPsychology/psych290_data.git

Cloning into 'psych290_data'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (2/2), done.[K
remote: Total 44 (delta 1), reused 1 (delta 1), pack-reused 42 (from 1)[K
Receiving objects: 100% (44/44), 86.41 MiB | 9.14 MiB/s, done.
Resolving deltas: 100% (8/8), done.
Updating files: 100% (14/14), done.


💡 BTW, if you ever need a copy of psych290RcodeV1.R (RStudio at home!), you can download it here!

### Mount Google Drive and copy databases

In [2]:
database = "dla_tutorial"

In [3]:
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

# this copies {database}.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"

Mounted at /content/drive


### Setup database connection

In [4]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database - mounts both databases as engines
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# attaches the dlatk_lexica.db so tutorial_db_engine can query both databases
from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

### (ONLY If nedded: SOFT RELOAD): If you have a **"database lock"** problem

If you face a "database locked" issue:
  1. restart the session (Runtime ==> Restart Session)
  2. run this cell to get set back up!

This block is your friend! ☺️ If you are working with other databases and you get a db locked error, (1) restart the session and (2) run the cell below, changing the database variable to your database name. For example, if you're working with blog_authorship.db, do `datatbase="blog_authorship"`!

In [1]:
database = "dla_tutorial"

%reload_ext sql

from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")
dlatk_lexica_engine = create_engine(f"sqlite:///sqlite_data/dlatk_lexica.db?charset=utf8mb4")

# set the output limit to 50
%config SqlMagic.displaylimit = 50

from IPython import get_ipython
from sqlalchemy import event

# auto‑attach the lexica db whenever tutorial_db_engine connects
@event.listens_for(tutorial_db_engine, "connect")
def _attach_lexica(dbapi_conn, connection_record):
    dbapi_conn.execute("ATTACH DATABASE 'sqlite_data/dlatk_lexica.db' AS dlatk_lexica;")

%sql tutorial_db_engine

## 1) `dla_tutorial` database

In the tutorials we worked with 2000 facebook topics.
Here, you will use 500 FB topics that were modeled over a large Facebook dataset (with 14 million 2009-2011 statuses) to produce the topic tables -
* `fb22_all_500t_cp`, and
* `fb22_all_500t_freq`.

### 1.1) Extract user level **500 FB topics** using the above tables.

#### Answer

### 1.2) Correlate the topics against `occu`, controlling for `age` and `gender`

As part of the answer to this question, do include the flags `--topic_tagcloud --make_topic_wordclouds --tagcloud_colorscheme blue` (or other colors as you prefer), to produce the wordclouds for the next question.

Please don't forget to set a group_freq_thresh.

#### Answer

### 1.3) Show top 8 topics for one occupation of your choice.

For this questions, feel free to use the `print_wordclouds` below that can filter for occupation. Very nice.

In [51]:
import glob
import os
import math
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def print_wordclouds(wordcloud_folder, occupation, num_topics):

    images = glob.glob(os.path.join(wordcloud_folder, 'occu__{}'.format(occupation), '*.png'))
    images = [image for image in images if 'beta' in image]
    num_topics = num_topics if num_topics <= len(images) else len(images)

    def get_coeff(x):
        return float('.'.join(x.split('/')[-1].split('-')[1].split('.')[:2]))

    top = sorted(images, key=lambda x: get_coeff(x))[::-1][:num_topics]

    images_per_row = 4
    fig, axes = plt.subplots(math.ceil(num_topics/images_per_row), images_per_row, figsize=(18, 8))
    axes = axes.flatten()

    for index, image in enumerate(top):
        axes[index].set_axis_off()
        axes[index].set_title(os.path.basename(image))
        axes[index].imshow(mpimg.imread(image))

In [58]:
# WORDCLOUD_FOLDER = ''
# OCCU = ''
# NUM_TOPICS = 8

# print_wordclouds(WORDCLOUD_FOLDER, OCCU, NUM_TOPICS)

**Answer:**

## 2) R is your friend! (5 points)

Using R, please plot the use of two topics over `age` in the `dla_tutorial` dataset. The literature suggests that people become more positive and other-oriented with age.

First, based on nothing but your intuition as a human being alive on this earth, please pick two key words that you think indicate a prosocial and an antisocial orientation.

### First some R setup!

In [60]:
%load_ext rpy2.ipython

In [None]:
# this is equivalent to install.packages() but much faster!!
!apt-get update -qq
!apt-get install -y r-cran-rsqlite r-cran-ggthemes r-cran-reshape2 r-cran-psych

Now let's set the database we want R to connect to (dla_tutorial!)

In [62]:
database='dla_tutorial'

# constructs the pathname
database_path = f"sqlite_data/{database}.db"
database_path

'sqlite_data/dla_tutorial.db'

In [None]:
%%R  -i database_path

# packages
require(tidyverse)
require(ggthemes)
require(reshape2)
require(psych)

# the custom R functions we have written to work with DLATK
source('./psych290_data/helper_files/psych290RcodeV1.R')

# load DBI for generic database functions and RSQLite as the SQLite backend
library(DBI)
library(RSQLite)

# connects to a file-based sqlite DB
db_con <- dbConnect(RSQLite::SQLite(),
                    dbname = database_path)

# enforce UTF-8 encoding
dbExecute(db_con, "PRAGMA encoding = 'UTF-8';")

### 2.1) Find two topics (among the 500 FB topics) that have your two key words among the most prevalent.

Using SQL, get the top 10 words from the frequency representations of those topics, and confirm that the topics represent what you had in mind. If your top keywords turn out to not work well, feel free to think of other ones.

Hint: running `%sql dlatk_lexica_engine` may be handy here.

**Answer:**

### 2.2  You have extracted 1grams above. Based on that, **shortlist the outcome table** down to users who have more than 500 words

**FYI:** Refer Tutorial 07 for how to create those tables: merge meta-features onto the outcome table.

**Answer**

### 2.3 Using the 500-FB-topic feat table that you extracted in Q1.1, in R, `merge` the shortlisted outcome table with the subset of the topic feature table that contains your topics (from Q2.1), and **plot your topics over age.

Hint: you'll need the R function `importTopicFeat()`

**Answer**

### 2.4) Please test if linear trends are significant (you can do that with `lm` or with a humble correlation).

**Hint:** `cor.test()`

**Answer**

### 2.5 Write a **half sentence** about as to whether you see support for your hypothesis. (if not, you still get full points :: science is about doing methods correctly, not about finding what you wanted. #plosOne)

**Answer**

Write here:

## 3) Annotated Emotion dataset

For these questions, we will use the Annotated Emotion dataset [Svitlana Volkova](https://www.cs.jhu.edu/~svitlana/) from Tutorial 11, which is in the `SUNET_svitlana` database. It contains ~29k Tweets annotated with emotions. Here, the documents = tweets are directly annotated with one of a few emotions. We have also added the `outcomes` table, that has the emotions listed for a given `message_id`.

Citation for the dataset -
```
Volkova, S., & Bachrach, Y. (2016, August). Inferring perceived demographics from user emotional tone and user-environment emotional contrast. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1567-1578).
```

### First, set up the database connection!

In [86]:
# name of database
database = 'svitlana'

In [87]:
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies {database}.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"

Mounted at /content/drive


The usual SQLite database connection procedure!

In [10]:
# Svitlana database setup

database = 'svitlana'

# reloads the %%sql extension
%reload_ext sql

# connects the extension to the database - mounts the database as an engine
from sqlalchemy import create_engine
svitlana_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")

# use engine (activates the connection to new database!!)
%sql svitlana_engine

# set the output limit to 50
%config SqlMagic.displaylimit = 50

☝🏻 **FYI**: if you face the **database locked** while using this database, Restart your session (Runtime ==> Restart session), then run the last cell again!

### 3.1) (2pts) First, using Tutorial 11, please model 50 topics on the Svitlana dataset.

Choose an alpha of 2, and remove enough stop words so that "him" is not included.

Note: you *may* get you a **"database locked"** error when you run `--estimate_lda_topics`. If that happens, (1) restart (Runtime ==> Restart session), (2) run the Svitlana database setup cells above, then (3) try your command again! 🙏

Hint: for looking at stopwords, extend your %sql row limit

In [101]:
%config SqlMagic.displaylimit = 200

**Answer:**

### 3.2) (2 pts) Please extract your 50 topics, and correlate them against emotions, **at the message level**. Show 8 topic positive correlations for an emotion of your choice.

Don't forget to set your `--lexicondb` to `svitlana`, and the group_freq_thresh appropriately for short documents.

Remember that the `emotion` outcome is not numerical.

You can use the below function to plot 8 word clouds if you point it to an emotion subfolder in the DLATK output.

In [18]:
import glob
import os
import math
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def print_wordclouds(wordcloud_folder, prefix, num_topics):

    image_list = glob.glob(os.path.join(wordcloud_folder, '*.png'))
    filtered = [image for image in image_list if prefix in image]

    def transform(x):
        return float('.'.join(x.split('/')[-1].split('-')[1].split('.')[:2]))

    top = sorted(filtered, key=transform)[::-1][:num_topics]

    images_per_row = 4
    fig, axes = plt.subplots(math.ceil(num_topics/images_per_row), images_per_row, figsize=(18, 8))
    axes = axes.flatten()

    for index, image in enumerate(top):
        axes[index].set_axis_off()
        axes[index].set_title(os.path.basename(image))
        axes[index].imshow(mpimg.imread(image))

In [None]:
# WORDCLOUD_FOLDER = ''
# PREFIX = 'pos'
# NUM_TOPICS = 8

# print_wordclouds(WORDCLOUD_FOLDER, PREFIX, NUM_TOPICS)

**Answer**

## 4) (1pt extra credit 🚀) `blog_authorship` database

If you feel like you want extra practice extracting and correlating topics.

### First, set up the database connection!

In [22]:
# define database name
database='blog_authorship'

This may take a minute or two (it's a large database)!

In [23]:
# Mount Google Drive & copy to Colab

# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies blog_authorship.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/blog_authorship.db" "sqlite_data"

Mounted at /content/drive


In [24]:
database='blog_authorship'

# reloads the %%sql extension
%reload_ext sql

# connects the extension to the database - mounts the database as an engine
from sqlalchemy import create_engine
blog_authorship_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")

# use engine (this activates the connection!)
%sql blog_authorship_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

☝🏻 **FYI**: if you face the **database locked** while using this database, Restart your session (Runtime ==> Restart session), then run the last cell again!

### Extract FB topics

In the previous homework, you extracted 1gram feature tables on this database (commented out below if need to run it again!).



In [None]:
database = 'blog_authorship'
msgs_table = 'blog_authorship_msgs'

In [None]:
# !dlatkInterface.py \
#   --corpdb {database} \
#   --corptable {msgs_table} \
#   --correl_field user_id \
#   --group_freq_thresh 500 \
#   --add_ngrams -n 1

This time, you need to extract 2,000 Facebook topic features (`fb2000_cp` and `fb2000_freq_t50ll`).

Below is the command to extract the topic features!

In [38]:
database = 'blog_authorship'
msgs_table = 'blog_authorship_msgs'
feat_1gram_table = 'feat$1gram$blog_authorship_msgs$user_id'
topics_cp_table = 'fb2000_cp'

Extract. This takes ~3 min.

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --group_freq_thresh 500 \
    --add_lex_table -l {topics_cp_table} \
    --weighted_lexicon

### 4.1) Correlate the topics against `occu`, controlling for `age` and `gender`

For an occupation of your choice, show the top 8 most correlated topics.

⏰ This will take ~22 minutes.

You can use the Python function below to show the top 8 topic wordclouds:

In [62]:
import glob
import os
import math
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def print_wordclouds(wordcloud_folder, occupation, num_topics):

    images = glob.glob(os.path.join(wordcloud_folder, 'occu__{}'.format(occupation), '*.png'))
    images = [image for image in images if 'beta' in image]
    num_topics = num_topics if num_topics <= len(images) else len(images)

    def get_coeff(x):
        return float('.'.join(x.split('/')[-1].split('-')[1].split('.')[:2]))

    top = sorted(images, key=lambda x: get_coeff(x))[::-1][:num_topics]

    images_per_row = 4
    fig, axes = plt.subplots(math.ceil(num_topics/images_per_row), images_per_row, figsize=(18, 8))
    axes = axes.flatten()

    for index, image in enumerate(top):
        axes[index].set_axis_off()
        axes[index].set_title(os.path.basename(image))
        axes[index].imshow(mpimg.imread(image))

In [None]:
# WORDCLOUD_FOLDER = ''
# OCCU = ''
# NUM_TOPICS = 8

# print_wordclouds(WORDCLOUD_FOLDER, OCCU, NUM_TOPICS)

**Answer**

## ‼️ **Save your database and/or output files** ‼️

Let's save all this work into as a new database file in your GDrive `sqlite_databases` folder!

First **`dla_tutorial`**!

In [65]:
database = 'dla_tutorial'

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

Second **`svitlana`**.

In [67]:
database = 'svitlana'

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

Third **`blog_authorship`** (if you did Q4)

In [69]:
database = 'blog_authorship'

In [70]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

Mounted at /content/drive
✅ Database 'blog_authorship.db' has been copied to your Google Drive.


We generated a lot of output in this Homework! Here's how you can save it to your Drive if you want to!

⏰ This will take a minute!

In [71]:
OUTPUT_FOLDER = './output_hw7'

In [72]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive (-r makes it copy the folder and all files/folders inside)
!cp -f -r {OUTPUT_FOLDER} "/content/drive/MyDrive/"

print(f"✅ '{OUTPUT_FOLDER}' has been copied to your Google Drive.")

Mounted at /content/drive
✅ './output_hw7' has been copied to your Google Drive.
