<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-08/W8_mini_tutorial_ImportandUseNewData_R_DLATK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W8 Mini tutorial - Using new data in DLATK (plus some analysis in R)

(c) Samuel Campione, Johannes Eichstaedt, and the World Well-Being Project, 2025.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.


This mini tutorial will walk you through a standard process for
working with new data in DLATK!

💡 Refer to [Tutorial 07](https://github.com/CompPsychology/psych290_colab_public) if you need more in-depth guidance on using R with SQLite & DLATK!

## Setup

### a) DLATK and SQLite

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install dlatk/
!pip install wordcloud langid jupysql

### b) Setup R

In [None]:
# load %%R extension
%load_ext rpy2.ipython

In [None]:
# this is equivalent to install.packages() but much faster!!
!apt-get update

!apt install -y \
    r-cran-rsqlite \
    r-cran-ggthemes \
    r-cran-reshape2 \
    r-cran-psych \
    r-cran-apatables

You'll probably need these custom R functions!

In [None]:
# this download the R script we need
!git clone https://github.com/CompPsychology/psych290_data.git

In [None]:
%%R

# load packages
library(tidyverse)
library(ggthemes)
library(reshape2)
library(psych)
library(apaTables)

#some options
options(repr.plot.width=20,repr.plot.height=10)

# and the custom R functions we have written to work with DLATK
source('./psych290_data/helper_files/psych290RcodeV1.R')

### c) Already set up `database.db`?

⚠️ If you're returning to this notebook or you've already made a database file for your data, you can go ahead and copy it over from your Google Drive and ignore sections 1 and 2 and skip to 3!

```
database='your_db'

# Mount Google Drive & copy to Colab
# connects & mounts your Google Drive to this colab space
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# this copies dlatk_lexica.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/dlatk_lexica.db" "sqlite_data"

# this copies {database}.db from your Google Drive to Colab
!cp -f "/content/drive/MyDrive/sqlite_databases/{database}.db" "sqlite_data"
```

If not, then continue on!

## Step 1) Upload your files!

Can do in the file tree! Or by the running the cell below:

In [None]:
from google.colab import files
uploaded = files.upload()

## Step 2) Write CSVs to SQLite database

### Method (a): using DLATK `csvToSQLite()`

🤓☝🏻 This is probably the simplest way, but it assumes your tables are formatted correctly for DLATK (e.g., `msgs` has `message_id` and `message` columns).

P.S. preprocess the tables in Excel, export to CSV, then use this method!

In [None]:
# put your desired database named!
database = 'your_db'
database_path = f"sqlite_data/{database}" # constructs the pathname

# path to CSVs you just uploaded
msgs_csv_path = "./msgs.csv"
outcomes_csv_path = "./outcomes.csv"
# ... however many you want/need

In [None]:
import os
from dlatk.tools.importmethods import csvToSQLite

csvToSQLite(msgs_csv_path, database_path, "msgs")

csvToSQLite(outcomes_csv_path, database_path, "outcomes")

Importing data, reading ./msgs.csv file
Reading remaining 741 rows into the table...
Importing data, reading ./outcomes.csv file
Reading remaining 100 rows into the table...


SQL Query: CREATE TABLE msgs (message_id INT, user_id INT, created_date VARCHAR(15), message LONGTEXT);
SQL Query: CREATE TABLE outcomes (user_id INT, gender VARCHAR(15), age INT, occu VARCHAR(31), is_indUnk INT);


### Method (b): using R to import data

This way allows you to work with R if you're more comfortable there, but has more steps.

In [None]:
# put your desired database named
database = 'your_db'

database_path = f"sqlite_data/{database}.db" # constructs the pathname

In [None]:
%%R -i database_path

# load DBI for generic database functions and RSQLite as the SQLite backend
library(DBI)
library(RSQLite)

# connects to a file-based sqlite DB
db_con <- dbConnect(RSQLite::SQLite(), dbname = database_path)

# enforce UTF-8 encoding
dbExecute(db_con, "PRAGMA encoding = 'UTF-8';")

[1] 0


In [None]:
%%R

msgs <- read.csv("./msgs.csv")
outcomes <- read.csv("./outcomes.csv")


# rename columns if needed
# colnames(msgs) <- c("message_id", "user_id", "created_date", "message", ...)

# save to database
dbWriteTable(db_con, "msgs", msgs, overwrite=TRUE, row.names=FALSE, encoding="UTF-8")

dbWriteTable(db_con, "outcomes", outcomes, overwrite=TRUE, row.names=FALSE, encoding="UTF-8")

## 3) Connect to %SQL

Now you can check that everything was added successfully by looking into the database using %sql!

☝🏻 We need to mount your new `database.db` file to an engine using `create_engine(...)` and then connect that engine to the %sql extension so you can do queries in a `%%sql` cell.

In [None]:
database = "your_db"

In [None]:
# loads the %%sql extension
%load_ext sql

from sqlalchemy import create_engine
# rename the engine to whatever you want
your_db_engine = create_engine(f"sqlite:///sqlite_data/{database}.db?charset=utf8mb4")

# connect it to the %sql extension
%sql your_db_engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

A quick check if your tables are there.

In [None]:
%sqlcmd tables

Amazing! 🤩

⚠️ Let's add *indexes* and be on to language analysis!

In [None]:
%%sql

-- rename the msgs and outcomes tables to whatever your tables are named

CREATE INDEX idx_msgs_user_id ON msgs (user_id);
CREATE INDEX idx_msgs_message_id ON msgs (message_id);

CREATE INDEX idx_outcomes_user_id ON outcomes (user_id);
-- any other indexes you want to create

## 4) Extract 1-grams

Let's extract unigrams at the user level.

🤓 Note: here is where you would put `--add_sent_per_row` to sentence tokenize each message! To sentence tokenize, you also need to `import nltk` and run `nltk.download('punkt_tab')` before running DLATK.

In [None]:
database = "your_db"
msgs_table = "msgs"

In [None]:
!dlatkInterface.py \
    --corpdb {database} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --add_ngrams -n 1

## 5) Plotting & analysis in R

Great! Now let's get basic descriptives and visualizations for your data.

In [None]:
database = "your_db"
database_path = f"sqlite_data/{database}.db" # constructs the pathname

In [None]:
%%R -i database_path

# connects to a file-based sqlite DB
db_con <- dbConnect(RSQLite::SQLite(),
                    dbname = database_path)

# enforce UTF-8 encoding
dbExecute(db_con, "PRAGMA encoding = 'UTF-8';")

[1] 0


### Plot messages per user (histogram)

In [None]:
%%R

messages.byGroup <- dbGetQuery(db_con, "SELECT COUNT(*) AS message_count, user_id FROM msgs GROUP BY user_id;")

In [None]:
%R head(messages.byGroup)

Unnamed: 0,message_count,user_id
1,10,743739
2,10,878542
3,7,894444
4,10,956611
5,10,1117055
6,7,1190695


Now let's plot it. Making changes to labels and limits accordingly.

In [None]:
%%R

qplot(messages.byGroup$message_count) +
theme_Publication() +
ylab("Number of users") +
xlab("Number of posts") +
xlim(0,150)

**Figure S1.** Histogram of blog posts per user

In [None]:
%R round(describe(messages.byGroup), 2)

Unnamed: 0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
message_count,1.0,100.0,7.41,3.24,9.5,7.84,0.74,1.0,10.0,9.0,-0.8,-0.98,0.32
user_id,2.0,100.0,3488164.73,812335.05,3734964.0,3659462.38,451244.14,743739.0,4323647.0,3579908.0,-1.98,3.47,81233.5


✏️  "Users wrote an average of 7.41 (SD = 3.24, min = 1, max = 10) blog posts."

### Plot words per user

In [None]:
%%R

# this pulls the meta table from the DB
feat_meta <- dbGetQuery(db_con, "select * from `feat$meta_1gram$msgs$user_id`")

# this converts it into "wide format"
feat_meta_wide <- importFeat(feat_meta)

# this pulls the outcome table from the DB
outcomes <- dbGetQuery(db_con, "select * from outcomes")

# this merges the outcome table onto the meta-data table
# making sure to keep all rows from the outcome table (all.x)
outcomes_meta_merged <- merge(outcomes, feat_meta_wide, by.x = "user_id", by.y = "group_id", all.x = TRUE)

# save new merged table to DB
dbWriteTable(db_con, "outcomes_meta", outcomes_meta_merged, overwrite=TRUE, row.names=FALSE, encoding='UTF-8')

Now get words per user!

In [None]:
%R words.byGroup <- dbGetQuery(db_con, "SELECT user_id, _total1grams AS wordCount FROM outcomes_meta")
%R describe(words.byGroup$wordCount)

Unnamed: 0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
X1,1.0,100.0,2199.12,1863.092023,1648.5,1892.975,1192.7517,333.0,9618.0,9285.0,1.606647,2.627728,186.309202


In [None]:
%%R

sum(words.byGroup$wordCount)

[1] 219912


✏️ "Users wrote an average of 2,199 words (SD = 1,863) for a total of 219,912 words."

In [None]:
%%R

qplot(words.byGroup$wordCount) +
theme_Publication() +
ylab("Number of users") +
xlab("Number of words") +
xlim(500, 25000)

**Figure S2.** Histogram of words per user. Only users with more than 500 words were retained in the study dataset *(if you added a group freq thresh)*

## Additional analyses

At this point, you can extract lexicons (remember to mount your `dlatk_lexica.db` from Google Drive), extract topics, model topics then correlate with your outcomes.

**FYI**: Topic modeling (`--estimate_lda_topics`) requires `gensim`. You'll need to install it before modeling.

```
!pip install gensim==4.3
```

If in doubt, copy the setup from Tutorial 12: the everything tutorial!

# Save to Google Drive

As always, save your database somewhere you can use it again next time!

In [None]:
database = "your_db"

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# copy the database file to your Drive
!cp -f "sqlite_data/{database}.db" "/content/drive/MyDrive/sqlite_databases/"

print(f"✅ Database '{database}.db' has been copied to your Google Drive.")

If you want to save output to Google Drive:

In [None]:
OUTPUT_FOLDER = '' # assign the name here!

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the database file to your Drive (-r makes it copy the folder and all files/folders inside)
!cp -f -r {OUTPUT_FOLDER} "/content/drive/MyDrive/"

print(f"✅ '{OUTPUT_FOLDER}' has been copied to your Google Drive.")