<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-01/W1_HW1_SQL_NinjaTraining_(dla_tutorial).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W1 Homework 1: SQL Ninja Fundamentals (2025-03)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.

## Setting up Colab with DLATK and SQLite

As discussed in the tutorials, we begin by setting up the Colab environment.

This will take ~1.5 to 2 minutes. If colab asks you about this not being authored by Google, say "Run anyway."

### Install packages

In [None]:
# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install dlatk/
!pip install jupysql

Cloning into 'dlatk'...
remote: Enumerating objects: 6975, done.[K
remote: Counting objects: 100% (1063/1063), done.[K
remote: Compressing objects: 100% (138/138), done.[K
remote: Total 6975 (delta 987), reused 930 (delta 925), pack-reused 5912 (from 1)[K
Receiving objects: 100% (6975/6975), 62.36 MiB | 10.47 MiB/s, done.
Resolving deltas: 100% (4940/4940), done.
Processing ./dlatk
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: dlatk
  Building wheel for dlatk (setup.py) ... [?25l[?25hdone
  Created wheel for dlatk: filename=dlatk-1.3.1-py3-none-any.whl size=35635829 sha256=21a7df4625cdfabac0ec0a623186e995e7106f90d6891ff10bfa2b7eec9dfe32
  Stored in directory: /tmp/pip-ephem-wheel-cache-l5d8xqjw/wheels/cc/c9/65/e1ecc64bac68518c07b286fe86921aa938e11a0c3a87d8ff93
Successfully built dlatk
Installing collected packages: dlatk
Successfully installed dlatk-1.3.1
Collecting jupysql
  Downloading jupysql-0.11.1-py3-none-any.whl.metadata (5.9 

### Download data and insert into SQLite database

In [None]:
# this downloads the csvs you need
!git clone https://github.com/CompPsychology/psych290_data.git

Cloning into 'psych290_data'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 11 (delta 0), reused 8 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (11/11), 13.84 MiB | 11.44 MiB/s, done.


Now, create a `username` variable which we use to name the database.

In [None]:
username = "your_name"

We then load the downloaded data into a database named [username].db in the sqlite_data folder.

In [None]:
# load the required package -- similar to library() function in R
import os
from dlatk.tools.importmethods import csvToSQLite

# store the complete path to the database -- sqlite_data/[username].db
database = os.path.join("sqlite_data", username)

# import CSVs into tables in this database
csvToSQLite(
    "psych290_data/dla_tutorial/msgs.csv",
    database,
    "dla_tutorial_msgs"
)

csvToSQLite(
    "psych290_data/dla_tutorial/blog_outcomes.csv",
    database,
    "dla_tutorial_outcomes"
)

SQL Query: CREATE TABLE dla_tutorial_msgs (message_id INT, user_id INT, date VARCHAR(31), created_time VARCHAR(31), message LONGTEXT);


Importing data, reading psych290_data/dla_tutorial/msgs.csv file
Reading 10000 rows into the table...
Reading 10000 rows into the table...
Reading 10000 rows into the table...
Reading remaining 1674 rows into the table...
Importing data, reading psych290_data/dla_tutorial/blog_outcomes.csv file
Reading remaining 1000 rows into the table...


SQL Query: CREATE TABLE dla_tutorial_outcomes (user_id INT, gender INT, age INT, occu VARCHAR(31), sign VARCHAR(15), is_indunk INT, is_student VARCHAR(7), is_education VARCHAR(7), is_technology VARCHAR(7));


### Setup database connection

Finally, we establish a connection with the (SQLite) database with the `%sql` extension from colab.

In [None]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database
from sqlalchemy import create_engine
engine = create_engine(f"sqlite:///sqlite_data/{username}.db?charset=utf8mb4")
%sql engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

## Recap

### List all the tables.

After connecting to a specific database, list all the tables inside it using a `%sqlcmd` statement. For a more readable output, you can save the output to a variable and print it.

In [None]:
result = %sqlcmd tables
print(result)

+-----------------------+
|          Name         |
+-----------------------+
|   dla_tutorial_msgs   |
| dla_tutorial_outcomes |
+-----------------------+


This homework will use both `dla_tutorial_msgs` and `dla_tutorial_outcomes` tables.

### Now, let's view schema of these tables.

Remember, `PRAGMA` and  `%sqlcmd` are statements that help run meta commands like looking at table details and database properties.

In [None]:
%%sql

PRAGMA table_info(dla_tutorial_msgs);

cid,name,type,notnull,dflt_value,pk
0,message_id,INT,0,,0
1,user_id,INT,0,,0
2,date,VARCHAR(31),0,,0
3,created_time,VARCHAR(31),0,,0
4,message,LONGTEXT,0,,0


In [None]:
%%sql

PRAGMA table_info(dla_tutorial_outcomes);

cid,name,type,notnull,dflt_value,pk
0,user_id,INT,0,,0
1,gender,INT,0,,0
2,age,INT,0,,0
3,occu,VARCHAR(31),0,,0
4,sign,VARCHAR(15),0,,0
5,is_indunk,INT,0,,0
6,is_student,VARCHAR(7),0,,0
7,is_education,VARCHAR(7),0,,0
8,is_technology,VARCHAR(7),0,,0


### They are missing indices. Let's add indices on the main column that connects the table.

Since the `user_id` columns have the same name in both the tables, we'll use different names for their respective indices to avoid collision.

In [None]:
%%sql

CREATE INDEX idx_dla_tutorial_msgs_user_id ON dla_tutorial_msgs (user_id);
CREATE INDEX idx_dla_tutorial_msgs_message_id ON dla_tutorial_msgs (message_id);

CREATE INDEX idx_dla_tutorial_outcomes_user_id ON dla_tutorial_outcomes (user_id);

## Questions

### 0) Select 10 random rows from both tables.

Look at the output, familiarize yourself with it a little bit. 👀 The messages in the msgs table are blog posts. Skim one.

**Answer:**

In [None]:
%%sql



### 1) Calculate the minimum, maximum and average age for both genders, in one SQL command.

**Answer:**

In [None]:
%%sql



### 2) Calculate the average age for every occupation, one command.

**Answer:**

In [None]:
%%sql



### 3) What's the average age difference between Libras and Leos?

We need a `GROUP BY` and `AVG` but we don't need to that for all groups, just the ones with care about.

**Hint:** You can do just get the average age for Libras and Leos, and then subtract them from one another in a separate command or using phone calculator technology.

**Answer:**

In [None]:
%%sql



### 4) How many blog posts do we have from authors from a given age?

i.e., how many from age 17, 18, etc.

**Hint:** First create a table with number of messages per user, then join on the extra info that you need.

**Answer:**

In [None]:
%%sql



### 5) How many blog posts do we have from authors in an industry?

**Answer:**

In [None]:
%%sql



### 6) How many blog posts do we have from a given year?

**Hint:** remember the  `strftime('%Y', sql_timeformat)` command (or in MySQL 🐬🐬🐬
 `YEAR(sql_timeformat)`).

 Perhaps helpful: don't forget you can use `AS` to name the result of some other command/function.

**Answer:**

In [None]:
%%sql



### 7) How many blog posts do we have from a given year only written by  Tauruses?

**Answer:**

In [None]:
%%sql



### 8) Should you do something to make the last queries for questions 1 and 2 run faster? If so, give the query

In our queries, we are partitioning columns with **GROUP BY**. So, it may help _index_ these columns.

**Answer:**

In [None]:
%%sql



### 9) Can you pull out 10 blog posts that mention the word "boys"?

**Answer:**

In [None]:
%%sql



### 10) (EXTRA CREDIT) How many blog posts do we have from a given month (i.e., from Feb of 2001, and so forth)?

**Hint:** **ORDER BY** (and another function) can use more than one variable...

**Answer:**

In [None]:
%%sql

