<a href="https://colab.research.google.com/github/CompPsychology/psych290_colab_public/blob/main/notebooks/week-01/W1_Tutorial_02_SQL_OPTIONAL_workingWithTweets_(sql_intro)_withSolutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W1 Tutorial 2 - Intermediate SQL on Tweets (DB: sql_intro) (2025-03)

(c) Johannes Eichstaedt & the World Well-Being Project, 2023.

✋🏻✋🏻 NOTE - You need to create a copy of this notebook before you work through it. Click on "Save a copy in Drive" option in the File menu, and safe it to your Google Drive.

✉️🐞 If you find a bug/something doesn't work, please slack us a screenshot, or email johannes.courses@gmail.com.

In this tutorial we will learn how to remove duplicates and how to work with `DATETIME` fields to get counts across time.

**FYI:** you can execute a cell by hitting `CTRL+Enter` (Win) or `Command+Enter` (Mac).   
`Shift+Enter` or `Command+Enter` will execute + advance to the cell below.

Please execute every cell as you go along.

**FYI:**
* 🤓🤓🤓 comparisons with the tidyverse are flagged with the triple nerd  
* 🐬🐬🐬 when there is code that runs in MySQL but not in SQLite, this is marked with the triple dolphin

## 1) Setting up Colab with DLATK and SQLite

This tutorial begins by setting up DLATK in the Colab environment, similar to the previous tutorials. The next couple of subsections do this for you.

Remember, if Colab asks you about this not being authored by Google, say, "Run anyway."

### 1a) Install packages

In [None]:
#We first install the necessary packages and then download the dataset.

#This cell does it for you.

# installing DLATK and necessary packages
!git clone -b psych290 https://github.com/dlatk/dlatk.git
!pip install dlatk/
!pip install jupysql

Cloning into 'dlatk'...
remote: Enumerating objects: 6975, done.[K
remote: Counting objects: 100% (1070/1070), done.[K
remote: Compressing objects: 100% (146/146), done.[K
remote: Total 6975 (delta 992), reused 929 (delta 924), pack-reused 5905 (from 2)[K
Receiving objects: 100% (6975/6975), 62.37 MiB | 7.94 MiB/s, done.
Resolving deltas: 100% (4934/4934), done.
Updating files: 100% (338/338), done.
Processing ./dlatk
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: dlatk
  Building wheel for dlatk (setup.py) ... [?25l[?25hdone
  Created wheel for dlatk: filename=dlatk-1.3.1-py3-none-any.whl size=35635829 sha256=4b7f6f37d25cac7f6e0e0e42b87ca451f89de22be1958f2d153588f10eede4f1
  Stored in directory: /tmp/pip-ephem-wheel-cache-5f5ccocr/wheels/cc/c9/65/e1ecc64bac68518c07b286fe86921aa938e11a0c3a87d8ff93
Successfully built dlatk
Installing collected packages: dlatk
Successfully installed dlatk-1.3.1
Collecting jupysql
  Downloading jupysql-0

### 1b) Download data and insert into SQLite database

In [None]:
# this download the csvs we need for this tutorial
!git clone https://github.com/CompPsychology/sql_intro.git

Cloning into 'sql_intro'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 7 (delta 1), reused 7 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (7/7), 1.29 MiB | 3.29 MiB/s, done.
Resolving deltas: 100% (1/1), done.


Now that you have set up Colab, create a `username` variable which we use to name the database for the rest of the tutorial.

In [None]:
username = "your_name"

We then load the downloaded data into a database named [username].db in the sqlite_data folder.

In [None]:
# load the required package -- similar to library() function in R
import os
from dlatk.tools.importmethods import csvToSQLite

# store the complete path to the database -- sqlite_data/[username].db
database = os.path.join("sqlite_data", username)

# import CSVs into tables in this database
csvToSQLite(
    "sql_intro/counties.csv",
    database,
    "counties"
)

csvToSQLite(
    "sql_intro/tweets.csv",
    database,
    "tweets"
)

Table already exists
Table already exists


### 1c) Setup database connection

Finally, we establish a connection with the (SQLite) database with the `%sql` extension from colab.

In [None]:
# loads the %%sql extension
%load_ext sql

# connects the extension to the database
from sqlalchemy import create_engine
engine = create_engine(f"sqlite:///sqlite_data/{username}.db?charset=utf8mb4")
%sql engine

#set the output limit to 50
%config SqlMagic.displaylimit = 50

## 2) Check tables

Let's check the tables within the database.

In [None]:
%sqlcmd tables

Name
counties
tweets


We will use both the tables `counties` and `tweets` in this tutorial.

#### 👩‍🔬💻 Exercise

Can you check the columns in these tables?

In [None]:
%%sql

PRAGMA table_info(counties);

cid,name,type,notnull,dflt_value,pk
0,fips,INT,0,,0
1,county,VARCHAR(31),0,,0
2,state,VARCHAR(31),0,,0


In [None]:
%%sql

PRAGMA table_info(tweets);

cid,name,type,notnull,dflt_value,pk
0,coordinates,TEXT,0,,0
1,created_at,VARCHAR(31),0,,0
2,fips,INT,0,,0
3,tweet_id,INT,0,,0
4,place_country_code,VARCHAR(7),0,,0
5,place_full_name,VARCHAR(63),0,,0
6,text,LONGTEXT,0,,0
7,user_created_at,VARCHAR(63),0,,0
8,user_id,INT,0,,0
9,user_statuses_count,DOUBLE,0,,0


**Ask Yourself:** What might be the primary key - foreign key relationship that connects the two tables? Basically, what column ties tables together?

**Answer-** `fips` in both `counties` and `tweets` tables. (Federal Information Processing System (FIPS) Codes identify counties.)

## 3) View Sample Of Records

Now that we have copied the tables, let's look at a sample of tweets. Instead of looking at consecutive samples, lets look at random sample. This is where we use the **ORDER BY RANDOM( )**. Run it multiple times to see different samples.

In [None]:
%%sql

SELECT *
FROM tweets
ORDER BY RANDOM()
LIMIT 5;

coordinates,created_at,fips,tweet_id,place_country_code,place_full_name,text,user_created_at,user_id,user_statuses_count
,2020-05-17 17:15:31,17195,1262069248635146240,,,"“It is very frustrating, because here we are, spending all of our time trying to take care of the COVID patients. All I can say is that it is real.” https://t.co/9uH69hvZGf",Sat Dec 13 04:35:47 +0000 2008,18093695,33471.0
,2020-05-12 04:17:29,20161,1260061508471422976,,,Send a message to these 10 key senators who can protect nurses in the next #COVID19 relief bill https://t.co/0lByEGseJr,Sun Oct 15 19:13:51 +0000 2017,919642495239065601,31602.0
,2020-05-10 18:29:41,40101,1259551198987902976,,,"Thousands Of Immigrant Kids Are Detained, Far From Their Parents. They Need Protection From COVID-19, Too https://t.co/5cyLXJdOhw via @cogwbur Can you imagine yourself in the place of these innocent victims of DUMP'S hatred!",Sun Oct 25 15:54:33 +0000 2015,4014898513,26829.0
,2020-05-16 20:05:13,5007,1261749565004435459,,,Arkansas Pandemic Unemployment Assistance website breached https://t.co/bf8iJ2mYyQ,Fri Jun 23 16:48:25 +0000 2017,878293675964592133,4039.0
,2020-05-10 13:52:23,17195,1259481411217764352,,,My IVF Treatment Update & A 2nd Wave of Coronavirus in Hong Kong https://t.co/plfzn0YghZ via @YouTube,Tue Jun 30 19:32:33 +0000 2009,52493415,444623.0


## 4) Drop Column

You can observe that the `coordinates` column is empty. Is that the case with all the rows? Lets check that.

In [None]:
%%sql

SELECT COUNT(*) AS n_tweets
FROM tweets
WHERE LENGTH(coordinates) > 0;

n_tweets
0


So the _coordinates_ columns is empty for all rows table. We might as well drop the column altogether.

In [None]:
%%sql

ALTER TABLE tweets
DROP COLUMN coordinates;

#### 👩‍🔬💻 Exercise

Can you check the total number of **unique** users in the table?

In [None]:
%%sql

SELECT COUNT(DISTINCT user_id)
FROM tweets;

COUNT(DISTINCT user_id)
4321


#### 👩‍🔬💻 Exercise

Verify if there are tweets from all the counties in `counties` table.        

In [None]:
%%sql

SELECT COUNT(fips)
FROM counties;

COUNT(fips)
100


In [None]:
%%sql

SELECT COUNT(DISTINCT fips)
FROM tweets;

COUNT(DISTINCT fips)
94


## 5) Remove Duplicates

Let's check if there are any duplicate tweets by finding number of distinct tweet ids.

In [None]:
%%sql

SELECT COUNT(DISTINCT(tweet_id)) AS n_tweets, COUNT(*) AS total
FROM tweets;

n_tweets,total
12145,12629


So there are duplicate tweets in this data. Let us make a table `n_duplicates` with number of duplicates for every tweet _id_ for only the tweets that have duplicates.

In [None]:
%%sql

CREATE TABLE n_duplicates AS
SELECT tweet_id, COUNT(*) AS n_duplicates
FROM tweets
GROUP BY tweet_id HAVING COUNT(*) > 1;

Let's check some of the entries in the `n_duplicates` table.

In [None]:
%%sql

SELECT *
FROM n_duplicates
LIMIT 10
OFFSET 100;

tweet_id,n_duplicates
1258713300671647745,2
1258728358613049346,2
1258728373184008195,2
1258728468495437824,2
1258743455078989825,2
1258743455976734720,2
1258743695588892672,2
1258743696117415937,2
1258758558868365312,2
1258758580087533569,2


There are indeed duplicate tweets in this data. Print the full tweets for some of these ids. Can you see why **ORDER BY** is needed here ?

In [None]:
%%sql

SELECT *
FROM tweets
WHERE tweet_id IN (SELECT tweet_id FROM n_duplicates)
ORDER BY tweet_id
LIMIT 10
OFFSET 150;

created_at,fips,tweet_id,place_country_code,place_full_name,text,user_created_at,user_id,user_statuses_count
2020-05-07 18:00:00,37195,1258456564547158020,,,Lessons from the Pandemic Police https://t.co/BCOzuBkTtS,Wed Jul 11 19:55:27 +0000 2012,633251100,5593.0
2020-05-07 18:00:00,37195,1258456564547158020,,,Lessons from the Pandemic Police https://t.co/BCOzuBkTtS,Wed Jul 11 19:55:27 +0000 2012,633251100,5593.0
2020-05-07 18:00:38,36027,1258456722097811457,,,"The Northeast Face Shield Project is a collection of volunteers who 3D print #faceshield parts and distribute completed face shields to healthcare facilities in the Northeast. Thank you for donating 1,200 face shields to @NuvanceHealth locations, including @Vassar_Brothers. https://t.co/vMOdBnLzE9",Tue Nov 12 20:40:55 +0000 2019,1194354342712750080,171.0
2020-05-07 18:00:38,36027,1258456722097811457,,,"The Northeast Face Shield Project is a collection of volunteers who 3D print #faceshield parts and distribute completed face shields to healthcare facilities in the Northeast. Thank you for donating 1,200 face shields to @NuvanceHealth locations, including @Vassar_Brothers. https://t.co/vMOdBnLzE9",Tue Nov 12 20:40:55 +0000 2019,1194354342712750080,171.0
2020-05-07 18:00:52,48479,1258456782508298240,,,"City of Laredo confirms four added cases of coronavirus, 414 total https://t.co/8J1Uf7TFTY",Thu Jun 12 18:09:14 +0000 2008,15099229,70768.0
2020-05-07 18:00:52,48479,1258456782508298240,,,"City of Laredo confirms four added cases of coronavirus, 414 total https://t.co/8J1Uf7TFTY",Thu Jun 12 18:09:14 +0000 2008,15099229,70768.0
2020-05-07 18:00:57,36027,1258456802540339203,,,Do you know which #HudsonValley parks and trails are closed to prevent the spread of #COVID19? https://t.co/AZPYtraJ7n @scenichudson,Thu May 21 16:26:54 +0000 2009,41619175,16722.0
2020-05-07 18:00:57,36027,1258456802540339203,,,Do you know which #HudsonValley parks and trails are closed to prevent the spread of #COVID19? https://t.co/AZPYtraJ7n @scenichudson,Thu May 21 16:26:54 +0000 2009,41619175,16722.0
2020-05-07 18:00:57,33001,1258456802833895425,,,Statewide coverage of COVID-19 topics and more https://t.co/qLmRpvNxWq,Thu Dec 31 12:54:53 +0000 2009,100761151,3099.0
2020-05-07 18:00:57,33001,1258456802833895425,,,Statewide coverage of COVID-19 topics and more https://t.co/qLmRpvNxWq,Thu Dec 31 12:54:53 +0000 2009,100761151,3099.0


Since the `tweets` table need not have the duplicate tweets one after the other, **ORDER BY** in fact helps group the duplicate together.

We need to remove these duplicates!!

To remove the duplicates, we first create a new table with all the unique rows from `tweets`. The command below selects all (`*`) `DISTINCT` rows and stores them in a new table called `unique_tweets`

In [None]:
%%sql

CREATE TABLE unique_tweets AS
SELECT DISTINCT *
FROM tweets;

Let us double-check if duplicates are gone.

In [None]:
%%sql

SELECT COUNT(DISTINCT(tweet_id)) AS n_tweets, COUNT(*) AS total
FROM unique_tweets;

n_tweets,total
12145,12145


And the total number of tweets.

In [None]:
%%sql

SELECT COUNT(*)
FROM unique_tweets;

COUNT(*)
12145


We can drop the previous `tweets` table which contains duplicates.

In [None]:
%%sql

DROP TABLE tweets;

Rename the table `unique_tweets` to `tweets` now that duplicates are removed.

In [None]:
%%sql

ALTER TABLE unique_tweets RENAME TO tweets;

Print the schema of our table `tweets`.

In [None]:
%%sql

PRAGMA table_info(tweets);

cid,name,type,notnull,dflt_value,pk
0,created_at,TEXT,0,,0
1,fips,INT,0,,0
2,tweet_id,INT,0,,0
3,place_country_code,TEXT,0,,0
4,place_full_name,TEXT,0,,0
5,text,TEXT,0,,0
6,user_created_at,TEXT,0,,0
7,user_id,INT,0,,0
8,user_statuses_count,REAL,0,,0


#### 👩‍🔬💻 Exercise

Now that we have cleaned the table, can you get the relative number of tweets of all the counties?

In [None]:
%%sql

SELECT fips, COUNT(*) AS num_tweets
FROM tweets
GROUP BY fips;

fips,num_tweets
1013,20
1017,17
1059,16
1119,6
5007,895
6043,41
8015,70
8037,189
10003,2594
12027,4


#### 👩‍🔬💻 Exercise

Also, which county tweeted the most? Can you get the name of the county?

In [None]:
%%sql

SELECT fips, COUNT(*) AS num_tweets
FROM tweets
GROUP BY fips
ORDER BY num_tweets DESC;

fips,num_tweets
10003,2594
36027,1281
5007,895
48479,847
20161,448
33005,444
37035,408
54033,381
42019,356
35043,344


In [None]:
%%sql

SELECT *
FROM counties
WHERE fips = 10003;

fips,county,state
10003,New Castle County,Delaware


Let's get number of tweets per user and have the results in descending order of number of tweets.

In [None]:
%%sql

SELECT user_id, COUNT(*) AS n_tweets
FROM tweets
GROUP BY user_id
ORDER BY n_tweets DESC
LIMIT 20
OFFSET 100;

user_id,n_tweets
22627877,14
1052628509242339328,13
4193586673,13
373681289,13
352429706,13
284076024,13
38351876,13
34955563,13
18187527,13
1172231621141041159,12


## 6) Let's derive counts across time
Find the date/time range of tweets.

In [None]:
%%sql

SELECT MIN(created_at), MAX(created_at)
FROM tweets;

MIN(created_at),MAX(created_at)
2020-05-05 00:00:08,2020-05-18 23:55:38


This is 2 weeks of data. Let's get the number of tweets per day.

**NOTE:** A good way to interact with date/time column is to treat such columns as if they contain categorical values (and not real numbers). Then they are similar to most of the variables such as `user_id`, `fips`, etc.

In [None]:
%%sql

SELECT DATE(created_at) AS date, COUNT(*) AS n_tweets
FROM tweets
GROUP BY date
ORDER BY date;

date,n_tweets
2020-05-05,845
2020-05-06,948
2020-05-07,818
2020-05-08,793
2020-05-09,727
2020-05-10,639
2020-05-11,895
2020-05-12,988
2020-05-13,1091
2020-05-14,1055


Let's try to get the number of tweets every hour. But before that we need to convert the `created_at` column to intervals of one hour, which can be done using the **DATE_FORMAT** function. It simply changes the column contents to a particular (hourly in our case), which we can use to summarize.

In [None]:
%%sql

SELECT strftime('%Y-%m-%d %H:00', created_at) AS hour, COUNT(*) AS n_tweets
FROM tweets
GROUP BY hour
ORDER BY hour
LIMIT 20
OFFSET 50;

hour,n_tweets
2020-05-07 06:00,12
2020-05-07 07:00,19
2020-05-07 08:00,3
2020-05-07 09:00,6
2020-05-07 10:00,12
2020-05-07 11:00,20
2020-05-07 12:00,32
2020-05-07 13:00,30
2020-05-07 14:00,34
2020-05-07 15:00,4


Let's check the average number of tweets every hour. We first create a table named `hourly` to hold the number of tweets every hour.

In [None]:
%%sql

CREATE TABLE hourly AS
SELECT DATE(created_at) AS date, strftime('%H', created_at) AS hour, COUNT(*) AS n_tweets
FROM tweets
GROUP BY date,hour;

#### 👩‍🔬💻 Exercise

Can you take the average over all days for every hour?

In [None]:
%%sql

SELECT hour, AVG(n_tweets) AS mean
FROM hourly
GROUP BY hour;

hour,mean
0,41.5
1,35.07692307692308
2,31.571428571428573
3,28.071428571428573
4,18.357142857142858
5,13.428571428571429
6,9.857142857142858
7,25.357142857142858
8,9.071428571428571
9,11.928571428571429


So, people tweet more in the afternoons/evenings. That makes sense.

We can do this in one command without creating a temporary table `hourly`.

In [None]:
%%sql

SELECT hour, AVG(n_tweets) AS mean
FROM (SELECT DATE(created_at) AS date, strftime('%H', created_at) AS hour, COUNT(*) as n_tweets
      FROM tweets
      GROUP BY date, hour) AS tmp
GROUP BY hour;

hour,mean
0,41.5
1,35.07692307692308
2,31.571428571428573
3,28.071428571428573
4,18.357142857142858
5,13.428571428571429
6,9.857142857142858
7,25.357142857142858
8,9.071428571428571
9,11.928571428571429


In [None]:
%%sql

DROP TABLE hourly;

## 7) Tutorial 1 but with tweets!

### 7a) Searching using wildcards

A very practical usage of wildcards over the current dataset would be to identify tweets that contain urls. Let's check the number of such tweets.

In [None]:
%%sql

SELECT COUNT(*) AS n_tweets_with_url
FROM tweets
WHERE text LIKE '%https://%';

n_tweets_with_url
9395


In [None]:
%%sql

SELECT *
FROM tweets
WHERE text LIKE '%https://%'
LIMIT 5
OFFSET 100;

created_at,fips,tweet_id,place_country_code,place_full_name,text,user_created_at,user_id,user_statuses_count
2020-05-05 07:21:06,54033,1257571003301089280,,,"wvstatejournal: CASES TOP 1,200: DHHR reported Monday the total number of COVID-19 cases confirmed in West Virginia is 1,206. Read more here: https://t.co/zHWPCENv20",Wed Jul 20 02:09:21 +0000 2011,338767521,17644.0
2020-05-05 07:21:08,54033,1257571010263625735,,,"wvstatejournal:RT WVNews247:WV LED NATION in having lowest key COVID-19 rate, Gov. Justice announces. Also, several more businesses will be allowed to open next week:https://t.co/VCVd68YCyY https://t.co/WNthm7QroS",Wed Jul 20 02:09:21 +0000 2011,338767521,17647.0
2020-05-05 07:21:10,54033,1257571019549822977,,,"wvstatejournal:DHHR REPORTS West Virginia's positive COVID-19 case total has grown to 1,224, and the cumulative positive test percentage continues to see a decline. Read more here:https://t.co/fYyJ8N2gXx",Wed Jul 20 02:09:21 +0000 2011,338767521,17651.0
2020-05-05 07:49:01,48479,1257578027522969600,,,"Tyson will keep slowing meat production as coronavirus sickens workers, tanks income https://t.co/LBtcNHmw76",Wed Jul 27 23:29:39 +0000 2011,343702041,96391.0
2020-05-05 07:56:27,21071,1257579899562401793,,,"I went to check Flowbee's charming website, since everyone's crying about haircuts... They're completely sold out because of COVID-19, and they're only allowing one purchase to curb profiteering and prioritize ""those in need."" Bless! https://t.co/Fl8uMehTXW",Wed Feb 28 03:15:03 +0000 2018,968685955619151872,342.0


### 7b) Join county, state

In table `tweets` the column `fips` is the fips code of the US county from which tweet likely originated. Every US county has a unique fips code. Let us now add columns `county` , `state` corresponding to the `fips` in the `tweets` table, information for which is in the `counties` table. This means we need to join these two tables on `fips`. In other words, `fips` column in both the tables exhibits the primary key - foreign key relationship.

Let's check the schema of the table `counties`.

In [None]:
%%sql

PRAGMA table_info(counties);

cid,name,type,notnull,dflt_value,pk
0,fips,INT,0,,0
1,county,VARCHAR(31),0,,0
2,state,VARCHAR(31),0,,0


The contents of the `fips` column, which is a **VARCHAR**, in `tweets` are all of length 5, but in `counties` are of length 4 & 5.

In [None]:
%%sql

SELECT DISTINCT(LENGTH(fips))
FROM tweets;

(LENGTH(fips))
5
4


In [None]:
%%sql

SELECT DISTINCT(LENGTH(fips))
FROM counties;

(LENGTH(fips))
5
4


Before we merge, we need to make sure the `fips` in table `counties` is of same length, with zero padding, as in table `tweets`. Here we left-pad the `fips` with 0 to make sure the length is 5. You can double-check this by running previous statement again.

To do this we can use the **UPDATE** clause.

In [None]:
%%sql

UPDATE counties
SET fips = printf('%05d', fips);

We intend to bring the `county`, `state` columns from `counties` into `tweets` for every tweet. But are all the `fips` in `tweets` present in table `counties`?

In [None]:
%%sql

SELECT COUNT(DISTINCT(fips)) AS n_fips
FROM tweets
WHERE fips NOT IN (SELECT fips FROM counties);

n_fips
0


So there are `fips` in `tweets` for which we have no record in `counties`. When we join, we have two options:
- **Left-Join:** For every `fips` in `tweets` that has an entry in `counties` , result table will have `county` , `state` from `counties`. For `fips` in `tweets` with no entry in `counties` , result table will fill NULL for `counties` , `state`.
- **Inner-Join:** For every `fips` in `tweets` that has an entry in `counties` , result will have `county` , `state` from `counties`. For `fips` in `tweets` with no entry in `counties` , result table will drop the record.

We will perform a inner-join for this tutorial, thereby keeping records in `tweets` that have an entry in `counties`.

This will take a lot more than a HOT SEC!!

In [None]:
%%sql

CREATE TABLE tweets_counties AS
SELECT tweets.*, counties.county, counties.state
FROM tweets INNER JOIN counties
ON tweets.fips = counties.fips;

In [None]:
%sqlcmd tables

Name
counties
n_duplicates
tweets
tweets_counties


Print the schema of the merged table.

In [None]:
%%sql

PRAGMA table_info(tweets_counties);

cid,name,type,notnull,dflt_value,pk
0,created_at,TEXT,0,,0
1,fips,INT,0,,0
2,tweet_id,INT,0,,0
3,place_country_code,TEXT,0,,0
4,place_full_name,TEXT,0,,0
5,text,TEXT,0,,0
6,user_created_at,TEXT,0,,0
7,user_id,INT,0,,0
8,user_statuses_count,REAL,0,,0
9,county,TEXT,0,,0


We can drop all tables except this new one as we now have everything in one table. And also rename the new `tweets_counties` table to tweets.

In [None]:
%%sql

DROP TABLE tweets;
DROP TABLE counties;
ALTER TABLE tweets_counties RENAME TO tweets;

Let's find the number of users per state.

In [None]:
%%sql

SELECT state, COUNT(DISTINCT(user_id)) AS n_users
FROM tweets
GROUP BY state;

state,n_users
Alabama,40
Arkansas,296
California,6
Colorado,112
Delaware,878
Florida,52
Georgia,44
Idaho,25
Illinois,49
Indiana,91


#### 👩‍🔬💻 Exercise

Can you list the states and the number of tweets per state in the decreasing number of tweets?

In [None]:
%%sql

SELECT state, COUNT(*) AS num_tweets
FROM tweets
GROUP BY state
ORDER BY num_tweets DESC;

state,num_tweets
Delaware,2594
New York,1281
Texas,1097
Arkansas,895
North Carolina,590
New Hampshire,555
Kansas,455
New Mexico,417
Pennsylvania,408
West Virginia,383


#### 👩‍🔬💻 Exercise

Can you normalize these numbers for the number of distinct users in the state? Where do you see California?

In [None]:
%%sql

SELECT state, COUNT(*) / COUNT(DISTINCT user_id) AS num_tweets_norm
FROM tweets
GROUP BY state
ORDER BY num_tweets_norm DESC;

state,num_tweets_norm
West Virginia,7
California,6
Illinois,5
Missouri,4
Minnesota,4
Iowa,4
Indiana,4
South Dakota,3
Oklahoma,3
New Hampshire,3
