# Sampling Cannabis tweets
Due to the large number of cannabis tweeets available in the date range of **1st May 2018** to **31st Dec 2018**, we sample out *N* = 100,000 tweets.

## Sampling technique
In order to maintain the temporal charecteristics of the tweets, we first break group all tweets by the week in which they were tweeted. We use strified sampling on each group with each of the keywords (e.g. bong, blunt, cannabis, etc.) used to mine the tweets acting as a strata. This allows us to maintain keyword statistics when sampling.

There are various ways to sample *N* items from *K* items with equal probability. We use [Reservoir Sampling](https://en.wikipedia.org/wiki/Reservoir_sampling). Since we wish to maintain the relative sizes of each starata, and temporal group in the final sample too, we sample *n* tweets from each strata such that
```
n = (temporal_group_size / K) * (strata_size / temporal_group_size) * N = strata_size * N / K
```

In [1]:
import datetime
import glob
import os
import pickle
import random
import sys

from types import SimpleNamespace

import numpy as np
import pandas as pd

TWEETS_DIR = '../Data/Tweets/'

## Sampling Code

In [2]:
CANNABIS_KEYWORDS = {
    'blunt', 'bong', 'budder', 'cannabis',
    'cbd', 'ganja', 'hash', 'hemp',
    'indica', 'kush', 'marijuana', 'marihuana',
    'reefer', 'sativa', 'thc', 'weed'
}

def load_data(directory):
    dfs = []
    for file in glob.glob(os.path.join(directory, 'Tweets*.csv')):
        df = pd.read_csv(file, usecols=['Id', 'CreatedAt', 'Text', 'UserId'],
                         dtype={'Id': str, 'CreatedAt': str, 'Text': str, 'UserId': str},
                         parse_dates=['CreatedAt'])
        df = df[df.CreatedAt >= datetime.datetime(2018, 5, 1)]
        df['Week'] = df.CreatedAt.apply(lambda x: x.strftime('%U'))
        for keyword in CANNABIS_KEYWORDS:
            df[keyword] = df.Text.apply(lambda x: keyword in x.lower())
        dfs.append(df)
        print(f'Processed file {file}')
    return pd.concat(dfs)

In [3]:
df = load_data(TWEETS_DIR)
print(f'Number of tweets: {df.shape}')
print(df.head())

Processed file ../Data/Tweets\Tweets-1-10-2018.csv
Processed file ../Data/Tweets\Tweets-10-12-2018.csv
Processed file ../Data/Tweets\Tweets-10-9-2018.csv
Processed file ../Data/Tweets\Tweets-11-6-2018.csv
Processed file ../Data/Tweets\Tweets-12-11-2018.csv
Processed file ../Data/Tweets\Tweets-13-8-2018.csv
Processed file ../Data/Tweets\Tweets-14-5-2018.csv
Processed file ../Data/Tweets\Tweets-15-10-2018.csv
Processed file ../Data/Tweets\Tweets-16-7-2018.csv
Processed file ../Data/Tweets\Tweets-17-12-2018.csv
Processed file ../Data/Tweets\Tweets-17-9-2018.csv
Processed file ../Data/Tweets\Tweets-18-6-2018.csv
Processed file ../Data/Tweets\Tweets-19-11-2018.csv
Processed file ../Data/Tweets\Tweets-2-7-2018.csv
Processed file ../Data/Tweets\Tweets-20-8-2018.csv
Processed file ../Data/Tweets\Tweets-21-5-2018.csv
Processed file ../Data/Tweets\Tweets-22-10-2018.csv
Processed file ../Data/Tweets\Tweets-23-7-2018.csv
Processed file ../Data/Tweets\Tweets-24-12-2018.csv
Processed file ../Data/Tw

In [19]:
def get_temporal_groups(df):
    week_groups = {}
    for tweet in df.itertuples():
        if tweet.Week not in week_groups:
            week_groups[tweet.Week] = {keyword: [] for keyword in CANNABIS_KEYWORDS}
        for keyword in CANNABIS_KEYWORDS:
            if getattr(tweet, keyword):
                week_groups[tweet.Week][keyword].append(tweet.Index)
    return week_groups

def reservoir_sample(ls, n):
    """
    Select n samples from ls with probability n / len(ls).
    """
    rand = random.Random()
    reservoir = ls[:n]
    for i in range(n, len(ls)):
        p = rand.randint(0, i)
        if p < n:
            reservoir[p] = ls[i]
    return reservoir

def temporal_stratified_sample(week_groups, n):
    """
    Chooses n total samples from df while preserving temporal, and keyword statistics (per week).
    """
    N = df.shape[0]
    sample = {}
    for week in week_groups:
        sample[week] = {keyword: [] for keyword in CANNABIS_KEYWORDS}
        for keyword, group in week_groups[week].items():
            sample_size = int(len(group) / N * n)
            sample[week][keyword] = reservoir_sample(group, sample_size)
        print(f'Sampled for week {week}.')
    return sample

In [17]:
week_groups = get_temporal_groups(df)

In [21]:
sample = temporal_stratified_sample(week_groups, 100000)

Sampled for week 39.
Sampled for week 40.
Sampled for week 49.
Sampled for week 50.
Sampled for week 36.
Sampled for week 37.
Sampled for week 23.
Sampled for week 24.
Sampled for week 45.
Sampled for week 46.
Sampled for week 32.
Sampled for week 33.
Sampled for week 19.
Sampled for week 20.
Sampled for week 41.
Sampled for week 42.
Sampled for week 28.
Sampled for week 29.
Sampled for week 51.
Sampled for week 38.
Sampled for week 25.
Sampled for week 47.
Sampled for week 26.
Sampled for week 27.
Sampled for week 34.
Sampled for week 21.
Sampled for week 43.
Sampled for week 30.
Sampled for week 52.
Sampled for week 48.
Sampled for week 35.
Sampled for week 22.
Sampled for week 44.
Sampled for week 17.
Sampled for week 18.
Sampled for week 31.


In [22]:
def save_samples(df, sample, save_file):
    sample_ind = list({x for week in sample for group in sample[week].values() for x in group})
    sampled_tweets = df.iloc[sample_ind][['Id', 'UserId', 'CreatedAt', 'Text']]
    print(f'Saving {sampled_tweets.shape[0]} tweets to {save_file}')
    with open(save_file, 'wb') as file_handle:
        pickle.dump(sampled_tweets, file_handle)

In [23]:
save_samples(df, sample, os.path.join(TWEETS_DIR, 'sampled.pickle'))

Saving 102701 tweets to ../Data/Tweets/sampled.pickle


In [24]:
with open(os.path.join(TWEETS_DIR, 'sampled.pickle'), 'rb') as file_handle:
    sdf = pickle.load(file_handle)

In [25]:
sdf.UserId.unique().shape

(75751,)

In [31]:
with open(os.path.join(TWEETS_DIR, '../Users/botscores.pickle'), 'rb') as file_handle:
    bs = pickle.load(file_handle)

In [32]:
len(bs)

41