## Reddit and word2vec

This is some of the code for my paper 'The Meanings of Class in Reddit Comments - An Exploratory Study of Word Embeddings'

This notebook accesses the data, does preprocessing and runs the word2vec implementation and some evaluation tests.

More details on request: jonas.schwenke@uni-konstanz.de

In [143]:
# Imports
import pandas as pd
import numpy as np
import gensim
import json
import time
import ast
import re
import matplotlib.pyplot as plt
import os
import warnings
import google.auth
from google.cloud.bigquery.client import Client
import scipy
import cython
from gensim import models, similarities
from gensim.models.keyedvectors import KeyedVectors
import logging
import sys  
import itertools
import math

## Accessing Google BigQuery

WARNING: Big requests might cost money if account it set up and test period is expired.

To access GBQ you first need to create a google account and install the necessary packages. 
<br>See https://cloud.google.com/bigquery/docs/reference/libraries?hl=de for more information.
<br>The query is written in Legacy SQL to make use of the RAND() function for sampling

In [183]:
# enter the path to GBG project ssh keys
keys_path = 'ssh_keys.json'

with open(keys_path, 'r') as f:
        json = json.load(f)

In [184]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = keys_path
bq_client = Client()

In [None]:
# all tables to be uses for sampling
all_years = [ "[fh-bigquery:reddit_comments.2008],"
            "[fh-bigquery:reddit_comments.2009],"
            "[fh-bigquery:reddit_comments.2010],"
            "[fh-bigquery:reddit_comments.2011],"
            "[fh-bigquery:reddit_comments.2012],"
            "[fh-bigquery:reddit_comments.2013],"
            "[fh-bigquery:reddit_comments.2014],"
            "[fh-bigquery:reddit_comments.2015_01],"
            "[fh-bigquery:reddit_comments.2015_02],"
            "[fh-bigquery:reddit_comments.2015_03],"
            "[fh-bigquery:reddit_comments.2015_04],"
            "[fh-bigquery:reddit_comments.2015_05],"
            "[fh-bigquery:reddit_comments.2015_06],"
            "[fh-bigquery:reddit_comments.2015_07],"
            "[fh-bigquery:reddit_comments.2015_08],"
            "[fh-bigquery:reddit_comments.2015_09],"
            "[fh-bigquery:reddit_comments.2015_10],"
            "[fh-bigquery:reddit_comments.2015_11],"
            "[fh-bigquery:reddit_comments.2015_12],"
            "[fh-bigquery:reddit_comments.2016_01],"
            "[fh-bigquery:reddit_comments.2016_02],"
            "[fh-bigquery:reddit_comments.2016_03],"
            "[fh-bigquery:reddit_comments.2016_04],"
            "[fh-bigquery:reddit_comments.2016_05],"
            "[fh-bigquery:reddit_comments.2016_06],"
            "[fh-bigquery:reddit_comments.2016_07],"
            "[fh-bigquery:reddit_comments.2016_08],"
            "[fh-bigquery:reddit_comments.2016_09],"
            "[fh-bigquery:reddit_comments.2016_10],"
            "[fh-bigquery:reddit_comments.2016_11],"
            "[fh-bigquery:reddit_comments.2016_12],"
            "[fh-bigquery:reddit_comments.2017_01],"
            "[fh-bigquery:reddit_comments.2017_02],"
            "[fh-bigquery:reddit_comments.2017_03],"
            "[fh-bigquery:reddit_comments.2017_04],"
            "[fh-bigquery:reddit_comments.2017_05],"
            "[fh-bigquery:reddit_comments.2017_06],"
            "[fh-bigquery:reddit_comments.2017_07],"
            "[fh-bigquery:reddit_comments.2017_08],"
            "[fh-bigquery:reddit_comments.2017_09],"
            "[fh-bigquery:reddit_comments.2017_10],"
            "[fh-bigquery:reddit_comments.2017_11],"
            "[fh-bigquery:reddit_comments.2017_12],"
            "[fh-bigquery:reddit_comments.2018_01],"
            "[fh-bigquery:reddit_comments.2018_02],"
            "[fh-bigquery:reddit_comments.2018_03],"
            "[fh-bigquery:reddit_comments.2018_04],"
            "[fh-bigquery:reddit_comments.2018_05],"
            "[fh-bigquery:reddit_comments.2018_06],"
            "[fh-bigquery:reddit_comments.2018_07],"
            "[fh-bigquery:reddit_comments.2018_08],"
            "[fh-bigquery:reddit_comments.2018_09],"
            "[fh-bigquery:reddit_comments.2018_10],"
            "[fh-bigquery:reddit_comments.2018_11],"
            "[fh-bigquery:reddit_comments.2018_12])"]

In [308]:
# downloads query object as pandas data frame
query = ("SELECT body, author, created_utc, parent_id, score, subreddit FROM"
        "[fh-bigquery:reddit_comments.2008],"
        "[fh-bigquery:reddit_comments.2009],"
        "[fh-bigquery:reddit_comments.2010],"
        "[fh-bigquery:reddit_comments.2011],"
        "[fh-bigquery:reddit_comments.2012],"
        "[fh-bigquery:reddit_comments.2013],"
        "[fh-bigquery:reddit_comments.2014],"
        "[fh-bigquery:reddit_comments.2015_01],"
        "[fh-bigquery:reddit_comments.2015_02],"
        "[fh-bigquery:reddit_comments.2015_03],"
        "[fh-bigquery:reddit_comments.2015_04],"
        "[fh-bigquery:reddit_comments.2015_05],"
        "[fh-bigquery:reddit_comments.2015_06],"
        "[fh-bigquery:reddit_comments.2015_07],"
        "[fh-bigquery:reddit_comments.2015_08],"
        "[fh-bigquery:reddit_comments.2015_09],"
        "[fh-bigquery:reddit_comments.2015_10],"
        "[fh-bigquery:reddit_comments.2015_11],"
        "[fh-bigquery:reddit_comments.2015_12],"
        "[fh-bigquery:reddit_comments.2016_01],"
        "[fh-bigquery:reddit_comments.2016_02],"
        "[fh-bigquery:reddit_comments.2016_03],"
        "[fh-bigquery:reddit_comments.2016_04],"
        "[fh-bigquery:reddit_comments.2016_05],"
        "[fh-bigquery:reddit_comments.2016_06],"
        "[fh-bigquery:reddit_comments.2016_07],"
        "[fh-bigquery:reddit_comments.2016_08],"
        "[fh-bigquery:reddit_comments.2016_09],"
        "[fh-bigquery:reddit_comments.2016_10],"
        "[fh-bigquery:reddit_comments.2016_11],"
        "[fh-bigquery:reddit_comments.2016_12],"
        "[fh-bigquery:reddit_comments.2017_01],"
        "[fh-bigquery:reddit_comments.2017_02],"
        "[fh-bigquery:reddit_comments.2017_03],"
        "[fh-bigquery:reddit_comments.2017_04],"
        "[fh-bigquery:reddit_comments.2017_05],"
        "[fh-bigquery:reddit_comments.2017_06],"
        "[fh-bigquery:reddit_comments.2017_07],"
        "[fh-bigquery:reddit_comments.2017_08],"
        "[fh-bigquery:reddit_comments.2017_09],"
        "[fh-bigquery:reddit_comments.2017_10],"
        "[fh-bigquery:reddit_comments.2017_11],"
        "[fh-bigquery:reddit_comments.2017_12],"
        "[fh-bigquery:reddit_comments.2018_01],"
        "[fh-bigquery:reddit_comments.2018_02],"
        "[fh-bigquery:reddit_comments.2018_03],"
        "[fh-bigquery:reddit_comments.2018_04],"
        "[fh-bigquery:reddit_comments.2018_05],"
        "[fh-bigquery:reddit_comments.2018_06],"
        "[fh-bigquery:reddit_comments.2018_07],"
        "[fh-bigquery:reddit_comments.2018_08],"
        "[fh-bigquery:reddit_comments.2018_09],"
        "[fh-bigquery:reddit_comments.2018_10],"
        "[fh-bigquery:reddit_comments.2018_11],"
        "[fh-bigquery:reddit_comments.2018_12]"
        "WHERE subreddit='AskReddit'"
        "LIMIT 10")
credentials, project = google.auth.default()

worldnews = pd.read_gbq(query, 
                        location="US", 
                        credentials=credentials, 
                        dialect='legacy', 
                        project_id='google_big_query_projectname')

Downloading: 100%|██████████| 10000000/10000000 [30:42<00:00, 5428.77rows/s]


In [309]:
# save compressed pandas data frame
os.chdir('filepath')
worldnews.to_csv('worldnews.gz', 
               sep='|', 
               compression='gzip')

## Preprocessing

This part loads two botlists, conducts preprocessing on chunked, compressed files and saves process.

In [206]:
# read botlists 
# 1. https://www.reddit.com/r/autowikibot/wiki/redditbots
# 2. Custom botlist from top 50 authors

botlist = pd.read_csv('redditbots.csv')
botlist['bots'] = botlist['bots'].map(lambda x: x.lstrip('/u/'))
botlist = botlist['bots'].values.tolist()

top50bots = pd.read_csv('top50bots.csv',header=None).iloc[:][0].values.tolist()
botlist = botlist + top50bots

In [207]:
# Preprocessing - drop NA, bots & duplicates with optional print statements
def drop(data, botlist):
    
    # stop time
    start_time = time.time()
    
    # print('-- Rows before preprocessing: ' + str(len(data)))
    
    # drop NA
    data = data.dropna(subset=['body'])
    # print('-- Rows after dropping NA: ' + str(len(data)))
    
    # remove comments from bots
    data = data[~data['author'].str.lower().isin([x.lower() for x in botlist])]
    # print('-- Rows after dropping from botlist: ' + str(len(data)))
    
    # remove more bots/moderators
    data = data[~data.author.str.contains(pat='bot|moderator|b0t', case=False, na=False)].reset_index(drop=True)
    # print('-- Rows after dropping more bots/moderators: ' + str(len(data)))
    
    # drop deleted authors
    data = data.drop(data[data['author'] == '[deleted]'].index)
    # print('-- Rows after dropping deleted authors: ' + str(len(data)))
       
    # remove duplicates based on body
    data = data.drop_duplicates(subset=['body'], keep=False)
    # print('-- Rows after dropping duplicates: ' + str(len(data)))

    # reset index
    data = data.reset_index(drop=True)   
    
    # print time
    print("--- %s seconds ---" % (time.time() - start_time))

    return data

In [208]:
# Preprocessing - body, tokenize, remove short
def tokenize(data):

    # stop time
    start_time = time.time()
    
    # remove line breaks from body
    data['body'] = data['body'].str.replace('\n',' ')

    # remove links
    pattern = r'http\S+'
    data['body'] = data['body'].str.replace(pat=pattern,repl=' ')

    # lowercase, remove numbers and punctuation, tokenize
    data['body'] = data.apply(lambda row: gensim.utils.simple_preprocess(row.body, deacc=False, min_len=1, max_len=15), axis=1)
    data = data.rename(columns = {'body': 'body_tokens'})
    
    # remove comments with fewer than 5 tokens
    data = data[data['body_tokens'].map(len) >= 5]
    print('-- Rows after dropping short comments: ' + str(len(data)))

    # reset index
    data = data.reset_index(drop=True) 

    # create new column
    data['no_tokens'] = data['body_tokens'].str.len()

    # print time
    print("--- %s seconds ---" % (time.time() - start_time))

    return data

In [310]:
# Read chunked data, preprocess and save as compressed files
os.chdir('directory')
chunksize = 10**6
i = 0
for chunk in pd.read_csv('askreddit.gz',
                         sep='|',
                         compression='gzip',
                         index_col=0,
                         chunksize=chunksize,
                         usecols=[0,1,2],
                         lineterminator='\n'):
    
    # stop time
    start_time = time.time()
    print('--- Processing chunk ' + str(i))
    
    # preprocess
    chunk = tokenize(drop(chunk,botlist))
    
    # save as compressed file
    os.chdir('/Users/jonas/Documents/SEDS/CSS/Project/Data/Comments/WorldNews')
    chunk.to_csv('worldnews_pp_'+str(i)+'.gz',
                 sep = '|',
                 compression='gzip')
    
    print("--- Finished after %s seconds ---" % (time.time() - start_time))
    
    i = i + 1

--- Processing chunk 0
-- Rows after dropping short comments: 784690
--- Finished after 253.14268684387207 seconds ---
--- Processing chunk 1
-- Rows after dropping short comments: 785628
--- Finished after 196.0009548664093 seconds ---
--- Processing chunk 2
-- Rows after dropping short comments: 785053
--- Finished after 190.57830786705017 seconds ---
--- Processing chunk 3
-- Rows after dropping short comments: 785125
--- Finished after 194.95415878295898 seconds ---
--- Processing chunk 4
-- Rows after dropping short comments: 785638
--- Finished after 198.19109988212585 seconds ---
--- Processing chunk 5
-- Rows after dropping short comments: 787308
--- Finished after 200.8927869796753 seconds ---
--- Processing chunk 6
-- Rows after dropping short comments: 787261
--- Finished after 202.44296407699585 seconds ---
--- Processing chunk 7
-- Rows after dropping short comments: 788029
--- Finished after 255.3637399673462 seconds ---
--- Processing chunk 8
-- Rows after dropping short

## word2vec

Compressed, preprocessed files are streamed into word2vec

In [374]:
# Stream compressed files to avoid RAM crash
class MySentences(object):
    def __init__(self, dirname,limit=None):
        self.dirname = dirname
        self.limit = limit

    def __iter__(self):
        # Count number of tokens
        no = 0 
        # iterate through the file directory
        for fname in os.listdir(self.dirname):
            # for each compressed file open it
            with gensim.utils.smart_open(os.path.join(self.dirname, fname)) as fin:             
                for line in itertools.islice(fin, self.limit):
                    try:
                        tokens = ast.literal_eval(gensim.utils.to_unicode(line).split("|")[1])
                    except ValueError:
                        continue
                    no += 1
                    yield tokens
        # Print number of tokens
        print(str(no))

In [375]:
# Read files and run word2vec 
assert gensim.models.word2vec.FAST_VERSION > -1
path = 'directory_of_compressed_files'
sentences = MySentences(path) # a memory-friendly iterator

# Set parameters. Details here: https://radimrehurek.com/gensim/models/word2vec.html
model = gensim.models.word2vec.Word2Vec(sentences,sg=1, 
                                        size=300, 
                                        window=5, 
                                        min_count=10, 
                                        workers=10, 
                                        hs=0, 
                                        negative=8,
                                        iter=5)

13251251
13251251
13251251
13251251
13251251
13251251


In [376]:
# classic test
model.wv.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.5738304853439331),
 ('angmar', 0.4578200578689575),
 ('boleyn', 0.44060570001602173),
 ('nosmo', 0.4355197250843048),
 ('latifa', 0.4351215660572052),
 ('kings', 0.4347761869430542),
 ('latifah', 0.4287641644477844),
 ('arthurs', 0.426189124584198),
 ('soopers', 0.40994611382484436),
 ('nefertiti', 0.40902742743492126)]

In [377]:
# save model
model.save('/Users/jonas/Documents/SEDS/CSS/Project/Data/Models/askreddit500.model')

In [175]:
# Save model vocabulary list and vectors seperately for further analysis in R
def save_emb(model,name,directory):
    syn0_object=model.wv.syn0

    ##output vector space##
    np.savetxt(directory+'syn0_ngf_'+name+'.txt',
             syn0_object, delimiter=" ")

    #output vocab list#
    vocab_list = model.wv.index2word
    for i in range(0,len(vocab_list)):
        if vocab_list[i] == '':
            vocab_list[i] = "thisisanemptytoken"+str(i)

    with open(directory+'vocab_list_ngf_'+name+'.txt','wb') as outfile:
        for i in range(0,len(vocab_list)):
            outfile.write(vocab_list[i].encode('utf8')+"\n".encode('ascii'))

## Evaluation

Runs several evaluation tests: 
- Google Semantic-Syntactic: http://download.tensorflow.org/data/questions-words.txt (split)
- SimVerb-3005: https://github.com/JoonyoungYi/datasets/tree/master/simverb3500
- MEN: https://staff.fnwi.uva.nl/e.bruni/MEN

In [378]:
# Evaluate models
warnings.filterwarnings('ignore')

# directory to models/ and evaluation/ which includes tests
os.chdir('directory')

# models to be compared
models = ['askreddit']

# load each model and print results
for model in models:
    mod = gensim.models.word2vec.Word2Vec.load('Models/'+model+'.model')

    print(model+' performance:\n')
    
    sem = mod.wv.accuracy('Evaluation/semantic.txt')
    print('Google semantic: '+ str(round(test_res(sem), 4)))

    syn = mod.wv.accuracy('Evaluation/syntactic.txt')
    print('Google syntactic: '+ str(round(test_res(syn), 4)))
    
    simverb = mod.wv.evaluate_word_pairs('Evaluation/simverb-3500.txt')[0][0]
    print('Simverb Pearson: '+ str(round(simverb, 4)))

    men = mod.wv.evaluate_word_pairs('Evaluation/men.txt')[0][0]
    print('MEN Pearson: '+ str(round(men, 4)))

    print('\n------------\n')


askreddit500 performance:

Google semantic: 0.634
Google syntactic: 0.6803
Simverb Pearson: 0.3767
MEN Pearson: 0.7513

------------

