## Drilling down on @Dril

> "getting pisse d off imagining my trolls and dissenters crawling around my house in little butler outfits and expecting tips" 

-- @dril, after his recent doxing.  Retweeted more than 3,800 times and liked more than 32,000 times.

What, in the land of weird Twitter, makes @dril @dril?

We can use the Python tool Scattertext to shed light on @dril's unique style and compare his Tweets to others in the weird Twitter space.

We'll first use Tweepy to pull the corpus of @dril's tweets.  In order to examine his style futher, we'll pull tweets from other weird Twitter users as a comparison corpus.  

Given @dril's propensity to mention other weird Twitter users, we can download the timelines of other wierd Twitter users as a comparison corpus.

## Technique

We'll use the Scattertext Python package to create the visualizations. See an overview on https://github.com/JasonKessler/scattertext

If you use the package, please cite it as:

Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017.

In [1]:
# Preliminary imports

%matplotlib inline
import twitter, tweepy
import scattertext as st
import re, io, itertools
from pprint import pprint
import pandas as pd
import numpy as np
import spacy
import os, pkgutil, json, urllib, datetime, time, itertools
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
import warnings
import functools
import collections
from html.parser import HTMLParser
warnings.filterwarnings("ignore", category=DeprecationWarning) 
display(HTML("<style>.container { width:98% !important; }</style>"))

## Set up Twitter API connection using Tweepy

In [2]:
secret_key, consumer_key = os.environ['TWITTER_SECRET_KEY'], os.environ['TWITTER_CONSUMER_KEY']
access_token, token_secret = os.environ['TWITTER_ACCESS_TOKEN'], os.environ['TWITTER_TOKEN_SECRET']
api = twitter.Api(consumer_key=consumer_key,
                  consumer_secret=secret_key,
                  access_token_key=access_token,
                  access_token_secret=token_secret)
auth = tweepy.OAuthHandler(consumer_key, secret_key)
auth.set_access_token(access_token, token_secret)
api = tweepy.API(auth)

In [3]:
nlp = spacy.load('en')

## Some helper functions for pulling statuses, assembling them into data frames, and extraction mentions

In [3]:
def pull_statuses(screen_name, verbose=False, last_id = None):
    pages = []
    try:
        for page in tweepy.Cursor(api.user_timeline, screen_name=screen_name).pages():
            if verbose: print(len(page))
            pages.append(page)
    except Exception as e:
        print(screen_name, e)
        
    return pages

In [4]:
def dataframe_from_pages(pages):
    html_parser = HTMLParser()

    df = pd.concat([pd.DataFrame([x._json for x in page]) for page in pages]).reset_index()
    df['created_at'] = pd.to_datetime(df['created_at'])
    df['retweet_filtered_text'] = (df
                                   .text
                                   .apply(lambda x: html_parser.unescape(' '.join(x.split()[2:]) 
                                                                         if x.startswith("RT ") 
                                                                         else x)))
    df['parse'] = df['retweet_filtered_text'].apply(nlp)
    return df

In [5]:
def extract_mention_counts(tweets):
    mentions = collections.Counter(
        functools.reduce(lambda a, b: a + b, 
                         (dril_status_df[tweets.apply(lambda x: type(x) == str)]
                          .text
                          .apply(lambda x: [t for t in x.lower().split() 
                                            if t.startswith('@') and len(t) > 2])))
    )
    return mentions

## Pull @dril's timeline to the Pandas dataframe dril_status_df

In [14]:
dril_status_df = dataframe_from_pages(pull_statuses('dril'))
dril_status_df['screen_name'] = 'dril'

## Let's look at the users @dril mentions frequently to identify other weird Tweeters

In [29]:
mentions = extract_mention_counts(dril_status_df.text)
print(mentions.most_common(4))
print('...')

[('@nataliejmooney', 34), ('@machiavellino', 29), ('@neonwario', 25), ('@mcdonalds', 24)]
...


Many are inaccessible or not right for the comparison set.

- ~~@nataliejmooney~~ (proteced)
- ~~@machiavellino~~ (suspended)
- @neonwario 
- ~~@mcdonalds~~ (a corporation)
- @celiapienkosz
- @bronzehammer
    - @_hermit_thrush_
- ~~@respected_loner~~ (proteced)
- @adultblackmale
- ~~@dril~~ (@dril)
- ~~@sofieok~~ (protected)
- @911victim
- @dinkmagic
- ~~@dwayne274928572~~ (deleted)
- @bakkooonn
- @leyawn
- @hermit_thrush
- ~~@hizzaerd~~ (protected)
- ~~@bashfulcoward~~ (suspended)
- @bevissimpson

## We were able to download statuses for five users before I exceeded my API's limit :(

In [31]:
weird_tweeps = ['neonwario', 'celiapienkosz', 'bronzehammer', '_hermit_thrush_', 
                'adultblackmale', '911victim', 'dinkmagic', 'bakkooonn',
                'leyawn', 'hermit_thrush', 'bevissimpson']
other_statuses = []
for user in weird_tweeps:
    try:
        other_statuses.append(pull_statuses(user))
    except:
        print('Cannot pull statuses from', user)

911victim Twitter error response: status code = 429
dinkmagic Twitter error response: status code = 429
bakkooonn Twitter error response: status code = 429
leyawn Twitter error response: status code = 429
hermit_thrush Twitter error response: status code = 429
bevissimpson Twitter error response: status code = 429


## Assemble the other statuses into a data frame, label statuses with screen names, and concatenate all statuses into one big data frame

In [83]:
other_status_df = pd.concat([dataframe_from_pages(page) 
                             for page in other_statuses 
                             if len(page)]).reset_index(drop=True)
other_status_df['screen_name'] = other_status_df['user'].apply(lambda x: x['screen_name'])
all_status_df = pd.concat([dril_status_df, other_status_df]).reset_index(drop=True)

## Remove tweets that are retweets or have less than 20 characters, and drop duplicate statuses
Save data frame to disk

In [84]:
all_status_df = all_status_df.loc[all_status_df
                                  .text
                                  .drop_duplicates()
                                  .apply(lambda x: len(x) > 20 and not x.startswith('RT '))
                                  .index]
all_status_df.to_csv('all_status_df.csv.gz', index=False, compression='gzip')

In [4]:
# reread archived statuses
try:
    all_status_df
except:
    all_status_df = pd.read_csv('all_status_df.csv.gz', compression='gzip')
    all_status_df['parse'] = all_status_df['retweet_filtered_text'].apply(nlp)

  interactivity=interactivity, compiler=compiler, result=result)


## Assemble into Scattertext corpus, with the category being screen name.  
Omit mentions, and words with apostrophes in them.  Dril's lack of apostrophes messes up the analysis.

In [5]:
corpus = (st.CorpusFromParsedDocuments(all_status_df, 
                                       category_col='screen_name', 
                                       parsed_col='parse')
          .build())
corpus = corpus.remove_terms([t 
                              for t in corpus.get_term_freq_df().index 
                              if ('@' in t or "'" in t or "’" in t)])

In [6]:
tdm = corpus.get_term_freq_df()

## Standard Scattertext plot

It's clear that @dril, uniquely, has a potty mouth.  The top scaled-F-scoring terms are profane, scatalogical, and related to social media (feed, trolls, followers, content , user, etc...). Terms are colored by scaled-F-score.

Please see https://github.com/JasonKessler/scattertext for more information on the metric.

While @dril tends to make gendered references to people using "boys" and "girls", other weird Twitter users prefered the term "dudes".  @dril doesn't use acronyms like "btw", "lol", or "ftw" as other weird twitter users do.  He also doesn't call things "crazy" like other weird Twitter users do. 


In [7]:
metadata = '@' + all_status_df['screen_name'] + ' ' + all_status_df['created_at'].astype(str)
html = st.produce_scattertext_explorer(corpus,
                                       category='dril',
                                       category_name='@Dril',
                                       not_category_name='Not @Dril',
                                       use_full_doc=True,
                                       minimum_term_frequency=20,
                                       minimum_not_category_term_frequency=20,
                                       pmi_filter_thresold=6,
                                       term_ranker=st.termranking.OncePerDocFrequencyRanker,
                                       width_in_pixels=1000,
                                       metadata=metadata,
                                       sort_by_dist=False)

file_name = 'output/dril_vs_non.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1500, height=700)

## Alternative visualization
We can instead use log-odds-ratio with an uniformaitive prior (see Monroe et. al (2009) for more details) to visualize lanaguage differences

Here, the x-axis is the log-freqeuency of a term, and the y-axis shows the association a term.  This technique can favor function words over content terms.

In [9]:
html = st.produce_fightin_words_explorer(corpus,
                                         category='dril',
                                         category_name='@Dril',
                                         not_category_name='Not @Dril',
                                         use_full_doc=True,
                                         minimum_term_frequency=20,
                                         minimum_not_category_term_frequency=20,
                                         pmi_filter_thresold=6,                                         
                                         term_ranker=st.termranking.OncePerDocFrequencyRanker,
                                         width_in_pixels=1000,
                                         metadata=metadata)
file_name = 'output/dril_vs_non_lorup.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1500, height=700)