## Using Scattertext to the 2018 State of the Union Speech
### Jason S. Kessler: http://www.jasonkessler.com


If you are academically inclined, you can cite the accompanying technical article as

Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. Vancouver, BC. 2017. https://arxiv.org/abs/1703.00565


In [9]:
%matplotlib inline
import scattertext as st
import re, io, itertools
from pprint import pprint
import pandas as pd
import numpy as np
import spacy.en
from html import unescape
import os, pkgutil, json, urllib, datetime
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))

## Download the database of tweets, parse them, filter out RT's and tweets by devices that Trump probably wasn't using. Label them as before or after election

In [10]:
nlp = spacy.load('en', parse=False)

### Grab the last three months of Trump's Tweets
* filter them for Tweets written on mobile devices,
* filter out non-rewteets

### Download historic State of the Union addressses
* Keep the ones since 1980
* Leave out Trump's 1st address

In [11]:
tweet_df = pd.concat([pd.read_json('http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json' % (year))
                      for year in range(2017, 2019)])

In [12]:
tweet_df = tweet_df[tweet_df.source.isin(['Twitter for iPhone', 'Twitter for Android']) 
                    & (~tweet_df.is_retweet) 
                    & (tweet_df.created_at > '2017-11-01')]

In [13]:
tweet_df['parse'] = tweet_df.text.apply(unescape).apply(nlp)

In [14]:
sou_df = pd.read_csv('https://raw.githubusercontent.com/BrianWeinstein/state-of-the-union/master/transcripts.csv')

In [15]:
sou_df = (sou_df[(sou_df['date'] > '1980-01-01') 
                 & (sou_df['date'] != '2017-02-28')]
          .rename(columns={'transcript': 'text'}))

In [16]:
sou_df['parse'] = sou_df.text.apply(nlp)

### Merge two sets together, include metadata for analysis

In [17]:
tweet_df['metadata'] = tweet_df['created_at']
tweet_df['category'] = 'tweet'
sou_df['metadata'] = sou_df['president'] + ': ' + sou_df['date']
sou_df['category'] = sou_df.president.apply(lambda x: 'trumpsou' if x == 'Donald J. Trump' else 'othersou')
df = pd.concat([tweet_df, sou_df])

In [18]:
df.to_csv('trump_state_of_union_2018.csv.gz', index=False, compression='gzip')

In [19]:
df = pd.read_csv('trump_state_of_union_2018.csv.gz')
df['parse'] = df['text'].apply(unescape).apply(nlp)

In [20]:
corpus = st.CorpusFromParsedDocuments(df, 
                                      category_col='category', 
                                      parsed_col='parse').build()

In [32]:
unigram_corpus = corpus.get_unigram_corpus().remove_terms(['&', '#', '/', ':'])
tdf = unigram_corpus.get_term_freq_df()
unigram_corpus = unigram_corpus.remove_terms(
    tdf[(tdf['trumpsou freq'] < 2) & (tdf.sum(axis=1) <= 9)].index
)

priors = (st.PriorFactory(corpus,
                          starting_count=0.01)
          .use_all_categories()
          .get_priors())

### Constructing Semiotic Squares
* Let's build two semiotic squares for the Trump's language
* One will focus on unigrams and caputure stylistic differences
* One will focus on phrases (Abram 2016).
* The x-axis of the analysis will measure how associated a word or phrase is to the set of Trump's tweets or his 2018 State of the Union address
* The y-axis will be how associated a word is to Trump and how associated a word is to 
* Assocations will be measured through the Log-Odds-Ratio with a Dirichlet Prior (Monroe 2008)

### References
* Handler, Abram, Matthew J. Denny, Hanna Wallach and Brendan T. O’Connor. Bag of What? Simple Noun Phrase Extraction for Text Analysis. 2016.
* Burt L. Monroe, Michael P. Colaresi, Kevin M. Quinn. Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. Political Analysis. 2008.

In [33]:
state_of_the_union_semiotic_square_labels = {
        'a': 'Trumpish 2018 State of Union Language',
        'not_a': 'Non-Trumpish Language in Tweets',
        'b': 'Un-State of the Union-y Language in Trump Tweets',
        'not_b': 'Untweet-like, State of the Union-y Language',
        'a_and_b': "Trump",
        'not_a_and_not_b': 'Language from Past Presidents in their States of the Union',
        'a_and_not_b': '''2018 State of the Union:<br><span style="font-size: 10pt;">Below: typical State of Union language used by Trump</span>''',
        'b_and_not_a': 'Trump Tweets:<br><span style="font-size: 10pt;">Below: langauge about standard State-of-the-Union topics used more in Trump Tweets</span>'   
    }

In [34]:
semiotic_square = st.SemioticSquare(
    unigram_corpus,
    category_a='trumpsou',
    category_b='tweet',
    neutral_categories=['othersou'],
    scorer=st.LogOddsRatioInformativeDirichletPrior(priors, alpha_w=10),
    labels = state_of_the_union_semiotic_square_labels
)


In [35]:
html = st.produce_semiotic_square_explorer(semiotic_square,
                                           category_name='Trump-SOUs',
                                           not_category_name='Trump-Tweets',
                                           x_label='Trump-Speech v Tweets',
                                           y_label='Trump v Others',
                                           neutral_category_name='Other SOUs',
                                           minimum_term_frequency=0,
                                           num_terms_semiotic_square=20,
                                           axis_scaler=st.Scalers.scale_neg_1_to_1_with_zero_mean,
                                           metadata=unigram_corpus._df['metadata'])
file_name = 'sou_semiotic.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1500, height=800)

In [37]:
phrasecorpus = st.CorpusFromParsedDocuments(df, 
                                            feats_from_spacy_doc=st.PhraseMachinePhrases(),
                                            category_col='category', 
                                            parsed_col='parse').build()
tdf = phrasecorpus.get_term_freq_df()
phrasecorpus = phrasecorpus.remove_terms(
    tdf[(tdf['trumpsou freq'] < 2) & (tdf.sum(axis=1) <= 6)].index
)

priors = (st.PriorFactory(phrasecorpus,
                          starting_count=0.01)
          .use_all_categories()
          .get_priors())

semiotic_square_phrase = st.SemioticSquare(
    phrasecorpus,
    category_a='trumpsou',
    category_b='tweet',
    neutral_categories=['othersou'],
    scorer=st.LogOddsRatioInformativeDirichletPrior(priors, 10),
    labels = state_of_the_union_semiotic_square_labels
)


In [38]:
html = st.produce_semiotic_square_explorer(semiotic_square_phrase,
                                           pmi_threshold_coefficient=0,
                                           minimum_term_frequency=2,
                                           category_name='Trump-SOUs',
                                           not_category_name='Trump-Tweets',
                                           x_label='Trump-Speech v Tweets',
                                           y_label='Trump v Others',
                                           neutral_category_name='Other SOUs',
                                           num_terms_semiotic_square=10,
                                           axis_scaler=st.Scalers.scale_neg_1_to_1_with_zero_mean,
                                           metadata=phrasecorpus._df['metadata']
                                          )
file_name = 'sou_semiotic_phrase.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1500, height=800)