# Using Scattertext to Explore the Effectiveness of Headlines
### Jason S. Kessler ([@jasonkessler](http://www.twitter.com/JasonKessler))

The code in this notebook shows how you can use the Python package Scattertext to explore how language used in headlines 
can correlate with social engagement.

For background on the term-class association scores used and semiotic squares, please see https://github.com/JasonKessler/PuPPyTalk and https://github.com/JasonKessler/SemioticSquaresTalk

This notebook makes heavy use of the library Scattertext (https://github.com/JasonKessler/scattertext) for language processing and visualizations.

The data used were scraped from Facebook by Max Woolf.  Please see his original notebook at https://github.com/minimaxir/clickbait-cluster.

In [1]:
import pandas as pd
import numpy as np
import sys
import umap
import spacy
import scattertext as st
from gensim.models import word2vec
import re
from glob import glob
from scipy.stats import rankdata
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))
import matplotlib.pyplot as plt

# Data preprocessing
Let's first parse a dataset of headlines from either BuzzFeed or the NYTimes. These are associated with Facebook reaction counts, and are binned into High, Medium, and Low reaction segments.

In [30]:
nlp = spacy.load('en', parser=False)

In [31]:
df = pd.concat([
    pd.read_csv(fn, sep='\t')
    .assign(publication=fn.split('/')[-1].split('_')[0]) 
    for fn in glob('./fb_headlines/*')
]).reset_index().assign(
    status_published = pd.to_datetime(df.status_published)
)[
    lambda df: df.status_published.apply(lambda x: x.year >= 2016) & df.page_id.isin(['BuzzFeed', 'NYTimes'])
].loc[
    lambda df: df['link_name'].dropna().index
].assign(
    parse = lambda df: df.link_name.apply(nlp)
).loc[
    lambda df: df.parse.apply(len) > 2
].assign(
    reaction_percentile = lambda df: df.groupby('publication')['num_reactions'].apply(lambda x: pd.Series(rankdata(x)/len(x), index=x.index)),
    reaction_bin = lambda df: df.reaction_percentile.apply(lambda x: 'Hi' if x > 2./3 else 'Lo' if x < 1./3 else 'Mid')
)
df.head()

Unnamed: 0,index,page_id,status_id,link_name,status_published,num_reactions,publication,parse,reaction_percentile,reaction_bin
0,0,BuzzFeed,21898300328_10154928658355329,Here's How Much The Kardashians Have Changed I...,2016-08-12 21:31:00,349,BuzzFeed,"(Here, 's, How, Much, The, Kardashians, Have, ...",0.024744,Lo
1,1,BuzzFeed,21898300328_10154928707445329,21 Memes That Are Too Pure For This World,2016-08-12 20:46:00,3622,BuzzFeed,"(21, Memes, That, Are, Too, Pure, For, This, W...",0.444981,Mid
2,2,BuzzFeed,21898300328_10154928683205329,Michael Phelps' Son Looks Like An Old Man And ...,2016-08-12 20:01:01,2667,BuzzFeed,"(Michael, Phelps, ', Son, Looks, Like, An, Old...",0.344196,Mid
3,3,BuzzFeed,21898300328_10154925295010329,"Here's What The Cast Of ""Degrassi: The Next Ge...",2016-08-12 19:16:01,2461,BuzzFeed,"(Here, 's, What, The, Cast, Of, "", Degrassi, :...",0.319553,Lo
4,4,BuzzFeed,21898300328_10154928549130329,19 Tweets About Michael Phelps That Are Exactl...,2016-08-12 18:31:00,9137,BuzzFeed,"(19, Tweets, About, Michael, Phelps, That, Are...",0.770871,Hi


# Build a corpus
Next, we'll build a Scattertext corpus from this data. Let's include noun phrase which occurred in the data as features in the corpus. We'll ensure that each feature occurred at least twice in a reaction bin. In bins with more total tokens, the minimum number of occurrences needed is the equivalent to twice in the lowest bin. This is referred to as a `ClassPercentageCompactor`.

Since we'll generate a number of redunant noun phrase (e.g., the phrase "United States of America" will generate `[United States], [America], [States], [States of America], [United States of America]` as noun phraess) we keep noun phrases that are found in larger noun pharses if they occurred at least five times outside of the surrounding phrase. This is accomplished the the `CompactTerms` class.

In [43]:
reaction_corpus = st.CorpusFromParsedDocuments(
    df, 
    parsed_col='parse', 
    category_col='reaction_bin',
    feats_from_spacy_doc=st.PhraseMachinePhrases()
).build(
).compact(
    st.ClassPercentageCompactor(term_count=2)
).compact(
    st.CompactTerms(slack=5)
)

In [44]:
print("Number of unique phrases found:", len(reaction_corpus.get_terms()))

Number of unique phrases found: 624


# Now let's look at phrase-reaction association
It's clear that presidential candidaets are used frequently and associated with more reactions.  Branded content from the NYT underperforms.

In [49]:
def get_metadata_from_corpus(corpus):
    df = corpus.get_df()
    return (df.page_id + ', ' 
            + df.reaction_percentile.apply(lambda x: str(int(x * 100)) + '%') + ', ' 
            + df.status_published.apply(lambda x: str(x.date())))

In [50]:
html = st.produce_frequency_explorer(
    reaction_corpus,
    category='Hi',
    not_categories=['Lo'],
    neutral_categories=['Mid'],
    neutral_category_name='Mid',
    minimum_term_frequency=0,
    pmi_filter_thresold=0,
    use_full_doc = True,
    term_scorer = st.DeltaJSDivergence(),
    width_in_pixels=1000,
    metadata=get_metadata_from_corpus(reaction_corpus),
    show_neutral=True,
    show_characteristic=False
)
file_name = 'reaction_all.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

  recall = cat_word_counts * 1. / cat_word_counts.sum()


Looking at unigram frequencies, we notice similar trends in the high converting content, but the second person pronoun correlates to lower performing content.

In [52]:
reaction_corpus_unigram = (st.CorpusFromParsedDocuments(df, parsed_col='parse', category_col='reaction_bin')
                           .build()
                           .compact(st.ClassPercentageCompactor(term_count=3))).get_unigram_corpus()
reaction_corpus_unigram.get_num_terms()

2907

In [53]:
html = st.produce_frequency_explorer(reaction_corpus_unigram,
                                     category='Hi',
                                     not_categories=['Lo'],
                                     neutral_categories=['Mid'],
                                     neutral_category_name='Mid',
                                     minimum_term_frequency=6,
                                     pmi_filter_thresold=0,
                                     use_full_doc = True,
                                     term_scorer = st.DeltaJSDivergence(),
                                     grey_threshold=0,
                                     width_in_pixels=1000,
                                     metadata=get_metadata_from_corpus(reaction_corpus),
                                     show_neutral=True,
                                     show_characteristic=False)
file_name = 'reaction_unigram.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

We can use UMAP to cluster unigrams based on their cooccurence statistics, and visually identify semantically sikmilar terms which indicate high or low performance.

In [56]:
html = st.produce_projection_explorer(
    reaction_corpus_unigram,
    category='Hi', 
    not_categories=['Lo'], 
    neutral_categories=['Mid'],
    term_scorer = st.RankDifference(),
    neutral_category_name='Mid',
    width_in_pixels=1000,
    use_full_doc=True,
    projection_model = umap.UMAP(metric='cosine'),
    term_acceptance_re=re.compile(''),
    metadata=get_metadata_from_corpus(reaction_corpus_unigram)
)
file_name = 'reaction_umap_projection.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

We can now look at publication-specific performance, and compare their performances to each other.

In [58]:
df['category'] = df.publication + ' ' + df.reaction_bin
df_four_square = df[df.reaction_bin.isin(['Hi', 'Lo'])]
# Create corpus and filter terms
four_square_corpus = (st.CorpusFromParsedDocuments(df_four_square, category_col = 'category', parsed_col = 'parse')
                      .build()
                      .compact(st.CompactTerms(minimum_term_count=2, slack=6))
                      .compact(st.ClassPercentageCompactor(term_count=2)))

In [61]:
four_square_axes = st.FourSquareAxes(four_square_corpus, 
                                     ['NYTimes Hi'], 
                                     ['NYTimes Lo'], 
                                     ['BuzzFeed Hi'], 
                                     ['BuzzFeed Lo'], 
                                     labels = {'a': 'Appeals to all',
                                               'a_and_not_b': 'NY Times: Hi Engagement',
                                               'b_and_not_a': 'NY Times: Lo Engagement',
                                               'a_and_b': 'BuzzFeed: Hi Engagement',
                                               'not_a_and_not_b': 'BuzzFeed: Lo Engagement',
                                               'not_a': 'Ignored by all',
                                               'b': 'Ignored by elite, appeals to masses',
                                               'not_b': 'Appeals to elite, ignored by masses'})
html = st.produce_four_square_axes_explorer(
    four_square_axes=four_square_axes,
    x_label='NYT: Hi-Lo',
    y_label='Buzz: Hi-Lo',
    use_full_doc=True,
    pmi_threshold_coefficient=0,
    metadata=get_metadata_from_corpus(four_square_corpus),
    censor_points=False)

In [62]:
file_name = 'reaction_axes.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1600, height=900)

In [59]:
# View chart with multiple terms visible
# Set up chart structure
four_square = st.FourSquare(
    four_square_corpus,
    category_a_list=['NYTimes Hi'],
    category_b_list=['BuzzFeed Hi'],
    not_category_a_list=['BuzzFeed Lo'],
    not_category_b_list=['NYTimes Lo'],
    scorer=st.RankDifference(),
    labels={'a': 'Highbrow Engagment',
            'b': 'Lowbrow Engagment',
            'not_a_and_not_b': 'Few Facebook Reactions',
            'a_and_b': 'Many Facebook Reactions',
            'a_and_not_b': 'NYTimes',
            'b_and_not_a': 'BuzzFeed',
            'not_a': 'Lowbrow Ignored',
            'not_b': 'Highbrow Ignored'})
html = st.produce_four_square_explorer(four_square=four_square,
                                       x_label='NYTimes-Buzz',
                                       y_label='Hi-Low',
                                       use_full_doc=True,
                                       pmi_threshold_coefficient=0,
                                       metadata=get_metadata_from_corpus(four_square_corpus), 
                                       censor_points=False)
file_name = 'reaction_semiotic_censor.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1600, height=900)