# Session 12(a) - Text Mining
## Analysing and summarising collections of text
### Phrases

Having learned how to clean and simplify our text for processing, the next stage is to ask what our text is about. This session will cover finding

In [1]:
import pandas as pd
import spacy

In [2]:
df = pd.read_csv('sample_news_large.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175 entries, 0 to 174
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   query      175 non-null    object
 1   title      173 non-null    object
 2   text       175 non-null    object
 3   published  175 non-null    object
 4   site       175 non-null    object
dtypes: object(5)
memory usage: 7.0+ KB


In [4]:
df.head()

Unnamed: 0,query,title,text,published,site
0,Hong Kong,Horrifying view of fires from space,Video Image Satellite images show insane view ...,2019-11-08T23:51:00.000+02:00,news.com.au
1,Hong Kong,Protester shot with live round in Hong Kong as...,\n Chief Executive addresses the press after c...,2019-11-11T02:00:00.000+02:00,scmp.com
2,Hong Kong,China imposes online gaming curfew for minors ...,Hong Kong (CNN) China has announced a curfew o...,2019-11-06T02:00:00.000+02:00,cnn.com
3,Hong Kong,Trump made 96 false claims last week - CNNPoli...,Washington (CNN) President Donald Trump was re...,2019-10-30T20:35:00.000+02:00,cnn.com
4,Hong Kong,50 best breads around the world | CNN Travel,(CNN) — What is bread? You likely don't have t...,2019-10-16T07:02:00.000+03:00,cnn.com


In [5]:
df['query'].value_counts()

Hong Kong         25
brexit            25
cryptocurrency    25
Tesla             25
alt-right         25
billionaire       25
bitcoin           25
Name: query, dtype: int64

In [6]:
df = df.sample(50, random_state=1)

In [7]:
nlp = spacy.load('en_core_web_md')

## Phrases (n-grams)
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/Archer-phrasing.jpg?raw=true" align="right" width="300">

- Operates on the assumption that if words often co-occur together in a corpus, they should be considered as a single 'phrase', rather than as individual words.
- Phrasing improves the accuracy of various analyses as it recognises that words may be transformed by their context.
- For example: 
    - In one document we have the phrase "human rights", in the other, "human biology". 
    - **Without phrasing** these may be considered similar as they both use the word "human".
    - However **with phrasing** these would be transformed into two seperate tokens, human_rights and human_biology, and therefore be more likely to be distinguished as different.

#### Training the Phraser

`train_phraser` has three stages. 
- First we create a list of tokenized *sentences*. 
- We then feed that list of sentences to a Gensim `Phrases` model. This model looks at which token co-occur, how often and [makes a judgement](https://arxiv.org/abs/1310.4546) about whether co-occurence is common enough to consider it a 'phrase'.
- Why sentences? Sentences mark out boundaries between words. Consider the phrase 'Human Rights'...


```
... and so recognising that he was only human. Rights based discussions can only....

```

```
... and so recognising that he was only human rights based discussions can only..

```

In [8]:
# gensim is a text processing library that has the Phrasing tools we need
from gensim.models import Phrases

In [19]:
stop_words = nlp.Defaults.stop_words
def process_text(doc):
    return [token.lemma_.lower() for token in doc if token.is_alpha]

# process sentences function
def process_sentences(doc):
    return [process_text(sent) for sent in doc.sents]

# ###### START HERE!!
def train_phraser(corpus, stop_words):
    sentences = []
    for doc in nlp.pipe(corpus):
        doc_sents = process_sentences(doc)
        sentences.extend(doc_sents)
    
    phraser = Phrases(sentences, common_terms=stop_words)
    return phraser

In [20]:
%time phraser = train_phraser(df['text'], stop_words)

CPU times: user 6.88 s, sys: 1.96 s, total: 8.84 s
Wall time: 9.88 s


In [22]:
test_text = nlp(df['text'].iloc[0])
test_text[:100]

Taxing the rich to help the poor and middle class has defined the presidential campaigns of Senators Elizabeth Warren and Bernie Sanders.
Skeptics of the proposal argue that a wealth tax of this magnitude is impractical, and will not generate the revenue that Senator Warren and supporters anticipate. But French economist Gabriel Zucman, an early proponent of the tax , argues otherwise.
Zucman, now author of the book “The Triumph of Injustice: How the Rich Dodge Taxes and How to Make Them Pay,” joined The Final Round to

In [24]:
tokens = process_text(test_text)
print(tokens[10:30])

['have', 'define', 'the', 'presidential', 'campaign', 'of', 'senators', 'elizabeth', 'warren', 'and', 'bernie', 'sanders', 'skeptic', 'of', 'the', 'proposal', 'argue', 'that', 'a', 'wealth']


In [26]:
phrased = phraser[tokens]
print(phrased[10:30])

['have', 'define', 'the', 'presidential', 'campaign', 'of', 'senators', 'elizabeth_warren', 'and', 'bernie', 'sanders', 'skeptic', 'of', 'the', 'proposal', 'argue', 'that', 'a', 'wealth_tax', 'of']


In [27]:
[token for token in phrased if '_' in token]

['elizabeth_warren',
 'wealth_tax',
 'wealth_tax',
 'elizabeth_warren',
 'wealth_tax',
 'wealth_tax',
 'elizabeth_warren',
 'wealth_tax',
 'wealth_tax',
 'new_york',
 'new_york',
 'chief_executive',
 'chief_executive',
 'wealth_tax']

In [28]:
# We can save our phraser to disk so we don't have to do it again

phraser.save('phraser.bin')

# and load it when we need it

phraser = Phrases.load('phraser.bin')

### Integrating Phrasing
Let's adjust our original functions to accommodate phrasing. We'll make it an optional part of the process.

In [33]:
# OUR REPLACEMENT FUNCTIONS

def process_text(doc, phraser=None):
    if phraser is None:
        return [token.lemma_.lower() for token in doc if token.is_alpha]
    else:
        tokens = []
        sentences = process_sentences(doc)
        for sent in sentences:
            phrased = phraser[sent]
            tokens.extend(phrased)
    return tokens

def process_sentences(doc):
    return [process_text(sent) for sent in doc.sents]
    
def filter_stops(tokens, stop_words):
    return [tok for tok in tokens if tok.lower() not in stop_words]

def process_documents(corpus, phraser=None, stop_words=None): #change here
    docs = nlp.pipe(corpus)
    processed = [process_text(doc, phraser=phraser) for doc in docs] # and here
    if stop_words is not None:
        processed = [filter_stops(doc, stop_words) for doc in processed]
    return processed

In [34]:
print(process_text(test_text)[10:30])

['have', 'define', 'the', 'presidential', 'campaign', 'of', 'senators', 'elizabeth', 'warren', 'and', 'bernie', 'sanders', 'skeptic', 'of', 'the', 'proposal', 'argue', 'that', 'a', 'wealth']


In [35]:
print(process_text(test_text, phraser=phraser)[10:30])

['have', 'define', 'the', 'presidential', 'campaign', 'of', 'senators', 'elizabeth_warren', 'and', 'bernie', 'sanders', 'skeptic', 'of', 'the', 'proposal', 'argue', 'that', 'a', 'wealth_tax', 'of']


In [37]:
%time tokenized_docs = process_documents(df['text'], phraser=phraser, stop_words=stop_words)

CPU times: user 6.58 s, sys: 1.87 s, total: 8.45 s
Wall time: 9.05 s


In [44]:
tokenized_docs[4][:20]

['private',
 'email',
 'white_house',
 'senior',
 'adviser',
 'stephen',
 'miller',
 'send',
 'breitbart',
 'editor',
 '-pron-',
 'recommend',
 'white',
 'nationalist',
 'website',
 'literature',
 'uphold',
 'coolidge',
 'administration',
 'model']

In [47]:
# to reiterate the steps
df = pd.read_csv('sample_news_large.csv')

# df = df.sample(50, random_state=1) # only use if you want to reduce the number of rows for testing

# Get your list of texts and stop words
corpus = df['text']
stop_words = nlp.Defaults.stop_words

# train your phraser and save it to avoid retraining later
phraser = train_phraser(corpus, stop_words)
phraser.save('phraser.bin')

tokenized_docs = process_documents(corpus,phraser=phraser, stop_words=stop_words)
# Done!!

In [48]:
df['tokens'] = tokenized_docs

In [49]:
df.head()

Unnamed: 0,query,title,text,published,site,tokens
93,billionaire,Warren's wealth tax would solve economic inequ...,Taxing the rich to help the poor and middle cl...,2019-10-27T17:45:00.000+02:00,yahoo.com,"[tax, rich, help, poor, middle, class, define,..."
114,bitcoin,The crypto CEO who bailed on a $4.6 million lu...,View Business Insider’s homepage for more stor...,2019-11-06T12:51:00.000+02:00,businessinsider.com,"[view, business_insider, homepage, story, cryp..."
19,Hong Kong,"Stock market news: November 8, 2019","Stocks ended slightly higher, shrugging off ea...",2019-11-08T18:30:00.000+02:00,yahoo.com,"[stock, end, slightly, high, shrug, early, los..."
69,alt-right,Victoria Police 'extremely' disappointed with ...,Victoria Police denounces 'inappropriate' meme...,2019-11-02T09:44:00.000+02:00,abc.net.au,"[victoria_police, denounce, inappropriate, mem..."
53,alt-right,"Stephen Miller promoted white supremacist, ant...",Hundreds of private emails White House senior ...,2019-11-12T23:15:00.000+02:00,vox.com,"[private, email, white_house, senior, adviser,..."


### Top Phrases

In [53]:
test_tokens = df.loc[93,'tokens']
extract_phrases(test_tokens)

['elizabeth_warren',
 'wealth_tax',
 'wealth_tax',
 'elizabeth_warren',
 'wealth_tax',
 'wealth_tax',
 'elizabeth_warren',
 'wealth_tax',
 'wealth_tax',
 'new_york',
 'new_york',
 'chief_executive',
 'chief_executive',
 'wealth_tax']

In [55]:
def extract_phrases(tokens):
    return [token for token in tokens if '_' in token]

df['phrases'] = df['tokens'].apply(extract_phrases)

In [58]:
# phrases for one document
df.head()

Unnamed: 0,query,title,text,published,site,tokens,phrases
93,billionaire,Warren's wealth tax would solve economic inequ...,Taxing the rich to help the poor and middle cl...,2019-10-27T17:45:00.000+02:00,yahoo.com,"[tax, rich, help, poor, middle, class, define,...","[elizabeth_warren, wealth_tax, wealth_tax, eli..."
114,bitcoin,The crypto CEO who bailed on a $4.6 million lu...,View Business Insider’s homepage for more stor...,2019-11-06T12:51:00.000+02:00,businessinsider.com,"[view, business_insider, homepage, story, cryp...","[business_insider, donald_trump]"
19,Hong Kong,"Stock market news: November 8, 2019","Stocks ended slightly higher, shrugging off ea...",2019-11-08T18:30:00.000+02:00,yahoo.com,"[stock, end, slightly, high, shrug, early, los...","[president_donald, president_donald, white_hou..."
69,alt-right,Victoria Police 'extremely' disappointed with ...,Victoria Police denounces 'inappropriate' meme...,2019-11-02T09:44:00.000+02:00,abc.net.au,"[victoria_police, denounce, inappropriate, mem...","[victoria_police, victoria_police, alt_right, ..."
53,alt-right,"Stephen Miller promoted white supremacist, ant...",Hundreds of private emails White House senior ...,2019-11-12T23:15:00.000+02:00,vox.com,"[private, email, white_house, senior, adviser,...","[white_house, trump_administration, far_right,..."


In [64]:
df.explode('phrases')['phrases'].value_counts()[:10]

hong_kong           216
hide_caption        162
kong_unrest          38
photos_hong          30
pro_democracy        28
unrest_protester     22
photo_hong           21
new_york             19
tear_gas             19
getty_images         17
Name: phrases, dtype: int64

In [65]:
# top ten phrases overall
print(df.explode('phrases')['phrases'].value_counts()[:10])

hong_kong           216
hide_caption        162
kong_unrest          38
photos_hong          30
pro_democracy        28
unrest_protester     22
photo_hong           21
new_york             19
tear_gas             19
getty_images         17
Name: phrases, dtype: int64


In [66]:
df.head()

Unnamed: 0,query,title,text,published,site,tokens,phrases
93,billionaire,Warren's wealth tax would solve economic inequ...,Taxing the rich to help the poor and middle cl...,2019-10-27T17:45:00.000+02:00,yahoo.com,"[tax, rich, help, poor, middle, class, define,...","[elizabeth_warren, wealth_tax, wealth_tax, eli..."
114,bitcoin,The crypto CEO who bailed on a $4.6 million lu...,View Business Insider’s homepage for more stor...,2019-11-06T12:51:00.000+02:00,businessinsider.com,"[view, business_insider, homepage, story, cryp...","[business_insider, donald_trump]"
19,Hong Kong,"Stock market news: November 8, 2019","Stocks ended slightly higher, shrugging off ea...",2019-11-08T18:30:00.000+02:00,yahoo.com,"[stock, end, slightly, high, shrug, early, los...","[president_donald, president_donald, white_hou..."
69,alt-right,Victoria Police 'extremely' disappointed with ...,Victoria Police denounces 'inappropriate' meme...,2019-11-02T09:44:00.000+02:00,abc.net.au,"[victoria_police, denounce, inappropriate, mem...","[victoria_police, victoria_police, alt_right, ..."
53,alt-right,"Stephen Miller promoted white supremacist, ant...",Hundreds of private emails White House senior ...,2019-11-12T23:15:00.000+02:00,vox.com,"[private, email, white_house, senior, adviser,...","[white_house, trump_administration, far_right,..."


In [69]:
# top ten phrases per group
for query, data in df.groupby('query'):
    print(f"****{query}****")
    print(data.explode('phrases')['phrases'].value_counts()[:10])
    print()

****Hong Kong****
hong_kong           216
hide_caption        159
kong_unrest          38
photos_hong          30
pro_democracy        28
unrest_protester     22
photo_hong           21
tear_gas             17
unrest_police        16
anti_government      13
Name: phrases, dtype: int64

****Tesla****
electric_vehicle    13
electric_car        12
elon_musk            9
mhari_shaw           9
early_retirement     8
mr_musk              7
self_drive           7
rs_automotive        6
yahoo_finance        4
business_insider     4
Name: phrases, dtype: int64

****alt-right****
alt_right               13
victoria_police         11
far_right                9
sandy_hook               7
social_medium            6
new_york                 5
trump_administration     5
white_house              5
washington_post          3
police_officer           2
Name: phrases, dtype: int64

****billionaire****
raw_story             16
trump_organization    15
wealth_tax             7
russell_moyle          7
whi