# Session 12(a) - Text Mining
## Analysing and summarising collections of text
### Phrases

Having learned how to clean and simplify our text for processing, the next stage is to ask what our text is about. This session will cover finding

In [None]:
import pandas as pd
import spacy

In [None]:
df = pd.read_csv('sample_news_large.csv')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['query'].value_counts()

In [None]:
df = df.sample(50, random_state=1)

In [None]:
nlp = spacy.load('en_core_web_md')

## Phrases (n-grams)
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/Archer-phrasing.jpg?raw=true" align="right" width="300">

- Operates on the assumption that if words often co-occur together in a corpus, they should be considered as a single 'phrase', rather than as individual words.
- Phrasing improves the accuracy of various analyses as it recognises that words may be transformed by their context.
- For example: 
    - In one document we have the phrase "human rights", in the other, "human biology". 
    - **Without phrasing** these may be considered similar as they both use the word "human".
    - However **with phrasing** these would be transformed into two seperate tokens, human_rights and human_biology, and therefore be more likely to be distinguished as different.

#### Training the Phraser

`train_phraser` has three stages. 
- First we create a list of tokenized *sentences*. 
- We then feed that list of sentences to a Gensim `Phrases` model. This model looks at which token co-occur, how often and [makes a judgement](https://arxiv.org/abs/1310.4546) about whether co-occurence is common enough to consider it a 'phrase'.
- Why sentences? Sentences mark out boundaries between words. Consider the phrase 'Human Rights'...


```
... and so recognising that he was only human. Rights based discussions can only....

```

```
... and so recognising that he was only human rights based discussions can only..

```

In [None]:
# gensim is a text processing library that has the Phrasing tools we need
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

In [None]:
def process_text(doc):
    return [token.lemma_.lower() for token in doc if token.text != '\n' and token.is_alpha]

# process sentences function
def process_sentences(doc):
    return [process_text(sent) for sent in doc.sents]

def train_phraser(corpus):
    sentences = []
    for doc in nlp.pipe(corpus):
        doc_sents = process_sentences(doc)
        sentences.extend(doc_sents)
    phraser = Phrases(sentences, connector_words=ENGLISH_CONNECTOR_WORDS)
    return phraser

In [None]:
phraser = train_phraser(df['text'])

In [None]:
test_text = nlp(df['text'].iloc[0])
test_text[:100]

In [None]:
tokens = process_text(test_text)
print(tokens[10:35])

In [None]:
phrased = phraser[tokens]
print(phrased[10:35])

In [None]:
print([token for token in phrased if '_' in token])

In [None]:
# We can save our phraser to disk so we don't have to do it again

phraser.save('phraser.bin')

# and load it when we need it

phraser = Phrases.load('phraser.bin')

In [None]:
df['tokens'] = tokenized_docs

In [None]:
# OUR REPLACEMENT FUNCTIONS

def new_process_text(doc, phraser=None):
    if phraser is None:
        tokens = [token.lemma_.lower() for token in doc if token.text != '\n' and token.is_alpha]
    else:
        tokens = []
        sentences = process_sentences(doc)
        for sent in sentences:
            phrased = phraser[sent]
            tokens.extend(phrased)
    return tokens 
    
def filter_stops(tokens, stop_words):
    return [tok for tok in tokens if tok.lower() not in stop_words]

def process_documents(corpus, phraser=None, stop_words=None): #change here
    docs = nlp.pipe(corpus)
    processed = [new_process_text(doc, phraser=phraser) for doc in docs] # and here
    if stop_words is not None:
        processed = [filter_stops(doc, stop_words) for doc in processed]
    return processed

def stringify_tokens(tokenized_documents):
    return [' '.join(doc) for doc in tokenized_documents]


In [None]:
print(new_process_text(test_text)[10:30])

In [None]:
print(new_process_text(test_text, phraser=phraser)[10:30])

In [None]:
stop_words = nlp.Defaults.stop_words

tokenized_docs = process_documents(df['text'], phraser=phraser, stop_words=stop_words)

In [None]:
print(tokenized_docs[3])

In [None]:
# to reiterate the steps
df = pd.read_csv('sample_news_large.csv')

df = df.sample(50, random_state=1) # only use if you want to reduce the number of rows for testing

# Get your list of texts and stop words
corpus = df['text']
stop_words = nlp.Defaults.stop_words

# train your phraser and save it to avoid retraining later
phraser = train_phraser(corpus)
phraser.save('phraser.bin')

phraser = Phrases.load('phraser.bin')

tokenized_docs = process_documents(corpus,phraser=phraser, stop_words=stop_words)
# Done!!

In [None]:
print(tokenized_docs[0])

In [None]:
stringified = stringify_tokens(tokenized_docs)
print(stringified[0])

In [None]:
df['tokens'] = stringified
df.head()

### Top Phrases
A useful exploratory technique to get a quick sense of your data is to examine the phrases used.

In [None]:
df.head()

In [None]:
def extract_phrases(tokens):
    return [token for token in tokens.split() if '_' in token]

df['phrases'] = df['tokens'].apply(extract_phrases)

In [None]:
# phrases for one document
df['phrases'].iloc[0]

In [None]:
# top ten phrases overall
print(df.explode('phrases')['phrases'].value_counts()[:10])

In [None]:
# top ten phrases per group
for query,data in df.groupby('query'):
    print(f"****{query}****")
    print(data.explode('phrases')['phrases'].value_counts()[:10])
    print()