# Session 12(a) - Text Mining
## Analysing and summarising collections of text
### Phrases

Having learned how to clean and simplify our text for processing, the next stage is to ask what our text is about. This session will cover finding

In [None]:
import pandas as pd
import spacy

In [None]:
df =

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['query'].value_counts()

In [None]:
nlp =

## Phrases (n-grams)
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/Archer-phrasing.jpg?raw=true" align="right" width="300">

- Operates on the assumption that if words often co-occur together in a corpus, they should be considered as a single 'phrase', rather than as individual words.
- Phrasing improves the accuracy of various analyses as it recognises that words may be transformed by their context.
- For example: 
    - In one document we have the phrase "human rights", in the other, "human biology". 
    - **Without phrasing** these may be considered similar as they both use the word "human".
    - However **with phrasing** these would be transformed into two seperate tokens, human_rights and human_biology, and therefore be more likely to be distinguished as different.

#### Training the Phraser

`train_phraser` has three stages. 
- First we create a list of tokenized *sentences*. 
- We then feed that list of sentences to a Gensim `Phrases` model. This model looks at which token co-occur, how often and [makes a judgement](https://arxiv.org/abs/1310.4546) about whether co-occurence is common enough to consider it a 'phrase'.
- Why sentences? Sentences mark out boundaries between words. Consider the phrase 'Human Rights'...


```
... and so recognising that he was only human. Rights based discussions can only....

```

```
... and so recognising that he was only human rights based discussions can only..

```

In [None]:
# gensim is a text processing library that has the Phrasing tools we need

from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

In [None]:
def process_text(doc):
    return [token.lemma_.lower() for token in doc if token.text != '\n' and token.is_alpha]

# process sentences function
def process_sentences(doc):
    return [process_text(sent) for sent in doc.sents]

def train_phraser

In [None]:
phraser =

In [None]:
test_text = nlp(df['text'].iloc[0])
test_text[:100]

In [None]:
test_tokens =
print(test_tokens[10:35])

In [None]:
test_tokens_phrased =
print(test_tokens_phrased[10:35])

In [None]:
print([token for token in test_tokens_phrased if '_' in token])

In [None]:
# We can save our phraser to disk so we don't have to do it again

phraser.

# and load it when we need it

phraser =

In [None]:
# OUR REPLACEMENT FUNCTIONS

def new_process_text(doc, phrasing_model=None):
    if phrasing_model is None:
        tokens = [token.lemma_.lower() for token in doc if token.text != '\n' and token.is_alpha]
    else:
        tokens = []



    return tokens 
    
def filter_stops(tokens, stop_words):
    return [tok for tok in tokens if tok.lower() not in stop_words]

def process_documents(texts, phraser_model=None, stop_words=None, n_process=1): #change here
    docs = nlp.pipe(texts, n_process=n_process)
    processed = [new_process_text(doc, phrasing_model=phraser_model) for doc in docs] # and here
    if stop_words is not None:
        processed = [filter_stops(doc, stop_words) for doc in processed]
    return processed

def stringify_tokens(tokenized_documents):



In [None]:
print(new_process_text(test_text)[10:30])

In [None]:
print(new_process_text(test_text, phrasing_model=phraser)[10:30])

In [None]:
spacy_stop_words = nlp.Defaults.stop_words

tokenized_docs =

In [None]:
print(tokenized_docs[3])

In [None]:
# to reiterate the steps
df =

# Get your list of texts and stop words
corpus =
spacy_stop_words =

# train your phraser and save it to avoid retraining later
phraser =
phraser.save('phraser.bin')

phraser = Phrases.load('phraser.bin')

tokenized_docs =
# Done!!

In [None]:
print(tokenized_docs[0])

In [None]:
stringified =
print(stringified[0])

In [None]:
df['tokens'] =
df.head()

### Top Phrases
A useful exploratory technique to get a quick sense of your data is to examine the phrases used.

In [None]:
df.head()

In [None]:
def extract_phrases(tokens):


df['phrases'] =

In [None]:
# phrases for one document
df['phrases'].iloc[0]

In [None]:
# top ten phrases overall
print(    )

In [None]:
# top ten phrases per group

