# Session 11 - Text Mining
## Introduction to Text as Data

- We broadly understand how text might be used as data for qualitative analysis.
- Words are not treated simply as individual units of data, but we recognise context, structure, pattern.
- How then can text be analysed quantitatively, and how can it be interpreted by a computer to aide that analysis?
- Usually text analysis techniques require text to be prepared for analysis through two stages

### 1. Tokenizing

- Computers break down text into units of analysis known as *Tokens*. Tokens are often individual words, but they can also be parts of words, common phrases etc. 
- Tokenizing is the first fundamental step in any text analysis.
- How exactly you split up text is not necessarily straight forward.
- There are a range of different strategies which you can see at the [Python NLTK Demo Page](https://text-processing.com/demo/tokenize/).

### 2. Pre-Processing
- Exactly what happens in pre-processing tends to depend on the type of analysis you are doing.
- In general it tends to involve...
        - Filtering out of common words and punctuation
        - Standardising the text to make it less complex for the computer.


## Step 1: Tokenizing

#### Example of the problem
We already know how to split up a string into a series of items in a list. It's pretty simple using `.split()`

In [38]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."

In [39]:
test_tokens = test_phrase.split()
print(test_tokens)

['I', "don't", 'see', 'my', 'cat.', 'He', 'has', 'a', 'long', 'tail,', 'fluffy', 'ears', 'and', 'big', 'eyes!', 'He', 'also', 'subscribes', 'to', 'Marxist', 'historical', 'materialism.', "It's", 'just', 'his', 'way.']


In [40]:
# first we check if the string 'ears' is in the list
'ears' in test_tokens

True

In [41]:
# so we would imagine that 'eyes' is also in the list?
'eyes' in test_tokens

False

Having punctuation atached to words like this can cause us problems because the tokens `eyes` and `eyes!` would be considered two seperate things. This messes with a lot of analysis further down the line.

### Using Tokenizers
- Tokenizers are functions that split up text for us. 
- Some of them are based on complex sets of rules, others are based on training computers using lots of examples of text.
- There are many (many!) packages available for handling text data. See Moodle for a list of common ones.

#### SpaCy

SpaCy uses pre-trained models of language to do a lot of the tasks we need. To create our spacy text tool we need to load in a model. SpaCy has a [load of different models](https://spacy.io/usage/models) for different languages and different types of task.

We're going to use `en_core_web_md` which means..
- en: English
- core: Can perform the core features of Spacy but not some of the more specialised techniques.
- web: trained on content from the web such as blogs, news, comments, making it suitable for similar content.
- md: medium version. There is also the small and large models. Small is trained just on web text data from 2013. Medium is trained on [petabytes of data from the contemporary internet](https://commoncrawl.org/big-picture/) and so is much more up to date in how it understands contemporary language use.

In [42]:
#import
import spacy

In [43]:
# nlp represents the trained language model provided by Spacy...
nlp = spacy.load('en_core_web_md')

In [44]:
# In SpaCy tokenization happens the moment you wrap a string in your language model object nlp()

doc = nlp(test_phrase)

doc

I don't see my cat. He has a long tail, fluffy ears and big eyes! He also subscribes to Marxist historical materialism. It's just his way.

In [45]:
# if we iterate over the doc we can see the tokens.
tokens_a = []

for word in doc:
    tokens_a.append(word)
print(tokens_a)

[I, do, n't, see, my, cat, ., He, has, a, long, tail, ,, fluffy, ears, and, big, eyes, !, He, also, subscribes, to, Marxist, historical, materialism, ., It, 's, just, his, way, .]




**List comprehensions** allow us to do in 1 line what would normally take 3. They are much more efficient than using a `for` loop. We'll see how they can be used more later.

In [46]:
tokens_b = [word for word in doc]
print(tokens_b)

[I, do, n't, see, my, cat, ., He, has, a, long, tail, ,, fluffy, ears, and, big, eyes, !, He, also, subscribes, to, Marxist, historical, materialism, ., It, 's, just, his, way, .]


A number of things have been done by the tokenizer.
- Punctuation has been seperated from words into their own tokens.
- Words that are contractions of two words (It's > It is / Don't > Do not) have been split into two.
- This becomes more useful in a minute and is all part of the process of reducing the nuance of language to make documents more comparable.

## Step 2: Pre-Processing
- Whilst the above looks just like the strings again it is actually a SpaCy **Document**.
- Once a string is processed by SpaCy it becomes a SpaCy [Document object](https://spacy.io/api/doc). 
- The SpaCy document object itself is made up of SpaCy [Token objects](https://spacy.io/api/token).

This means that Documents and Tokens have a range of associated methods and attributes based on SpaCy's analysis.

In [47]:
# Let's do that again
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc = nlp(test_phrase)

In [48]:
type(doc)

spacy.tokens.doc.Doc

In [49]:
# check language
doc.lang_

'en'

In [50]:
token = doc[5]
token

cat

In [51]:
type(token)

spacy.tokens.token.Token

#### Pre-Processing: Lemmatisation
Language is very nuanced in real life, but part of the filtering process involves reducing that nuance to strip back to a piece of text's bare bones.

```
"I don't like rabbits in space"
"I do not like rabbits in space"
```
- Semantically the same, computationally different.
- Lemmatising using the token method `.lemma_` allows us to roll words back to a common 'root'.

In [52]:
rabbit_1 = nlp("I don't like rabbits in space")
rabbit_2 = nlp("I do not like rabbits in space")

print([token.lemma_ for token in rabbit_1])
print([token.lemma_ for token in rabbit_2])


['-PRON-', 'do', 'not', 'like', 'rabbit', 'in', 'space']
['-PRON-', 'do', 'not', 'like', 'rabbit', 'in', 'space']


These phrases below are semantically similar, but only share 1 word. Lemmatising brings them closer together computationally.

In [53]:
rabbit_1 = nlp("I'm loving these rabbits")
rabbit_2 = nlp("I love this rabbit!")

print([token.lemma_ for token in rabbit_1])
print([token.lemma_ for token in rabbit_2])

['-PRON-', 'be', 'love', 'these', 'rabbit']
['-PRON-', 'love', 'this', 'rabbit', '!']


In [54]:
# Let's see what it does to our test phrase
doc = nlp(test_phrase)

In [55]:
print( [token.lemma_ for token in doc])

['-PRON-', 'do', 'not', 'see', '-PRON-', 'cat', '.', '-PRON-', 'have', 'a', 'long', 'tail', ',', 'fluffy', 'ear', 'and', 'big', 'eye', '!', '-PRON-', 'also', 'subscribe', 'to', 'marxist', 'historical', 'materialism', '.', '-PRON-', 'be', 'just', '-PRON-', 'way', '.']


#### Pre-Processing: Punctuation
Unless punctuation matters for your analysis (such as needing to break text down into sentences), we normally will clear out punctuation from text. We can use SpaCy's `.is_alpha` attribute to include only tokens that are alphabetical, and still just return the lemma.

In [56]:
processed_doc = [token.lemma_ for token in doc if token.is_alpha]
print(processed_doc)

['-PRON-', 'do', 'see', '-PRON-', 'cat', '-PRON-', 'have', 'a', 'long', 'tail', 'fluffy', 'ear', 'and', 'big', 'eye', '-PRON-', 'also', 'subscribe', 'to', 'marxist', 'historical', 'materialism', '-PRON-', 'just', '-PRON-', 'way']


#### Pre-Processing: Stop Words
Stop words are common words that tend to be structurally useful in sentences, but are too common to provide much meaning alone. They are often stripped out before text analysis.
Lets add this to our list comprehension.

In [57]:
stop_words = nlp.Defaults.stop_words

In [58]:
print(processed_doc)

['-PRON-', 'do', 'see', '-PRON-', 'cat', '-PRON-', 'have', 'a', 'long', 'tail', 'fluffy', 'ear', 'and', 'big', 'eye', '-PRON-', 'also', 'subscribe', 'to', 'marxist', 'historical', 'materialism', '-PRON-', 'just', '-PRON-', 'way']


In [59]:
# list comp that filters out stop_words
print([word for word in processed_doc if word.lower() not in stop_words])

['-PRON-', '-PRON-', 'cat', '-PRON-', 'long', 'tail', 'fluffy', 'ear', 'big', 'eye', '-PRON-', 'subscribe', 'marxist', 'historical', 'materialism', '-PRON-', '-PRON-', 'way']


In [60]:
# we'll add .lower() to our result to ensure that we 
# get rid of any distinction when it comes to capitalisation as well

def process_text(doc):
    return [token.lemma_.lower() for token in doc if token.is_alpha]
    
def filter_stops(tokens, stop_words):
    return [tok for tok in tokens if tok.lower() not in stop_words]

In [61]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc = nlp(test_phrase)

print(process_text(doc))
print()
print(filter_stops(process_text(doc), stop_words))

['-pron-', 'do', 'see', '-pron-', 'cat', '-pron-', 'have', 'a', 'long', 'tail', 'fluffy', 'ear', 'and', 'big', 'eye', '-pron-', 'also', 'subscribe', 'to', 'marxist', 'historical', 'materialism', '-pron-', 'just', '-pron-', 'way']

['-pron-', '-pron-', 'cat', '-pron-', 'long', 'tail', 'fluffy', 'ear', 'big', 'eye', '-pron-', 'subscribe', 'marxist', 'historical', 'materialism', '-pron-', '-pron-', 'way']


## Processing a Corpus
A "Corpus" is a collection of textual documents. Often in computational textual analysis a corpus size will run into the hundreds, thousands or even hundreds of thousands. Whilst we could process each document seperately with a `for` loop, it is much more efficient to use Spacy's `pipe` which can process multiple documents in parallel and is memory efficient.


In [62]:
corpus = ["I don't like rabbits in space",
         "I do not like rabbits in space",
         "I'm loving these rabbits",
         "I love this rabbit!"]

In [64]:
# pipe it!
docs = nlp.pipe(corpus)
docs

<generator object Language.pipe at 0x7fd4bc5dbed0>

Spacy has created a generator object. This means that at the moment, no processing has been done. Each document is only processed when we iterate over the generator object as if it were a list.

In [65]:
# We can force the generator to produce the results by iterating over it in a list comprehension.
docs = [doc for doc in docs]
docs

[I don't like rabbits in space,
 I do not like rabbits in space,
 I'm loving these rabbits,
 I love this rabbit!]

For peace of mind we can check and see that yes the objects are spacy documents

In [66]:
for doc in docs:
    print(type(doc))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>


Which means we can run our processor on each Spacy doc as it is spat out and retain the result.

In [67]:
# process the text as it generates
[process_text(doc) for doc in docs]

[['-pron-', 'do', 'like', 'rabbit', 'in', 'space'],
 ['-pron-', 'do', 'not', 'like', 'rabbit', 'in', 'space'],
 ['-pron-', 'love', 'these', 'rabbit'],
 ['-pron-', 'love', 'this', 'rabbit']]

In [68]:
def process_documents(corpus, stop_words=None):
    # pipe it
    docs = nlp.pipe(corpus)
    # process the text
    processed = [process_text(doc) for doc in docs]
    # remove stop_words if provided
    if stop_words is not None:
        processed = [filter_stops(tokens, stop_words) for tokens in processed]
    return processed

In [69]:
stop_words = nlp.Defaults.stop_words

print( process_documents(corpus) )
print()
print( process_documents(corpus, stop_words) )

[['-pron-', 'do', 'like', 'rabbit', 'in', 'space'], ['-pron-', 'do', 'not', 'like', 'rabbit', 'in', 'space'], ['-pron-', 'love', 'these', 'rabbit'], ['-pron-', 'love', 'this', 'rabbit']]

[['-pron-', 'like', 'rabbit', 'space'], ['-pron-', 'like', 'rabbit', 'space'], ['-pron-', 'love', 'rabbit'], ['-pron-', 'love', 'rabbit']]


## Real Data Test
So far we've been working on a toy dataset. Lets see what happens with a real dataset.

In [70]:
import pandas as pd

In [71]:
df = pd.read_csv('sample_news_large.csv')

In [75]:
df.head()

Unnamed: 0,query,title,text,published,site
0,Hong Kong,Horrifying view of fires from space,Video Image Satellite images show insane view ...,2019-11-08T23:51:00.000+02:00,news.com.au
1,Hong Kong,Protester shot with live round in Hong Kong as...,\n Chief Executive addresses the press after c...,2019-11-11T02:00:00.000+02:00,scmp.com
2,Hong Kong,China imposes online gaming curfew for minors ...,Hong Kong (CNN) China has announced a curfew o...,2019-11-06T02:00:00.000+02:00,cnn.com
3,Hong Kong,Trump made 96 false claims last week - CNNPoli...,Washington (CNN) President Donald Trump was re...,2019-10-30T20:35:00.000+02:00,cnn.com
4,Hong Kong,50 best breads around the world | CNN Travel,(CNN) — What is bread? You likely don't have t...,2019-10-16T07:02:00.000+03:00,cnn.com


In [76]:
stop_words = nlp.Defaults.stop_words

%time result =  process_documents(df['text'], stop_words=stop_words)

CPU times: user 27.5 s, sys: 7.91 s, total: 35.4 s
Wall time: 41.5 s


In [83]:
print(df.loc[4,'text'])

(CNN) — What is bread? You likely don't have to think for long, and whether you're hungry for a slice of sourdough or craving some tortillas, what you imagine says a lot about where you're from. But if bread is easy to picture, it's hard to define. Bread historian William Rubel argues that creating a strict definition of bread is unnecessary, even counterproductive. "Bread is basically what your culture says it is," says Rubel, the author of "Bread: A Global History." "It doesn't need to be made with any particular kind of flour." Instead, he likes to focus on what bread does: It turns staple grains such as wheat, rye or corn into durable foods that can be carried into the fields, used to feed an army or stored for winter. Even before the first agricultural societies formed around 10,000 B.C., hunter-gatherers in Jordan's Black Desert made bread with tubers and domesticated grain. Related content 50 of the world's best desserts Today, the descendants of those early breads showcase the 

In [84]:
print(result[4])

['cnn', 'bread', '-pron-', 'likely', 'think', 'long', '-pron-', 'hungry', 'slice', 'sourdough', 'crave', 'tortilla', '-pron-', 'imagine', 'lot', '-pron-', 'bread', 'easy', 'picture', '-pron-', 'hard', 'define', 'bread', 'historian', 'william', 'rubel', 'argue', 'create', 'strict', 'definition', 'bread', 'unnecessary', 'counterproductive', 'bread', 'basically', '-pron-', 'culture', '-pron-', 'rubel', 'author', 'bread', 'global', 'history', '-pron-', 'need', 'particular', 'kind', 'flour', 'instead', '-pron-', 'like', 'focus', 'bread', '-pron-', 'turn', 'staple', 'grain', 'wheat', 'rye', 'corn', 'durable', 'food', 'carry', 'field', 'use', 'feed', 'army', 'store', 'winter', 'agricultural', 'society', 'form', 'hunter', 'gatherer', 'jordan', 'black', 'desert', 'bread', 'tuber', 'domesticated', 'grain', 'relate', 'content', 'world', 'good', 'dessert', 'today', 'descendant', 'early', 'breads', 'showcase', 'remarkable', 'breadth', '-pron-', 'world', 'food', 'tradition', 'rugged', 'mountain', 'g

In [85]:
# to create a new column of tokenized documents we simply assign the result

df['tokens'] = result

In [86]:
df[['text','tokens']]

Unnamed: 0,text,tokens
0,Video Image Satellite images show insane view ...,"[video, image, satellite, image, insane, view,..."
1,\n Chief Executive addresses the press after c...,"[chief, executive, address, press, citywide, c..."
2,Hong Kong (CNN) China has announced a curfew o...,"[hong, kong, cnn, china, announce, curfew, onl..."
3,Washington (CNN) President Donald Trump was re...,"[washington, cnn, president, donald, trump, re..."
4,(CNN) — What is bread? You likely don't have t...,"[cnn, bread, -pron-, likely, think, long, -pro..."
...,...,...
170,The head of Facebook’s blockchain project says...,"[head, facebook, blockchain, project, developm..."
171,Housing and Urban Development Secretary Ben Ca...,"[housing, urban, development, secretary, ben, ..."
172,"Wednesday, October 16, 2019\nWall Street lost ...","[wednesday, october, wall, street, lose, milli..."
173,U.S. stocks slid Thursday following reports th...,"[stock, slide, thursday, follow, report, chine..."
