# Session 11 - Text Mining
## Introduction to Text as Data

- We broadly understand how text might be used as data for qualitative analysis.
- Words are not treated simply as individual units of data, but we recognise context, structure, pattern.
- How then can text be analysed quantitatively, and how can it be interpreted by a computer to aide that analysis?
- Usually text analysis techniques require text to be prepared for analysis through two stages

### 1. Tokenizing

- Computers break down text into units of analysis known as *Tokens*. Tokens are often individual words, but they can also be parts of words, common phrases etc. 
- Tokenizing is the first fundamental step in any text analysis.
- How exactly you split up text is not necessarily straight forward.
- There are a range of different strategies which you can see at the [Python NLTK Demo Page](https://text-processing.com/demo/tokenize/).

### 2. Pre-Processing
- Exactly what happens in pre-processing tends to depend on the type of analysis you are doing.
- In general it tends to involve...
        - Filtering out of common words and punctuation
        - Standardising the text to make it less complex for the computer.


## Step 1: Tokenizing

#### Example of the problem
We already know how to split up a string into a series of items in a list. It's pretty simple using `.split()`

In [1]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."

In [2]:
test_tokens = test_phrase.split()
print(test_tokens)

['I', "don't", 'see', 'my', 'cat.', 'He', 'has', 'a', 'long', 'tail,', 'fluffy', 'ears', 'and', 'big', 'eyes!', 'He', 'also', 'subscribes', 'to', 'Marxist', 'historical', 'materialism.', "It's", 'just', 'his', 'way.']


In [3]:
# first we check if the string 'ears' is in the list
'ears' in test_tokens

True

In [4]:
# so we would imagine that 'eyes' is also in the list?
'eyes' in test_tokens

False

Having punctuation atached to words like this can cause us problems because the tokens `eyes` and `eyes!` would be considered two seperate things. This messes with a lot of analysis further down the line.

### Using Tokenizers
- Tokenizers are functions that split up text for us. 
- Some of them are based on complex sets of rules, others are based on training computers using lots of examples of text.
- There are many (many!) packages available for handling text data. See Moodle for a list of common ones.

#### SpaCy

SpaCy uses pre-trained models of language to do a lot of the tasks we need. To create our spacy text tool we need to load in a model. SpaCy has a [load of different models](https://spacy.io/usage/models) for different languages and different types of task.

We're going to use `en_core_web_md` which means..
- en: English
- core: Can perform the core features of Spacy but not some of the more specialised techniques.
- web: trained on content from the web such as blogs, news, comments, making it suitable for similar content.
- md: medium version. There is also the small and large models. Small is trained just on web text data from 2013. Medium is trained on [petabytes of data from the contemporary internet](https://commoncrawl.org/big-picture/) and so is much more up to date in how it understands contemporary language use.

In [5]:
import spacy

In [6]:
# nlp represents the trained language model provided by Spacy...
nlp = spacy.load('en_core_web_md')

In [7]:
# In SpaCy tokenization happens the moment you wrap a string in your language model object nlp()

doc = nlp(test_phrase)

doc

I don't see my cat. He has a long tail, fluffy ears and big eyes! He also subscribes to Marxist historical materialism. It's just his way.

In [8]:
# if we iterate over the doc we can see the tokens.
tokens_a = []

for word in doc:
    tokens_a.append(word)
print(tokens_a)

[I, do, n't, see, my, cat, ., He, has, a, long, tail, ,, fluffy, ears, and, big, eyes, !, He, also, subscribes, to, Marxist, historical, materialism, ., It, 's, just, his, way, .]




**List comprehensions** allow us to do in 1 line what would normally take 3. They are much more efficient than using a `for` loop. We'll see how they can be used more later.

In [9]:
tokens_b = [word for word in doc]
print(tokens_b)

[I, do, n't, see, my, cat, ., He, has, a, long, tail, ,, fluffy, ears, and, big, eyes, !, He, also, subscribes, to, Marxist, historical, materialism, ., It, 's, just, his, way, .]


A number of things have been done by the tokenizer.
- Punctuation has been seperated from words into their own tokens.
- Words that are contractions of two words (It's > It is / Don't > Do not) have been split into two.
- This becomes more useful in a minute and is all part of the process of reducing the nuance of language to make documents more comparable.

## Step 2: Pre-Processing
- Whilst the above looks just like the strings again it is actually a SpaCy **Document**.
- Once a string is processed by SpaCy it becomes a SpaCy [Document object](https://spacy.io/api/doc). 
- The SpaCy document object itself is made up of SpaCy [Token objects](https://spacy.io/api/token).

This means that Documents and Tokens have a range of associated methods and attributes based on SpaCy's analysis.

In [10]:
# Let's do that again
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc = nlp(test_phrase)

In [11]:
type(doc)

spacy.tokens.doc.Doc

In [12]:
doc.lang_

'en'

In [13]:
token = doc[5]
token

cat

In [14]:
type(token)

spacy.tokens.token.Token

#### Pre-Processing: Lemmatisation
Language is very nuanced in real life, but part of the filtering process involves reducing that nuance to strip back to a piece of text's bare bones.

```
"I don't like rabbits in space"
"I do not like rabbits in space"
```
- Semantically the same, computationally different.
- Lemmatising using the token method `.lemma_` allows us to roll words back to a common 'root'.

In [15]:
rabbit_1 = nlp("I don't like rabbits in space")
rabbit_2 = nlp("I do not like rabbits in space")

print( [token.lemma_ for token in rabbit_1])
print( [token.lemma_ for token in rabbit_2])

['-PRON-', 'do', 'not', 'like', 'rabbit', 'in', 'space']
['-PRON-', 'do', 'not', 'like', 'rabbit', 'in', 'space']


These phrases below are semantically similar, but only share 1 word. Lemmatising brings them closer together computationally.

In [16]:
rabbit_1 = nlp("I'm loving these rabbits")
rabbit_2 = nlp("I love this rabbit!")

print( [token.lemma_ for token in rabbit_1])
print( [token.lemma_ for token in rabbit_2])

['-PRON-', 'be', 'love', 'these', 'rabbit']
['-PRON-', 'love', 'this', 'rabbit', '!']


In [17]:
# Let's see what it does to our test phrase
doc

I don't see my cat. He has a long tail, fluffy ears and big eyes! He also subscribes to Marxist historical materialism. It's just his way.

In [18]:
print( [token.lemma_ for token in doc])

['-PRON-', 'do', 'not', 'see', '-PRON-', 'cat', '.', '-PRON-', 'have', 'a', 'long', 'tail', ',', 'fluffy', 'ear', 'and', 'big', 'eye', '!', '-PRON-', 'also', 'subscribe', 'to', 'marxist', 'historical', 'materialism', '.', '-PRON-', 'be', 'just', '-PRON-', 'way', '.']


#### Pre-Processing: Punctuation
Unless punctuation matters for your analysis (such as needing to break text down into sentences), we normally will clear out punctuation from text. We can use SpaCy's `.is_alpha` attribute to include only tokens that are alphabetical, and still just return the lemma.

In [19]:
processed_doc = [token.lemma_ for token in doc if token.is_alpha]
print(processed_doc)

['-PRON-', 'do', 'see', '-PRON-', 'cat', '-PRON-', 'have', 'a', 'long', 'tail', 'fluffy', 'ear', 'and', 'big', 'eye', '-PRON-', 'also', 'subscribe', 'to', 'marxist', 'historical', 'materialism', '-PRON-', 'just', '-PRON-', 'way']


#### Pre-Processing: Stop Words
Stop words are common words that tend to be structurally useful in sentences, but are too common to provide much meaning alone. They are often stripped out before text analysis.
Lets add this to our list comprehension.

In [20]:
stop_words = nlp.Defaults.stop_words

In [21]:
print([word for word in processed_doc if word.lower() not in stop_words])

['-PRON-', '-PRON-', 'cat', '-PRON-', 'long', 'tail', 'fluffy', 'ear', 'big', 'eye', '-PRON-', 'subscribe', 'marxist', 'historical', 'materialism', '-PRON-', '-PRON-', 'way']


In [22]:
# we'll add .lower() to our result to ensure that we 
# get rid of any distinction when it comes to capitalisation as well

def process_text(doc):
    return [token.lemma_.lower() for token in doc if token.is_alpha]
    
def filter_stops(tokens, stop_words):
    return [tok for tok in tokens if tok.lower() not in stop_words]

In [23]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc = nlp(test_phrase)

print(process_text(doc))
print()
print(filter_stops(process_text(doc), stop_words))

['-pron-', 'do', 'see', '-pron-', 'cat', '-pron-', 'have', 'a', 'long', 'tail', 'fluffy', 'ear', 'and', 'big', 'eye', '-pron-', 'also', 'subscribe', 'to', 'marxist', 'historical', 'materialism', '-pron-', 'just', '-pron-', 'way']

['-pron-', '-pron-', 'cat', '-pron-', 'long', 'tail', 'fluffy', 'ear', 'big', 'eye', '-pron-', 'subscribe', 'marxist', 'historical', 'materialism', '-pron-', '-pron-', 'way']


## Processing a Corpus
A "Corpus" is a collection of textual documents. Often in computational textual analysis a corpus size will run into the hundreds, thousands or even hundreds of thousands. Whilst we could process each document seperately with a `for` loop, it is much more efficient to use Spacy's `pipe` which can process multiple documents in parallel and is memory efficient.


In [24]:
corpus = ["I don't like rabbits in space",
         "I do not like rabbits in space",
         "I'm loving these rabbits",
         "I love this rabbit!"]

In [25]:
docs = nlp.pipe(corpus)
docs

<generator object Language.pipe at 0x7fbcdfa2dad0>

Spacy has created a generator object. This means that at the moment, no processing has been done. Each document is only processed when we iterate over the generator object as if it were a list.

In [26]:
# We can force the generator to produce the results by iterating over it in a list comprehension.
docs = [doc for doc in docs]
docs

[I don't like rabbits in space,
 I do not like rabbits in space,
 I'm loving these rabbits,
 I love this rabbit!]

For peace of mind we can check and see that yes the objects are spacy documents

In [27]:
for doc in docs:
    print(type(doc))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.doc.Doc'>


Which means we can run our processor on each Spacy doc as it is spat out and retain the result.

In [28]:
[process_text(doc) for doc in docs]

[['-pron-', 'do', 'like', 'rabbit', 'in', 'space'],
 ['-pron-', 'do', 'not', 'like', 'rabbit', 'in', 'space'],
 ['-pron-', 'love', 'these', 'rabbit'],
 ['-pron-', 'love', 'this', 'rabbit']]

In [29]:
def process_documents(corpus, stop_words=None):
    docs = nlp.pipe(corpus)
    processed = [process_text(doc) for doc in docs]
    if stop_words is not None:
        processed = [filter_stops(doc, stop_words) for doc in processed]
    return processed

In [33]:
stop_words = nlp.Defaults.stop_words

print( process_documents(corpus) )
print()
print( process_documents(corpus, stop_words) )

[['-pron-', 'do', 'like', 'rabbit', 'in', 'space'], ['-pron-', 'do', 'not', 'like', 'rabbit', 'in', 'space'], ['-pron-', 'love', 'these', 'rabbit'], ['-pron-', 'love', 'this', 'rabbit']]

[['-pron-', 'like', 'rabbit', 'space'], ['-pron-', 'like', 'rabbit', 'space'], ['-pron-', 'love', 'rabbit'], ['-pron-', 'love', 'rabbit']]


## Real Data Test
So far we've been working on a toy dataset. Lets see what happens with a real dataset.

In [34]:
import pandas as pd

In [35]:
df = pd.read_csv('sample_news.csv')

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   uuid               100 non-null    object
 1   query              100 non-null    object
 2   thread.title_full  100 non-null    object
 3   text               100 non-null    object
 4   published          100 non-null    object
 5   thread.site        100 non-null    object
dtypes: object(6)
memory usage: 4.8+ KB


In [37]:
stop_words = nlp.Defaults.stop_words

%time result = process_documents(df['text'], stop_words=stop_words)

CPU times: user 10 s, sys: 2.12 s, total: 12.1 s
Wall time: 12.2 s


In [38]:
print(df.loc[0,'text'])

Image copyright Getty Images UK Prime Minister Boris Johnson hopes to persuade MPs to back a deal to take the UK out of the EU.
Doing so would implement the result of the referendum of June 2016, in which 52% of voters backed Leave and 48% Remain.
But where do voters stand on Brexit now, after more than three years of debate and negotiation?
There is no majority for any course of action First, no single course of action is preferred by a majority of voters.
For example, polling firm Kantar has asked voters on a number of occasions which of four possible outcomes they prefer.
The most popular choice has been to remain in the EU. However, this secured the support of only about one in three.
The next most popular, leaving without a deal, is preferred by slightly less than a quarter.
Much the same picture has been painted by another survey. BMG asked people which of five alternatives they would prefer if a deal is not agreed by the end of this month. None has come even close to being backe

In [39]:
print(result[0])

['image', 'copyright', 'getty', 'images', 'uk', 'prime', 'minister', 'boris', 'johnson', 'hope', 'persuade', 'mps', 'deal', 'uk', 'eu', 'implement', 'result', 'referendum', 'june', 'voter', 'leave', 'remain', 'voter', 'stand', 'brexit', 'year', 'debate', 'negotiation', 'majority', 'course', 'action', 'single', 'course', 'action', 'prefer', 'majority', 'voter', 'example', 'polling', 'firm', 'kantar', 'ask', 'voter', 'number', 'occasion', 'possible', 'outcome', '-pron-', 'prefer', 'popular', 'choice', 'remain', 'eu', 'secure', 'support', 'popular', 'leave', 'deal', 'prefer', 'slightly', 'quarter', 'picture', 'paint', 'survey', 'bmg', 'ask', 'people', 'alternative', '-pron-', 'prefer', 'deal', 'agree', 'end', 'month', 'come', 'close', 'half', 'voter', 'agreement', 'reach', 'single', 'popular', 'option', 'leave', 'eu', 'deal', '-pron-', 'popular', 'option', 'hold', 'referendum', 'reverse', 'brexit', 'referendum', 'choose', 'poll', 'opinium', 'panelbase', 'comres', 'ask', 'people', '-pron-'

In [40]:
# to create a new column of tokenized documents we simply assign the result like so...

df['tokens'] = result

In [41]:
df[['text','tokens']]

Unnamed: 0,text,tokens
0,Image copyright Getty Images UK Prime Minister...,"[image, copyright, getty, images, uk, prime, m..."
1,Stocks were off slightly as investors consider...,"[stock, slightly, investor, consider, mixed, b..."
2,Image copyright Getty Images A key part of the...,"[image, copyright, getty, images, key, brexit,..."
3,Send Load more share options\nThe New IRA has ...,"[send, load, share, option, new, ira, break, -..."
4,The issue of the Irish border - and how to han...,"[issue, irish, border, handle, flow, good, peo..."
...,...,...
95,IN GRAPHICS: What happens now? How has the gov...,"[graphics, happen, government, react, mr, raab..."
96,Report\nProtesters travelled from across the U...,"[report, protester, travel, uk, attend, march,..."
97,UK's Johnson asks for a Brexit delay that he d...,"[uk, johnson, ask, brexit, delay, -pron-, want..."
98,The MP who is bidding to replace him next mont...,"[mp, bid, replace, -pron-, month, warn, health..."
