# Introduction to Text Mining
## 2. Pre-Processing Documents to be Data
- A key part of text analysis is preparing your text for different kinds of analysis.

- Different types of analysis require different types of preparation but usually two steps are fundamental. 
    - Tokenizing
    - Filtering.


All the processes that we engaged with in Part 1 would have used some sort of Tokenizing in order to understand the underling text. When we prepare text to use other kinds of analysis we also need to tokenize appropriately.


### a) Tokenizing
The process of splitting up the text into a list of individual words that can be treated as individual units.

How exactly you split up text is not necessarily straight forward and there are a range of different strategies. The correct one to use varies depending on the type of text you are using and the type of analysis that you want to do.

To see how different tokenizer strategies interpret a piece of text you can check out the [Python NLTK Demo Page](https://text-processing.com/demo/tokenize/).

#### Tokenizing: Example of the problem
We already know how to split up a string into a series of items in a list. It's pretty simple using `.split()`

In [1]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."

In [2]:
test_tokens = test_phrase.split()
test_tokens

['I',
 "don't",
 'see',
 'my',
 'cat.',
 'He',
 'has',
 'a',
 'long',
 'tail,',
 'fluffy',
 'ears',
 'and',
 'big',
 'eyes!',
 'He',
 'also',
 'subscribes',
 'to',
 'Marxist',
 'historical',
 'materialism.',
 "It's",
 'just',
 'his',
 'way.']

In [3]:
# first we check if the string 'ears' is in the list
'ears' in test_tokens

True

In [4]:
# so we would imagine that 'eyes' is also in the list?
'eyes' in test_tokens

False

Having punctuation atached to words like this can cause us problems because the tokens `eyes` and `eyes!` would be considered two seperate things. This messes with a lot of analysis further down the line.

In [5]:
# Consider this tweet

test_tweet = "How will #Brexit affect #customsdeclarations? Check out the @britishchambers"\
" guide to find out! https://t.co/CpoGudZAcb https://t.co/Pr4RW4Tyhw"


In [6]:
# if we split it we get
test_tweet.split()

['How',
 'will',
 '#Brexit',
 'affect',
 '#customsdeclarations?',
 'Check',
 'out',
 'the',
 '@britishchambers',
 'guide',
 'to',
 'find',
 'out!',
 'https://t.co/CpoGudZAcb',
 'https://t.co/Pr4RW4Tyhw']

Tweets pose further challenges because because of the unusual use of language that is common within the domain of twitter and other social media (hashtags).

#### Tokenizing: Using Tokenizers
Tokenizers are functions that split up text for us. Some of them are based on complex sets of rules, others are based on training computers using lots of examples of text and the ideal way to split them.

There are many (many!) packages available for handling text data. These include..

- [The Natural Language Toolkit (NLTK)](https://www.nltk.org/): Very well established package. More of a focus on linguistics. Can do most tasks but a bit of a steep learning curve compared to TextBlob. People often dip into its toolset whilst using other packages.
- [TextBlob](https://textblob.readthedocs.io/en/dev/): Good for beginners but lacks some features. Built on top of NLTK but doesn't have all of NLTK's functionality.
- [Gensim](https://radimrehurek.com/gensim/): Good to handle very large amounts of text (100,000's). However to achieve this it relies on fairly high level concepts in Python, making its approach a little tricky to get a grip on. 
- [SpaCy](https://spacy.io/): Relatively new. Utilises a lot of language models that are pre-trained on very large datasets of text. Incredibly fast with industry level complex tools. Very flexible if used correctly and has a relatively accesible set of tools once you understand the SpaCy approach.


In [7]:
import spacy

In [8]:
nlp = spacy.load('en_core_web_md')

In [9]:
# In SpaCy tokenization happens the moment you wrap a string in your language model object nlp()

doc = nlp(test_phrase)

doc

I don't see my cat. He has a long tail, fluffy ears and big eyes! He also subscribes to Marxist historical materialism. It's just his way.

In [10]:
# if we iterate over the doc we can see the tokens.
for word in doc:
    print(word)

I
do
n't
see
my
cat
.
He
has
a
long
tail
,
fluffy
ears
and
big
eyes
!
He
also
subscribes
to
Marxist
historical
materialism
.
It
's
just
his
way
.


To get a list of tokens we can do one of two things...

- a) Use a `for loop`

In [11]:
tokens_a = []

for word in doc:
    tokens_a.append(word)
    

In [12]:
print(tokens_a)

[I, do, n't, see, my, cat, ., He, has, a, long, tail, ,, fluffy, ears, and, big, eyes, !, He, also, subscribes, to, Marxist, historical, materialism, ., It, 's, just, his, way, .]


- b) Use a *List Comprehension*

List comprehensions allow us to do in 1 line what would normally take 3. We'll see how they can be used more later.

In [13]:
tokens_b = [word for word in doc]

In [14]:
print(tokens_b)

[I, do, n't, see, my, cat, ., He, has, a, long, tail, ,, fluffy, ears, and, big, eyes, !, He, also, subscribes, to, Marxist, historical, materialism, ., It, 's, just, his, way, .]


#### Tokenizing: Understanding Tokenization
A number of things have been done by the tokenizer.
- Punctuation has been seperated from words into their own tokens.
- Words that are contractions of two words (It's > It is / Don't > Do not) have been split into two.
- This becomes more useful in a minute and is all part of the process of reducing the nuance of language to make documents more comparable.

In [15]:
texts = ["I don't like rabbits in space",
         "I do not like rabbits in space",
         "I'm loving these rabbits",
         "I love this rabbit"]

In [16]:
# Remember we can use nlp.pipe to convert a list of documents quicker than iterating over the list one at a time.

docs = nlp.pipe(texts, n_process=-1)
docs

<generator object Language.pipe at 0x7fa3ed9d2e50>

In [17]:
# We can force the generator to produce the results by wrapping it in a list
docs = list(docs)
docs

[I don't like rabbits in space,
 I do not like rabbits in space,
 I'm loving these rabbits,
 I love this rabbit]

In [18]:
# each of these list items is now a SpaCy document...

one_doc = docs[0]
print(one_doc.lang_)

one_doc_tokens = [word for word in one_doc]
print(one_doc_tokens)

en
[I, do, n't, like, rabbits, in, space]


### b) Filtering
Often in text analysis, there is a lot of material that is useful to humans but less useful to computers when performing analysis. Language is very nuanced in real life, but part of the filtering process involves reducing that nuance to strip back to a piece of text's bare bones.

For example, how different would we consider these two sentences...?

```
"I don't like rabbits in space"
"I do not like rabbits in space"
```
Semantically they are the same, computationally they are different.
```
"I am loving these rabbits"
"I love this rabbit"
```
Semantically a little different but still similar. Computationally very different as they only share one word, 'I'.

#### SpaCy Tokens
As we saw above, once a string is wrapped in a nlp model, it becomes a SpaCy [Document object](https://spacy.io/api/doc) giving access to a range of methods. The SpaCy document object, tokenizes our string meaning that the Document object is also a list of SpaCy [Token objects](https://spacy.io/api/token).

In [19]:
# and to reiterate, a spacy document is a collection of spacy tokens
print(one_doc)
print(type(one_doc))

print(one_doc_tokens[4])
print(type(one_doc_tokens[4]))

I don't like rabbits in space
<class 'spacy.tokens.doc.Doc'>
rabbits
<class 'spacy.tokens.token.Token'>


#### SpaCy Tokens: Lemmatization

This is the process of reducing a word down to its 'root' form whilst still retaining the meaning. By reducing to the root of a word, you reduce the variation of words used and increase the chances that semantically similar phrases have the same tokens.

In [20]:
texts = ["I don't like rabbits in space",
         "I do not like rabbits in space",
         "I'm loving these rabbits",
         "I love this rabbit"]

docs = list(nlp.pipe(texts))

for doc in docs:
    print([token.lemma_ for token in doc])

['-PRON-', 'do', 'not', 'like', 'rabbit', 'in', 'space']
['-PRON-', 'do', 'not', 'like', 'rabbit', 'in', 'space']
['-PRON-', 'be', 'love', 'these', 'rabbit']
['-PRON-', 'love', 'this', 'rabbit']


#### '-PRON-'
In SpaCy the `-PRON-` is a stand-in for any pronoun. From [the documentation](https://spacy.io/api/annotation).


##### About spaCy's custom pronoun lemma for English

*spaCy adds a special case for English pronouns: all English pronouns are lemmatized to the special token -PRON-. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.*


#### SpaCy Tokens: Stop Words

"Stop" words are words within a language that provide structure to the language but are often do not convey a lot of information in themselves. Examples include...
- the
- and
- it

Normally for analysis processes beyond what SpaCy provides in terms of document level analysis we would want to filter out these stop words.

In [21]:
# we can load a list of stopwwords from spacy
stopwords = stopwords = nlp.Defaults.stop_words

In [22]:
nlp_phrase = nlp(test_phrase)
print(nlp_phrase)

I don't see my cat. He has a long tail, fluffy ears and big eyes! He also subscribes to Marxist historical materialism. It's just his way.


In [23]:
stop_word = nlp_phrase[15].text
not_stop_word = nlp_phrase[5].text

print(stop_word)
print(stop_word in stopwords)

print(not_stop_word)
print(not_stop_word in stopwords)

and
True
cat
False


In [24]:
for word in nlp_phrase:
    print(f"{word} : {word.text.lower() in stopwords}")

I : True
do : True
n't : True
see : True
my : True
cat : False
. : False
He : True
has : True
a : True
long : False
tail : False
, : False
fluffy : False
ears : False
and : True
big : False
eyes : False
! : False
He : True
also : True
subscribes : False
to : True
Marxist : False
historical : False
materialism : False
. : False
It : True
's : True
just : True
his : True
way : False
. : False


In [25]:
# first we lemmatize

lemmas = [word.lemma_ if word.lemma_ != '-PRON-'
          else word.text
          for word in nlp_phrase]
lemmas

['I',
 'do',
 'not',
 'see',
 'my',
 'cat',
 '.',
 'He',
 'have',
 'a',
 'long',
 'tail',
 ',',
 'fluffy',
 'ear',
 'and',
 'big',
 'eye',
 '!',
 'He',
 'also',
 'subscribe',
 'to',
 'marxist',
 'historical',
 'materialism',
 '.',
 'It',
 'be',
 'just',
 'his',
 'way',
 '.']

In [26]:
# we can then filter out the stopwords like so

lemmas_filtered = [word for word in lemmas if word.lower() not in stopwords]
lemmas_filtered

['cat',
 '.',
 'long',
 'tail',
 ',',
 'fluffy',
 'ear',
 'big',
 'eye',
 '!',
 'subscribe',
 'marxist',
 'historical',
 'materialism',
 '.',
 'way',
 '.']

In [27]:
# we can also filter for only alphabetical tokens (not punctuation) using .is_alpha

nlp_phrase_filtered = [word for word in lemmas_filtered if word.isalpha()]
nlp_phrase_filtered

['cat',
 'long',
 'tail',
 'fluffy',
 'ear',
 'big',
 'eye',
 'subscribe',
 'marxist',
 'historical',
 'materialism',
 'way']

In [28]:
# if we put all our filters together...

tokens = [word.lemma_ for word in nlp_phrase]
tokens = [word for word in tokens if word.isalpha()]
tokens

['do',
 'not',
 'see',
 'cat',
 'have',
 'a',
 'long',
 'tail',
 'fluffy',
 'ear',
 'and',
 'big',
 'eye',
 'also',
 'subscribe',
 'to',
 'marxist',
 'historical',
 'materialism',
 'be',
 'just',
 'way']

In [29]:
def filter_text(spacy_doc, lower=True):
    tokens = [word.lemma_ for word in spacy_doc]
    tokens = [word for word in tokens if word.lower() not in stopwords]
    tokens = [word for word in tokens if word.isalpha()]
    if lower:
        tokens = [word.lower() for word in tokens]
    return tokens

In [30]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc = nlp(test_phrase)
filter_text(doc)

['cat',
 'long',
 'tail',
 'fluffy',
 'ear',
 'big',
 'eye',
 'subscribe',
 'marxist',
 'historical',
 'materialism',
 'way']

## Cleaning our News Data

In [31]:
import pandas as pd

In [32]:
df = pd.read_pickle('news_data.pkl')

In [33]:
df['text_nlp'] = list(nlp.pipe(df['text']))

In [34]:
df['cleaned_tokens'] = df['text_nlp'].apply(filter_text)

In [35]:
df['cleaned_tokens']

0      [image, copyright, getty, images, uk, prime, m...
1      [stock, slightly, investor, consider, mixed, b...
2      [image, copyright, getty, images, key, brexit,...
3      [send, load, share, option, new, ira, break, l...
4      [issue, irish, border, handle, flow, good, peo...
                             ...                        
967    [victoria, police, denounce, inappropriate, me...
968    [photo, old, canadian, drag, queen, pose, half...
969    [ex, breitbart, writer, milo, yiannopoulos, re...
970    [brexit, party, leader, continue, worry, longe...
971    [screenshot, minor, task, everyday, task, impo...
Name: cleaned_tokens, Length: 972, dtype: object

We can compare the original and the cleaned versions

In [36]:
print(df.loc[0,'text'])

Image copyright Getty Images UK Prime Minister Boris Johnson hopes to persuade MPs to back a deal to take the UK out of the EU.
Doing so would implement the result of the referendum of June 2016, in which 52% of voters backed Leave and 48% Remain.
But where do voters stand on Brexit now, after more than three years of debate and negotiation?
There is no majority for any course of action First, no single course of action is preferred by a majority of voters.
For example, polling firm Kantar has asked voters on a number of occasions which of four possible outcomes they prefer.
The most popular choice has been to remain in the EU. However, this secured the support of only about one in three.
The next most popular, leaving without a deal, is preferred by slightly less than a quarter.
Much the same picture has been painted by another survey. BMG asked people which of five alternatives they would prefer if a deal is not agreed by the end of this month. None has come even close to being backe

In [37]:
print(' '.join(df.loc[0,'cleaned_tokens']))

image copyright getty images uk prime minister boris johnson hope persuade mps deal uk eu implement result referendum june voter leave remain voter stand brexit year debate negotiation majority course action single course action prefer majority voter example polling firm kantar ask voter number occasion possible outcome prefer popular choice remain eu secure support popular leave deal prefer slightly quarter picture paint survey bmg ask people alternative prefer deal agree end month come close half voter agreement reach single popular option leave eu deal popular option hold referendum reverse brexit referendum choose poll opinium panelbase comres ask people think proposal deal forward mr johnson find slightly voter favour half opinium poll think proposal represent good deal reckon represent bad brexit deal possible difficult simple guide brexit deal brexit people good bad know panelbase comres find support proposal oppose case know backdrop voter compromise deal mr johnson strike eu f

We'll use this technique to clean our text in the next sessions.