# Introduction to Text Mining
## 2. Pre-Processing Documents to be Data
- A key part of text analysis is preparing your text for different kinds of analysis.

- Different types of analysis require different types of preparation but usually two steps are fundamental. 
    - Tokenizing
    - Filtering.


All the processes that we engaged with in Part 1 would have used some sort of Tokenizing in order to understand the underling text. When we prepare text to use other kinds of analysis we also need to tokenize appropriately.


### a) Tokenizing
The process of splitting up the text into a list of individual words that can be treated as individual units.

How exactly you split up text is not necessarily straight forward and there are a range of different strategies. The correct one to use varies depending on the type of text you are using and the type of analysis that you want to do.

To see how different tokenizer strategies interpret a piece of text you can check out the [Python NLTK Demo Page](https://text-processing.com/demo/tokenize/).

#### Tokenizing: Example of the problem
We already know how to split up a string into a series of items in a list. It's pretty simple using `.split()`

In [None]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."

In [None]:
test_tokens = 
test_tokens

In [None]:
# first we check if the string 'ears' is in the list
'ears' in test_tokens

In [None]:
# so we would imagine that 'eyes' is also in the list?
'eyes' in test_tokens

Having punctuation atached to words like this can cause us problems because the tokens `eyes` and `eyes!` would be considered two seperate things. This messes with a lot of analysis further down the line.

In [None]:
# Consider this tweet

test_tweet = "How will #Brexit affect #customsdeclarations? Check out the @britishchambers"\
" guide to find out! https://t.co/CpoGudZAcb https://t.co/Pr4RW4Tyhw"


In [None]:
# if we split it we get
test_tweet.split()

Tweets pose further challenges because because of the unusual use of language that is common within the domain of twitter and other social media (hashtags).

#### Tokenizing: Using Tokenizers
Tokenizers are functions that split up text for us. Some of them are based on complex sets of rules, others are based on training computers using lots of examples of text and the ideal way to split them.

There are many (many!) packages available for handling text data. These include..

- [The Natural Language Toolkit (NLTK)](https://www.nltk.org/): Very well established package. More of a focus on linguistics. Can do most tasks but a bit of a steep learning curve compared to TextBlob. People often dip into its toolset whilst using other packages.
- [TextBlob](https://textblob.readthedocs.io/en/dev/): Good for beginners but lacks some features. Built on top of NLTK but doesn't have all of NLTK's functionality.
- [Gensim](https://radimrehurek.com/gensim/): Good to handle very large amounts of text (100,000's). However to achieve this it relies on fairly high level concepts in Python, making its approach a little tricky to get a grip on. 
- [SpaCy](https://spacy.io/): Relatively new. Utilises a lot of language models that are pre-trained on very large datasets of text. Incredibly fast with industry level complex tools. Very flexible if used correctly and has a relatively accesible set of tools once you understand the SpaCy approach.


In [None]:
import spacy

In [None]:
nlp = 

In [None]:
# In SpaCy tokenization happens the moment you wrap a string in your language model object nlp()

doc = 

doc

In [None]:
# if we iterate over the doc we can see the tokens.


To get a list of tokens we can do one of two things...

- a) Use a `for loop`

In [None]:
tokens_a = []


    

In [None]:
print(tokens_a)

- b) Use a *List Comprehension*

List comprehensions allow us to do in 1 line what would normally take 3. We'll see how they can be used more later.

In [None]:
tokens_b = 

In [None]:
print(tokens_b)

#### Tokenizing: Understanding Tokenization
A number of things have been done by the tokenizer.
- Punctuation has been seperated from words into their own tokens.
- Words that are contractions of two words (It's > It is / Don't > Do not) have been split into two.
- This becomes more useful in a minute and is all part of the process of reducing the nuance of language to make documents more comparable.

In [None]:
texts = ["I don't like rabbits in space",
         "I do not like rabbits in space",
         "I'm loving these rabbits",
         "I love this rabbit"]

In [None]:
# Remember we can use nlp.pipe to convert a list of documents quicker than iterating over the list one at a time.

docs = 
docs

In [None]:
# We can force the generator to produce the results by wrapping it in a list
docs = 
docs

In [None]:
# each of these list items is now a SpaCy document...

one_doc = 
print()

one_doc_tokens = 
print()

### b) Filtering
Often in text analysis, there is a lot of material that is useful to humans but less useful to computers when performing analysis. Language is very nuanced in real life, but part of the filtering process involves reducing that nuance to strip back to a piece of text's bare bones.

For example, how different would we consider these two sentences...?

```
"I don't like rabbits in space"
"I do not like rabbits in space"
```
Semantically they are the same, computationally they are different.
```
"I am loving these rabbits"
"I love this rabbit"
```
Semantically a little different but still similar. Computationally very different as they only share one word, 'I'.

#### SpaCy Tokens
As we saw above, once a string is wrapped in a nlp model, it becomes a SpaCy [Document object](https://spacy.io/api/doc) giving access to a range of methods. The SpaCy document object, tokenizes our string meaning that the Document object is also a list of SpaCy [Token objects](https://spacy.io/api/token).

In [None]:
# and to reiterate, a spacy document is a collection of spacy tokens
print(one_doc)
print(type(one_doc))

print(one_doc_tokens[4])
print(type(one_doc_tokens[4]))

#### SpaCy Tokens: Lemmatization

This is the process of reducing a word down to its 'root' form whilst still retaining the meaning. By reducing to the root of a word, you reduce the variation of words used and increase the chances that semantically similar phrases have the same tokens.

In [None]:
texts = ["I don't like rabbits in space",
         "I do not like rabbits in space",
         "I'm loving these rabbits",
         "I love this rabbit"]

docs = 



#### '-PRON-'
In SpaCy the `-PRON-` is a stand-in for any pronoun. From [the documentation](https://spacy.io/api/annotation).


##### About spaCy's custom pronoun lemma for English

*spaCy adds a special case for English pronouns: all English pronouns are lemmatized to the special token -PRON-. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.*


#### SpaCy Tokens: Stop Words

"Stop" words are words within a language that provide structure to the language but are often do not convey a lot of information in themselves. Examples include...
- the
- and
- it

Normally for analysis processes beyond what SpaCy provides in terms of document level analysis we would want to filter out these stop words.

In [None]:
# we can load a list of stopwwords from spacy
stopwords = 

In [None]:
nlp_phrase = 
print(nlp_phrase)

In [None]:
for word in nlp_phrase:
    print(f"{word} : {word.text.lower() in stopwords}")

In [None]:
# first we lemmatize

lemmas = 
lemmas

In [None]:
# we can then filter out the stopwords like so

lemmas_filtered = 
lemmas_filtered

In [None]:
# we can also filter for only alphabetical tokens (not punctuation) using .is_alpha

nlp_phrase_filtered = 
nlp_phrase_filtered

In [None]:
# if we put all our filters together...

tokens = 
tokens = 
tokens

In [None]:
def filter_text(spacy_doc, lower=True):
    
    return tokens

In [None]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc = nlp(test_phrase)
filter_text(doc)

## Cleaning our News Data

In [None]:
import pandas as pd

In [None]:
df = pd.read_pickle('news_data.pkl')

In [None]:
df['text_nlp'] = 

In [None]:
df['cleaned_tokens'] = 

In [None]:
df['cleaned_tokens']

We can compare the original and the cleaned versions

In [None]:
print(df.loc[0,'text'])

In [None]:
print(' '.join(df.loc[0,'cleaned_tokens']))

We'll use this technique to clean our text in the next sessions.