# SC207 Text Mining
## Preprocessing and Tokenisation
### Preparing text to be used as data

- We broadly understand how text might be used as data for qualitative analysis.
- Words are not treated simply as individual units of data, but we recognise context, structure, pattern.
- How then can text be analysed quantitatively, and how can it be interpreted by a computer to aide that analysis?
- The tools we've already used for sentiment analysis and entity recognition performed this process for us in the background, but this won't always be the case.
- Usually text analysis techniques require text to be prepared for analysis through two stages

### 1. Tokenizing

- Computers break down text into units of analysis known as *Tokens*. Tokens are often individual words, but they can also be parts of words, common phrases etc.
- Tokenizing is the first fundamental step in any text analysis.
- How exactly you split up text is not necessarily straight forward.
- There are a range of different strategies which you can see at the [Python NLTK Demo Page](https://text-processing.com/demo/tokenize/).

### 2. Pre-Processing
- Exactly what happens in pre-processing tends to depend on the type of analysis you are doing.
- In general it tends to involve...
        - Filtering out of common words and punctuation
        - Standardising the text to make it less complex for the computer.


## Step 1: Tokenizing

#### Example of the problem
We already know how to split up a string into a series of items in a list. It's pretty simple using `.split()`

In [None]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."

In [None]:
test_tokens =
print(test_tokens)

In [None]:
# first we check if the string 'ears' is in the list


In [None]:
# so we would imagine that 'eyes' is also in the list?


Having punctuation atached to words like this can cause us problems because the tokens `eyes` and `eyes!` would be considered two seperate things. This messes with a lot of analysis further down the line.

### Using Tokenizers
- Tokenizers are functions that split up text for us. 
- Some of them are based on complex sets of rules, others are based on training computers using lots of examples of text.
- There are many (many!) packages available for handling text data. See Moodle for a list of common ones.

#### SpaCy

We met SpaCy when we used it for named entity recognition. Today we'll also make use of its trained model to perform our tokenisation.

We're going to use `en_core_web_md` which means is trained on [petabytes of data from the contemporary internet](https://commoncrawl.org/big-picture/) and so is very up to date in how it understands contemporary language use.

In [None]:
# imports

In [None]:
# nlp represents the trained language model provided by Spacy...
nlp =

In [None]:
# In SpaCy tokenization happens the moment you wrap a string in your language model object nlp()

doc =

doc

In [None]:
# if we iterate over the doc we can see the tokens.
tokens_a = []



print(tokens_a)



**List comprehensions** allow us to do in 1 line what would normally take 3. They are much more efficient than using a `for` loop. We'll see how they can be used more later.

In [None]:
tokens_b =
print(tokens_b)

A number of things have been done by the tokenizer.
- Punctuation has been seperated from words into their own tokens.
- Words that are contractions of two words (It's > It is / Don't > Do not) have been split into two.
- This becomes more useful in a minute and is all part of the process of reducing the nuance of language to make documents more comparable.

## Step 2: Pre-Processing
- Whilst the above looks just like the strings again it is actually a SpaCy **Document**.
- Once a string is processed by SpaCy it becomes a SpaCy [Document object](https://spacy.io/api/doc). 
- The SpaCy document object itself is made up of SpaCy [Token objects](https://spacy.io/api/token).

This means that Documents and Tokens have a range of associated methods and attributes based on SpaCy's analysis.

In [None]:
# Let's do that again
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc =

In [None]:
type(doc)

In [None]:
#language

In [None]:
token =
token

In [None]:
type(token)

#### Pre-Processing: Lemmatisation
Language is very nuanced in real life, but part of the filtering process involves reducing that nuance to strip back to a piece of text's bare bones.

```
"I don't like rabbits in space"
"I do not like rabbits in space"
```
- Semantically the same, computationally different.
- Lemmatising using the token method `.lemma_` allows us to roll words back to a common 'root'.

In [None]:
rabbit_1 = nlp("I don't like rabbits in space")
rabbit_2 = nlp("I do not like rabbits in space")

print( )
print( )

These phrases below are semantically similar, but only share 1 word. Lemmatising brings them closer together computationally.

In [None]:
rabbit_1 = nlp("I'm loving these rabbits")
rabbit_2 = nlp("I love this rabbit!")



In [None]:
# Let's see what it does to our test phrase
doc

In [None]:
print()

#### Pre-Processing: Punctuation
Unless punctuation matters for your analysis (such as needing to break text down into sentences), we normally will clear out punctuation from text. We can use SpaCy's `.is_alpha` attribute to include only tokens that are alphabetical, and still just return the lemma.

In [None]:
processed_doc =
print(processed_doc)

#### Pre-Processing: Stop Words
Stop words are common words that tend to be structurally useful in sentences, but are too common to provide much meaning alone. They are often stripped out before text analysis.
Lets add this to our list comprehension.

In [None]:
stop_words =
stop_words

In [None]:
# show stopword filtering

In [None]:
# we'll add .lower() to our result to ensure that we 
# get rid of any distinction when it comes to capitalisation as well

def process_text(doc):

    
def filter_stops(tokens, stop_words):


In [None]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."
doc =

print(process_text(doc))
print()
print(filter_stops(process_text(doc), stop_words))

## Processing a Corpus
A "Corpus" is a collection of textual documents. Often in computational textual analysis a corpus size will run into the hundreds, thousands or even hundreds of thousands. Whilst we could process each document seperately with a `for` loop, it is much more efficient to use Spacy's `pipe` which can process multiple documents in parallel and is memory efficient.


In [None]:
corpus = ["I don't like rabbits in space",
         "I do not like rabbits in space",
         "I'm loving these rabbits",
         "I love this rabbit!"]

In [None]:
docs = #pipe it
docs

Spacy has created a generator object. This means that at the moment, no processing has been done. Each document is only processed when we iterate over the generator object as if it were a list.

In [None]:
# We can force the generator to produce the results by iterating over it in a list comprehension.
docs =
docs

For peace of mind we can check and see that yes the objects are spacy documents

In [None]:
for doc in docs:
    print(type(doc))

Which means we can run our processor on each Spacy doc as it is spat out and retain the result.

In [None]:
def process_documents(corpus, stop_words=None):


In [None]:
stop_words = nlp.Defaults.stop_words

print( process_documents(corpus) )
print()
print( process_documents(corpus, stop_words) )

## Real Data Test
So far we've been working on a toy dataset. Lets see what happens with a real dataset.

In [None]:
import pandas as pd

In [None]:
df =

In [None]:
# check info

In [None]:
stop_words = nlp.Defaults.stop_words

result = #process corpus

In [None]:
print(df.loc[0,'text'])

In [None]:
print(result[0])

# Phrasing
Looks for common co-occurences of tokens in the text and joins them together into single tokens.
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/Archer-phrasing.jpg?raw=true" align="right" width="300">


In [None]:
! pip install gensim

In [80]:
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

In [81]:
phraser =
phrased =

In [84]:
for doc in phrased[:5]:
    for token in doc:
        if '_' in token:
            print(token)

climate_change
look_like
social_medium
hong_kong
worth_billion
chief_executive
set_fire
hong_kong
carrie_lam
protester_shoot
set_fire
hong_kong
science_technology
chow_tsz
tear_gas
hong_kong
year_old
world_large
world_large
revenue_billion
president_donald
false_claim
false_claim
false_claim
cabinet_meeting
interview_fox
news_sean
false_claim
false_claim
false_claim
impeachment_inquiry
falsely_claim
ukrainian_president
falsely_claim
falsely_claim
falsely_claim
falsely_claim
quid_pro
falsely_claim
falsely_claim
adam_schiff
falsely_claim
committee_hearing
falsely_claim
falsely_claim
ask_question
impeachment_inquiry
falsely_claim
false_claim
bin_laden
bin_laden
bin_laden
bin_laden
bin_laden
bin_laden
fact_check
false_claim
president_trump
impeachment_inquiry
false_claim
george_washington
george_washington
fact_check
fact_check
october_exchange
reporter_cabinet
october_interview
fox_news
sean_hannity
united_states
october_speech
shale_insight
conference_fact
household_income
fox_news
media

Currently we have a list of lists of strings....


````
Results List
    - Document List
        - String Token
        - String Token
        ...
    - Document List
        - String Token
        ....
````

A cleaner way to store this would be to convert each tokenised document into a string with spaces seperating each token. This can be easily reversed later and is more efficient to store.

````
Results List
    - String of Tokens
    - String of Tokens
    - String of Tokens
    ...
````


In [None]:
# Takes a list of tokenized documents and joins the tokens together, outputting a list of stringified tokens.
def stringify_tokens(tokenized_documents):


In [None]:
result_strings =
result_strings[0]

In [None]:
# to create a new column of tokenized documents we simply assign the result like so...



In [None]:
df[['text','tokens']]

In [None]:
df.to_csv('sample_news_large_with_tokens.csv', index=False)