# Notebook 2 - Text preprocessing, POS tags, and simple word model

So far we have learned how to read files, manage them using pandas data frames and search for patterns using regular expressions. In this notebook we will look into the next concepts:
- text preprocessing
- POS tagging & Named-entity recognition (NER)
- bag-of-words model

---

## 3. Text preprocessing
Before running any analysis or developing a language model, we have to make sure that our data is in a suitable format, which will guarantee the best performance and accuracy of the algorithm. This step is called **text preprocessing** and consists of several smaller tasks.

### 3.1 Load from csv to pandas
In this section, we will use the same 'dummy_data' dataset which we used in the previous notebook. Firstly, let's load it!

In [None]:
import pandas as pd
import re

In [None]:
dummy_data_dataset_file = "https://raw.githubusercontent.com/TheRootOf3/ucl-nlp-notebook-series/main/Notebook2/datasets/dummy_data.csv"

''' uncomment if you want to run it locally '''
# dummy_data_dataset_file = "./datasets/dummy_data.csv"

dummy_data = pd.read_csv(dummy_data_dataset_file, encoding='utf-8')
dummy_data

In the dummy_data file there are 3 different types of entries. We will create an individual dataframe for each one.

In [None]:
# saving the three types of text data in 3 separate dataframes
sms_df = dummy_data[dummy_data['type'] == "sms"]
review_df = dummy_data[dummy_data['type'] == "review"]
news_df = dummy_data[dummy_data['type'] == "news_article"]

In [None]:
# sample entries on which we can test preprocessing methods

sms_sample = """***** CONGRATlations **** You won 2 tIckETs to Hamilton in 
NYC http://www.hamiltonbroadway.com/J?NaIOl/event   wORtH over $500.00...CALL 
555-477-8914 or send message to: hamilton@freetix.com to get ticket !! !"""
review_sample = """ THIS FOOD AND STAFF WAS AMAZING!!!!! ABSOLUTELY LOVE THAT PLACE <3<3<3"""
news_sample = """worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness.  cynthia cooper  worldcom s ex-head of internal accounting  alerted directors to irregular accounting practices at the us telecoms giant in 2002. her warnings led to the collapse of the firm following the discovery of an $11bn (┬ú5.7bn) accounting fraud. mr ebbers has pleaded not guilty to charges of fraud and conspiracy.  prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom  ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates. but ms cooper  who now runs her own consulting business  told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002. she said andersen had given a  green light  to the procedures and practices used by worldcom. mr ebber s lawyers have said he was unaware of the fraud  arguing that auditors did not alert him to any problems.  ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company s finance chief  giving only  brief  answers himself. the prosecution s star witness  former worldcom financial chief scott sullivan  has said that mr ebbers ordered accounting adjustments at the firm  telling him to  hit our books . however  ms cooper said mr sullivan had not mentioned  anything uncomfortable  about worldcom s accounting during a 2001 audit committee meeting. mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing. worldcom emerged from bankruptcy protection in 2004  and is now known as mci. last week  mci agreed to a buyout by verizon communications in a deal valued at $6.75bn."""

### 3.2 Removing unwanted characters
The is a primary step in the process of text cleaning. If we scrap some text from HTML/XML sources, we’ll need to get rid of all the tags, HTML entities, punctuation, non-alphabets, and any other kind of characters that might not be a part of the language. The general methods of such cleaning involve **regular expressions**, which can be used to filter out most of the unwanted texts.

However, sometimes, depending on the type of data, we want to retain certain types of punctuation. Consider for example human-generated tweets which you want to classify as very angry, angry, neutral, happy, and very happy. Simple sentiment analysis might find it hard to differentiate between a happy, and very happy sentiment because the only difference between a happy and a very happy tweet might be punctuation.

Example:

*This is amazing* vs *THIS IS AMAZING!!!!!*

Or what about this one

*I don't know :) <3* vs *I don't know :(((*

Now let's create a simple function that keeps only letters.

In [None]:
#regular expression keeping only letters 

def keep_letters_only(raw_text):
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)
    return letters_only_text

In [None]:
keep_letters_only(sms_sample)

You can see that this is not ideal as this leaves us with a lot of random stuff like "www" and "com". We will get back to that later.

In [None]:
keep_letters_only(review_sample) 

We don't lose any meaning, but as mentioned previously, keeping the exclamation marks might be useful if we want to distinguish between positive and *VERY* positive reviews.

In [None]:
keep_letters_only(news_sample)

For news articles that works perfectly fine as we do not lose any relevant information in this case since we want to classify by genre (sports, business, tech, etc.).

### 3.3 Text Normalisation
Recall our sms sample:

**** **** CONGRATlations **** You won 2 tIckETs to Hamilton in 
NYC http://www.hamiltonbroadway.com/J?NaIOl/event   wORtH over $500.00...CALL 
555-477-8914 or send message to: hamilton@freetix.com to get ticket !! !

I'd definitely deem this as spam. But clearly, there's a lot going on here: phone numbers, emails, website URLs, money amounts, and gratuitous whitespace and punctuation. Some terms are randomly capitalized, others are in all-caps. Since these terms might show up in any one of the training examples in countless forms, we need a way to ensure each training example is on an equal footing via a preprocessing step called **normalization**. 

To detect spam messages we don't want the computer to know or remember which email address or phone number was previously used in a spam message. We want the computer to understand **the pattern** of a spammy message. For example, if the message contains a lot of money amounts, words like "congratulations", "you won", AND an email address, it should be more likely to be considered spam. Again, we do not care what was the particular email address.

So instead of removing the following terms, for each training example, let's replace them with a specific string.

- Replace email addresses with `emailaddr`
- Replace URLs with `httpaddr`
- Replace money symbols with `moneysymb`
- Replace phone numbers with `phonenumbr`
- Replace numbers with `numbr`
- get rid of all other punctuations

In [None]:
def normalisation_sms(raw_text):
    cleaned = re.sub(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', raw_text)
    cleaned = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr',
                     cleaned)
    cleaned = re.sub(r'£|\$|\€', 'moneysymb ', cleaned) #add whitespace
    cleaned = re.sub(
        r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
        'phonenumbr', cleaned)
    cleaned = re.sub(r'\d+(\.\d+)?', 'numbr', cleaned)
    letters_only_text = re.sub("[^a-zA-Z]", " ", cleaned)
    return letters_only_text

In [None]:
normalisation_sms(sms_sample)

### 3.4 Tokenisation
Tokenisation is the process of splitting a sentence into words (tokens).

As you remember, in the previous notebook we used the `.split()` method which may be helpful in this case. Let's see an easy example:

In [None]:
print("A bad day in London is still better than a bad day anywhere else".split())

Now, this sentence has been broken down to 14 tokens of 12 unique types (token 'A' is not the same as token 'a'). 

This example divides the individual entities but doesn't get rid of the capitalism involved (no pun intended). Capitalization and De-capitalisation are again, dependent on the data and the task at hand.

In this case, it seems reasonable to de-capitalize text. Converting the uppercase 'A' to lowercase is a good idea since it has the same meaning as 'a'. Let's do it!

In [None]:
print("A bad day in London is still better than a bad day anywhere else".lower().split())

Ok, so we have changed all characters to lowercase but is it better now? Look what happened to "London" - its first letter changed as well. This is a good example of when de-capitalization may not be the best solution. Imagine that there exists an item called "london". Because of this, the NLP algorithm developed further may confuse the city of London with the item "london". 

So if we want to differentiate between any sentiments, then something written in uppercase might mean something different than something written in lowercase. 

Note that in the example above there was no punctuation. Let's see what happens in the following case.

In [None]:
print("A bad day in London, is still better than a bad day anywhere else! London is the capital of the UK.".lower().split())

'London' and 'London,' are not the same thing! Of course, people are smart enough to understand that both tokens have the same meaning. However, computer algorithms looking for *patterns* may treat these two tokes as totally different and unrelated things.

The simplest solution would be to remove all the punctuation but as we said earlier, this may lead to the loss of meaning/sentiment. Is there any clever way we can solve this issue and keep punctuation? Yes! We can use regular expressions to match the word boundaries and treat punctuation as separate tokens.

In [None]:
print(re.findall(r'\b\w+\b|[^\w ]', "A bad day in London, is still better than a bad day anywhere else! London is the capital of the UK.".lower()))

Cool, this seems to work. How about this one?

In [None]:
print(re.findall(r'\b\w+\b|[^\w ]', "Although I do like rain, I don't like this stormy weather!"))

Look what happened to "don't". It has been treated as two words split with the apostrophe. Since "don't" is the negation of "do", it would be natural to split "don't" somehow differently, showing that fact. So let's create a rule, that every "don't" will be split to "do" and "n't". But what with "can't" or "shouldn't"? Or even totally different words containing apostrophes like "students'"?  

As you can see, we have to develop rules for multiple cases - quite boring and time-consuming. This is why we introduce another Python module called `nltk` - Natural Language Toolkit. This module contains many useful text processing tools including tokenizers. Let's see how it works.

In [None]:
import nltk 
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
from nltk.tokenize import word_tokenize

print(word_tokenize("I don't like stormy weather after 8 o'clock in the evening!"))

As you can see `word_tokenizer` does exactly what we want! Nltk provides also different tokenizers for different types of input. Let's compare the `word_tokenize` with `TweetTokenizer`, which has been designed to work better with Twitter-type source texts (including hashtags, mentions, etc.).

In [None]:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()

print(word_tokenize("Hey @everyone, this is a sample #Twitter text containing some emojis :))) !!! Have fun <3 !"))
print(tt.tokenize("Hey @everyone, this is a sample #Twitter text containing some emojis :))) !!! Have fun <3 !"))

### 3.5 Stopword removal

Stopwords are the words that are used very frequently. Words like “of, are, the, it, is” are some examples of stopwords. In applications like document search engines and document classification, where keywords are more important than general terms, removing stopwords can be a good idea. However, if there’s some application about, for instance, songs lyrics search, or searching for specific quotes, stopwords can be important. 

“To be, or not to be” - Stopwords in such phrases actually play an important role, and hence, should not be dropped.

Another example is negation. "not" is contained in many stopword lists, but deleting "not" out of a negative review can make a positive out of it.

There are two common approaches to removing the stopwords, and both are fairly straightforward. One way is to count all the word occurrences, and providing a threshold value on the count, and getting rid of all the terms/words occurring more than the specified threshold value. The other way is to have a predetermined list of stopwords, which can be removed from the list of tokens/tokenized sentences. In the beginning, the second one may be better, as determining thresholds can be quite difficult.

NLTK comes with many corpora, including a stopword list. This list contains around 200 terms. However, you may want to use one that contains over 600 terms: [http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop](http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop) (Apostrophes have been removed as it has been done for the news articles)

Ok, let's see how we can remove stopwords from the news article sample

In [None]:
# Firstly, let's read the stopwords file

stop_words = []

''' uncomment if you are running this notebook locally '''
# with open("./datasets/SmartStoplist.txt", 'r') as f:
#     stop_words.extend(f.read().splitlines())

''' uncomment if you are using google colab '''
import urllib
with urllib.request.urlopen("https://raw.githubusercontent.com/TheRootOf3/ucl-nlp-notebook-series/main/Notebook2/datasets/SmartStoplist.txt") as f:
    stop_words.extend(f.read().decode('utf-8').splitlines())

    
print(stop_words[:10])  # First 10 stopwords

In [None]:
# We can remove stopwords using our stopwords list, or...
nltk_tokens = word_tokenize(news_sample)
filtered_sentence_smart = [w for w in nltk_tokens if not w in stop_words]


# ...again, we can use nltk builtin stopwords feature
from nltk.corpus import stopwords

stop_words_nltk = set(stopwords.words('english'))
filtered_sentence_nltk = [w for w in nltk_tokens if not w in stop_words_nltk]

In [None]:
print(news_sample)

In [None]:
print(filtered_sentence_nltk)

In [None]:
print(filtered_sentence_smart)

### 3.6 Lemmatising and Stemming
Lemmatisation and stemming both refer to a process of reducing a word to its root. The difference is that stem might not be an actual word whereas, a lemma is an actual word. It’s a handy tool if you want to avoid treating different forms of the same word as different words, e.g. *love, loved, loving*

**Lemmatising:** considered, considers, consider → “consider”

**Stemming:** considered, considering, consider → “consid”

In many applications, there may be no significant difference between lemmatising and stemming when training classifiers. However, the best way to find out how they work and when to use which solution is to try them! NLTK comes with many different in-built lemmatisers and stemmers, so just plug and play.

A note of caution: WordNetLemmatizer requires a POS-tag. The default is set to "noun" and therefore doesn't work with other words.

In [None]:
from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "considers"
word_2 = "apple"

stemmed_word =  stemmer.stem(word)
lemmatised_word = lemmatizer.lemmatize(word)

stemmed_word_2 =  stemmer.stem(word_2)
lemmatised_word_2 = lemmatizer.lemmatize(word_2)

print(stemmed_word)
print(lemmatised_word)

In [None]:
print(stemmed_word_2)
print(lemmatised_word_2)

### 3.7 Putting it all together
Now that we covered everything we need to know, we can combine everything into one function and apply it to the whole data. Let's keep it simple and write one for the news articles

In [None]:
def preprocess_news(raw_text):
    
    #keeping only letters
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and tokenise
    tokens = word_tokenize(letters_only_text.lower())
    

    cleaned_words = []
    stemmer = PorterStemmer()
    
    # remove stopwords
    for word in tokens:
        if word not in stop_words:
            cleaned_words.append(word)
    
    # stemm or lemmatise words
    stemmed_words = []
    for word in cleaned_words:
        word = stemmer.stem(word)
        stemmed_words.append(word)
    
    # converting list back to string
    return " ".join(stemmed_words)

In [None]:
news_sample

In [None]:
preprocess_news(news_sample)

In [None]:
news_df['prep_text'] = news_df['text'].apply(preprocess_news)
news_df

---

## 4. Part-of-speech tagging and Named-entity recognition

One may be interested not only in lexical features or pattern based features like punctuation or upper/lower case letters but also in semantic features in the source. This may be for example detecting verbs or searching for location names ("London", "New York", etc.).
- **POS tagging** - determining the lexical type of a given token (verb, noun, etc.) 
- **NER** - identifying and classifying named-entities into general groups (person name, money amount, time)

To understand both tasks let's look at the simple example. Let's say we have a following sentence "John visited US in 2020.".

| Operation | Output |
|-----------|--------|
| Raw text | John   visited   US   in   2020  . |
| POS tags | John<sub>proper noun</sub>   visited<sub>verb</sub>   US<sub>proper noun</sub>   in<sub>adposition</sub>   2020<sub>number</sub>  .<sub>punctuation</sub> |
| NER | John<sub>Person</sub>   visited   US<sub>Country</sub>   in   2020<sub>Time</sub>  . |

Now, let's try to do the same using another Python module called `spaCy`.

In [None]:
# Run this cell to download spacy english language package
! python3 -m spacy download en_core_web_sm

In [None]:
import spacy

# Firstly, we need to load the English language model
model = spacy.load("en_core_web_sm")

sentence = "John visited US in 2020."
doc = model(sentence)

Now, let's see how does `spacy` tags these words with part-of-speech types:

In [None]:
for token in doc:
    print(token, token.pos_)

If we would like to see those types and relations between word visually, `spacy` comes with a very handy method for doing so.

In [None]:
from spacy import displacy
displacy.render(doc, jupyter=True)

Now, how about named-entities? How can we identify and tag them?

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

We can also visualize those tags!

In [None]:
displacy.render(doc, style="ent", jupyter=True)
# GPE stands for the Geopolitical entity

`Spacy` gives many tools for this kind of analysis and identification. It also contains preprocessing tools like tokenizers and even advanced word embeddings. Make sure to check its [documentation](https://spacy.io/usage#quickstart)!

---

## 5. Bag-of-words model
Ok, now that we have tokenised and preprocessed our text, it is time to convert it into computer-readable vectors. This is called feature extraction. The **bag-of-words (BOW) model** is a popular and simple feature extraction technique. The intuition behind BOW is that two sentences are said to be similar if they contain a similar set of words. Bag-of-words can be treated as a special case of a more complex, **n-gram language model**.

The general idea of the BOW model is to count how many times each word (*token*) from the dataset occurs in a given sentence/source. The simplest way of implementing this model is using Python dictionaries. Let's try!

In [None]:
sentence1 = "They like apples"
sentence2 = "We like bananas"

sentence1_bag = {}
sentence2_bag = {}

def create_bag(text):
    bag = {}
    for token in text.split():
        if token in bag:
            bag[token] += 1
        else:
            bag[token] = 1
    return bag

sentence1_bag = create_bag(sentence1)
sentence2_bag = create_bag(sentence2)
print(sentence1_bag)
print(sentence2_bag)

Ok, now the computer understands how many and which words make up each sentence but is it able to compare them? No, because there isn't any connection between those sentences (yet!). We have to develop a "common denominator" for both sentences so we can compare them. 

### 5.1 One-Hot Vectors
In the case of the BOW model, the solution is to create a *bag* of all used tokens and encode words using computer-readable **One-Hot Vectors**. How does it work? BOW constructs a dictionary of *m* unique words in the corpus (vocabulary) and converts each word into a sparse vector of size *m*, where all values are set to 0 apart from the index of that word in the vocabulary. We can also say that each word is a feature and that sentences consist of features. If a feature is present in a given sentence it means one thing, if it is not present it means something different.

In the case above there are five different words: "They", "We", "like", "apples", "bananas". We can encode them using a vector of length 5.

| word    | associated vector |
|---------|-------------------|
| They    | [1,0,0,0,0]       |
| We      | [0,1,0,0,0]       |
| like    | [0,0,1,0,0]       |
| apples  | [0,0,0,1,0]       |
| bananas | [0,0,0,0,1]       |


A sentence can be represented by adding the vectors together.

For example: *They like apples* can be expressed as *They + like + apples* and using vectors: [1,0,0,0,0] + [0,0,1,0,0] + [0,0,0,1,0] = [1,0,1,1,0], hence the computer readable version of "They like apples" is [1,0,1,1,0]. Simple, right? Unfortunately, it is often too simple - look what happens if you transfrm "apples like They" to vectors: the result is exactly the same [1,0,1,1,0]! We can use the n-gram model to eliminate this issue.

| sentence           | associated sum of vectors |
|--------------------|---------------------------|
| They like apples   | [1,0,1,1,0]               |
| We like apples     | [0,1,1,1,0]               |
| We like bananas    | [0,1,1,0,1]               |
| apples like They   | [1,0,1,1,0]               |

What if there are more than 1 occurrence of the same token? There are different ways of handling that: max-pooling only counts whether a word is present, but not how many times. Sum pooling counts the number of occurrences of each word.

| sentence                     | method            | associated sum of vectors |
|------------------------------|-------------------|---------------------------|
| They like like like apples   | max-pooling       |[1,0,1,1,0]                |
| They like like like apples   | sum pooling       |[1,0,3,1,0]                |



Now, how to implement it in Python? Of course, we could develop our own methods of words vectorization but the `scikit-learn` package gives us a set of useful tools!

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Corpus containing all sentences
corpus = [sentence1, sentence2]

vectorizer = CountVectorizer()
vectorizer.fit(corpus)  # vectorizer learns numbers of word occurrences (features)
print(vectorizer.get_feature_names())

As you can see, count vectorizer created a set of 5 features based on which it will "score" given sentences. Let's see how to transform a sentence into a corresponding vector.

In [None]:
print(sentence1)
print(vectorizer.transform([sentence1]).toarray())

In [None]:
print(sentence2)
print(vectorizer.transform([sentence2]).toarray())

In [None]:
print("like like like")
print(vectorizer.transform(["like like like"]).toarray())

As you can see CountVectorizer is sum pooling by default. If you want to change it to max-pooling add a parameter "binary=True".

In [None]:
vectorizer = CountVectorizer(binary=True)
vectorizer.fit(corpus)
print("like like like")
print(vectorizer.transform(["like like like"]).toarray())

### 5.2 BOW model limitations - the word order
Although bag-of-words is simple, easy to implement and in some applications works quite well, it has some serious limitations. The first one is the one we have already discussed - the word order does not matter. If the purpose of the model is to classify texts based on some keywords then it may be not important what was the order of words. However, it's quite tricky - two single words may have very different meaning when they occur together. Let's take a closer look at this example:

In [None]:
# Let's say there are two reviews of the same movie 
corpus = ["The movie was not bad, actually quite good!", "The movie was not good, actually quite bad!"]

vectorizer = CountVectorizer()
vectorizer.fit(corpus)  # vectorizes learns numbers of word occurences (features)
print(vectorizer.get_feature_names())
for sentence in corpus:
    print(sentence)
    print(vectorizer.transform([sentence]).toarray())

Although those reviews express completely opposite emotions, both senteces have been represented in the exactly same way. 

### 5.3 BOW model limitations - previously unseen words 
The next problem is new words. What happens if we ask our model to vectorize a text which contains previously unseen words (i.e. those words weren't present in the training corpus)? We cannot add them to the corpus and encode them on the fly since this will change the length of the vector. The only reasonable solution is to dismiss all words which were not in the training corpus. 

In [None]:
corpus = ["They like bananas", "We like apples"]

test_sentence1 = "We like apples, bananas"
test_sentence2 = "We like apples, bananas, plums"


vectorizer = CountVectorizer()
vectorizer.fit(corpus)  # vectorizes learns numbers of word occurrences (features)
print(vectorizer.get_feature_names())

print(test_sentence1)
print(vectorizer.transform([test_sentence1]).toarray())
print(test_sentence2)
print(vectorizer.transform([test_sentence2]).toarray())

As you can see, both sentences got encoded to the same vector. Because of this, we lost the additional information from the second sentence that "they" also like plums. The solution for this problem is to use sufficiently large training corpora but this leads to a significant performance drop and unreasonable memory usage (for corpus containing 100k words, every sentence is represented using a vector of length 100k). I encourage you to read [this chapter](https://web.stanford.edu/~jurafsky/slp3/3.pdf) of the SLP book about a better model - the **n-gram language model**.

