<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Part 1: Language Data Pre-Processing



### Learning Objectives

1. The Natural Language Toolkit (NLTK) 
2. Define and implement tokenizing, lemmatizing, stemming, part-of-speech tagging and named entity recognition.
3. Preprocess text data by removing stopwords
4. Implement sentiment analysis.

Before we get started, you will need to install the Natural Language Toolkit, known as [NLTK](https://www.nltk.org/install.html). If it is already installed you will get a `Requirement already satisfied` message.


In [1]:
pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk

In [None]:
# Download all the packages from nltk
#nltk.download('all')
# Alternatively, choose package to download from the window that pops up.
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
# Imports
import pandas as pd       
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
import re

# Pre-Processing 

When dealing with text data, there are common pre-processing steps that NLTK can help us to do:
- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

Let's say we have the following text that we want to be able to identify as spam.

In [None]:
# Define spam text.
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only. If you would like to receive this donation please reply by contacting me before I am operated on.'

print(spam)

## Tokenizing

NLTK provides functions to split text into individual words or sentences, called 'tokens'.


In [None]:
# tokenize into sentences
sent_tokenize(spam.lower())

In [None]:
# tokenize into words
word_tokenize(spam.lower())


You might notice that  `word_tokenize()` considers special characters as words too. To overcome this, we can instantiate a  `RegexpTokenizer` to specify that we want words containing digits or characters only using the Regex `\w+`

For more on Regular Expressions (and to test a regex) see https://regex101.com/

In [None]:
# \w matches any word character (equivalent to [a-zA-Z0-9_]), + to repeat the match
# So this tokenizer finds all the words containing alphabets, digits or _
tokenizer = RegexpTokenizer(r'\w+') 


In [None]:
# Use the regular expression tokenizer that has been instantiated 
spam_tokens = tokenizer.tokenize(spam.lower())


In [None]:
# What happened with the numeric value?
spam_tokens


In [None]:
# Look for digits in each of the tokens
[(re.findall('\d+', i), i) for i in spam_tokens]

In [None]:
# Instantiate tokenizer to include digits attached to , or .
tokenizer_1 = RegexpTokenizer('\d[\d,.]+|\w+')
tokenizer_1.tokenize(spam.lower())

## Lemmatizing & Stemming

- "He is *running* really fast!"
- "He *ran* the race."
- "He *runs* a five-minute mile."

If we wanted a computer to interpret these sentences, I might count up how many times I see each word. The computer will treat words like "running," "ran," and "runs" differently... but they mean very similar things (in this context)!

**Lemmatizing** and **stemming** are two forms of shortening words so we can combine similar forms of the same word.

When we "**lemmatize**" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

How do we know what the base form is? Fortunately we don't have to create a dictionary, NLTK uses the publicly available [WordNet](https://wordnet.princeton.edu/).

In [None]:
# Instantiate lemmatizer.
lemmatizer = WordNetLemmatizer()

In [None]:
# Lemmatize tokens.
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [None]:
# Compare tokens to lemmatized version.
# zip() creates pairs of tuples from each list
list(zip(spam_tokens, tokens_lem))

In [None]:
# Print only those lemmatized tokens that are different.
for i in range(len(spam_tokens)):
    if spam_tokens[i] != tokens_lem[i]:
        print((spam_tokens[i], tokens_lem[i]))

Lemmatizing is usually the more correct and precise way of handling things from a grammatical point of view, but also might not have much of an effect.

We can also do this on individual words.

In [None]:
# Lemmatize the word "computers."
lemmatizer.lemmatize("computers")

In [None]:
lemmatizer.lemmatize("computer")

In [None]:
lemmatizer.lemmatize("computation")

In [None]:
lemmatizer.lemmatize("computationally")

When we "**stem**" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [None]:
# Instantiate PorterStemmer.
p_stemmer = PorterStemmer()

In [None]:
# Stem tokens.
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [None]:
# Compare tokens to stemmed version.
list(zip(spam_tokens, stem_spam))

In [None]:
# Print only those stemmed tokens that are different.

[(spam_tokens[i], stem_spam[i]) for i in range(len(spam_tokens)) if spam_tokens[i] != stem_spam[i]]

In [None]:
# Stem the word "computers"
p_stemmer.stem("computers")



In [None]:
# Stem the word "computer."
p_stemmer.stem("computer")

In [None]:

# Stem the word "computation."
p_stemmer.stem("computation")

In [None]:

# Stem the word "computationally."
p_stemmer.stem("computationally")

## Part of Speech Tagging

Another task in NLP is to tag the different parts of speech. This can be done using NLTK's `pos_tag`.

In [None]:

text = tokenizer.tokenize("Taylor Swift's concert The Eras Tour will be held at the Singapore National Stadium in Singapore on 2,3,4 and 7,8,9 March 2024")


pos_tag_list = pos_tag(text)
print(pos_tag_list)

NLTK has tagged the tokens as various parts of speech:
 
- NN: noun 
- IN: preposition and conjunction 
- CD: digit 
- VBN, VBG : verbs
- JJ: adjective 
- PRP: pronoun
- MD: modal 
- CC: conjunction
-etc


## Named Entity Recognition

Another NLP task is to perform named entity recognition, ie to identify locations, people and organizations. This is done by passing the tagged parts of speech to the `nltk ne_chunk` module.
The nltk chunk module will parse the parts of speech to identify tokens which are probably named entities. 


In [None]:
namedEnt = ne_chunk(pos_tag_list, binary=False)
namedEnt.draw()

## Stop Word Removal

The following quote has had stop words (and punctuation) removed:

"Answer great question life universe everything said deep thought said deep thought paused forty two said deep thought infinite majesty calm."

<details><summary>What book is the above sentence from?</summary>

The Hitchhiker's Guide to the Galaxy!
    
![](../images/hgg.jpg)
    
The original quote reads:  
..."The Answer to the Great Question..."  
"Yes..!"  
"Of Life, the Universe and Everything..." said Deep Thought.  
"Yes...!"  
"Is..." said Deep Thought, and paused.  
"Yes...!"  
"Is..."  
"Yes...!!!...?"  
"Forty-two," said Deep Thought, with infinite majesty and calm.‚Äù
</details>

<details><summary>If you were familiar with the book, how did you know what book the sentence was from?</summary>

Removing stop words did not remove key identifying words such as "life", "universe", "everything", and "forty-two".
</details>

<details><summary>Based on this, how would you define stop words?</summary>

Stop words are words that have little to no significance or meaning. They are common words that only add to the grammatical structure and flow of the sentence, so it is still relatively easy to identify the contents of sentences without stop words.
</details>

In [None]:
# Print English stopwords.
print(stopwords.words("english"))

In [None]:
# Remove stopwords from "spam_tokens."
no_stop_words = [token for token in spam_tokens if token not in stopwords.words('english')]

In [None]:
# Check it
print(no_stop_words)

---

# Sentiment Analysis

![](../images/sent.jpeg)

[Sentiment analysis](https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis-practitioners-guide-nlp-5.html) is an area of natural language processing in which we seek to classify text as having positive or negative emotion.

Let's build a simple function that can classify text as either having positive or negative sentiment.

What words tell us whether certain text is positive?

In [None]:
# Let's come up with a list of positive and negative words we might observe. 
# Suggest more!!
positive_words = ['love', 'good', 'great']
negative_words = ['garbage', 'sad', 'bad']

In [None]:
def simple_sentiment(text):
    # Instantiate tokenizer.
    tokenizer = RegexpTokenizer(r'\w+')  
    # Tokenize text.
    tokens = tokenizer.tokenize(text.lower()) 
    # Instantiate stemmer.
    p_stemmer = PorterStemmer()
    # Stem words.
    stemmed_words = [p_stemmer.stem(i) for i in tokens]
    # Stem our positive/negative words.
    positive_stems = [p_stemmer.stem(i) for i in positive_words]
    negative_stems = [p_stemmer.stem(i) for i in negative_words]

    # Count "positive" words.
    positive_count = sum([1 for i in stemmed_words if i in positive_stems])
    # Count "negative" words
    negative_count = sum([1 for i in stemmed_words if i in negative_stems])
    # Calculate Sentiment Percentage 
    return round((positive_count - negative_count) / len(tokens), 2)

In [None]:
# Run our sentiment analyzer 
simple_sentiment("I love programming!")

<details><summary> What are some shortcomings of this method? </summary>

- Primarily, we're limited to the positive/negative words we came up with.
- If someone wrote "not good" or "not bad," our sentiment function would probably treat "not good" as positive or neutral... but it's probably supposed to mean negative!
- The ordering of the words doesn't matter here, which is not how language generally works.
- We haven't corrected for misspellings.
</details>

There are a couple of ways to proceed with sentiment analysis:

1. If you have already-labeled data, you can build a supervised learning model.
2. If you don't have labeled data, you can use a Lexicon (dictionary) that has already been built/trained for sentiment analysis.
    - There are a bunch of these and which to use depends on your purpose/data. Here are just a few that are available:
        - AFINN lexicon
        - MPQA subjectivity lexicon
        - SentiWordNet
        - VADER lexicon

We will use the [VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html) (Valence Aware Dictionary and sEntiment Reasoner) lexicon to analyze the sentiments of our reviews.

VADER has a basic lexicon for positive and negative sentiments, including emoticons and negation ('didn't like'), but also considers intensity in terms of use of CAPs.

In [None]:
# Instantiate Sentiment Intensity Analyzer from nltk.sentiment.vader
sent = SentimentIntensityAnalyzer()

VADER's `polarity_scores` takes in a string and returns a dictionary of scores in each of four categories:

- negative
- neutral
- positive
- compound, which calculates a score based on the above three. 

In [None]:
sent.polarity_scores('This is FANTASTIC')

In [None]:
sent.polarity_scores('I don"t like programming at all and I hate Python especially')

Let's try analyzing the sentiment of IMDb movie reviews. The data is from [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words).

In [None]:
# Read in training data.
reviews = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [None]:
reviews.head()

In [None]:
# Examine a review.


In [None]:
# Use the sent.polarity_scores() to find the sentiment of the review.


In [None]:
# Does this match the sentiment given in the training data?


In [None]:
# apply the scores to all the reviews
reviews['scores'] = reviews['review'].apply(sent.polarity_scores)


In [None]:
# Extract only the compound score
reviews['compound']  = reviews['scores'].apply(lambda score_dict: score_dict['compound'])

In [None]:
# Check for the first 10 reviews
reviews.head(10)

In [None]:
# Have a look at one of the reviews that is different from the labelled sentiment


# What sentiment would YOU assign?