### SRE - May 8, 2024
#### Introduction to Natural Language Processing 

In [1]:
#import packages
#standard packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# bag of words
from sklearn.feature_extraction.text import CountVectorizer
# stemming and lemmatizing
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

from nltk.tokenize import word_tokenize #makes tokens
from nltk.stem import PorterStemmer #word stemming
from nltk.stem import WordNetLemmatizer #lemmatizer
from nltk.corpus import stopwords #remove stopwords 

import re ##regular expressions package that allows us to remove punctuation and change capitalization (among other things)
import string ## package that deals with string operations

from textblob import TextBlob # spell correcting plus others (e.g., sentiment)
print("packages imported")

packages imported


[nltk_data] Downloading package wordnet to /home/jupyter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Datasets

Data set 1: [kaggle link](https://www.kaggle.com/datasets/shivanandmn/multilabel-classification-dataset?select=test.csv)
- classifies academic papers into four topics: computer science, physics, mathematics, and statistics.  Papers can belong to more than one topic.

Data set 2: [kaggle link](https://www.kaggle.com/datasets/vaibhavsatpathy/text-classification)
- classifies survey responses into achievement, affection, bonding, enjoy the moment,
exercise, leisure, and nature.  Responses can only belong to one "type" of experience. 

Data set 3: [kaggle link](https://www.kaggle.com/code/sohaelshafey/multi-label-classification-tweets-preprocessing)
- classifies tweets into joy, sadness, anger, anticipation and fear.  Tweets can belong to more than one emotion. 

#### What is Natural Language Processing?

- A method to allow computers to "understand" text
- Input: text/speech corpora (in other words a natural language data set)
- Output: a vectorized form of the text that a computer can manipulate
- The transformation from input to output can be done in a variety of ways: e.g., using specific rules or a probabilistic approach.
- Some of the most common approaches will be discussed below. 

#### Models

##### 1. Bag of Words

- Idea is to generate a large "bag" that contains *all* of the **unique** words used to generate the text of interest.

For example, let's say we have the sentences:
- I like math
- Math is fun
- Calculus is not easy
  
The bag of words we would need to construct would contain the words:
- i
- like
- math
- is
- fun
- calculus
- not
- easy

The words in the bag of words then become the columns of a matrix.  We can then express each sentence as a vector where a 1 indicates the presence of a words and a 0 indicates the word does not appear in that sentence.

|Sentence| i | like| math | is | fun | calculus | not | easy|
|--------|---|-----|------|----|-----|----------|-----|-----|
|I like math| 1 | 1 | 1| 0 | 0 | 0 | 0 | 0|
|Math is fun| 0 | 0 | 1 | 1| 1| 0| 0| 0|
|Calculus is not easy|0|0|0|1|0|1|1|1|

In [2]:
example_text = ["I like math","Math is not easy","Calculus is fun"]

In [3]:
# use bag of words to turn text into a vector
vectorizer = CountVectorizer()
text_vectors = vectorizer.fit_transform(example_text)

print(vectorizer.get_feature_names_out())
print(text_vectors.toarray())

['calculus' 'easy' 'fun' 'is' 'like' 'math' 'not']
[[0 0 0 0 1 1 0]
 [0 1 0 1 0 1 1]
 [1 0 1 1 0 0 0]]


The documentation for CountVectorizer is [here.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

Notice how "I" is gone! Why does this happen? 

We'll come back to this later!

Note: CountVectorizer uses a basic count, so as the *frequency* of the word increases in the sentence, we will see larger numbers in the row.

In [4]:
example_text_1 = ["I like math","Math is not easy math","Calculus is fun"]
# use bag of words to turn text into a vector
vectorizer = CountVectorizer()
text_vectors = vectorizer.fit_transform(example_text_1)

print(vectorizer.get_feature_names_out())
print(text_vectors.toarray())

# we can at (binary = true) to fix this!
vectorizer = CountVectorizer(binary=True)
text_vectors = vectorizer.fit_transform(example_text_1)

print(vectorizer.get_feature_names_out())
print(text_vectors.toarray())

['calculus' 'easy' 'fun' 'is' 'like' 'math' 'not']
[[0 0 0 0 1 1 0]
 [0 1 0 1 0 2 1]
 [1 0 1 1 0 0 0]]
['calculus' 'easy' 'fun' 'is' 'like' 'math' 'not']
[[0 0 0 0 1 1 0]
 [0 1 0 1 0 1 1]
 [1 0 1 1 0 0 0]]


- Bag of words generally does well with classification when it has *enough* data to form the "dictionary".
- By dictionary here, we mean all the the "columns" of words needed.
- This means, that the matrix for the text can be **Large**, which may make it difficult to process.
- One way to fix this is to prehaps only take the top e.g. 2000 frequently used words 

##### 2. Bag of Words using bigrams, trigrams, and ngrams

- The previous bag of words model used unigrams: i.e., one word "tokens"
- Another approach is to use multiple words coupled together.
    - two word phrases are called bigrams
    - three word phrases are called trigams
    - can generalize this to ngrams, where $n$ denotes the length of the coupled words

In the case of bigrams and our previous example:

|Sentence| i like | like math| math is | is fun | calculus is| is not | not easy|
|--------|--------|----------|---------|--------|------------|--------|---------|
|I like math| 1 | 1 | 0| 0 | 0 | 0 | 0 |
|Math is fun| 0 | 0 | 1 | 1| 0| 0| 0|
|Calculus is not easy|0|0|0|0|1|1|1|

In [5]:
# use bag of words with bigrams
#ngram_range=(min_n,max_n)
vectorizer = CountVectorizer(binary = True, ngram_range=(2,2))
text_vectors = vectorizer.fit_transform(example_text)

print(vectorizer.get_feature_names_out())
print(text_vectors.toarray())

['calculus is' 'is fun' 'is not' 'like math' 'math is' 'not easy']
[[0 0 0 1 0 0]
 [0 0 1 0 1 1]
 [1 1 0 0 0 0]]


In [6]:
#sometimes will also see the inclusion of both unigrams and bigrams
vectorizer = CountVectorizer(binary = True, ngram_range=(1,2))
text_vectors = vectorizer.fit_transform(example_text)

print(vectorizer.get_feature_names_out())
print(text_vectors.toarray())

['calculus' 'calculus is' 'easy' 'fun' 'is' 'is fun' 'is not' 'like'
 'like math' 'math' 'math is' 'not' 'not easy']
[[0 0 0 0 0 0 0 1 1 1 0 0 0]
 [0 0 1 0 1 0 1 0 0 1 1 1 1]
 [1 1 0 1 1 1 0 0 0 0 0 0 0]]


- using multiple word phrases can be better for capturing positive or negative sentiment in text: e.g., "math is not great" using unigrams does not couple "not" and "great" together, which may cause issues down the road when using a classifier to predict. So using bigrams to get the token "not great" may be beneficial.
- again we may want to restrict to 1000, 2000 frequently used tokens to reduce the size of the dictionary for computational purposes.

Pros: Easy to implement.

Cons: Models may not necessarily "know" how to classify text when there are words that are used in the text but are not defined in the dictionary.  E.g., "Linear algebra is hard" would be hard to classify with a dictionary created from the example phrases as linear, algebra, and hard do not appear in the initial phrases.

##### 3. Augment NLP techniques with other techniques

##### i. Stemming and Lemmatization

- method to "normalize" text.
- breaks words down into their base words (i.e., their stems)
    - examples:
        - running -> run
        - runs -> run
        - studying, studies, study -> studi
-  Stemming using an algorithm to break down words, so in some cases the "base" word may not be an English word (e.g., studi)
-  In the case Lemmatization, it uses a dictionary which ensures that the output is an English word. So it would output "study" for the last example.

This can help with generating the bag of words, because by putting words into their stemmed/lemmatized form, we do not need all parts of speech of a single word to be able to generate a vector.  And if a new phrase appears, we have a better chance at being able to classify it.    

In [7]:
stemmer = PorterStemmer()

words = word_tokenize("Studying math is hard!") #needs to be passed as a list

print(words)

stemmed_words = []
for word in words:
    stemmed_words.append(stemmer.stem(word)) #stem the words

' '.join(stemmed_words) ## join the stemmed words with a space so it "looks" like a sentence


['Studying', 'math', 'is', 'hard', '!']


'studi math is hard !'

The documentation on the PorterStemmer is [here.](https://www.nltk.org/howto/stem.html)  There are also a few other stemmers in the NLTK package in python that are listed in the documentation. 

There is a nice technical discussion of how the PorterStemmer works [here.](http://people.scs.carleton.ca/~armyunis/projects/KAPI/porter.pdf)

Notice how the punctuation in the phrase also makes it into the the word tokens and the stemmed words.  While punctuation is necessary for language to make sense, it is often not needed for computational purposes. 

##### ii. Remove punctuation
- Punctuation can make the dictionary vector unnecessarily complex.
- Often we want to remove the punctuation.
- Some Lemmatizers will do this automatically

In [8]:
lemmatizer = WordNetLemmatizer()

words = word_tokenize("Studying math is hard!") #needs to be passed as a list

print(words)

lemmatized_words = []
lemmatized_words_1 = []
for word in words:
    lemmatized_words.append(lemmatizer.lemmatize(word)) #lemmatize the words
    ## if we want to specify a part of speech, can use pos for this.
    lemmatized_words_1.append(lemmatizer.lemmatize(word,pos='v'))
                              
print(' '.join(lemmatized_words)) ## join the stemmed words with a space so it "looks" like a sentence
print(' '.join(lemmatized_words_1)) ## join the stemmed words with a space so it "looks" like a sentence

['Studying', 'math', 'is', 'hard', '!']
Studying math is hard !
Studying math be hard !


The documentation on WordNetLemmatizer is [here.](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet)

Sometimes we may need to specify the part of speech to help create the dictionary.

##### iii. Remove stopwords

These are words that are necessary in English to form sentences, but do not contribute to the overall meaning of the phrase. Examples of stop words are:
- the, then, there, these
- it, if, is, that, this, are
- I, he, she, they, their, them
- a, an, and

Is often beneficial to remove these words as they may end up "bogging" down the algorithm and can help to reduce the dimensionality of the problem. 

In [9]:
stop_words = stopwords.words('english')

words = word_tokenize("Studying math is hard!") #needs to be passed as a list

print(words)

remove_stop_words = []
for word in words:
    if word not in stop_words:
        remove_stop_words.append(word) #remove the stop words

                              
print(' '.join(remove_stop_words)) ## join the stemmed words with a space so it "looks" like a sentence

['Studying', 'math', 'is', 'hard', '!']
Studying math hard !


##### iv. Removing punctuation

We often do not want the punctuation to be included in the bag of words/dictionary that we are building.  Some built in packages will automatically do this, but it can be helpful to remove the punctuation in our data cleaning process.

This can be done using the `.translate` function in the re package (documentation [here.](https://docs.python.org/3/library/re.html)), but can also be done using the string package (documentation [here.](https://docs.python.org/3/library/string.html)).

In [10]:
sentence = "Studying math is hard!"
### using re package
remove_punctuation = re.sub(r'[^\w\s]','',sentence)
#\w is for word characters and includes alphanumeric characters and underscore
#\s is for white space characters
#r' means to look for
#the '' in the second input means to replace with a blank
#the third input is the text we want to do this for
#so this is essentially saying to find all the punctuation and remove them in the string
#and replace the functuation with a blank space

##using string package 
remove_punctuation_1 = sentence.translate(str.maketrans('', '', string.punctuation))

print(remove_punctuation)
print(remove_punctuation_1)

Studying math is hard
Studying math is hard


Both will work, though usually the re package is more often used.

##### v. Converting to lower case

It can also be helpful to remove capitalization.  Sometimes the packages used will designate "calculus" and "Calculus" as different "words" due to the capitalization.  Some packages will automatically revert the text to all lower case.  It is often helpful to manually do this during the data cleaning process, just to ensure that the capitalization is removed.

This is done using `lower()`.

In [11]:
lowercase = sentence.lower()
print(lowercase)

studying math is hard!


##### vi. Spell correcting.

It can sometimes be helpful to correct typos in text.  TextBlob can help with this -- full documentation [here](https://textblob.readthedocs.io/en/dev/) and more specific documentation [here.](https://textblob.readthedocs.io/en/dev/quickstart.html#part-of-speech-tagging)

Note: this is often not typically used, as we want to ensure that we do not change mis-spelled words into something that is incorrect.  It is often better to either try to flag potential mis-spellings for the user can correct them or to use bag of words with a token frequency cut off. 

In [12]:
!python -m textblob.download_corpora #download some necessary packages

[nltk_data] Downloading package brown to /home/jupyter/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jupyter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


In [13]:
incorrect_phrase = TextBlob("Spell corercting can be heplful")

correction = incorrect_phrase.correct()

parts_of_speech = incorrect_phrase.tags

sentiment = incorrect_phrase.sentiment

print(correction)
print(parts_of_speech)
print(sentiment)

Spell correcting can be helpful
[('Spell', 'NNP'), ('corercting', 'NN'), ('can', 'MD'), ('be', 'VB'), ('heplful', 'JJ')]
Sentiment(polarity=0.0, subjectivity=0.0)


Polarity here is a measure of how negative/positive the sentence is, with -1 being the most negative and +1 being the most positive.

Subjectivity has a range of 0 to 1, with 0 being very objective and 1 being very subjective.

https://www.youtube.com/watch?v=M7SWr5xObkA