# <font color="maroon"> NLP Toolkits and Preprocessing Techniques </font>
Python libraries for natural language processing
1. Converting text to a meaningful format for analysis
2. Preprocessing and cleaning text

Open-Source Libraries<br> 
1. <font color="red">NLTK<br> </font>
2. <font color="red">TextBlob<br></font>
3. SpaCy<br>
4. GenSim<br>

Cloud-Based NLP Services<br> 
1. IBM Watson<br>
2. Google Cloud Natural Language API
3. Amazon Comprehend
4. Microsoft Azure

## How to Install NLTK?

### Method (i) Command Line
pip install nltk<br>
import nltk<br>
nltk.download()

### Method (ii) Anaconda Navigator (Environment)
![Installation of NLTK library](NLTK.png)

### Method (iii) Download Package and Place into Site-package directory
Install nltk toolkit from https://sourceforge.net/projects/nltk/<br>
![Installation of NLTK library](nltk_package.png)
<br>Locate the package into site-package directory <br>
(to find the path:<br> import site <br>site.getsitepackages())


In [None]:
import site
site.getsitepackages()

# Method 1

In [None]:
pip install nltk

In [None]:
import nltk
nltk.download()

## Sample Text Data

Consider this sentence:
<br>**Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers) from the
store. Should I pick up some black-eyed peas as well?**

Text data is messy and unstructured. To analyze this data, we need to preprocess the text.


![](https://i.imgur.com/pt5p6Hb.png)

# Code: Tokenization (Words)

In [None]:
from nltk.tokenize import word_tokenize

my_text = '''Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers)
from the store. Should I pick up some black-eyed peas as well?'''

print(word_tokenize(my_text)) # print function requires Python 3

# Code: Tokenization (Sentences)

In [None]:
from nltk.tokenize import sent_tokenize

my_text = '''Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers)
from the store. Should I pick up some black-eyed peas as well?'''

print(sent_tokenize(my_text))

![](https://i.imgur.com/3L6x92C.png)

# Code: Remove Punctuation

In [None]:
import re # Regular expression library
import string

#Replace punctuations with a white space
s = re.sub('[^\w\s]','',my_text)
s

#OR
clean_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', my_text)  # string.punctuation is a string defined in the string module of Python. It contains all the punctuation characters: !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`.
clean_text

# Code: Make All Text Lowercase

In [None]:
clean_text = s.lower()
clean_text

# Code: Remove Numbers

In [None]:
# Removes all words containing digits
clean_text = re.sub('\d', '', clean_text)  #\d is a special sequence in regular expressions that matches any digit (0-9).
clean_text

# <font color='blue'>Preprocessing: Stop Words</font>

![](https://i.imgur.com/T5RJXrX.png)

# Code: Stop Words

In [None]:
from nltk.corpus import stopwords
set(stopwords.words('english'))

# Code: Remove Stop Words

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

my_text = ["Hi Mr. Smith! I’m going to buy some vegetables \
(3 tomatoes and 3 cucumbers from the store. Should I pick up some black-eyed peas as well?"]

# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(my_text)  # this equavalent to X=CountVectorizer(stop_words='english').fit_transform(my_text)
print (X)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

# Reference: https://www.geeksforgeeks.org/difference-between-pandas-vs-numpy/
# self learn pandas: https://www.w3schools.com/python/pandas/pandas_intro.asp
# self learn numpy: https://www.w3schools.com/python/numpy/numpy_intro.asp

The process of using CountVectorizer.fit_transform involves the following steps:

(1)Tokenization: The text documents are first tokenized, breaking them into individual words or tokens.

(2)Vocabulary Building (fit): CountVectorizer builds a vocabulary, which is a dictionary mapping each unique word (or token) in the documents to an integer index.

(3)Counting (transform): It then counts the occurrences of each word in each document and stores these counts in a sparse matrix, where rows represent documents, and columns represent the vocabulary words. Each element of the matrix represents the frequency of the corresponding word in the respective document.

![](https://i.imgur.com/9qllh8j.png)

# Code: Stemming

In [None]:
from nltk.stem import LancasterStemmer  #more stem library: https://www.nltk.org/api/nltk.stem.html
stemmer = LancasterStemmer()

# Try some stems
print('drive:{}'.format(stemmer.stem('drive')))
print('drives:{}'.format(stemmer.stem('drives')))
print('driver:{}'.format(stemmer.stem('driver')))
print('drivers:{}'.format(stemmer.stem('drivers')))
print('driven:{}'.format(stemmer.stem('driven')))

# Code: Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer  # Reference: https://www.nltk.org/api/nltk.stem.wordnet.html
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()

input_str = "been had done languages cities mice"
input_str = word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

![](https://i.imgur.com/8edVsCR.png)

# Code: Parts of Speech Tagging

In [None]:
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

my_text = "James Smith lives in the United States."
my_text2 = "James Smith is having a live band in the United States."
tokens = pos_tag(word_tokenize(my_text))
tokens2 = pos_tag(word_tokenize(my_text2))
print("Sentence 1:",tokens)
print("Sentence 2:",tokens2)

#Reference:https://pythonspot.com/nltk-speech-tagging/

![POS](nltk-speech-codes.png)

## Named Entity Recognition

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.chunk import ne_chunk  #Spacy reference: https://medium.com/geekculture/named-entity-recognition-ner-part-ii-implementation-with-open-source-packages-2713c4c4a8c5

my_text = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(my_text)) # this labels each word as a part of speech
entities = ne_chunk(tokens) # this extracts entities from the list of words
entities.draw()

# <font color="blue"> Prepocessing: Compound Term Extraction </font>

![](https://i.imgur.com/q1WuWai.png)

# Code: Compound Term Extraction

In [None]:
from nltk.tokenize import MWETokenizer #https://www.nltk.org/api/nltk.tokenize.mwe.html

my_text = "You all are the greatest students of all time."

mwe_tokenizer = MWETokenizer([('You','all'), ('of', 'all', 'time')])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(my_text))

mwe_tokens

# New York City, take into account, make use of, high probability, kick the bucket

# Lambda Function

In [None]:
# Basic example, https://www.w3schools.com/python/python_lambda.asp
square_me=lambda x: x*x

my_numbers=[9, 3, 4, 100, 2, 1]
my_numbers_squared = list(map(square_me, my_numbers)) #map = applies a function to all the items in an input_list
                                                      #map(function, iterable)
print(my_numbers_squared)

# <font color=red>Preprocessing Exercise </font>



# Introduction

We will be using review data from Kaggle to practice preprocessing text data. The dataset contains user reviews for many products, but today we'll be focusing on the product in the dataset that had the most reviews - an oatmeal cookie.

The following code will help you load in the data. If this is your first time using nltk, you'll to need to pip install it first.


In [None]:
import nltk
# nltk.download() <-- Run this if it's your first time using nltk to download all of the datasets and models
import pandas as pd

In [None]:
df = pd.read_csv('cookie_reviews.csv')
df.head()

**Question 1:**

Determine how many reviews there are in total.
   

**Question 2:**
    
Determine the percentage of 1, 2, 3, 4 and 5 star reviews.

**Question 3:**

(a) Remove stop words

1. df['reviews'] refers to the 'reviews' column in your DataFrame df 
2. .apply(lambda x: ...) is used to apply a function (defined by the lambda function) along the axis of the DataFrame.
3. lambda x: ' '.join([word for word in x.split() if word not in (stop)]) is a lambda function that:
   <br>a. Splits each review x into a list of words (x.split()).
   <br>b. Iterates through each word in this list (for word in x.split()).
   <br>c. Checks if each word is not in the stop list (i.e., if it's not a stopword).
   <br>d. If the word is not a stopword, it includes it in the list comprehension ([word for word in x.split() if word not in (stop)]).
   <br>e. Joins these words back into a single string with spaces separating them (' '.join(...)).

(b) Change to lower case

(b) Perform stemming

1. Constructs a new list (documents) by iterating over each element (x) in the list l_case.
2. For each document i in l_case, the inner list comprehension splits i into words using i.split(" ").
3. It then applies stemming to each word using sno.stem(word), where sno is an object or function that performs stemming.
4. The outer comprehension gathers these lists of stemmed words (one list per document) and constructs a new list (documents) where each element corresponds to a document from l_case, but with each word stemmed.

# TextBlob

### Another toolkit other than NLTK

- Wraps around NLTK and makes it easier to use

# TextBlob Demo: Tokenization

In [None]:
#pip install textblob  #Install the library before importing

In [None]:
from textblob import TextBlob

my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!")
my_text.words

# TextBlob Demo: Spell Check

In [None]:
blob = TextBlob("I'm graat at speling.")
print(blob.correct()) # print function requires Python 3

<font color="blue">
## How does the correct function work?  <br>
    
- Calculates the Levenshtein distance between the word ‘graat’ and all words in its word list </br>
- Of the words with the smallest Levenshtein distance, it outputs the most popular word </br></font>

# TextBlob Demo: Tagging

In [None]:
blob = TextBlob("John hits the ball.")
for words, tag in blob.tags:
    print (words, tag)

# TextBlob Demo: Language Detection and Translation

In [None]:
from textblob import TextBlob

text = "This is a sample text in English."
blob = TextBlob(text)

In [None]:
!pip install langdetect  #install the library before importing

In [None]:
from langdetect import detect

text = "This is a sample text in English."
language = detect(text)

print("Detected Language:", language)

In [None]:
!pip install googletrans==4.0.0-rc1  #install the library before importing

In [None]:
from langdetect import detect
from googletrans import Translator

text = "This is a sample text in English."

# Detect the language
detected_lang = detect(text)

# Translate to French
translator = Translator()
translated_text = translator.translate(text, src=detected_lang, dest='fr').text

print("Detected Language:", detected_lang)
print("Translated Text (to French):", translated_text)

# Exercise

In [None]:
# Write a Python function using TextBlob to tokenize a given sentence and count the number of tokens.



In [None]:
# Write a Python function using TextBlob to perform Parts of Speech (POS) tagging on a given sentence.



In [None]:
# Write a Python function using TextBlob to perform spell checking on a given text and suggest corrections.


In [None]:
# Write a Python function using langdetect and googletrans to perform trasnlation on a given text from english to chiense


# <font color="maroon"> Some other functions in NLP: Text Similarity Measures </font>

- To measure distance between 2 string

Applications
- Information retrieval
- Text classification
- Document clustering
- Topic Modeling
- Matric decomposition

To measure the word similarity, we use **<font color="blue"><a href="https://pypi.org/project/python-Levenshtein/" target="_blank">Levenshtein distance</a></font>**.
- Minimum number of operations to get from one word to another.

![](https://i.imgur.com/FkdJmPi.png)

In [None]:
pip install python-Levenshtein  #install before importing

In [None]:
from Levenshtein import distance as lev
lev('party', 'park')

In [None]:
#concept behind lev('party', 'park')
def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i

    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)

    return dp[m][n]

# Example usage
string1 = "party"
string2 = "park"
distance = levenshtein_distance(string1, string2)
print("Levenshtein distance:", distance)

## Let's use the Levenshtein to measure the similarity between 2 sentences:
<br>sentence1 = "The quick brown fox jumps over the lazy dog."
<br>sentence2 = "A quick brown fox jumps over a lazy dog."

In [None]:
from Levenshtein import distance as lev

sentence1 = "The quick brown fox jumps over the lazy dog."
sentence2 = "A quick brown fox jumps over a lazy dog."

words1 = sentence1.lower().split()
words2 = sentence2.lower().split()

distance = lev(words1, words2)

# Calculate similarity (adjust based on your specific needs)
max_length = max(len(words1), len(words2))
# print (max_length)
similarity = 1 - (distance / max_length)

print("Levenshtein distance between sentence 1 and sentence 2:", distance)
print("Similarity between sentence 1 and sentence 2:", similarity)

# However, it's important to note that Levenshtein distance is typically used for comparing sequences of characters, not entire sentences or phrases.
# To measure similarity between sentences where the words are not necessarily in the same sequence, 
# you need to consider methods that can account for semantic similarity rather than just sequence-based similarity like Levenshtein distance.
# Here are a few approaches you can explore: TF-IDF/Word Embeddings (pretrained model like Word2Vec, GloVe, or FastText) and Similarity Metrics (Cosine Similarity)

# Text Format for Analysis: Count Vectorizer

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus =['This is the first document.', 'This is the second document.', 'And the third one. One is fun.'] #corpus=collection of teks
cv = CountVectorizer()
X = cv.fit_transform(corpus)
pd.DataFrame(X.toarray(),columns=cv.get_feature_names())

![](https://i.imgur.com/OQDeQlb.png)

# Document Similarity: Example

![](https://i.imgur.com/PyirXsy.png)

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']
# create the document-term matrix with count vectorizer
cv = CountVectorizer(stop_words="english")
X = cv.fit_transform(corpus).toarray()
dt = pd.DataFrame(X, columns=cv.get_feature_names())
dt

# Document Similarity: Example

In [None]:
# calculate the cosine similarity between all combinations of documents
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

# list all of the combinations of 5 take 2 as well as the pairs of phrases
pairs = list(combinations(range(len(corpus)),2)) #sentence (0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), .., (3,4))
print(pairs)
combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs]
print (combos)

# calculate the cosine similarity for all pairs of phrases and sort by most similar
results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results, combos), reverse=True)

In [None]:
pairs = list(combinations(range(5),2))
pairs

![](https://i.imgur.com/jrfN6Jj.png)

![](https://i.imgur.com/BI8XP92.png)

![](https://i.imgur.com/3IbfQXT.png)

![](https://i.imgur.com/pnNqzql.png)

In [None]:
import pandas as pd
corpus = ['This is the first document.',
         'This is the second document.',
         'And the third one. One is fun.']
# original Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=cv.get_feature_names())

In [None]:
# new TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
cv_tfidf = TfidfVectorizer()
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names())

![](https://i.imgur.com/xlJibKw.png)

## Document Similarity: Example with TF-IDF

In [None]:
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']

from sklearn.feature_extraction.text import TfidfVectorizer
# create the document-term matrix with TF-IDF vectorizer
cv_tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
dt_tfidf = pd.DataFrame(X_tfidf,columns=cv_tfidf.get_feature_names())
dt_tfidf

In [None]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results_tfidf = [cosine_similarity([X_tfidf[a_index]], [X_tfidf[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results_tfidf, combos), reverse=True)

![](https://i.imgur.com/mj4J60v.png)