# Practicing - Natural Language Processing Toolkit for Python

* Author: Cleiber Garcia

* Pourpose: Develop competencies on how to use Natural Language Processing with Python
* This Notebook was produced as part of my studies of the course 'Python for Data Science and Machine Learning Bootcamp', taught by Mr Jose Portilla, Head of Data Science at Pierian Training. The course is offered at Udemy (https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/learn/lecture/5784218?start=15#overview)

* Although the degree of similarity between this notebook and the notebook written by Jose Portillo for this course is almost 100%, I assure you that I wrote it line by line. Also, I took the liberty to make some changes in order to clariry some examples or to make code more readable, when I judged it apropriate.

* For more information, please contact me at cleiber.garcia@gmail.com

# Background

"Natural Language Processing [1] lies at the heart of countless applications we use every day, from voice assistants to spam filters and machine translation. It allows machines to understand, interpret, and generate human language, bridging the gap between humans and computers. Within the vast landscape of NLP tools and techniques, the Natural Language Toolkit (NLTK) has emerged as one of the most widely used and powerful libraries for Python."

"NLTK provides a wide range of functionalities and resources for text processing, tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and much more. It also includes various corpora, lexical resources, and pre-trained models to help us get started quickly." [2]

[1] Natural Language Processing. Located at Wikipedia: https://en.wikipedia.org/wiki/Natural_language_processing, accessed on July 3rd, 2023.

[2] Guide to NLTK - Natural Language Toolkit for Python. Located at https://pieriantraining.com/guide-to-nltk-natural-language-toolkit-for-python/?utm_source=email-sendgrid&utm_medium=903744&utm_campaign=2023-06-30&utm_term=9685726&utm_content=educational, accessed on July 3rd, 2023

# 1. Getting Started with NLTK

## 1.1 Installing, Importing and Downloading NLTK Resources

In [2]:
!pip install nltk



In [3]:
import nltk

In [4]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\basque_grammars.zip.
[nlt

[nltk_data]    |   Unzipping corpora\names.zip.
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package nps_chat to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\nps_chat.zip.
[nltk_data]    | Downloading package omw to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package omw-1.4 to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\opinion_lexicon.zip.
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]  

[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\verbnet.zip.
[nltk_data]    | Downloading package verbnet3 to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\verbnet3.zip.
[nltk_data]    | Downloading package webtext to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\webtext.zip.
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\wmt15_eval.zip.
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\word2vec_sample.zip.
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package w

True

## 1.2 Tokenization and Text Preprocessing

"Tokenization [3]  plays a crucial role in Natural Language Processing (NLP) as it breaks down text into smaller units called tokens, which can be words, sentences, or even characters. Tokenization serves as the foundation for various NLP tasks such as text classification, sentiment analysis, and named entity recognition." [2]

"NLTK’s word tokenization allows us to split text into individual words or tokens. This process is essential for analyzing the linguistic structure of a sentence and extracting meaningful information from it." [2]

"NLTK also provides sentence tokenization, which is the process of splitting a document or paragraph into individual sentences. Sentence tokenization helps in tasks like document summarization or machine translation." [2]

After tokenization, it is often necessary to preprocess the text further to enhance its quality and remove noise. NLTK offers several preprocessing techniques to assist in this process [2]:

* 1. Removing Stop Words
* 2. Stemming
* 3. Lemmatization
* 4. Handling Special Characters

[2] Guide to NLTK - Natural Language Toolkit for Python. Located at https://pieriantraining.com/guide-to-nltk-natural-language-toolkit-for-python/?utm_source=email-sendgrid&utm_medium=903744&utm_campaign=2023-06-30&utm_term=9685726&utm_content=educational, accessed on July 3rd, 2023

[3] Tokenization. Available at https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization, accessed on July 3rd, 2023.

### 1.2.1 Import Python Modules

In [20]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

### 1.2.2 Download Necessary NLTK Resources (only required once)

In [22]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### 1.2.3 Set Sample Text Example

In [21]:
# Sample text for demonstration
text = "Tokenization is an important step in Natural Language Processing (NLP). \
It breaks down text into smaller units called tokens. These tokens can be \
words, sentences, or even characters."

In [11]:
text

'Tokenization is an important step in Natural Language Processing (NLP). It breaks down text into smaller units called tokens. These tokens can be words, sentences, or even characters.'

### 1.2.4 Tokenization - Word Tokenization

In [12]:
tokens = word_tokenize(text)
print('Word Tokens:')
print(tokens)

Word Tokens:
['Tokenization', 'is', 'an', 'important', 'step', 'in', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', '.', 'It', 'breaks', 'down', 'text', 'into', 'smaller', 'units', 'called', 'tokens', '.', 'These', 'tokens', 'can', 'be', 'words', ',', 'sentences', ',', 'or', 'even', 'characters', '.']


### 1.2.5 Tokenization - Sentence Tokenization

In [14]:
sentences = sent_tokenize(text)
print('Sentences: ')
print(sentences)

Sentences: 
['Tokenization is an important step in Natural Language Processing (NLP).', 'It breaks down text into smaller units called tokens.', 'These tokens can be words, sentences, or even characters.']


### 1.2.6 Preprocessing - Removing Stop Words

In [17]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.casefold() not in stop_words]
print('Tokens after removing stop words:')
print(filtered_tokens)

Tokens after removing stop words:
['Tokenization', 'important', 'step', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', '.', 'breaks', 'text', 'smaller', 'units', 'called', 'tokens', '.', 'tokens', 'words', ',', 'sentences', ',', 'even', 'characters', '.']


### 1.2.7 Text Preprocessing - Stemming

In [16]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print('Stemmed Tokens:')
print(stemmed_tokens)

Stemmed Tokens:
['token', 'import', 'step', 'natur', 'languag', 'process', '(', 'nlp', ')', '.', 'break', 'text', 'smaller', 'unit', 'call', 'token', '.', 'token', 'word', ',', 'sentenc', ',', 'even', 'charact', '.']


### 1.2.8 Text Preprocessing - Lemmatization

In [18]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
print('Lemmatized Tokens:')
print(lemmatized_tokens)

Lemmatized Tokens:
['Tokenization', 'important', 'step', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', '.', 'break', 'text', 'smaller', 'unit', 'called', 'token', '.', 'token', 'word', ',', 'sentence', ',', 'even', 'character', '.']


### 1.2.9 Text Preprocessing - Handling Special Characters

In [19]:
special_chars = set(string.punctuation)
filtered_tokens = [token for token in tokens if token not in special_chars]
print('Tokens after handling special characters:')
print(filtered_tokens)

Tokens after handling special characters:
['Tokenization', 'is', 'an', 'important', 'step', 'in', 'Natural', 'Language', 'Processing', 'NLP', 'It', 'breaks', 'down', 'text', 'into', 'smaller', 'units', 'called', 'tokens', 'These', 'tokens', 'can', 'be', 'words', 'sentences', 'or', 'even', 'characters']


## 1.3 Part-of-Speech Tagging
"Part-of-speech (POS) tagging [4] is a vital process in NLP that involves assigning tags to words in a text, indicating their grammatical category or function within a sentence. POS tagging aids in understanding the structure and meaning of a sentence, which is crucial for various NLP applications such as text analysis, information retrieval, machine translation, and sentiment analysis." [2]

[2] Guide to NLTK - Natural Language Toolkit for Python. Located at https://pieriantraining.com/guide-to-nltk-natural-language-toolkit-for-python/?utm_source=email-sendgrid&utm_medium=903744&utm_campaign=2023-06-30&utm_term=9685726&utm_content=educational, accessed on July 3rd, 2023

[4] Part-of-speech tagging. Available at https://en.wikipedia.org/wiki/Part-of-speech_tagging, accessed on July 3rd, 2023.

### 1.3.1 Import Python Modules

In [31]:
import nltk
from nltk.tokenize import word_tokenize

### 1.3.2 Download necessary NLTK resources (only required once)

In [33]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### 1.3.3 Set Sample text for demonstration

In [34]:
text = "NLTK provides powerful tools for performing POS tagging."

### 1.3.4 Tokenize the Sample Text into Words

In [35]:
tokens = word_tokenize(text)
tokens

['NLTK',
 'provides',
 'powerful',
 'tools',
 'for',
 'performing',
 'POS',
 'tagging',
 '.']

### 1.3.5 Perform POS (Part-of-Speech) tagging

In [29]:
pos_tags = nltk.pos_tag(tokens)

In [30]:
# Print the POS tags
for token, pos_tag in pos_tags:
    print(f'{token}: {pos_tag}')

NLTK: NNP
provides: VBZ
powerful: JJ
tools: NNS
for: IN
performing: VBG
POS: NNP
tagging: NN
.: .


"The POS tags are represented using the Penn Treebank tagset, which includes tags such as nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and more." [2]

# 2. Sentiment Analysis with NLTK

"Sentiment analysis [5], also known as opinion mining, is a crucial area of NLP that involves determining the sentiment expressed in a piece of text. It has various applications, including social media monitoring, brand reputation management, market research, and customer feedback analysis. NLTK provides powerful tools and techniques to perform sentiment analysis efficiently. NLTK supports multiple approaches for sentiment analysis, including rule-based and machine learning methods. Rule-based approaches rely on predefined sets of linguistic rules or lexicons to determine the sentiment of words or phrases in a text. [2]

One popular rule-based approach is the Vader sentiment analysis tool included in NLTK, which provides a pre-trained model for analyzing sentiment. Machine learning methods leverage labeled datasets to train models that can automatically classify text into positive, negative, or neutral sentiments. NLTK offers functionality to preprocess and prepare data for machine learning classification models. It also provides access to various classifiers like Naive Bayes, Maximum Entropy, and Support Vector Machines for sentiment analysis tasks." [2]

[2] Guide to NLTK - Natural Language Toolkit for Python. Located at https://pieriantraining.com/guide-to-nltk-natural-language-toolkit-for-python/?utm_source=email-sendgrid&utm_medium=903744&utm_campaign=2023-06-30&utm_term=9685726&utm_content=educational, accessed on July 3rd, 2023

[5] Sentiment Analysis. Available at https://en.wikipedia.org/wiki/Sentiment_analysis, accessed on July 3rd, 2023.

### 2.1 Import Python Modules

In [37]:
import nltk
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC

### 2.2 Download necessary NLTK resources (only required once)

In [40]:
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### 2.3 Load Working Example
Suppose we want to analyse the sentiment expressed in a collection of Twitter tweets about a particular product.

Lets assume the data is in the format: tweet, label.

In [105]:
# Load labeled data for training a sentiment classifier
# Assumes the data is in the format: tweet,label (e.g., "I love this product,positive")
labeled_data = [
    ("I love this product", "positive"),
    ("This product is terrible", "negative"),
    ("The quality could be better", "neutral"),
    ("I recommend this product", "positive"),
    ("This product does not fit my expectations", "negative"),
    ("This product is great!", "positive"),
    ("I do not like nor dislike this product", "neutral"),
    ("I recommend this product", "positive"),
    ("You will not like this product", "negative"),
    ("I do not recommend this product", "negative")
    
    # Add more labeled data here...
]

In [106]:
labeled_data

[('I love this product', 'positive'),
 ('This product is terrible', 'negative'),
 ('The quality could be better', 'neutral'),
 ('I recommend this product', 'positive'),
 ('This product does not fit my expectations', 'negative'),
 ('This product is great!', 'positive'),
 ('I do not like nor dislike this product', 'neutral'),
 ('I recommend this product', 'positive'),
 ('You will not like this product', 'negative'),
 ('I do not recommend this product', 'negative')]

### 2.4 Preprocessing the labeled data

In [107]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [85]:
preprocessed_data = []
labels = []

In [111]:
for tweet, label in labeled_data:
    tokens = word_tokenize(tweet.lower())
    filtered_tokens = [token for token in tokens if token not in stop_words]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    preprocessed_tweet = ' '.join(lemmatized_tokens)
    
    preprocessed_data.append(preprocessed_tweet)
    labels.append(label)

### 2.5 Split Preprocessed Data into Train and Test Sets

In [114]:
X_train, X_test, y_train, y_test = train_test_split(preprocessed_data, 
                                                   labels,
                                                   test_size=0.20,
                                                   random_state=42)

### 2.6 Vectorize the Preprocessed Data using TF-IDF

In [88]:
X_train

['product great !',
 'love product',
 'recommend product',
 'quality could better',
 'recommend product',
 'product fit expectation',
 'recommend product',
 'like dislike product']

In [89]:
X_test

['like product', 'product terrible']

In [115]:
vectorizer = TfidfVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

### 2.7 Train a Support Vector Machine (SVM) Classifier

In [91]:
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_vectors, y_train)

### 2.8 Evaluate the Trained Classifier on the Testing Set

In [92]:
X_test_vectors

<2x11 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [98]:
from sklearn.metrics import classification_report, confusion_matrix

In [100]:
y_predicted = svm_classifier.predict(X_test_vectors)

In [104]:
print('Confusion matrix: ', confusion_matrix(y_test, y_predicted))

Confusion matrix:  [[0 1 1]
 [0 0 0]
 [0 0 0]]


In [94]:
print('Classification report: ', classification_report(y_test, y_predicted))

Classification report:                precision    recall  f1-score   support

    negative       0.00      0.00      0.00       2.0
     neutral       0.00      0.00      0.00       0.0
    positive       0.00      0.00      0.00       0.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 2.9 Sentiment Analysis of New, Unseen Tweets

In [95]:
unseen_tweets = [
     "This product exceeded my expectations!",
     "I'm really disappointed with the customer service.",
     "The price seems fair for the quality.",
     # Add more unseen tweets here... 
]

In [119]:
# Using Vader sentiment analyser
analyzer = SentimentIntensityAnalyzer()

In [121]:
for tweet in unseen_tweets:
    sentiment_scores = analyzer.polarity_scores(tweet)
    print(f"Tweet: {tweet}")
    print(f"Sentiment Scores: {sentiment_scores}")
    print()

Tweet: This product exceeded my expectations!
Sentiment Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Tweet: I'm really disappointed with the customer service.
Sentiment Scores: {'neg': 0.361, 'neu': 0.639, 'pos': 0.0, 'compound': -0.5256}

Tweet: The price seems fair for the quality.
Sentiment Scores: {'neg': 0.0, 'neu': 0.723, 'pos': 0.277, 'compound': 0.3182}



# 3. Named Entity Recognition with NLTK

"Named entity recognition (NER) [6] is a natural language processing (NLP) task that identifies and classifies named entities in text into predefined categories, such as people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is a crucial step in information extraction, which is the process of automatically extracting structured information from unstructured text data." [2]

[2] Guide to NLTK - Natural Language Toolkit for Python. Located at https://pieriantraining.com/guide-to-nltk-natural-language-toolkit-for-python/?utm_source=email-sendgrid&utm_medium=903744&utm_campaign=2023-06-30&utm_term=9685726&utm_content=educational, accessed on July 3rd, 2023

[6] Named-entity-Recognition available at https://en.wikipedia.org/wiki/Named-entity_recognition, accessed on July 4th, 2023.

## 3.1 Import Python Modules

In [125]:
import nltk
from nltk import ne_chunk
from nltk.tokenize import word_tokenize

## 3.2 Download Necessary Resources (only required once)

In [123]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Cleiber\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

## 3.3 Set Sample Text for Demonstration

In [126]:
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak,\
and Ronald Wayne. Its headquarters are located\
in Cupertino, California."

In [127]:
text

'Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne. Its headquarters are located in Cupertino, California.'

## 3.4 Tokenize the Text into Words

In [130]:
tokens = word_tokenize(text)
tokens[0:5]

['Apple', 'Inc.', 'was', 'founded', 'by']

## 3.5 Apply NER using NLTK's pre-trained models

In [131]:
ner_tags = ne_chunk(nltk.pos_tag(tokens))

## 3.6 Print the Marked Entities

In [132]:
for chunk in ner_tags:
    if hasattr(chunk, 'label'):
        print(f"Entity: {''.join(c[0] for c in chunk)} | Type: {chunk.label()}")

Entity: Apple | Type: PERSON
Entity: Inc. | Type: ORGANIZATION
Entity: SteveJobs | Type: PERSON
Entity: SteveWozniak | Type: PERSON
Entity: RonaldWayne | Type: PERSON
Entity: Cupertino | Type: GPE
Entity: California | Type: GPE
