# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [119]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [120]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [121]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [122]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# Tokenization: Split the text into individual words
tokens = word_tokenize(text)

# Load stop words
stop_words = set(stopwords.words('english'))

# Remove punctuation (optional)
tokens = [word for word in tokens if word not in string.punctuation]

# Stop word removal
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Output the filtered tokens
print("Filtered Tokens:", filtered_tokens)



Filtered Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


Remove stop words and store the result in a variable called `filtered_tokens`

In [123]:

# Text input
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# Tokenization: Split the text into individual words
tokens = word_tokenize(text)

# Load stop words
stop_words = set(stopwords.words('english'))

# Remove punctuation (optional)
tokens = [word for word in tokens if word not in string.punctuation]

# Remove stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Print the result
print("Filtered Tokens:", filtered_tokens)


Filtered Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


In [124]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [125]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [126]:

# Sample text
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# Define stop words
stop_words = set([
    "is", "a", "of", "it", "and", "in", "the", "to", "on", "for", "with", "as", "at", "by", "an", "this", "that", "these", "those", "be", "been"
])

# Tokenize and remove punctuation
word_tokens = text.translate(str.maketrans("", "", string.punctuation)).split()

# Filter out stop words
filtered_tokens = [word for word in word_tokens if word.lower() not in stop_words]

# Simple Stemmer function
def simple_stemmer(word):
    suffixes = ["ing", "ed", "es", "s", "ly"]
    for suffix in suffixes:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

# Apply Stemming
stemmed_tokens = [simple_stemmer(word) for word in filtered_tokens]


In [127]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['Natural', 'Language', 'Process', 'NLP', 'fascinat', 'field', 'study', 'involv', 'analyz', 'understand', 'human', 'language']


Apply lemmatization and store the result in `lemmatized_tokens`

In [128]:
# your code here
lemmatized_tokens = [lemmatization_dict.get(word, word) for word in filtered_tokens]

In [129]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involve', 'analyze', 'understand', 'human', 'language']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [130]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [131]:

# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)


In [132]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [133]:
# Sample text corpus
corpus = [
    "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language.",
    "NLP stands for Natural Language Processing, which helps machines understand human language.",
    "The field of Natural Language Processing includes techniques such as tokenization, part-of-speech tagging, and parsing."
]

# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')  # Automatically removes common English stop words

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Convert the result to a dense array to view it
tfidf_array = X_tfidf.toarray()

# Get the feature names (words) corresponding to the TF-IDF values
vocabulary = tfidf_vectorizer.get_feature_names_out()


In [134]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.33656181 0.33656181 0.25596393 0.         0.25596393 0.
  0.33656181 0.39755765 0.         0.19877883 0.25596393 0.
  0.19877883 0.         0.         0.33656181 0.         0.
  0.         0.         0.33656181]
 [0.         0.         0.         0.37139674 0.28245679 0.
  0.         0.4387058  0.37139674 0.2193529  0.28245679 0.
  0.2193529  0.         0.37139674 0.         0.         0.
  0.         0.37139674 0.        ]
 [0.         0.         0.27542121 0.         0.         0.3621458
  0.         0.21388914 0.         0.21388914 0.         0.3621458
  0.21388914 0.3621458  0.         0.         0.3621458  0.3621458
  0.3621458  0.         0.        ]]
Vocabulary: ['analyzing' 'fascinating' 'field' 'helps' 'human' 'includes' 'involves'
 'language' 'machines' 'natural' 'nlp' 'parsing' 'processing' 'speech'
 'stands' 'study' 'tagging' 'techniques' 'tokenization' 'understand'
 'understanding']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [135]:

# Sample text corpus
corpus = [
    "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language.",
    "NLP stands for Natural Language Processing, which helps machines understand human language.",
    "The field of Natural Language Processing includes techniques such as tokenization, part-of-speech tagging, and parsing."
]

# Step 1: Initialize the CountVectorizer for bigrams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')  # Bigrams only, remove stop words

# Step 2: Fit and transform the corpus into a bigram representation
X_bigram = bigram_vectorizer.fit_transform(corpus)

# Convert the result to a dense array to view it
bigram_array = X_bigram.toarray()

# Get the feature names (bigrams) corresponding to the bigram representation
bigram_vocabulary = bigram_vectorizer.get_feature_names_out()




In [136]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1]
 [0 0 0 0 1 1 0 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 1 0]
 [0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0]]
Bigram Vocabulary: ['analyzing understanding' 'fascinating field' 'field natural'
 'field study' 'helps machines' 'human language' 'includes techniques'
 'involves analyzing' 'language processing' 'machines understand'
 'natural language' 'nlp fascinating' 'nlp stands' 'processing helps'
 'processing includes' 'processing nlp' 'speech tagging' 'stands natural'
 'study involves' 'tagging parsing' 'techniques tokenization'
 'tokenization speech' 'understand human' 'understanding human']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function.

In [137]:
# your code here
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)

    # Step 2: Remove stop words
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Step 3: Remove punctuation
    filtered_tokens = [word for word in filtered_tokens if word not in string.punctuation]

    # Step 4: Apply lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    return lemmatized_tokens


text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# Apply the custom text preprocessing pipeline
preprocessed_text = text_preprocessing_pipeline(text)


print("Preprocessed Text:", preprocessed_text)


Preprocessed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


Apply this function to the following text

In [138]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

preprocessed_text = text_preprocessing_pipeline(text)


In [139]:
print("Processed Text:", preprocessed_text)

Processed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [140]:
sentence = "The cats are playing with the mice in the garden."
# Step 1: Tokenize the sentence
tokens = word_tokenize(sentence)

# Step 2: Preprocess the sentence (remove stop words and punctuation)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
filtered_tokens = [word for word in filtered_tokens if word not in string.punctuation]

# Step 3: Apply Stemming
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Step 4: Apply Lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]



In [141]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'are', 'playing', 'mice', 'garden']
Stemmed Tokens: ['cat', 'are', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'are', 'playing', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [142]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [143]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets.

In [144]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [145]:
# your code here


# Step 1: Load the positive and negative tweets from twitter_samples
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

# Step 2: Combine positive and negative tweets into a single list
all_tweets = positive_tweets + negative_tweets

# Step 3: Create a list of corresponding labels (1 for positive, 0 for negative)
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)




In [146]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [147]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]

"i love this home"


'i love this home'

In [148]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['FollowFriday', 'France_Inte', 'PKuchly57', 'Milipol_Paris', 'being', 'top', 'engaged', 'member', 'my', 'community', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [149]:
# your code here
# Step 1: Create a Bag of Words representation
# Step 2: Create a TF-IDF representation


tweets = [
    "love this movie so much",
    "this is a great day",
    "hate the traffic today",
    "this movie was awesome",
    "feeling so happy and blessed"
]

bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(tweets)

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(tweets)


df_bow = pd.DataFrame(X_bow.toarray(), columns=bow_vectorizer.get_feature_names_out())

df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("Bag of Words Representation:\n", df_bow)
print("\nTF-IDF Representation:\n", df_tfidf)


Bag of Words Representation:
    and  awesome  blessed  day  feeling  great  happy  hate  is  love  movie  \
0    0        0        0    0        0      0      0     0   0     1      1   
1    0        0        0    1        0      1      0     0   1     0      0   
2    0        0        0    0        0      0      0     1   0     0      0   
3    0        1        0    0        0      0      0     0   0     0      1   
4    1        0        1    0        1      0      1     0   0     0      0   

   much  so  the  this  today  traffic  was  
0     1   1    0     1      0        0    0  
1     0   0    0     1      0        0    0  
2     0   0    1     0      1        1    0  
3     0   0    0     1      0        0    1  
4     0   1    0     0      0        0    0  

TF-IDF Representation:
         and   awesome   blessed       day   feeling     great     happy  hate  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   0.0   
1  0.000000  0.000000  0.000000 

## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

