# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
# !pip install nltk scikit-learn pandas matplotlib

Now, import the required libraries:

In [2]:
import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/guillermo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/guillermo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/guillermo/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/guillermo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/guillermo/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/guillermo/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [4]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
# your code here
tokens = word_tokenize(text)
print(tokens)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']


Remove stop words and store the result in a variable called `filtered_tokens`

In [5]:
stop_words = set(stopwords.words('english'))
# your code here

In [6]:
# Create a new list. For each word in 'tokens', only keep it if it's NOT in the stop_words list.
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [7]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [8]:
# your code here

stemmed_tokens = []


for word in filtered_tokens:
    print("---- ",word,"----")
    print('PS:',stemmer.stem(word))
    stemmed_tokens.append(stemmer.stem(word))

----  Natural ----
PS: natur
----  Language ----
PS: languag
----  Processing ----
PS: process
----  ( ----
PS: (
----  NLP ----
PS: nlp
----  ) ----
PS: )
----  fascinating ----
PS: fascin
----  field ----
PS: field
----  study ----
PS: studi
----  ! ----
PS: !
----  involves ----
PS: involv
----  analyzing ----
PS: analyz
----  understanding ----
PS: understand
----  human ----
PS: human
----  language ----
PS: languag
----  . ----
PS: .


In [9]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', '(', 'nlp', ')', 'fascin', 'field', 'studi', '!', 'involv', 'analyz', 'understand', 'human', 'languag', '.']


Apply lemmatization and store the result in `lemmatized_tokens`

In [10]:
# your code here


# lemmatized_tokens = []


# for word in filtered_tokens:
#     print("---- ",word,"----")
#     print('PS:',stemmer.stem(word))
#     lemmatized_tokens.append(stemmer.stem(word))

# Apply lemmatization to each word in the filtered_tokens list
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

In [11]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [12]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [13]:
# your code here
# Step 1: Initialize the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer (min_df = 1)#(min_df=0.1,max_df=0.7)


# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)

In [14]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [18]:
# your code here
from sklearn.feature_extraction.text import TfidfVectorizer

# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 1)


# Step 2: Fit and transform the corpus into a TF-IDF representation

X_tfidf = tfidf_vectorizer.fit_transform(corpus)


In [22]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [25]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Step 2: Fit and transform the corpus into a bigram representation

X_bigram = bigram_vectorizer.fit_transform(corpus)



In [26]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function. 

In [36]:
# your code here
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)
    # print(tokens)

    # Step 2: Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Step 3: Remove punctuation
    tokens_no_punct = [word for word in filtered_tokens if word.isalpha()]


    # (More exact way but more complex):
    # We will check if each token is NOT a punctuation character
    # translator = str.maketrans('', '', string.punctuation)
    # no_punct_tokens = [word.translate(translator) for word in filtered_tokens]
    # # Remove any empty strings that might be left after removing punctuation
    # no_punct_tokens = [word for word in no_punct_tokens if word]

    # Step 4: Apply lemmatization
    lemmatizer = WordNetLemmatizer()

    # Apply lemmatization to each word in the filtered_tokens list
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_no_punct]

    return lemmatized_tokens


Apply this function to the following text

In [37]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# your code here

processed_text = text_preprocessing_pipeline(text)

In [38]:
print("Processed Text:", processed_text)

Processed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [39]:
sentence = "The cats are playing with the mice in the garden."
# your code here
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens

# A) Tokenize
tokens = word_tokenize(sentence)
# B) Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
# C) Remove punctuation (using the same method as before)
tokens_no_punct = [word for word in filtered_tokens if word.isalpha()]



# Step 2: Apply stemming
stemmer = PorterStemmer()
# Apply stemming to each word in the filtered_tokens list
stemmed_tokens = [stemmer.stem(word) for word in tokens_no_punct]


# Step 3: Apply lemmatization
lemmatizer = WordNetLemmatizer()
# Apply lemmatization to each word in the filtered_tokens list
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_no_punct]



In [40]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['cats', 'playing', 'mice', 'garden', '.']
Stemmed Tokens: ['cat', 'play', 'mice', 'garden']
Lemmatized Tokens: ['cat', 'playing', 'mouse', 'garden']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [41]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/guillermo/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [42]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets. 

In [43]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [44]:
# your code here

# Combine the datasets
all_tweets = positive_tweets + negative_tweets

# Create labels: 1 for positive, 0 for negative
# The first len(positive_tweets) are positive, the rest are negative
labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)

In [45]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [46]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here

# Apply the function to every tweet in the all_tweets list
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]

In [47]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['FollowFriday', 'top', 'engaged', 'member', 'community', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [48]:
# your code here
# Step 1: Create a Bag of Words representation
# First, convert each list of tokens back into a string
corpus_strings = [' '.join(tweet) for tweet in preprocessed_tweets]

# Now, initialize the vectorizer and fit it
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(corpus_strings)


In [49]:
# Step 2: Create a TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus_strings)

## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.



In [50]:
# --- THE NEXT STEP: TRAINING A MODEL ---
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Choose your features (X) and labels (y)
# Let's use the TF-IDF features we made, as they are often better
X = X_tfidf # Our TF-IDF numbers (the features)
y = labels   # Our list of 1s and 0s (the target we want to predict)

# 2. Split the data into Training and Testing sets
# We train the model on one part and test its accuracy on a part it has never seen.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Choose and train a Machine Learning model
# We'll use Logistic Regression, which is great for yes/no (positive/negative) classification.
model = LogisticRegression(max_iter=1000) # max_iter is just to make sure it finishes calculating
model.fit(X_train, y_train) # This is the line that "trains" the model!

# 4. Use the model to make predictions on the test set
y_pred = model.predict(X_test)

# 5. See how accurate our model is!
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}") # This will print a percentage like "Model Accuracy: 75.50%"

Model Accuracy: 75.05%


In [52]:
# --- INSPECTING THE PREDICTIONS ---

# 1. First, we need to get the original text of the test tweets.
# Remember, we split our 'corpus_strings' and 'labels' into train and test sets.
# The X_test matrix has the same order as the test portion of 'corpus_strings' and 'all_tweets'.
# We need to get the indices of the test set to find the original tweets.

# Get the indices that were chosen for the test set
_, _, _, _, indices_train, indices_test = train_test_split(X, y, range(len(y)), test_size=0.2, random_state=42)

# 2. Now, let's create a DataFrame to view everything neatly
import pandas as pd

# Create a list of results for the test set
results = []
for i, index_in_original_data in enumerate(indices_test):
    original_tweet = all_tweets[index_in_original_data]
    cleaned_tweet = corpus_strings[index_in_original_data] # The preprocessed version we used
    true_sentiment = "Positive" if y_test[i] == 1 else "Negative"
    predicted_sentiment = "Positive" if y_pred[i] == 1 else "Negative"
    results.append({
        "Original Tweet": original_tweet,
        "Cleaned Words": cleaned_tweet,
        "True Sentiment": true_sentiment,
        "Predicted Sentiment": predicted_sentiment
    })

# Convert the list to a Pandas DataFrame for a nice table view
results_df = pd.DataFrame(results)

# 3. Print the overall accuracy again and show the first 15 test samples
print(f"\nModel Accuracy on Test Set: {accuracy:.2%}\n")
print("Sample Predictions from the Test Set:")
pd.set_option('display.max_colwidth', None) # This ensures we can see the full tweet text
display(results_df.head(15))


Model Accuracy on Test Set: 75.05%

Sample Predictions from the Test Set:


Unnamed: 0,Original Tweet,Cleaned Words,True Sentiment,Predicted Sentiment
0,"I love you, how but you? @Taecyeon2pm8 did you feel the same? Emm I think not :(",love feel Emm think,Negative,Negative
1,@mayusushita @dildeewana_ @sonalp2591 @deepti_ahmd @armansushita8 Thanks Guys :),mayusushita Thanks Guys,Positive,Positive
2,"Your love, O Lord, is better than life. :) &lt;3 https://t.co/KPCeYJqKLM",love Lord better life lt http,Positive,Positive
3,@yasminyasir96 yeah but it will be better if we use her official Account :) Like The Other @PracchiNDesai ❤️,yeah better use official Account Like PracchiNDesai,Positive,Positive
4,Ok good night I wish troye wasn't ugly and I met him today:)():)!:!; but ok today was fun I'm excited for tmrw!!,Ok good night wish troye ugly met today ok today fun excited tmrw,Positive,Positive
5,"@scottybev I'm not surprised, that sounds hellish! Why would you do such a thing? :(",scottybev surprised sound hellish would thing,Negative,Positive
6,"Dry, hot, scorching summer #FF :) @infocffm @MediationMK @ExeterMediation @KentFMS @EssexMediation",Dry hot scorching summer FF infocffm MediationMK ExeterMediation KentFMS EssexMediation,Positive,Positive
7,@hanbined sad pray for me :(((,hanbined sad pray,Negative,Negative
8,Popol day too :(,Popol day,Negative,Positive
9,My Song of the Week is Ducktails - Surreal Exposure #SOTW https://t.co/BeXVWh7zIR Jingly jangly loveliness! :-),Song Week Ducktails Surreal Exposure SOTW http Jingly jangly loveliness,Positive,Positive
