# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [2]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [15]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."
# your code here

set_sentence = nltk.word_tokenize(text)

print(set_sentence)



['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']


Remove stop words and store the result in a variable called `filtered_tokens`

In [32]:
stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in set_sentence if word.lower() not in stop_words]

print("Filtered Tokens:", filtered_tokens)


Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

Apply stemming and store the result in `stemmed_tokens`

In [34]:
def apply_stemming(tokens):
    """Applies stemming to a list of tokens."""
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens]


stemmed_texts = apply_stemming(filtered_tokens)

print("Stemmed Tokens:", stemmed_texts)


Stemmed Tokens: ['natur', 'languag', 'process', '(', 'nlp', ')', 'fascin', 'field', 'studi', '!', 'involv', 'analyz', 'understand', 'human', 'languag', '.']


Apply lemmatization and store the result in `lemmatized_tokens`

In [35]:
# your code here
def lemmatization(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word, pos="v") for word in tokens]
    return lemmatized

lemmatized_tokens = lemmatization(filtered_tokens)

Exercise 2 - Lemmatization: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinate', 'field', 'study', '!', 'involve', 'analyze', 'understand', 'human', 'language', '.']


In [36]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinate', 'field', 'study', '!', 'involve', 'analyze', 'understand', 'human', 'language', '.']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

\Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [37]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [41]:

# your code here
# Step 1: Initialize the CountVectorizer
vectorizer = CountVectorizer()


# Step 2: Fit and transform the corpus into a BoW representation
X = vectorizer.fit_transform(corpus)

# Convert BoW matrix to a Pandas DataFrame for readability
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())


In [42]:
print("Bag of Words Representation:\n", bow_df)


Bag of Words Representation:
    amazing  enjoy  in  is  learning  love  new  nlp  things
0        0      0   0   0         0     1    0    1       0
1        1      0   0   1         0     0    0    1       0
2        0      1   1   0         1     0    1    1       1


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [49]:
# your code here
# Step 1: Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Step 2: Fit and transform the corpus into a TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Convert the TF-IDF matrix to a Pandas DataFrame for readability
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out())



In [54]:
#print("TF-IDF:\n", X_tfidf)
#print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())


print(tfidf_df)

    amazing     enjoy        in        is  learning      love       new  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.861037  0.000000   
1  0.652491  0.000000  0.000000  0.652491  0.000000  0.000000  0.000000   
2  0.000000  0.432385  0.432385  0.000000  0.432385  0.000000  0.432385   

        nlp    things  
0  0.508542  0.000000  
1  0.385372  0.000000  
2  0.255374  0.432385  


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [57]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
# Step 2: Fit and transform the corpus into a bigram representation

X_bigram = bigram_vectorizer.fit_transform(corpus)



In [58]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function.

In [147]:
def text_preprocessing_pipeline(text):
    # Tokenization
    tokens = word_tokenize(text.lower())

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Remove punctuation
    pattern = re.compile('[%s]' % re.escape(string.punctuation))
    tokens_no_punctuation = [pattern.sub('', token) for token in tokens]
    # Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(tokens)  # Return cleaned text

Apply this function to the following text

In [71]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# your code here

processed_text = text_preprocessing_pipeline(text)

In [72]:
print("Processed Text:", processed_text)

Processed Text: ['Natural', 'Language', 'Processing', '', 'NLP', '', 'fascinating', 'field', 'study', '', 'involves', 'analyzing', 'understanding', 'human', 'language', '']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [73]:
sentence = "The cats are playing with the mice in the garden."
# your code here
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
tokens = word_tokenize(sentence.lower())

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Stemming
stemmed_tokens = [stemmer.stem(word) for word in tokens]

# Lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Tokens:", tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)


Original Tokens: ['the', 'cats', 'are', 'playing', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']
Stemmed Tokens: ['the', 'cat', 'are', 'play', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']
Lemmatized Tokens: ['the', 'cat', 'are', 'playing', 'with', 'the', 'mouse', 'in', 'the', 'garden', '.']


In [74]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']
Stemmed Tokens: ['the', 'cat', 'are', 'play', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']
Lemmatized Tokens: ['the', 'cat', 'are', 'playing', 'with', 'the', 'mouse', 'in', 'the', 'garden', '.']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [137]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [149]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets.

In [148]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [150]:
# your code here

# Combine the datasets
all_tweets = positive_tweets + negative_tweets

labels = [1] * len(positive_tweets) + [0] * len(negative_tweets)



In [151]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [152]:
# Step 1: Apply the preprocessing pipeline to all tweets
preprocessed_tweets = [text_preprocessing_pipeline(tweet) for tweet in all_tweets]
#
# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=500)  # Limit features to 500 most important words

# Convert tweets to TF-IDF features
tfidf_matrix = vectorizer.fit_transform(preprocessed_tweets)



In [153]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: # followfriday @ france_inte @ pkuchly57 @ milipol_paris top engaged member community week : )


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [154]:
# your code here
# Step 1: Create a Bag of Words representation
bow_vectorizer = CountVectorizer(max_features=500)
X_bow = bow_vectorizer.fit_transform(preprocessed_tweets)


# Step 2: Create a TF-IDF representation
tfidf_vectorizer = TfidfVectorizer(max_features=500)
X_tfidf = tfidf_vectorizer.fit_transform(preprocessed_tweets)

print("Bag of Words Shape:", X_bow.shape)
print("TF-IDF Shape:", X_tfidf.shape)
print("Sample Preprocessed Tweet:", preprocessed_tweets[0])



Bag of Words Shape: (10000, 500)
TF-IDF Shape: (10000, 500)
Sample Preprocessed Tweet: # followfriday @ france_inte @ pkuchly57 @ milipol_paris top engaged member community week : )


## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

