<a href="https://colab.research.google.com/github/RamziRBM/lab-py-nlp/blob/main/lab-py-nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Comprehensive NLP Lab: From Preprocessing to Feature Extraction**

In this lab, you will explore a wide range of Natural Language Processing (NLP) techniques, from basic text preprocessing to advanced feature extraction and analysis. By the end of this lab, you will be able to:

1. **Tokenize** and preprocess text data.
2. Remove **stop words** and **punctuation**.
3. Apply **stemming** and **lemmatization**.
4. Extract features using **Bag of Words (BoW)** and **TF-IDF**.
5. Generate **n-grams** to capture contextual information.
6. Evaluate the impact of different preprocessing techniques on text data.

Let's dive in!

## **1. Setup the Environment**


Before we begin, ensure you have the necessary libraries installed. Run the following cell to install them:


In [1]:
!pip install nltk scikit-learn pandas matplotlib




Now, import the required libraries:

In [2]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
# Download NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## **2. Text Preprocessing**

### **Exercise 1: Tokenization and Stop Word Removal**

Tokenize the following text

In [4]:
text = ["Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."]
tokenized_docs = [word_tokenize(doc) for doc in text]
for doc in tokenized_docs:
    print(doc)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']


Remove stop words and store the result in a variable called `filtered_tokens`

In [5]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokenized_docs[0] if word.lower() not in stop_words]

In [6]:
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


### **Exercise 2: Stemming and Lemmatization**

Apply stemming and lemmatization to the `filtered_tokens`. Compare the results.

In [7]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Apply stemming and store the result in `stemmed_tokens`

In [8]:
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

In [9]:
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['natur', 'languag', 'process', '(', 'nlp', ')', 'fascin', 'field', 'studi', '!', 'involv', 'analyz', 'understand', 'human', 'languag', '.']


Apply lemmatization and store the result in `lemmatized_tokens`

In [10]:
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

In [11]:
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'field', 'study', '!', 'involves', 'analyzing', 'understanding', 'human', 'language', '.']


## **3. Feature Extraction**

### **Exercise 3: Bag of Words (BoW)**

Use the `CountVectorizer` from `scikit-learn` to create a Bag of Words representation of the following corpus

In [12]:
corpus = [
    "I love NLP.",
    "NLP is amazing.",
    "I enjoy learning new things in NLP."
]

In [13]:

# your code here
# Step 1: Initialize the CountVectorizer

# Step 2: Fit and transform the corpus into a BoW representation

# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Fit the vectorizer on the text data
X = vectorizer.fit_transform(corpus)

In [14]:
print("Bag of Words:\n", X.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words:
 [[0 0 0 0 0 1 0 1 0]
 [1 0 0 1 0 0 0 1 0]
 [0 1 1 0 1 0 1 1 1]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 4: TF-IDF**

Use the `TfidfVectorizer` from `scikit-learn` to create a TF-IDF representation of the same corpus. Store the result in `X_tfidf`

In [15]:
# your code here
# Step 1: Initialize the TfidfVectorizer

# Step 2: Fit and transform the corpus into a TF-IDF representation

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit the vectorizer on the text data
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

In [16]:
print("TF-IDF:\n", X_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

TF-IDF:
 [[0.         0.         0.         0.         0.         0.861037
  0.         0.50854232 0.        ]
 [0.65249088 0.         0.         0.65249088 0.         0.
  0.         0.38537163 0.        ]
 [0.         0.43238509 0.43238509 0.         0.43238509 0.
  0.43238509 0.2553736  0.43238509]]
Vocabulary: ['amazing' 'enjoy' 'in' 'is' 'learning' 'love' 'new' 'nlp' 'things']


### **Exercise 5: N-grams**

Generate `bigrams (2-grams)` from the corpus using `CountVectorizer`. Store the result in `X_bigram`

In [17]:
# your code here
# Step 1: Initialize the CountVectorizer with ngram_range=(2, 2)
bigram_vectorizer =CountVectorizer(ngram_range=(2, 2))
# Step 2: Fit and transform the corpus into a bigram representation
X_bigram=bigram_vectorizer.fit_transform(corpus)

In [18]:
print("Bigrams:\n", X_bigram.toarray())
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

Bigrams:
 [[0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0]
 [1 1 0 1 0 1 0 1]]
Bigram Vocabulary: ['enjoy learning' 'in nlp' 'is amazing' 'learning new' 'love nlp'
 'new things' 'nlp is' 'things in']


## **4. Advanced Exercise: Custom Preprocessing Pipeline**

### **Exercise 6: Build a Custom Preprocessing Pipeline**

Combine all the preprocessing steps (tokenization, stop word removal, punctuation removal, stemming/lemmatization) into a single function.

In [19]:
# your code here
def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens=word_tokenize(text)
    print(tokens)

    # Step 2: Remove stop words
    rm_stopwords = stopwords.words('english')
    clean_words = [word for word in tokens if  word not in rm_stopwords]

    # Step 3: Remove punctuation
    clean_data=[word for word in clean_words if word.isalpha()]
    # Step 4: Apply lemmatization
    lemmatized_tokens=[]
    for word in clean_data:
        lemmatized_tokens.append(lemmatizer.lemmatize(word))
    return lemmatized_tokens



Apply this function to the following text

In [20]:
text = "Natural Language Processing (NLP) is a fascinating field of study! It involves analyzing and understanding human language."

# your code here
processed_text=text_preprocessing_pipeline(text)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'study', '!', 'It', 'involves', 'analyzing', 'and', 'understanding', 'human', 'language', '.']


In [21]:
print("Processed Text:", processed_text)

Processed Text: ['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'study', 'It', 'involves', 'analyzing', 'understanding', 'human', 'language']


## **5. Evaluation of Preprocessing Techniques**

### **Exercise 7: Compare Preprocessing Techniques**

Compare the results of stemming and lemmatization on the following sentence. Store the results in `stemmed_tokens` and `lemmatized_tokens`

In [22]:
sentence = "The cats are playing with the mice in the garden."
# your code here
# Step 1: Tokenize and preprocess the sentence and store the result in filtered_tokens
filtered_tokens = word_tokenize(sentence)
print(filtered_tokens)

# Step 2: Apply stemming
stemmed_tokens=[]
for token in filtered_tokens:
    stemmed_tokens.append(stemmer.stem(token))

# Step 3: Apply lemmatization
lemmatized_tokens=[]
for token in filtered_tokens:
    lemmatized_tokens.append(lemmatizer.lemmatize(token))

['The', 'cats', 'are', 'playing', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']


In [23]:
print("Original Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Tokens: ['The', 'cats', 'are', 'playing', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']
Stemmed Tokens: ['the', 'cat', 'are', 'play', 'with', 'the', 'mice', 'in', 'the', 'garden', '.']
Lemmatized Tokens: ['The', 'cat', 'are', 'playing', 'with', 'the', 'mouse', 'in', 'the', 'garden', '.']


## **6. Real-World Dataset: Sentiment Analysis**

### **Exercise 8: Preprocess and Analyze Tweets**

In this exercise, you will work with a real-world dataset of tweets. The dataset contains 5000 positive and 5000 negative tweets. Your task is to preprocess the tweets and extract features for sentiment analysis.


In [24]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [25]:
# Load the dataset
from nltk.corpus import twitter_samples

Load the dataset of positive and negative tweets.

In [26]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Combine them into a single list called ``all_tweets`` and create a corresponding list of labels called `labels`.

In [27]:
# your code here

# Combine the datasets

all_tweets=positive_tweets+negative_tweets
labels=([1] * len(positive_tweets))+([0] * len(negative_tweets))

In [28]:
# Print a sample tweet
print("Sample Tweet:", all_tweets[0])
print("Label:", labels[0])

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Label: 1


### **Exercise 9: Preprocess Tweets**

Apply the custom preprocessing pipeline to the entire dataset of tweets. Store the result in ``preprocessed_tweets``.

In [29]:
# Step 1: Apply the preprocessing pipeline to all tweets
# your code here
preprocessed_tweets=[text_preprocessing_pipeline(tweet) for tweet in all_tweets]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
['hopeless', 'for', 'tmr', ':', '(']
['Everything', 'in', 'the', 'kids', 'section', 'of', 'IKEA', 'is', 'so', 'cute', '.', 'Shame', 'I', "'m", 'nearly', '19', 'in', '2', 'months', ':', '(']
['@', 'Hegelbon', 'That', 'heart', 'sliding', 'into', 'the', 'waste', 'basket', '.', ':', '(']
['‚Äú', '@', 'ketchBurning', ':', 'I', 'hate', 'Japanese', 'call', 'him', '``', 'bani', "''", ':', '(', ':', '(', '‚Äù', 'Me', 'too']
['Dang', 'starting', 'next', 'week', 'I', 'have', '``', 'work', "''", ':', '(']
['oh', 'god', ',', 'my', 'babies', "'", 'faces', ':', '(', 'https', ':', '//t.co/9fcwGvaki0']
['@', 'RileyMcDonough', 'make', 'me', 'smile', ':', '(', '(']
['@', 'f0ggstar', '@', 'stuartthull', 'work', 'neighbour', 'on', 'motors', '.', 'Asked', 'why', 'and', 'he', 'said', 'hates', 'the', 'updates', 'on', 'search', ':', '(', 'http', ':', '//t.co/XvmTUikWln']
['why', '?', ':', '(', '``', '@', 'tahuodyy', ':', 'sialan', ':', '(', 'http

In [30]:
# Print a sample preprocessed tweet
print("Preprocessed Tweets Sample:", preprocessed_tweets[0])

Preprocessed Tweets Sample: ['FollowFriday', 'top', 'engaged', 'member', 'community', 'week']


### **Exercise 10: Feature Extraction on Tweets**

Extract features from the preprocessed tweets using **Bag of Words** and **TF-IDF**. Store the results in ``X_bow`` and ``X_tfidf``, respectively.

In [37]:
print(type(preprocessed_tweets[0]))

<class 'list'>


In [39]:
from sklearn.model_selection import train_test_split

# your code here

preprocessed_tweets = [t for tweet in preprocessed_tweets for t in tweet if t.strip() != ""]
# Step 1: Create a Bag of Words representation
tweet_vectorizer = CountVectorizer(max_features=1000)
X_bow = tweet_vectorizer.fit_transform(preprocessed_tweets)
print(X_bow.toarray())

# Step 2: Create a TF-IDF representation
tweet_tfidf_vectorizer=TfidfVectorizer(max_df=0.5)
X_tfidf = tweet_tfidf_vectorizer.fit_transform(preprocessed_tweets)
print(X_tfidf)



[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 63173 stored elements and shape (68079, 15457)>
  Coords	Values
  (0, 4676)	1.0
  (1, 13860)	1.0
  (2, 4054)	1.0
  (3, 8594)	1.0
  (4, 2631)	1.0
  (5, 14763)	1.0
  (6, 5814)	1.0
  (7, 6709)	1.0
  (8, 6017)	1.0
  (9, 9713)	1.0
  (10, 10512)	1.0
  (11, 1977)	1.0
  (12, 2712)	1.0
  (13, 2176)	1.0
  (14, 43)	1.0
  (15, 813)	1.0
  (16, 8353)	1.0
  (17, 13449)	1.0
  (18, 3364)	1.0
  (19, 7873)	1.0
  (20, 7620)	1.0
  (21, 9464)	1.0
  (22, 771)	1.0
  (23, 15292)	1.0
  (24, 1511)	1.0
  :	:
  (68053, 9172)	1.0
  (68054, 2216)	1.0
  (68055, 894)	1.0
  (68056, 14351)	1.0
  (68057, 9136)	1.0
  (68058, 10871)	1.0
  (68059, 1791)	1.0
  (68060, 5795)	1.0
  (68061, 4698)	1.0
  (68062, 6687)	1.0
  (68063, 974)	1.0
  (68064, 10404)	1.0
  (68065, 1911)	1.0
  (68066, 8993)	1.0
  (68067, 246)	1.0
  (68068, 8435)	1.0
  (68069, 27

## **7. Conclusion**

In this lab, you explored a wide range of NLP techniques, from basic text preprocessing to advanced feature extraction and analysis. You also worked with a real-world dataset of tweets and applied your knowledge to preprocess and extract features for sentiment analysis.

