**Title:** Implementing a Text Preprocessing Pipeline in Jupyter Notebook or Google Colab

**Objective:**
**By the end of this activity, students will be able to: **

1. Understand the importance of text preprocessing in Natural
2. Extract a sample dataset from NLTK and store it in a structured format.
3. Implement a text preprocessing pipeline that includes:
  a. Lowercasing
  b. Special character removal
  c. Stopword removal
  d. Tokenization
4. Apply the preprocessing pipeline to real-world text data using Jupyter Notebook or Google Colab








Install NLTK and necessary libraries

In [11]:
!pip install nltk pandas  # Use in Google Colab or Jupyter Notebook





Import the required libraries

In [12]:
import nltk
from nltk.corpus import movie_reviews
import pandas as pd
import random
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary datasets
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Load and Save the Sample Dataset from NLTK**

Extract movie review texts and their corresponding labels (positive/negative):


In [13]:
# Get a list of file IDs
file_ids = movie_reviews.fileids()

# Create a list of (text, category) tuples
dataset = [(movie_reviews.raw(file_id), movie_reviews.categories(file_id)[0]) for file_id in file_ids]

# Shuffle the dataset for randomness
random.shuffle(dataset)

# Convert into a DataFrame
df = pd.DataFrame(dataset, columns=["text", "category"])

# Display sample data
df.head()


Unnamed: 0,text,category
0,this british import follows the ( mis- ) adven...,pos
1,i remember really enjoying this movie when i s...,neg
2,"synopsis : melissa , a mentally-disturbed woma...",neg
3,i had been expecting more of this movie than t...,pos
4,whenever writer/director robert altman works i...,pos


Save the dataset as a CSV file:

In [14]:
from google.colab import files
files.download('nltk_movie_reviews.csv')



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Define the Text Preprocessing Pipeline**

Implement a function to perform the following preprocessing steps:
  1. Lowercasing
  2. Special character removal
  3. Stopword removal
  4. Tokenization


In [15]:
def preprocess_text(text):
    # Step 1: Convert to lowercase
    text = text.lower()

    # Step 2: Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Step 3: Remove stopwords
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]

    # Step 4: Tokenization (returning list of tokens)
    return filtered_text


**Test the Preprocessing Function**

Apply the function to a sample sentence:

In [18]:
import nltk
nltk.download('punkt_tab')

sample_text = "Wow! This movie was amazing, but some parts were too slow."
processed_text = preprocess_text(sample_text)
print(processed_text)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['wow', 'movie', 'amazing', 'parts', 'slow']


**Apply the Preprocessing Pipeline to the Dataset**

Apply the function to the "text" column of the dataset:


In [19]:
df['processed_text'] = df['text'].apply(preprocess_text)
df[['text', 'processed_text', 'category']].head()


Unnamed: 0,text,processed_text,category
0,this british import follows the ( mis- ) adven...,"[british, import, follows, mis, adventures, gr...",pos
1,i remember really enjoying this movie when i s...,"[remember, really, enjoying, movie, saw, years...",neg
2,"synopsis : melissa , a mentally-disturbed woma...","[synopsis, melissa, mentallydisturbed, woman, ...",neg
3,i had been expecting more of this movie than t...,"[expecting, movie, less, thrilling, twister, t...",pos
4,whenever writer/director robert altman works i...,"[whenever, writerdirector, robert, altman, wor...",pos


**Save the Processed Data**

Save the cleaned dataset:

In [20]:
df.to_csv('processed_nltk_reviews.csv', index=False)
print("Preprocessed dataset saved as 'processed_nltk_reviews.csv'")

Preprocessed dataset saved as 'processed_nltk_reviews.csv'


**If using Google Colab, download the processed dataset:**

In [21]:
files.download('processed_nltk_reviews.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Answer the following questions:
* What changes did you observe in the preprocessed text?
* Why is each step of preprocessing important?
* How does preprocessing improve text analysis in machine learning?
* How would you modify this pipeline for sentiment analysis?


**What changes did you observe in the preprocessed text?**
In the preprocessed text:

1. All words are in lowercase.

Example: "Wow! This Movie" → "wow this movie"

2. Special characters like punctuation are removed.

Example: "amazing!" → "amazing"

3. Common words (stopwords) are removed.

Words like "this", "was", "but", "some" are removed because they don’t add much meaning.

4. Text is broken into individual words (tokens).

Example: "wow this movie was amazing" → ["wow", "movie", "amazing"]

**Why is each step of preprocessing important?**
1. Lowercasing
→ Makes all words uniform. "Movie" and "movie" should be treated the same.

2. Removing special characters
→ Cleans the text. Punctuation doesn't usually help the model understand meaning.

3. Removing stopwords
→ Gets rid of common, less useful words. Keeps the important ones.

4. Tokenization
→ Breaks text into words, so machine learning models can work with them.



**How does preprocessing improve text analysis in machine learning?**
1. Makes data clean and consistent.

2. Reduces noise by removing unnecessary stuff (like stopwords or punctuation).

3. Improves accuracy by focusing on the most useful words.

4. Speeds up training because there’s less data to process.

**How would you modify this pipeline for sentiment analysis?**
For sentiment analysis, you might:

1. Keep some stopwords like “not” or “no” – because they change meaning.

Example: “not good” ≠ “good”

2. Add stemming or lemmatization

Converts words like “running”, “runs”, “ran” → “run”

Helps reduce word variations.

3. Use word embeddings or TF-IDF

Turns words into numbers that show importance or context.

4. Use labeled sentiment data

Make sure your data has labels like "positive", "negative".

