**Install NLTK and necessary libraries**

In [None]:
!pip install nltk pandas  # Use in Google Colab or Jupyter Notebook



**Import the required libraries**

In [None]:
import nltk
from nltk.corpus import movie_reviews
import pandas as pd
import random
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary datasets
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Load and Save the Sample Dataset from NLTK**

In [None]:
# Get a list of file IDs
file_ids = movie_reviews.fileids()

# Create a list of (text, category) tuples
dataset = [(movie_reviews.raw(file_id), movie_reviews.categories(file_id)[0]) for file_id in file_ids]

# Shuffle the dataset for randomness
random.shuffle(dataset)

# Convert into a DataFrame
df = pd.DataFrame(dataset, columns=["text", "category"])

# Display sample data
df.head()


Unnamed: 0,text,category
0,the seasoned capt . dudley smith ( james cromw...,pos
1,contact is the 1997 movie i've seen the most -...,pos
2,michael robbins' hardball is quite the cinemat...,neg
3,"well i'll be damned , what a most excellent su...",pos
4,the makers of spawn have created something alm...,neg


**Save the dataset as a CSV file**

In [None]:
df.to_csv('nltk_movie_reviews.csv', index=False)
print("Dataset saved as 'nltk_movie_reviews.csv'")


Dataset saved as 'nltk_movie_reviews.csv'


In [None]:
from google.colab import files
files.download('nltk_movie_reviews.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Implement a function to perform the following preprocessing steps:**
* Lowercasing
* Special character removal
* Stopword removal
* Tokenization


In [None]:
def preprocess_text(text):
    # Step 1: Convert to lowercase
    text = text.lower()

    # Step 2: Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Step 3: Remove stopwords
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]

    # Step 4: Tokenization (returning list of tokens)
    return filtered_text


In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

**Apply the function to a sample sentence**

In [None]:
sample_text = "Wow! This movie was amazing, but some parts were too slow."
processed_text = preprocess_text(sample_text)
print(processed_text)


['wow', 'movie', 'amazing', 'parts', 'slow']


**Apply the Preprocessing Pipeline to the Dataset**

Apply the function to the "text" column of the dataset:


In [None]:
df['processed_text'] = df['text'].apply(preprocess_text)
df[['text', 'processed_text', 'category']].head()


Unnamed: 0,text,processed_text,category
0,the seasoned capt . dudley smith ( james cromw...,"[seasoned, capt, dudley, smith, james, cromwel...",pos
1,contact is the 1997 movie i've seen the most -...,"[contact, 1997, movie, ive, seen, five, times,...",pos
2,michael robbins' hardball is quite the cinemat...,"[michael, robbins, hardball, quite, cinematic,...",neg
3,"well i'll be damned , what a most excellent su...","[well, ill, damned, excellent, surprise, confu...",pos
4,the makers of spawn have created something alm...,"[makers, spawn, created, something, almost, va...",neg


**Save the Processed Data**

Save the cleaned dataset:

In [None]:
df.to_csv('processed_nltk_reviews.csv', index=False)
print("Preprocessed dataset saved as 'processed_nltk_reviews.csv'")


Preprocessed dataset saved as 'processed_nltk_reviews.csv'


In [None]:
files.download('processed_nltk_reviews.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Reflection and Discussion**

1. What changes did you observe in the preprocessed text?

The text became all lowercase, special characters were removed, and common words were filtered out. Only important words remained.

2. Why is each step of preprocessing important?

* Lowercasing makes words consistent.
* Removing special characters cleans the text.
* Stopword removal keeps only important words.
* Tokenization breaks text into useful pieces.

3. How does preprocessing improve text analysis in machine learning?

* It removes noise, making patterns clearer and improving accuracy.

4. How would you modify this pipeline for sentiment analysis?

* Add stemming or lemmatization to group similar words.
* Use n-grams to capture word combinations.
* Apply TF-IDF or word embeddings for better feature extraction.