# NLP Pipeline Exercises: Sentiment Labelled Sentences

**Dataset:** Sentiment Labelled Sentences (UCI repository)  
**URL:** https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences  

**Dataset Description:**  
This dataset contains sentences extracted from product reviews and their sentiment labels.

**Columns/Variables:**
- `sentence`: The text of the review sentence.
- `label`: Sentiment of the sentence, 0 = negative, 1 = positive.

We will use this dataset to practice **tokenization, normalization, stop word removal, stemming**, and **preparing text for NLP tasks**.


In [3]:
# 🔧 Setup: Import libraries and load dataset
import nltk
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import random
import pandas as pd

# Download NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('movie_reviews')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\joseg\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\joseg\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\joseg\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\joseg\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [4]:
# Prepare a simple DataFrame
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the dataset
random.shuffle(documents)

# Convert to pandas DataFrame
df = pd.DataFrame(documents, columns=['words', 'label'])


In [5]:
# Show first 5 rows
df.head(100)

Unnamed: 0,words,label
0,"[note, :, ordinarily, ,, moviereviews, ., org,...",pos
1,"[one, of, the, 90s, ', most, unwelcome, thrill...",neg
2,"[capsule, :, in, 2176, on, the, planet, mars, ...",neg
3,"[a, sensuous, romantic, comedy, ,, about, as, ...",neg
4,"[aggressive, ,, bleak, ,, and, unrelenting, fi...",pos
...,...,...
95,"[often, similar, to, a, little, boy, lost, in,...",neg
96,"[the, obvious, reason, for, producing, a, sequ...",neg
97,"[just, how, inseparable, is, the, team, of, sg...",pos
98,"[this, three, hour, movie, opens, up, with, a,...",pos


## Exercise 1: Convert Words Back to Sentences

Question:
Combine the tokenized words into sentences (strings) to make text easier to preprocess.

In [6]:

# Join the words into a single string for each review
df['sentence'] = df['words'].apply(lambda x: ' '.join(x)) # x represent each element being processed
df[['sentence', 'label']].head()


Unnamed: 0,sentence,label
0,"note : ordinarily , moviereviews . org will no...",pos
1,one of the 90s ' most unwelcome thriller trend...,neg
2,capsule : in 2176 on the planet mars police ta...,neg
3,"a sensuous romantic comedy , about as appealin...",neg
4,"aggressive , bleak , and unrelenting film abou...",pos


## Exercise 2: Tokenization

Question:
Tokenize the first review using NLTK’s word_tokenize.

In [7]:
from nltk.tokenize import word_tokenize

first_review = df['sentence'][0]   

# Tokenize the review
tokens = word_tokenize(first_review)

# Display the tokens
print(tokens)


['note', ':', 'ordinarily', ',', 'moviereviews', '.', 'org', 'will', 'not', 'give', 'away', 'any', 'critical', 'plot', 'points', 'of', 'a', 'film', 'that', 'could', 'be', 'interpreted', 'as', '``', 'spoilers', '.', '``', 'however', ',', 'being', 'that', 'music', 'of', 'the', 'heart', 'is', 'based', 'on', 'a', 'true', 'story', 'and', 'that', 'moviereviews', '.', 'org', 'feels', 'the', 'film', 'can', 'not', 'be', 'properly', 'credited', 'without', 'such', 'revelations', ',', 'plot', 'giveaways', 'will', 'appear', 'in', 'the', 'following', 'review', '.', 'if', 'this', 'bothers', 'you', ',', 'please', 'note', 'the', '3', 'star', 'rating', 'of', 'the', 'film', 'and', 'stop', 'reading', 'now', '.', '``', 'what', 'does', 'it', 'take', 'to', 'play', 'carnegie', 'hall', '?', 'practice', '.', '``', 'it', 'takes', 'two', 'hours', 'for', 'music', 'of', 'the', 'heart', 'to', '``', 'play', 'carnegie', 'hall', ',', '``', 'both', 'figuratively', 'and', 'literally', '.', 'like', 'the', 'children', 'it'

## Exercise 3: Normalization & Stop Word Removal

Question:

Convert tokens to lowercase.

Remove punctuation and stop words.

In [18]:
from nltk.corpus import stopwords
 
def normalize_and_remove_stops(tokens):
    """
    1. Converts all tokens to lowercase (Normalization).
    2. Removes common English stop words.
    """
    # 1. Normalization: Convert to lowercase
    normalized_tokens = [t.lower() for t in tokens]
 
    # punctuation removal
    final_normalised = [t for t in normalized_tokens if t.isalpha()]
 
    # 2. Stop Words Removal
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [t for t in final_normalised if t not in stop_words]
    return filtered_tokens
 
# Process our tokens
cleaned_tokens = normalize_and_remove_stops(tokens)
print("--- NORMALIZATION & STOP WORDS REMOVAL ---")
print()
print("Original number of tokens:", len(tokens))
print()
print(cleaned_tokens)

--- NORMALIZATION & STOP WORDS REMOVAL ---

Original number of tokens: 879

['note', 'ordinarily', 'moviereviews', 'org', 'give', 'away', 'critical', 'plot', 'points', 'film', 'could', 'interpreted', 'spoilers', 'however', 'music', 'heart', 'based', 'true', 'story', 'moviereviews', 'org', 'feels', 'film', 'properly', 'credited', 'without', 'revelations', 'plot', 'giveaways', 'appear', 'following', 'review', 'bothers', 'please', 'note', 'star', 'rating', 'film', 'stop', 'reading', 'take', 'play', 'carnegie', 'hall', 'practice', 'takes', 'two', 'hours', 'music', 'heart', 'play', 'carnegie', 'hall', 'figuratively', 'literally', 'like', 'children', 'portrays', 'movie', 'starts', 'dark', 'realms', 'awful', 'cinema', 'works', 'way', 'show', 'stopping', 'performance', 'legendary', 'concert', 'hall', 'roberta', 'guaspari', 'academy', 'award', 'winner', 'meryl', 'streep', 'two', 'kids', 'husband', 'left', 'violins', 'bought', 'small', 'shop', 'mediterranean', 'life', 'desperately', 'needs', 'ju

## Exercise 4: Stemming

Question:
Stem the cleaned tokens using PorterStemmer.

In [19]:
#Stemming
from nltk.stem import PorterStemmer
def stem_tokens(tokens):
    """
    Applies stemming to a list of tokens using the Porter Stemmer algorithm.
    """
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(t) for t in tokens]
    return stemmed_tokens
# Stem our cleaned tokens
stemmed_tokens = stem_tokens(cleaned_tokens)
print("--- STEMMING ---")
print(stemmed_tokens)

--- STEMMING ---
['note', 'ordinarili', 'moviereview', 'org', 'give', 'away', 'critic', 'plot', 'point', 'film', 'could', 'interpret', 'spoiler', 'howev', 'music', 'heart', 'base', 'true', 'stori', 'moviereview', 'org', 'feel', 'film', 'properli', 'credit', 'without', 'revel', 'plot', 'giveaway', 'appear', 'follow', 'review', 'bother', 'pleas', 'note', 'star', 'rate', 'film', 'stop', 'read', 'take', 'play', 'carnegi', 'hall', 'practic', 'take', 'two', 'hour', 'music', 'heart', 'play', 'carnegi', 'hall', 'figur', 'liter', 'like', 'children', 'portray', 'movi', 'start', 'dark', 'realm', 'aw', 'cinema', 'work', 'way', 'show', 'stop', 'perform', 'legendari', 'concert', 'hall', 'roberta', 'guaspari', 'academi', 'award', 'winner', 'meryl', 'streep', 'two', 'kid', 'husband', 'left', 'violin', 'bought', 'small', 'shop', 'mediterranean', 'life', 'desper', 'need', 'jump', 'start', 'get', 'one', 'meet', 'man', 'introduc', 'job', 'music', 'teacher', 'east', 'harlem', 'elementari', 'school', 'soon'

## Exercise 5: Full Preprocessing Function

Question:
Write a function `preprocess_text(text)`  that tokenizes, normalizes, removes stopwords, and stems. Apply it to all reviews.

In [20]:

# Preprocessing function
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = normalize_and_remove_stops(tokens)
    tokens = stem_tokens(tokens)
    return tokens
# Apply preprocessing to the entire DataFrame
df['processed_tokens'] = df['sentence'].apply(preprocess_text)
df[['sentence', 'processed_tokens', 'label']].head()
# Display the first 5 rows of the processed DataFrame
df[['sentence', 'processed_tokens', 'label']].head()


Unnamed: 0,sentence,processed_tokens,label
0,"note : ordinarily , moviereviews . org will no...","[note, ordinarili, moviereview, org, give, awa...",pos
1,one of the 90s ' most unwelcome thriller trend...,"[one, unwelcom, thriller, trend, return, grave...",neg
2,capsule : in 2176 on the planet mars police ta...,"[capsul, planet, mar, polic, take, custodi, ac...",neg
3,"a sensuous romantic comedy , about as appealin...","[sensuou, romant, comedi, appeal, averag, ligh...",neg
4,"aggressive , bleak , and unrelenting film abou...","[aggress, bleak, unrel, film, interraci, coupl...",pos


## Exercise 6: Most Common Words

Question:
Find the 10 most common words in all processed reviews.

In [21]:
#find the most common words in the processed tokens
from collections import Counter
all_tokens = [token for tokens in df['processed_tokens'] for token in tokens]
token_counts = Counter(all_tokens)
most_common_tokens = token_counts.most_common(10)
print("Most common tokens:", most_common_tokens) 


Most common tokens: [('film', 11199), ('movi', 6977), ('one', 6029), ('like', 4137), ('charact', 3881), ('make', 3243), ('get', 3220), ('time', 3047), ('scene', 2671), ('even', 2611)]
