# ASSIGNMENT 17 — TEXT CLEANING, PREPROCESSING & NLP PIPELINE

**Dataset**:SMS Spam Collection Dataset  
**Source:** https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

In [5]:
import pandas as pd
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [6]:
nlp = spacy.load("en_core_web_sm")


# PART 1 — NLP PIPELINE & BASIC TEXT CLEANING


In [7]:
# TASK 1: Understanding Raw Text Data
# Step 1: Load dataset using pandas
df = pd.read_csv("spam.csv", encoding="latin-1")

# Keep only useful columns
df = df[['v1','v2']]
df.columns = ['label','text']

# Step 2: Print first 5 samples
print("\nFirst 5 Samples:\n")
print(df.head())

# Step 3: Length of each text
df['text_length'] = df['text'].apply(len)
print("\nText Length:\n")
print(df['text_length'].head())

# Step 4: Identify common issues
print("\nExample Raw Text Issues:")
print(df['text'][3])


First 5 Samples:

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...

Text Length:

0    111
1     29
2    155
3     49
4     61
Name: text_length, dtype: int64

Example Raw Text Issues:
U dun say so early hor... U c already then say...


In [8]:
# TASK 2: BASIC TEXT CLEANING
# Step 1: Lowercase
def basic_clean(text):

    text = text.lower()

    # Step 2: Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Step 3: Remove numbers
    text = re.sub(r'\d+', '', text)

    # Step 4: Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Store cleaned text
df['clean_text_basic'] = df['text'].apply(basic_clean)

print("\nOriginal vs Basic Cleaned:\n")
print(df[['text','clean_text_basic']].head())


Original vs Basic Cleaned:

                                                text  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                    clean_text_basic  
0  go until jurong point crazy available only in ...  
1                            ok lar joking wif u oni  
2  free entry in a wkly comp to win fa cup final ...  
3        u dun say so early hor u c already then say  
4  nah i dont think he goes to usf he lives aroun...  


# PART 2 — ADVANCED TEXT CLEANING

In [9]:
# TASK 3: Removing Noise
def advanced_clean(text):

    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # Remove emails
    text = re.sub(r'\S+@\S+', '', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove special characters & emojis
    text = re.sub(r'[^\w\s]', '', text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

df['clean_text_advanced'] = df['clean_text_basic'].apply(advanced_clean)

In [10]:
# TASK 4: Handling Stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    filtered = [w for w in words if w not in stop_words]
    return " ".join(filtered)

df['text_no_stopwords'] = df['clean_text_advanced'].apply(remove_stopwords)

In [11]:
# TASK 5: Repeated Characters & Slang (Optional)
# Normalize repeated characters
def normalize_repeated(text):
    return re.sub(r'(.)\1+', r'\1\1', text)

# Slang dictionary
slang_dict = {
    "u":"you",
    "gr8":"great",
    "luv":"love"
}

def replace_slang(text):
    words = text.split()
    words = [slang_dict.get(w, w) for w in words]
    return " ".join(words)

df['text_slang_fixed'] = df['text_no_stopwords'].apply(normalize_repeated)
df['text_slang_fixed'] = df['text_slang_fixed'].apply(replace_slang)

# PART 3 — BASIC TEXT PREPROCESSING


In [12]:
# TASK 6: Tokenization
print("\nWord Tokens Example:")
print(word_tokenize(df['text_slang_fixed'][0]))

print("\nSentence Tokens Example:")
print(sent_tokenize(df['text'][0]))


Word Tokens Example:
['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat']

Sentence Tokens Example:
['Go until jurong point, crazy..', 'Available only in bugis n great world la e buffet... Cine there got amore wat...']


In [13]:
# TASK 7: Stemming
stemmer = PorterStemmer()

def stemming(text):
    return [stemmer.stem(word) for word in text.split()]

print("\nStemming Example:")
print(stemming(df['text_slang_fixed'][0]))


Stemming Example:
['go', 'jurong', 'point', 'crazi', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amor', 'wat']


In [14]:
# TASK 8: Lemmatization
lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    return [lemmatizer.lemmatize(word) for word in text.split()]

print("\nLemmatization Example:")
print(lemmatize(df['text_slang_fixed'][0]))


Lemmatization Example:
['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat']


In [15]:
# TASK 9: FINAL NLP PIPELINE FUNCTION
def nlp_preprocess(text):

    # Lowercase
    text = text.lower()

    # Noise removal
    text = advanced_clean(text)

    # Stopword removal
    text = remove_stopwords(text)

    # Lemmatization using spaCy
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc]

    return " ".join(tokens)

df['final_clean_text'] = df['text'].apply(nlp_preprocess)

print("\nFinal Cleaned Text:")
print(df[['text','final_clean_text']].head())


Final Cleaned Text:
                                                text  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                    final_clean_text  
0  go jurong point crazy available bugis n great ...  
1                            ok lar joking wif u oni  
2  free entry 2 wkly comp win fa cup final tkts 2...  
3                u dun say early hor u c already say  
4         nah do not think go usf life around though  


##  Task 10 — Observations & Insights

###  1. Difference Between Basic and Advanced Cleaning

**Basic cleaning** focuses on standardizing text by performing simple operations such as converting text to lowercase, removing punctuation, numbers, and extra spaces. It mainly improves consistency in the dataset.

**Advanced cleaning** removes complex noise found in real-world text, such as URLs, email addresses, HTML tags, special characters, emojis, and slang. This step improves the overall quality and usefulness of the text for NLP tasks.



###  2. Why Lemmatization is Preferred Over Stemming

**Stemming** reduces words by cutting suffixes using rules, which may produce incomplete or incorrect words (e.g., *studies → studi*).

**Lemmatization** converts words into their meaningful base form using linguistic knowledge (e.g., *studies → study*).
It preserves meaning better and therefore gives more accurate results in NLP applications.



### 3. Importance of Preprocessing in NLP Models

Text preprocessing is essential because raw text contains noise and inconsistencies. Cleaning and preprocessing:

* Remove irrelevant information
* Standardize text format
* Reduce vocabulary size
* Help NLP models understand meaningful patterns

Proper preprocessing improves model performance and efficiency.