
# Practical 1: Text Preprocessing   

This practical covers the **fundamental concepts and implementation** of text preprocessing for Natural Language Processing (NLP). Preprocessing transforms raw text into a clean, structured format that is suitable for machine learning algorithms.  



## Primary Aim  
**To build a robust, reusable text preprocessing pipeline for downstream NLP tasks.**

---

## Specific Objectives  
- Prepare clean text data for **classification tasks** (sentiment analysis, spam detection, etc.)  
- Improve text quality for **information retrieval** (search engines, recommendation systems)  
- Preprocess corpora for **topic modeling** (LDA, NMF)  
- Perform **feature engineering** with numerical text representations  
- Enhance **data quality** by removing noise and inconsistencies  



## Module Info  
- **Unit**: 3  
- **Learning Outcome**: Implement common text preprocessing techniques  


## Section 0: Creating Data Sets  

### 🔹 Theory Notes  
Before applying preprocessing techniques, we create a sample dataset. In practice, text data may come from sources like social media posts, reviews, or web scraping.  

In [None]:
# Import pandas for data manipulation
import pandas as pd
data = [
    # Sample sentences with various text challenges
    'When life gives you lemons, make lemonade! 🙂',
    'She bought 2 lemons for $1 at Maven Market.',
    'A dozen lemons will make a gallon of lemonade. [AllRecipes]',
    'lemon, lemon, lemons, lemon, lemon, lemons',
    # Example with dash and contraction
    "He's running to the market to get a lemon - there's a great sale today.",
    'Does Maven Market carry Eureka lemons or Meyer lemons?',
    'An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]',
    'iced tea is my favorite'
 ]
# Convert list to DataFrame for easier manipulation
data_df = pd.DataFrame(data, columns=['sentence'])
# Set display option to show full text content
pd.set_option('display.max_colwidth', None)
# Display the DataFrame
data_df

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! 🙂"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon - there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]"
7,iced tea is my favorite


## Section 1: Preprocessing
### 1.1 Normalization
Convert all text to lowercase for consistency.

In [None]:
# Create a copy of the original DataFrame to preserve raw data
spacy_df = data_df.copy()
# Convert all sentences to lowercase for normalization
spacy_df['clean_sentence'] = spacy_df['sentence'].str.lower()
# Show both original and normalized sentences
spacy_df[['sentence', 'clean_sentence']]

Unnamed: 0,sentence,clean_sentence
0,"When life gives you lemons, make lemonade! 🙂","when life gives you lemons, make lemonade! 🙂"
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for $1 at maven market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade. [allrecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons","lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon - there's a great sale today.,he's running to the market to get a lemon - there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]","an arnold palmer is half lemonade, half iced tea. [wikipedia]"
7,iced tea is my favorite,iced tea is my favorite


### 1.2 Text Cleaning
Remove citations, URLs, emails, social media handles, and non-alphanumeric characters.

In [None]:
# Remove specific citations like [wikipedia] from text
spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace('[wikipedia]', '', regex=False)
# Define regex pattern for advanced cleaning (URLs, emails, non-alphanumeric, etc.)
combined = r'https?://\\S+|www\\.\\S+|<.*?>|\\S+@\\S+\\.\\S+|@\\w+|#\\w+|[^A-Za-z0-9\\s]'
# Apply regex to clean unwanted patterns
spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(combined, ' ', regex=True)
# Normalize whitespace (replace multiple spaces with single space) and strip leading/trailing spaces
spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(r'\\s+', ' ', regex=True).str.strip()
# Show cleaned sentences
spacy_df[['sentence', 'clean_sentence']]

Unnamed: 0,sentence,clean_sentence
0,"When life gives you lemons, make lemonade! 🙂",when life gives you lemons make lemonade
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for 1 at maven market
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade allrecipes
3,"lemon, lemon, lemons, lemon, lemon, lemons",lemon lemon lemons lemon lemon lemons
4,He's running to the market to get a lemon - there's a great sale today.,he s running to the market to get a lemon there s a great sale today
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",an arnold palmer is half lemonade half iced tea
7,iced tea is my favorite,iced tea is my favorite


## Section 1.2: Advanced Text Processing with spaCy
Install and load spaCy's English model, then process a sample sentence.

In [None]:
# Import spaCy for advanced NLP processing
import spacy
# If not already installed, uncomment the next line to download the English model
# !python -m spacy download en_core_web_sm
# Load the pre-trained English language model
nlp = spacy.load('en_core_web_sm')
# Select a sample cleaned sentence for processing
phrase = spacy_df.clean_sentence[0]
# Process the sentence with spaCy
doc = nlp(phrase)
# Display tokens extracted from the sentence
[token.text for token in doc]

['when', 'life', 'gives', 'you', 'lemons', 'make', 'lemonade']

### Lemmatization
Reduce words to their base form.

In [None]:
# Extract lemmatized (base) forms of each token in the sentence
[token.lemma_ for token in doc]

['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']

### Stop Words Removal
Remove common words that carry little meaning.

In [None]:
# Remove stop words and show lemmatized tokens that carry meaning
[token.lemma_ for token in doc if not token.is_stop]

['life', 'give', 'lemon', 'lemonade']

## Section 2: Creating Reusable Functions
Lemmatization and stop word removal in a function.

In [None]:
# Function to lemmatize and remove stop words from text
def token_lemma_stopw(text):
    doc = nlp(text)
    output = [token.lemma_ for token in doc if not token.is_stop]
    return ' '.join(output)

# Apply the function to all cleaned sentences in the DataFrame
spacy_df['processed'] = spacy_df['clean_sentence'].apply(token_lemma_stopw)
# Show cleaned and processed sentences
spacy_df[['clean_sentence', 'processed']]

Unnamed: 0,clean_sentence,processed
0,when life gives you lemons make lemonade,life give lemon lemonade
1,she bought 2 lemons for 1 at maven market,buy 2 lemon 1 maven market
2,a dozen lemons will make a gallon of lemonade allrecipes,dozen lemon gallon lemonade allrecipe
3,lemon lemon lemons lemon lemon lemons,lemon lemon lemon lemon lemon lemon
4,he s running to the market to get a lemon there s a great sale today,s run market lemon s great sale today
5,does maven market carry eureka lemons or meyer lemons,maven market carry eureka lemon meyer lemon
6,an arnold palmer is half lemonade half iced tea,arnold palmer half lemonade half ice tea
7,iced tea is my favorite,ice tea favorite


## Section 3: Complete NLP Pipeline
Combine all preprocessing steps into a single pipeline.

In [None]:
def lower_replace(series):
    output = series.str.lower()
    combined = r'https?://\\S+|www\\.\\S+|<.*?>|\\S+@\\S+\\.\\S+|@\\w+|#\\w+|[^A-Za-z0-9\\s]'
    output = output.str.replace(combined, ' ', regex=True)
    output = output.str.replace(r'\\s+', ' ', regex=True).str.strip()
    return output

# Complete pipeline: normalization, cleaning, lemmatization, stop word removal
def nlp_pipeline(series):
    output = lower_replace(series)
    output = output.apply(token_lemma_stopw)
    return output

# Apply the pipeline to the original sentences
cleaned_text = nlp_pipeline(data_df.sentence)
# Display the final processed text
cleaned_text

0                              life give lemon lemonade
1                            buy 2 lemon 1 maven market
2                 dozen lemon gallon lemonade allrecipe
3                   lemon lemon lemon lemon lemon lemon
4                 s run market lemon s great sale today
5           maven market carry eureka lemon meyer lemon
6    arnold palmer half lemonade half ice tea wikipedia
7                                      ice tea favorite
Name: sentence, dtype: object

## Section 4: Word Representation (Vectorization)
Convert processed text into numerical vectors using CountVectorizer.

In [None]:
# Import CountVectorizer for Bag-of-Words representation
from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer instance
cv = CountVectorizer()
# Fit and transform the processed text to get the BoW matrix
bow = cv.fit_transform(cleaned_text)
# Convert the matrix to a DataFrame for easy viewing
pd.DataFrame(bow.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,allrecipe,arnold,buy,carry,dozen,eureka,favorite,gallon,give,great,...,life,market,maven,meyer,palmer,run,sale,tea,today,wikipedia
0,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
2,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,1,1,0,1,0
5,0,0,0,1,0,1,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
7,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Advanced Count Vectorization
Filter out stop words and rare words.

In [None]:
# Advanced CountVectorizer: remove stop words, use unigrams, filter rare words
cv1 = CountVectorizer(stop_words='english', ngram_range=(1,1), min_df=2)
# Fit and transform the processed text
bow1 = cv1.fit_transform(cleaned_text)
# Convert to DataFrame for analysis
bow1_df = pd.DataFrame(bow1.toarray(), columns=cv1.get_feature_names_out())
# Calculate term frequencies across all documents
term_freq = bow1_df.sum()
# Display term frequencies
term_freq

ice          2
lemon       12
lemonade     3
market       3
maven        2
tea          2
dtype: int64

## Section 5: TF-IDF (Term Frequency-Inverse Document Frequency)
Calculate TF-IDF scores for better feature weighting.

In [None]:
# Import TfidfVectorizer for TF-IDF representation
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a TfidfVectorizer instance
tv = TfidfVectorizer()
# Fit and transform the processed text to get the TF-IDF matrix
tvidf = tv.fit_transform(cleaned_text)
# Convert the matrix to a DataFrame for easy viewing
tvidf_df = pd.DataFrame(tvidf.toarray(), columns=tv.get_feature_names_out())
# Display the TF-IDF DataFrame
tvidf_df

Unnamed: 0,allrecipe,arnold,buy,carry,dozen,eureka,favorite,gallon,give,great,...,life,market,maven,meyer,palmer,run,sale,tea,today,wikipedia
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.600547,0.0,...,0.600547,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.63563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.459683,0.532707,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.514841,0.0,0.0,0.0,0.514841,0.0,0.0,0.514841,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.457738,...,0.0,0.331033,0.0,0.0,0.0,0.457738,0.457738,0.0,0.457738,0.0
5,0.0,0.0,0.0,0.437511,0.0,0.437511,0.0,0.0,0.0,0.0,...,0.0,0.316405,0.366668,0.437511,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.334679,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.334679,0.0,0.0,0.280487,0.0,0.334679
7,0.0,0.0,0.0,0.0,0.0,0.0,0.644859,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.540443,0.0,0.0


### TF-IDF with Filtering
Focus on words that appear in at least 2 documents.

In [None]:
# TF-IDF with filtering: only include words appearing in at least 2 documents
tv1 = TfidfVectorizer(min_df=2)
# Fit and transform the processed text
tvidf1 = tv1.fit_transform(cleaned_text)
# Convert to DataFrame for analysis
tvidf1_df = pd.DataFrame(tvidf1.toarray(), columns=tv1.get_feature_names_out())
# Display the filtered TF-IDF DataFrame
tvidf1_df

Unnamed: 0,ice,lemon,lemonade,market,maven,tea
0,0.0,0.568471,0.822704,0.0,0.0,0.0
1,0.0,0.411442,0.0,0.595449,0.690041,0.0
2,0.0,0.568471,0.822704,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.568471,0.0,0.822704,0.0,0.0
5,0.0,0.67013,0.0,0.484914,0.561947,0.0
6,0.603613,0.0,0.520868,0.0,0.0,0.603613
7,0.707107,0.0,0.0,0.0,0.0,0.707107


### N-gram Analysis
Include both unigrams and bigrams for phrase-level information.

In [None]:
# N-gram TF-IDF: include both unigrams and bigrams for phrase-level features
tv2 = TfidfVectorizer(ngram_range=(1,2))
# Fit and transform the processed text
tvidf2 = tv2.fit_transform(cleaned_text)
# Convert to DataFrame for analysis
tvidf2_df = pd.DataFrame(tvidf2.toarray(), columns=tv2.get_feature_names_out())
# Analyze feature importance by summing TF-IDF scores
tvidf2_df.sum().sort_values(ascending=False)

lemon                 1.583310
lemon lemon           0.857624
market                0.767950
lemonade              0.743321
ice tea               0.625522
ice                   0.625522
tea                   0.625522
maven                 0.621858
maven market          0.621858
half                  0.505881
favorite              0.493436
tea favorite          0.493436
lemon maven           0.439482
buy                   0.439482
buy lemon             0.439482
give lemon            0.416207
life                  0.416207
lemon lemonade        0.416207
give                  0.416207
life give             0.416207
gallon lemonade       0.358685
dozen lemon           0.358685
allrecipe             0.358685
dozen                 0.358685
gallon                0.358685
lemonade allrecipe    0.358685
lemon gallon          0.358685
sale today            0.319884
today                 0.319884
great sale            0.319884
great                 0.319884
market lemon          0.319884
lemon gr