**Text Preprocessing for NLP: Practical Implementation Report**

This report documents the implementation of a text preprocessing pipeline for NLP applications. It transformed raw text data into clean, numerical formats suitable for machine learning

**1. Installation & Import :**


In [8]:
!python -m spacy download en_core_web_sm


^C




Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------

In [27]:
!pip install scikit-learn



**What it does**: Sets up essential libraries for data handling (pandas), advanced NLP (spaCy), and vectorization (scikit-learn).

In [None]:
import pandas as pd
import spacy # Natural language processing library for advanced text processing
from sklearn.feature_extraction.text import CountVectorizer # Convert text documents to a matrix of token counts (bag-of-words)
from sklearn.feature_extraction.text import TfidfVectorizer # Convert text documents to TF-IDF feature vectors (term frequency-inverse document frequency)

# Successfully installed and imported all required tools.

**2. Dataset Creation** 

**What it does** : Creates a test dataset with various text challenges (punctuation, emojis, numbers, citations).

In [None]:

data = [

"When life gives you lemons, make lemonade! 🙂",

"She bought 2 lemons for $1 at Maven Market.",

"A dozen lemons will make a gallon of lemonade. [AllRecipes]",

"lemon, lemon, lemons, lemon, lemon, lemons",

"He's running to the market to get a lemon — there's a great sale today.",

"Does Maven Market carry Eureka lemons or Meyer lemons?",

"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",

"iced tea is my favorite"

] #  Built a DataFrame with 8 diverse sentences for testing preprocessing techniques.


In [3]:

# Convert list to DataFrame

data_df = pd.DataFrame(data, columns=['sentence'])


In [4]:

# Set display options to show full content

pd.set_option('display.max_colwidth', None)


**Text Cleaning :  Basic Normalization**


**What it does** : Converts all text to lowercase for consistency.

In [None]:
# Create a copy for spaCy processing

spacy_df = data_df.copy()



# Convert text to lowercase

spacy_df['clean_sentence'] = spacy_df['sentence'].str.lower() # Eliminated case sensitivity issues.


# Remove specific citations

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace('[wikipedia]', '')

**Advanced Cleaning**

**What it does** : Removes URLs, emails, social media handles, and special characters using regex.

In [None]:


# Advanced cleaning with regex

combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(combined, ' ', regex=True) # Clean text with only letters, numbers, and spaces.

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(r'\s+', ' ', regex=True).str.strip() # Clean up extra whitespace by replacing multiple spaces with single space and removing leading/trailing spaces


**Advanced Text Processing with spaCy**

Takes cleaned text and prepares it for advanced NLP operations by loading language tools and creating a structured representation of the sentence.

In [None]:
# Load the pre-trained pipeline

nlp = spacy.load('en_core_web_sm') # # Load spaCy's English model with tokenization, lemmatization, and POS tagging capabilities


# Extract first cleaned sentence from DataFrame as sample text for processing
phrase = spacy_df.clean_sentence[0] # "when life gives you lemons make lemonade"

# Apply spaCy pipeline to create Doc object with tokens, lemmas, and linguistic annotations
doc = nlp(phrase)


**Tokenization**

Shows two ways to get words from the sentence - one gives plain text, the other gives smart word objects with extra information for analysis.

In [10]:

# Extract tokens as text strings

[token.text for token in doc]

# Output: ['when', 'life', 'gives', 'you', 'lemons', 'make', 'lemonade']



# Extract tokens as spaCy objects (with linguistic attributes)

[token for token in doc]

# Output: [when, life, gives, you, lemons, make, lemonade]


[when, life, gives, you, lemons, make, lemonade]

**Lemmatization**


Converts words to their basic dictionary forms - "gives" becomes "give" and "lemons" becomes "lemon".

In [11]:

# Extract lemmatized forms

[token.lemma_ for token in doc]

# Output: ['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']


['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']

**Stop Words Removal**

Removes common, meaningless words like "when", "you" and keeps only important content words, then converts them to their basic forms and joins them back into a clean sentence.

In [None]:

# View all English stop words in spaCy

list(nlp.Defaults.stop_words)

# Show total count of stop words in spaCy's English model
print(f"Total stop words: {len(list(nlp.Defaults.stop_words))}") # 326 stop words



# Remove stop words
# Filter out common words like 'when', 'the', 'you', keeping only meaningful content
[token for token in doc if  not token.is_stop]

# Output: [life, gives, lemons, lemonade]



# Combine lemmatization and stop word removal
# Get root form of words while removing stop words for cleaner text analysis
[token.lemma_ for token in doc if  not token.is_stop]

# Output: ['life', 'give', 'lemon', 'lemonade']



# Convert back to sentence format
# Join processed tokens into a clean, normalized sentence
norm = [token.lemma_ for token in doc if  not token.is_stop]

' '.join(norm) # Output: 'life give lemon lemonade'


Total stop words: 326


'life give lemon lemonade'

**Creating Reusable Functions**

Creates a reusable function that takes any text, removes common words, converts remaining words to their basic forms, and returns a clean sentence - then applies this function to all sentences in the dataset.

In [None]:
# Function for lemmatization and stop word removal
# Processes text using spaCy to reduce words to root forms and remove common words
def  token_lemma_stopw(text):

    # Process text through spaCy NLP pipeline
    doc = nlp(text)

    # Extract lemmatized form of each word, excluding stop words (the, and, is, etc.)
    output = [token.lemma_ for token in doc if  not token.is_stop]

    return  ' '.join(output) # Join processed tokens back into a single string



# Apply to entire dataset

spacy_df.clean_sentence.apply(token_lemma_stopw)

0                       life give lemon lemonade
1                     buy 2 lemon 1 maven market
2          dozen lemon gallon lemonade allrecipe
3            lemon lemon lemon lemon lemon lemon
4          s run market lemon s great sale today
5    maven market carry eureka lemon meyer lemon
6       arnold palmer half lemonade half ice tea
7                               ice tea favorite
Name: clean_sentence, dtype: object

**Complete NLP Pipeline**

**Creates two functions** - one for basic cleaning (lowercase  and  remove special characters) and another that combines all preprocessing steps, then applies the complete pipeline to clean all text and saves the results for later use.

In [None]:
# Function to convert text to lowercase and remove special characters
def  lower_replace(series):

    output = series.str.lower() # Convert all text to lowercase for consistency

# Remove URLs, emails, HTML tags, mentions, hashtags, and special characters

    combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

    output = output.str.replace(combined, ' ', regex=True)

    return output



# Function that combines all preprocessing steps into one pipeline
def  nlp_pipeline(series):

    # First clean and standardize the text
    output = lower_replace(series)

    # Then apply tokenization, lemmatization, and stop word removal
    output = output.apply(token_lemma_stopw)

    return output



# Apply complete pipeline to original dataset

cleaned_text = nlp_pipeline(data_df.sentence)



# Save processed data for future use

pd.to_pickle(cleaned_text, 'preprocessed_text.pkl') # Store cleaned text as pickle file to avoid reprocessing in future runs


**Word Representation (Vectorization)**

Loads the cleaned text, converts each sentence into numerical vectors by counting how many times each word appears, then displays the results in a table where rows are sentences and columns are words.

In [None]:
# Load preprocessed data
series = pd.read_pickle('preprocessed_text.pkl') # Read previously cleaned and processed text data from pickle file

# Create Count Vectorizer

cv = CountVectorizer() # Initialize basic bag-of-words vectorizer to convert text to word counts

bow = cv.fit_transform(series) # Transform text data into numerical word count matrix

# Convert to DataFrame for visualization

pd.DataFrame(bow.toarray(), columns=cv.get_feature_names_out()) # Display the sparse matrix as a readable table with words as columns

Unnamed: 0,allrecipe,arnold,buy,carry,dozen,eureka,favorite,gallon,give,great,...,life,market,maven,meyer,palmer,run,sale,tea,today,wikipedia
0,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
2,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,1,1,0,1,0
5,0,0,0,1,0,1,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
7,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


**Optimized Bag-of-Words Implementation**

Creates a smarter word counter that removes common English words, focuses only on individual words (not phrases), and keeps only words that appear in at least 2 sentences, then calculates how often each word appears overall.

In [None]:
# Count Vectorizer with filtering
# Creates a clean bag-of-words model by removing common stop words and rare terms
cv1 = CountVectorizer(

stop_words='english', # Remove English stop words

ngram_range=(1,1), # Use only single words (unigrams)

min_df=2  # Include words that appear in at least 2 documents

)



bow1 = cv1.fit_transform(series) # Transform text into word count matrix


bow1_df = pd.DataFrame(bow1.toarray(), columns=cv1.get_feature_names_out()) # Convert sparse matrix to readable DataFrame format



# Calculate term frequencies

term_freq = bow1_df.sum() # Sum word counts across all documents to see most frequent terms

**TF-IDF (Term Frequency-Inverse Document Frequency)**

Gives higher scores to rare, important words and lower scores to common words, keeping only words that appear multiple times for better results.

In [None]:
# Basic TF-IDF vectorization

tv = TfidfVectorizer()

tvidf = tv.fit_transform(series) # Transform text data into TF-IDF matrix

tvidf_df = pd.DataFrame(tvidf.toarray(), columns=tv.get_feature_names_out()) # Convert sparse matrix to readable DataFrame format



# TF-IDF with filtering

tv1 = TfidfVectorizer(min_df=2) # Words must appear in at least 2 documents

tvidf1 = tv1.fit_transform(series) # Transform text with filtering applied

tvidf1_df = pd.DataFrame(tvidf1.toarray(), columns=tv1.get_feature_names_out()) # Convert filtered results to DataFrame

**N-gram Analysis**

Uses single words and word pairs to capture more meaning, then ranks them by importance. This approach identifies both individual important words and meaningful phrases, providing better context than analyzing single words alone.

In [None]:

# Bigram TF-IDF (pairs of consecutive words)

tv2 = TfidfVectorizer(ngram_range=(1,2)) # Include both unigrams and bigrams

tvidf2 = tv2.fit_transform(series) # Transform text data into TF-IDF feature matrix

tvidf2_df = pd.DataFrame(tvidf2.toarray(), columns=tv2.get_feature_names_out()) # Convert sparse matrix to DataFrame for easier analysis



# Analyze feature importance

tvidf2_df.sum().sort_values(ascending=False)


lemon                 1.583310
lemon lemon           0.857624
market                0.767950
lemonade              0.743321
ice tea               0.625522
ice                   0.625522
tea                   0.625522
maven                 0.621858
maven market          0.621858
half                  0.505881
favorite              0.493436
tea favorite          0.493436
lemon maven           0.439482
buy                   0.439482
buy lemon             0.439482
give lemon            0.416207
life                  0.416207
lemon lemonade        0.416207
give                  0.416207
life give             0.416207
gallon lemonade       0.358685
dozen lemon           0.358685
allrecipe             0.358685
dozen                 0.358685
gallon                0.358685
lemonade allrecipe    0.358685
lemon gallon          0.358685
sale today            0.319884
today                 0.319884
great sale            0.319884
great                 0.319884
market lemon          0.319884
lemon gr