# Text Preprocessing

## Section 0: Creating Data Sets

### Theory Notes

Before diving into preprocessing techniques, we need a sample dataset to work with. In real-world applications, text data comes from various sources like social media posts, customer reviews, documents, or web scraping results.

### Code Implementation

In [1]:
# Import Pandas library

import pandas as pd

In [2]:
data = [

"When life gives you lemons, make lemonade! 🙂",

"She bought 2 lemons for $1 at Maven Market.",

"A dozen lemons will make a gallon of lemonade. [AllRecipes]",

"lemon, lemon, lemons, lemon, lemon, lemons",

"He's running to the market to get a lemon — there's a great sale today.",

"Does Maven Market carry Eureka lemons or Meyer lemons?",

"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",

"iced tea is my favorite"

]

In [3]:

# Convert list to DataFrame

data_df = pd.DataFrame(data, columns=['sentence'])

data_df

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! 🙂"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade....
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon — th...
5,Does Maven Market carry Eureka lemons or Meyer...
6,"An Arnold Palmer is half lemonade, half iced t..."
7,iced tea is my favorite


In [5]:

# Set display options to show full content

pd.set_option('display.max_colwidth', None)


## Section 1: Preprocessing

### 1.1 Normalization

**Theory Notes**

Text normalization is the process of converting text to a standard, consistent format. The most common normalization technique is converting all text to lowercase, which ensures that words like "Apple" and "apple" are treated as the same token.

### Code Implementation


In [6]:

# Create a copy for spaCy processing

spacy_df = data_df.copy()

# Convert text to lowercase

spacy_df['clean_sentence'] = spacy_df['sentence'].str.lower()

### 1.2 Text Cleaning
### Code Implementation

In [7]:
import spacy

In [8]:

# Remove specific citations

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace('[wikipedia]', '')

# Advanced cleaning with regex

combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(combined, ' ', regex=True)

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(r'\s+', ' ', regex=True).str.strip()


## Section 1.2: Advanced Text Processing with spaCy

### Theory Notes

spaCy is a powerful industrial-strength NLP library that provides advanced tokenization, lemmatization, and linguistic analysis. It offers pre-trained language models that understand grammar, syntax, and word relationships.

### Code Implementation

In [9]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m103.7 kB/s[0m eta [36m0:00:00[0m00:03[0m00:05[0m
[?25h
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
# Download and install English language model

# !python -m spacy download en_core_web_sm



# Load the pre-trained pipeline

nlp = spacy.load('en_core_web_sm')



# Process a sample sentence

phrase = spacy_df.clean_sentence[0] # "when life gives you lemons make lemonade"

doc = nlp(phrase)


### 1.2.1 Tokenization

**Theory Notes**

Tokenization splits text into individual units (tokens) such as words, punctuation marks, or numbers. Modern tokenizers handle complex cases like contractions, compound words, and special characters intelligently.

### Code Implementation

In [11]:

# Extract tokens as text strings

[token.text for token in doc]

# Output: ['when', 'life', 'gives', 'you', 'lemons', 'make', 'lemonade']



# Extract tokens as spaCy objects (with linguistic attributes)

[token for token in doc]

# Output: [when, life, gives, you, lemons, make, lemonade]


[when, life, gives, you, lemons, make, lemonade]

### 1.2.2 Lemmatization

**Theory Notes**

Lemmatization reduces words to their base or root form (lemma) using linguistic knowledge. Unlike stemming, which simply removes suffixes, lemmatization considers the word's part of speech and meaning to find the correct root form.

**Examples:**

    "running" → "run"

    "better" → "good"

    "mice" → "mouse"

### Code Implementation

In [12]:

# Extract lemmatized forms

[token.lemma_ for token in doc]

# Output: ['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']

['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']

### 1.2.3 Stop Words Removal

**Theory Notes**

Stop words are common words that carry little semantic meaning and are often filtered out to focus on more meaningful content. Examples include `"the", "and", "is", "in"`, etc.

### Code Implementation

In [13]:

# View all English stop words in spaCy

list(nlp.Defaults.stop_words)

print(f"Total stop words: {len(list(nlp.Defaults.stop_words))}") # 326 stop words



# Remove stop words

[token for token in doc if  not token.is_stop]

# Output: [life, gives, lemons, lemonade]



# Combine lemmatization and stop word removal

[token.lemma_ for token in doc if  not token.is_stop]

# Output: ['life', 'give', 'lemon', 'lemonade']



# Convert back to sentence format

norm = [token.lemma_ for token in doc if  not token.is_stop]

' '.join(norm) # Output: 'life give lemon lemonade'


Total stop words: 326


'life give lemon lemonade'

## Section 2: Creating Reusable Functions

**Theory Notes**

Creating modular, reusable functions is essential for maintainable code and consistent preprocessing across different datasets.

### Code Implementation

In [15]:

# Function for lemmatization and stop word removal

def  token_lemma_stopw(text):

    doc = nlp(text)

    output = [token.lemma_ for token in doc if  not token.is_stop]

    return  ' '.join(output)



# Apply to entire dataset

spacy_df.clean_sentence.apply(token_lemma_stopw)


0                       life give lemon lemonade
1                     buy 2 lemon 1 maven market
2          dozen lemon gallon lemonade allrecipe
3            lemon lemon lemon lemon lemon lemon
4          s run market lemon s great sale today
5    maven market carry eureka lemon meyer lemon
6       arnold palmer half lemonade half ice tea
7                               ice tea favorite
Name: clean_sentence, dtype: object

## Section 3: Complete NLP Pipeline

**Theory Notes**

An NLP pipeline combines multiple preprocessing steps into a single, streamlined workflow. This approach ensures consistency and makes it easy to apply the same transformations to new data.

### Code Implementation

In [16]:

def  lower_replace(series):

    output = series.str.lower()

    combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

    output = output.str.replace(combined, ' ', regex=True)

    return output



def  nlp_pipeline(series):

    output = lower_replace(series)

    output = output.apply(token_lemma_stopw)

    return output



# Apply complete pipeline

cleaned_text = nlp_pipeline(data_df.sentence)



# Save processed data for future use

pd.to_pickle(cleaned_text, 'preprocessed_text.pkl')


## Section 4: Word Representation (Vectorization)

**Theory Notes**

Vectorization converts preprocessed text into numerical representations that machine learning algorithms can process. Text must be transformed into vectors (arrays of numbers) because algorithms cannot directly work with text strings.

### 4.1 Count Vectorization (Bag of Words)

**Theory Notes**

Count Vectorization creates a matrix where each row represents a document and each column represents a unique word in the corpus. Cell values indicate how many times each word appears in each document. This approach ignores word order but captures word frequency.

### Code Implementation

In [19]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.1-cp313-cp313-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.1-cp313-cp313-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.1-cp313-cp313-macosx_12_0_arm64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m429.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25hDownloading joblib-1.5.2-py3-none-any.whl (308 kB)
Downloading scipy-1.16.1-cp313-cp313-macosx_14_0_arm64.whl (20.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.8/20.8 MB[0m [31m436.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25hDownloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Instal

In [20]:

# Load preprocessed data

import pandas as pd

series = pd.read_pickle('preprocessed_text.pkl')



from sklearn.feature_extraction.text import CountVectorizer



# Create Count Vectorizer

cv = CountVectorizer()

bow = cv.fit_transform(series)



# Convert to DataFrame for visualization

pd.DataFrame(bow.toarray(), columns=cv.get_feature_names_out())


Unnamed: 0,allrecipe,arnold,buy,carry,dozen,eureka,favorite,gallon,give,great,...,life,market,maven,meyer,palmer,run,sale,tea,today,wikipedia
0,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
2,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,1,1,0,1,0
5,0,0,0,1,0,1,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
7,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Advanced Count Vectorization

In [21]:

# Count Vectorizer with filtering

cv1 = CountVectorizer(

stop_words='english', # Remove English stop words

ngram_range=(1,1), # Use only single words (unigrams)

min_df=2  # Include words that appear in at least 2 documents

)



bow1 = cv1.fit_transform(series)

bow1_df = pd.DataFrame(bow1.toarray(), columns=cv1.get_feature_names_out())



# Calculate term frequencies

term_freq = bow1_df.sum()


## Section 5: TF-IDF (Term Frequency-Inverse Document Frequency)

**Theory Notes**

**TF-IDF** addresses a key limitation of simple count vectorization by considering both term frequency (how often a word appears in a document) and inverse document frequency (how rare the word is across the entire corpus).

- **Formula**: TF-IDF = TF \times IDF
- **TF (Term Frequency):** Number of times word appears in a document / Total words in the document
- **IDF (Inverse Document Frequency):** log(Total number of documents / Number of documents containing the word)

**Key Insight:** TF-IDF gives higher weights to words that are frequent in a specific document but rare across the corpus, making them more distinctive and informative.

### Code Implementation

In [22]:

from sklearn.feature_extraction.text import TfidfVectorizer



# Basic TF-IDF vectorization

tv = TfidfVectorizer()

tvidf = tv.fit_transform(series)

tvidf_df = pd.DataFrame(tvidf.toarray(), columns=tv.get_feature_names_out())



# TF-IDF with filtering

tv1 = TfidfVectorizer(min_df=2) # Words must appear in at least 2 documents

tvidf1 = tv1.fit_transform(series)

tvidf1_df = pd.DataFrame(tvidf1.toarray(), columns=tv1.get_feature_names_out())


#### N-gram Analysis

In [23]:

# Bigram TF-IDF (pairs of consecutive words)

tv2 = TfidfVectorizer(ngram_range=(1,2)) # Include both unigrams and bigrams

tvidf2 = tv2.fit_transform(series)

tvidf2_df = pd.DataFrame(tvidf2.toarray(), columns=tv2.get_feature_names_out())



# Analyze feature importance

tvidf2_df.sum().sort_values(ascending=False)


lemon                 1.583310
lemon lemon           0.857624
market                0.767950
lemonade              0.743321
ice tea               0.625522
ice                   0.625522
tea                   0.625522
maven market          0.621858
maven                 0.621858
half                  0.505881
tea favorite          0.493436
favorite              0.493436
buy lemon             0.439482
buy                   0.439482
lemon maven           0.439482
life give             0.416207
life                  0.416207
give                  0.416207
give lemon            0.416207
lemon lemonade        0.416207
lemonade allrecipe    0.358685
lemon gallon          0.358685
allrecipe             0.358685
dozen                 0.358685
gallon                0.358685
dozen lemon           0.358685
gallon lemonade       0.358685
run                   0.319884
sale today            0.319884
market lemon          0.319884
sale                  0.319884
run market            0.319884
great   