# Text Preprocessing

## Section 0: Creating Data Sets

### Theory Notes

Before diving into preprocessing techniques, we need a sample dataset to work with. In real-world applications, text data comes from various sources like social media posts, customer reviews, documents, or web scraping results.

### Code Implementation

In [3]:

# Import Pandas library

import pandas as pd

In [4]:

data = [

"When life gives you lemons, make lemonade! 🙂",

"She bought 2 lemons for $1 at Maven Market.",

"A dozen lemons will make a gallon of lemonade. [AllRecipes]",

"lemon, lemon, lemons, lemon, lemon, lemons",

"He's running to the market to get a lemon — there's a great sale today.",

"Does Maven Market carry Eureka lemons or Meyer lemons?",

"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",

"iced tea is my favorite"

]


In [5]:

# Convert list to DataFrame

data_df = pd.DataFrame(data, columns=['sentence'])


In [6]:

# Set display options to show full content

pd.set_option('display.max_colwidth', None)


## Section 1: Preprocessing

### 1.1 Normalization

**Theory Notes**

Text normalization is the process of converting text to a standard, consistent format. The most common normalization technique is converting all text to lowercase, which ensures that words like "Apple" and "apple" are treated as the same token.

### Code Implementation


In [7]:

# Create a copy for spaCy processing

spacy_df = data_df.copy()



# Convert text to lowercase

spacy_df['clean_sentence'] = spacy_df['sentence'].str.lower()


### 1.2 Text Cleaning
### Code Implementation

In [8]:

# Remove specific citations

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace('[wikipedia]', '')



# Advanced cleaning with regex

combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(combined, ' ', regex=True)

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(r'\s+', ' ', regex=True).str.strip()


## Section 1.2: Advanced Text Processing with spaCy

### Theory Notes

spaCy is a powerful industrial-strength NLP library that provides advanced tokenization, lemmatization, and linguistic analysis. It offers pre-trained language models that understand grammar, syntax, and word relationships.

### Code Implementation

In [None]:

import spacy

: 

In [None]:
# Download and install English language model

!python -m spacy download en_core_web_sm



# Load the pre-trained pipeline

nlp = spacy.load('en_core_web_sm')



# Process a sample sentence

phrase = spacy_df.clean_sentence[0] # "when life gives you lemons make lemonade"

doc = nlp(phrase)


ModuleNotFoundError: No module named 'spacy'