# Introduction to NLP & Text Cleaning

## What is NLP?
Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language. It combines **linguistics, computer science, and machine learning** to analyze text and speech data.

NLP is widely used in applications like:
- Text classification (spam detection, sentiment analysis)
- Machine translation (English → French)
- Chatbots and virtual assistants
- Text summarization and question answering

## Why Text Cleaning is Important
Text data is messy and often contains unwanted characters, punctuation, and capitalization differences.
Cleaning the text is the **first step** in NLP preprocessing. Properly cleaned text improves the performance of machine learning models and embeddings.

**In this notebook, we will learn:**
1. Lowercasing
2. Removing punctuation
3. Removing stopwords
4. Handling special characters and numbers
5. Cleaning a sample dataset


In [1]:
# Import libraries
import re
import string
import pandas as pd
from nltk.corpus import stopwords


In [2]:
# Download stopwords (if first time)
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\IDREES
[nltk_data]     AHMAD\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Sample dataset

In [3]:
texts = [
    "Hello there! How are you doing today?",
    "NLP is amazing, isn't it?",
    "Text preprocessing is a critical step in NLP!",
    "Remove punctuation, stopwords, and make lowercase."
]

df = pd.DataFrame(texts, columns=['Text'])
df.head()

Unnamed: 0,Text
0,Hello there! How are you doing today?
1,"NLP is amazing, isn't it?"
2,Text preprocessing is a critical step in NLP!
3,"Remove punctuation, stopwords, and make lowerc..."


**Step 1: Lowercasing**

In [4]:
# Lowercase all text
df['Clean_Text'] = df['Text'].str.lower()
df.head()

Unnamed: 0,Text,Clean_Text
0,Hello there! How are you doing today?,hello there! how are you doing today?
1,"NLP is amazing, isn't it?","nlp is amazing, isn't it?"
2,Text preprocessing is a critical step in NLP!,text preprocessing is a critical step in nlp!
3,"Remove punctuation, stopwords, and make lowerc...","remove punctuation, stopwords, and make lowerc..."


**Step 2: Remove Punctuation**

In [5]:
# Remove punctuation
df['Clean_Text'] = df['Clean_Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
df.head()

Unnamed: 0,Text,Clean_Text
0,Hello there! How are you doing today?,hello there how are you doing today
1,"NLP is amazing, isn't it?",nlp is amazing isnt it
2,Text preprocessing is a critical step in NLP!,text preprocessing is a critical step in nlp
3,"Remove punctuation, stopwords, and make lowerc...",remove punctuation stopwords and make lowercase


**Step 3: Remove Stopwords**

In [6]:
stop_words = set(stopwords.words('english'))

df['Clean_Text'] = df['Clean_Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df.head()

Unnamed: 0,Text,Clean_Text
0,Hello there! How are you doing today?,hello today
1,"NLP is amazing, isn't it?",nlp amazing isnt
2,Text preprocessing is a critical step in NLP!,text preprocessing critical step nlp
3,"Remove punctuation, stopwords, and make lowerc...",remove punctuation stopwords make lowercase


**Step 4: Remove Numbers / Special Characters (Optional)**

In [7]:
df['Clean_Text'] = df['Clean_Text'].apply(lambda x: re.sub(r'\d+', '', x))  # Remove numbers
df['Clean_Text'] = df['Clean_Text'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())  # Remove extra spaces
df.head()


Unnamed: 0,Text,Clean_Text
0,Hello there! How are you doing today?,hello today
1,"NLP is amazing, isn't it?",nlp amazing isnt
2,Text preprocessing is a critical step in NLP!,text preprocessing critical step nlp
3,"Remove punctuation, stopwords, and make lowerc...",remove punctuation stopwords make lowercase


**Step 5: Example Cleaning Output**

In [8]:
for original, cleaned in zip(df['Text'], df['Clean_Text']):
    print(f"Original: {original}")
    print(f"Cleaned:  {cleaned}")
    print('-'*50)


Original: Hello there! How are you doing today?
Cleaned:  hello today
--------------------------------------------------
Original: NLP is amazing, isn't it?
Cleaned:  nlp amazing isnt
--------------------------------------------------
Original: Text preprocessing is a critical step in NLP!
Cleaned:  text preprocessing critical step nlp
--------------------------------------------------
Original: Remove punctuation, stopwords, and make lowercase.
Cleaned:  remove punctuation stopwords make lowercase
--------------------------------------------------
