Data Preprocessing & Cleaning

Why Preprocess?
Remove noise (irrelevant data).

Normalize text (lowercase, stemming).

Handle missing values.

Steps:

Tokenization (splitting text into words/tokens).

Stopword removal.

Lemmatization/Stemming.

Handling special characters, emojis, etc.

Cleaning text data with nltk and re

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special chars
    tokens = text.split()
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words('english')]
    return ' '.join(tokens)

sample_text = "Generative AI is amazing! It can create new data :)"
print(clean_text(sample_text))  # Output: "generative ai amazing create new data"

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


generative ai amazing create new data


What are nltk and re in Python?
Both nltk (Natural Language Toolkit) and re (Regular Expressions) are Python libraries used for text processing, but they serve different purposes.

re (Regular Expressions)
Theory
Purpose: A built-in Python module for pattern matching in strings.

Use Cases:

Finding/replacing text patterns (e.g., emails, phone numbers).

Removing special characters, URLs, or unwanted symbols.

Splitting text based on patterns.**bold text**

In [3]:
import re

text = "Contact me at email@example.com or call 123-456-7890."

# Extract email
email = re.findall(r'[\w\.-]+@[\w\.-]+', text)
print("Email:", email)  # Output: ['email@example.com']

# Remove numbers
clean_text = re.sub(r'\d', '#', text)
print(clean_text)  # Output: "Contact me at email@example.com or call ###-###-####."

Email: ['email@example.com']
Contact me at email@example.com or call ###-###-####.


nltk (Natural Language Toolkit)
Purpose: A library for natural language processing (NLP).

Use Cases:

Tokenization (splitting text into words/sentences).

Removing stopwords ("the", "is", etc.).

Lemmatization (reducing words to base forms, e.g., "running" → "run").

Part-of-speech tagging (identifying nouns, verbs, etc.).

In [5]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('all')  # Downloads all NLTK data (large download!)
nltk.download('punkt')  # For tokenization
nltk.download('stopwords')  # For stopwords
nltk.download('wordnet')  # For lemmatization

text = "Cats are chasing mice in the garden."

# Tokenize
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Output: ['Cats', 'are', 'chasing', 'mice', 'in', 'the', 'garden', '.']

# Remove stopwords
filtered_words = [word for word in tokens if word.lower() not in stopwords.words('english')]
print("Without stopwords:", filtered_words)
# Output: ['Cats', 'chasing', 'mice', 'garden', '.']

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_words]
print("Lemmatized:", lemmatized)
# Output: ['Cat', 'chasing', 'mouse', 'garden', '.']

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

Tokens: ['Cats', 'are', 'chasing', 'mice', 'in', 'the', 'garden', '.']
Without stopwords: ['Cats', 'chasing', 'mice', 'garden', '.']
Lemmatized: ['Cats', 'chasing', 'mouse', 'garden', '.']


[nltk_data]    |   Unzipping corpora/ycoe.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Data Representation & Vectorization

Why Vectorize?
Machines understand numbers, not text.

Methods:

Bag of Words (BoW)

TF-IDF (Term Frequency-Inverse Document Frequency)

Word Embeddings (Word2Vec, GloVe)

Transformer Embeddings (BERT, GPT)

Converting text to vectors using TF-IDF

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Generative AI creates data.",
    "AI models need training data.",
    "Deep learning is a subset of AI."
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # Words in vocabulary
print(X.toarray())  # TF-IDF vectors

['ai' 'creates' 'data' 'deep' 'generative' 'is' 'learning' 'models' 'need'
 'of' 'subset' 'training']
[[0.34520502 0.5844829  0.44451431 0.         0.5844829  0.
  0.         0.         0.         0.         0.         0.        ]
 [0.29803159 0.         0.38376993 0.         0.         0.
  0.         0.50461134 0.50461134 0.         0.         0.50461134]
 [0.2553736  0.         0.         0.43238509 0.         0.43238509
  0.43238509 0.         0.         0.43238509 0.43238509 0.        ]]


What is TF-IDF?
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection (corpus). It's commonly used in:

Search engines (ranking documents)

Text classification (e.g., spam detection)

Information retrieval



How TF-IDF Works
A. Term Frequency (TF)
Measures how often a word appears in a document.

Formula:
TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in d)

B. Inverse Document Frequency (IDF)
Measures how rare a word is across all documents.

Formula:
IDF(t) = log(Total number of documents / Number of documents containing term t)

C. TF-IDF Calculation
TF-IDF(t, d) = TF(t, d) * IDF(t)

Why Use TF-IDF?
* Normalizes frequent words: Common words (e.g., "the", "is") get low weights.

* Boosts rare, meaningful words: Important terms (e.g., "blockchain", "quantum") get higher scores.

* Better than raw counts: More effective than Bag-of-Words (BoW) for many NLP tasks.







In [7]:
pip install scikit-learn



In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "the cat sat on the mat",
    "the dog ate my homework",
    "the cat and the dog are friends"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
print("Vocabulary:", vectorizer.get_feature_names_out())
# Output: ['and', 'are', 'ate', 'cat', 'dog', 'friends', 'homework', 'mat', 'my', 'on', 'sat', 'the']

# Convert to dense matrix (for readability)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray().round(2))

Vocabulary: ['and' 'are' 'ate' 'cat' 'dog' 'friends' 'homework' 'mat' 'my' 'on' 'sat'
 'the']
TF-IDF Matrix:
 [[0.   0.   0.   0.34 0.   0.   0.   0.45 0.   0.45 0.45 0.53]
 [0.   0.   0.5  0.   0.38 0.   0.5  0.   0.5  0.   0.   0.3 ]
 [0.42 0.42 0.   0.32 0.32 0.42 0.   0.   0.   0.   0.   0.5 ]]
