<h1 style='color:green;'><center>Text Preprocessing in NLP</center></h1>

Text preprocessing is a crucial step in Natural Language Processing (NLP) to prepare raw text data for analysis. It involves cleaning, transforming, and converting text into formats that algorithms can effectively process. In text preprocessing we try to convert the text data into numerical so that machine able to understand the text data. 

# Key concepts in NLP
* Corpus
* Tokenization
* Stopwords
* Stemming
* Lemmatization
* One-Hot Encoding
* Bag of Words (BoW)
* TF-IDF (Term Frequency - Inverse Document Frequency)
* N-Grams
* Word2Vec


# 1. Corpus
What is it?
 - A corpus is a collection of text data used for analysis or training NLP models. For example, a set of news articles, books, or social media posts can constitute a corpus.

When to use?
 - A corpus is essential in NLP because it provides the raw data for model training or linguistic research.

Advantages:
 - Provides diverse data for training.
 - Can be tailored for specific tasks (e.g., sentiment analysis, chatbots).

Disadvantages:
 - May require extensive cleaning and annotation.
 - Large corpora can be computationally expensive to process.

Example:

In [5]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
## Generating a raw corpus from nltk 

import nltk
from nltk.corpus import gutenberg
nltk.download('gutenberg')

corpus = gutenberg.raw('shakespeare-hamlet.txt')
print(corpus[:500])

[The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your selfe

   Bar. Long liue the King

   Fran. Barnardo?
  Bar. He

   Fran. You come most carefully vpon your houre

   Bar. 'Tis now strook twelue, get thee to bed Francisco

   Fran. For this releefe much thankes: 'Tis bitter cold,
And I am sicke at heart

   Barn. Haue you had quiet Guard?
  Fran. Not


[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\jayku\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


# 2. Tokenization
What is it?
 - Tokenization is the process of breaking text into smaller pieces, such as words or sentences. These smaller pieces are called "tokens."

When to use?
 - When text needs to be processed word by word or sentence by sentence.
Necessary for tasks like word embeddings, sentiment analysis, and more.

Advantages:
 - Simplifies text processing.
 - Forms the basis for more complex operations (e.g., parsing).

Disadvantages:
 - May struggle with ambiguous cases (e.g., "New York" vs. "New" and "York").
 - Requires additional steps for handling punctuation or special characters.
 
Example:

In [8]:
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

text = "Natural Language Processing is amazing! Let's tokenize this text."

# Tokenize into words
words = word_tokenize(text)
print("Words:", words)

# Tokenize into sentences
sentences = sent_tokenize(text)
print("Sentences:", sentences)

Words: ['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'Let', "'s", 'tokenize', 'this', 'text', '.']
Sentences: ['Natural Language Processing is amazing!', "Let's tokenize this text."]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jayku\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In the above exapmle we see the word tokenizer return a list of all words, and the sentance tokenizer return the list of seprate sentences.

# 3. Stemming
What is it?
 - Stemming reduces words to their root form, often producing non-meaningful results (e.g., "running" → "run").

When to use?
 - Useful for simple tasks like search engines or spam filters.
When accuracy isn't critical.

Advantages:
 - Fast and computationally cheap.

Disadvantages:
 - May not produce meaningful results.
 - Over-aggressive stemming can lead to loss of information.
 
Example:

In [9]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runs", "easily", "faster"]
stems = [stemmer.stem(word) for word in words]
print("Stemmed Words:", stems)


Stemmed Words: ['run', 'run', 'easili', 'faster']


we see that some words have no meanings like "easili".

# 4. Lemmatization
What is it?
 - Lemmatization reduces words to their dictionary form, considering the word's context and meaning.

When to use?
 - When accuracy and context matter (e.g., chatbots, text classification).

Advantages:
 - Produces meaningful base forms of words.
 - Handles irregular forms better than stemming.

Disadvantages:
 - Slower than stemming.
 - Requires a dictionary or lexical database.

Example:

In [10]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "easily", "faster"]
lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]
print("Lemmatized Words:", lemmas)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jayku\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatized Words: ['run', 'run', 'easily', 'faster']


# 5. Stopwords
What is it?
 - Stopwords are common words (e.g., "is", "and", "the") that add little value to analysis and are often removed.

When to use?
 - For tasks like text classification, where common words don't contribute much information.

Advantages:
 - Reduces noise in text data.
 - Simplifies analysis.

Disadvantages:
 - Removing stopwords may lose context in some cases.
 
Example:

In [11]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
words = ["This", "is", "a", "simple", "example"]
filtered = [word for word in words if word.lower() not in stop_words]
print("Filtered Words:", filtered)


Filtered Words: ['simple', 'example']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jayku\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 6. One-Hot Encoding
What is it?
 - One-Hot Encoding converts text data into binary vectors. Each unique word is represented as a vector with one 1 and the rest 0s.

When to use?
 - Useful for converting categorical text data into numerical form for machine learning.
Commonly used in simple NLP models or as input to embeddings.

Advantages:
 - Easy to implement.
 - Provides a unique representation for each word.

Disadvantages:
 - Results in high-dimensional vectors for large vocabularies.
 - Does not capture relationships between words (e.g., "king" and "queen").
 
Example:

In [15]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

words = [['hello', 'world', 'hello']]
encoder = OneHotEncoder()
encoded = encoder.fit_transform(np.array(words))
print("One-Hot Encoded:\n", encoded)


One-Hot Encoded:
   (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0


#  7. Bag of Words (BoW)
What is it?
 - BoW is a representation where text is converted into a vector of word counts. It ignores the order and grammar of words.

When to use?
 - Suitable for document classification, spam detection, or sentiment analysis.
Works well with small datasets and simple models.

Advantages:
 - Easy to understand and implement.
 - Can handle large text data efficiently.

Disadvantages:
 - Ignores word order and semantics.
 - Produces sparse matrices with large vocabularies.
 
Example:

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["This is a sample text", "Text preprocessing is key"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

print("BoW Features:\n", X.toarray())
print("Vocabulary:\n", vectorizer.vocabulary_)


BoW Features:
 [[1 0 0 1 1 1]
 [1 1 1 0 1 0]]
Vocabulary:
 {'this': 5, 'is': 0, 'sample': 3, 'text': 4, 'preprocessing': 2, 'key': 1}


# 8. TF-IDF (Term Frequency - Inverse Document Frequency)

What is it?
 - TF-IDF weighs words based on how frequently they appear in a document (TF) and how unique they are across the corpus (IDF).

When to use?
 - Useful for keyword extraction, document similarity, or search engines.
 - Better than BoW for capturing word importance.

Advantages:
 - Captures importance of words in context.
 - Reduces the impact of common words like "is" or "the."

Disadvantages:
 - Requires preprocessing to remove stopwords.
 - Computationally intensive for large corpora.

Example:



In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

text = ["This is a sample text", "Text preprocessing is key"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(text)

print("TF-IDF Features:\n", X.toarray())
print("Vocabulary:\n", tfidf.vocabulary_)


TF-IDF Features:
 [[0.40993715 0.         0.         0.57615236 0.40993715 0.57615236]
 [0.40993715 0.57615236 0.57615236 0.         0.40993715 0.        ]]
Vocabulary:
 {'this': 5, 'is': 0, 'sample': 3, 'text': 4, 'preprocessing': 2, 'key': 1}


# 9. N-Grams

What is it?
 - N-Grams are sequences of n words or characters. For example, bigrams are two-word sequences, and trigrams are three-word sequences.

When to use?
 - Helpful for capturing word context and structure in text.
Commonly used in machine translation and speech recognition.

Advantages:
 - Captures relationships between words.
 - Useful for context-sensitive tasks.

Disadvantages:
 - Increases dimensionality with larger n.
 - May result in sparsity.

Example:

In [2]:
from nltk.util import ngrams

text = "Natural Language Processing"
# Generate bi-grams
bigrams = list(ngrams(text.split(), 2))
print("Bi-grams:", bigrams)

Bi-grams: [('Natural', 'Language'), ('Language', 'Processing')]


# 10. Word2Vec
What is it?
 - Word2Vec is a neural network-based method to generate word embeddings that capture semantic relationships between words.

When to use?
 - For tasks requiring semantic similarity (e.g., recommendation systems, sentiment analysis).

Advantages:
 - Captures relationships between words (e.g., "king - man + woman = queen").
 - Efficient for large vocabularies.

Disadvantages:
 - Requires significant data for training.
 - Training can be computationally intensive.

Example:

In [3]:
from gensim.models import Word2Vec

sentences = [["natural", "language", "processing"], ["text", "mining"]]
model = Word2Vec(sentences, vector_size=10, min_count=1, workers=1)

# Word vector for "natural"
print("Vector for 'natural':", model.wv['natural'])


Vector for 'natural': [-0.0960355   0.05007293 -0.08759586 -0.04391825 -0.000351   -0.00296181
 -0.0766124   0.09614743  0.04982058  0.09233143]
