### 1. Case Folding

Case folding is a text normalization technique in Natural Language Processing (NLP) where all characters in a string are converted to the same case, typically lowercase. This ensures that words like "Apple", "apple", and "APPLE" are treated identically during processing, which is crucial for tasks like text classification, search, and matching. Case folding helps reduce vocabulary size and improves consistency, especially in languages where case doesn’t alter the meaning of words. However, it may not be suitable for languages or domains where case conveys semantic information (e.g., proper nouns or chemical symbols).



In [7]:
txt = "Hello, And Welcome to My World"
print(txt)

Hello, And Welcome to My World


In [8]:
x = txt.casefold()
print(x)

hello, and welcome to my world


In [9]:
y = txt.lower()
print(y)

hello, and welcome to my world


### 2. Special Character Removal

Special character removal is a preprocessing step in NLP where non-alphanumeric characters such as punctuation marks, symbols, or emojis are stripped from the text. This is done to eliminate noise and focus on meaningful tokens that contribute to the analysis or modeling tasks. Commonly removed characters include `@`, `#`, `$`, `%`, and so on. While this step helps simplify the text and reduce dimensionality, care should be taken in domains where special characters have semantic significance—such as programming languages, social media handles, or sentiment expressions.

In [10]:
import re

#input string
input_str = "hello how are$ you!!"

#Using regular expressions to remove special characters
clean_str = re.sub(r"[^a-zA-Z0-9\s]","",input_str)

print(clean_str)

hello how are you


In [11]:
import re

#input string
input_str = "hello123 how are$ you!!"

#Using regular expressions to remove special characters
clean_str = re.sub(r"[^a-zA-Z0-9\s]","",input_str)

print(clean_str)

hello123 how are you


## Libraries in the field of NLP

- SpaCy - Natural language processing library in Python that can be used to tokenize and process textual data
- nltk

In [None]:
import spacy

# Load the spacy model
nlp = spacy.load("en_core_web_sm")

# Input string
input_str = "hello how are$ you!!"

# Function to clean the string
def clean_text(text):
  cleaned_text = ''.join(char for char in text if char.isalpha() or char.isspace())
  doc = nlp(cleaned_text)
  return ' '.join(token.text for token in doc)

# Get the final output
clean_str = clean_text(input_str)
print(clean_str)

In [12]:
import nltk

nltk.download('punkt')

input_str = "hello how are$ you!!"

# Tokenize
tokens = nltk.word_tokenize(input_str)

# Remove the special characters
clean_tokens = [token for token in tokens if token.isalnum()]

clean_str = ' '.join(clean_tokens)

print(clean_str)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saila\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


hello how are you


### 3. Contractions

Contractions are shortened versions of words or combinations of words created by omitting certain letters and using an apostrophe, such as "can't" for "cannot" or "they're" for "they are". In NLP, expanding contractions is an important preprocessing step as it helps standardize the text and improve the performance of downstream tasks like sentiment analysis or machine translation. Without contraction handling, the model might treat "isn't" and "is not" as entirely different phrases, leading to inconsistencies. Contraction expansion is typically done using lookup dictionaries or regular expression-based methods.


In [13]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.3-py3-none-any.whl.metadata (1.6 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.2.0-cp310-cp310-win_amd64.whl.metadata (14 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.3-py3-none-any.whl (345 kB)
Downloading pyahocorasick-2.2.0-cp310-cp310-win_amd64.whl (34 kB)
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions

   ---------- ----------------------------- 1/4 [anyascii]
   ---------- ----------------------------- 1/4 [anyascii]
   -------------------- ------------------- 2/4 [textsearch]
   -----------



In [14]:
import contractions

txt = "I can't believe it's already raining"

expanded_txt = contractions.fix(txt)

print(expanded_txt)

I cannot believe it is already raining


##### Using regex

In [15]:
import re

def expand_contractions(text):
  contractions_pattern = {
      r"(?i)can't":"cannot",
      r"(?i)won't":"will not",
      r"(?i)it's":"it is",
      r"(?i)weren't":"were not",
      r"(?i)I'm":"I am",
      r"(?i)couldn't":"could not"
  }
  for contraction, expansion in contractions_pattern.items():
    text = re.sub(contraction,expansion,text)

  return text

In [16]:
txt = "I couldn't visit my aunt's place yesterday"
expanded_text = expand_contractions(txt)
print(expanded_text)

I could not visit my aunt's place yesterday


### 4. Tokenization

Tokenization is the process of breaking down a sequence of text into smaller units called tokens, which can be words, subwords, or characters. It is a foundational step in NLP that enables machines to understand and manipulate text data. For instance, the sentence "NLP is fun!" can be tokenized into `["NLP", "is", "fun", "!"]`. There are different types of tokenizers, such as whitespace tokenizers, regex-based tokenizers, and more sophisticated ones like WordPiece or Byte Pair Encoding (BPE) used in transformer models. The choice of tokenizer can significantly affect model performance and accuracy.


In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
# Sample text for tokenization
txt = "NLTK provides powerful tools for tokenization. It includes word tokenization and sentence tokenization"

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saila\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# Word tokenization
words = word_tokenize(txt)
print(words)

['NLTK', 'provides', 'powerful', 'tools', 'for', 'tokenization', '.', 'It', 'includes', 'word', 'tokenization', 'and', 'sentence', 'tokenization']


In [3]:
# Word tokenization
sent = sent_tokenize(txt)
print(sent)

['NLTK provides powerful tools for tokenization.', 'It includes word tokenization and sentence tokenization']


### 5. Stop Word Removal

Stop word removal is a common text preprocessing technique in NLP where frequently occurring but semantically insignificant words—like "the", "is", "and", "in"—are filtered out. These words often do not contribute meaningful information for tasks like text classification or information retrieval and can be safely removed to reduce noise and dimensionality. Libraries such as NLTK and spaCy provide predefined lists of stop words that can be customized based on context. However, removing stop words isn't always beneficial, especially in tasks like sentiment analysis or translation, where even small words can carry important meaning.


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [5]:
# Sample Sentence
sentence = "This is a sample sentence, showing off the stop words filtration"

In [6]:
# Tokenize the Sentence
nltk.download('punkt')
nltk.download('stopwords')
words = word_tokenize(sentence)

# Filter out stopwords
new_sentence = [word for word in words if word.lower() not in stopwords.words('english')]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saila\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saila\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [7]:
# Print the final sentence
print(sentence)
print(new_sentence)

This is a sample sentence, showing off the stop words filtration
['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration']


### 6. N-grams

N-grams are contiguous sequences of *n* items (typically words or characters) extracted from a text, used to capture local word order and context in NLP tasks. For example, for the sentence "I love NLP", the bigrams (2-grams) are "I love" and "love NLP". N-grams help preserve partial sentence structure and are useful in applications like text classification, language modeling, and machine translation. However, increasing the value of *n* can lead to high-dimensional and sparse representations, so a balance is often maintained between granularity and computational efficiency.


In [8]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import ngrams

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saila\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
def generate_ngrams(text, n):
  tokens = word_tokenize(text)
  n_grams = list(ngrams(tokens,n))
  return n_grams

In [11]:
# Example text
txt = "N-Grams are a sequence of n items from a given sample of text or speech"

unigrams = generate_ngrams(txt,1)
print(unigrams)

[('N-Grams',), ('are',), ('a',), ('sequence',), ('of',), ('n',), ('items',), ('from',), ('a',), ('given',), ('sample',), ('of',), ('text',), ('or',), ('speech',)]


In [12]:
bigrams = generate_ngrams(txt,2)
print(bigrams)

[('N-Grams', 'are'), ('are', 'a'), ('a', 'sequence'), ('sequence', 'of'), ('of', 'n'), ('n', 'items'), ('items', 'from'), ('from', 'a'), ('a', 'given'), ('given', 'sample'), ('sample', 'of'), ('of', 'text'), ('text', 'or'), ('or', 'speech')]


In [13]:
trigrams = generate_ngrams(txt,3)
print(trigrams)

[('N-Grams', 'are', 'a'), ('are', 'a', 'sequence'), ('a', 'sequence', 'of'), ('sequence', 'of', 'n'), ('of', 'n', 'items'), ('n', 'items', 'from'), ('items', 'from', 'a'), ('from', 'a', 'given'), ('a', 'given', 'sample'), ('given', 'sample', 'of'), ('sample', 'of', 'text'), ('of', 'text', 'or'), ('text', 'or', 'speech')]


### 7. Vectorization

Vectorization is the process of converting text into numerical representations that machine learning models can understand. This transformation enables algorithms to process textual data by representing each word, sentence, or document as a vector of numbers. Common vectorization techniques include Count Vectorizer, TF-IDF, and more advanced methods like word embeddings. Proper vectorization is critical for model performance, as it determines how well the semantic and syntactic properties of text are captured.


### 8. Word Embeddings

Word embeddings are dense vector representations of words that capture their meaning and context based on usage in large corpora. Unlike sparse encodings like one-hot vectors, embeddings map words to continuous vector spaces where semantically similar words have closer vectors. Popular models for generating embeddings include Word2Vec, GloVe, and fastText. Embeddings significantly improve the performance of NLP models by preserving relationships such as similarity and analogy, enabling machines to understand deeper language structure.



---
Check out the code in Transformers folder

files :
- Word2Vec using genism
- 
---

### 9. Bag of Words

The Bag of Words (BoW) model is a simple yet effective technique for representing text data in NLP. It treats each document as a "bag" containing the words it includes, disregarding grammar and word order but maintaining frequency. Each document is converted into a vector based on word occurrences from a vocabulary. Though BoW is easy to implement and works well for basic tasks like document classification, it often results in sparse matrices and fails to capture word context or semantics, which can limit its performance on complex NLP problems.


In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [15]:
df = pd.read_csv('spam.csv')
df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [16]:
df.Category.value_counts()/len(df)*100

Category
ham     86.593683
spam    13.406317
Name: count, dtype: float64

In [17]:
df['spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head(5)

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [18]:
df.shape

(5572, 3)

In [19]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam,test_size=0.2)

print(X_train.shape)
print(X_test.shape)

(4457,)
(1115,)


In [20]:
#Create bag of words representation using CountVectorizer
v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_test_cv = v.transform(X_test)

print(X_train_cv.toarray()[:2][0])
print(X_train_cv.shape)

[0 0 0 ... 0 0 0]
(4457, 7735)


In [21]:
v.get_feature_names_out()[1771]

'chgs'

In [22]:
v.vocabulary_

{'ill': 3597,
 'call': 1599,
 '2mrw': 406,
 'at': 1116,
 'ninish': 4779,
 'with': 7543,
 'my': 4654,
 'address': 818,
 'that': 6810,
 'icky': 3573,
 'american': 943,
 'freek': 2979,
 'wont': 7576,
 'stop': 6495,
 'callin': 1611,
 'me': 4393,
 'bad': 1193,
 'jen': 3799,
 'eh': 2547,
 'just': 3862,
 'do': 2372,
 'what': 7469,
 'ever': 2670,
 'is': 3735,
 'easier': 2508,
 'for': 2931,
 'you': 7699,
 'sorry': 6336,
 'cant': 1635,
 'take': 6689,
 'your': 7703,
 'right': 5788,
 'now': 4835,
 'it': 3747,
 'so': 6290,
 'happens': 3319,
 'there': 6833,
 '2waxsto': 421,
 'wat': 7396,
 'want': 7375,
 'she': 6081,
 'can': 1624,
 'come': 1906,
 'and': 965,
 'get': 3105,
 'her': 3402,
 'medical': 4408,
 'insurance': 3686,
 'll': 4150,
 'be': 1264,
 'able': 761,
 'to': 6936,
 'deliver': 2233,
 'have': 3347,
 'basic': 1232,
 'care': 1647,
 'currently': 2114,
 'shopping': 6117,
 'the': 6814,
 'give': 3134,
 'til': 6899,
 'friday': 2991,
 'morning': 4572,
 'thats': 6813,
 'when': 7475,
 'see': 5993,
 'm

In [23]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [24]:
np.where(X_train_np[0]!=0)

(array([ 406,  818,  943, 1116, 1193, 1599, 1611, 2547, 2979, 3573, 3597,
        3799, 4393, 4654, 4779, 6495, 6810, 7543, 7576], dtype=int64),)

In [25]:
# Naive Bayes Classifier
model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [26]:
y_pred = model.predict(X_test_cv)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       973
           1       0.98      0.92      0.95       142

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [27]:
# Test on a random datapoint

message = {"Upto 20% off on parking, exclusing offer just for you"}

message_cnt = v.transform(message)

model.predict(message_cnt)

array([0], dtype=int64)

### 📦 Bag of Words: Pros and Cons

| Pros                                                                                 | Cons                                                                                              |
|--------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| ✅ Simple and easy to implement — a great starting point for NLP problems.           | ❌ Ignores word order and syntax, losing sentence structure and context.                          |
| ✅ Works well with traditional ML models like Naive Bayes, SVM, and Logistic Regression. | ❌ Cannot capture semantic similarity — treats synonyms as unrelated (e.g., "happy" vs "joyful"). |
| ✅ Fast to compute and scale, especially for smaller datasets.                       | ❌ High dimensionality and sparsity for large vocabularies, leading to inefficient computations.  |
| ✅ Produces interpretable features, enabling better understanding of model behavior. | ❌ Doesn't handle out-of-vocabulary words or unseen text gracefully.                             |
| ✅ Performs well on simple classification tasks like spam detection or topic tagging.| ❌ No context awareness — phrases like "not bad" and "bad" may be misrepresented.                 |
