# **Assignment: Advanced Text Preprocessing**

In this assignment, you will explore more libraries and techniques for text preprocessing . Follow the steps below and implement your solutions in Python.

---

## **Part 1: Text Preprocessing**
Expand the preprocessing steps beyond NLTK. Perform the following tasks:

### **1. Tokenization**
Use **spaCy** and **TextBlob** libraries for tokenization.

- Compare tokenization using **spaCy** and **TextBlob**.
- Explain any differences in the output.

### **2. Lemmatization**
Use **spaCy** and **TextBlob** lemmatizers.

- Perform lemmatization on a given sample dataset.
- Compare results with **WordNetLemmatizer** from NLTK.

### **3. Stemming**
Use additional stemmers, such as:

- **Snowball Stemmer** from NLTK
- **Lancaster Stemmer**

Compare their outputs with **PorterStemmer** from NLTK.

### **4. Stopwords Removal**
Use stopword lists from **Gensim** and **spaCy**, and compare them with the NLTK stopword list.

- Note any additional or missing stopwords between libraries.

### **5. Other Text Cleaning Techniques**
Explore the following additional cleaning techniques:

- **Lowercasing**
- **Removing special characters**
- **Removing numbers**
- **Handling contractions using the `contractions` library**.

---


## **Bonus (Optional)**
- Implement **Lemmatization** and **Stemming** in the feature extraction step to observe their impact on the resulting n-grams.
- Explore **preprocessing pipelines** combining multiple steps (e.g., using **spaCy pipelines** or **scikit-learn Pipelines**).

---

## **Submission Requirements**
- Code implementation for each step.
- Written explanation comparing different approaches and libraries.


# Install the Required Libraries

In [None]:
!pip install nltk



In [None]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m 

# Import the Required Libraries

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string
import re
import contractions

# Download necessary NLTK data if not already available

In [None]:

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Printing the stopwords in English
stop_words = set(stopwords.words('english'))
print("Stop words:", stop_words)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\waqar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\waqar\AppData\Roaming\nltk_data...


Stop words: {'down', 'our', 'who', 'because', 'had', 'they', "we're", 'same', "you'd", 'has', "shan't", 'should', "haven't", 'd', "he'd", 'too', "shouldn't", 'where', 'above', "hadn't", 'during', 'under', 'in', 'have', 'you', 'needn', 'herself', 'out', 'which', 'how', 'until', "isn't", 'whom', 'for', "it'd", 'the', 'between', 'doing', 'with', 'hers', 'so', 'it', 'them', 'there', 'itself', 'weren', 'only', 'through', 'having', 'what', 'ain', 'myself', 'then', "should've", 'again', 'me', 'mustn', 'but', "mustn't", 'him', 'those', 'into', 'a', "we'll", "we've", 'other', 'just', 'their', 're', 'few', 'didn', 'not', "needn't", "they'll", 'further', 'theirs', "you're", 'before', 'if', 'shan', 'any', 'being', "wouldn't", "don't", "that'll", "didn't", 'while', 'this', 'its', "aren't", 'wasn', "weren't", 'will', 'when', 'some', "mightn't", "she's", 'her', 'of', "he's", 'nor', 'your', 'haven', 'off', 'she', 'such', 'more', 'than', 's', 'ours', 'y', 'to', 'all', 'or', 'hasn', 'shouldn', "it'll", 

[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\waqar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Create a Dummy Dataset

In [None]:
# Create the DataFrame
data = {
    'ID': [1, 2, 3, 4, 5,6],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.",
        "The better player played well and he is a runner"
    ]
}

# Adjusting pandas options to display full content in the dataframe
pd.set_option('display.max_colwidth', None)

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Text
0,1,"The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily."
1,2,Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.
2,3,"Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate."
3,4,Muslims fast during the month of Ramadan to cleanse their body and soul.
4,5,Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.
5,6,The better player played well and he is a runner


## Understanding of Regular Expressions (RE)

In [None]:
# Function to find all words starting with 'a'
# \b: This is the word boundary metacharacter.
# \w: This metacharacter stands for "word character".
def find_words_starting_with_a(text):
    pattern = r'\ba\w*'
    return re.findall(pattern, text, re.IGNORECASE)

# Apply the function to the DataFrame
df['Words_Starting_With_A'] = df['Text'].apply(find_words_starting_with_a)

df

Unnamed: 0,ID,Text,Words_Starting_With_A
0,1,"The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.","[an, Arabic, and]"
1,2,Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.,[as]
2,3,"Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.","[a, act, aimed, at]"
3,4,Muslims fast during the month of Ramadan to cleanse their body and soul.,[and]
4,5,Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.,"[an, annual]"
5,6,The better player played well and he is a runner,"[and, a]"


In [None]:
df.drop(df.columns[-1], axis=1, inplace=True)
df

Unnamed: 0,ID,Text
0,1,"The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily."
1,2,Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.
2,3,"Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate."
3,4,Muslims fast during the month of Ramadan to cleanse their body and soul.
4,5,Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.
5,6,The better player played well and he is a runner


# Create the Functions of Data Preprocessing

In [None]:
# Initialize Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Function 1: Expanding contractions
def expand_contractions(text):
    return contractions.fix(text)

# Function 2: Lowercasing text
def lowercase_text(text):
    return text.lower()

# Function 3: Tokenization
def tokenize_text(text):
    return word_tokenize(text)

# Function 4: Removing punctuation
def remove_punctuation(tokens):
    return [word for word in tokens if word not in string.punctuation]

# Function 5: Removing numbers
def remove_numbers(tokens):
    return [word for word in tokens if not word.isdigit()]

# Function 6: Removing special characters
def remove_special_characters(tokens):
    return [re.sub(r'[^A-Za-z0-9]+', '', word) for word in tokens if word]

# Function 7: Removing stopwords
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

# Function 8: Lemmatization
def lemmatize_text(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Function 9: Stemming
def stem_text(tokens):
    return [stemmer.stem(word) for word in tokens]

# Function 10: Normalization (e.g., converting currency symbols to words)
def normalize_text(tokens):
    return [re.sub(r'\$', 'dollar', word) for word in tokens]

# Function 11: Text standardization (e.g., standardizing variations of "U.S.A." to "USA")
def standardize_text(tokens):
    return [re.sub(r'u\.s\.a\.', 'USA', word) for word in tokens]



# Main function to call all preprocessing steps

In [None]:

# Main function to call preprocessing step-by-step
def preprocess_text(text, use_stemming=False):
    # Step 1: Expand contractions
    text = expand_contractions(text)

    # Step 2: Lowercase the text
    text = lowercase_text(text)

    # Step 3: Tokenization
    tokens = tokenize_text(text)

    # Step 4: Remove punctuation
    tokens = remove_punctuation(tokens)

    # Step 5: Remove numbers
    tokens = remove_numbers(tokens)

    # Step 6: Remove special characters
    tokens = remove_special_characters(tokens)

    # Step 7: Remove stopwords
    tokens = remove_stopwords(tokens)

    # Step 8: Lemmatization
    tokens = lemmatize_text(tokens)

    # Step 9: Normalization
    tokens = normalize_text(tokens)

    # Step 10: Standardization
    tokens = standardize_text(tokens)

    # Step 11: Stemming
    if use_stemming:
        tokens = stem_text(tokens)

    # Join tokens back to string
    processed_text = ' '.join(tokens)

    return processed_text



# Apply preprocessing function to the dataframe with and without stemming
df['Processed_Text_Lemmatization'] = df['Text'].apply(lambda x: preprocess_text(x, use_stemming=False))
df['Processed_Text_Stemming'] = df['Text'].apply(lambda x: preprocess_text(x, use_stemming=True))

df

Unnamed: 0,ID,Text,Processed_Text_Lemmatization,Processed_Text_Stemming
0,1,"The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",quran holy book islam written arabic muslim recite daily,quran holi book islam written arab muslim recit daili
1,2,Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.,prophet muhammad pbuh born mecca regarded last prophet islam,prophet muhammad pbuh born mecca regard last prophet islam
2,3,"Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",zakat mandatory act charity islam aimed helping le fortunate,zakat mandatori act chariti islam aim help le fortun
3,4,Muslims fast during the month of Ramadan to cleanse their body and soul.,muslim fast month ramadan cleanse body soul,muslim fast month ramadan cleans bodi soul
4,5,Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.,hajj annual pilgrimage mecca every muslim must perform life,hajj annual pilgrimag mecca everi muslim must perform life
5,6,The better player played well and he is a runner,better player played well runner,better player play well runner


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
spacy_stopwords = nlp.Defaults.stop_words
print("spaCy Stopwords count:", len(spacy_stopwords))
print("Sample spaCy Stopwords:", list(spacy_stopwords)[:20])

spaCy Stopwords count: 326
Sample spaCy Stopwords: ['then', 'seems', 'did', 'whereafter', 'every', 'ten', 'an', 'after', 'namely', 'by', 'give', 'although', 'whose', 'sixty', 'his', 'whole', 'who', 'again', 'moreover', 'hereupon']


In [None]:
from textblob import Word
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
textblob_stopwords = stopwords.words("english")
print("TextBlob Stopwords count:", len(textblob_stopwords))
print("Sample TextBlob Stopwords:", textblob_stopwords[:20])

TextBlob Stopwords count: 198
Sample TextBlob Stopwords: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from gensim.parsing.preprocessing import STOPWORDS

# Gensim stopwords
gensim_stopwords = list(STOPWORDS)

# Print the count and a sample of Gensim stopwords
print("Gensim Stopwords count:", len(gensim_stopwords))
print("Sample Gensim Stopwords:", gensim_stopwords[:20])


Gensim Stopwords count: 337
Sample Gensim Stopwords: ['then', 'seems', 'did', 'whereafter', 'every', 'ten', 'an', 'after', 'namely', 'by', 'cry', 'hasnt', 'give', 'although', 'whose', 'sixty', 'his', 'whole', 'who', 'again']


Stopwords Removal: Library Comparison
**NLTK**
Source: Manually curated stopwords.

Count: ~179 stopwords.

Common examples: ['the', 'is', 'in', 'not', 'on']

Pros: Widely used, easy to use.

Cons: Doesn't update frequently; static list.

**spaCy**
Source: Built-in linguistic data.

Count: ~326 stopwords (en_core_web_sm)

Common examples: Includes n’t, 're, 've, etc.

Pros: Richer list, includes contractions and symbols.

Cons: May include more than needed for some use cases.

**Gensim**
Source: Inspired by Google's stopwords and extended.

Count: ~337 stopwords.

Common examples: Similar to spaCy but may include more domain-specific terms.

Pros: Lightweight, used for topic modeling.

Cons: Slightly less standardized.

In [None]:
import spacy
import pandas as pd

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample DataFrame
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.",
        "The better player played well and he is a runner."
    ]
}

df = pd.DataFrame(data)

# Define the spaCy tokenizer function
def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.text for token in doc]

# Apply the tokenizer function to the DataFrame
df['spaCy_Tokens'] = df['Text'].apply(spacy_tokenizer)

# Ensure the DataFrame is displayed with clear formatting
print(df[['ID', 'spaCy_Tokens']].to_string(index=False))


 ID                                                                                                         spaCy_Tokens
  1 [The, Qur'an, is, the, holy, book, of, Islam, ., It, 's, written, in, Arabic, ,, and, Muslims, recite, it, daily, .]
  2     [Prophet, Muhammad, (, PBUH, ), was, born, in, Mecca, ., He, is, regarded, as, the, last, prophet, of, Islam, .]
  3               [Zakat, is, a, mandatory, act, of, charity, in, Islam, ,, aimed, at, helping, the, less, fortunate, .]
  4                             [Muslims, fast, during, the, month, of, Ramadan, to, cleanse, their, body, and, soul, .]
  5          [Hajj, is, an, annual, pilgrimage, to, Mecca, that, every, Muslim, must, perform, once, in, their, life, .]
  6                                                       [The, better, player, played, well, and, he, is, a, runner, .]


In [None]:
import pandas as pd
from textblob import TextBlob

# Sample DataFrame
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.",
        "The better player played well and he is a runner."
    ]
}

df = pd.DataFrame(data)

# Define the TextBlob tokenizer function
def textblob_tokenizer(text):
    blob = TextBlob(text)
    return [word for word in blob.words]

# Apply the tokenizer function to the DataFrame
df['TextBlob_Tokens'] = df['Text'].apply(textblob_tokenizer)

# Ensure the DataFrame is displayed with clear formatting
print(df[['ID', 'TextBlob_Tokens']].to_string(index=False))



 ID                                                                                             TextBlob_Tokens
  1 [The, Qur'an, is, the, holy, book, of, Islam, It, 's, written, in, Arabic, and, Muslims, recite, it, daily]
  2        [Prophet, Muhammad, PBUH, was, born, in, Mecca, He, is, regarded, as, the, last, prophet, of, Islam]
  3            [Zakat, is, a, mandatory, act, of, charity, in, Islam, aimed, at, helping, the, less, fortunate]
  4                       [Muslims, fast, during, the, month, of, Ramadan, to, cleanse, their, body, and, soul]
  5    [Hajj, is, an, annual, pilgrimage, to, Mecca, that, every, Muslim, must, perform, once, in, their, life]
  6                                                 [The, better, player, played, well, and, he, is, a, runner]


Key Differences:
Punctuation Handling:

TextBlob: Treats punctuation as part of the word (e.g., "It's" stays as one token).

spaCy: Separates punctuation from words (e.g., "It's" becomes ["It", "'s"]).

Contractions:

TextBlob: Splits contractions simply (e.g., "isn't" → ['is', 'n', 't']).

spaCy: Handles contractions more accurately (e.g., "isn't" → ['is', "n't"]).

Token Granularity:

TextBlob: Basic tokenization without splitting punctuation.

spaCy: More precise, breaking text into both words and punctuation.

Complexity:

TextBlob: Fast and straightforward, perfect for simple tasks.

spaCy: More advanced, ideal for detailed and production-level NLP tasks.



In [None]:
import pandas as pd
from textblob import TextBlob
import spacy

# Sample DataFrame
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.",
        "The better player played well and he is a runner."
    ]
}

df = pd.DataFrame(data)

# Define the TextBlob tokenizer function
def textblob_tokenizer(text):
    blob = TextBlob(text)
    return [word for word in blob.words]

# Define the spaCy tokenizer function
nlp = spacy.load("en_core_web_sm")
def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.text for token in doc]

# Apply the tokenizers to the DataFrame
df['TextBlob_Tokens'] = df['Text'].apply(textblob_tokenizer)
df['spaCy_Tokens'] = df['Text'].apply(spacy_tokenizer)

# Display the DataFrame with separate columns for each tokenizer
print("TextBlob Tokenized Output:")
print(df[['ID', 'TextBlob_Tokens']].to_string(index=False))

print("\nspaCy Tokenized Output:")
print(df[['ID', 'spaCy_Tokens']].to_string(index=False))


TextBlob Tokenized Output:
 ID                                                                                             TextBlob_Tokens
  1 [The, Qur'an, is, the, holy, book, of, Islam, It, 's, written, in, Arabic, and, Muslims, recite, it, daily]
  2        [Prophet, Muhammad, PBUH, was, born, in, Mecca, He, is, regarded, as, the, last, prophet, of, Islam]
  3            [Zakat, is, a, mandatory, act, of, charity, in, Islam, aimed, at, helping, the, less, fortunate]
  4                       [Muslims, fast, during, the, month, of, Ramadan, to, cleanse, their, body, and, soul]
  5    [Hajj, is, an, annual, pilgrimage, to, Mecca, that, every, Muslim, must, perform, once, in, their, life]
  6                                                 [The, better, player, played, well, and, he, is, a, runner]

spaCy Tokenized Output:
 ID                                                                                                         spaCy_Tokens
  1 [The, Qur'an, is, the, holy, book, of, 

Quick Comparison: TextBlob vs. spaCy vs. NLTK Lemmatizer
TextBlob:
Fast and easy for basic tasks. Great for simple tokenization and lemmatization but lacks precision for complex text. Perfect when you need quick results without much effort.

spaCy:
Powerful and accurate, designed for complex projects. Splits punctuation correctly and lemmatizes with context. Best for larger datasets and tasks requiring high precision, but needs more setup than TextBlob.

NLTK Lemmatizer:
Deep and precise, using WordNet to accurately lemmatize words. Best for detailed linguistic tasks but requires extra steps like part-of-speech tagging. Slower than spaCy but highly accurate for nuanced language tasks.

In [None]:
import pandas as pd
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

# Sample DataFrame
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.",
        "The better player played well and he is a runner."
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

# Define a function to stem words with each stemmer
def apply_stemmers(text):
    words = text.split()  # Split the text into words
    porter_stemmed = [porter.stem(word) for word in words]
    snowball_stemmed = [snowball.stem(word) for word in words]
    lancaster_stemmed = [lancaster.stem(word) for word in words]

    return porter_stemmed, snowball_stemmed, lancaster_stemmed

# Apply the stemming function to each row in the 'Text' column
df[['Porter_Stemmed', 'Snowball_Stemmed', 'Lancaster_Stemmed']] = df['Text'].apply(lambda x: pd.Series(apply_stemmers(x)))

# Display the DataFrame with the stemmed results
print(df[['ID', 'Text', 'Porter_Stemmed', 'Snowball_Stemmed', 'Lancaster_Stemmed']].to_string(index=False))



 ID                                                                                       Text                                                                                            Porter_Stemmed                                                                                         Snowball_Stemmed                                                                                       Lancaster_Stemmed
  1 The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily. [the, qur'an, is, the, holi, book, of, islam., it', written, in, arabic,, and, muslim, recit, it, daily.] [the, qur'an, is, the, holi, book, of, islam., it, written, in, arabic,, and, muslim, recit, it, daily.] [the, qur'an, is, the, holy, book, of, islam., it's, writ, in, arabic,, and, muslim, recit, it, daily.]
  2    Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.     [prophet, muhammad, (pbuh), wa, born, in, mecca., he, is, regard, as, the, last, 

**Porter Stemmer**
One of the oldest and most widely used stemmers.

Applies a set of rules to strip suffixes while keeping the word recognizable (e.g., "connection" → "connect").

Best for: General NLP tasks where you need balance between accuracy and simplicity.
**Lancaster Stemmer**
Much more aggressive than Porter—it chops words down heavily.

May produce stems that aren't real words (e.g., "universal" → "univers").

Best for: Fast, exploratory tasks where exact meaning isn’t critical.
**Snowball Stemmer**
A modernized version of Porter with more consistent rules.

Less aggressive and more accurate than Lancaster.

Supports multiple languages.

Best for: Robust NLP tasks requiring better precision and language flexibility.

In [None]:
import re
import spacy
from textblob import Word
import contractions

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')


In [None]:
# Function for Lowercasing using TextBlob
def lowercase_text(text):
    return text.lower()

# Function for Lowercasing using spaCy
def spacy_lowercase(text):
    doc = nlp(text)
    return " ".join([token.text.lower() for token in doc])


In [None]:
# Function for Removing Special Characters using TextBlob
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

# Function for Removing Special Characters using spaCy
def spacy_remove_special_characters(text):
    doc = nlp(text)
    return " ".join([token.text for token in doc if token.is_alpha])


In [None]:
# Function for Removing Numbers using TextBlob
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Function for Removing Numbers using spaCy
def spacy_remove_numbers(text):
    doc = nlp(text)
    return " ".join([token.text for token in doc if not token.is_digit])


In [None]:
# Function for Handling Contractions using TextBlob
def expand_contractions(text):
    return contractions.fix(text)

# Function for Handling Contractions using spaCy
def spacy_expand_contractions(text):
    # spaCy itself doesn't provide contraction handling out-of-the-box.
    # So we use the contractions library here.
    return contractions.fix(text)


In [None]:
import pandas as pd
import re
import contractions

# Sample DataFrame
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.",
        "The better player played well and he is a runner."
    ]
}

df = pd.DataFrame(data)

# Sample functions for text cleaning
def lowercase_text(text):
    return text.lower()

def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

def remove_numbers(text):
    return re.sub(r'\d+', '', text)

def expand_contractions(text):
    return contractions.fix(text)

# Apply Text Cleaning Functions
df['Lowercased_Text'] = df['Text'].apply(lowercase_text)
df['Special_Char_Removed'] = df['Text'].apply(remove_special_characters)
df['Numbers_Removed'] = df['Text'].apply(remove_numbers)
df['Contractions_Expanded'] = df['Text'].apply(expand_contractions)

# Printing each result separately with better formatting

# Print Lowercased Text
print("\n----- Lowercased Text -----")
print(df[['ID', 'Lowercased_Text']].to_string(index=False))

# Print Special Characters Removed
print("\n----- Special Characters Removed -----")
print(df[['ID', 'Special_Char_Removed']].to_string(index=False))

# Print Numbers Removed
print("\n----- Numbers Removed -----")
print(df[['ID', 'Numbers_Removed']].to_string(index=False))

# Print Contractions Expanded
print("\n----- Contractions Expanded -----")
print(df[['ID', 'Contractions_Expanded']].to_string(index=False))



----- Lowercased Text -----
 ID                                                                            Lowercased_Text
  1 the qur'an is the holy book of islam. it's written in arabic, and muslims recite it daily.
  2    prophet muhammad (pbuh) was born in mecca. he is regarded as the last prophet of islam.
  3         zakat is a mandatory act of charity in islam, aimed at helping the less fortunate.
  4                   muslims fast during the month of ramadan to cleanse their body and soul.
  5   hajj is an annual pilgrimage to mecca that every muslim must perform once in their life.
  6                                          the better player played well and he is a runner.

----- Special Characters Removed -----
 ID                                                                    Special_Char_Removed
  1   The Quran is the holy book of Islam Its written in Arabic and Muslims recite it daily
  2     Prophet Muhammad PBUH was born in Mecca He is regarded as the last prophet

Explanation of Text Cleaning Techniques
Lowercasing:

Purpose: Converts all text to lowercase to remove case sensitivity.

Why: Treats words like "Islam" and "islam" as the same.

Example: "The Qur'an" → "the qur'an"

Removing Special Characters:

Purpose: Eliminates punctuation, parentheses, and symbols.

Why: These do not contribute meaning and can disrupt NLP tasks.

Example: "Prophet Muhammad (PBUH)" → "Prophet Muhammad PBUH"

Removing Numbers:

Purpose: Removes digits that aren't relevant in most text analysis tasks.

Why: Focuses on the content of the words.

Example: "30 days" → " days"

Handling Contractions:

Purpose: Expands shortened words (e.g., "it's" to "it is").

Why: Standardizes text and reduces ambiguity.

Example: "It's raining" → "It is raining"

Benefits:
Consistency: Ensures uniformity across the text.

Noise Reduction: Removes irrelevant characters and numbers.

Improved Model Accuracy: Helps NLP models focus on meaningful words.

In [None]:
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from nltk.stem import PorterStemmer, SnowballStemmer
from spacy.lang.en import English
from sklearn.base import BaseEstimator, TransformerMixin

# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_sm")

# Sample DataFrame
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.",
        "The better player played well and he is a runner."
    ]
}

df = pd.DataFrame(data)


In [None]:
# Define a function to apply Lemmatization using spaCy
def lemmatize_text(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc])

# Define Stemming function (using NLTK's PorterStemmer for demonstration)
def stem_text(text, stemmer_type="porter"):
    words = text.split()
    if stemmer_type == "porter":
        stemmer = PorterStemmer()
    elif stemmer_type == "snowball":
        stemmer = SnowballStemmer("english")

    return " ".join([stemmer.stem(word) for word in words])

# Combine different preprocessing techniques
def preprocess_text(text, use_stemming=False, stemmer_type="porter"):
    # Lowercase text, remove special characters, etc.
    text = text.lower()
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])  # Remove special characters

    if use_stemming:
        return stem_text(text, stemmer_type)
    else:
        return lemmatize_text(text)


In [None]:

tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))  # Extract unigrams and bigrams
X_tfidf = tfidf_vectorizer.fit_transform(df['Text'])

# Display extracted features (n-grams)
feature_names = tfidf_vectorizer.get_feature_names_out()
print("Extracted Features (Unigrams and Bigrams):")
print(feature_names[:20])  # Show first 20 features


Extracted Features (Unigrams and Bigrams):
['act' 'act of' 'aimed' 'aimed at' 'an' 'an annual' 'an is' 'and' 'and he'
 'and muslims' 'and soul' 'annual' 'annual pilgrimage' 'arabic'
 'arabic and' 'as' 'as the' 'at' 'at helping' 'better']


In [None]:
# Custom transformer to apply lemmatization or stemming in a pipeline
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, use_stemming=False, stemmer_type="porter"):
        self.use_stemming = use_stemming
        self.stemmer_type = stemmer_type

    def fit(self, X, y=None):
        return self  # No fitting necessary

    def transform(self, X):
        return [preprocess_text(text, use_stemming=self.use_stemming, stemmer_type=self.stemmer_type) for text in X]

# Create a pipeline combining text preprocessing and feature extraction
pipeline = Pipeline([
    ('preprocessor', TextPreprocessor(use_stemming=True, stemmer_type="porter")),  # Can toggle stemming or lemmatization
    ('vectorizer', TfidfVectorizer(analyzer='word', ngram_range=(1, 2)))  # Extract unigrams and bigrams
])
X_pipeline = pipeline.fit_transform(df['Text'])
print("Extracted n-grams with preprocessing pipeline:")
feature_names_pipeline = pipeline.named_steps['vectorizer'].get_feature_names_out()
print(feature_names_pipeline[:20])


Extracted n-grams with preprocessing pipeline:
['act' 'act of' 'aim' 'aim at' 'an' 'an annual' 'and' 'and he'
 'and muslim' 'and soul' 'annual' 'annual pilgrimag' 'arab' 'arab and'
 'as' 'as the' 'at' 'at help' 'better' 'better player']
