# **Advanced Text Preprocessing and N-gram Feature Extraction**

In this assignment, you will explore more libraries and techniques for text preprocessing and feature extraction. Follow the steps below and implement your solutions in Python.

---

## **Part 1: Text Preprocessing**
Expand the preprocessing steps beyond NLTK. Perform the following tasks:

### **1. Tokenization**
Use **spaCy** and **TextBlob** libraries for tokenization.

- Compare tokenization using **spaCy** and **TextBlob**.
- Explain any differences in the output.

### **2. Lemmatization**
Use **spaCy** and **TextBlob** lemmatizers.

- Perform lemmatization on a given sample dataset.
- Compare results with **WordNetLemmatizer** from NLTK.

### **3. Stemming**
Use additional stemmers, such as:

- **Snowball Stemmer** from NLTK
- **Lancaster Stemmer**

Compare their outputs with **PorterStemmer** from NLTK.

### **4. Stopwords Removal**
Use stopword lists from **Gensim** and **spaCy**, and compare them with the NLTK stopword list.

- Note any additional or missing stopwords between libraries.

### **5. Other Text Cleaning Techniques**
Explore the following additional cleaning techniques:

- **Lowercasing**
- **Removing special characters**
- **Removing numbers**
- **Handling contractions using the `contractions` library**.

---

## **Part 2: Feature Extraction Using N-grams**

### **1. Unigram, Bigram, and Trigram Extraction**
Extend the Bag of Words method to capture:

- **Unigrams**: Single words.
- **Bigrams**: Two-word combinations.
- **Trigrams**: Three-word combinations.

Use **CountVectorizer** from **scikit-learn** for implementation. Apply the different n-grams on the sample dataset and explain the results.

### **2. Word and Character N-grams**
- Perform feature extraction using **word-level n-grams** and **character-level n-grams**.
- Analyze the difference in feature vectors between word and character n-grams.


### **3. Vocabulary Size and Sparsity**
- Analyze how different n-gram models affect the vocabulary size and sparsity of the feature vectors.

---

## **Bonus**
- Implement **Lemmatization** and **Stemming** in the feature extraction step to observe their impact on the resulting n-grams.
- Explore **preprocessing pipelines** combining multiple steps (e.g., using **spaCy pipelines** or **scikit-learn Pipelines**).

---


# **Part 1: Text Preprocessing**
### **1. Tokenization using spaCy and TextBlob**
This step involves tokenizing a sample text using both the spaCy and TextBlob libraries to compare their outputs.

In [81]:
pip install spacy textblob nltk



In [82]:
import pandas as pd
import spacy
from textblob import TextBlob

# Create the DataFrame
data = {
    'ID': [1, 2, 3, 4, 5],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life."
    ]
}

df = pd.DataFrame(data)

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Function to tokenize text using spaCy and TextBlob
def tokenize(text):
    spacy_tokens = [token.text for token in nlp(text)]
    text_blob_tokens = TextBlob(text).words
    return spacy_tokens, text_blob_tokens

# Apply tokenization to DataFrame
df['spaCy Tokens'], df['TextBlob Tokens'] = zip(*df['Text'].apply(tokenize))

# Display the DataFrame
print(df[['spaCy Tokens', 'TextBlob Tokens']])



                                                                                                           spaCy Tokens  \
0  [The, Qur'an, is, the, holy, book, of, Islam, ., It, 's, written, in, Arabic, ,, and, Muslims, recite, it, daily, .]   
1      [Prophet, Muhammad, (, PBUH, ), was, born, in, Mecca, ., He, is, regarded, as, the, last, prophet, of, Islam, .]   
2                [Zakat, is, a, mandatory, act, of, charity, in, Islam, ,, aimed, at, helping, the, less, fortunate, .]   
3                              [Muslims, fast, during, the, month, of, Ramadan, to, cleanse, their, body, and, soul, .]   
4           [Hajj, is, an, annual, pilgrimage, to, Mecca, that, every, Muslim, must, perform, once, in, their, life, .]   

                                                                                               TextBlob Tokens  
0  [The, Qur'an, is, the, holy, book, of, Islam, It, 's, written, in, Arabic, and, Muslims, recite, it, daily]  
1         [Prophet, Muhammad, PBUH,


### 2. Lemmatization using spaCy and comparing with NLTK's WordNetLemmatizer
We'll compare how lemmatization is handled by spaCy and see how it differs from NLTK's WordNetLemmatizer.


In [83]:
from nltk.stem import WordNetLemmatizer

# NLTK Lemmatization
nltk_lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    spacy_doc = nlp(text)
    nltk_lemmas = [nltk_lemmatizer.lemmatize(token.text) for token in spacy_doc]
    spacy_lemmas = [token.lemma_ for token in spacy_doc]
    return nltk_lemmas, spacy_lemmas

# Apply lemmatization to DataFrame
df['NLTK Lemmas'], df['spaCy Lemmas'] = zip(*df['Text'].apply(lemmatize))

# Display the DataFrame
print(df[['NLTK Lemmas', 'spaCy Lemmas']])



                                                                                                            NLTK Lemmas  \
0  [The, Qur'an, is, the, holy, book, of, Islam, ., It, 's, written, in, Arabic, ,, and, Muslims, recite, it, daily, .]   
1        [Prophet, Muhammad, (, PBUH, ), wa, born, in, Mecca, ., He, is, regarded, a, the, last, prophet, of, Islam, .]   
2                  [Zakat, is, a, mandatory, act, of, charity, in, Islam, ,, aimed, at, helping, the, le, fortunate, .]   
3                              [Muslims, fast, during, the, month, of, Ramadan, to, cleanse, their, body, and, soul, .]   
4           [Hajj, is, an, annual, pilgrimage, to, Mecca, that, every, Muslim, must, perform, once, in, their, life, .]   

                                                                                                         spaCy Lemmas  
0  [the, Qur'an, be, the, holy, book, of, Islam, ., it, be, write, in, Arabic, ,, and, Muslims, recite, it, daily, .]  
1       [Prophet, Muh


### 3. Stemming using Snowball, Lancaster, and comparing with PorterStemmer from NLTK
This section explores the differences in stemming outputs among different stemmers available in NLTK.


In [84]:
from nltk.stem import SnowballStemmer, LancasterStemmer, PorterStemmer

# Initialize stemmers
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
porter = PorterStemmer()

def stem(text):
    tokens = [token.text for token in nlp(text)]
    snowball_stems = [snowball.stem(word) for word in tokens]
    lancaster_stems = [lancaster.stem(word) for word in tokens]
    porter_stems = [porter.stem(word) for word in tokens]
    return snowball_stems, lancaster_stems, porter_stems

# Apply stemming to DataFrame
df['Snowball Stems'], df['Lancaster Stems'], df['Porter Stems'] = zip(*df['Text'].apply(stem))

# Display the DataFrame
print(df[['Snowball Stems', 'Lancaster Stems', 'Porter Stems']])


                                                                                                     Snowball Stems  \
0  [the, qur'an, is, the, holi, book, of, islam, ., it, 's, written, in, arab, ,, and, muslim, recit, it, daili, .]   
1    [prophet, muhammad, (, pbuh, ), was, born, in, mecca, ., he, is, regard, as, the, last, prophet, of, islam, .]   
2                    [zakat, is, a, mandatori, act, of, chariti, in, islam, ,, aim, at, help, the, less, fortun, .]   
3                              [muslim, fast, dure, the, month, of, ramadan, to, cleans, their, bodi, and, soul, .]   
4         [hajj, is, an, annual, pilgrimag, to, mecca, that, everi, muslim, must, perform, onc, in, their, life, .]   

                                                                                                 Lancaster Stems  \
0    [the, qur'an, is, the, holy, book, of, islam, ., it, 's, writ, in, arab, ,, and, muslim, recit, it, dai, .]   
1  [prophet, muhammad, (, pbuh, ), was, born, in, mec


### 4. Stopwords Removal using lists from Gensim, spaCy, and comparing with NLTK
This step will demonstrate the differences in stopwords provided by Gensim, spaCy, and NLTK libraries.


In [85]:

from spacy.lang.en.stop_words import STOP_WORDS as SPACY_STOPWORDS
from gensim.parsing.preprocessing import STOPWORDS as GENSIM_STOPWORDS
from nltk.corpus import stopwords

# NLTK stopwords
nltk_stopwords = set(stopwords.words('english'))

# Compare stopword lists
print("spaCy stopwords:", SPACY_STOPWORDS)
print("Gensim stopwords:", GENSIM_STOPWORDS)
print("NLTK stopwords:", nltk_stopwords)


spaCy stopwords: {'hers', 'others', 'someone', 'thus', 'see', 'next', 'who', 'everything', 'amount', 'everyone', 'whatever', 'well', 'latterly', 'very', 'take', 'neither', 'may', 'beforehand', 'above', 'ten', 'either', "'ll", 'per', 'side', 'give', 'almost', 'is', 'few', 'much', 'just', 'below', 'in', 'it', 'while', 'across', 'whereupon', 'can', 'now', 'whereafter', 'mine', 'mostly', 'about', 'although', 'seeming', 'nine', 'enough', 'whence', 'are', "'ve", 'on', 'call', 'some', 'had', 'so', 'really', 'hereby', 'of', 'quite', 'we', 'not', 'namely', 'often', 'eight', 'many', 'along', 'thereupon', 'why', 'behind', 'every', 'move', 'somehow', 'at', 'us', 'serious', '‘ve', 'themselves', 'latter', 'during', 'less', 'twenty', 'both', 'unless', 'still', 'has', 'afterwards', 'hereupon', 'made', 'the', 'an', 'whither', 'become', 'further', '’ve', 'least', 'i', 'wherever', '‘ll', 'one', 'perhaps', 'somewhere', 'be', 'from', 'thence', 'been', 'with', 'fifteen', 'elsewhere', 'you', 'among', 'part',


### 5. Other Text Cleaning Techniques
Exploring additional text cleaning techniques such as lowercasing, removing special characters, removing numbers, and handling contractions.


In [86]:
import re
from contractions import fix

# Function to apply various text cleaning techniques
def clean_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    text = fix(text)  # Expand contractions
    return text

# Apply cleaning to DataFrame
df['Cleaned Text'] = df['Text'].apply(clean_text)

# Display the DataFrame
print(df[['Cleaned Text']])



                                                                              Cleaned Text
0    the quran is the holy book of islam its written in arabic and muslims recite it daily
1      prophet muhammad pbuh was born in mecca he is regarded as the last prophet of islam
2         zakat is a mandatory act of charity in islam aimed at helping the less fortunate
3                  muslims fast during the month of ramadan to cleanse their body and soul
4  hajj is an annual pilgrimage to mecca that every muslim must perform once in their life



## Part 2: Feature Extraction Using N-grams
### 1. Unigram, Bigram, and Trigram Extraction
We extend the Bag of Words method to capture unigrams, bigrams, and trigrams using the CountVectorizer from scikit-learn.


In [87]:
import pandas as pd

# Sample dataset
data = {
    "Instance No": ["d1", "d2", "d3", "d4", "d5"],
    "Text": [
        "Allah is One. Allah Loves us.",
        "We Love Allah. Allah Loves me.",
        "Allah loves us. Allah created us.",
        "Allah loves those who love humanity",
        "Allah is the creator"
    ],
    "Gender": ["Female", "Male", "Female", "Male", "Female"]
}

# Creating a DataFrame
df = pd.DataFrame(data)

from sklearn.feature_extraction.text import CountVectorizer

# Initialize vectorizers
word_vectorizer = CountVectorizer(ngram_range=(1, 3))
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 4))

# Fit and transform the data
X_word = word_vectorizer.fit_transform(df['Text'])
X_char = char_vectorizer.fit_transform(df['Text'])

# Print feature names
print("Word-level n-grams:", word_vectorizer.get_feature_names_out())
print("Character n-grams:", char_vectorizer.get_feature_names_out())



Word-level n-grams: ['allah' 'allah allah' 'allah allah loves' 'allah created'
 'allah created us' 'allah is' 'allah is one' 'allah is the' 'allah loves'
 'allah loves me' 'allah loves those' 'allah loves us' 'created'
 'created us' 'creator' 'humanity' 'is' 'is one' 'is one allah' 'is the'
 'is the creator' 'love' 'love allah' 'love allah allah' 'love humanity'
 'loves' 'loves me' 'loves those' 'loves those who' 'loves us'
 'loves us allah' 'me' 'one' 'one allah' 'one allah loves' 'the'
 'the creator' 'those' 'those who' 'those who love' 'us' 'us allah'
 'us allah created' 'we' 'we love' 'we love allah' 'who' 'who love'
 'who love humanity']
Character n-grams: [' a' ' al' ' all' ' c' ' cr' ' cre' ' h' ' hu' ' hum' ' i' ' is' ' is '
 ' l' ' lo' ' lov' ' m' ' me' ' me.' ' o' ' on' ' one' ' t' ' th' ' the'
 ' tho' ' u' ' us' ' us.' ' w' ' wh' ' who' '. ' '. a' '. al' 'ah' 'ah '
 'ah c' 'ah i' 'ah l' 'ah.' 'ah. ' 'al' 'all' 'alla' 'an' 'ani' 'anit'
 'at' 'ate' 'ated' 'ato' 'ator' 'cr' 'cr


### 2. Word and Character N-grams
This section analyzes the differences in feature vectors between word-level and character-level n-grams.


In [88]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
import spacy
from nltk.stem import PorterStemmer

# Loading spaCy model for lemmatization
nlp = spacy.load("en_core_web_sm")

def spacy_lemmatizer(text):
    return [token.lemma_ for token in nlp(text)]

def nltk_stemmer(text):
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(word) for word in text.split()])

# Creating a pipeline with CountVectorizer and TfidfTransformer
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer=spacy_lemmatizer, ngram_range=(1, 3))),
    ('tfidf', TfidfTransformer())
])

# Applying the pipeline
processed_features = pipeline.fit_transform(df['Text'])
print("Processed Text Feature Names:", pipeline.named_steps['vectorizer'].get_feature_names_out())


Processed Text Feature Names: ['.' '. allah' '. allah create' '. allah love' 'I' 'I .' 'allah' 'allah .'
 'allah . allah' 'allah be' 'allah be one' 'allah be the' 'allah create'
 'allah create we' 'allah love' 'allah love I' 'allah love those'
 'allah love we' 'be' 'be one' 'be one .' 'be the' 'be the creator'
 'create' 'create we' 'create we .' 'creator' 'humanity' 'love' 'love I'
 'love I .' 'love allah' 'love allah .' 'love humanity' 'love those'
 'love those who' 'love we' 'love we .' 'one' 'one .' 'one . allah' 'the'
 'the creator' 'those' 'those who' 'those who love' 'we' 'we .'
 'we . allah' 'we love' 'we love allah' 'who' 'who love'
 'who love humanity']





### 3. Vocabulary Size and Sparsity
Analyze how different n-gram models affect the vocabulary size and sparsity of the feature vectors. Sparsity indicates the proportion of zero values in the feature matrix.


In [89]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample dataset
data = {
    "Instance No": ["d1", "d2", "d3", "d4", "d5"],
    "Text": [
        "Allah is One. Allah Loves us.",
        "We Love Allah. Allah Loves me.",
        "Allah loves us. Allah created us.",
        "Allah loves those who love humanity",
        "Allah is the creator"
    ],
    "Gender": ["Female", "Male", "Female", "Male", "Female"]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Initialize vectorizers for unigram, bigram, and trigram
unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))

# Fit and transform the data from the DataFrame
X_unigram = unigram_vectorizer.fit_transform(df['Text'])
X_bigram = bigram_vectorizer.fit_transform(df['Text'])
X_trigram = trigram_vectorizer.fit_transform(df['Text'])

def calculate_sparsity(X):
    non_zero_elements = X.nnz
    total_elements = X.shape[0] * X.shape[1]
    sparsity = 1 - (non_zero_elements / total_elements)
    return sparsity

# Output the vocabulary sizes and sparsity
print("Unigram Vocabulary Size:", len(unigram_vectorizer.get_feature_names_out()))
print("Unigram Sparsity:", calculate_sparsity(X_unigram))
print("Bigram Vocabulary Size:", len(bigram_vectorizer.get_feature_names_out()))
print("Bigram Sparsity:", calculate_sparsity(X_bigram))
print("Trigram Vocabulary Size:", len(trigram_vectorizer.get_feature_names_out()))
print("Trigram Sparsity:", calculate_sparsity(X_trigram))


Unigram Vocabulary Size: 14
Unigram Sparsity: 0.6571428571428571
Bigram Vocabulary Size: 18
Bigram Sparsity: 0.7444444444444445
Trigram Vocabulary Size: 17
Trigram Sparsity: 0.788235294117647



### Bonus (Optional)
Implement Lemmatization and Stemming in the feature extraction step to observe their impact on the resulting n-grams. Explore preprocessing pipelines combining multiple steps (e.g., using spaCy pipelines or scikit-learn Pipelines).


In [90]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import spacy
from nltk.stem import PorterStemmer
import pandas as pd

# Sample dataset
data = {
    "Instance No": ["d1", "d2", "d3", "d4", "d5"],
    "Text": [
        "Allah is One. Allah Loves us.",
        "We Love Allah. Allah Loves me.",
        "Allah loves us. Allah created us.",
        "Allah loves those who love humanity",
        "Allah is the creator"
    ],
    "Gender": ["Female", "Male", "Female", "Male", "Female"]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Load spacy model
nlp = spacy.load("en_core_web_sm")

# Define lemmatizer using spaCy
def spacy_lemmatizer(text):
    return [token.lemma_ for token in nlp(text)]

# Define stemmer using NLTK
def nltk_stemmer(text):
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(word) for word in text.split()])

# Create a pipeline with CountVectorizer and TfidfTransformer
# Note: using spacy_lemmatizer as tokenizer; adjust as necessary for nltk_stemmer or other preprocessing
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer=spacy_lemmatizer, ngram_range=(1, 3))),
    ('tfidf', TfidfTransformer())
])

# Fit and transform the pipeline on the text data from the DataFrame
processed_features = pipeline.fit_transform(df['Text'])

# Print the feature names extracted by the vectorizer
feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out()
print("Processed Text Feature Names:", feature_names)


Processed Text Feature Names: ['.' '. allah' '. allah create' '. allah love' 'I' 'I .' 'allah' 'allah .'
 'allah . allah' 'allah be' 'allah be one' 'allah be the' 'allah create'
 'allah create we' 'allah love' 'allah love I' 'allah love those'
 'allah love we' 'be' 'be one' 'be one .' 'be the' 'be the creator'
 'create' 'create we' 'create we .' 'creator' 'humanity' 'love' 'love I'
 'love I .' 'love allah' 'love allah .' 'love humanity' 'love those'
 'love those who' 'love we' 'love we .' 'one' 'one .' 'one . allah' 'the'
 'the creator' 'those' 'those who' 'those who love' 'we' 'we .'
 'we . allah' 'we love' 'we love allah' 'who' 'who love'
 'who love humanity']


# **Practise Code:**

In [91]:
!pip install nltk



# Import the Required Libraries

In [92]:
pip install contractions




In [93]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string
import re
import contractions

# Download necessary NLTK data if not already available

In [94]:

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Printing the stopwords in English
stop_words = set(stopwords.words('english'))
print("Stop words:", stop_words)


Stop words: {'ll', 'doesn', 'hers', 'why', 'do', 'those', 'for', 'd', 'into', 'when', 'having', 'through', 'all', 'here', 'who', 'at', 'am', 'or', 'a', 'ma', 'too', "hasn't", "mustn't", 'once', 'as', 'themselves', 'me', 'before', 'very', 'but', "she's", 'which', 'during', 'how', "you'd", "isn't", 'wouldn', 'didn', 'wasn', 'both', 'no', 'has', 'above', 'the', 'their', 'doing', 'an', "weren't", 'himself', 'm', 'further', 's', 'than', 'ours', "didn't", 'i', 'her', "you'll", 'whom', 'own', 'hadn', 'off', 'couldn', 't', "couldn't", 'aren', 'these', 'any', 'is', 'few', "won't", 'be', 'from', 'been', 'was', 'yours', 'just', "you've", 'with', 'below', 'in', 'mustn', 'nor', "haven't", 'it', 'while', 'you', 'herself', 'he', 'most', 'hasn', 've', 'our', 'itself', 'they', 'if', "aren't", 'have', 'can', 'she', 'his', 'now', 'y', 'them', 'other', 'my', "mightn't", "needn't", 'needn', 'same', 'does', 'where', 'weren', 'each', 'until', "shan't", 'its', 'then', 'should', 'and', 'being', 'down', "doesn'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Create a Dummy Dataset

In [95]:
# Create the DataFrame
data = {
    'ID': [1, 2, 3, 4, 5],
    'Text': [
        "The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",
        "Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.",
        "Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",
        "Muslims fast during the month of Ramadan to cleanse their body and soul.",
        "Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life."
    ]
}

# Adjusting pandas options to display full content in the dataframe
pd.set_option('display.max_colwidth', None)

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Text
0,1,"The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily."
1,2,Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.
2,3,"Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate."
3,4,Muslims fast during the month of Ramadan to cleanse their body and soul.
4,5,Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.


# Create the Functions of Data Preprocessing

In [96]:
# Initialize Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Function 1: Expanding contractions
def expand_contractions(text):
    return contractions.fix(text)

# Function 2: Lowercasing text
def lowercase_text(text):
    return text.lower()

# Function 3: Tokenization
def tokenize_text(text):
    return word_tokenize(text)

# Function 4: Removing punctuation
def remove_punctuation(tokens):
    return [word for word in tokens if word not in string.punctuation]

# Function 5: Removing numbers
def remove_numbers(tokens):
    return [word for word in tokens if not word.isdigit()]

# Function 6: Removing special characters
def remove_special_characters(tokens):
    return [re.sub(r'[^A-Za-z0-9]+', '', word) for word in tokens if word]

# Function 7: Removing stopwords
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

# Function 8: Lemmatization
def lemmatize_text(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Function 9: Stemming
def stem_text(tokens):
    return [stemmer.stem(word) for word in tokens]

# Function 10: Normalization (e.g., converting currency symbols to words)
def normalize_text(tokens):
    return [re.sub(r'\$', 'dollar', word) for word in tokens]

# Function 11: Text standardization (e.g., standardizing variations of "U.S.A." to "USA")
def standardize_text(tokens):
    return [re.sub(r'u\.s\.a\.', 'USA', word) for word in tokens]



# Main function to call all preprocessing steps

In [97]:

# Main function to call preprocessing step-by-step
def preprocess_text(text, use_stemming=False):
    # Step 1: Expand contractions
    text = expand_contractions(text)

    # Step 2: Lowercase the text
    text = lowercase_text(text)

    # Step 3: Tokenization
    tokens = tokenize_text(text)

    # Step 4: Remove punctuation
    tokens = remove_punctuation(tokens)

    # Step 5: Remove numbers
    tokens = remove_numbers(tokens)

    # Step 6: Remove special characters
    tokens = remove_special_characters(tokens)

    # Step 7: Remove stopwords
    tokens = remove_stopwords(tokens)

    # Step 8: Lemmatization
    tokens = lemmatize_text(tokens)

    # Step 9: Normalization
    tokens = normalize_text(tokens)

    # Step 10: Standardization
    tokens = standardize_text(tokens)

    # Step 11: Stemming
    if use_stemming:
        tokens = stem_text(tokens)

    # Join tokens back to string
    processed_text = ' '.join(tokens)

    return processed_text



# Apply preprocessing function to the dataframe with and without stemming
df['Processed_Text_Lemmatization'] = df['Text'].apply(lambda x: preprocess_text(x, use_stemming=False))
df['Processed_Text_Stemming'] = df['Text'].apply(lambda x: preprocess_text(x, use_stemming=True))

df

Unnamed: 0,ID,Text,Processed_Text_Lemmatization,Processed_Text_Stemming
0,1,"The Qur'an is the holy book of Islam. It's written in Arabic, and Muslims recite it daily.",quran holy book islam written arabic muslim recite daily,quran holi book islam written arab muslim recit daili
1,2,Prophet Muhammad (PBUH) was born in Mecca. He is regarded as the last prophet of Islam.,prophet muhammad pbuh born mecca regarded last prophet islam,prophet muhammad pbuh born mecca regard last prophet islam
2,3,"Zakat is a mandatory act of charity in Islam, aimed at helping the less fortunate.",zakat mandatory act charity islam aimed helping le fortunate,zakat mandatori act chariti islam aim help le fortun
3,4,Muslims fast during the month of Ramadan to cleanse their body and soul.,muslim fast month ramadan cleanse body soul,muslim fast month ramadan cleans bodi soul
4,5,Hajj is an annual pilgrimage to Mecca that every Muslim must perform once in their life.,hajj annual pilgrimage mecca every muslim must perform life,hajj annual pilgrimag mecca everi muslim must perform life


# Feature Extraction using Bag of Words Method (Frequency Count of Words / Count Vectorizer)

In [98]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample dataset
data = {
    "Instance No": ["d1", "d2", "d3", "d4", "d5"],
    "Text": [
        "Allah is One. Allah Loves us.",
        "We Love Allah. Allah Loves me.",
        "Allah loves us. Allah created us.",
        "Allah loves those who love humanity",
        "Allah is the creator"
    ],
    "Gender": ["Female", "Male", "Female", "Male", "Female"]
}

# Creating a DataFrame
df = pd.DataFrame(data)

def count_vectorize(dataframe, ngram_range=(1,1)):
    # Initialize count vectorizer with the specified ngram range
    vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words='english')
    # Fit and transform the text data
    X = vectorizer.fit_transform(dataframe['Text'])

    # Create a DataFrame from the feature matrix
    feature_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

    # Print vocabulary and total size
    vocabulary = vectorizer.get_feature_names_out()
    total_size = X.shape[1]
    print("Vocabulary:", vocabulary)
    print("Total Size of Vocabulary (Number of Features):", total_size)

    # Return the dataframe
    return feature_df

# Example of usage
result_df = count_vectorize(df, ngram_range=(1,1))  # Unigram vectorization
print("\nFeature Matrix:\n")
result_df

Vocabulary: ['allah' 'created' 'creator' 'humanity' 'love' 'loves']
Total Size of Vocabulary (Number of Features): 6

Feature Matrix:



Unnamed: 0,allah,created,creator,humanity,love,loves
0,2,0,0,0,0,1
1,2,0,0,0,1,1
2,2,1,0,0,0,1
3,1,0,0,1,1,1
4,1,0,1,0,0,0


In [99]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample dataset
data = {
    "Instance No": ["d1", "d2", "d3", "d4", "d5"],
    "Text": [
        "Allah is One. Allah Loves us.",
        "We Love Allah. Allah Loves me.",
        "Allah loves us. Allah created us.",
        "Allah loves those who love humanity",
        "Allah is the creator"
    ],
    "Gender": ["Female", "Male", "Female", "Male", "Female"]
}

# Creating a DataFrame
df = pd.DataFrame(data)

def tfidf_vectorize(dataframe, ngram_range=(1, 1)):
    # Initialize TF-IDF vectorizer with the specified ngram range
    vectorizer = TfidfVectorizer(ngram_range=ngram_range, stop_words=None)
    # Fit and transform the text data
    X = vectorizer.fit_transform(dataframe['Text'])

    # Create a DataFrame from the feature matrix
    feature_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

    # Print vocabulary and total size
    vocabulary = vectorizer.get_feature_names_out()
    total_size = X.shape[1]
    print("Vocabulary:", vocabulary)
    print("Total Size (Number of Features):", total_size)

    # Return the dataframe
    return feature_df

# Example of usage
result_df = tfidf_vectorize(df, ngram_range=(1, 1))  # Unigram vectorization
print("\nTF-IDF Feature Matrix:\n")
result_df

Vocabulary: ['allah' 'created' 'creator' 'humanity' 'is' 'love' 'loves' 'me' 'one'
 'the' 'those' 'us' 'we' 'who']
Total Size (Number of Features): 14

TF-IDF Feature Matrix:



Unnamed: 0,allah,created,creator,humanity,is,love,loves,me,one,the,those,us,we,who
0,0.507419,0.0,0.0,0.0,0.429567,0.0,0.299966,0.0,0.532438,0.0,0.0,0.429567,0.0,0.0
1,0.484033,0.0,0.0,0.0,0.0,0.40977,0.286142,0.507899,0.0,0.0,0.0,0.0,0.507899,0.0
2,0.433667,0.455049,0.0,0.0,0.0,0.0,0.256367,0.0,0.0,0.0,0.0,0.734261,0.0,0.0
3,0.232639,0.0,0.0,0.488219,0.0,0.393892,0.275054,0.0,0.0,0.0,0.488219,0.0,0.0,0.488219
4,0.280882,0.0,0.589463,0.0,0.475575,0.0,0.0,0.0,0.0,0.589463,0.0,0.0,0.0,0.0
