## 📘 TF-IDF and N-grams: Feature Extraction for Text

This notebook demonstrates how to convert a text corpus into numerical features using **TF-IDF (Term Frequency–Inverse Document Frequency)** and **N-Grams**

In [2]:
# Import the pandas library for data manipulation and analysis, especially for working with DataFrames
import pandas as pd

In [3]:
# Read the CSV file 'spamclassification.csv' from the 'Datasets' folder using pandas
# Specifying encoding='latin1' to avoid UnicodeDecodeError with special characters
messages = pd.read_csv('Datasets/spamclassification.csv', encoding='latin1')

# Display the first 5 rows of the DataFrame to get a quick look at the data
messages.head()

Unnamed: 0,Label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Cleaning and Preprocessing

In [4]:
# Import the regular expression module for basic text cleaning
import re

# Import the Natural Language Toolkit (nltk) library for text processing
import nltk

# Import a list of common English stopwords (e.g., "and", "is", "with") from NLTK
from nltk.corpus import stopwords

# Import the WordNet Lemmatizer from NLTK for reducing words to their base/dictionary form
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer object
lemmatizer = WordNetLemmatizer()

In [5]:
corpus = []  # Initialize an empty list to store the cleaned and processed text data

for i in range(0, len(messages)):
    # Remove all characters except alphabets from the message text
    review = re.sub('[^a-zA-Z]', ' ', messages['text'][i])

    # Convert all characters to lowercase
    review = review.lower()

    # Split the sentence into individual words (tokenization)
    review = review.split()

    # Apply stemming to each word and remove stopwords 
    review = [lemmatizer.lemmatize(word) for word in review if word not in stopwords.words('english')]

    # Join the processed words back into a single string
    review = ' '.join(review)

    # Append the cleaned review to the corpus list
    corpus.append(review)

corpus[:5]

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply',
 'u dun say early hor u c already say',
 'nah think go usf life around though']

## Create TF-IDF

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Initialize the TF-IDF Vectorizer
# max_features=100 limits the vocabulary to the top 100 most frequent terms
tfidf = TfidfVectorizer()

# Fit and transform the cleaned text corpus
X = tfidf.fit_transform(corpus).toarray()

In [7]:
# Convert to DataFrame for better readability
df_tfidf = pd.DataFrame(X, columns=tfidf.get_feature_names_out())

# Display the result
print("TF-IDF Feature Matrix:\n")
print(df_tfidf.round(3))  # Round for better viewing


TF-IDF Feature Matrix:

       aa  aah  aaniye  aaooooright  aathi   ab  abbey  abdomen  abeg  abel  \
0     0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
1     0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
2     0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
3     0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
4     0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
...   ...  ...     ...          ...    ...  ...    ...      ...   ...   ...   
5569  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
5570  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
5571  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
5572  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   
5573  0.0  0.0     0.0          0.0    0.0  0.0    0.0      0.0   0.0   0.0   

      ...  zed  zero   zf  

## Create TF-IDF using N-Grams

In [13]:

# Initialize the TF-IDF Vectorizer for bigrams
# max_features=100 limits the number of top bigrams based on frequency
tfidf = TfidfVectorizer(ngram_range=(2, 2), max_features=100)

# Fit and transform the corpus into a TF-IDF-weighted matrix of bigrams
X = tfidf.fit_transform(corpus).toarray()

list(tfidf.vocabulary_.items())[:5]

[('free entry', 31),
 ('claim call', 16),
 ('call claim', 3),
 ('free call', 30),
 ('chance win', 15)]

In [14]:
# Convert to a DataFrame for clearer display
df_tfidf_bigrams = pd.DataFrame(X, columns=tfidf.get_feature_names_out())

# Display the TF-IDF matrix (rounded for readability)
print("TF-IDF Matrix (Bigrams):\n")
print(df_tfidf_bigrams.round(3))

TF-IDF Matrix (Bigrams):

      account statement  attempt contact  await collection  call claim  \
0                   0.0              0.0               0.0         0.0   
1                   0.0              0.0               0.0         0.0   
2                   0.0              0.0               0.0         0.0   
3                   0.0              0.0               0.0         0.0   
4                   0.0              0.0               0.0         0.0   
...                 ...              ...               ...         ...   
5569                0.0              0.0               0.0         0.0   
5570                0.0              0.0               0.0         0.0   
5571                0.0              0.0               0.0         0.0   
5572                0.0              0.0               0.0         0.0   
5573                0.0              0.0               0.0         0.0   

      call customer  call identifier  call land  call landline  call later  \
0      