<a href="https://colab.research.google.com/github/MNourMoslem/Simple-Text-Classifier/blob/master/withTf_Idf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is TF-IDF?
TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It combines two key concepts:

##Term Frequency (TF):

TF measures how frequently a term appears in a document. The assumption is that the more a word appears in a document, the more important it is.
The formula for Term Frequency is:

$\large TF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Number of terms in document d}}$

$$
\
$$

##Inverse Document Frequency (IDF):

IDF measures how important a term is in the entire corpus. If a term appears in many documents, its IDF is lower, implying that it's a common word and less important.
The formula for Inverse Document Frequency is:

$\large IDF(t,d) = \log(\frac{\text{Number of times term t appears in document d}}{\text{Number of terms in document d}})$
$$
\
$$

##Combining TF and IDF:

The TF-IDF score is the product of Term Frequency and Inverse Document Frequency:

$TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)$

This score helps to highlight words that are important to a specific document but not common across the entire corpus, making them more useful for tasks like text classification, search engines, and information retrieval.
Why Use TF-IDF?
TF-IDF is widely used in natural language processing tasks because it balances the frequency of words with their uniqueness across documents. It helps to identify keywords that are not only frequent within a document but also provide significant differentiation between documents in a corpus.

In the context of a text classifier, TF-IDF is particularly useful because it allows the model to focus on words that are most informative for distinguishing between different classes, rather than being distracted by common words that appear frequently but contribute little to classification.

In [24]:
'''
Before we start it would be good to implement TextCleaner module from the repository in the link below
To help us to modify and clear the data text from the dataset
'''
!git clone https://github.com/MNourMoslem/TextCleaner

fatal: destination path 'TextCleaner' already exists and is not an empty directory.


# Twitter Spam Dataset

## Import The Data

In [25]:
"""
The dataset we are going to work with containes a collection of twittes that is labeled to ham and spam.
Our goal is to train our module on this dataset to be able to predict whether a twitte is a spam or not.
(There is additional information about why its called spam and ham in the end of this file)
"""

import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/MNourMoslem/Simple-Text-Classifier/master/dataset/spam.tsv", delimiter='\t')

rand_idx = 3401
rand_message = data['message'][rand_idx]
rand_message

'As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589'

## Create the helper function to clean the data from unnecessary text like emails, tags, etc.

In [63]:
from TextCleaner import TextCleaner

def clean_text(text : str):
  text = TextCleaner.remove_hashtags(text)
  text = TextCleaner.remove_emails(text)
  text = TextCleaner.remove_urls(text)
  text = TextCleaner.remove_tags(text)
  text = TextCleaner.remove_repeated(text) # Replaces charechters repeated more than 2 times to the same charechter (`helloooooo` -> `hello`)
  text = TextCleaner.remove_non_alphanumerics(text)
  text = TextCleaner.remove_numbers(text)
  text = TextCleaner.remove_wide_spaces(text) # Replaces spaces with more than one space to one space ('     ' -> ' ')
  return text.strip()

sample_text1 = "@mnmoslem the last update of clash royal is awful i sent to them on this email clashroyalemail@supercell.com"
sample_text2 = "you could win 10000$ cash from only clicking on this link www.somelink.com/winnerwinnerchickendinner"
sample_text3 = "this text  haaaas   a lot of     sssspaces  and reeeepeated  charechters"
cleaned1 = clean_text(sample_text1)
cleaned2 = clean_text(sample_text2)
cleaned3 = clean_text(sample_text3)
print(cleaned1)
print(cleaned2)
print(cleaned3)

the last update of clash royal is awful i sent to them on this email
you could win cash from only clicking on this link
this text has a lot of spaces and repeated charechters


In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
 2   length   5572 non-null   int64 
 3   punct    5572 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 174.2+ KB


## Check the shape of the data

In [28]:
ham = data.loc[data['label'] == 'ham']
spam = data.loc[data['label'] == 'spam']
print("Shape of Ham:", ham.shape)
print("Shape of Spam:", spam.shape)

Shape of Ham: (4825, 4)
Shape of Spam: (747, 4)


In [29]:
# we notice that we have a problem which its that the data is not balanced so we balance it..

ham = ham.sample(len(spam)) # we take n samples from the data where n is the number of text data in spam
print("Shape of Ham:", ham.shape)
new_data = pd.concat((ham, spam), axis = 0) # we create new data
print("Shape of New Data:", new_data.shape)

Shape of Ham: (747, 4)
Shape of New Data: (1494, 4)


In [30]:
new_data['message'] = new_data['message'].apply(clean_text)

x = new_data['message'].values # gets the twittes
y = (new_data['label'].values == 'spam').astype(int) # gets labels and converts them from (ham, spam) to (0, 1)

print("Shape of X: ", x.shape)
print("Shape of Y: ", y.shape)

Shape of X:  (1494,)
Shape of Y:  (1494,)


## Import The Tf-Idf from Sklearn and fit the data to it

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
new_x = tfidf.fit_transform(x)

print("Shape of New_X:", new_x.shape)
print("Number of vocab:", len(tfidf.vocabulary_))

Shape of New_X: (1494, 3933)
Number of vocab: 3933


## Split the data to train data and test data

In [33]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(new_x, y, test_size = 0.2, shuffle = True, random_state = 256)

## Import The Support Vector Classifier from Sklearn and train the module

In [34]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(x_train, y_train)
preds = svc.predict(x_test)

## Check The result

In [35]:
from sklearn.metrics import classification_report

cr = classification_report(y_test, preds)
print(cr)

              precision    recall  f1-score   support

           0       0.88      0.99      0.93       140
           1       0.99      0.89      0.93       159

    accuracy                           0.93       299
   macro avg       0.94      0.94      0.93       299
weighted avg       0.94      0.93      0.93       299



## Create functions to detect spam

In [36]:
def spam_detect(text : str):
  text = clean_text(text)
  raw = tfidf.transform([text])
  return svc.predict(raw)

def print_spam(text : str):
  result = "Spam" if spam_detect(text) else "Ham"

  def truncate(string, width):
    if len(string) > width:
        string = string[:width-3] + '...'
    return string

  print(f"{truncate(text, 30)} =======> is {result}")

## Final Test

In [37]:
text1 = "clash royal became so bad"
text2 = "if you want to win 100000$ just click the link bellow www.winnerwinnerchickendinner.com"

print_spam(text1)
print_spam(text2)



In [14]:
# Let's Do it again but this time on another dataset

## Import the Dataset

In [49]:
'''
In this example we are going to use the Rotten Tomatoes Movie Reviews Dataset that
includes collection of movie reviews labeled by negative and positive.
'''

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/MNourMoslem/Simple-Text-Classifier/master/dataset/rotten_tomatoes.csv", index_col = False)
df.head()

Unnamed: 0,text,label
0,the rock is destined to be the 21st century's ...,1
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too-tepid biopic,1
3,if you sometimes like to go to the movies to h...,1
4,"emerges as something rare , an issue movie tha...",1


## Check the balancing of the dataset

In [50]:
print(df['label'].value_counts())

label
1    4265
0    4265
Name: count, dtype: int64


## Import Tf-idf from Sklearn and fit our data to it

In [51]:
x = df.iloc[:, 0].values
y = df.iloc[:, 1].values
print("Shape of X:", x.shape)
print("Shape of Y:", y.shape)
print('')

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf2 = TfidfVectorizer()
new_x = tfidf2.fit_transform(x)

print("Shape of New_X:", new_x.shape)
print("Number of vocab:", len(tfidf2.vocabulary_))

Shape of X: (8530,)
Shape of Y: (8530,)

Shape of New_X: (8530, 16474)
Number of vocab: 16474


## Split dataset to train and test data

In [52]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(new_x, y, test_size = 0.2, random_state = 256)

##Import Support Vector Classifier and train it

In [53]:
from sklearn.svm import SVC

svc2 = SVC()
svc2.fit(x_train, y_train) # might take time
preds = svc2.predict(x_test)

## Take a look to the result

In [54]:
print("First 10 predctions:", preds[:10])
print("First 10 test labels:", y_test[:10])

First 10 predctions: [1 0 0 0 1 0 0 0 0 1]
First 10 test labels: [1 0 1 0 1 0 1 1 0 1]


## Get the classification report

In [55]:
from sklearn.metrics import classification_report

cr = classification_report(y_test, preds)
print(cr)

              precision    recall  f1-score   support

           0       0.74      0.74      0.74       842
           1       0.74      0.75      0.75       864

    accuracy                           0.74      1706
   macro avg       0.74      0.74      0.74      1706
weighted avg       0.74      0.74      0.74      1706



## Final Test

In [56]:
def rate_review(text : str):
  text = clean_text(text)
  raw = tfidf2.transform([text])
  return svc2.predict(raw)

def print_rate(text : str):
  result = "Positive" if rate_review(text) else "Negative"

  def truncate(string, width):
    if len(string) > width:
        string = string[:width-3] + '...'
    return string

  print(f"{truncate(text, 30)} =======> is {result}")

In [62]:
review1 = "The movie was so good i liked it"
review2 = "worst movie ever"
review3 = "didn't expect this, i was waiting for something better"
review4 = "good movie i liked it"

print_rate(review1)
print_rate(review2)
print_rate(review3)
print_rate(review4)



In [42]:
# An Additional Information..

## Why It's Called Spam and Ham?

* **Spam**: The term "spam" originally comes from a 1970s sketch by the British comedy group Monty Python, where a restaurant serves everything with spam (a brand of canned meat), leading to the word being repeated excessively. This sketch became a metaphor for the flood of unwanted and repetitive messages, particularly in email and online communication. Over time, "spam" came to refer to any form of unwanted, unsolicited bulk messages.

* **Ham**: In contrast, "ham" is used as a playful back-formation from "spam." It was chosen because it's a similar-sounding word but represents something desirable in this context—legitimate, wanted messages. It doesn’t have a deeper etymology related to its meaning in email classification, but it serves as a simple, memorable opposite to spam.