## **TF-IDF (Term Frequency-Inverse Document Frequency)**

TF-IDF is a numerical statistic that measures the importance of a word in a document relative to a collection of documents (corpus). It is commonly used in information retrieval and text mining.

The TF-IDF value for a term in a document is calculated using the following formula:

**TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)**

Where:
- Term Frequency is the number of times term appears in document . This value is often normalized to prevent bias towards longer documents.

  **Term Frequency(TF) = [number of times word appeared / total no of words in a document]**

- Inverse Document Frequency is a measure of how important the term is across the entire corpus. It is calculated as follows:

  **Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word + 1)]**

  The "+1" in the denominator is added to avoid division by zero in case a term is not present in any document.

The TF-IDF score is high when a term appears frequently in a particular document but infrequently across the entire corpus. This indicates that the term is likely to be important in that document.

TF-IDF is often used in natural language processing tasks such as document classification, information retrieval, and text mining to assess the significance of words in documents.

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import pprint
import pandas as pd

### Understanding how to generate n-grams using TfidfVectorizer

In [2]:
corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [6]:
tf = TfidfVectorizer()
output = tf.fit_transform(corpus)
pprint.pprint(tf.vocabulary_)

{'already': 0,
 'am': 1,
 'amazon': 2,
 'and': 3,
 'announcing': 4,
 'apple': 5,
 'are': 6,
 'ate': 7,
 'biryani': 8,
 'dot': 9,
 'eating': 10,
 'eco': 11,
 'google': 12,
 'grapes': 13,
 'iphone': 14,
 'ironman': 15,
 'is': 16,
 'loki': 17,
 'microsoft': 18,
 'model': 19,
 'new': 20,
 'pixel': 21,
 'pizza': 22,
 'surface': 23,
 'tesla': 24,
 'thor': 25,
 'tomorrow': 26,
 'you': 27}


In [21]:
feature_names = tf.get_feature_names_out()

for word in feature_names:
    idx = tf.vocabulary_.get(word)
    print("{0} : {1}".format(word,tf.idf_[idx]))

already : 2.386294361119891
am : 2.386294361119891
amazon : 2.386294361119891
and : 2.386294361119891
announcing : 1.2876820724517808
apple : 2.386294361119891
are : 2.386294361119891
ate : 2.386294361119891
biryani : 2.386294361119891
dot : 2.386294361119891
eating : 1.9808292530117262
eco : 2.386294361119891
google : 2.386294361119891
grapes : 2.386294361119891
iphone : 2.386294361119891
ironman : 2.386294361119891
is : 1.1335313926245225
loki : 2.386294361119891
microsoft : 2.386294361119891
model : 2.386294361119891
new : 1.2876820724517808
pixel : 2.386294361119891
pizza : 2.386294361119891
surface : 2.386294361119891
tesla : 2.386294361119891
thor : 2.386294361119891
tomorrow : 1.2876820724517808
you : 2.386294361119891


In [13]:
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [18]:
output.toarray()[:2]

array([[0.24266547, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24266547, 0.        , 0.        ,
        0.40286636, 0.        , 0.        , 0.        , 0.        ,
        0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
        0.        , 0.        , 0.72799642, 0.        , 0.        ,
        0.24266547, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.5680354 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.5680354 ,
        0.        , 0.26982671, 0.        , 0.        , 0.        ,
        0.30652086, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30652086, 0.        ]])

## **Product Classification using TFIDF Vectorize**
Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

In [24]:
df = pd.read_csv("data\\Ecommerce_data.csv")
df.head()

Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [25]:
df['label'].value_counts()

Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: label, dtype: int64

In [26]:
df['label_num'] = df['label'].map({
    'Household' : 0,
    'Books' : 1,
    'Electronics' : 2,
    'Clothing & Accessories': 4,
})
df.head()

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,4
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,4


### Train test split the dataset

In [35]:
X_train, X_test, y_train, y_test = train_test_split(
    df.Text, 
    df.label_num, 
    test_size=0.2,
    random_state=42,
    stratify=df.label_num
)
(len(X_train), len(X_test)),  y_train.value_counts()

((19200, 4800),
 4    4800
 2    4800
 1    4800
 0    4800
 Name: label_num, dtype: int64)

### Create KNN model, train, and evaluate

In [38]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('KNN', KNeighborsClassifier())         
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.96      1200
           1       0.96      0.95      0.96      1200
           2       0.97      0.96      0.97      1200
           4       0.98      0.97      0.98      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



### Create naive bayes model, train, and evaluate

In [40]:
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('Multi NB', MultinomialNB())         
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1200
           1       0.98      0.93      0.96      1200
           2       0.97      0.97      0.97      1200
           4       0.98      0.98      0.98      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



### Create naive bayes model, train, and evaluate

In [42]:
#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),       
     ('Random Forest', RandomForestClassifier())         
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1200
           1       0.97      0.98      0.98      1200
           2       0.98      0.96      0.97      1200
           4       0.98      0.99      0.98      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800

