###### Document Frequency (DF) : Number of times term t is present in all docs 
* Scoring Mechanism: HIgher the no. of times the term appears in all docs the term should be lower.

* IDF(Inverse Document Frequency) : log ( (Total Documents) / (Number of documents term t is present in) )
  -  log is used to dampen the effect of IDF
<br>
* Term Frequency (TF(t,d)) : (Total number of time term t is present in doc A) / (Total no. of tokens in doc A)
<br>
* TF-IDF = TF(t,d) * IDF(t)
<br>
* sklearn (TF-IDF) formula : To take account 0 division possibility
idf(t) = log[(1+n)/(1+df(t))] + 1

<br>

*  Limitation of TF-IDF model:
   -  As n increased, dimensionality, sparsity increases.
   -  Doesn't capture relationship between words.
   -  Doesn't address OOV (Out of Vocabulary) Problem.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Corpus : Collection of docs
corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [19]:
v = TfidfVectorizer()
transformed_output = v.fit_transform(corpus)
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [20]:
#Gives all your features/vocabulary in order
all_feature_names = v.get_feature_names_out()

for word in all_feature_names:
    index = v.vocabulary_.get(word)
    print(f"{word} {v.idf_[index]}") #Retrieve specific score for a word by using the index operator

already 2.386294361119891
am 2.386294361119891
amazon 2.386294361119891
and 2.386294361119891
announcing 1.2876820724517808
apple 2.386294361119891
are 2.386294361119891
ate 2.386294361119891
biryani 2.386294361119891
dot 2.386294361119891
eating 1.9808292530117262
eco 2.386294361119891
google 2.386294361119891
grapes 2.386294361119891
iphone 2.386294361119891
ironman 2.386294361119891
is 1.1335313926245225
loki 2.386294361119891
microsoft 2.386294361119891
model 2.386294361119891
new 1.2876820724517808
pixel 2.386294361119891
pizza 2.386294361119891
surface 2.386294361119891
tesla 2.386294361119891
thor 2.386294361119891
tomorrow 1.2876820724517808
you 2.386294361119891


In [21]:
#Printing 1st two sentences from the corpus
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [22]:
#Printing the corresponding tf-idf vector for 1st two sentences
transformed_output.toarray()[:2]

array([[0.24266547, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24266547, 0.        , 0.        ,
        0.40286636, 0.        , 0.        , 0.        , 0.        ,
        0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
        0.        , 0.        , 0.72799642, 0.        , 0.        ,
        0.24266547, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.5680354 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.5680354 ,
        0.        , 0.26982671, 0.        , 0.        , 0.        ,
        0.30652086, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30652086, 0.        ]])

###### Problem Statement: Given a description about a product sold on e-commerce website, classify it in one of the 4 categories

* Dataset Credits:
  - https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

In [23]:
import pandas as pd
df = pd.read_csv("Ecommerce_Data.csv")
print(df.shape)
df.head(5)

(24000, 2)


Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [24]:
df.label.value_counts()

label
Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: count, dtype: int64

In [25]:
#Map label categories to numbers
df["label_num"] = df.label.map({
    'Household':0,
    'Books':1,
    'Electronics':2,
    'Clothing & Accessories':3
})
df.head(5)

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3


In [26]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(
    df.Text, #Dependent variable
    df.label_num, #Independent variable
    test_size = 0.2,
    random_state = 98,
    stratify = df.label_num
)

In [27]:
print("Shape of X_train",X_train.shape)
print("Shape of X_test",X_test.shape)

Shape of X_train (19200,)
Shape of X_test (4800,)


In [28]:
y_train.value_counts()

label_num
2    4800
0    4800
1    4800
3    4800
Name: count, dtype: int64

In [29]:
y_test.value_counts()

label_num
3    1200
1    1200
2    1200
0    1200
Name: count, dtype: int64

In [38]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
#1st "Truth" then "Prediction"

In [39]:
#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('RandomForest', RandomForestClassifier())         
])

In [40]:
#2. fit with X_train and y_train
clf.fit(X_train, y_train)

In [42]:
#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

In [44]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1200
           1       0.97      0.98      0.97      1200
           2       0.97      0.96      0.97      1200
           3       0.98      0.98      0.98      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



In [45]:
X_test[:5]

3025     DIDA Men's Polyester Brushed Tracksuit Highlig...
3445     The Shining Review Obviously a masterpiece, pr...
22660    Hp Wireless Multimedia Keyboard & Mouse (Wirel...
12664    A History of Ancient and Early Medieval India:...
15387    Guide To JAIIB Legal Aspects Principles Of Ban...
Name: Text, dtype: object

In [49]:
X_test[:5][3025]

"DIDA Men's Polyester Brushed Tracksuit Highlighted with bright contrast details, DIDA offers authentic track suit, a reliable daily wear athletic companion. This product can be used for Running , Training purposes."

In [50]:
X_test[:5][3445]

"The Shining Review Obviously a masterpiece, probably the best supernatural novel in a hundred yearsAs a storyteller, he is up there in the Dickens class \t\t\t\t    \t \t\t\t\t\t Book Description One of the true classics of horror fiction, THE SHINING is regarded as one of Stephen King's masterpieces. \t\t\t\t    \t \t\t\t\t\t              See all Product description"

In [46]:
y_test[:5]

3025     3
3445     1
22660    2
12664    1
15387    1
Name: label_num, dtype: int64

In [47]:
y_pred[:5]

array([3, 1, 2, 1, 1])

In [51]:
from sklearn.naive_bayes import MultinomialNB
clf = Pipeline([
    ('vectorizer_tfidf',TfidfVectorizer()),
    ('Multi NB',MultinomialNB())
])

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1200
           1       0.98      0.94      0.96      1200
           2       0.97      0.97      0.97      1200
           3       0.97      0.98      0.98      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [53]:
#Preprocess text
import spacy
nlp = spacy.load("en_core_web_sm")
def preprocess(text):
    doc = nlp(text)
    filtered_token = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_token.append(token.lemma_)
        
    return " ".join(filtered_token)

In [None]:
df["preprocessed_txt"] = df["Text"].apply(preprocess)

In [None]:
df.head()

In [None]:
df.Text[0]

In [None]:
df.preprocessed_txt[0]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.preprocessed_txt,
    df.label_num,
    test_size = 0,2,
    random_state = 98,
    stratify = df.label_num
)

In [None]:
clf = Pipeline([
    ('vectorizer_tfidf',TfidfVectorizer()),
    ('RandomForest',RandomForestClassifier())
])

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))