### Term Frequency - Inverse Document Frequency

DF => Number of times the term 't' is present in all docs   
<br>
IDF => log($\frac{Total-Document}{Number-of-documents-the-term-'t'-is-present-in}$)  
<br>
TF(t,d) => $\frac{Total-number-of-times-term-'t'-is-present-in-docA}{Total-number-of-tokens-in-docA}$
<hr>

**TF-IDF** = TF(t,d) * IDF(t)

**Limitations of TF-IDF**  
* As n increases, dimentionality and sparsity increases.
* Doesnt capture relationship between words.
* Doesnt address out of vocabulary (OOV) problem.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
corpus = [
    'Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
    'Apple is announcing new iphone tomorrow',
    'Tesla is announcing new model-3 tomorrow',
    'Google is announcing new pixel-6 tomorrow',
    'Microsoft is announcing new surface tomorrow',
    'Amazon is announcing new eco-dot tomorrow',
    'I am eating biryani and you are eating grapes',
    'something is amazing'
]

In [4]:
v = TfidfVectorizer()
transformed_output = v.fit_transform(corpus)
print(v.vocabulary_)

{'thor': 27, 'eating': 11, 'pizza': 23, 'loki': 18, 'is': 17, 'ironman': 16, 'ate': 8, 'already': 0, 'apple': 6, 'announcing': 5, 'new': 21, 'iphone': 15, 'tomorrow': 28, 'tesla': 26, 'model': 20, 'google': 13, 'pixel': 22, 'microsoft': 19, 'surface': 25, 'amazon': 3, 'eco': 12, 'dot': 10, 'am': 1, 'biryani': 9, 'and': 4, 'you': 29, 'are': 7, 'grapes': 14, 'something': 24, 'amazing': 2}


In [8]:
all_feat_names = v.get_feature_names_out()

for word in all_feat_names:
    idx = v.vocabulary_.get(word)
    print(f'{word}: {v.idf_[idx]}')

already: 2.504077396776274
am: 2.504077396776274
amazing: 2.504077396776274
amazon: 2.504077396776274
and: 2.504077396776274
announcing: 1.4054651081081644
apple: 2.504077396776274
are: 2.504077396776274
ate: 2.504077396776274
biryani: 2.504077396776274
dot: 2.504077396776274
eating: 2.09861228866811
eco: 2.504077396776274
google: 2.504077396776274
grapes: 2.504077396776274
iphone: 2.504077396776274
ironman: 2.504077396776274
is: 1.1177830356563834
loki: 2.504077396776274
microsoft: 2.504077396776274
model: 2.504077396776274
new: 1.4054651081081644
pixel: 2.504077396776274
pizza: 2.504077396776274
something: 2.504077396776274
surface: 2.504077396776274
tesla: 2.504077396776274
thor: 2.504077396776274
tomorrow: 1.4054651081081644
you: 2.504077396776274


In [9]:
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [10]:
transformed_output.toarray()[:2]

array([[0.24247317, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.24247317, 0.        ,
        0.        , 0.40642288, 0.        , 0.        , 0.        ,
        0.        , 0.24247317, 0.10823643, 0.24247317, 0.        ,
        0.        , 0.        , 0.        , 0.7274195 , 0.        ,
        0.        , 0.        , 0.24247317, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.31652498, 0.5639436 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.5639436 , 0.        , 0.25173606, 0.        , 0.        ,
        0.        , 0.31652498, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.31652498, 0.        ]])

In [11]:
import pandas as pd

df = pd.read_csv('assets/Ecommerce_data.csv')
df.head()

Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [12]:
df.shape

(24000, 2)

In [14]:
df['label'].value_counts()

label
Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: count, dtype: int64

In [18]:
df['label_num'] = df['label'].map({
    'Household': 0,
    'Books': 1,
    'Electronics': 2,
    'Clothing & Accessories': 3
})

In [19]:
df.head()

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3


In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['label_num'], test_size=0.2, random_state=2022, stratify=df['label_num'])

X_train.shape, X_test.shape

((19200,), (4800,))

In [21]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('KNN', KNeighborsClassifier())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1200
           1       0.97      0.95      0.96      1200
           2       0.97      0.97      0.97      1200
           3       0.97      0.98      0.97      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [22]:
X_test[:5]

20706    Lal Haveli Designer Handmade Patchwork Decorat...
19166    GOTOTOP Classical Retro Cotton & PU Leather Ne...
15209    FabSeasons Camouflage Polyester Multi Function...
2462     Indian Superfoods: Change the Way You Eat Revi...
6621     Milton Marvel Insulated Steel Casseroles, Juni...
Name: Text, dtype: object

In [23]:
y_test[:5]

20706    0
19166    2
15209    3
2462     1
6621     3
Name: label_num, dtype: int64

In [24]:
y_pred[:5]

array([0, 2, 3, 1, 0], dtype=int64)

In [25]:
from sklearn.naive_bayes import MultinomialNB

clf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('KNN', MultinomialNB())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1200
           1       0.98      0.92      0.95      1200
           2       0.97      0.97      0.97      1200
           3       0.97      0.99      0.98      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [26]:
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('KNN', RandomForestClassifier())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.96      1200
           1       0.98      0.98      0.98      1200
           2       0.98      0.97      0.97      1200
           3       0.98      0.98      0.98      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



In [27]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [29]:
def preprocess(txt):
    doc = nlp(txt)
    filtered = []

    for token in doc:
        if not token.is_punct and not token.is_stop:
            filtered.append(token.lemma_)

    return ' '.join(filtered) 

In [30]:
df['processed_txt'] = df['Text'].apply(preprocess)

In [31]:
df.head()

Unnamed: 0,Text,label,label_num,processed_txt
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0,Urban Ladder Eisner low Study Office Computer ...
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0,contrast live Wooden Decorative Box Painted Bo...
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2,IO Crest SY PCI40010 PCI raid Host Controller ...
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3,ISAKAA Baby Socks bear 8 Years- Pack 4 6 8 12 ...
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3,Indira Designer Women Art Mysore Silk Saree Bl...


In [35]:
X_train, X_test, y_train, y_test = train_test_split(df['processed_txt'], df['label_num'], test_size=0.2, random_state=2022, stratify=df['label_num'])

X_train.shape, X_test.shape

((19200,), (4800,))

In [36]:
clf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('KNN', RandomForestClassifier())
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      1200
           1       0.98      0.98      0.98      1200
           2       0.98      0.97      0.98      1200
           3       0.98      0.99      0.98      1200

    accuracy                           0.98      4800
   macro avg       0.98      0.98      0.98      4800
weighted avg       0.98      0.98      0.98      4800

