# TF-IDF

#### A word is important if it’s frequent in this document but rare across all documents.

TF-IDF(t,d)=TF(t,d)×IDF(t)

High score when:

Word repeats in the document

Word is rare globally

Low score when:

Word common everywhere

Or barely appears

## TF

TF(t,d)=count of term t in document d

But raw counts are rude, so people normalize:

TF = count of t/ total words in document d



## IDF

Word kitne documents me aata hai?

idf = N/df(t)

N = total no of documents
df(t) = no of documents containig t

## Limitation


❌ No semantics

❌ No word order beyond n-grams

❌ Synonyms ignored

❌ Context blind

❌ High-dimensional sparse vectors

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()

corpus = ['TRENDING: New Yorkers encounter empty supermarket shelves (pictured, Wegmans in Brooklyn)',
          'sold-out online grocers (FoodKick, MaxDelivery) as #coronavirus-fearing shoppers stock up',
          'When I couldnt find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a',
          '2 pack of Purell??!!Check out how  #coronavirus concerns are driving up prices.',
          'Find out how you can protect yourself and loved ones from #coronavirus. ?',
          'Prices of surgical masks have increased six-fold, N95 respirators have more than trebled',
          'gowns cost twice as much""-@DrTedros #coronavirusHI TWITTER! I am a pharmacist. I sell hand',
          'sanitizer for a living! Or I do when any exists. Like masks, it is sold the fuck out everywhere.',
          'SHOULD YOU BE WORRIED? No. Use soap. SHOULD YOU VISIT TWENTY PHARMACIES LOOKING FOR THE LAST BOTTLE?']

transformed = v.fit_transform(corpus)

In [13]:
print(transformed.toarray()[0])

[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.30151134
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.30151134
 0.30151134 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.30151134 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.30151134 0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.30151134
 0.         0.         0.         0.         0.         0.
 0.30151134 0.         0.         0.         0.         0.
 0.         0.30151134 0.         0.         0.         0.
 0.         0.30151134 0.         0.         0.         0.
 0.         0.         0.         0.30151134 0.         0.
 0.30151134 0.         0.       

In [64]:
import pandas as pd
df = pd.read_csv("D:\Jupyter\jupyter notebook\datasets\ecommerceDataset.csv")



In [69]:
df.columns = ['category','text']
df['category'].value_counts()

map = {'Household':0,
        'Books':1,
       'Electronics':2,
       'Clothing & Accessories':3}

In [70]:
df['text'] = df['text'].fillna('') # filling the null values with empty string

In [71]:
df['cat'] = df['category'].map(map) # creating a separate column where categories are labelencoded

In [72]:
df.head()

Unnamed: 0,category,text,cat
0,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",0
1,Household,SAF 'UV Textured Modern Art Print Framed' Pain...,0
2,Household,"SAF Flower Print Framed Painting (Synthetic, 1...",0
3,Household,Incredible Gifts India Wooden Happy Birthday U...,0
4,Household,Pitaara Box Romantic Venice Canvas Painting 6m...,0


In [75]:
import spacy
nlp = spacy.load('en_core_web_sm')
def preprocessor(text):
    '''This function removes the stop words and lematize the tokens'''
    doc = nlp(text)
    filtered_token = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_token.append(token.lemma_)
    return ' '.join(filtered_token)
    
    

**Note: This preprocessing will take time** 

*around 10 mins on my computer*

In [None]:
df['new_text'] = df['text'].apply(preprocessor)  # applying the preprocessor to text

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['new_text'],df['cat'],
                                                    test_size = 0.3,
                                                    random_state = 34,
                                                    stratify = df['cat'])

In [40]:
X_train.head()

17259    Sugar Knocker Ayurvedic Medicine for Diabetes,...
41642    Generic High Gain 16dBi 2.4GHz Wifi Yagi Anten...
25176    Mathematics Formulae & Definitions (RPH Pocket...
9020     VRCT Classic Off-White Khadi Conical Shade and...
13333    V Guard VIC-15 2000-Watt Induction Cooktop (Bl...
Name: text, dtype: object

In [33]:
y_train.value_counts()

0    15449
1     9456
2     8497
3     6937
Name: cat, dtype: int64

In [34]:
y_test.value_counts()

0    3863
1    2364
2    2124
3    1734
Name: cat, dtype: int64

In [50]:
# now training this data on different models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([
                ('tf-idf',TfidfVectorizer()),
                ('model',LinearSVC())
                ])

pipe.fit(X_train,y_train)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


pred = pipe.predict(X_test)

print(classification_report(y_test,pred))

In [61]:
cm = confusion_matrix(y_test,pred)
print(cm)


[[5694   33   43   24]
 [  59 3458   18   11]
 [  70   29 3086    2]
 [  16   10    6 2569]]


In [None]:
import seaborn as sns
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=['Households','books','electronics','Clothing&Accessories'],
    yticklabels=['Households','books','electronics','Clothing&Accessories']
)