### What is TF-IDF?

- TF stands for **Term Frequency** and denotes the ratio of  number of times a particular word appeared in a Document to total number of words in the document.
          
         Term Frequency(TF) = [number of times word appeared / total no of words in a document]
 
- Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.


- IDF stands for **Inverse Document Frequency** and denotes the log of ratio of total number of documents/datapoints in the whole dataset to the number of documents that contains the particular word.

         Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word)]
        
- In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0. 


- Finally:
         
         TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)

In [3]:
import spacy

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
corpus = ["Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"]

In [12]:
v = TfidfVectorizer() # we generate an instance of TfIdf class.

In [16]:
transformed_output = v.fit_transform(corpus) # with fitting we are generating a vector
                                            # v-object gets its vocabulary

In [17]:
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [21]:
print(v.get_feature_names())# get the words in order
                            # these are our features. based on this we'll calculate the vector. 

['already', 'am', 'amazon', 'and', 'announcing', 'apple', 'are', 'ate', 'biryani', 'dot', 'eating', 'eco', 'google', 'grapes', 'iphone', 'ironman', 'is', 'loki', 'microsoft', 'model', 'new', 'pixel', 'pizza', 'surface', 'tesla', 'thor', 'tomorrow', 'you']


In [26]:
all_feature_names = v.get_feature_names()

for word in all_feature_names:
    indx= v.vocabulary_.get(word)
    print(f"{word} {v.idf_[indx]}")

already 2.386294361119891
am 2.386294361119891
amazon 2.386294361119891
and 2.386294361119891
announcing 1.2876820724517808
apple 2.386294361119891
are 2.386294361119891
ate 2.386294361119891
biryani 2.386294361119891
dot 2.386294361119891
eating 1.9808292530117262
eco 2.386294361119891
google 2.386294361119891
grapes 2.386294361119891
iphone 2.386294361119891
ironman 2.386294361119891
is 1.1335313926245225
loki 2.386294361119891
microsoft 2.386294361119891
model 2.386294361119891
new 1.2876820724517808
pixel 2.386294361119891
pizza 2.386294361119891
surface 2.386294361119891
tesla 2.386294361119891
thor 2.386294361119891
tomorrow 1.2876820724517808
you 2.386294361119891


In [27]:
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [29]:
transformed_output.toarray()[:2] #this is a sparse metric, we need to convert it to array. 
                                    #this gives tf-idf of first 2 doc, index17, is, tf-idf score:0.11 -generic
                                    # rare word score-Thor: 0.24

array([[0.24266547, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24266547, 0.        , 0.        ,
        0.40286636, 0.        , 0.        , 0.        , 0.        ,
        0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
        0.        , 0.        , 0.72799642, 0.        , 0.        ,
        0.24266547, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.5680354 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.5680354 ,
        0.        , 0.26982671, 0.        , 0.        , 0.        ,
        0.30652086, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30652086, 0.        ]])

### e-commerce example

In [46]:
import pandas as pd

In [47]:
df = pd.read_csv("C:/Users/Owner/nlp-tutorials/12_tf_idf/Ecommerce_data.csv")

In [48]:
df.head()

Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [49]:
df.shape

(24000, 2)

In [50]:
df.label.value_counts()

Household                 6000
Books                     6000
Clothing & Accessories    6000
Electronics               6000
Name: label, dtype: int64

In [51]:
df["label_num"] = df.label.map({
    "Household":0,
    "Books":1,
    "Clothing & Accessories":2,
    "Electronics":3
})

In [52]:
df.head()

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,3
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,2
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,2


In [55]:
from sklearn.model_selection import train_test_split  

In [57]:
X_test, X_train, y_test, y_train = train_test_split(df["Text"], df["label_num"], 
                                                    test_size= 0.2,
                                                   random_state=2022,
                                                   stratify=df.label_num)

In [58]:
X_train.shape

(4800,)

In [59]:
y_train.value_counts()

3    1200
2    1200
1    1200
0    1200
Name: label_num, dtype: int64

using 'classifiers' to train model

In [61]:
from sklearn.neighbors import KNeighborsClassifier

In [62]:
from sklearn.pipeline import Pipeline

In [63]:
from sklearn.metrics import classification_report

In [68]:
clf = Pipeline([
    ("vectorizer", TfidfVectorizer()),
    ("classifier", KNeighborsClassifier())    
])

clf.fit(X_train, y_train)

y_preds = clf.predict(X_test)

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.92      0.95      0.93      4800
           1       0.95      0.93      0.94      4800
           2       0.97      0.97      0.97      4800
           3       0.96      0.95      0.95      4800

    accuracy                           0.95     19200
   macro avg       0.95      0.95      0.95     19200
weighted avg       0.95      0.95      0.95     19200



In [69]:
X_test[:5]

15820    IRIS Furniture Children Deluxe Spiderman Toddl...
23276                     Rupa Thermocot Men's Thermal Top
4959     Kuchipoo Front Open Kids Thermal Top & Pyjama ...
15245    Spread Spain Metallic Gold Bar Trolley/Kitchen...
5009     LG 22 inch (55cm) LCD Monitor - Full HD, IPS P...
Name: Text, dtype: object

In [70]:
y_test[:5]

15820    0
23276    2
4959     2
15245    0
5009     3
Name: label_num, dtype: int64

In [72]:
y_preds[:5]

array([0, 2, 2, 0, 3], dtype=int64)

In [73]:
from sklearn.naive_bayes import MultinomialNB

In [74]:
clf = Pipeline([
    ("vectorizer", TfidfVectorizer()),
    ("classifier", MultinomialNB())    # for the text classification problems codebasics starts with nb, inthis problem rf 
])

clf.fit(X_train, y_train)

y_preds = clf.predict(X_test)

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      4800
           1       0.98      0.92      0.95      4800
           2       0.97      0.98      0.97      4800
           3       0.96      0.96      0.96      4800

    accuracy                           0.95     19200
   macro avg       0.96      0.95      0.95     19200
weighted avg       0.96      0.95      0.95     19200



In [75]:
from sklearn.ensemble import RandomForestClassifier

In [77]:
clf = Pipeline([
    ("vectorizer", TfidfVectorizer()),
    ("classifier", RandomForestClassifier())     
])

clf.fit(X_train, y_train)

y_preds = clf.predict(X_test)

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.92      0.92      0.92      4800
           1       0.93      0.95      0.94      4800
           2       0.96      0.97      0.96      4800
           3       0.96      0.93      0.95      4800

    accuracy                           0.94     19200
   macro avg       0.94      0.94      0.94     19200
weighted avg       0.94      0.94      0.94     19200



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [78]:
### utlity function for pre-processing the text
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 

def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [79]:
df["cleaned_text"] = df["Text"].apply(preprocess)

In [80]:
df.head()

Unnamed: 0,Text,label,label_num,cleaned_text
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0,Urban Ladder Eisner Low Study Office Computer ...
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0,contrast live Wooden Decorative Box Painted Bo...
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,3,IO Crest SY PCI40010 PCI RAID Host Controller ...
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,2,ISAKAA Baby Socks bear 8 Years- Pack 4 6 8 12 ...
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,2,Indira Designer Women Art Mysore Silk Saree Bl...


In [81]:
X_test, X_train, y_test, y_train = train_test_split(df["cleaned_text"], df["label_num"], 
                                                    test_size= 0.2,
                                                   random_state=2022,
                                                   stratify=df.label_num)

In [82]:
clf = Pipeline([
    ("vectorizer", TfidfVectorizer()),
    ("classifier", MultinomialNB())    # we used nb because we see it performed well.  
])

clf.fit(X_train, y_train)

y_preds = clf.predict(X_test)

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      4800
           1       0.98      0.92      0.95      4800
           2       0.96      0.98      0.97      4800
           3       0.96      0.97      0.96      4800

    accuracy                           0.96     19200
   macro avg       0.96      0.96      0.96     19200
weighted avg       0.96      0.96      0.96     19200



In [90]:
nlp = spacy.load("en_core_web_sm")

In [91]:
text = """
There is no need today to labor the point that a scientific approach does not
consist solely, or even mainly, in a complete system and a comprehensive
doctrine. In the formal sense the present work contains no such svstem;
instead of a complete theory it offers only material for one.
"""
text = text.replace("\n", " ").rstrip(" ").lstrip(" ")

In [92]:
doc = nlp(text)

In [94]:
from sklearn.feature_extraction.text import CountVectorizer

In [95]:
v = CountVectorizer(ngram_range=(1,1))

In [96]:
v.fit([text])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [97]:
clean = preprocessing(text)

In [98]:
v.fit([clean])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [103]:
words = v.vocabulary_
words.keys()

dict_keys(['need', 'today', 'labor', 'point', 'scientific', 'approach', 'consist', 'solely', 'mainly', 'complete', 'system', 'comprehensive', 'doctrine', 'formal', 'sense', 'present', 'work', 'contain', 'svstem', 'instead', 'theory', 'offer', 'material'])

In [105]:
matrice = v.transform([clean]).toarray()
matrice

array([[1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1]], dtype=int64)

In [126]:
import numpy as np
df = pd.DataFrame(words.keys())
df["counts"] = np.matrix.transpose(matrice)

In [125]:
df

Unnamed: 0,0,counts
0,need,1
1,today,2
2,labor,1
3,point,1
4,scientific,1
5,approach,1
6,consist,1
7,solely,1
8,mainly,1
9,complete,1


In [57]:
count = doc.count_by(spacy.attrs.POS)

In [62]:
count.items()

dict_items([(95, 2), (87, 2), (90, 9), (92, 11), (94, 2), (100, 4), (98, 1), (84, 7), (86, 5), (97, 5), (89, 2), (85, 4), (93, 1)])

In [59]:
doc.vocab[1].text

'IS_ALPHA'

In [64]:
for k,v in count.items():
    print(doc.vocab[k].text, "|", v)

PRON | 2
AUX | 2
DET | 9
NOUN | 11
PART | 2
VERB | 4
SCONJ | 1
ADJ | 7
ADV | 5
PUNCT | 5
CCONJ | 2
ADP | 4
NUM | 1


In [68]:
clean = [token.text for token in doc if not token.is_stop and not token.is_punct]

In [70]:
freq_tokens= {}
for token in clean:
    if token not in freq_tokens:
        freq_tokens[token] = 1
    else:
        freq_tokens[token] +=1

In [71]:
freq_tokens.items()

dict_items([('need', 1), ('today', 1), ('labor', 1), ('point', 1), ('scientific', 1), ('approach', 1), ('consist', 1), ('solely', 1), ('mainly', 1), ('complete', 2), ('system', 1), ('comprehensive', 1), ('doctrine', 1), ('formal', 1), ('sense', 1), ('present', 1), ('work', 1), ('contains', 1), ('svstem', 1), ('instead', 1), ('theory', 1), ('offers', 1), ('material', 1)])

In [72]:
freq_tokens.keys()

dict_keys(['need', 'today', 'labor', 'point', 'scientific', 'approach', 'consist', 'solely', 'mainly', 'complete', 'system', 'comprehensive', 'doctrine', 'formal', 'sense', 'present', 'work', 'contains', 'svstem', 'instead', 'theory', 'offers', 'material'])

In [75]:
max(freq_tokens.keys(), key=(lambda key: freq_tokens[key]))

'complete'