This notebook applies the TF-IDF method to vectorize product descriptions

TF-IDF (Term Frequency-Inverse Document Frequency) was originally proposed for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document but is offset by the number of documents that contain the word. Thus, words that are common in every document, such as "the", "is", and "of", rank low even though they may appear many times, because they don't mean much to that document. On the other hand, if the word "Notebook", for example, appears frequently in a document, while it does not appear many times in others, it could be considered relevant.

With regards to the metrics, it's important to consider that:


*   The Term Frequency can be calculated in different manners, with the simplest being a raw count of instances a word appears in a document. Then, the frequency can be adjusted by the length of a document, or by the raw frequency of the most frequent word in a document.
*   The Inverse Document Frequency indicates how common or rare a word is in the entire document set. By taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm, it's possible to obtain this metric. If the word is very common and appears in many documents, the number will approach 0. Otherwise, it will approach 1.



## TF-IDF Vectorizer

In [None]:
# libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Transformação com TF-IDF
vectorizer = TfidfVectorizer()

## base_pt_en_full

dataset with portuguese product descriptions and lemmatized corpus obtained through english NLP pipeline

In [None]:
base_pt_en_full = pd.read_csv('base_pt_en_full.csv', index_col=0)
base_pt_en_full

Unnamed: 0,produto,language,text_lem_en,vec_pt_en,word_count_text_lem_en,char_count_text_lem_en,avg_word_length_text_lem_en,Departamento,preco_inteiro,preco_decimal,...,qtd_reviews,rating,tag,url_produto,preco,rating_ajustado,frete_gratis,prazo_ajustado,frete_preco,frete_gratis_binario
0,Notebook Lenovo Ultrafino ideapad S145 i5-1035...,bg,notebook lenovo ultrafino ideapad windows dolb...,[ 3.20744514e-02 -6.18032664e-02 1.48132816e-...,11,70,6.363636,Computadores e Informática,"3.699,",0.0,...,,"4,4 de 5 estrelas","[<span class=""a-badge-text"" data-a-badge-color...",,3699.00,4.4,,,,0
1,Impressora multifuncional HP DeskJet Ink Advan...,pt,impressora multifuncional hp deskjet ink advan...,[-1.21643379e-01 2.40051225e-01 1.67172253e-...,8,49,6.125000,Computadores e Informática,378,0.0,...,,"4,6 de 5 estrelas",[],,378.00,4.6,,,,0
2,Multifuncional Epson EcoTank L3150 - Tanque de...,pt,multifuncional epson ecotank tanque de tinta c...,[ 3.48832496e-02 -5.45841753e-02 4.45314199e-...,12,66,5.500000,Computadores e Informática,"1.109,",0.0,...,,"4,8 de 5 estrelas","[<span class=""a-badge-text"" data-a-badge-color...",,1109.00,4.8,,,,0
3,"Suporte para Notebook, OCTOO, Uptable, UP-BL, ...",pt,suporte para notebook octoo uptable bl preto,[ 2.78077155e-01 -4.09961402e-01 4.46318574e-...,7,38,5.428571,Computadores e Informática,45,90.0,...,,"4,4 de 5 estrelas","[<span class=""a-badge-text"" data-a-badge-color...",,45.90,4.4,,,,0
4,"Apple iPad 8ª Geração 10.2"", Wi-Fi, 128GB Spac...",pt,apple ipad geracao wi fi space gray,[ 9.22142789e-02 9.51300040e-02 8.22614599e-...,7,29,4.142857,Computadores e Informática,"3.378,",99.0,...,,"4,8 de 5 estrelas",[],,3378.99,4.8,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1211,Carregador Usb-c 18w Fast Charger Apple (Branco),en,carregador usb fast charger apple branco,[-4.77391593e-02 1.27128333e-01 -1.75883230e-...,6,35,5.833333,Celulares e Comunicação,,,...,,,[],,,,,,,0
1212,"Smartphone Samsung Galaxy A21s 32GB Tela 6.5"" ...",pt,smartphone samsung galaxy tela camera versao g...,[ 0.007927 -0.06780162 0.267856 0.050022...,8,49,6.125000,Celulares e Comunicação,138,99.0,...,1.116,"4,7 de 5 estrelas",[],/Xiaomi-Power-Bank-20000-mah/dp/B078K1CHB5/ref...,138.00,4.7,,,10.49,0
1213,Celular Apple iPhone 11 Pro 64gb / Tela 5.8'' ...,en,celular apple iphone pro tela ios,[ 0.0282125 -0.02904633 -0.04370533 0.028462...,6,28,4.666667,Celulares e Comunicação,750,0.0,...,1.864,"4,6 de 5 estrelas",[],/Rel%C3%B3gio-Smartwatch-Amazfit-Amoled-Vers%C...,750.00,3.8,,,20.06,0
1214,Carregador Rápido Samsung sem Fio Pad II 2019 ...,pt,carregador rapido samsung sem fio pad preto or...,[ 0.18468538 -0.10956886 -0.003615 -0.078803...,8,45,5.625000,Celulares e Comunicação,78,30.0,...,541.000,"4,4 de 5 estrelas",[],/Celular-Play-C%C3%A2mera-Multilaser-Preto/dp/...,78.00,4.2,,,10.98,0


In [None]:
# takes the lemmatized corpus
X_pt_en = base_pt_en_full['text_lem_en']

In [None]:
# Product Category distribuction
base_pt_en_full['Departamento'].value_counts()

Computadores e Informática    372
Eletrodomésticos              354
Eletroportáteis               262
Celulares e Comunicação       228
Name: Departamento, dtype: int64

In [None]:
# Defines 'Departamento' as target
y_pt_en = pd.DataFrame(base_pt_en_full['Departamento'].copy(deep=True), columns=['Departamento'])

In [None]:
# Defines class of interest (1 vs all problem)
y_pt_en['Target'] = 0
y_pt_en.loc[y_pt_en['Departamento']=='Celulares e Comunicação','Target'] = 1
y_pt_en = y_pt_en.pop('Target')

In [None]:
# Split dataset into train and test subsets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_pt_en, y_pt_en, test_size=0.2, random_state=1)

In [None]:
# Applies TF-idf to train dataset
X_train_tfidf_pt_en = vectorizer.fit_transform(X_train.astype(str))
X_train_tfidf_pt_en.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
# Applies TF-idf to test dataset
X_test_tfidf_pt_en = vectorizer.transform(X_test.astype(str))
X_test_tfidf_pt_en.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])