## TF-IDF embedding
TF-IDF (term frequency-inverse document frequency) embedding is a technique used to represent text documents as numerical vectors. It is commonly used for natural language processing tasks such as text classification, information retrieval and text similarity comparison. The idea behind TF-IDF is to give more weight to words that are more informative and less common within a given corpus of documents.
The TF-IDF score of a word is calculated as the product of its term frequency (TF) and inverse document frequency (IDF). The TF is the number of times a word appears in a document, while the IDF is the logarithm of the ratio of the total number of documents to the number of documents containing the word. Words that appear frequently in many documents will have a lower IDF score and therefore, lower overall TF-IDF score, while words that appear infrequently in a few documents will have a higher IDF score and therefore, higher overall TF-IDF score.
TF-IDF embedding is useful for text classification, as it allows models to understand the importance of different words in a document and classify the text based on these features. It can also be used for information retrieval, by finding the documents most relevant to a query by comparing their tf-idf vectors.

citation: https://github.com/omkar34/products-recommendation/blob/master/product_recommend.ipynb


Import libirares

In [1]:
# Import portion of a package
import matplotlib.pyplot as plt  # Most common visualization package that a lot of others are based on

# Import full packages under custom name
import numpy as np  # Common package for numerical methods
import pandas as pd  # Common package for data storeage/manipulation

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Get product detail data

In [2]:
product_detail_detail_path = "./clean_data/cleaned_products_detailed.csv"

In [3]:
df_product_standard = pd.read_csv(product_detail_detail_path)

In [4]:
df_product_standard

Unnamed: 0.1,Unnamed: 0,ctr_product_num,attribute_id,attr_value_mdm_seq_num,attr_lov_value_id,attr_value_en_txt,attr_value_en_sentence
0,36,4000,FEATURES_BENEFITS_DLR_TXT,1,,,
1,98,5044,FEATURES_BENEFITS_DLR_TXT,1,,Travel poker chips,Travel poker chips
2,99,5045,FEATURES_BENEFITS_DLR_TXT,1,,40 piece poker chips,40 piece poker chips
3,1739,21465,FEATURES_MUTLI_CD,1,NO_ADVANCED_FEATURES,,
4,1741,21466,FEATURES_MUTLI_CD,1,NO_ADVANCED_FEATURES,,
...,...,...,...,...,...,...,...
1559099,1981940,8997339,FEATURES_BENEFITS_DLR_TXT,2,,Hockey backpack folds away into a separate zip...,Heavy-duty 600D Polyester and WR-coated materi...
1559100,1981941,8997339,FEATURES_BENEFITS_DLR_TXT,3,,Hideaway padded backpack straps double as regular,Heavy-duty 600D Polyester and WR-coated materi...
1559101,1981942,8997339,FEATURES_BENEFITS_DLR_TXT,4,,Multiple grab handles,Heavy-duty 600D Polyester and WR-coated materi...
1559102,1981943,8997339,FEATURES_BENEFITS_DLR_TXT,5,,Large main compartment with separate floating ...,Heavy-duty 600D Polyester and WR-coated materi...


Get product description, and filter wrong data

In [5]:
df_detailed = df_product_standard[['ctr_product_num','attr_value_en_sentence']]
df_detailed = df_detailed.dropna()
df_detailed = df_detailed.drop_duplicates()
df_detailed = df_detailed[~df_detailed.attr_value_en_sentence.isin(['Features and Benefits not loaded','NaN','Features and benefits not loaded','Features and Benefits not loaded','Features and Benefits not loaded,'])]
df_detailed

Unnamed: 0,ctr_product_num,attr_value_en_sentence
1,5044,Travel poker chips
2,5045,40 piece poker chips
6,22726,General Tire GMAX UHP; Features: Wide circumfe...
16,31702,Top 3 Vehicle Applications: Toyota Tercel (199...
19,31703,Top 3 Vehicle Applications: Hyundai Accent (20...
...,...,...
1559079,8997335,Reliable performance through consistency and u...
1559082,8997336,Reliable performance through consistency and u...
1559085,8997337,Made from heavy duty 600 denier water-resistan...
1559091,8997338,Strong 600D polyester exteriorRemovable divide...


Create tfidf vectorizer, containing trigram and bigram which is 2 words and 3 words combination, filter words occur less than 1000 times, remove stop word

In [6]:
#setting tfidf vectorizer
tf = TfidfVectorizer(ngram_range =(1,3), min_df=1000, stop_words = 'english',analyzer='word')
tf

Create matrix

In [7]:
#tfidf matrix
tf_matrix = tf.fit_transform(df_detailed['attr_value_en_sentence'])


In [8]:
tf_matrix.shape

(235390, 1378)

In [9]:
print(tf_matrix)

  (1, 893)	0.6515558126248835
  (1, 36)	0.7586007006553104
  (2, 1318)	0.22507828397415125
  (2, 597)	0.2477503935400904
  (2, 927)	0.26346949187274754
  (2, 140)	0.32608032199000436
  (2, 1254)	0.5326143170226687
  (2, 1134)	0.2779408632928911
  (2, 559)	0.3276555624672216
  (2, 1341)	0.22182437410443015
  (2, 453)	0.18037737203806478
  (2, 1231)	0.2246880095023191
  (2, 536)	0.33421685526266504
  (3, 88)	0.8648729507786918
  (3, 1294)	0.5019908156643492
  (4, 88)	0.8648729507786918
  (4, 1294)	0.5019908156643492
  (6, 1242)	0.33622536162706224
  (6, 1333)	0.33053463254801335
  (6, 413)	0.4163614938518135
  (6, 262)	0.3738739817556827
  (6, 458)	0.3281215547900794
  (6, 1217)	0.29625025420381146
  (6, 826)	0.2909791588234598
  (6, 1254)	0.32826277966882755
  :	:
  (235387, 390)	0.2411671676225766
  (235387, 130)	0.24903871124663973
  (235387, 74)	0.27384805561027714
  (235387, 1022)	0.1917850797043513
  (235387, 1318)	0.19538851635031831
  (235388, 1377)	0.3492897750623917
  (235388, 

In [33]:
tf_df = pd.DataFrame(tf_matrix.toarray(),columns = tf.get_feature_names_out())

Below is the tfidf vectorizer with row equal to product, column equal to unique words in vocabulary

In [32]:
df_detailed.reset_index(inplace=True)

In [35]:
tf_df = tf_df.join(df_detailed['ctr_product_num'])

In [36]:
tf_df.set_index('ctr_product_num', inplace=True)

In [38]:
tf_df.to_csv("embeddings/TFIDF_all.csv")