# RAPIDS cuML TfidfVectorizer and KNN to find similar Text and Images
In this notebook we use RAPIDS cuML's TfidfVectorizer and cuML's KNN to find items with similar titles and items with similar images. First we use RAPIDS cuML TfidfVectorizer to extract text embeddings of each item's title and then compare the embeddings using RAPIDS cuML KNN. Next we extract image embeddings of each item with EffNetB0 and compare them using RAPIDS cuML KNN.[](http://)

## Load Libraries

In [6]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import cv2, matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.applications import EfficientNetB0
print('TF',tf.__version__)

TF 2.4.1


In [7]:
# RESTRICT TENSORFLOW TO 12GB OF GPU RAM
# SO THAT WE HAVE GPU RAM FOR RAPIDS CUML KNN
LIMIT = 12
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024*LIMIT)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    print(e)
print('Restrict TensorFlow to max %iGB GPU RAM'%LIMIT)
print('so RAPIDS can use %iGB GPU RAM'%(16-LIMIT))

1 Physical GPUs, 1 Logical GPUs
Restrict TensorFlow to max 12GB GPU RAM
so RAPIDS can use 4GB GPU RAM


## Load Train Data
In this competition, we have items with an image and title. For the train data, the column `label_group` indicates the ground truth of which items are similar. We need to build a model that finds these similar images based on their image and title's text. In this notebook we explore some tools to help us.

In [None]:
train_df = pd.read_csv('../input/shopee-product-matching/train.csv')
print('train shape is', train_df.shape )
train_df.head()

In [None]:
# let's see the number of unique label groups we have
train_df['label_group'].nunique() 
print("Number of unique groups are {} out of total records of {}".format(train_df['label_group'].nunique(), len(train_df)))

This means on an average a group would contain 3-4 items

In [None]:
train_df.info()

Dataset is clean, no null values in the dataset

## Displaying Random Images from Training Dataset 

In [None]:
BASE = '../input/shopee-product-matching/train_images/'

def display_image(train_df, rows, cols, path=BASE):
    
    for i in range(rows):
        plt.figure(figsize=(20,5))
        for j in range(cols):
            row = np.random.randint(0, len(train_df)) # picking a random row
            image_hash = train_df.iloc[row, 1]
            image_decoded = cv2.imread(path + image_hash)
            title = train_df.iloc[row, 3]
            
            text_ordering = ""
            for i,ch in enumerate(title):
                text_ordering += ch
                if (i!=0) & (i%20 == 0):
                    text_ordering += '\n'
                    
            ## plot it out!
            plt.subplot(1,cols,j+1) # subplot takes (rows, columns, nth plot)
            plt.title(text_ordering)
            plt.axis("off")
            plt.imshow(image_decoded)
            
    
    plt.show()


display_image(train_df,4,6)

## Displaying some Group items

In [None]:
groups = train_df.label_group.value_counts()
plt.figure(figsize=(20,5))
plt.bar(groups.index.values[:50].astype(str), groups.values[:50])
plt.xticks(rotation = 50)
plt.xlabel("Group Label")
plt.ylabel("Count of number of Items in the group")
plt.title("Top 50 Groups according to the frequency of the number of items in them")
plt.show()

In [None]:
for k in range(5):
    print("#"*40)
    print("### Group {} with label: {}".format(k+1, groups.index.values[k].astype(str)))
    print("#"*40)
    Kth_df = train_df.loc[train_df.label_group == groups.index[k]]
    display_image(Kth_df, 2, 4)
    print("\n\n")

# Finding Similar Titles

In [2]:
!pip install cuml



In [4]:
import cuml, cudf, cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
print("RAPIDS CuML version: ", cuml.__version__)

RAPIDS CuML version:  0.16.0


In [10]:
train_df_gpu = pd.read_csv("../input/shopee-product-matching/train.csv")
print("Shape of dataset: ", train_df_gpu.shape)
train_df_gpu.head()

Shape of dataset:  (34250, 5)


Unnamed: 0,posting_id,image,image_phash,title,label_group
0,train_129225211,0000a68812bc7e98c42888dfb1c07da0.jpg,94974f937d4c2433,Paper Bag Victoria Secret,249114794
1,train_3386243561,00039780dfc94d01db8676fe789ecd05.jpg,af3f9460c2838f0f,"Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...",2937985045
2,train_2288590299,000a190fdd715a2a36faed16e2c65df7.jpg,b94cb00ed3e50f78,Maling TTS Canned Pork Luncheon Meat 397 gr,2395904891
3,train_2406599165,00117e4fc239b1b641ff08340b429633.jpg,8514fc58eafea283,Daster Batik Lengan pendek - Motif Acak / Camp...,4093212188
4,train_3369186413,00136d1cf4edede0203f32f05f660588.jpg,a6f319f924ad708c,Nescafe \xc3\x89clair Latte 220ml,3648931069


### Extracting Text Embeddings using TfidfVectorizor

In [21]:
model = TfidfVectorizer(stop_words = 'english', binary=True)