<img src="https://i.imgur.com/gODdzoI.png">

<center><h1>-Data Preprocessing-</h1></center>

# 1. Introduction

### 🟢 Goal:
> Building a model that can identify which images contain the same product/s.

### 🟠 Challenges:
* Finding near-duplicates of the product (and NOT the image)
* Erasing the impact of the background (the area surrounding the product) 
* Using the description of the image (or the title )


### 📚 Libraries + W&B
* Create an account on https://wandb.ai
* Input your personal key of the project (mine will be secret, as it is confidential 🙃)
* You can find my project in the W&B Dashboard by [clicking here](https://wandb.ai/andrada/shopee-kaggle?workspace=user-andrada).

In [None]:
pip install textfeatures

In [None]:
!pip install pandarallel

In [None]:
!jupyter nbextension enable --py widgetsnbextension

In [None]:
# Libraries
import wandb
import os
import cv2
from PIL import Image
import string
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import albumentations as alb
from albumentations.augmentations.transforms import RGBShift, HueSaturationValue, HorizontalFlip,\
                                                    CLAHE, RandomCrop, RandomGamma, Rotate,\
                                                    CenterCrop, MedianBlur, VerticalFlip, InvertImg
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, ne_chunk
from textblob import TextBlob
import textfeatures
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from wordcloud import WordCloud, ImageColorGenerator
from wordcloud import STOPWORDS as stopwords_wc

from pandarallel import pandarallel
# Enable progress tracking
pandarallel.initialize(progress_bar=True)

# Environment check
os.environ["WANDB_SILENT"] = "true"

# Secrets 🤫
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
personal_key_for_api = user_secrets.get_secret("wandb")

# Color scheme
my_colors = ["#EDAC54", "#F4C5B7", "#DD7555", "#B95F18", "#475A20"]

class color:
    BOLD = '\033[1m' + '\033[93m'
    END = '\033[0m'

In [None]:
! wandb login $personal_key_for_api

In [None]:
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(_x, _y, format(value, ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)

# 2. Data - Images

* **train**: 
    * `train_images` : the product photos (~ 32,400 files)
    * `train.csv` : the corresponding metadata; each product is assigned a label_group that marks the images with identical products.
* **test**: 
    * `test_images` : the product photos to be predicted (~ 70,000 hidden files, only 3 showing)
    * `test.csv` : the corresponding metadata

In [None]:
run = wandb.init(project="shopee-kaggle", name="image-discover")

# Read in data
train_base = "../input/shopee-product-matching/train_images"
train_df = pd.read_csv("../input/shopee-product-matching/train.csv")

print("Files in train image folder: {:,}".format(len(os.listdir(train_base))), "\n" +
      "Rows in train dataframe: {:,}".format(train_df.shape[0]))

# Log into W&B
wandb.log({'Files in train image folder': len(os.listdir(train_base)), 
           'Rows in train dataframe' : train_df.shape[0]})

## 2.1 Duplicated Images

There are 1,246 images that have 2 or more apparitions:
* The `title` differs for most of them
* The `label_group` is usually the same, but there are a few cases where it differs as well

In [None]:
# Get the count of apparitions per image
image_count = train_df["image"].value_counts().reset_index()
image_count.columns = ["image", "count"]
image_count_duplicates = image_count[image_count["count"] > 1]
print("Total no. of images with duplicates: {:,}".format(len(image_count_duplicates)))

#Plot
fig, ax = plt.subplots(figsize=(16, 7))
plt.bar(x=image_count_duplicates.iloc[::16]["image"],
        height=image_count_duplicates.iloc[::16]["count"],
        color=my_colors[4])
plt.title("Duplicated Images: How many apparitions?", fontsize=20)
plt.xticks([])
plt.xlabel("Image ID", fontsize=16)
plt.ylabel("Count", fontsize=16);

In [None]:
# --- Make a custom plot to save into W&B ---

# Prepare data
n = len(image_count_duplicates.iloc[::16]["image"])
labels = ["id_" + str.zfill(str(i), 2) for i in range(n)]
values = image_count_duplicates.iloc[::16]["count"]

data = [[label, val] for (label, val) in zip(labels, values)]

# Create Table & .log() the plot
table = wandb.Table(data=data, columns = ["Image_ID", "count"])
wandb.log({"image_chart" : wandb.plot.bar(table, "Image_ID", "count",
                                          title="Duplicated Images: How many apparitions?")})

> You can check the plot in the W&B Project:
<img src="https://i.imgur.com/bmAG5Zh.png" width=650>

I also wanted to look at how these images with same "image name" look and what trully differentiates them:
* The description usually reffers to the same object, but the wording is different
* The Group ID can sometimes be different, although the image is exactly the same - this means that the text description is the one that is indicating the category in these instances.

> **📌 Note**: If you're using `cv2` to visualize the images, note that they will be displayed in the `BGR` colorspace (blue-green-red order - for some reason this is the default of this library). To correct that, you can use `cv2.cvtColor` to display them in the `RGB` colorspace. 

In [None]:
def get_image_info(name):
    '''Displays a photo of the image and information on it.
    name: a string containing the code of the image (.jpg format)'''
    
    # Read in the image & corresponding metadata
    sample_image = cv2.imread(train_base + "/" + name)
    sample_image = cv2.cvtColor(sample_image, cv2.COLOR_BGR2RGB)
    sample_df = train_df[train_df["image"] == name]
    
    print(color.BOLD + "Apparitions for this image:" + color.END, len(sample_df), "\n" +
          color.BOLD + "Some titles:" + color.END, sample_df["title"].value_counts().index[:5].values, "\n" +
          color.BOLD + "No. of unique groups:" + color.END, sample_df["label_group"].value_counts().shape[0])

    # Plot image
    plt.figure(figsize=(16, 7))
    plt.imshow(sample_image)
    plt.axis("off")
    plt.show();

In [None]:
# Example 1
sample_name = image_count_duplicates.iloc[0]["image"]
get_image_info(name = sample_name)

In [None]:
# Example 2
sample_name = image_count_duplicates.iloc[11]["image"]
get_image_info(name = sample_name)

In [None]:
# Example 3
sample_name = image_count_duplicates.iloc[12]["image"]
get_image_info(name = sample_name)

### 💾 Clean Duplicates Function

Hence, we'll clean these duplicates by selecting only the first appearence for each.

In [None]:
def clean_duplicates(df, train=True):
    '''Intakes the original dataframe and returns it wo duplicated images.
    Converts label_group to string as well.'''
    
    if train == True:
        df["label_group"] = df["label_group"].astype(str)
    df = df.drop_duplicates(subset=['image']).reset_index(drop=True)
    
    return df

In [None]:
# Clean duplicates
train_df = clean_duplicates(df=train_df)

print("Is train metadata now the same length as the image train folder?", "\n",
      train_df.shape[0] == len(os.listdir(train_base)))

## 2.2 Label Group

> 📌**Note**: If there are 2 or more images with the same `label_group` it means that these have been already mapped as being identical.

In [None]:
# Get count of values on each group
groups_df = train_df["label_group"].value_counts().reset_index()
groups_df.columns = ["group", "count"]

# Print info
print("No. of unique groups: {:,}".format(len(groups_df)), "\n" +
      "Max no. of apparitions in 1 group: {}".format(groups_df["count"].max()), "\n" +
      "Min no. of apparitions in 1 group: {}".format(groups_df["count"].min()))
wandb.log({"No. of unique groups": len(groups_df)})

#Plot
fig, ax = plt.subplots(figsize=(16, 7))
plt.bar(x=groups_df.iloc[::100]["group"],
        height=groups_df.iloc[::100]["count"],
        color=my_colors[3])
plt.title("Group Count Distribution", fontsize=20)
plt.xticks([])
plt.xlabel("Group ID", fontsize=16)
plt.ylabel("Count", fontsize=16)
plt.show();

In [None]:
# --- Make a custom plot to save into W&B ---

# Prepare data
n = len(groups_df.iloc[::100]["group"])
labels = ["id_" + str.zfill(str(i), 3) for i in range(n)]
values = groups_df.iloc[::100]["count"]

data = [[label, val] for (label, val) in zip(labels, values)]

# Create Table & .log() the plot
table = wandb.Table(data=data, columns = ["Group_ID", "count"])
wandb.log({"group_chart" : wandb.plot.bar(table, "Group_ID", "count",
                                          title="Group Count Distribution")})

> And this is the plot in the W&B Project:
<img src="https://i.imgur.com/eOHWhjQ.png" width=650>

Let's also observe the images within some of the groups:
* There is definitely a resemblance between products (for the human eye 👁)
* For some groups however, the overall structure of the images is very different

In [None]:
def get_group_info(group_name):
    '''This function shows a sample of 6 images within a group.
    group_name: a string representing the desired group code'''
    
    # Retrieve a sample of 6 images from this group
    sample_names = train_df[train_df["label_group"] == group_name]["image"].\
                    sample(6, random_state=24).values
    sample_text = train_df[train_df["label_group"] == group_name]["title"].\
                    sample(1, random_state=1).values

    # Plot
    fig = plt.figure(figsize=(16, 8))
    plt.suptitle(f"Group: {sample_group}", fontsize=20)
    plt.title(f"{sample_text}", fontsize=15)
    plt.axis("off")
    for k, name in enumerate(sample_names):
        image = cv2.imread(train_base + "/" + name)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        fig.add_subplot(2, 3, k+1)
        plt.imshow(image)
        plt.axis("off")
    
    plt.show();

In [None]:
# Example 1
sample_group = groups_df["group"][3]
get_group_info(group_name=sample_group)

In [None]:
# Example 2
sample_group = groups_df["group"][99]
get_group_info(group_name=sample_group)

In [None]:
# Example 3
sample_group = groups_df["group"][300]
get_group_info(group_name=sample_group)

## 2.3 Image Augmentation

Another aspect I wanted to explore was the different kinds of augmentation that might be performed on the images, so that the model can better pick up unique patterns.

> **📌 Note**: From my research, the best performing augmentations for this type of problem were flips (vertical flip, horizontal flip, etc.), crops (center crop, random crop, etc.) and rotations, as they "display" the product in different positions without changing its color or texture attributes.

Below you can see an example of an image and 11 different applied augmentations.

*You can find the [albumentations documentation here](https://vfdev-5-albumentations.readthedocs.io/en/docs_pytorch_fix/api/augmentations.html).*

In [None]:
def display_augmentations(path):
    '''Displays different types of augmentations on a chosen image.
    path: the direct path to the desired image.'''
    
    # Read in original image
    original = cv2.imread(path)
    original = cv2.cvtColor(original, cv2.COLOR_BGR2RGB)

    # Transformations
    transform_rgb = alb.Compose([ RGBShift(g_shift_limit=50, always_apply=True) ])
    transform_hsv = alb.Compose([ HueSaturationValue(hue_shift_limit=100, sat_shift_limit=60, always_apply=True) ])
    transform_hf = alb.Compose([ HorizontalFlip(always_apply=True) ])
    transform_clahe = alb.Compose([ CLAHE(clip_limit=10.0, always_apply=True) ])
    transform_rc = alb.Compose([ RandomCrop(height=300, width=300, always_apply=True) ])
    transform_rg = alb.Compose([ RandomGamma(gamma_limit=(200, 400), always_apply=True) ])
    transform_rot = alb.Compose([ Rotate(limit=90, always_apply=True) ])
    transform_cc = alb.Compose([ CenterCrop(height=450, width=450, always_apply=True) ])
    transform_mb = alb.Compose([ MedianBlur(blur_limit=103, always_apply=True) ])
    transform_vf = alb.Compose([ VerticalFlip(always_apply=True) ])
    transform_ii = alb.Compose([ InvertImg(always_apply=True) ])

    # Apply transformations
    transformed_rgb = transform_rgb(image=original)["image"]
    transformed_hsv = transform_hsv(image=original)["image"]
    transformed_hf = transform_hf(image=original)["image"]
    transformed_clahe = transform_clahe(image=original)["image"]
    transformed_rc = transform_rc(image=original)["image"]
    transformed_rg = transform_rg(image=original)["image"]
    transformed_rot = transform_rot(image=original)["image"]
    transformed_cc = transform_cc(image=original)["image"]
    transformed_mb = transform_mb(image=original)["image"]
    transformed_vf = transform_vf(image=original)["image"]
    transformed_ii = transform_ii(image=original)["image"]

    all_transformations = [original, transformed_rgb, transformed_hsv, transformed_hf, 
                           transformed_clahe, transformed_rc, transformed_rg, transformed_rot, 
                           transformed_cc, transformed_mb, transformed_vf, transformed_ii]
    all_names = ["Original", "RGBShift", "HueStaurationValue", "HorizontalFlip", "CLAHE",
                 "RandomCrop", "RandomGamma", "Rotate", "CenterCrop", "MedianBlur",
                 "VerticalFlip", "InvertImg"]
    
    # Plot
    fig = plt.figure(figsize=(20, 14))
    plt.suptitle(f"Image Augmentations", fontsize=20)
    for k, image in enumerate(all_transformations):
        fig.add_subplot(3, 4, k+1)
        plt.title(all_names[k])
        plt.imshow(image)
        plt.axis("off")
        
        wandb.log({"Image Augmentations": plt})

    plt.show();

In [None]:
example_path = "../input/shopee-product-matching/test_images/0006c8e5462ae52167402bac1c2e916e.jpg"
display_augmentations(path = example_path)

In [None]:
# ~ END of EXPERIMENT ~
wandb.finish()
# ~~~~~~~~~~~~~~~~~~~~~

# 3. Data - Texts

## 3.1 Text Preprocessing - step by step
Before analyzing the text, we'll have to prepare it a little bit, so the insights we'll gain afterwards will be as accurate as possible.

> **❗ Disclaimer**: I chose NOT to remove numbers - as they might be very important when it comes to how many ml or how many pieces are in the package of a product. I am thinking numbers might actually give a huge insight for our prediction.

In [None]:
# Original
original = train_df["title"][100]
print(color.BOLD + "Before:" + color.END, original, "\n")

# ~~~~~ Convert to lower case ~~~~~
lower = original.lower()
print(color.BOLD + "Lower case:" + color.END, lower)

# ~~~~~ Remove punctuation ~~~~~
### !”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:
wo_punct = lower.translate(str.maketrans('','',string.punctuation))
print(color.BOLD + "Remove punctuation:" + color.END, wo_punct)

# ~~~~~ Remove whitespaces ~~~~~
wo_whitespaces = wo_punct.strip()
print(color.BOLD + "Remove whitespaces:" + color.END, wo_whitespaces)

# ~~~~~ Tokenize words ~~~~~
tokenize = word_tokenize(wo_whitespaces)
print(color.BOLD + "Tokenized:" + color.END, tokenize)

# ~~~~~ Remove stopwords ~~~~~
text_wo_sw = [word for word in tokenize if not word in stopwords.words()]
print(color.BOLD + "Remove stopwords:" + color.END, text_wo_sw)

# ~~~~~ Lemmatization ~~~~~
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(word) for word in text_wo_sw]
print(color.BOLD + "Lemmatization:" + color.END, lemmatized_text)

# ~~~~~ Part of speech tagging ~~~~~
pos_text = TextBlob(' '.join(lemmatized_text))
print(color.BOLD + "POS:" + color.END, pos_text.tags)

# # ~~~~~ Named entity recognition ~~~~~
# ner_text = ne_chunk(pos_tag(lemmatized_text))
# print("NER:", ner_text)

### 💾 Text Preprocessing Function
> **📌 Note**: This function takes ~ 30 mins in the Kaggle environment. But [Maxim Vlah](https://www.kaggle.com/maximvlah) came in the comments with the amazing library called `pandarallel`, which enables parallelisation when applying `.apply()` function in `pandas`. You can check the GitHub repo [here](https://github.com/nalepae/pandarallel). Now the same function takes about 8-9 minutes.

**Check out preprocessing methodology below ⬇️**

In [None]:
def preprocess_title(title):
    '''Text Preprocessing Performance.
    title: the string that needs prepped.'''
    
    # Lower Case
    title = title.lower()
    # Remove Punctuation
    title = title.translate(str.maketrans('','',string.punctuation))
    # Remove whitespaces
    title = title.strip()
    # Tokenize
    tokens_title = word_tokenize(title)
    # Remove stopwords
    tokens_title = [word for word in tokens_title if not word in stopwords.words()]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemm_text = [lemmatizer.lemmatize(word) for word in tokens_title]
    prepped_title = ' '.join(lemm_text)

    return prepped_title
    

def get_POS(prepped_title):
    '''Gets Part of Speech.
    prepped_text: the already prepped text'''
    
    # Part of speech tagging
    pos_text = TextBlob(prepped_title)
    pos_text = ' '.join([j for (i, j) in pos_text.tags])

    return pos_text

In [None]:
# Process preprocessed title
train_df["title_prep"] = train_df["title"].\
                          parallel_apply(lambda x: preprocess_title(x))

In [None]:
# Add part of speech
train_df["pos"] = train_df["title_prep"].\
                          parallel_apply(lambda x: get_POS(x))

In [None]:
# Read in prepped data
train_df_prep = pd.read_csv("../input/shopee-preprocessed-data/train_title_prepped.csv")
train_df_prep["label_group"] = train_df_prep["label_group"].astype(str)

# Save also as artifact
run = wandb.init(project='shopee-kaggle', name='df_title_prepped')
artifact = wandb.Artifact(name='preprocessed', 
                          type='dataset')

artifact.add_file("../input/shopee-preprocessed-data/train_title_prepped.csv")

wandb.log_artifact(artifact)
wandb.finish()

## 3.2 Text Features Extraction

Another method was to extract features from the `title` column, in an attempt to feed into the final model more useful information. You can check out [this article](https://towardsdatascience.com/textfeatures-library-for-extracting-basic-features-from-text-data-f98ba90e3932) for more about the textfeatures library.

The extractions were:
* `word_count` : counts how many words are in a sentence
* `char_count` : counts how many characters are in a sentence
* `avg_word_length` : counts what's the average word length in a sentence
* `stopwords_count` : counts how many stopwords are in a sentence
* `numerics_count` : counts how many numbers are in a sentence

In [None]:
def extract_title_features(df_prep):
    '''Extracts features from the unprocessed title column.'''
    
    # Extract Features
    df_prep = textfeatures.word_count(df_prep, "title", "word_count")
    df_prep = textfeatures.char_count(df_prep, "title", "char_count")
    df_prep = textfeatures.avg_word_length(df_prep, "title", "avg_word_length")
    df_prep = textfeatures.stopwords_count(df_prep, "title", "stopwords_count")
    df_prep = textfeatures.numerics_count(df_prep, "title", "numerics_count")
    
    return df_prep

In [None]:
train_df_prep = extract_title_features(df_prep=train_df_prep)

### Explore the Features 

Within out `title` variable, the texts are usually ~10 words long, with ~ 50 characters and containing 1 to 2 numerics.

In [None]:
title_features = ['word_count', 'char_count', 'avg_word_length',
                  'stopwords_count', 'numerics_count']

# Plot
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(16, 10), squeeze=False)
plt.suptitle(f"Title : Features Extracted", fontsize=20)
rows = [0, 0, 0, 1, 1, 1]
cols = [0, 1, 2, 0, 1, 2]
axs[1,2].set_visible(False)

for k, (name, i, j) in enumerate(zip(title_features, rows, cols)):
    sns.kdeplot(train_df_prep[name], ax=axs[i, j], color=my_colors[k],
                shade="fill", lw=3)
    axs[i, j].set_title(name, fontsize=15)
    axs[i, j].set_xlabel("", fontsize=16)
    axs[i, j].set_ylabel("", fontsize=16)

## 3.3 Text Exploration

Now let's look at the newly created `title_prep`.

In [None]:
# Another W&B Experiment
run = wandb.init(project="shopee-kaggle", name="text-discover")

In [None]:
# Get bag of words from the title
title_prep = train_df_prep["title_prep"].values.astype('U')
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(title_prep)

bag_of_words = pd.DataFrame({'word' : vectorizer.vocabulary_.keys(),
                             'freq' : vectorizer.vocabulary_.values()})

# Plot
plt.figure(figsize=(16, 14))
plot = sns.barplot(data=bag_of_words.head(25).sort_values('freq', ascending=False),
                   y="word", x="freq", color=my_colors[4])
show_values_on_bars(plot, h_v="h", space=0.4)
plt.title("Example of words & frequencies", fontsize=20)
plt.yticks(fontsize=15)
plt.xticks([],)
plt.xlabel("Frequency of apparition", fontsize=16)
plt.ylabel("", fontsize=16);

> Create Custom Plot for W&B ⬇

In [None]:
# --- Make a custom plot to save into W&B ---

# Prepare data
labels = bag_of_words.head(25).sort_values('freq', ascending=False)["word"]
values = bag_of_words.head(25).sort_values('freq', ascending=False)["freq"]

data = [[label, val] for (label, val) in zip(labels, values)]

# Create Table & .log() the plot
table = wandb.Table(data=data, columns = ["Word", "Frequency"])
wandb.log({"text_chart" : wandb.plot.bar(table, "Word", "Frequency",
                                          title="Example of words & frequencies")})

**Let's also explore the `pos` (Part of Speech) column**:
* **WRB** - wh- adverb (how)
* **WP** - wh- pronoun (who)
* **VBZ** - verb, present tense with 3rd person singular (bases)
* **VBP** - verb, present tense not 3rd person singular (wrap)
* **RP** - particle (about)

You can find [full list here](https://www.guru99.com/pos-tagging-chunking-nltk.html#:~:text=POS%20Tagging%20in%20NLTK%20is,each%20word%20of%20the%20sentence.).

In [None]:
# Get bag of words from `pos` column
pos_prep = train_df_prep["pos"].values.astype('U')
vectorizer = CountVectorizer()
vectorizer.fit_transform(pos_prep)

bag_of_words = pd.DataFrame({'pos' : vectorizer.vocabulary_.keys(),
                             'freq' : vectorizer.vocabulary_.values()})

# Plot
plt.figure(figsize=(16, 10))
plot = sns.barplot(data=bag_of_words.head(25).sort_values('freq', ascending=False),
                   y="pos", x="freq", color=my_colors[0])
show_values_on_bars(plot, h_v="h", space=0)
plt.title("Part of Speech: Frequencies", fontsize=20)
plt.yticks(fontsize=15)
plt.xticks([],)
plt.xlabel("Frequency of apparition", fontsize=16)
plt.ylabel("", fontsize=16);

> Create Custom Plot for W&B ⬇

In [None]:
# --- Make a custom plot to save into W&B ---

# Prepare data
labels = bag_of_words.head(25).sort_values('freq', ascending=False)["pos"]
values = bag_of_words.head(25).sort_values('freq', ascending=False)["freq"]

data = [[label, val] for (label, val) in zip(labels, values)]

# Create Table & .log() the plot
table = wandb.Table(data=data, columns = ["POS", "Frequency"])
wandb.log({"pos_chart" : wandb.plot.bar(table, "POS", "Frequency",
                                          title="Part of Speech: Frequencies")})

### ☁ Wordcloud

In [None]:
# Get all titles
text_for_wc = " ".join(title for title in train_df_prep["title"])

# Wordcloud
font_path = "../input/shopee-preprocessed-data/ACETONE.otf"
stopwords_wc = set(stopwords_wc)
# stopwords_wc.update(["yes"])

wordcloud = WordCloud(stopwords=stopwords_wc, font_path=font_path,
                      max_words=4000,
                      max_font_size=200, random_state=42,
                      width=1600, height=800,
                      colormap = "spring")
wordcloud.generate(text_for_wc)

# Plot
plt.figure(figsize = (16, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show();

In [None]:
# ~ END of EXPERIMENT ~
wandb.finish()
# ~~~~~~~~~~~~~~~~~~~~~

## 3.4 Create Text Embeddings

Let's append now the `TF-IDF` & `CountVectorizer` embeddings explored above to our training dataframe.

> **📌 Note**: We'll end up with 26,705 columns (instead of the 12 we are working with now, or 5 in the original training dataset).

In [None]:
def get_embeddings(df):
    '''Gets the word embeddings for `title_prep` and `pos` columns.
    df: dataframe which contains cleaned title & part of speech'''
    
    # Vectorizer functions don't support NAs, so we need to remove if any
    df = df[df["title_prep"].isna() != True].reset_index(drop=True)
    
    # `title` vectorizer
    title_vectorizer = TfidfVectorizer()
    title_matrix = title_vectorizer.fit_transform(df["title_prep"]).toarray()
    # Create dataframe
    title_matrix_df = pd.DataFrame(title_matrix)
    title_matrix_df.columns = [f"title_{i}" for i in range(0, title_matrix_df.shape[1])]
    
    # `pos` vectorizer
    pos_vectorizer = CountVectorizer()
    pos_matrix = pos_vectorizer.fit_transform(df["pos"]).toarray()
    # Create dataframe
    pos_matrix_df = pd.DataFrame(pos_matrix)
    pos_matrix_df.columns = [f"pos_{i}" for i in range(0, pos_matrix_df.shape[1])]
    
    # Concatenate all data together
    final_df = pd.concat([df, title_matrix_df, pos_matrix_df], axis=1)
    
    return final_df

In [None]:
# Get `title` and `pos` embeddings
train_df_prep_final = get_embeddings(df=train_df_prep)

print("Train df shape - Before: {}".format(train_df_prep.shape), '\n' +
      "Train df shape - After: {}".format(train_df_prep_final.shape))

train_df_prep_final.head(3)

In [None]:
# Let's also save it to W&B project
run = wandb.init(project='shopee-kaggle', name='df_title_pos_embeddings')
artifact = wandb.Artifact(name='preprocessed_embeddings', 
                          type='dataset')

artifact.add_file("../input/shopee-preprocessed-data/train_title_prepped_embeddings.parquet")

wandb.log_artifact(artifact)
wandb.finish()

# 4. Final Preprocessing Function

We'll need the functions we created in this notebook to preprocess the `test` dataframe as well, before applying the ML model. Hence, is best to create a `preprocess_df` function that contains the necessary metadata process pipeline.

> **📌 Remember**: We'll use an Unsupervised ML Technique to make our prediction for this competition. Hence, all methodologies we'll apply for the **CV** score we'll also need to use for the **submission** notebook. More on that in my [next notebook](https://www.kaggle.com/andradaolteanu/ii-shopee-model-training-with-pytorch-x-rapids).

In [None]:
def preprocess_df(df):
    '''Preprocessing pipeline.'''
    
    # Clean duplicates
    df = clean_duplicates(df, train=False)
    # Preprocess title + get POS
    df["title_prep"] = df["title"].apply(lambda x: preprocess_title(x))
    df["pos"] = df["title_prep"].apply(lambda x: get_POS(x))
    # Extract title features
    df = extract_title_features(df)
    # Get embeddings from title and pos
    df = get_embeddings(df)
    
    return df

In [None]:
# Test the preprocess_df function 
test = pd.read_csv("../input/shopee-product-matching/test.csv")
preprocess_df(df=test)

### - Training & Submission Notebook here: [🛒II. Shopee: Model Training with Pytorch x RAPIDS](https://www.kaggle.com/andradaolteanu/ii-shopee-model-training-with-pytorch-x-rapids) -

<img src="https://i.imgur.com/cUQXtS7.png">

# Specs on how I prepped & explored ⌨️🎨¶
### (on my local machine)
* Z8 G4 Workstation 🖥
* 2 CPUs & 96GB Memory 💾
* NVIDIA Quadro RTX 8000 🎮
* RAPIDS version 0.17 🏃🏾‍♀️