# Overview

This notebook explores the example dataset we will use and creates a BigQuery table with said data.

Input:

[Kaggle Flipkart Product Catalog Dataset](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products)

Output:

BigQuery table in a schema compatible for use in downstream notebooks.

### Authenticate

If you are using Colab, you will need to authenticate yourself first. The next cell will check if you are currently using Colab, and will start the authentication process.

In [None]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()

## Installation & Configurations

In [None]:
!pip install spacy
!pip install spacy-cleaner

Collecting spacy-cleaner
  Downloading spacy_cleaner-3.1.3-py3-none-any.whl (15 kB)
Collecting spacy<3.5.0,>=3.4.1 (from spacy-cleaner)
  Downloading spacy-3.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-lookups-data<1.1.0,>=1.0.3 (from spacy-cleaner)
  Downloading spacy_lookups_data-1.0.5-py2.py3-none-any.whl (98.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tqdm<4.65.0,>=4.64.0 (from spacy-cleaner)
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting wasabi<1.1.0,>=0.9.1 (from spacy<3.5.0,>=3.4.1->spacy-cleaner)
  Downloading wasabi-0.10.1-py3-none-any.whl (26 kB)
Collecting typer<0.8.0,>=0.3.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [None]:
!python -m spacy validate
!python -m spacy download en_core_web_sm

2023-11-21 19:13:31.135084: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-21 19:13:31.135180: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-21 19:13:31.135222: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-21 19:13:31.145571: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2K[38;5;2m✔ Loaded compatibility table[0m
[1m

In [None]:
!pip install google-cloud-storage

In [None]:
!python -m pip install openpyxl

# Dataset

[This](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products) is a pre-crawled dataset, taken as subset of a bigger dataset (more than 5.8 million products) that was created by extracting data from [Flipkart](https://www.flipkart.com/), a leading Indian eCommerce store.


In [None]:
import pandas as pd
full_ds = pd.read_csv('gs://product_catalog_enrichment/flipkart_20k/flipkart_com-ecommerce_sample.csv')

In [None]:
full_ds.head()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."


In [None]:
full_ds['image'][0]

'["http://img5a.flixcart.com/image/short/u/4/a/altht-3p-21-alisha-38-original-imaeh2d5vm5zbtgg.jpeg", "http://img5a.flixcart.com/image/short/p/j/z/altght4p-26-alisha-38-original-imaeh2d5kbufss6n.jpeg", "http://img5a.flixcart.com/image/short/p/j/z/altght4p-26-alisha-38-original-imaeh2d5npdybzyt.jpeg", "http://img5a.flixcart.com/image/short/z/j/7/altght-7-alisha-38-original-imaeh2d5jsz2ghd6.jpeg"]'

In [None]:
full_ds['description'][0]

"Key Features of Alisha Solid Women's Cycling Shorts Cotton Lycra Navy, Red, Navy,Specifications of Alisha Solid Women's Cycling Shorts Shorts Details Number of Contents in Sales Package Pack of 3 Fabric Cotton Lycra Type Cycling Shorts General Details Pattern Solid Ideal For Women's Fabric Care Gentle Machine Wash in Lukewarm Water, Do Not Bleach Additional Details Style Code ALTHT_3P_21 In the Box 3 shorts"

In [None]:
# check the values of each row for each column
n = full_ds.nunique(axis=0)
print("No.of.unique values in each column :\n", n)

No.of.unique values in each column :
 uniq_id                    20000
crawl_timestamp              371
product_url                20000
product_name               12676
product_category_tree       6466
pid                        19998
retail_price                2247
discounted_price            2448
image                      18589
is_FK_Advantage_product        2
description                17539
product_rating                36
overall_rating                36
brand                       3499
product_specifications     18825
dtype: int64


In [None]:
full_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   uniq_id                  20000 non-null  object 
 1   crawl_timestamp          20000 non-null  object 
 2   product_url              20000 non-null  object 
 3   product_name             20000 non-null  object 
 4   product_category_tree    20000 non-null  object 
 5   pid                      20000 non-null  object 
 6   retail_price             19922 non-null  float64
 7   discounted_price         19922 non-null  float64
 8   image                    19997 non-null  object 
 9   is_FK_Advantage_product  20000 non-null  bool   
 10  description              19998 non-null  object 
 11  product_rating           20000 non-null  object 
 12  overall_rating           20000 non-null  object 
 13  brand                    14136 non-null  object 
 14  product_specifications

In [None]:
full_ds.describe()

Unnamed: 0,retail_price,discounted_price
count,19922.0,19922.0
mean,2979.206104,1973.401767
std,9009.639341,7333.58604
min,35.0,35.0
25%,666.0,350.0
50%,1040.0,550.0
75%,1999.0,999.0
max,571230.0,571230.0


In [None]:
df = full_ds[['uniq_id','product_name','description','brand','product_category_tree','image','product_specifications']]

In [None]:
#df[~df['product_category_tree']]
df['product_category_tree'].isnull().sum()

0

In [None]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

  pd.set_option('display.max_colwidth', -1)


In [None]:
df.head()

Unnamed: 0,uniq_id,product_name,description,brand,product_category_tree,image,product_specifications
0,c2d766ca982eca8304150849735ffef9,Alisha Solid Women's Cycling Shorts,Key Features of Alisha Solid Women's Cycling S...,Alisha,"[""Clothing >> Women's Clothing >> Lingerie, Sl...","[""http://img5a.flixcart.com/image/short/u/4/a/...","{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,FabHomeDecor Fabric Double Sofa Bed,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,FabHomeDecor,"[""Furniture >> Living Room Furniture >> Sofa B...","[""http://img6a.flixcart.com/image/sofa-bed/j/f...","{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,AW Bellies,Key Features of AW Bellies Sandals Wedges Heel...,AW,"[""Footwear >> Women's Footwear >> Ballerinas >...","[""http://img5a.flixcart.com/image/shoe/7/z/z/r...","{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,Alisha Solid Women's Cycling Shorts,Key Features of Alisha Solid Women's Cycling S...,Alisha,"[""Clothing >> Women's Clothing >> Lingerie, Sl...","[""http://img5a.flixcart.com/image/short/6/2/h/...","{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,Sicons All Purpose Arnica Dog Shampoo,Specifications of Sicons All Purpose Arnica Do...,Sicons,"[""Pet Supplies >> Grooming >> Skin & Coat Care...","[""http://img5a.flixcart.com/image/pet-shampoo/...","{""product_specification""=>[{""key""=>""Pet Type"",..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   uniq_id                 20000 non-null  object
 1   product_name            20000 non-null  object
 2   description             19998 non-null  object
 3   brand                   14136 non-null  object
 4   product_category_tree   20000 non-null  object
 5   image                   19997 non-null  object
 6   product_specifications  19986 non-null  object
dtypes: object(7)
memory usage: 1.1+ MB


# Category Analysis

In [None]:
#Helper function to reformat the given text
def reformat(text: str) -> str:
  text = text.replace('[', '')
  text = text.replace(']', '')
  text = text.replace('"', '')
  return text

#df.loc[:, 'product_category_tree'] = df['product_category_tree'].apply(lambda x: reformat(x))
df['product_category_tree'] = df['product_category_tree'].apply(lambda x: reformat(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['product_category_tree'] = df['product_category_tree'].apply(lambda x: reformat(x))


In [None]:
# Finding the depth of the category trees
# Finding total number of categories in each level
cat_len = {}
for cat_tree in df.product_category_tree:
  number_of_categories = len(cat_tree.split(">>"))
  #print(number_of_categories)
  if number_of_categories not in cat_len:
    cat_len[number_of_categories] = 1
  else:
    cat_len[number_of_categories] += 1
print(cat_len)

{6: 3640, 4: 4765, 5: 4911, 1: 328, 3: 4419, 7: 778, 2: 1129, 8: 30}


**There are total 8 levels at max.**

In [None]:
temp_df = df['product_category_tree'].str.split('>>', expand=True)
temp_df.columns = ['c0_name', 'c1_name', 'c2_name', 'c3_name', 'c4_name', 'c5_name', 'c6_name', 'c7_name']
for col in temp_df.columns:
  temp_df[col] = temp_df[col].apply(lambda x: x.strip() if x else x)

**Considering only 4 levels from category tree**

In [None]:
#Considering only 4 levels from category tree
temp_df =temp_df[['c0_name', 'c1_name', 'c2_name', 'c3_name']]
temp_df

Unnamed: 0,c0_name,c1_name,c2_name,c3_name
0,Clothing,Women's Clothing,"Lingerie, Sleep & Swimwear",Shorts
1,Furniture,Living Room Furniture,Sofa Beds & Futons,FabHomeDecor Fabric Double Sofa Bed (Finish Co...
2,Footwear,Women's Footwear,Ballerinas,AW Bellies
3,Clothing,Women's Clothing,"Lingerie, Sleep & Swimwear",Shorts
4,Pet Supplies,Grooming,Skin & Coat Care,Shampoo
...,...,...,...,...
19995,Baby Care,Baby & Kids Gifts,Stickers,WallDesign Stickers
19996,Baby Care,Baby & Kids Gifts,Stickers,Wallmantra Stickers
19997,Baby Care,Baby & Kids Gifts,Stickers,Elite Collection Stickers
19998,Baby Care,Baby & Kids Gifts,Stickers,Elite Collection Stickers


In [None]:
# concatenating df1 and df2 along rows
df_with_cat = pd.concat([df, temp_df], axis=1)
df_with_cat = df_with_cat.drop('product_category_tree', axis=1)

In [None]:
df_with_cat.head()

Unnamed: 0,uniq_id,product_name,description,brand,image,product_specifications,c0_name,c1_name,c2_name,c3_name
0,c2d766ca982eca8304150849735ffef9,Alisha Solid Women's Cycling Shorts,Key Features of Alisha Solid Women's Cycling S...,Alisha,"[""http://img5a.flixcart.com/image/short/u/4/a/...","{""product_specification""=>[{""key""=>""Number of ...",Clothing,Women's Clothing,"Lingerie, Sleep & Swimwear",Shorts
1,7f7036a6d550aaa89d34c77bd39a5e48,FabHomeDecor Fabric Double Sofa Bed,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,FabHomeDecor,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...","{""product_specification""=>[{""key""=>""Installati...",Furniture,Living Room Furniture,Sofa Beds & Futons,FabHomeDecor Fabric Double Sofa Bed (Finish Co...
2,f449ec65dcbc041b6ae5e6a32717d01b,AW Bellies,Key Features of AW Bellies Sandals Wedges Heel...,AW,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...","{""product_specification""=>[{""key""=>""Ideal For""...",Footwear,Women's Footwear,Ballerinas,AW Bellies
3,0973b37acd0c664e3de26e97e5571454,Alisha Solid Women's Cycling Shorts,Key Features of Alisha Solid Women's Cycling S...,Alisha,"[""http://img5a.flixcart.com/image/short/6/2/h/...","{""product_specification""=>[{""key""=>""Number of ...",Clothing,Women's Clothing,"Lingerie, Sleep & Swimwear",Shorts
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,Sicons All Purpose Arnica Dog Shampoo,Specifications of Sicons All Purpose Arnica Do...,Sicons,"[""http://img5a.flixcart.com/image/pet-shampoo/...","{""product_specification""=>[{""key""=>""Pet Type"",...",Pet Supplies,Grooming,Skin & Coat Care,Shampoo


In [None]:
#Saving the categories into an xlsx on local
columns = temp_df.columns
with pd.ExcelWriter('flipkart_cat_analysis_cat_depth4.xlsx') as writer:
  for col in columns:
    temp_df[col].value_counts().to_excel(writer, sheet_name=col)

# Product Description Prep

In [None]:
#Cleaning the description text
import spacy
import spacy_cleaner
from spacy_cleaner.processing import removers, replacers, mutators

MODEL = spacy.load("en_core_web_sm")

PROC_PIPELINE = spacy_cleaner.Pipeline(
    MODEL,
    replacers.replace_punctuation_token,
    mutators.mutate_lemma_token,
    removers.remove_stopword_token,

)
def parse_nlp_description(description) -> str:
    #print(description)
    if pd.isna(description):
      return " "
    doc = MODEL(description.lower())
    lemmas = []
    for token in doc:
        if token.lemma_ not in lemmas and not token.is_stop and token.is_alpha:
            lemmas.append(token.lemma_)

    return " ".join(lemmas)

#test
#DESCRIPTION = "Solemio Sleeveless Solid Men's Reversible Sweatshirt Price: Rs. 1,261 Fitz Blue Sweat Shirt is a must-have for any wardrobe. Ensemble with trendy embroidery on the chest and full zipped closure down front. It is marked with ribbed waistband, cuffs and two pouch pockets on the front. Classic color scheme allows you to wear this over a wide range of separates. Relaxed fit for free body movement. Brand: Fitz Color: Red Style Statement: Designed to be worn as casual as well as leisure wear. Team it with a denims and sneakers to look uber cool. Material: Polyester Cotton Wash Care: Do not bleach and tumble dry. Use gentle machine wash and gentle warm Iron. About the brand: FITZ is the company’s flagship sportswear and active wear brand with an Indian soul and international outlook. It is an evolved range of sports and activity gears dedicated to sports enthusiasts. The brand is poised to become an economic force when it comes to young people who are aware of style trends, sportswear designs reflected the spirited, celebrity-conscious sensibilities of the decade. Disclaimer: Product color may slightly vary due to photographic lighting sources or your monitor settings. Size varies from brand to brand. Kindly go through the size chart for more clarity. Fitz Blue Sweat Shirt is a must-have for any wardrobe. Ensemble with trendy embroidery on the chest and full zipped closure down front. It is marked with ribbed waistband, cuffs and two pouch pockets on the front. Classic color scheme allows you to wear this over a wide range of separates. Relaxed fit for free body movement. Brand: Fitz Color: Red Style Statement: Designed to be worn as casual as well as leisure wear. Team it with a denims and sneakers to look uber cool. Material: Polyester Cotton Wash Care: Do not bleach and tumble dry. Use gentle machine wash and gentle warm Iron. About the brand: FITZ is the company’s flagship sportswear and active wear brand with an Indian soul and international outlook. It is an evolved range of sports and activity gears dedicated to sports enthusiasts. The brand is poised to become an economic force when it comes to young people who are aware of style trends, sportswear designs reflected the spirited, celebrity-conscious sensibilities of the decade. Disclaimer: Product color may slightly vary due to photographic lighting sources or your monitor settings. Size varies from brand to brand. Kindly go through the size chart for more clarity."
#print(len(DESCRIPTION))

#print(parse_nlp_description(DESCRIPTION))
#print(len(parse_nlp_description(DESCRIPTION)))

In [None]:
df_with_cat['description'] = df_with_cat['description'].apply(lambda x: parse_nlp_description(x))

In [None]:
#df_with_cat.head()

# Extracting Product Attributes

In [None]:
#Extracting attributes from product specifications
import json
from typing import List, Dict

import jsonpickle
import pandas as pd
import re

import numpy as np
SPEC_MATCH_ONE = re.compile("(.*?)\\[(.*)\\](.*)")
SPEC_MATCH_TWO = re.compile("(.*?)=>\"(.*?)\"(.*?)=>\"(.*?)\"(.*)")

def parse_spec(specification: str):
    if pd.isna(specification):
      return None
    m = SPEC_MATCH_ONE.match(specification)
    out = {}
    position = 0
    if m is not None and m.group(2) is not None:
        phrase = ''
        for c in m.group(2):
            if c == '}':
                m2 = SPEC_MATCH_TWO.match(phrase)
                if m2 and m2.group(2) is not None and m2.group(4) is not None:
                    out[m2.group(2)]=m2.group(4)
                phrase = ''
            else:
                phrase += c
    json_string = jsonpickle.encode(out)
    print(json_string)
    return json_string

In [None]:
df_with_cat['attributes'] = df_with_cat['product_specifications'].apply(parse_spec)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{"Brand Color": "Grey", "color": "Grey", "Ideal For": "Women's", "Wire Support": "Wirefree", "Detachable Straps": "No", "Straps": "Regular", "Number of Contents in Sales Package": "Pack of 1", "Cup Type": "Non-Padded", "Fabric": "Cotton", "Type": "T-Shirt Bra"}
{"Brand Color": "Black", "color": "Black", "Pattern": "Self Design", "Occasion": "Casual", "Ideal For": "Women's", "Wire Support": "Underwire", "Straps": "Regular", "Cup Type": "Non-Padded", "Fabric": "Polymide Net", "Type": "Plunge Bra"}
{"Brand Color": "Black", "color": "Black", "Pattern": "Solid", "Occasion": "Casual", "Ideal For": "Women's", "Wire Support": "Underwire", "Straps": "Regular", "Number of Contents in Sales Package": "Pack of 3", "Cup Type": "Non padded Cups", "Fabric": "Cotton", "Type": "Full Coverage Bra"}
{"Brand Color": "Black", "color": "Black", "Pattern": "Solid", "Occasion": "Casual", "Ideal For": "Women's", "Inner Lining": "Lycra Polymade", 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{"Occasion": "Casual", "Ideal For": "Women", "Type": "Flats", "Heel Height": "1 inch", "Outer Material": "Synthetic Leather", "Color": "20,Beige"}
{"Number of Contents in Sales Package": "Pack of 5", "Brand Fit": "Slim", "Fabric": "Cotton", "Ideal For": "Men's", "Style Code": "RAC-5OFCOMBO-10"}
{"Material": "Cartoon", "Brand": "Love Baby", "Type": "Set of Towels", "Model Name": "Baby Bath Towel", "Ideal For": "Boys, Girls", "Model ID": "1907", "Color": "Blue", "Length": "60.9 cm", "Width": "91.4 cm", "Sales Package": "Bath Towel"}
{"Number of Contents in Sales Package": "Pack of 1", "Fabric": "Cotton Lycra", "Wash": "Other", "Rise": "Mid Rise", "Occasion": "Casual", "Ideal For": "Men's", "Style Code": "Momento DLBL"}
{"Closure": "Button", "Number of Contents in Sales Package": "Pack of 1", "Brand Fit": "Slim", "Fabric": "Cotton", "Rise": "Mid Rise", "Wash": "Stone Wash", "Fly": "Zipper", "Pattern": "Solid", "Ideal For": "

In [None]:
df_with_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   uniq_id                 20000 non-null  object
 1   product_name            20000 non-null  object
 2   description             20000 non-null  object
 3   brand                   14136 non-null  object
 4   image                   19997 non-null  object
 5   product_specifications  19986 non-null  object
 6   c0_name                 20000 non-null  object
 7   c1_name                 19672 non-null  object
 8   c2_name                 18543 non-null  object
 9   c3_name                 14124 non-null  object
 10  attributes              19986 non-null  object
dtypes: object(11)
memory usage: 1.7+ MB


# Downloading Images from product image url into GCS

In [None]:
from google.cloud import storage
from google.cloud.storage import Bucket

IMAGE_BUCKET = 'genai-product-catalog'
GCS_IMAGE_FOLDER = 'flipkart_Nov14'

def create_gcs_bucket(bucket_name: str):
  storage_client = storage.Client()
  exists = Bucket(storage_client, IMAGE_BUCKET).exists()
  if exists:
    print("Bucket exists")
  else:
    print("Bucket does not exist")
    # Creates the new bucket
    bucket = storage_client.create_bucket(IMAGE_BUCKET)
    print(f"Bucket {bucket.name} created.")

In [None]:
create_gcs_bucket(IMAGE_BUCKET)

Bucket exists


In [None]:
import urllib.request, urllib.error
#from numpy import NaN

#Formating Imagelist string into list of image urls
def extract_url(image_list: str) -> List[str]:
  image_list = image_list.replace('[', '')
  image_list = image_list.replace(']', '')
  image_list = image_list.replace('"', '')
  #image_list = image_list.apply(lambda x: reformat(x))
  image_urls = image_list.split(',')
  return image_urls

#Dowlonding image from flipkart url into gcs bucket
def download_image(image_url, image_file_name, destination_blob_name):
  storage_client = storage.Client()
  image_found_flag = False
  try:
    urllib.request.urlretrieve(image_url, image_file_name)
    bucket = storage_client.bucket(IMAGE_BUCKET)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(image_file_name)
    print(
        f"File {image_file_name} uploaded to {destination_blob_name}."
    )
    image_found_flag = True
  except urllib.error.URLError:
    print("URLError exception")
  except urllib.error.HTTPError:
    print("HTTPError exception")
  except urllib.error.HTTPException:
    print("HTTPException exception")
  except:
    print("Unknown exception")
  return image_found_flag

In [None]:
#Get Image URI
def get_product_image(df):
  products_with_no_image_count = 0
  products_with_no_image = []
  gcs_image_url = []
  image_found_flag = False

  for id, image_list in zip(df['uniq_id'], df['image']):

    if pd.isnull(image_list): #No image url
      print("WARNING: No image url: product ", id)
      products_with_no_image_count += 1
      products_with_no_image.append(id)
      gcs_image_url.append(None)
      continue

    image_urls = extract_url(image_list)
    for index in range(len(image_urls)):
      image_url = image_urls[index]
      image_file_name = '{}_{}.jpg'.format(id, index)
      destination_blob_name = GCS_IMAGE_FOLDER+'/'+image_file_name
      image_found_flag = download_image(image_url, image_file_name, destination_blob_name)
      if image_found_flag:
        gcs_image_url.append('gs://'+IMAGE_BUCKET+'/'+destination_blob_name)
        break
    if not image_found_flag:
      print("WARNING: No image: product ", id)
      products_with_no_image_count += 1
      products_with_no_image.append(id)
      gcs_image_url.append(None)

  #appending gcs image uri into dataframe
  gcs_image_loc = pd.DataFrame(gcs_image_url)
  gcs_image_loc.columns = ["image_uri"]

  df_with_gcs_image_uri = pd.concat([df, gcs_image_loc], axis=0)
  df_with_gcs_image_uri = df_with_gcs_image_uri.drop('image', axis=1)

  return df_with_gcs_image_uri

In [None]:
df_with_gcs_image_uri = get_product_image(df_with_cat)

In [None]:
df_with_gcs_image_uri.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   uniq_id                 20000 non-null  object
 1   product_name            20000 non-null  object
 2   description             20000 non-null  object
 3   brand                   14136 non-null  object
 4   image                   19997 non-null  object
 5   product_specifications  19986 non-null  object
 6   c0_name                 20000 non-null  object
 7   c1_name                 19672 non-null  object
 8   c2_name                 18543 non-null  object
 9   c3_name                 14124 non-null  object
 10  attributes              19986 non-null  object
 11  image_uri               18360 non-null  object
dtypes: object(12)
memory usage: 2.0+ MB


In [None]:
#Filtering null values
non_null_image_df = df_with_gcs_image_uri[df_with_gcs_image_uri['image_uri'].notnull()]
non_null_image_df = non_null_image_df[non_null_image_df['description'].notnull()]

In [None]:
non_null_image_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18360 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   uniq_id                 18360 non-null  object
 1   product_name            18360 non-null  object
 2   description             18360 non-null  object
 3   brand                   13476 non-null  object
 4   image                   18360 non-null  object
 5   product_specifications  18347 non-null  object
 6   c0_name                 18360 non-null  object
 7   c1_name                 18064 non-null  object
 8   c2_name                 16940 non-null  object
 9   c3_name                 12829 non-null  object
 10  attributes              18347 non-null  object
 11  image_uri               18360 non-null  object
dtypes: object(12)
memory usage: 1.8+ MB


In [None]:
#Dropping redundant columns
non_null_image_df = non_null_image_df.drop(['image','product_specifications'], axis=1)
non_null_image_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18360 entries, 0 to 19999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   uniq_id       18360 non-null  object
 1   product_name  18360 non-null  object
 2   description   18360 non-null  object
 3   brand         13476 non-null  object
 4   c0_name       18360 non-null  object
 5   c1_name       18064 non-null  object
 6   c2_name       16940 non-null  object
 7   c3_name       12829 non-null  object
 8   attributes    18347 non-null  object
 9   image_uri     18360 non-null  object
dtypes: object(10)
memory usage: 1.5+ MB


In [None]:
non_null_image_df.to_csv('flipkart_preprocessed.csv', header=False, index=False)

In [None]:
#Copying the cleaned dataset into gcs
!gsutil cp flipkart_preprocessed.csv gs://genai-product-catalog/

Copying file://flipkart_preprocessed.csv [Content-Type=text/csv]...
- [1 files][ 14.0 MiB/ 14.0 MiB]                                                
Operation completed over 1 objects/14.0 MiB.                                     


In [None]:
non_null_image_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18449 entries, 0 to 19999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   uniq_id       18449 non-null  object
 1   product_name  18449 non-null  object
 2   description   18449 non-null  object
 3   c0_name       18449 non-null  object
 4   c1_name       18152 non-null  object
 5   c2_name       17023 non-null  object
 6   c3_name       12908 non-null  object
 7   image_uri     18449 non-null  object
dtypes: object(8)
memory usage: 1.3+ MB


In [None]:
non_null_image_df.head()

Unnamed: 0,uniq_id,product_name,description,brand,c0_name,c1_name,c2_name,c3_name,attributes,image_uri
0,c2d766ca982eca8304150849735ffef9,Alisha Solid Women's Cycling Shorts,key feature alisha solid woman cycling short c...,Alisha,Clothing,Women's Clothing,"Lingerie, Sleep & Swimwear",Shorts,"{""Number of Contents in Sales Package"": ""Pack ...",gs://genai-product-catalog/flipkart_20k_oct26/...
1,7f7036a6d550aaa89d34c77bd39a5e48,FabHomeDecor Fabric Double Sofa Bed,fabhomedecor fabric double sofa bed finish col...,FabHomeDecor,Furniture,Living Room Furniture,Sofa Beds & Futons,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,"{""Installation & Demo Details"": ""Installation ...",gs://genai-product-catalog/flipkart_20k_oct26/...
2,f449ec65dcbc041b6ae5e6a32717d01b,AW Bellies,key feature aw belly sandal wedge heel casual ...,AW,Footwear,Women's Footwear,Ballerinas,AW Bellies,"{""Ideal For"": ""Women"", ""Occasion"": ""Casual"", ""...",gs://genai-product-catalog/flipkart_20k_oct26/...
3,0973b37acd0c664e3de26e97e5571454,Alisha Solid Women's Cycling Shorts,key feature alisha solid woman cycling short c...,Alisha,Clothing,Women's Clothing,"Lingerie, Sleep & Swimwear",Shorts,"{""Number of Contents in Sales Package"": ""Pack ...",gs://genai-product-catalog/flipkart_20k_oct26/...
5,c2a17313954882c1dba461863e98adf2,Eternal Gandhi Super Series Crystal Paper Weig...,key feature eternal gandhi super series crysta...,Eternal Gandhi,Eternal Gandhi Super Series Crystal Paper Weig...,,,,"{""Model Name"": ""Gandhi Paper Weight Mark V"", ""...",gs://genai-product-catalog/flipkart_20k_oct26/...


In [None]:
#Checking for categories/sub-categories repetition
non_null_image_df.reset_index(drop=True, inplace=True)
col1 = non_null_image_df['c0_name']
col2 = non_null_image_df['c1_name']
col3 = non_null_image_df['c2_name']
col4 = non_null_image_df['c3_name']

In [None]:
'''
Categoty Tree [depth 4]:
root -> child -> sub-child -> leaf
'''

duplicate_index = []
for i in range(0,len(col1)):
    if (col1[i] == col2[i] and col1[i] and col2[i]):
      print('category repeating: root & child is same')
      print(i)
      print(col1[i],col2[i], col3[i], col4[i])
    if (col2[i] == col3[i] and col2[i] and col3[i]):
      print('category repeating: child & sub-child is same')
      print(i)
      print(col1[i],col2[i], col3[i], col4[i])
    if (col3[i] == col4[i] and col3[i] and col4[i]):
      print('category repeating:  sub-child & leaf is same')
      print(i)
      print(col1[i],"'",col2[i], ",", col3[i], ",", col4[i])
    if (col1[i] == col3[i] and col1[i] and col3[i]):
      print('category repeating: root & sub-child is same')
      print(i)
    if (col1[i] == col4[i] and col1[i] and col4[i]):
      print('category repeating: root & leaf is same')
      print(i)
    if (col2[i] == col4[i] and col2[i] and col4[i]):
      print('category repeating: child & leaf is same')
      print(i)

category repeating:  sub-child & leaf is same
1533
Automotive ' Accessories & Spare parts , Tyres , Tyres
category repeating:  sub-child & leaf is same
9220
Clothing ' Women's Clothing , Leggings & Jeggings , Leggings & Jeggings
category repeating:  sub-child & leaf is same
10332
Clothing ' Women's Clothing , Leggings & Jeggings , Leggings & Jeggings
category repeating:  sub-child & leaf is same
10343
Clothing ' Women's Clothing , Leggings & Jeggings , Leggings & Jeggings
category repeating:  sub-child & leaf is same
13868
Clothing ' Women's Clothing , Leggings & Jeggings , Leggings & Jeggings
category repeating:  sub-child & leaf is same
13981
Clothing ' Women's Clothing , Leggings & Jeggings , Leggings & Jeggings
category repeating:  sub-child & leaf is same
13982
Clothing ' Women's Clothing , Leggings & Jeggings , Leggings & Jeggings
category repeating:  sub-child & leaf is same
14007
Clothing ' Women's Clothing , Leggings & Jeggings , Leggings & Jeggings
category repeating:  sub-ch

**Some of the sub-child & leaf are matching. We should remove the duplicate category**

*Please check the index from above result and update below list accordingly, before running this cell*

*This approach is to make leaf categories as Null*

In [None]:
#please check the index and update below list, before running this cell
#duplicate_index = [1533,9220, 10332, 10343,13868,13981,13982,14007,14373,16386,17517]
for i in duplicate_index:
  non_null_image_df['c3_name'][i] = None

**Checking brand name in the category**

In [None]:

non_null_image_df[non_null_image_df['brand'].isin(non_null_image_df['c0_name'])]

Unnamed: 0,uniq_id,product_name,description,brand,c0_name,c1_name,c2_name,c3_name,attributes,image_uri


In [None]:
non_null_image_df[non_null_image_df['brand'].isin(non_null_image_df['c1_name'])]

Unnamed: 0,uniq_id,product_name,description,brand,c0_name,c1_name,c2_name,c3_name,attributes,image_uri
17024,1bcf75f2f9cae1f0de161a3c1ff39f88,Frames MDF Photo Frame,key feature frame mdf photo wall table decor f...,Frames,Home Decor & Festive Needs,Wall Decor & Clocks,Wall Photo Frames,Frames Wall Photo Frames,"{""Frame Material"": ""MDF"", ""Backing"": ""Wood"", ""...",gs://genai-product-catalog/flipkart_20k_oct26/...


In [None]:
non_null_image_df[non_null_image_df['brand'].isin(non_null_image_df['c2_name'])]

Unnamed: 0,uniq_id,product_name,description,brand,c0_name,c1_name,c2_name,c3_name,attributes,image_uri
9687,8a3e23b6dc2d811d53bf9544c3e5f1a5,Tennis Tennis Sports Shoes Running Shoes,key feature tennis sport shoe run material eva...,Tennis,Footwear,Men's Footwear,Sports Shoes,Tennis Sports Shoes,"{""Ideal For"": ""Men"", ""Occasion"": ""Sports"", ""So...",gs://genai-product-catalog/flipkart_20k_oct26/...


In [None]:
non_null_image_df[non_null_image_df['brand'].isin(non_null_image_df['c3_name'])]

Unnamed: 0,uniq_id,product_name,description,brand,c0_name,c1_name,c2_name,c3_name,attributes,image_uri


In [None]:
non_null_image_df.info()
# %%
#non_null_image_df.to_csv('18K_no_duplicate.csv', header=False, index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18360 entries, 0 to 18359
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   uniq_id       18360 non-null  object
 1   product_name  18360 non-null  object
 2   description   18360 non-null  object
 3   brand         13476 non-null  object
 4   c0_name       18360 non-null  object
 5   c1_name       18064 non-null  object
 6   c2_name       16940 non-null  object
 7   c3_name       12818 non-null  object
 8   attributes    18347 non-null  object
 9   image_uri     18360 non-null  object
dtypes: object(10)
memory usage: 1.4+ MB


In [None]:
!gsutil cp 18K_no_duplicate.csv gs://genai-product-catalog/

Copying file://18K_no_duplicate.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/14.0 MiB.                                     


In [None]:
columns = non_null_image_df.columns
with pd.ExcelWriter('flipkart_category_analysis.xlsx') as writer:
  for col in columns:
    non_null_image_df[col].value_counts().to_excel(writer, sheet_name=col)

# Preparing for Generating Embedding

In [None]:
#renaming column name to match embedding generation code
non_null_image_df.rename(columns={'uniq_id':'id'}, inplace=True)

### Upload preprocessed data into BigQuery

In [None]:
from google.cloud import bigquery

def create_table(client, table_id, schema):
    table = bigquery.Table(table_id, schema=schema)
    table = client.create_table(table,exists_ok=True)  # Make an API request
    print(
        "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
    )
def upload_df_into_bq(client, table_id, df):
    #df.to_gbq(table_id, PROJECT, if_exists='replace', progress_bar=True)
    job_config = bigquery.LoadJobConfig(schema=schema)
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    #job_config.skip_leading_rows = 1
    job_config.autodetect = False
    job_config.source_format = 'CSV'
    job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
    job.result()
    print(
        "Uploaded dataframe into table {}.{}".format(PROJECT, table_id)
    )

In [None]:
PROJECT = 'solutions-2023-mar-107'
LOCATION = 'us-central1'
table_id = 'solutions-2023-mar-107.flipkart.preprocessed_data'

schema = [
    bigquery.SchemaField('id', 'STRING', mode='REQUIRED'),
    bigquery.SchemaField('product_name', 'STRING', mode='REQUIRED'),
    bigquery.SchemaField('description', 'STRING', mode='REQUIRED'),
    bigquery.SchemaField('brand', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('c0_name', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('c1_name', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('c2_name', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('c3_name', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('attributes', 'JSON', mode='NULLABLE'),
    bigquery.SchemaField('image_uri', 'STRING', mode='REQUIRED')
]
client = bigquery.Client(PROJECT)

create_table(client, table_id, schema)

upload_df_into_bq(client, table_id, non_null_image_df)


**Adding new (empty) columns for text & image embeddings**

In [None]:
table = client.get_table(table_id)
original_schema = table.schema
new_schema = original_schema[:]  # Creates a copy of the schema.
new_schema.append(bigquery.SchemaField('text_embedding', 'FLOAT', mode='REPEATED'))
new_schema.append(bigquery.SchemaField('image_embedding', 'FLOAT', mode='REPEATED'))

table.schema = new_schema
table = client.update_table(table, ["schema"])  # Make an API request.

if len(table.schema) == len(original_schema) + 2 == len(new_schema):
    print("Two new columns have been added.")
else:
    print("Something went wrong.")

Two new columns have been added.
