<div>
    <img src="https://i.imgur.com/kQAVzSD.png">
    </div>

<center><h1>Introduction 📝</h1></center>

> 🎯Goal: To get a better understanding of the textual data.

In [None]:
pip install stylecloud

In [None]:
pip install matplotlib==3.1

In [None]:
from IPython.core.display import display, HTML, Javascript
import pandas as pd

def nb():
    styles = open("../input/css-style/edit.css", "r").read()
    return HTML("<style>"+styles+"</style>")
nb()

train_df = pd.read_csv('../input/shopee-product-matching/train.csv')

> ⏬ We'll work with a top-down approach. Let's first see the wordcloud ☁️ for all the titles.

In [None]:
import stylecloud

stylecloud.gen_stylecloud(text=' '.join(train_df['title']),
                          icon_name='fas fa-shopping-cart',
                          palette='colorbrewer.diverging.Spectral_11',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

import os
from IPython.display import Image
Image(filename="./stylecloud.png", width=1024, height=1024)

> We see the titles are not completely English. There is Indonesian (a lot of it!) 🙉.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer as CV

def get_top_n_words(corpus, n=None):
    vec = CV().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return [i[0] for i in words_freq[:n]]

def get_top_n_words_s(corpus, n=None):
    vec = CV().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    s = ''
    for i in words_freq[:n]:
        word, freq = i[0], i[1]
        for j in range(freq):
            s += ' ' + word + ' '
    return s

def get_top_n_bigram(corpus, n=None):
    vec = CV(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return [i[0] for i in words_freq[:n]]


def get_top_n_trigram(corpus, n=None):
    vec = CV(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return [i[0] for i in words_freq[:n]]

> Down the rabbit 🐇 hole we go. We will first see the top 100 words.

In [None]:
top100 = get_top_n_words(train_df['title'], n=100)

In [None]:
import stylecloud

stylecloud.gen_stylecloud(text=' '.join(top100),
                          icon_name='fas fa-paw',
                          palette='colorbrewer.diverging.Spectral_11',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

import os
from IPython.display import Image
Image(filename="./stylecloud.png", width=1024, height=1024)

> There are a lot of Indonesian words (and I don't what they mean). So, let's find out what they are. Below is a dictionary that converts them from Indonesian to English.

<center>
 {'bahan' : 'ingredient', 'bisa' : 'can', 'rak' : 'rack', 'panjang' : 'long', 'untuk' : 'to', 'rambut' : 'hair', 'bayi' : 'baby', 'celana' : 'pants', 'isi' : 'contents', 'grosir' : 'wholesaler', 'tas' : 'bag', 'kaki' : 'feet', 'kaos' : 't-shirt', 'lampu' : 'light', 'tali' : 'rope', 'pria' : 'men', 'dan' : 'and', 'plastik' : 'plastic', 'baju' : 'clothes', 'putih' : 'white', 'alat' : 'tool', 'paket' : 'package', 'mobil' : 'car', 'gamis' : 'robe', 'tempat' : 'the place', 'anak' : 'child', 'warna' : 'color', 'dompet' : 'purse', 'wanita' : 'women', 'wajah' : 'face', 'termurah' : 'cheapest', 'mainan' : 'toy', 'sabun' : 'soap', 'dengan' : 'with', 'jilbab' : 'hijab', 'hitam' : 'black', 'tangan' : 'hand', 'karakter' : 'character', 'murah' : 'cheap', 'sarung' : 'scabbard', 'sepatu' : 'shoes', 'pendek' : 'short', 'botol' : 'bottle', 'kain' : 'fabric'}
</center>

> Let's revisualize the above wordcloud after Indonesian words are converted into English.

In [None]:
con = {'bahan' : 'ingredient', 'bisa' : 'can', 'rak' : 'rack', 'panjang' : 'long', 'untuk' : 'to', \
       'rambut' : 'hair', 'bayi' : 'baby', 'celana' : 'pants', 'isi' : 'contents', 'grosir' : 'wholesaler', \
       'tas' : 'bag', 'kaki' : 'feet', 'kaos' : 'tshirt', 'lampu' : 'light', 'tali' : 'rope', 'pria' : 'men', \
       'dan' : 'and', 'plastik' : 'plastic', 'baju' : 'clothes', 'putih' : 'white', 'alat' : 'tool', \
       'paket' : 'package', 'mobil' : 'car', 'gamis' : 'robe', 'tempat' : 'the place', 'anak' : 'child', \
       'warna' : 'color', 'dompet' : 'purse', 'wanita' : 'women', 'wajah' : 'face', 'termurah' : 'cheapest', \
       'mainan' : 'toy', 'sabun' : 'soap', 'dengan' : 'with', 'jilbab' : 'hijab', 'hitam' : 'black',\
       'tangan' : 'hand', 'karakter' : 'character', 'murah' : 'cheap', 'sarung' : 'scabbard', \
       'sepatu' : 'shoes', 'pendek' : 'short', 'botol' : 'bottle', 'kain' : 'fabric'}
convtop100 = []
for i in top100:
    if i in con.keys():
        convtop100.append(con[i])
    else:
        convtop100.append(i)

In [None]:
import stylecloud

stylecloud.gen_stylecloud(text=' '.join(convtop100),
                          icon_name='fas fa-paw',
                          palette='colorbrewer.diverging.Spectral_11',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

import os
from IPython.display import Image
Image(filename="./stylecloud.png", width=1024, height=1024)

> We notice words like Cheapest, Cheap, Premium, Free and Original, which are just used to drive SEO. They won't help us at all in matching the images.

> We notice the word Korea which will be again not helpful for us to match the images. The word being here because of the hype around Korean beauty 👸 products.

> If we look at the frequency of words, we see that most products are focused on women (Hijab, Bag, Serum, Hair, Cream 🧴).

> The other categories that exist are products for men (Polos, Men, Pants👖 and 👕Shirt), babies (Toy, Bottle), 
household stuff (LED and Tool) and electronics (USB🔌 and Bluetooth🎧).

> One has to definitely mention Mask 😷 (Thanks, Corona).

> Let's go down a further down the rabbit hole. We will now visualize the top 1000 words.

In [None]:
top1000 = get_top_n_words(train_df['title'], n=1000)

In [None]:
import stylecloud

stylecloud.gen_stylecloud(text=' '.join(top1000),
                          icon_name='fas fa-feather-alt',
                          palette='colorbrewer.diverging.Spectral_11',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

import os
from IPython.display import Image
Image(filename="./stylecloud.png", width=1024, height=1024)

> We see a larger picture here but again we need to convert the Indonesian words to English. Let's do that.

In [None]:
convtop1000 = '''kids women original cheap bag men masks and babies ml anti for baby sets fill hands 100 air t-shirts color shoes motifs shoes cream premium mini import xe2 pants tools gr plain material hair pcs led cheapest fashion serum mask jam can in 10 toy lights hijab case by free place bpom bag shelf long korean promo pro usb cover new plastic cloth sandal character dress cellphone soap black package gamis box plus jilbab 12 jumbo white white x80 bottles size ready with wholesale car super holder face gel feet black 1kg strap wallet 100ml short body bluetooth cover 30 samsung motor multipurpose portable gram book iphone rubber cable hand hair box year ori face lip cm model oil folding digital cleaner 50 muslim flower kg soft honey sling for skin headset spray oppo original cap mukena matte charger adult whitening cotton color up silicone on xiaomi full latest instant medicine skin bestselling bath facial x9d 11 wall stainless glow milk kitchen boy pink glass lotion glasses water sport eye size koko r esmi oil jeans gold hanger brush sleeve powder toner warranty cute wardah women with beauty wash joyko big earphone the all 2020 xe3 soap suit make bike tie xl small acne in series speaker bubble knit pouch thick without on khimar brush brown 20 cat waterproof pants extra red wash jelly asi max fruit refill sweater wireless 15 care best backpack milk light wood micro silicone double cosmetic kids softlens cat 3d redmi knife pay tunic robot clothing type travel packaging price quality protector and al stand natural to collagen vivo glass towel 200 multifunctional electric bra non battery teeth universal magic pack ring note herbal spice perfume vitamin feather paper 500 food sachet xa4 sport 24 food pen tissue liquid pencil negligee 60 ball weighing mi sale fresh liquid clamp pillow seller kit blue fan new bandana tablet casual bali phone glue tempered stock fish sleep hanging table hold eat brush case education blender shirt stack pump realme high 30ml organic balloons slip decoration tape sticker filter date card my taste noodles per body wipe egg pigeon machine wrap sneakers packing xxl hanasui data puzzle xef foam spoon guy jacket shari bleach stick plate a5 emina brightening wind chick card meter id pan fit tea pajamas makeup 40 steel drawstring round doll home x9c bag shape 200ml off accessories diamond vanilla shower viva batik real inch cartoon dot essence cotton bass storage couple xs pen sanitizer fast clear embroidery handsfree touch mic pure 50ml bb 500ml capsules cup cleanser drink cake scuba blue waist organizer hot daily sheet mix free maxi powder powder mobile phone distro shaving maxy hot hood slim soft mixer foam top quality fried tissue sd elastic wet 60ml coffee tea shampoo leaves nivea big 25 night you pot lid simple tv 150ml 01 sticker high quality transparent iron tripod strip x98 selfie electric uv halal smart men smooth 250 liter hoodie android clip beautiful faucet uk batam plant lightening star scarlett somebymi bottle kitchen ornamental bergo instant aloe green watch gas jersey 500gr peel 400 yellow one cake cap screen import korean stereo chicken head a7 power silver organic no safety sponge classic tint pump room age bts xf0 green charging dispenser galaxy pimples training 150 cable sugar roll mm perfect sprei niacin a3s headband apple picture onion smartphone audio diaper garnier cussons ecer xb8 camera unisex long single implora 6s rayon paper 250ml 2in1 lemon cook camera base cleansing king style buy hdmi 5mm part teether basket pad sponge multi aqua dark outdoor band 16 moisturizing rubber kpop writing fluid pop lipstick pet helm dry belt pen love me retro adapter eye grade strap pc sorex car f9 sling rose velvet tire key cute star variant clear vera ultra spf 3in1 basic complete hanger lcd connection philips a9 a5s eyebrow mika letter xc2 sling current moon cream onemed sendal some sugar school nail backpack microphone ms no nail adhesive watt brown lace gaming screen l emari rice navy battery umbrella ribbon moscrepe control cheese ply gie day high eyebrow japan nano tube jakarta little gift f1s cup cd storage mold cover spatula stick flash 99 overall foundation yuja laundry liptint wipes mat starter shield button cool extra work flexible unique skirt skincare xr pregnant quran pashmina madame 32 flower x9f f11 a37 combination duct tape pan moisture 90 laptop 120 hn by electric a71 wallpaper paris gluta pearl magnet vanilla toilet 36 tents 1000 art ac niqab flip roti hose refrigerator comedo running 100gr snack octopus style from equipment women animal protection bh miracle busui coffee powerbank nice analog wheel life mist rice baso balm of rain learn square aspect 14 wire sauce glossy top massage cheek needle red f5 manual lock vise shop 180 adem waistbag front sheet grip ornament 250gr a31 veil pixy snail ear fit royal ash corset kanebo 15gr mouse telon crack shirt chino home chocolate hello prime game bear scrub net unicorn kant'''

In [None]:
import stylecloud

stylecloud.gen_stylecloud(text=convtop1000,
                          icon_name='fas fa-feather-alt',
                          palette='colorbrewer.diverging.Spectral_11',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

import os
from IPython.display import Image
Image(filename="./stylecloud.png", width=1024, height=1024)

> We see the top 100 words still blocking our view. Let's remove them.

In [None]:
a = [i.lower() for i in convtop100]
b = [i.lower() for i in convtop1000.split()]

In [None]:
c = []
for i in b:
    if i not in a:
        c.append(i)

In [None]:
import stylecloud

stylecloud.gen_stylecloud(text=' '.join(c),
                          icon_name='fas fa-feather-alt',
                          palette='colorbrewer.diverging.Spectral_11',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

import os
from IPython.display import Image
Image(filename="./stylecloud.png", width=1024, height=1024)

> We see more of the product categories identified in top 100 words. 

> We see baby products (milk, tissues, wipes, and kids 🧒)

> We see the 'Organic' 🌿 trend coming up. 

> We see kitchen products coming up (Oil, Kitchen, Sugar, Coffee ☕, Pan, Stainless, Knife, Liquid and Sponge).

> We stationery/school-stuff coming up (Paper, Pen🖊️ and Backpack).

> We see pets coming up (Cat🐱).

> We see in electronics (Camera📷, Samsung, OPPO, Xiaomi, Digital and Battery).

> Indonesian is not my strong suit. I think it won't be for a lot of us. Converting Indonesian words to English has given me more idea about the distribution of products in the dataset. 

> At the global level, we see the broad category of baby/adult stuff and personal/household stuff.

> As we go down, we see other categories up, we get the student broad category and pet category.

> I hope you liked the wordclouds.

> There will be more wordclouds to follow. Stay tuned.