# Figuring out Skala's hair products
Thanks to the Skala products I've recently discovered the world of hair treatments, the _curly girl method_, and many ways to style your hair. These products caught my attention because they are visually very appealing, but there are so many of them and they are a jungle to figure out, as the main thing that can be noticed is the pot colors and nice images shown on the labels.

Inspired by an online tutorial I decided to dig into Skala's hair products with NLP and see if I could understand these products better.
I am going to create a content-based recommendation system where the 'content' will be the chemical components of cosmetics, or ingredients. Specifically, I will process ingredient lists for all Skala hair products via [word embedding](https://en.wikipedia.org/wiki/Word_embedding), and then visualize ingredient similarity using a machine learning method called t-SNE and Bokeh.

[![Skala_masks](./Images/skala_products.jpg)](https://www.olx.com.co/item/tratamiento-y-crema-de-peinar-marca-skala--iid-1111236615)

First of all I'm importing everything I need, from the fundamental numpy and pandas, to all the bokeh packages required to make some nice interactive plot and of course t-SNE

In [1]:
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import re

from bokeh.io import show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
output_notebook()
from bokeh.transform import factor_cmap
# bokeh imports for server
from bokeh.layouts import row, column
from bokeh.models import Div

Then I import the data I downloaded with the other notebook (`Get_skalacosmetics_data.ipynb`) and check how many products are in the table per type. Not that many, but let's see where this goes.

In [2]:
# Load the data
df = pd.read_csv("datasets/Skala_hair_products.csv",index_col=0)
#print(df.head())
df.Product_type.value_counts()

Conditioner    32
Mask           31
Shampoo        17
Comb_cream      9
Beard           5
Gel             1
Name: Product_type, dtype: int64

To get to our end goal of comparing ingredients in each product, we first need to do some preprocessing tasks and bookkeeping of the actual words in each product's ingredients list.

Here I define the function I need to get a clean list of ingredients for each product from the raw string I downloaded from the website. This step is called **tokenization** and will be useful to create our bag of words.

In [3]:
def get_ingredients(raw_ingredients):
   #raw_ingredients=soup.find(id="produto-texto-composicao").string
   ingredients=re.sub('\s{2,}',' ',raw_ingredients)
   ingredients=re.sub('\.|\\\\[a-zA-Z]','',ingredients) 
   ingredients_lower = ingredients.lower()
   ingredients_list=ingredients_lower.split(',') # tokenize ingredients
   ingredients_list=[re.sub('^[ \t]+|[ \t]+$','',ing) for ing in ingredients_list]
   return ingredients_list


After splitting them into tokens, we'll make a binary bag of words. Then we will create a **dictionary** with the tokens, `ingredient_idx`, which will have the following format:

`{ "ingredient_name": index value, … }`

In [4]:
# Initialize dictionary, list, and initial index
ingredient_idx = {}
corpus = []
idx = 0

# For loop for tokenization
for i,df_row in df.iterrows():   
   ingredients = get_ingredients(df_row['Ingredients'])
   corpus.append(ingredients)
   for ingredient in ingredients:
      if ingredient not in ingredient_idx:
            ingredient_idx[ingredient] = idx
            idx += 1
            

# Check the result 
#print(ingredient_idx)
print("The index for Sodium Laureth Sulfate is", ingredient_idx['sodium laureth sulfate'])

The index for Sodium Laureth Sulfate is 14


The next step is making a **document-term matrix (DTM)**. Here each cosmetic product will correspond to a document, and each chemical composition will correspond to a term. This means we can think of the matrix as a “cosmetic-ingredient” matrix. The size of the matrix $N\times M$, where $N$ is the number of ingredients in our dictionary, and $M$ is the number of products in the dataframe. To create this matrix, we'll first make an empty matrix filled with zeros, then we'll fill it by following this simple rule: if an ingredient is in a cosmetic, the value is 1. If not, it remains 0.

In [5]:
# Get the number of items and tokens 
M = df.shape[0] # n of rows
N = len(ingredient_idx)

# Initialize a matrix of zeros
A = np.zeros(shape=[M,N])

Before we can fill the matrix, let's create a function to count the tokens (i.e., an ingredients list) for each row. this is what we call **one-hot encoding**. By encoding each ingredient in the items, the Cosmetic-Ingredient matrix will be filled with binary values: if an ingredient is in a cosmetic, the value is 1. If not, it remains 0.

In [6]:
# Define the oh_encoder function
def oh_encoder(tokens):
    x = np.zeros(len(ingredient_idx))
    for ingredient in tokens:
        #Geet the index for each ingredient
        idx = ingredient_idx[ingredient]
        # Put 1 at the corresponding indices
        x[idx] += 1
    return x

Now we'll apply the `oh_encoder()` functon to the tokens in corpus and set the values at each row of this matrix. So the result will tell us what ingredients each item is composed of.

In [7]:
# Make a document-term matrix
i = 0
for tokens in corpus:
    A[i, :] = oh_encoder(tokens)
    i += 1

For visualization, we need to downsize the document-term matrix into two dimensions. We'll use t-SNE for reducing the dimension of the data here.

**T-distributed Stochastic Neighbor Embedding (t-SNE)** is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, this technique can reduce the dimension of data while keeping the similarities between the instances. This enables us to make a plot on the coordinate plane, which can be said as vectorizing. All of these cosmetic items in our data will be vectorized into two-dimensional coordinates, and the distances between the points will indicate the similarities between the items.

In [8]:
# Dimension reduction with t-SNE
model = TSNE(n_components=2, learning_rate=200, random_state=42,init='random')#init='pca'
tsne_features = model.fit_transform(A)

# Make X, Y columns 
df['X'] = tsne_features[:,0]
df['Y'] = tsne_features[:,1]

## Visualization
We are now ready to create our plot. With the t-SNE values, we can plot all our items on the coordinate plane. And the coolest part here is that it will also show us the name, the brand and the category of each item. Let's make a scatter plot using Bokeh and add a hover tool to show that information. 

In [9]:
cols= ["#1f77b4","#fda000","#e55171","#44aa99","#332288","#e6a1eb","#0b672a","#f6d740","#bb2aa7","#6ab1d1","#5e0101","#3eba00"]
source = ColumnDataSource(df)

def scatter_plot(source):
   # Make a source and a scatter plot  
   tools = 'pan,wheel_zoom,tap,box_select,reset'
   plot = figure(x_axis_label = 'X', 
               y_axis_label = 'Y', 
               width = 500, height = 400,tools=tools)

   plot.circle(x = 'X', 
      y = 'Y', 
      source = source, 
      size = 10, alpha = .8, legend_field="Product_type",
            color=factor_cmap('Product_type', cols, df.Product_type.unique()))

   # Create a HoverTool object
   hover = HoverTool(tooltips = [('Product', '@Product_name'), 
                                 ('Brand', '@Brand'), 
                                 ('Category', '@Product_type')])
   plot.add_tools(hover)
   plot.width=700
   plot.height=500
   plot.legend.location = 'bottom_right'
   plot.legend.background_fill_alpha = 0.5
   plot.legend.border_line_width = 3
   return plot

# Plot the map
plot = scatter_plot(source)
show(plot)

Interestingly, shampoos and most beard products clearly clump themselves together, separately from all other types of products. All styling products, on the other hand, form ~2 clumps which include conditioners, masks, combing creams and the gel. It also seems that products from the same line tend to cluster together, having very similar ingredients (and making them redundant?).

To take this a bit further, let's now make a bokeh server allowing us to select a few products and check what their common and characterizing ingredients are. This will require us to use a bit of set methods.

In [10]:
corpus_set=[set(c) for c in corpus]

def ingredients_server(doc):
    # set up widgets
    title = Div(text='', width_policy='fit')
    common_ingr = Div(text='', width_policy='fit')
    characterizing_ingr = Div(text='', width_policy='fit')
    
    # call plot from previous cell
    source = ColumnDataSource(df)
    plot = scatter_plot(source)
    
    # add a callback to a widget
    def update(selected=None):
        title.text = 'Please select some products to compare their ingredients'

    def update_ingredients(selected):
        #stats.text = str(data[[t1, t2, t1+'_returns', t2+'_returns']].describe())
        title.text = '<b>Common ingredients of selected products</b>:'
        data = [corpus_set[s] for s in selected]
        common_ingredients = set.intersection(*data)
        common_ingr.text=str(common_ingredients)
        characterizing_ingr.text='<b>Characterizing ingredients of each product</b>:<br>'
        for s in selected:
            characterizing_ingredients = corpus_set[s].difference(common_ingredients)
            characterizing_ingr.text += '<b>'+df.Product_name.iloc[s]+'</b>: '+str(characterizing_ingredients)+'<br>'

    def selection_change(attrname, old, new):
      selected = source.selected.indices
      if selected:
         update_ingredients(selected)
      else:
         title.text = 'No products selected'
         common_ingr = ''
         characterizing_ingr.text = ''

    source.selected.on_change('indices', selection_change)

    # create a layout for everything
    text_col=column(title,common_ingr,characterizing_ingr)
    layout = row(plot,text_col)
    
    # initialize
    update()

    # add the layout to curdoc
    doc.add_root(layout)
    
# In the notebook, just pass the function that defines the app to show
# You may need to supply notebook_url, e.g notebook_url="http://localhost:8889" 
show(ingredients_server) 

By selecting conditioner and mask from the same line, when they are very close, we can see that their list of ingredients is extremely similar.

As I'm not a specialist in this area I can't say if this small difference is still substantial, but I do believe that they must have a very similar effect on the hair, as they both have the same "active ingredients" (such as oils/proteins/extracts/etc), but they have a different use, which could be due to e.g. a PH difference.