In [11]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Weighted Similarity Scheme
Each product's ingredients will be given a weight based on its position in the list. To generate the similarity score between two different products, each matched ingredient's weights will be multiplied together, and then all of those will be summed. Finally, this sum is divided by the maximum possible score (which is the weights from the shorter list squared and then summed).

To get the weights, we will use the geometric distribution.

![Geometric Distribution (from Wikipedia)](images/geometric_distribution.png)

**Geometric distribution:**
The probability that the first occurrence of success requires $i$ independent trials each with success probability $p$.
$$
p(1 - p) ^ {i - 1}
$$
A smaller p will produce more evenly distributed weights, whereas a larger p will put more importance on earlier ingredients.

In [12]:
# uses a geometric distribution so each weight
# decreases geometrically according to its position.
def generate_weights(n: int, p: float = 0.2) -> list:
    '''
    n: length of desired weight list
    p: parameter for geometric distribution (between 0 and 1)

    Returns a list of weights that sum to 1 based on the
    geometric distribution.
    '''
    weights = []
    total_weight = 0
    for i in range(1, n + 1):
        weight = p * ((1 - p) ** (i - 1)) # geometric pdf
        weights.append(weight)
        total_weight += weight
    normalized_weights = [weight / total_weight for weight in weights]
    return normalized_weights

generate_weights(5, 0.2)

[0.29747739171822934,
 0.23798191337458352,
 0.19038553069966682,
 0.15230842455973348,
 0.12184673964778678]

## Weighted Distance Matrix
The goal of this similarity scheme is to produce a distance matrix that judges the "distance" of the products based on the geometric weights assigned to each ingredient. We will use every unique ingredient included in the dataset so that each product has a weight assigned to each possible ingredient (if it doesn't have a particular ingredient, the weight is 0).

First, we need to import the data (formatted in `0_data_preprocess.ipynb`) and add the weights column.

In [13]:
# Dataframe exported/imported as a pickle to preserve 
# the columns with a list format (csv gets messy)
df_full = pd.read_json('data/skincare_products_1.json')
df = df_full[df_full['Cosing Ref No'].apply(lambda x: len(x) > 1)]
df.head(3)

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Cosing Ref No,INCI Name,Function,Weights,Weighted Similarity Products
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"[Algae (Seaweed) Extract, Mineral Oil, Petro...",1,1,1,1,1,"[54290.0, 95058.0, 79504.0, 34040.0, 34654.0, ...","[ALGAE EXTRACT, HYDROGENATED MINERAL OIL, PETR...","[FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITI...","[0.20002126990000002, 0.1600170159, 0.12801361...",[Little Miss Miracle Limited-Edition Crème de ...
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"[Galactomyces Ferment Filtrate (Pitera), Buty...",1,1,1,1,1,"[84397, 74756, 58983, 92472, 37735, 35342, 38173]","[GALACTOMYCES FERMENT FILTRATE, BUTYLENE GLYCO...","[HUMECTANT, FRAGRANCE, HUMECTANT, SKIN CONDITI...","[0.2530733224, 0.2024586579, 0.161966926400000...","[Facial Treatment Essence Mini, Facial Treatme..."
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"[Water, Dicaprylyl Carbonate, Glycerin, Cet...",1,1,1,1,0,"[92472, 55832, 34040, 75132, 55337, 38182, 583...","[WATER, DICAPRYLYL CARBONATE, GLYCERIN, CETEAR...","[ANTIPLAQUE, SKIN CONDITIONING, SOLVENT, SKIN ...","[0.2000003831, 0.1600003065, 0.1280002452, 0.1...","[C-Tango™ Multivitamin Eye Cream, After-Sun Mi..."


In [14]:
# Adds weights column to df
df['Weights'] = \
    df['INCI Name'].apply(lambda x: generate_weights(len(x), p=0.2))

df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Weights'] = \


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Cosing Ref No,INCI Name,Function,Weights,Weighted Similarity Products
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"[Algae (Seaweed) Extract, Mineral Oil, Petro...",1,1,1,1,1,"[54290.0, 95058.0, 79504.0, 34040.0, 34654.0, ...","[ALGAE EXTRACT, HYDROGENATED MINERAL OIL, PETR...","[FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITI...","[0.20002126990973731, 0.16001701592778989, 0.1...",[Little Miss Miracle Limited-Edition Crème de ...
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"[Galactomyces Ferment Filtrate (Pitera), Buty...",1,1,1,1,1,"[84397, 74756, 58983, 92472, 37735, 35342, 38173]","[GALACTOMYCES FERMENT FILTRATE, BUTYLENE GLYCO...","[HUMECTANT, FRAGRANCE, HUMECTANT, SKIN CONDITI...","[0.25307332242756025, 0.2024586579420482, 0.16...","[Facial Treatment Essence Mini, Facial Treatme..."
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"[Water, Dicaprylyl Carbonate, Glycerin, Cet...",1,1,1,1,0,"[92472, 55832, 34040, 75132, 55337, 38182, 583...","[WATER, DICAPRYLYL CARBONATE, GLYCERIN, CETEAR...","[ANTIPLAQUE, SKIN CONDITIONING, SOLVENT, SKIN ...","[0.20000038312461918, 0.16000030649969535, 0.1...","[C-Tango™ Multivitamin Eye Cream, After-Sun Mi..."


Now, we can define a function that will create a weight matrix out of the dataframe as it is currently formatted. 

In [15]:
def create_weight_matrix(df):
    '''
    Creates a (n,m) array, with n rows of
    products each containing weights for each of
    the m unique ingredients.
    '''
    names = df['Name']

    ingredients = df.explode("INCI Name")['INCI Name'].unique()
    ingredients.sort()

    weight_matrix = \
        np.zeros((len(names), len(ingredients)))
    
    curr_row = 0

    for i in df.index:
        curr_ingredient_vector = np.zeros(len(ingredients))

        indices = np.searchsorted(
            ingredients, df.loc[i, 'INCI Name'])
        
        curr_ingredient_vector[indices] = \
            df.loc[i, 'Weights']
        
        weight_matrix[curr_row, :] = curr_ingredient_vector
        curr_row += 1

    return weight_matrix 


In [16]:
# 1248 products, 3248 unique ingredients

weight_matrix = create_weight_matrix(df)
weight_matrix.shape

(1248, 3248)

In [17]:
cos_sim_matrix = cosine_similarity(weight_matrix)
cos_sim_matrix

array([[1.        , 0.00113784, 0.12008987, ..., 0.05044345, 0.12011037,
        0.07764517],
       [0.00113784, 1.        , 0.18860833, ..., 0.21396305, 0.20877504,
        0.18854263],
       [0.12008987, 0.18860833, 1.        , ..., 0.451555  , 0.55583233,
        0.45444632],
       ...,
       [0.05044345, 0.21396305, 0.451555  , ..., 1.        , 0.42045168,
        0.39904943],
       [0.12011037, 0.20877504, 0.55583233, ..., 0.42045168, 1.        ,
        0.69306275],
       [0.07764517, 0.18854263, 0.45444632, ..., 0.39904943, 0.69306275,
        1.        ]])

In [18]:
sorted_indices = sorted(range(len(cos_sim_matrix[1])),
                        key = lambda i: cos_sim_matrix[1][i],
                        reverse = True)
cos_sim_matrix[1, sorted_indices[1:6]]

array([1.        , 0.98697665, 0.75588934, 0.7216478 , 0.71157037])

In [19]:
def get_5_most_similar_products(df, cos_sim_matrix: np.ndarray, index: int) -> np.ndarray:
    '''
    Gets the 5 most similar products (aside from itself) of the 
    product at the given position using the weighted distance matrix. 
    Returns a list.
    weighted_similarity_matrix: the matrix of similarity scores
    index: the positional index of the product
    '''
    sorted_indices = sorted(range(len(cos_sim_matrix[index])),
                        key = lambda i: cos_sim_matrix[index][i],
                        reverse = True)
    
    most_similar_products = df.iloc[sorted_indices[1:6],:]['Name'].values
    most_similar_scores = np.round(cos_sim_matrix[index, sorted_indices[1:6]], decimals=2)

    return (most_similar_products, most_similar_scores)



In [20]:
# Need to reset index here so that when making a
# new column, it uses the same "positional" index
df2 = df.reset_index()

df2['Weighted Similarity Products'] = \
    df.reset_index().reset_index()['level_0'].apply(lambda x: get_5_most_similar_products(df, cos_sim_matrix, x)[0])

df2['Weighted Similarity Scores'] = \
    df.reset_index().reset_index()['level_0'].apply(lambda x: get_5_most_similar_products(df, cos_sim_matrix, x)[1])
    
df2.head(3)

Unnamed: 0,index,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Cosing Ref No,INCI Name,Function,Weights,Weighted Similarity Products,Weighted Similarity Scores
0,0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"[Algae (Seaweed) Extract, Mineral Oil, Petro...",1,1,1,1,1,"[54290.0, 95058.0, 79504.0, 34040.0, 34654.0, ...","[ALGAE EXTRACT, HYDROGENATED MINERAL OIL, PETR...","[FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITI...","[0.20002126990973731, 0.16001701592778989, 0.1...",[Little Miss Miracle Limited-Edition Crème de ...,"[1.0, 1.0, 0.55, 0.55, 0.52]"
1,1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"[Galactomyces Ferment Filtrate (Pitera), Buty...",1,1,1,1,1,"[84397, 74756, 58983, 92472, 37735, 35342, 38173]","[GALACTOMYCES FERMENT FILTRATE, BUTYLENE GLYCO...","[HUMECTANT, FRAGRANCE, HUMECTANT, SKIN CONDITI...","[0.25307332242756025, 0.2024586579420482, 0.16...","[Facial Treatment Essence Mini, Facial Treatme...","[1.0, 0.99, 0.76, 0.72, 0.71]"
2,2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"[Water, Dicaprylyl Carbonate, Glycerin, Cet...",1,1,1,1,0,"[92472, 55832, 34040, 75132, 55337, 38182, 583...","[WATER, DICAPRYLYL CARBONATE, GLYCERIN, CETEAR...","[ANTIPLAQUE, SKIN CONDITIONING, SOLVENT, SKIN ...","[0.20000038312461918, 0.16000030649969535, 0.1...","[C-Tango™ Multivitamin Eye Cream, After-Sun Mi...","[0.89, 0.76, 0.75, 0.74, 0.73]"


In [21]:
# Setting the list of similar products to an empty list
# for products with <2 ingredients

df_full = df_full[~(df_full['Cosing Ref No'].apply(lambda x: len(x) > 1))]
df_full['Weighted Similarity Products'] = [[] for i in range(len(df_full))]
df_full['Weighted Similarity Scores'] = [[] for i in range(len(df_full))]
df_full.head(3)

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Cosing Ref No,INCI Name,Function,Weights,Weighted Similarity Products,Weighted Similarity Scores
7,Moisturizer,DRUNK ELEPHANT,Virgin Marula Luxury Facial Oil,72,4.4,[100% Unrefined Sclerocraya Birrea (Marula) Ke...,1,1,1,1,0,[],[],[],,[],[]
11,Moisturizer,KIEHL'S SINCE 1851,Midnight Recovery Concentrate,47,4.4,[Caprylic/Capric Triglyceride Dicaprylyl Carbo...,1,1,1,1,1,[58059.0],[OZONIZED SUNFLOWER SEED OIL],[SKIN CONDITIONING],,[],[]
26,Moisturizer,DRUNK ELEPHANT,Virgin Marula Luxury Facial Oil Mini,40,4.5,[100% Unrefined Sclerocraya Birrea (Marula) Ke...,1,1,1,1,0,[],[],[],,[],[]


In [22]:
df2 = df2.set_index("index")
df2.index.name = None

merged_df = pd.concat([df2,df_full])
merged_df.sort_index().head(3)

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Cosing Ref No,INCI Name,Function,Weights,Weighted Similarity Products,Weighted Similarity Scores
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"[Algae (Seaweed) Extract, Mineral Oil, Petro...",1,1,1,1,1,"[54290.0, 95058.0, 79504.0, 34040.0, 34654.0, ...","[ALGAE EXTRACT, HYDROGENATED MINERAL OIL, PETR...","[FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITI...","[0.20002126990973731, 0.16001701592778989, 0.1...",[Little Miss Miracle Limited-Edition Crème de ...,"[1.0, 1.0, 0.55, 0.55, 0.52]"
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"[Galactomyces Ferment Filtrate (Pitera), Buty...",1,1,1,1,1,"[84397, 74756, 58983, 92472, 37735, 35342, 38173]","[GALACTOMYCES FERMENT FILTRATE, BUTYLENE GLYCO...","[HUMECTANT, FRAGRANCE, HUMECTANT, SKIN CONDITI...","[0.25307332242756025, 0.2024586579420482, 0.16...","[Facial Treatment Essence Mini, Facial Treatme...","[1.0, 0.99, 0.76, 0.72, 0.71]"
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"[Water, Dicaprylyl Carbonate, Glycerin, Cet...",1,1,1,1,0,"[92472, 55832, 34040, 75132, 55337, 38182, 583...","[WATER, DICAPRYLYL CARBONATE, GLYCERIN, CETEAR...","[ANTIPLAQUE, SKIN CONDITIONING, SOLVENT, SKIN ...","[0.20000038312461918, 0.16000030649969535, 0.1...","[C-Tango™ Multivitamin Eye Cream, After-Sun Mi...","[0.89, 0.76, 0.75, 0.74, 0.73]"


In [23]:
# Uncomment to re-save dataframe
# merged_df.sort_index().to_json("data/skincare_products_1.json")