In [1]:
import pandas as pd
import numpy as np

# Similarity Scheme
Each product's ingredients will be given a weight based on its position in the list. To generate the similarity score between two different products, each matched ingredient's weights will be multiplied together, and then all of those will be summed. Finally, this sum is divided by the maximum possible score (which is the weights from the shorter list squared and then summed).

To get the weights, we will use the geometric distribution.

**Geometric distribution:**
The probability that the first occurrence of success requires $i$ independent trials each with success probability $p$.
$$
p(1 - p) ^ {i - 1}
$$

![Geometric Distribution (from Wikipedia)](images/geometric_distribution.png)

In [8]:
# uses a geometric distribution so each weight
# decreases geometrically according to its position.
def generate_weights(n, p):
    weights = []
    total_weight = 0
    for i in range(1, n + 1):
        weight = p * ((1 - p) ** (i - 1)) # geometric pdf
        weights.append(weight)
        total_weight += weight
    normalized_weights = [weight / total_weight for weight in weights]
    return normalized_weights

In [11]:
generate_weights(4, 0.3)

[0.3947887879984208,
 0.2763521515988946,
 0.19344650611922617,
 0.1354125542834583]

In [162]:
df = pd.read_csv("data/skincare_products_merged.csv")

In [124]:
df

Unnamed: 0,Label,Brand,Name,Price,Rank,Combination,Dry,Normal,Oily,Sensitive,Ingredient_Placement,Ingredient,INCI name,COSING Ref No,Ingredient Description,Ingredient Function
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,1,1,1,1,1,1,Algae (Seaweed) Extract,ALGAE EXTRACT,54290.0,Algae Extract is an extract of various species...,"FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITIO..."
1,Moisturizer,LA MER,Crème de la Mer,175,4.1,1,1,1,1,1,2,Mineral Oil,HYDROGENATED MINERAL OIL,95058.0,Hydrogenated Mineral Oil is the end product of...,SKIN PROTECTING
2,Moisturizer,LA MER,Crème de la Mer,175,4.1,1,1,1,1,1,3,Petrolatum,PETROLATUM,79504.0,Petrolatum. A complex combination of hydrocarb...,"ANTISTATIC, SKIN CONDITIONING - EMOLLIENT"
3,Moisturizer,LA MER,Crème de la Mer,175,4.1,1,1,1,1,1,4,Glycerin,GLYCERIN,34040.0,Glycerine ;Glycerol (INN); Glycerol (RIFM); G...,"DENATURANT, HAIR CONDITIONING, HUMECTANT, ORAL..."
4,Moisturizer,LA MER,Crème de la Mer,175,4.1,1,1,1,1,1,5,Isohexadecane,ISOHEXADECANE,34654.0,"Hydrocarbons, C4, 1,3-butadiene-free, polym., ...","SKIN CONDITIONING, SKIN CONDITIONING - EMOLLIE..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45811,Sun protect,ST. TROPEZ TANNING ESSENTIALS,Pro Light Self Tan Bronzing Mist,20,1.0,0,0,0,0,0,15,Alpha-isomethyl Ionone,ALPHA-ISOMETHYL IONONE,39665.0,"3-Methyl-4-(2,6,6-trimethyl-2-cyclohexenyl)-3-...","PERFUMING, SKIN CONDITIONING"
45812,Sun protect,ST. TROPEZ TANNING ESSENTIALS,Pro Light Self Tan Bronzing Mist,20,1.0,0,0,0,0,0,16,CI 14700 (Red 4),CI 14700,32711.0,"Disodium 3-[(2,4-dimethyl-5-sulphonatophenyl)a...",COLORANT
45813,Sun protect,ST. TROPEZ TANNING ESSENTIALS,Pro Light Self Tan Bronzing Mist,20,1.0,0,0,0,0,0,17,CI 19140 (Yellow 5),CI 19140,32737.0,Trisodium 5-hydroxy-1-(4-sulphophenyl)-4-(4-su...,"COLORANT, HAIR DYEING"
45814,Sun protect,ST. TROPEZ TANNING ESSENTIALS,Pro Light Self Tan Bronzing Mist,20,1.0,0,0,0,0,0,18,CI 42090 (Blue 1).,CI 42090,32756.0,Dihydrogen (ethyl)[4-[4-[ethyl(3-sulphonatoben...,"COLORANT, HAIR DYEING"


In [136]:
# brand and names that contain an NA inci name
brand_and_names = df[df['INCI name'].isna()][['Brand', 'Name']].drop_duplicates()
brand_and_names.iloc[0]['Brand']

'LA MER'

In [125]:
#nas = \
#    pd.merge(df[df['INCI name'].isna()][['Brand', 'Name']].drop_duplicates(), \
#             df, on=['Brand', 'Name'], how='inner')

curr = \
    df[(df['Brand'] == brand_and_names['Brand'].iloc[1]) & \
        (df['Name'] == brand_and_names['Name'].iloc[1])]

curr_indices = curr.index
na_indices = curr[curr['INCI name'].isna()].index

for i in na_indices:
    integer_pos = curr.index.get_loc(i)

    print("index: {}".format(i))
    print("integer position: {}".format(integer_pos))
    print("Ingredient placement: {}".format(curr['Ingredient_Placement'].loc[i]))

    curr.loc[i, 'Ingredient_Placement'] = np.nan

    if len(curr) > (integer_pos + 1):
        curr.iloc[integer_pos + 1:, 10] = \
            curr.iloc[integer_pos + 1:, 10] - 1

print(curr['Ingredient_Placement'].iloc[34:70])

index: 144
integer position: 36
Ingredient placement: 37
index: 176
integer position: 68
Ingredient placement: 68.0
142    35.0
143    36.0
144     NaN
145    37.0
146    38.0
147    39.0
148    40.0
149    41.0
150    42.0
151    43.0
152    44.0
153    45.0
154    46.0
155    47.0
156    48.0
157    49.0
158    50.0
159    51.0
160    52.0
161    53.0
162    54.0
163    55.0
164    56.0
165    57.0
166    58.0
167    59.0
168    60.0
169    61.0
170    62.0
171    63.0
172    64.0
173    65.0
174    66.0
175    67.0
176     NaN
177    68.0
Name: Ingredient_Placement, dtype: float64


In [131]:
df.loc[curr.index]

Unnamed: 0,Label,Brand,Name,Price,Rank,Combination,Dry,Normal,Oily,Sensitive,Ingredient_Placement,Ingredient,INCI name,COSING Ref No,Ingredient Description,Ingredient Function
108,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,1,Algae (Seaweed) Extract,ALGAE EXTRACT,54290.0,Algae Extract is an extract of various species...,"FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITIO..."
109,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,2,Cyclopentasiloxane,CYCLOPENTASILOXANE,75413.0,Decamethylcyclopentasiloxane,"HAIR CONDITIONING, SKIN CONDITIONING, SKIN CON..."
110,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,3,Petrolatum,PETROLATUM,79504.0,Petrolatum. A complex combination of hydrocarb...,"ANTISTATIC, SKIN CONDITIONING - EMOLLIENT"
111,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,4,Glyceryl Distearate,GLYCERYL DISTEARATE,34067.0,"Distearic acid, diester with glycerol","ANTISTATIC, SKIN CONDITIONING - EMOLLIENT"
112,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,5,Phenyl Trimethicone,PHENYL TRIMETHICONE,79701.0,"1,1,5,5,5-Hexamethyl-3-phenyl-3-[(trimethylsil...","ANTIFOAMING, HAIR CONDITIONING, SKIN CONDITIONING"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,74,Geraniol,GERANIOL,33991.0,"2,6-Octadien-1-ol, 3,7-dimethyl-, (2E)-","PERFUMING, TONIC"
182,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,75,Linalool,LINALOOL,35016.0,"3,7-Dimethyl octa-1,6-diene-3-ol","DEODORANT, PERFUMING"
183,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,76,Limonene,LIMONENE,57187.0,1-Methyl-4-Isopropenylcyclohexene; dipentene,"DEODORANT, PERFUMING, SOLVENT"
184,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,1,1,1,1,1,77,Potassium Sorbate,POTASSIUM SORBATE,37025.0,"Potassium (E,E)-hexa-2,4-dienoate","FRAGRANCE, PRESERVATIVE"


In [149]:
def remove_na_ingredients(df):
    # WARNING: modifies df in-place
    
    # Find the brand and names that contain na
    brand_name_contains_na = \
        df[df['INCI name'].isna()][['Brand', 'Name']].drop_duplicates()
    

    # loop through each of these items, replace na ingredient positions
    # with na, adjust the other numbers accordingly, and then drop nas
    for i in range(len(brand_name_contains_na)):
        curr_df = \
            df[(df['Brand'] == brand_name_contains_na.iloc[i]['Brand']) & \
               (df['Name'] == brand_name_contains_na.iloc[i]['Name'])]

        # get the indices of the item and indices where ingredient is nan
        curr_indices = curr_df.index 
        na_indices = curr_df[curr_df['INCI name'].isna()].index

        print(brand_name_contains_na.iloc[i]['Brand'], \
              brand_name_contains_na.iloc[i]['Name'])
        print("na indices: ")
        print(na_indices)
    
        for j in na_indices:

            df.loc[j, 'Ingredient_Placement'] = np.nan

            # get integer position of na index to check if it's at the end
            integer_pos = curr_df.index.get_loc(j)

            # if there are more ingredients after, shift their position
            # up by 1
            if len(curr) > (integer_pos + 1):
                df.loc[j:curr_indices[-1], 'Ingredient_Placement'] = \
                    df.loc[j:curr_indices[-1], 'Ingredient_Placement'] - 1
    
    # now dropping nans will yield correct ingredient placement
    return df.dropna(subset=['INCI name'])

    

In [164]:
df2 = remove_na_ingredients(df)

LA MER Crème de la Mer
na indices: 
Index([41], dtype='int64')
LA MER The Moisturizing Soft Cream
na indices: 
Index([144, 176], dtype='int64')
IT COSMETICS Your Skin But Better™ CC+™ Cream with SPF 50+
na indices: 
Index([207, 225, 266], dtype='int64')
TATCHA The Water Cream
na indices: 
Index([270, 295, 297], dtype='int64')
DRUNK ELEPHANT Lala Retro™ Whipped Cream
na indices: 
Index([302], dtype='int64')
DRUNK ELEPHANT Virgin Marula Luxury Facial Oil
na indices: 
Index([325], dtype='int64')
LA MER Little Miss Miracle Limited-Edition Crème de la Mer
na indices: 
Index([402], dtype='int64')
KIEHL'S SINCE 1851 Midnight Recovery Concentrate
na indices: 
Index([443, 444, 445, 446, 447, 448, 449, 450, 451, 452], dtype='int64')
BELIF The True Cream Aqua Bomb
na indices: 
Index([457, 463, 485, 503, 508], dtype='int64')
SUNDAY RILEY Luna Sleeping Night Oil
na indices: 
Index([509, 510, 511, 512, 513, 515, 517, 523, 524], dtype='int64')
DRUNK ELEPHANT The Littles™
na indices: 
Index([580, 596,

In [165]:
df2.loc[441:455]

Unnamed: 0,Label,Brand,Name,Price,Rank,Combination,Dry,Normal,Oily,Sensitive,Ingredient_Placement,Ingredient,INCI name,COSING Ref No,Ingredient Description,Ingredient Function
441,Moisturizer,FRESH,Lotus Youth Preserve Moisturizer,45,4.3,0,0,0,0,0,39.0,Limonene,LIMONENE,57187.0,1-Methyl-4-Isopropenylcyclohexene; dipentene,"DEODORANT, PERFUMING, SOLVENT"
442,Moisturizer,FRESH,Lotus Youth Preserve Moisturizer,45,4.3,0,0,0,0,0,40.0,Citral.,CITRAL,32857.0,"2,6-Octadienal, 3,7-dimethyl-; 3,7-Dimethyl-2,...","FLAVOURING, PERFUMING"
453,Moisturizer,KIEHL'S SINCE 1851,Midnight Recovery Concentrate,47,4.4,1,1,1,1,1,1.0,Sunflower Seed Oil.,OZONIZED SUNFLOWER SEED OIL,58059.0,"Helianthus annuus (sunflower) seed oil, produc...",SKIN CONDITIONING
454,Moisturizer,BELIF,The True Cream Aqua Bomb,38,4.5,1,0,1,1,0,1.0,Water,WATER,92472.0,"Aqua (EU),Deionized Water,Distilled Water,Micr...","ANTIPLAQUE, SKIN CONDITIONING, SOLVENT"
455,Moisturizer,BELIF,The True Cream Aqua Bomb,38,4.5,1,0,1,1,0,2.0,Dipropylene Glycol,DIPROPYLENE GLYCOL,75742.0,"1,1'-Oxydipropan-2-ol; Oxydipropan-2-ol; Hydro...","FRAGRANCE, PERFUMING, SOLVENT, VISCOSITY CONTR..."


In [163]:
df.loc[453]

Label                                                           Moisturizer
Brand                                                    KIEHL'S SINCE 1851
Name                                          Midnight Recovery Concentrate
Price                                                                    47
Rank                                                                    4.4
Combination                                                               1
Dry                                                                       1
Normal                                                                    1
Oily                                                                      1
Sensitive                                                                 1
Ingredient_Placement                                                     11
Ingredient                                              Sunflower Seed Oil.
INCI name                                       OZONIZED SUNFLOWER SEED OIL
COSING Ref N