### Learning, conclusions and future work
**Learnings**
- react dom stores all info in some CDATA in order to not render DB all over again

**Conclusion**
- parsing ingredients returns not a full list of items
- categories are not so informative

**Future work:**
- need more data from multiple sources
- recommender bot that sends linkt to buy product @end
- parse full list of ingredients
- recommender system that includes user reviews from zalando

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('data/processed_cosmetics.csv')

Goal:
- Build item by ingredients dataframe (encoding ingredients of each item)
- TSNE dimensionality reduction, plus some visulization


In [4]:
data.category.value_counts()

face care      1271
sun protect     819
skin tone       666
Name: category, dtype: int64

In [48]:
data.head()

Unnamed: 0,category,brand,name,price,rating,ingredients,All skin types,Combination skin,Dry skin,Mature skin,Normal skin,Oily skin,Sensitive skin
0,skin tone,Darphin,MELAPERFECT FOUNDATION NEUTRAL - Foundation,39.95,4.0,"Water\aqua\eau, methyl, trimethicone, phenyl t...",0,0,0,0,0,0,0
1,skin tone,Nyx Professional Makeup,HD PHOTOGENIC CONCEALER WAND - Concealer,6.95,4.143646,"AQUA / WATER, TRIMETHYLSILOXYPHENYL DIMETHICON...",0,0,0,0,0,0,0
2,skin tone,Nyx Professional Makeup,BARE WITH ME TINTED SKIN VEIL - Setting spray ...,9.95,4.56,"Aqua/Water/Eau, Glycerin, Alcohol Denat., Aloe...",0,0,0,0,0,0,0
3,skin tone,MAC,PREP + PRIME FIX + MATTIFYING MIST 100ML - Primer,22.5,4.327586,"Water\Aqua\Eau , Alcohol Denat. , Silica , Sod...",0,0,0,0,0,0,0
4,skin tone,MAC,STUDIO FIX FLUID SPF15 FOUNDATION - Foundation,33.95,4.485348,"Octinoxate 2.50%, Titanium Dioxide 1.00%Water\...",0,0,0,0,0,0,0


In [51]:
'phenyl' in data['ingredients'][0]

True

In [12]:
df = data[data['category'] == 'skin tone'][data['Combination skin'] == 1]
df = df.reset_index()

In [13]:
df

Unnamed: 0,index,category,brand,name,price,rating,ingredients,All skin types,Combination skin,Dry skin,Mature skin,Normal skin,Oily skin,Sensitive skin
0,90,skin tone,Clinique,SUPERBALANCED MAKEUP LIQUID SILK FOUNDATION 30...,32.95,4.692308,"Dimethicone, Water\Aqua\Eau, Titanium Dioxide,...",0,1,1,0,0,1,0
1,178,skin tone,Clinique,SUPERBALANCED MAKEUP LIQUID SILK FOUNDATION 30...,32.95,4.692308,"Dimethicone, Water\Aqua\Eau, Titanium Dioxide,...",0,1,1,0,0,1,0
2,260,skin tone,Cover FX,COVER FX MATTIFYING PRIMER - Primer,34.2,1.0,"Cyclopentasiloxane, Salix Nigra (willow) Bark ...",0,1,0,0,0,1,0
3,310,skin tone,Clinique,SUPERBALANCED MAKEUP LIQUID SILK FOUNDATION 30...,32.95,4.692308,"Dimethicone, Water\Aqua\Eau, Titanium Dioxide,...",0,1,1,0,0,1,0
4,320,skin tone,Clinique,SUPERBALANCED MAKEUP LIQUID SILK FOUNDATION 30...,32.95,4.692308,"Dimethicone, Water\Aqua\Eau, Titanium Dioxide,...",0,1,1,0,0,1,0
5,427,skin tone,A'PIEU,WONDER-TENSION PACT PPOSONG SPF30/PA++ - Found...,18.95,4.0,"Bambusa Vulgaris Water, Water, Isononyl Isonon...",0,1,0,0,0,0,0


In [14]:
df.shape

(6, 14)

In [99]:
data.ingredients.value_counts()

Ingredients 1: Water/Aqua/Eau, Alcohol Denat. (SD Alcohol 40-B), Glycolic Acid, Potassium Hydroxide, Hamamelis Virginiana (Witch Hazel) Water, Salicylic Acid, Polysorbate 20, Lactic Acid, Mandelic Acid, Malic Acid, Citric Acid, Salix Alba (Willow)...          36
Aqua/Water/Eau*, Dicaprylyl Carbonate, Ethylexyl Methoxycrylene, Ethylhexyl Methoxycinnamate, Glyceryl Stearate, Clycerin, Dipropylene Glycol, Peg-100 Stearate, Cyclopentasiloxane, Butyl Methoxydibenzoylmethane, Cetyl Alcohol, Methyl Methacrylate ...          35
Aqua/Water/Eau*, Ethylhexyl Methoxycinnamate, Dicaprylyl Carbonate, Butyloctyl Salicylate, Glycerin, Homosalate, Tribehenin PEG-20 Esters, Cyclopentasiloxane, Butyl Methoxydibenzoylmethane, Dimethicone, Pongamia Glabra Seed Oil, Tocopheryl Acetate...          35
Aqua/Water/Eau, Ethylhexyl Methoxycinnamate, Dibutyl Adipate, Octocrylene, Bis-Ethylhexyloxyphenol Methoxyphenyl Triazine, C12-15 Alkyl Benzoate, Butyl Methoxydibenzoylmethane, Alcohol Denat., Glycerin, Distarch

In [19]:
# tokenize ingredients and save unique ingredients for item-ingredient matrix
encoded = data['ingredients'].str.get_dummies(sep=', ')

In [20]:
encoded.shape

(2756, 3309)

In [31]:
new_data = pd.concat([data['name'], encoded], axis=1)

In [32]:
new_data.set_index('name')

Unnamed: 0_level_0,Pelargonium Graveolens F...,ALCOHOL,DIMETHICONE,ETHYL HEX...,GLYCERETH-26,HYDROXYETHYLCELLULOSE,PEG-60 HYDROGENATED CASTOR OIL,AMBER POWDER,Allantoin,Aqua,...,xanthan gum,zinc gluconate,zinc oxide,zingiber,Aqua (water),Coconut Alkanes,Dimethicone,Homosalate,Öl,ökologischem Anbau.
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MELAPERFECT FOUNDATION NEUTRAL - Foundation,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
HD PHOTOGENIC CONCEALER WAND - Concealer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BARE WITH ME TINTED SKIN VEIL - Setting spray & powder,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PREP + PRIME FIX + MATTIFYING MIST 100ML - Primer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STUDIO FIX FLUID SPF15 FOUNDATION - Foundation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PREP + PRIME FIX +100ML - Primer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PRO LONGWEAR CONCEALER - Concealer,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MINERALIZE SKINFINISH - Highlighter,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MINI PREP + PRIME FIX +LITTLE M.A.C 30ML - Setting spray & powder,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
LOU-MANIZER - Highlighter,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
new_data.iloc[:, 1:].head()

Unnamed: 0,Pelargonium Graveolens F...,ALCOHOL,DIMETHICONE,ETHYL HEX...,GLYCERETH-26,HYDROXYETHYLCELLULOSE,PEG-60 HYDROGENATED CASTOR OIL,AMBER POWDER,Allantoin,Aqua,...,xanthan gum,zinc gluconate,zinc oxide,zingiber,Aqua (water),Coconut Alkanes,Dimethicone,Homosalate,Öl,ökologischem Anbau.
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
X = new_data.iloc[:, 1:]
X.shape

(2756, 3309)

In [53]:
# use TSNE to reduce dimensionality of 2706 by 3900 item ingredient matrix
model = TSNE(n_components = 2, random_state=42, n_jobs=-1)

In [63]:
%%time
tsne_features = model.fit_transform(X)

In [64]:
# Make X, Y columns
data['X'] = tsne_features[:, 0]
data['Y'] = tsne_features[:, 1]

In [65]:
data.tail()

Unnamed: 0,category,brand,name,price,rating,ingredients,All skin types,Combination skin,Dry skin,Mature skin,Normal skin,Oily skin,Sensitive skin,X,Y
2751,sun protect,Institut Esthederm,Sun protection,50.73,4.269727,"Aqua/Water/Eau*, Ethylhexyl Methoxycinnamate, ...",0,0,0,0,1,0,0,-7.119333,-65.076187
2752,sun protect,Institut Esthederm,INSTITUT ESTHEDERM ADAPTASUN PROTECTIVE SILKY ...,44.46,4.269727,"Butan, Dibutyl, Adipate, Coco-Caprylate/Caprat...",0,0,0,0,1,0,0,20.860256,-73.293312
2753,sun protect,Institut Esthederm,INSTITUT ESTHEDERM ADAPTASUN PROTECTIVE TANNIN...,46.74,4.269727,"Aqua/water/eau*, Ethylhexyl Methoxycinnamate, ...",0,0,0,0,1,0,0,4.710694,-89.744255
2754,sun protect,Skin Stories,SKIN STORIES COLOR PROTECT SUN STICK - Sun pro...,14.95,4.269727,"C12-15 Alkyl Benzoate, Homosalate, Cera Alba, ...",1,0,0,0,0,0,0,61.138252,-49.978981
2755,sun protect,Institut Esthederm,Self tan,63.84,4.269727,"Aqua/Water/Eau*, Dicaprylyl Carbonate, Ethylex...",1,0,0,0,0,0,0,-57.678764,-26.760834


- Content-based Recommendation Filtering
* Косинусная мера - косинус угла между векторами рейтингов. Если два вектора равнонаправлены, их косинусный коэффициент равен 1. Если направление противоположно, то -1.

$$ \text{similarity}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|_{2}\|\mathbf {B} \|_{2}}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}} $$



In [77]:
# df_2 = df[df.Label == 'Moisturizer_Dry'].reset_index().drop('index', axis = 1)

myItem = data[data.name.str.contains('MOISTURE SURGE')]
myItem = myItem[:1]

[мой крем](https://en.zalando.de/clinique-moisture-surge-72-hour-auto-replenishing-hydrator-30ml-skin-moisturizer-neutral-cll31g011-s11.html)

In [78]:
myItem

Unnamed: 0,category,brand,name,price,rating,ingredients,All skin types,Combination skin,Dry skin,Mature skin,Normal skin,Oily skin,Sensitive skin,X,Y
669,face care,Clinique,MOISTURE SURGE 72-HOUR AUTO-REPLENISHING HYDRA...,19.0,4.708502,"Water\Aqua\Eau , Dimethicone , Butylene Glycol...",0,0,1,0,0,0,0,13.993463,17.993937


In [88]:
data['dist'] = 0.0

In [89]:
data.head()

Unnamed: 0,category,brand,name,price,rating,ingredients,All skin types,Combination skin,Dry skin,Mature skin,Normal skin,Oily skin,Sensitive skin,X,Y,dist
0,skin tone,Darphin,MELAPERFECT FOUNDATION NEUTRAL - Foundation,39.95,4.0,"Water\aqua\eau, methyl, trimethicone, phenyl t...",0,0,0,0,0,0,0,7.968571,-0.88837,0.0
1,skin tone,Nyx Professional Makeup,HD PHOTOGENIC CONCEALER WAND - Concealer,6.95,4.143646,"AQUA / WATER, TRIMETHYLSILOXYPHENYL DIMETHICON...",0,0,0,0,0,0,0,30.822571,6.326037,0.0
2,skin tone,Nyx Professional Makeup,BARE WITH ME TINTED SKIN VEIL - Setting spray ...,9.95,4.56,"Aqua/Water/Eau, Glycerin, Alcohol Denat., Aloe...",0,0,0,0,0,0,0,13.999744,0.731336,0.0
3,skin tone,MAC,PREP + PRIME FIX + MATTIFYING MIST 100ML - Primer,22.5,4.327586,"Water\Aqua\Eau , Alcohol Denat. , Silica , Sod...",0,0,0,0,0,0,0,1.605113,4.878013,0.0
4,skin tone,MAC,STUDIO FIX FLUID SPF15 FOUNDATION - Foundation,33.95,4.485348,"Octinoxate 2.50%, Titanium Dioxide 1.00%Water\...",0,0,0,0,0,0,0,33.212692,61.479507,0.0


In [87]:
P1 = np.array([myItem.X.values, myItem.Y.values]).reshape(1, -1)
P1

array([[13.993463, 17.993937]], dtype=float32)

In [96]:
# cosine similarities with other items
for i in range(len(data)):
    P2 = np.array([data['X'][i], data['Y'][i]]).reshape(-1, 1)
    data.dist[i] = (P1 * P2).sum() / (np.sqrt(np.sum(P1))*np.sqrt(np.sum(P2)))
# cosine_similarity(P1, P2)

In [98]:
data['dist'].isnull().sum()

1197

In [93]:
data = data.sort_values('dist')
recommendations_myItem = data[['name', 'brand','price', 'rating','dist']].head(5)

In [94]:
recommendations_myItem

Unnamed: 0,name,brand,price,rating,dist
976,OMOROVICZA BUDAPEST REFINING FACIAL POLISHER -...,Omorovicza Budapest,80.0,4.269727,0.803314
1864,SPF 30 SUNSCREEN BODY SPRAY 200ML - Sun protec...,Mimitika,19.95,3.8,0.973279
852,PERFECTING BODY SCRUB - Body scrub,Darphin,29.95,3.0,1.32803
701,JUMBO CLARIFYING LOTION 2 - Toner,Clinique,32.95,5.0,2.214922
2034,SELF TAN PURITY WATER MOUSSE 200ML - Self tan,St. Tropez,44.95,4.833334,2.833082
