# Skincare Product Recommender

In recent years we have become more and more concerned with what is in the products we use. It is true for our food, our skincare, and how our clothes are made. 
What happens though if you have a favorite skincare product from [Sephora](https://www.sephora.com/)? It is already really hard to find products that you like initially, and now you are going clean and organic and have to start the process all over again. This is where this project comes in. I wanted to create an organic product alternative to your favorite Sephora skincare products. 

Sephora is a leading company that carries different lines and offers a large choice of products from hair, makeup and skincare categories. Here I only focus on the skincare products. Sephora has a lot of choices for it's customers, including a "[Clean Sephora](https://www.sephora.com/beauty/clean-beauty-products)" section. Thus part of my recommendation stayed in Sephora. I also gathered data from [Credo](https://credobeauty.com/) and [Follain](https://follain.com/) to provide more options. The one thing to note about clean and organic companies at this time is their inventory is smaller in size usually and thus provides less choices. 

#### Table of Contents

[Gathering the data](#Gathering-the-data)<br>
[Cleaning Data](#Cleaning-Data)<br>
[Creating the Recommender](#Creating-the-Recommender)<br>
[Flask App](#Flask-App)<br>

## Gathering the data

The first thing I needed to do was gather the data. I used Selenium, requests and BeautifulSoup to gather the data from the three online stores. Since I did a previous project on trying to predict product prices from Sephora and already gathered the data, I ended up using the same data for this project. You can see the code for gathering Sephora data in my project called [Capstone](https://github.com/LenaNevel/CAPSTONE) on GitHub. I followed similar steps when gathering data from Credo and Follain. The jupyter notebooks where I gather the data can be found in "gathering_data" folder of this project.

In [1]:
import pandas as pd
sephora_slugs = pd.read_csv('./data/sephora_urls.csv')
credo_slugs = pd.read_csv('./data/credo_product_slugs.csv')
follain_slugs = pd.read_csv('./data/follain_product_slugs.csv')

print("Initially {} product urls were gathered from Sephora.com.".format(len(sephora_slugs)))
print("From CredoBeauty.com there was also {} urls initially.".format(len(credo_slugs)))
print("And last but not least, {} urls where gathered from Follain".format(len(follain_slugs)))

Initially 2768 product urls were gathered from Sephora.com.
From CredoBeauty.com there was also 613 urls initially.
And last but not least, 189 urls where gathered from Follain


In [2]:
sephora_initial_clean = pd.read_csv('./data/sephora_intial_clean.csv')
credo_initial_clean = pd.read_csv('./data/credo_initial_clean.csv')
follain_initial_clean = pd.read_csv('./data/follain_initial_clean.csv')

print('After removing duplicates, {} sephora, {} credo and {} follain products were left.'.format(len(sephora_initial_clean),
                                                                                                 len(credo_initial_clean),
                                                                                                 len(follain_initial_clean)))

After removing duplicates, 2457 sephora, 517 credo and 164 follain products were left.


In [3]:
clean_sephora = sephora_initial_clean[sephora_initial_clean['type'] == 'clean']
print("There is {} products in the 'Clean Sephora' section of Sephora".format(len(clean_sephora)))

There is 557 products in the 'Clean Sephora' section of Sephora


You can see that unfortunately, Credo and Follain have a much smaller inventory and thus less choices for an alternative choices. However, luckily Sephora now has clean options too and they have a larger number of products for us to choose from. 

The data that I focused on collecting was name of the product, company name, price and ingredients. I also kept track of which category each product belonged to (moisturizer, serum, eye cream, ...) and the url that led to that product description page. 

In [4]:
follain_initial_clean.head()

Unnamed: 0,name,brand,category,price,ingredients,store,url,type
0,Brightening Cleanser,Indie Lee,cleansers,34.0,"Water, Decyl Glucoside, Disodium Coco Glucosid...",Follain,https://follain.com//collections/skincare-clea...,clean
1,Ocean Cleanser,OSEA,cleansers,48.0,"Water, Algae Extract, Decyl Glucoside, Sodium ...",Follain,https://follain.com//collections/skincare-clea...,clean
2,Regenerating Cleanser,Tata Harper,cleansers,84.0,"Aloe Vera Leaf Juice, Cetearyl Alcohol, Cetear...",Follain,https://follain.com//collections/skincare-clea...,clean
3,Vitamin B Cleansing Oil & Makeup Remover,One Love Organics,cleansers,42.0,"Helianthus Annus Seed Oil, Caprylic, Di-PPG-2 ...",Follain,https://follain.com//collections/skincare-clea...,clean
4,Fantastic Face Wash,Ursa Major,cleansers,28.0,"Aloe Barbadensis Leaf Juice, Lauryl Glucoside,...",Follain,https://follain.com//collections/skincare-clea...,clean


## Cleaning Data

After collecting the data, a lot of work had to be done cleaning it up so it would be in working condition. The entire cleaning code can be found in "cleaning_data" folder. 

First I looked at each store individually. After that I combined them in to one DataFrame. While I was cleaning and doing EDA one of the techniques that helped me the most of looking at all of the unique ingredients individually. The code that I wrote is below. I did a lot of manually checking all of the unique ingredients and checking for any out of place characters. 

In [5]:
#this is a quick method to check the ingredients for legibility

products = pd.read_csv('./data/products_combined.csv')

all_ingredients = []

for i in products.index:
    list_ingredients = products.ingredients[i].split(', ')
    for j in list_ingredients:     
        all_ingredients.append(j.strip())
print(len(all_ingredients))
print(len(set(all_ingredients)))

unique_ingredients = set(all_ingredients)
sorted(unique_ingredients)[5000:5005]

86472
7526


['Sorbitan Palmitate',
 'Sorbitan Sesquiisostearate',
 'Sorbitan Sesquioleate',
 'Sorbitan Stearate',
 'Sorbitan Trioleate']

In "combining_stores.ipynb" you will notice I used Fuzzy String Matching technique to combine any similar ingredients that might have mispellings or in other way spelled differently but essentially mean the same thing. Fuzzy String Matching finds things that are approximately alike instead of exactly alike. For the benefit of recommening products I made an assumption that something that came from the fruit or from the seed are similar enough for our case. I used "larger than 75%" threshold of Fuzzy String Matching to combine similar ingredients. 

<img src="./images/example1.jpeg">

## Creating the Recommender

To create the recommender I needed to compare product to product based on their ingredients. To create my vectors for comparison, I used CountVectorizer to create my DataFrame of ingredients. Similar to the Fuzzy String Matching that I did above to find similar item to item, I was using pairwise distance matrix to find product vector similarity. The pairwise distance gives a number between 0 and 1. The closer to 0 the score is the more similar the two products are based on their ingredients list. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import pairwise_distances

#CountVactorizer splits the ingredients by comma and not in to individual words
cvec = CountVectorizer(tokenizer=lambda x: x.split(', '))
#training CountVectorizer and transforming
cvec_words = cvec.fit_transform(products['ingredients'])
#calculating the pairwise distance for products
recommender = pairwise_distances(cvec_words, metric='cosine')

#building a recommender DataFrame
recommender_df = pd.DataFrame(recommender, columns=products['name'], index=products['name'])

In [7]:
recommender_df.head()

name,Protini Polypeptide Moisturizer,The Water Cream,Ultra Facial Cream,CC+ Cream with SPF 50+,The Dewy Skin Cream,Lala Retro Whipped Moisturizer with Ceramides,Crme de la Mer Moisturizer,F-Balm Electrolyte Waterfacial Mask,The True Cream Aqua Bomb,Virgin Marula Antioxidant Face Oil,...,Rose Lip Polish,Grapefruit Lip Balm,Sun Protection Lip Balm SPF 15,Lemon Lip Balm,Mint Lip Balm,Mineral Sunscreen Unscented SPF 30,Super Shield Sport Stick Sunscreen SPF 50,Mineral Sunscreen SPF 30 Fragrance-Free,Mineral Sunscreen SPF 30 Lightly Scented,Mineral Sunscreen SPF 30
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Protini Polypeptide Moisturizer,0.0,0.0,0.844636,1.0,0.879856,0.778052,0.939217,0.739121,0.910657,1.0,...,1.0,0.966097,0.972005,1.0,0.963582,0.969876,0.964907,0.854329,0.927164,0.971347
The Water Cream,0.0,0.0,0.844636,1.0,0.879856,0.778052,0.939217,0.739121,0.910657,1.0,...,1.0,0.966097,0.972005,1.0,0.963582,0.969876,0.964907,0.854329,0.927164,0.971347
Ultra Facial Cream,0.844636,0.844636,0.0,1.0,0.819561,0.657143,0.921754,0.798502,0.838985,1.0,...,1.0,0.869069,0.927925,1.0,0.859358,0.922443,0.954825,0.812477,0.906239,0.926229
CC+ Cream with SPF 50+,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
The Dewy Skin Cream,0.879856,0.879856,0.819561,1.0,0.0,0.896892,0.952938,0.838409,0.875485,1.0,...,1.0,1.0,1.0,1.0,1.0,0.860058,1.0,0.915409,0.915409,0.933444


I wanted to make sure that even if the person uses lower case or only part of the name when making a search request the system can find the right item. Thus the code below does just that.

In [8]:
user_input = 'the dewy skin'
#making sure user input is going to be interpreted properly
titles = products[products['name'].str.lower().str.contains(user_input.lower())]
titles.reset_index(inplace = True)
titles= titles.drop(columns = ['index'])
titles

Unnamed: 0,name,brand,category,price,ingredients,store,url,type,website
0,The Dewy Skin Cream,Tatcha,moisturizer,$68.0,"Water, Saccharomyces, Glycerin, Propanediol, D...",Sephora,https://www.sephora.com/product/the-dewy-skin-...,clean,"<a href=""https://www.sephora.com/product/the-d..."


My goal was to built a Flask app to be able to display my recommendations. There was a few things I needed to account for. First, is there a product with that name in our database in the first place. If not, what would be my output. Second, is that item already part of "Clean Sephora". Third, if it is not part of "Clean Sephora" and it exists in the database would would be the recommendations for it. 

Take a look at the code below. "Water Rose Moisture Cream" is a cream sold at CVS. At the moment it is not in our system. What I decided to do was utilize the Fuzzy String Matching again to find titles which would be nearest what was entered. What happens is it measures the similarity distance between the product names which are in the database and the entered by user name if it can't find an exact match. The "website" column of the dataset that is printed to the screen leads to the product description page in the Flask app. 

In [9]:
from fuzzywuzzy import fuzz


user_input = 'Water Rose Moisture Cream'
fuzzy = []
for k in products.name:
    fuzzy.append((k, fuzz.ratio(k.lower(), user_input.lower())))
fuzzy_ratio = pd.DataFrame(fuzzy, columns = ['name', 'ratio'])
similar = fuzzy_ratio.sort_values(by = 'ratio', ascending = False)
similar = similar[0:5]
similar.reset_index(inplace=True)
similar.drop(['index', 'ratio'], axis = 1, inplace = True)
for index, item in enumerate(similar.name):
    similar.loc[index,'brand']=products[products['name'] == item]['brand'].values[0]
    similar.loc[index,'price'] = products[products['name'] == item]['price'].values[0]
    similar.loc[index,'store'] = products[products['name'] == item]['store'].values[0]
    similar.loc[index,'website'] = products[products['name'] == item]['website'].values[0]



In [10]:
similar

Unnamed: 0,name,brand,price,store,website
0,Water Bank Moisture Cream,LANEIGE,$38.0,Sephora,"<a href=""https://www.sephora.com/product/water..."
1,Chia Seed Moisture Cream,boscia,$38.0,Sephora,"<a href=""https://www.sephora.com/product/chia-..."
2,Extra Repair Moisture Cream,Bobbi Brown,$102.0,Sephora,"<a href=""https://www.sephora.com/product/extra..."
3,Moisture Cream,Eve Lom,$150.0,Sephora,"<a href=""https://www.sephora.com/product/the-e..."
4,Banana Souffl Moisture Cream,Glow Recipe,$39.0,Sephora,"<a href=""https://www.sephora.com/product/glow-..."


If the item is in our database AND it's also part of "Clean Sephora" it made sense to me to recommend other products from the same line. Any time the user will enter a title that is already "clean" it will recommend a random five other products from that same brand. 

In [11]:
searched_item = 'The Dewy Skin Cream'
searched_category = products[products['name'] == searched_item]['category'].values[0]
searched_brand = products[products['name'] == searched_item]['brand'].values[0]
#if the product is already part of 'clean sephora'
if products[(products['name'] == searched_item) & (products['store'] == 'Sephora')]['type'].values == 'clean':
    brand_other = products[(products['brand'] == searched_brand) & (products['store'] == 'Sephora')]
    brand_other = brand_other[brand_other['name'] != searched_item]
    print_to_screen = brand_other.sample(n=5)
    print_to_screen = print_to_screen[['name', 'brand', 'price', 'store', 'website']]

In [12]:
print_to_screen

Unnamed: 0,name,brand,price,store,website
1149,Aburatorigami Japanese Blotting Papers,Tatcha,$12.0,Sephora,"<a href=""https://www.sephora.com/product/abura..."
2294,The Kissu Lip Mask,Tatcha,$28.0,Sephora,"<a href=""https://www.sephora.com/product/tatch..."
1306,Violet-C Brightening Serum 20% Vitamin C + 10%...,Tatcha,$88.0,Sephora,"<a href=""https://www.sephora.com/product/viole..."
1775,The Silk Peony Melting Eye Cream,Tatcha,$60.0,Sephora,"<a href=""https://www.sephora.com/product/the-s..."
2334,Camellia Gold Spun Lip Balm,Tatcha,$30.0,Sephora,"<a href=""https://www.sephora.com/product/camel..."


Now the third question. What to do if the product is not "clean" and it is part of our database. In this case, I went with recommending other products that ARE part of "Clean Sephora" but also recommend products that can be found in Credo and Follain. I pick the products that have the lowest pairwise distance score. In this case I have to mention some limitations that I encountered. There was a lot of inconsistency in how companies would enter the ingredients of their products on Sephora. There was a lot of mispelled ones and also some that ment the same thing but were spelled or entered differently. That is one of the reasons I went with Fuzzy String Matching. Just take a look at the photo below. When scraping Credo and Follain the issue I ran in to sometimes was that for the sake of being more transparent to the customer (my assumption) companies would use "Orange Peel Oil" instead of Citrus Aurantium Dulcis Peel Oil which other companies use. When using Fuzzy String Matching unfortunately that wouldn't be caught. 
<img src="./images/example2.jpeg">

In [13]:
searched_item = 'Truth Serum'
searched_category = products[products['name'] == searched_item]['category'].values[0]
searched_brand = products[products['name'] == searched_item]['brand'].values[0]

df = pd.DataFrame(recommender_df[searched_item].index)
df['score'] = recommender_df[searched_item].values
df[['brand', 'category', 'price', 'store', 'url', 'type', 'website']] = products[['brand',
                                                                       'category', 'price', 'store', 'url', 'type', 'website']]
#pulling out only the same category of the searched item
appropriete_category = df[(df['category'] == searched_category) &(df['type'] == 'clean')].sort_values(by = 'score')
#first table is the clean sephora recommentations
clean_sephora_recommendation = appropriete_category[appropriete_category['store'] == 'Sephora']
print_to_screen_sep = clean_sephora_recommendation[['name', 'brand', 'price', 'store', 'website']]
print_to_screen_sep = print_to_screen_sep[print_to_screen_sep['name'] != searched_item]
print_to_screen_sep = print_to_screen_sep[0:5]

#second table is the other clean stores
other_clean = appropriete_category[appropriete_category['store'] != 'Sephora']
print_to_screen_other = other_clean[['name', 'brand', 'price', 'store', 'website']]
print_to_screen_other = print_to_screen_other[print_to_screen_other['name'] != searched_item]
print_to_screen_other = print_to_screen_other[0:5]

In [14]:
print_to_screen_sep

Unnamed: 0,name,brand,price,store,website
1302,Squalane + 10% Lactic Acid Resurfacing Night S...,Biossance,$62.0,Sephora,"<a href=""https://www.sephora.com/product/bioss..."
1534,FAB Skin Lab Retinol Serum 0.25% Pure Concentrate,First Aid Beauty,$58.0,Sephora,"<a href=""https://www.sephora.com/product/fab-s..."
1517,Skin Rescue Acne Clearing Pads with White Clay,First Aid Beauty,$30.0,Sephora,"<a href=""https://www.sephora.com/product/skin-..."
1556,Apple Cider Vinegar Resurfacing Peel Pads,Volition Beauty,$64.0,Sephora,"<a href=""https://www.sephora.com/product/apple..."
1309,D-Bronzi Anti-Pollution Sunshine Drops,Drunk Elephant,$36.0,Sephora,"<a href=""https://www.sephora.com/product/d-bro..."


In [15]:
print_to_screen_other

Unnamed: 0,name,brand,price,store,website
2863,Kypris Serum: Clearing Serum,Kypris,$90.0,Credo,"<a href=""https://credobeauty.com//collections/..."
2781,Daily Skin Nutrition,Indie Lee,$80.0,Credo,"<a href=""https://credobeauty.com//collections/..."
2844,Lightening Serum,Marie Veronique,$110.0,Credo,"<a href=""https://credobeauty.com//collections/..."
2799,Stem Cell Serum,Indie Lee,$135.0,Credo,"<a href=""https://credobeauty.com//collections/..."
2829,Nutrient Concentrate,Susanne Kaufmann,$156.0,Credo,"<a href=""https://credobeauty.com//collections/..."


## Flask App

The full code can be found and ran from file "product_recommender.py". In the most simple term you initiate the Flask, and pass all the above code under a function in the code. I ended up having 4 different pages. First page where the user enters the product name, page if the product is not found, page if it's already part of the "Clean Sephora", and page for if it is in the database and is not part of "Clean Sephora". The HTML files which are being called by "render_template" command in the "product_recommender.py" are in the templates folder. Images that are called on from HTML files are saved in static folder. 

<img src="./images/page1.jpeg">

<img src="./images/page2.jpeg">

<img src="./images/page3.jpeg">

<img src="./images/page4.jpeg">