# Data Pre-Processing
This notebook will contain our steps to pre-process all of the data we will use in this project. 

## Data Files
We currently have 4 files we are working with (update as needed):
* `ingredient_info.csv`: this data set includes information about individual ingredients such as INCI name, the description of the ingredient, and the function of the ingredient. 
    * Source: *(PUT SOURCE HERE)*
* `sephora_sample_1` / `sephora_sample_2`: these are sample data sets from sites that have webscraped product information from sephora.com. We are interested in seeing the formatting of the ingredient lists and how we can use it in our project.
    * Source 1: https://www.dataandsons.com/categories/product-lists/sephora-products-dataset
    * Source 2: https://www.crawlfeeds.com/datasets/ratings-and-reviews-dataset-from-sephora?params=slug
* `skincare_products.csv`: this file contains product information for skincare, including a well-formatted ingredients list (yay!). We will start our project using this set, incorporating beauty products if time permits and we are able to source that data.
    * Source: *(PUT SOURCE HERE)*

In [1]:
import pandas as pd 
import re as re
import warnings
from preprocess_funcs import *

## Ingredient_info.csv
Need to skip some lines and also split the INCI names 

In [2]:
ingredient_info = pd.read_csv('data/ingredient_info.csv', skiprows=9)
ingredient_info.head(3)

Unnamed: 0,COSING Ref No,INCI name,INN name,Ph. Eur. Name,CAS No,EC No,Chem/IUPAC Name / Description,Restriction,Function,Update Date
0,94753,DISODIUM TETRAMETHYLHEXADECENYLCYSTEINE FORM...,,,"2040469-40-5, 2422121-34-2",,Disodium Tetramethylhexadecenylcysteine Formyl...,,SKIN PROTECTING,16/06/2020
1,94896,(LIQUIDAMBAR STYRACIFLUA/TRIBULUS TERRESTRIS)...,,,,,(Liquidambar Styraciflua/Tribulus Terrestris)...,,SKIN CONDITIONING,01/12/2017
2,95645,ACRYLATES/VA/VINYL NEODECANOATE COPOLYMER,,,99728-55-9,,Acrylates/VA/Vinyl Neodecanoate Copolymer is ...,,PLASTICISER,14/02/2018


In [3]:
ingredient_info.to_csv("data/ingredient_info_processed.csv", index=False)

## sephora_sample_1.csv
Need to separate out the ingredients into individual rows for each product.

In [4]:
#warnings.filterwarnings('ignore')
warnings.simplefilter("always")

sephora_sample_1 = pd.read_csv('data/sephora_sample_1.csv')

def Convert(string):
    string = str(string).replace('\'','').replace('\"','').replace('<br><br><b>', ',').replace(' <br><br>', ',').replace('<br><br>', ',').replace(' <br>',',').replace('<br>',',')
    string = re.sub(r'\-(?![+-]?\d)', '', string)
    string = re.split(r',\s*(?![^()]*\))(?![+-]?\d)', string)
    return string

sephora_sample_1['ingredients'] = sephora_sample_1['ingredients'].apply(lambda x: Convert(x))  

sephora_sample_1 = sephora_sample_1.explode('ingredients').dropna(subset=['ingredients'])

sephora_sample_1.head(3)

Unnamed: 0,url,pid,name,brand,currency,price,availability,color,description,usage,reviews,rating,likes_count,size,breadcrumbs,images,ingredients,uniq_id,scraped_at
0,https://www.sephora.com/ca/en/product/strobe-l...,P404708,Ambient® Strobe Light Sculptor,Hourglass,CAD,29.5,InStock,,"<b>What it is:</b> <br>An innovative, dual-end...",<b>Suggested Usage:</b> <br>-Sweep angled side...,15,3.6,4126,,"Makeup, Brushes & Applicators, Sponges & Appli...",https://www.sephora.com/productimages/product/...,,d9325a99-1d1d-596c-bcab-c98031c2cfc2,21/04/2022 02:12:25
1,https://www.sephora.com/ca/en/product/sephora-...,P454320,Brow Shaper Pencil - Waterproof,SEPHORA COLLECTION,CAD,16.0,InStock,"03 Rich Chestnut, 12 Granite, 02 Nutmeg Brown,...",<b>What it is: </b>A retractable brow-shaping ...,<b>Suggested Usage:</b><br>-Use the point to d...,131,3.2901,8105,Size 0.007 oz/ 0.2g,"Makeup, Eye, Eyebrow",https://www.sephora.com/productimages/product/...,Polyethylene,31ba526c-0522-5d87-8be5-ab90da039f15,21/04/2022 04:21:34
1,https://www.sephora.com/ca/en/product/sephora-...,P454320,Brow Shaper Pencil - Waterproof,SEPHORA COLLECTION,CAD,16.0,InStock,"03 Rich Chestnut, 12 Granite, 02 Nutmeg Brown,...",<b>What it is: </b>A retractable brow-shaping ...,<b>Suggested Usage:</b><br>-Use the point to d...,131,3.2901,8105,Size 0.007 oz/ 0.2g,"Makeup, Eye, Eyebrow",https://www.sephora.com/productimages/product/...,Octyldodecanol,31ba526c-0522-5d87-8be5-ab90da039f15,21/04/2022 04:21:34


In [5]:
sephora_sample_1.to_csv("data/sephora_sample_1_processed.csv", index=False)

## skincare_products.csv
Need to separate out the ingredients here as well.

In [6]:
skincare_products = pd.read_csv("data/skincare_products.csv")
ingredient_df = pd.read_csv("data/ingredient_info_processed.csv")

# First, split the ingredients into a list
skincare_products['Ingredients'] = \
    skincare_products["Ingredients"].str.split(",")

# Use functions from preprocess_funcs.py to
# match ingredients to the appropriate Cosing Ref No.
skincare_products['Cosing Ref No'] = \
    skincare_products['Ingredients'].apply(lambda x: get_INCI_name_list(ingredient_df, x))


skincare_products.head()

KeyboardInterrupt: 

In [None]:
cosing_to_inci = \
    dict(zip(ingredient_df['COSING Ref No'], ingredient_df['INCI name']))

cosing_to_funct = \
    dict(zip(ingredient_df['COSING Ref No'], ingredient_df['Function']))


skincare_products['INCI Name'] = \
    skincare_products['Cosing Ref No'].apply(lambda x: [cosing_to_inci[id] for id in x])

skincare_products['Function'] = \
    skincare_products['Cosing Ref No'].apply(lambda x: [cosing_to_funct[id] for id in x])

skincare_products.head()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Cosing Ref No,INCI Name,Function
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"[Algae (Seaweed) Extract, Mineral Oil, Petro...",1,1,1,1,1,"[54290.0, 95058.0, 79504.0, 34040.0, 34654.0, ...","[ALGAE EXTRACT, HYDROGENATED MINERAL OIL, PETR...","[FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITI..."
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"[Galactomyces Ferment Filtrate (Pitera), Buty...",1,1,1,1,1,"[84397, 74756, 58983, 92472, 37735, 35342, 38173]","[GALACTOMYCES FERMENT FILTRATE, BUTYLENE GLYCO...","[HUMECTANT, FRAGRANCE, HUMECTANT, SKIN CONDITI..."
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"[Water, Dicaprylyl Carbonate, Glycerin, Cet...",1,1,1,1,0,"[92472, 55832, 34040, 75132, 55337, 38182, 583...","[WATER, DICAPRYLYL CARBONATE, GLYCERIN, CETEAR...","[ANTIPLAQUE, SKIN CONDITIONING, SOLVENT, SKIN ..."
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"[Algae (Seaweed) Extract, Cyclopentasiloxane,...",1,1,1,1,1,"[54290.0, 75413.0, 79504.0, 34067.0, 79701.0, ...","[ALGAE EXTRACT, CYCLOPENTASILOXANE, PETROLATUM...","[FRAGRANCE, HUMECTANT, ORAL CARE, SKIN CONDITI..."
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"[Water, Snail Secretion Filtrate, Phenyl Tri...",1,1,1,1,1,"[92472.0, 58704.0, 79701.0, 33401.0, 74756.0, ...","[WATER, SNAIL SECRETION FILTRATE, PHENYL TRIME...","[ANTIPLAQUE, SKIN CONDITIONING, SOLVENT, SKIN ..."


In [None]:
#skincare_products[skincare_products['Cosing Ref No'].apply(lambda x: len(x) > 1)].to_pickle('data/skincare_products_listed.pickle')
skincare_products.to_pickle('data/skincare_products_listed.pickle')