# Capstone: Sephora. Predicting prices based on Ingredients

## Problem description

It is an assumption customers make that their skin care product price is dependent on the ingredients in this product. The goal of my projects is to see if I can predict prices of the products based on the ingredients. To accomplish this goal, I first had to gather my data. I used Sephora.com data for this.

### Project Structure:
- Notebook 0. Selenium URL Collection
- Notebook 1. Saving data from URL to an HTML file
- Notebook 2. Collecting Product Data
- Notebook 3. Data Cleaning 
- Notebook 4. EDA
- Notebook 5. Fuzzy String Matching
- Notebook 6. Regression Modeling
- Notebook 7. Classification Modeling

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re
import numpy as np

In [2]:
products = pd.read_csv('./data/product_info.csv')
products.head(2)

Unnamed: 0,name,brand,category,price,ingredients,no_reviews,hearts,size1,size2,url
0,Protini™ Polypeptide Moisturizer,Drunk Elephant,moisturizing-cream-oils-mists,$68.00,"<div aria-labelledby=""tab2"" class=""css-1rny024...",3K reviews,216935,SIZE 1.69 oz/ 50 mL•ITEM 2025633,0,https://www.sephora.com/product/protini-tm-pol...
1,The Water Cream,Tatcha,moisturizing-cream-oils-mists,$68.00,"<div aria-labelledby=""tab2"" class=""css-1rny024...",2K reviews,197492,ITEM 1932920,SIZE: 1.7 oz/ 50 mL,https://www.sephora.com/product/the-water-crea...


In [3]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2765 entries, 0 to 2764
Data columns (total 10 columns):
name           2765 non-null object
brand          2765 non-null object
category       2765 non-null object
price          2765 non-null object
ingredients    2754 non-null object
no_reviews     2765 non-null object
hearts         2765 non-null int64
size1          2765 non-null object
size2          2765 non-null object
url            2765 non-null object
dtypes: int64(1), object(9)
memory usage: 216.1+ KB


In [4]:
products[products.name == 'Pure One Step Camellia Oil Cleanser']

Unnamed: 0,name,brand,category,price,ingredients,no_reviews,hearts,size1,size2,url
34,Pure One Step Camellia Oil Cleanser,Tatcha,moisturizing-cream-oils-mists,$48.00,"<div aria-labelledby=""tab2"" class=""css-1rny024...",1K reviews,88702,ITEM 1673805,SIZE: 5.1 oz/ 150 mL,https://www.sephora.com/product/pure-one-step-...
841,Pure One Step Camellia Oil Cleanser,Tatcha,cleanser,$48.00,"<div aria-labelledby=""tab2"" class=""css-1rny024...",1K reviews,88740,ITEM 1673805,SIZE: 5.1 oz/ 150 mL,https://www.sephora.com/product/pure-one-step-...


In [5]:
#dropping duplicate occurance by 'name' and 'brand' of product
products.drop_duplicates(['name', 'brand'], keep="first", inplace = True)

#### Price

In [6]:
#converting the price to float
products.price = products.price.str.replace("$", "")
# this format for some: '$185.00($214.00 value)'
for i in products.index:
    # this format for some: '$185.00($214.00 value)'
    products.loc[i, 'price'] =  products.price[i].split("(")[0]
    #this format for some: $4.00 $2.00
    products.loc[i, "price"] =  products.price[i].split(" ")[0]
#setting it as float
products.price = products.price.astype(float)

#### Ingredients

In [7]:
products[products.ingredients.isnull()]

Unnamed: 0,name,brand,category,price,ingredients,no_reviews,hearts,size1,size2,url
381,Black Pine 3D Day Facial,KORRES,moisturizing-cream-oils-mists,58.0,,65 reviews,1682,SIZE 1.35 oz/ 40 mL•ITEM 2041630,0,https://www.sephora.com/product/black-pine-ant...
485,Freeze Frame Beauty Essence,Saturday Skin,moisturizing-cream-oils-mists,47.0,,78 reviews,12273,SIZE 1.69 oz/ 50 mL•ITEM 2014314,0,https://www.sephora.com/product/freeze-frame-b...
524,"Black Pine 3D Antiaging, Firming and Lifting S...",KORRES,moisturizing-cream-oils-mists,68.0,,33 reviews,2316,SIZE 1.35 oz/ 40 mL•ITEM 2041648,0,https://www.sephora.com/product/black-pine-ant...
637,& Sleeping Beauty Purifying Mousse - Pink Clay...,Edible Beauty,moisturizing-cream-oils-mists,43.0,,59 reviews,1834,SIZE 1.7 oz/ 50g•ITEM 2266385,0,https://www.sephora.com/product/sleeping-beaut...
644,Plantscription™ Youth-Renewing Face Oil,Origins,moisturizing-cream-oils-mists,57.0,,45 reviews,5720,ITEM 1560937,0,https://www.sephora.com/product/plantscription...
727,Drop Shot Mix-in Facial Oil,Urban Decay,moisturizing-cream-oils-mists,34.0,,26 reviews,5059,SIZE 0.80 oz/ 24 ml•ITEM 2039428,0,https://www.sephora.com/product/drop-shot-mix-...
1372,Rapid Age Spot and Pigment Lightening Serum,Murad,facial-treatments,72.0,,1K reviews,42773,ITEM 654095,SIZE: 1 oz,https://www.sephora.com/product/rapid-age-spot...
1675,The Pore Wonders Set,The INKEY List,facial-treatments,35.0,,2 reviews,3416,ITEM 2259984,0,https://www.sephora.com/product/the-pore-wonde...
2333,Hydrating Floral Mask,Tata Harper,facial-treatment-masks,95.0,,15 reviews,2008,SIZE 1 oz/ 30 mL•ITEM 2169464,0,https://www.sephora.com/product/hydrating-flor...


In [8]:
products.loc[381, 'ingredients'] = "Caprylic/ Capric Triglyceride, Olus Oil/Vegetable Oil/Huile Végétale, Cetearyl Alcohol, Glyceryl Stearate Citrate, Distarch Phosphate, Diheptyl Succinate, Sinorhizobium Meliloti Ferment Filtrate, Hydrogenated Ethylhexyl Olivate, Passiflora Incarnata Seed Oil, Propanediol, Squalane, Alphaisomethyl Ionone, Ammonium Acryloyldimethyltaurate/Vp Copolymer, Argania Spinosa Kernel Oil, Ascorbyl Palmitate, Avena Strigosa Seed Extract, Benzyl Alcohol, Benzyl Salicylate, Butylene Glycol, Butyrospermum Parkii (Shea) Butter, Capryloyl Glycerin/Sebacic Acid Copolymer, Cetyl Hydroxyethylcellulose, Citric Acid, Citronellol, Coumarin, Crambe Abyssinica Seed Oil, Dimethyl Isosorbide, Epigallocatechin Gallate, Epigallocatechin Gallatyl Glucoside, Ethyl Linoleate, Ethyl Linolenate, Ethyl Oleate, Ethyl Palmitate, Ethyl Stearate, Ethylhexylglycerin, Euphorbia Cerifera (Candelilla) Wax, Glyceryl Caprylate, Glyceryl Stearate, Helianthus Annuus (Sunflower) Seed Oil, Hexapeptide-11, Hexyl Cinnamal, Hexyldecyl Stearate, Hydrogenated Olive Oil Unsaponifiables, Hydrogenated Vegetable Oil, Lactic Acid, Lecithin, Linalool, Lonicera Caprifolium (Honeysuckle) Flower Extract, Lonicera Japonica (Honeysuckle) Flower Extract, Parfum/Fragrance, Pentylene Glycol, Peucedanum Graveolens (Dill) Extract, Phenoxyethanol, Pinus Nigra Bud/Needle Extract, Potassium Sorbate, Prunus Armeniaca (Apricot) Kernel Oil, Salicylic Acid, Sesamum Indicum (Sesame) Seed Extract, Sodium Benzoate, Sodium Carboxymethyl Beta-Glucan, Sodium Gluceptate, Sorbic Acid, Spilanthes Acmella Flower Extract, Tocopherol, Tocopheryl Acetate, Xanthan Gum"
products.loc[485, 'ingredients'] = "Dipropylene Glycol, PEG-32, Bis-PEG-18 Methyl Ether Dimethyl Silane, Methyl Gluceth-20, Alcohol Denat., Cyclopentasiloxane, Lauroyl Lysine, Octyldodecanol, Pentylene Glycol, Squalane, Dimethicone, Citrus Limon (Lemon) Peel Oil, Juniperus Mexicana Oil, Citrus Aurantium Dulcis (Orange) Peel Oil, Citrus Grandis (Grapefruit) Peel Oil, Eucalyptus Globulus Leaf Oil, Lavandula Angustifolia (Lavender) Oil, Rosmarinus Officinalis (Rosemary) Leaf Oil, Punica Granatum Extract, Hydrolyzed Avocado Protein, Glycerin, Poloxamer 407, Cyclohexasiloxane, PEG-11 Methyl Ether Dimethicone, PVP, Dimethicone/Vinyl Dimethicone Crosspolymer, Glyceryl Stearate, PEG-40 Stearate, Sodium Acrylate/Sodium Acryloyldimethyl Taurate Copolymer, Isohexadecane, Potassium Hydroxide, Acrylates/C10-30 Alkyl Acrylate Crosspolymer, Polysorbate 80, Polysilicone-11, Sorbitan Oleate, Hydrogenated Lecithin Butylene Glycol 1,2-Hexanediol, Cholesterol, PEG-800, Maltodextrin, Polysorbate 20, Saccharide Hydrolysate, sh-Oligopeptide-1, sh-Polypeptide-1, sh-Oligopeptide-2, sh-Polypeptide-22, sh-Polypeptide-45, sh-Polypeptide-8, sh-Polypeptide-9, Cetearyl Alcohol, Carbomer, Xanthan Gum, Disodium EDTA, Phenoxyethanol"
products.loc[524, 'ingredients'] = "Caprylic/ Capric/Myristic/Stearic Triglyceride, Propanediol, Caprylic/Capric Triglyceride, Glycerin, Cetearyl Alcohol, Glyceryl Stearate Citrate, Isoamyl Laurate, Cetyl Palmitate, Crambe Abyssinica Seed Oil, Olus Oil/Vegetable Oil/Huile Végétale, Passiflora Incarnata Seed Oil, Squalane, Sinorhizobium Meliloti Ferment Filtrate, Ethyl Linoleate, Behenyl Alcohol, Alpha-Isomethyl Ionone, Ascorbyl Palmitate, Avena Strigosa Seed Extract, Benzyl Alcohol, Benzyl Salicylate, Butylene Glycol, Cetyl Hydroxyethylcellulose, Citric Acid, Citronellol, Coumarin, Dimethyl Isosorbide, Epigallocatechin Gallate, Epigallocatechin Gallatyl Glucoside, Ethyl Linolenate, Ethyl Oleate, Ethyl Palmitate, Ethyl Stearate, Glyceryl Caprylate, Helianthus Annuus (Sunflower) Seed Oil, Hexapeptide-11, Hexyl Cinnamal, Lactic Acid, Lecithin, Linalool, Lonicera Caprifolium (Honeysuckle) Flower Extract, Lonicera Japonica (Honeysuckle) Flower Extract, Parfum/ Fragrance, Pentylene Glycol, Peucedanum Graveolens (Dill) Extract, Phenoxyethanol, Pinus Nigra Bud/ Needle Extract, Potassium Sorbate, Salicylic Acid, Sesamum Indicum (Sesame) Seed Extract, Sodium Benzoate, Sodium Carboxymethyl Beta-Glucan, Sodium Citrate, Sorbic Acid, Spilanthes Acmella Flower Extract, Tetrasodium Glutamate Diacetate, Tocopherol, Tocopheryl Acetate, Xanthan Gum"
products.loc[637, 'ingredients'] = "Helianthus Annuus Seed Oil, Glycerin, Cocos Nucifera Oil, Stearyl Alcohol, Kaolin, Cetearyl Olivate, Euphorbia Cerifera Cera, Sorbitan Olivate, Zeolite, Cetyl Alcohol, Glycerol Formal, Citrullus Lanatus Seed Oil, Butyl Avocadate, Sodium Levulinate, Quartz, Glyceryl Caprylate, Sodium Anisate, Passiflora Edulis Seed Oil, Euterpe Oleracea Fruit Oil, Rose Damascena Flower Oil, Acacia Senegal Gum, Xanthum Gum, Tocopherol, Farnesol^, Citronellol^, Geraniol^, Eugenol^, Linalool^, Limonene^, Citral^, Iron Oxide"
products.loc[644, 'ingredients'] = "Caprylic/Capric Triglyceride, Myristyl Myristate, Isodecyl Isononanoate, Triethylhexanoin, Butylene Glycol, Stearic Acid, Peg-100 Stearate, Cetyl Alcohol, Behenyl Alcohol, Hydrogenated Coco-Glycerides, Cetearyl Alcohol, Glycerin, Anogeissus Leiocarpus Bark Extract, Peucedanum Graveolens (Dill) Extract, Cassia Alata Leaf Extract, Rosa Damascena Flower Oil*, Lavandula Angustifolia (Lavender) Oil*, Pelargonium Graveolens Flower Oil*, Illicium Verum (Anise) Fruit/Seed Oil*, Citrus Aurantium Bergamia (Bergamot) Fruit Oil*, Carthamus Tinctorius (Safflower) Seed Oil*, Myristica Fragrans (Nutmeg) Kernel Oil*, Citrus Aurantium Dulcis (Orange) Peel Oil*, Citrus Nobilis (Mandarin Orange) Peel Oil*, Citrus Limon (Lemon) Peel Oil*, Litsea Cubeba Fruit Oil*, Hibiscus Abelmoschus Extract, Linalool, Citronellol, Limonene, Geraniol, Citral, Hypnea Musciformis (Algae) Extract, Gellidiela Acerosa (Algae) Extract, Hordeum Vulgare (Barley) Extract/Extrait D'Orge, Triticum Vulgare (Wheat) Germ Extract, Pisum Sativum (Pea) Extract, Centaurium Erythraea (Centaury) Extract, Bambusa Vulgaris (Bamboo) Extract, Poria Cocos Sclerotium Extract, Cholesterol, Rosmarinus Officinalis (Rosemary) Leaf Extract, Coffea Arabica (Coffee) Seed Extract, Rubus Idaeus (Raspberry) Leaf Extract, Squalane, Sigesbeckia Orientalis (St. Paul'S Wort) Extract, Micrococcus Lysate, Phytosphingosine, Caffeine, Acetyl Hexapeptide-8, Hydrolyzed Algin, Crithmum Maritimum Extract, Hydrogenated Vegetable Oil, Laminaria Digitata Extract, Sodium Hyaluronate, Lecithin, Maltodextrin, Glycine Soja (Soybean) Sterols, Dimethicone, Ceteareth-20, Polysorbate 20, Glyceryl Stearate, Carbomer, Potassium Hydroxide, Tocopheryl Acetate, Boron Nitride, Tetrahexyldecyl Ascorbate, Sodium Dehydroacetate, Xanthan Gum, Glucosamine Hcl, Phenoxyethanol * Essential Oil"
products.loc[727, 'ingredients'] = "Helianthus Annuus Seed Oil/Sunflower Seed Oil, Simmondsia Chinensis Seed Oil/Jojoba Seed Oil, Macadamia Integrifolia Seed Oil, Triticum Vulgare Germ Oil/Wheat Germ Oil, Caprylic/Capric Triglyceride, Silica Silylate [Nano]/Silica Silylate, Limnanthes Alba Seed Oil/Meadowfoam Seed Oil, Punica Granatum Seed Oil, Fragrance, Tocopherol, Spilanthes Acmella Flower Extract, Glycine Soja Oil/Soybean Oil, Plankton Extract, Linalool, Limonene"
products.loc[1372, 'ingredients'] = "Water, Alcohol Denat., Glycolic Acid, Butylene Glycol, Glycerin, Methyl Gluceth-10, Dextran, Hexapeptide-2, Rice Amino Acids, Aloe Barbadensis Leaf Juice, Urea, Yeast Amino Acids, Trehalose, Inositol, Taurine, Betaine, Zinc Gluconate, Ascorbic Acid, Chitosan, Propyl Gallate, Nonoxynol-10, Lecithin, Tocopherol, Magnesium Ascorbyl Phosphate, Dipotassium Glycyrrhizate, Palmitoyl Hydroxypropyltrimonium Amylopectin/Glycerin Crosspolymer, Vitis Vinifera (Grape) Seed Extract, Chitosan PCA, Allantoin, Polyquaternium-10, Sodium Metabisulfite, Sodium Sulfite, Sodium Hydroxide, PPG-26-Buteth-26, PEG-40 Hydrogenated Castor Oil, Hydroxyethylcellulose, Disodium EDTA, Limonene, Linalool, Fragrance"
products.loc[2333, 'ingredients'] = "Helianthus Annuus (Sunflower) Seed Oil, Glycerin, Caprylic/Capric Triglyceride, Butyrospermum Parkii (Shea) Butter*, Cocos Nucifera (Coconut) Oil*, Propanediol, Olea Europaea (Olive) Oil*, Ricinus Communis (Castor) Seed Oil*, Water, Hordeum Vulgare Leaf Juice*, Cocos Nucifera (Coconut) Fruit Extract, Sucrose Laurate, Linoleic Acid, Sambucus Nigra Fruit Extract, Tocopherol, Lactobacillus Ferment, Leuconostoc Ferment Filtrate, Salvia Hispanica Seed Extract, Sodium Hyaluronate, Hyaluronic Acid, Tremella Funciformis Sporocarp Extract, Anigozanthos Flavidus Extract, Banksia Serrata Flower Extract, Grevillea Speciosa Flower Extract, Musa Sapientum Flower Extract, Saccharide Isomerate, Linolenic Acid, Squalane, Beta Vulgaris/Beet Root Extract, Camellia Oleifera Seed Oil*, Lavandula Angustifolia (Lavender) Flower/Leaf/Stem Water*, Rosa Damascena Flower Water*, Arnica Montana (Arnica) Extract*, Borago Officinalis (Borage) Leaf Extract*, Calendula Officinalis (Calendula) Flower Extract*, Medicago Sativa (Alfalfa) Extract*, Spiraea Ulmaria (Meadowsweet) Extract*, Simmondsia Chinensis (Jojoba) Seed Oil*, Ascorbyl Palmitate, Cera Alba/Beeswax *, Copernicia Cerifera (Carnauba) Wax*, Sucrose Palmitate, Sucrose Stearate, Sodium Citrate, Phenethyl Alcohol, Citric Acid, Hydrolyzed Corn Starch, Ci 77288, Aroma"

In [9]:
#dropped index 1675 (The Pore Wonders Set - The INKEY List) 
#because it's a set of multiple products all having different ingredients

In [10]:
#dropping any products that ingredients were unable to be gathered
products = products[products.ingredients.notnull()]

In [11]:
# Take onle the part with the real ingredients. Anything including these words is the part of the page that describes
pattern = ['<b>','<br>', '-\w+: ', 'Please', 'No Info', 
           'This product','Visit', 'Helpes', 'Serves', 'plays', 'Plays',
          'Acts', 'acts', 'Clean at Sephora products are', 'the synthetic fragrances',
           "Please be aware that ingredient lists may change", "Supports", "Reduces", "Restores", 
          "easily", "moisture", 'moisturizes', 'nourishes', 'penetrate', 'Help', 'reduce', 'improves', 'Hydrates',
          'appearance', "improving", 'improve', 'benefits', 'neutrilizes', 'formula', 'original', 'natural',
          'nurture', 'moisturize', 'moisturizes', 'protects', 'soothes', 'promote', 'regenerate',
          'protecting', 'updated periodically', 'addressing', 'nourish', 'Nurture', 'help', 'supports', 'Soothes',
          'Protects', 'innovative technology', 'helps', 'subject to change', 'ingredients', 'and', 'brighten',
          'technology', 'smooths', 'Calms', 'shown', 'moisturizes', 'rich in', 'diminishes', 'softens',
          'brightens', 'eliminates', 'boosts', 'for a more', 'effect', "Moisturizes", 'look', 'hydrated',
          'properties', 'plumps', 'Effects', 'protection', 'moisturizing', "revitalizes", "Active",
          'Division:', 'Nourishes', 'Provides', 'cleansing', 'triggers', 'Diminishes', 'Offers', 'Increases', 'hydrates',
          'Used', 'Revitalizes', 'revived', 'antioxidants', 'Neutralizes', 'physical', 'support', 'smoother']

products.loc[:, 'ingredients'] = products.ingredients.str.split('<br/>')

In [12]:
#looping through all the rows
for i in products.index:
   #looping through the elemnts in the row's list
    for j in range(len(products.ingredients[i])):
        #if non of the words in pattern present in the string then it's an ingredient list
        if all(x not in products.ingredients[i][j] for x in pattern) and (products.ingredients[i][j].strip()):
            ingredient = products.ingredients[i][j]
    products.loc[i, 'ingredients'] = ingredient

   

In [13]:
#removing noise from ingredients
remove1 = ".\r\n"
remove2 = "</div></div>"
remove3 = '<div aria-labelledby="tab2" class="css-1rny024" hidden="" id="tabpanel2" role="tabpanel" tabindex="0"><div class="css-pz80c5">'
remove4 = '\r\r'
remove5 = '\n'
remove6 = '<div arialabelledby="tab2" class="css1rny024" hidden="" id="tabpanel2" role="tabpanel" tabindex="0"><p class="css189uj0g" datacomp="Text Box ">Get more information about<! > </p><div class="cssk13wt5" datacomp="Link Box ">shipping rates'


products['ingredients'] = products['ingredients'].str.replace(remove1, "")
products['ingredients'] = products['ingredients'].str.replace(remove2, "")
products['ingredients'] = products['ingredients'].str.replace(remove3, "")
products['ingredients'] = products['ingredients'].str.replace(remove4, "")
products['ingredients'] = products['ingredients'].str.replace(remove5, "")
products['ingredients'] = products['ingredients'].str.replace(remove6, "")

products['ingredients'] = products['ingredients'].str.replace('-', "")
products['ingredients'] = products['ingredients'].str.replace('*', "")
products['ingredients'] = products['ingredients'].str.replace('+', "")
products['ingredients'] = products['ingredients'].str.replace('.', "")
products['ingredients'] = products['ingredients'].str.replace('^', "")
products['ingredients'] = products['ingredients'].str.replace('\r', "")
products['ingredients'] = products['ingredients'].str.replace('May contain: ', "")
products['ingredients'] = products['ingredients'].str.replace(';', ",")
products['ingredients'] = products['ingredients'].str.replace('\u200b', "")


#1,2 hexanediol shows up in multiple formats. vectorizes as different ingredients
products['ingredients'] = products['ingredients'].str.replace('1,10Decanediol', "110Decanediol")
products['ingredients'] = products['ingredients'].str.replace('1,2 Hexanedio', "12Hexanediol")
products['ingredients'] = products['ingredients'].str.replace('1,2 Hexanediol', "12Hexanediol")
products['ingredients'] = products['ingredients'].str.replace('1,2Hexanediol', "12Hexanediol")
products['ingredients'] = products['ingredients'].str.replace('12HexanediolQuaternium18 Bentonite', "12Hexanediol, Quaternium18 Bentonite")
products['ingredients'] = products['ingredients'].str.replace('1, 2Hexanediol', "12Hexanediol")
products['ingredients'] = products['ingredients'].str.replace('12Hexanedioll', "12Hexanediol")
products['ingredients'] = products['ingredients'].str.replace('4t butylcyclohexanol', "4tbutylcyclohexanol")



#adding space after a comma so it countvectorizes properly
products['ingredients'] = products['ingredients'].str.replace(',', ", ")

#removing all the instances of 100% natural, 100% organic and the like
products['ingredients'] = products['ingredients'].str.replace('100% Natural ', "")
products['ingredients'] = products['ingredients'].str.replace('100% Organic ', "")
products['ingredients'] = products['ingredients'].str.replace('100percent Natural ', "")
products['ingredients'] = products['ingredients'].str.replace('100% Unrefined ', "")
products['ingredients'] = products['ingredients'].str.replace('100% Pure ', "")
products['ingredients'] = products['ingredients'].str.replace('100% ', "")
products['ingredients'] = products['ingredients'].str.replace('All Natural ', "")
products['ingredients'] = products['ingredients'].str.replace('Organic ', "")


#strip leading and trailing white spaces
products['ingredients'] = products['ingredients'].str.strip()

In [14]:
#I decided to remove the "water" ingredient instances because it's mostly just a carrier agent 
#it appears in almost every row and would distort the vectorizing and put too much weight on it

water = ['Water, ', 'Water/Aqua/Eau, ', 'Aqua/Water, ', 'Aqua (Water), ', 'Water , ', 'Aqua/Water/Eau, ',
        'Water (Aqua, Eau), ', 'Water/Aqua/Eau (Aqua), ', ' Water, ', 'Water/Eau, ', 'Water (Aqua / Eau), ',
        'ater/Aqua/Eau, ', 'Water/Eau/Aqua, ', 'Aqua, ']

for i in range(len(water)):
    products['ingredients'] = products['ingredients'].str.replace(water[i], "")



In [15]:
#checking which products do not have an ingredints list and have an empty string instead
for i in products.index:
    if not products.ingredients[i].strip():
        print(i)

45
117
139
222
349
1174
1175
1839
2118


In [16]:
#manually filling in the ingredients that failed to fill
#checkin what is their names

products.loc[[45, 117, 139,222, 349, 1174, 1175, 1839, 2118], 'name']

45      BB Tinted Treatment 12-Hour Primer Broad Spect...
117          SEA drink of H2O hydrating boost moisturizer
139     Bio-Performance Advanced Super Revitalizing Cream
222                    Colored Clay CC Undereye Corrector
349                   Regenerative Anti-Aging Moisturizer
1174    Purity Made Simple® Facial Cleansing Gel & Eye...
1175    EAU FRAÎCHE DOUCEUR Micellar Cleansing Water F...
1839       The Microdelivery Triple-Acid Brightening Peel
2118                                      Eye Contour Gel
Name: name, dtype: object

In [17]:
products.loc[45, 'ingredients'] = "Cyclopentasiloxane, Isododecane, Mica, Polysilicone-11, Polymethylsilsesquioxane, Hexyl Laurate, PEG-10 Dimethicone, Polyglyceryl-4 Isostearate, Stearic Acid, Cetyl PEG/PPG-10/1 Dimethicone, Alumina, Triethoxycaprylylsilane, Dipalmitoyl Hydroxyproline, Diamond Powder, Iron Oxides"
products.loc[117, 'ingredients'] = "Butylene Glycol, Dimethicone, Glycerin, Hydroxyethyl Acrylate/Sodium Acryloyldimethyl Taurate Copolymer, Squalane, Phenoxyethanol, Ammonium Acryloyldimethyltaurate/VP Copolymer, Polysorbate 60, Caprylyl Glycol, Enteromorpha Compressa Extract, Citrus Aurantium Dulcis (Orange) Peel Oil, Disodium EDTA, Propanediol, Sea Salt Extract, Hexylene Glycol, Lavandula Angustifolia (Lavender) Oil, Phospholipids, Stearoyl Inulin, Cocos Nucifera (Coconut) Oil, Hyaluronic Acid, Silanetriol, Limonene, Citric Acid, Sorbic Acid, Sodium Anisate, Sodium Levulinate, Linalool, Potassium Sorbate, Sodium Benzoate, Helianthus Annuus (Sunflower) Seed Oil, Tocopherol, Algae Extract, Gardenia Taitensis Flower Extract, Blue 1 (CI 42090)"
products.loc[139, 'ingredients'] = "Glycerin, Cyclomethicone, Butylene Glycol, Dimethicone, Cetyl Octanoate, Squalane, Dimethicone Copolyol, Disteardimonium Hectorite, Hydrogenated C6-14 Olefin Polymers, PEG-150, Mortierella Oil, Phytosteryl Macadamiate, Tocopheryl Acetate, Stearyl Glycyrrhetinate, Arginine HCI, PEG-60 Hydrogenated Castor Oil, Sodium Glutamate, Disodium Adenosine Triphosphate, Saccharomyces Lysate Extract, Sodium Acetylated Hyaluronate, Rosa Roxburghii Extract, Octyl Methoxycinnamate, Agar, Sodium Hexametaphosphate, Trisodium EDTA, Tocopherol, BHT, Ethylparaben, Butylparaben, Methylparaben, Fragrance, Iron Oxides"
products.loc[222, 'ingredients'] = "Ricinus Communis (Castor) Seed Oil, Octyldodecyl Stearoyl Stearate, Bis-digylceryl Polyacyladipate-2, Rhus Verniciflua Peel Wax, Rhus Succedanea Fruit Wax, Methyl Methacrylate Crosspolymer, Tribehenin, Tocopherol, Tocopheryl Acetate, Glycyrrhiza Glabra (Licorice) Root Extract, Microcrystalline Wax/Cera Microcristallina/Cire Microcristalline, Kaolin, Tetrahexyldecyl Ascorbate, Silica, Montmorillonite, Magnesium Aluminum Silicate, Caffeine, Squalane, Polyethylene, Ethylhexyl Palmitate, Atelocollagen, Butylene Glycol, Mica, Pentaerythrityl Tetraisostearate, Silica Dimethyl Silylate, Sodium Chondroitin Sulfate, Sodium Hyaluronate, Ascorbyl Palmitate, Caprylyl Glycol, Phenoxyethanol, Iron Oxides (CI 77491, CI 77492, CI 77499), Titanium Dioxide (CI 77891)"
products.loc[349, 'ingredients'] = "Caprylic/Capric Triglyceride, Butylene Glycol, Dimethicone, Pentylene Glycol, Butyrospermum Parkii (Shea) Butter, Sorbitan Stearate, Glycerin, Glyceryl Stearate, PEG-100 Stearate, Limnanthes Alba (Meadowfoam) Seed Oil, DI-C12-15 Alkyl Fumarate, Algae Exopolysaccharides, Malus Domestica Fruit (Apple) Cell Culture Extract, Bambusa Vulgaris (Bamboo) Leaf/Stem Extract, Tetrapeptide-21, Pisum Sativum (Pea) Extract, Enantia Chlorantha Bark Extract, Alaria Esculenta Extract, Sodium Ascorbyl Phosphate, Biotin, Glucosamine HCL, Eucalyptus Globulus Leaf Oil, Ergothioneine, Oleanolic Acid, Cetyl Alcohol, Dimethicone/PEG-10/15 Crosspolymer, Stearic Acid, Carbomer, Xanthan Gum, Lecithin, Mannitol, Vinyl Dimethicone/Methicone Silsesquioxane Crosspolymer, Disodium EDTA, Caprylyl Glycol, Ethylhexylglycerin, Aminomethyl Propanol, Hexylene Glycol, Phenoxyethanol"
products.loc[1174, 'ingredients'] = "Sodium Trideceth Sulfate, Disodium Lauroamphodiacetate, Acrylates Copolymer, Polysorbate 20, Sodium C14-16 Olefin Sulfonate, Glycerin, Cocamidopropyl Betaine, Isopropyl Alcohol, Sodium Sulfate, Limnanthes Alba (Meadowfoam) Seed Oil, Aniba Rosaeodora (Rosewood) Wood Oil, Pelargonium Graveolens Flower Oil, Bulnesia Sarmientoi Wood Oil, Cymbopogon Martini Oil, Rosa Centifolia Flower Oil, Amyris Balsamifera Bark Oil, Santalum Album (Sandalwood) Oil, Salvia Sclarea (Clary) Oil, Ormenis Multicaulis Oil, Acacia Dealbata Flower/Stem Extract, Daucus Carota Sativa (Carrot) Seed Oil, Piper Nigrum (Pepper) Fruit Oil, Disteareth-75 Ipdi, Glycereth-7 Caprylate/Caprate, Potassium Chloride, Hydrogen Peroxide, Magnesium Nitrate, Magnesium Chloride, Sodium Benzotriazolyl Butylphenol Sulfonate, Buteth-3, Tributyl Citrate, Sodium Hydroxide, Sodium Chloride, Disodium Edta, Citric Acid, Linalool, Methylchloroisothiazolinone, Methylisothiazolinone"
products.loc[1175, 'ingredients'] = "Hexylene Glycol, Glycerin, Poloxamer 184, Dihydrocholeth-30, Polyaminopropyl Biguanide, Benzyl Salicylate, Propylene Glycol, Fragrance, Disodium Cocoamphodiacetate, Disodium Edta, Rosa Gallica Extract/ Rosa Gallica Flower Extract"
products.loc[1839, 'ingredients'] = "Sd Alcohol 40b (Alcohol Denat.), Glycereth-7 Trimethyl Ether, Niacinamide, Mandelic Acid, Peg-8/Smdi Copolymer, Triethanolamine, Azelaic Acid, Phytic Acid, Bisabolol"
products.loc[2118, 'ingredients'] = "Rosa Damascena Flower Water, Citrus Aurantium Amara (bitter Orange) Flower Water, Butylene Glycol, Glycerin, Carbomer, Tromethamine, Caffeine, Phenoxyethanol, Ethylhexylglycerin, PEG-8, Aloe Barbadensis Leaf Juice, PEG-32, Citric Acid, Benzyl Acohol, Centaurea Cyanus Flower Extract, Sodium Benzoate, Ginkgo Biloba Leaf Extract, Potassium Sorbate, Methylisothiazolinone, Chamomilla Recutita (Matricaria) Flower Extract, Dehydroacetic Acid, CI 42090/Blue 1"



In [18]:
#some ingredients have a format similar to "Ricinus Communis (Castor) Seed Oil"
#in the parenthesis the name laymen terms vs. outside the parenthesis it's scientific name
#This is true majority of the time. I will remove everything in parenthesis to try and make
#all the ingredients more uniform in naming

In [19]:
for i in products.index:
    #remove anything between parethesis and parenthesis
    new_string = re.sub(r'\([^)]*\)', '', products.loc[i, 'ingredients'])
    #remove anything between <> and <>
    new_string= re.sub(r'\<[^)]*\>', '', new_string)
    #remove anything betwee [] and []
    new_string= re.sub(r'\[[^)]*\]', '', new_string)
    products.loc[i, 'ingredients'] = new_string.replace('  ', ' ')
    
products['ingredients'] = products['ingredients'].str.replace('<', "")
products['ingredients'] = products['ingredients'].str.replace(']', "")
products['ingredients'] = products['ingredients'].str.replace(')', "")

In [20]:
#the first ingredient looked like "Sd Alcohol 40b (Alcohol Denat.)". Removed (Alcohol Denat.) with above
products.loc[1839, 'ingredients']

'Sd Alcohol 40b , Glycereth-7 Trimethyl Ether, Niacinamide, Mandelic Acid, Peg-8/Smdi Copolymer, Triethanolamine, Azelaic Acid, Phytic Acid, Bisabolol'

In [21]:
#Following similar logic I will separate where "/" is present and grab what is the first description before "/"

In [22]:
for i in products.index:
    list_of_ingredients = []
    split = products.ingredients[i].split(", ")
    for j in range(len(split)):
        results = split[j].split("/")[0]
        results = results.strip() #stripping any white space 
        list_of_ingredients.append(results)
    products.loc[i, 'ingredients'] = ', '.join(list_of_ingredients)

#### Number of Reviews

In [23]:
#converting # reviews to integer
products['no_reviews'] = products['no_reviews'].str.replace(" reviews", "").str.replace(".review", "").str.replace("K", "000").astype(int)

#### Size

In [24]:
#this extract oz if those exist otherwise sets it to 0
for i in products.index:
    
    split = products.size1[i].split("/")[0]
    results = re.search("SIZE (.*) oz", split)
    try:
        products.loc[i, 'size1'] = results.group(1)
    except:
        products.loc[i, 'size1'] = '0'

    
#size one has a size if size 1 doesn't
for i in products.index:
    
    split = products.size2[i].split("/")[0]
    results = re.search("SIZE: (.*) oz", split)
    try:
        products.loc[i, 'size2'] = results.group(1)
    except:
        products.loc[i, 'size2'] = '0'

In [25]:
        
#remove " fl" and 'Limited Edition' and the like from the number of oz
products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace(" fl", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace(" fl", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("fl", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("fl", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("Limited Edition ", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("Limited Edition ", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("2 x ", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("2 x ", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("3 x ", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("3 x ", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("Balm ", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("Balm ", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("Standard Size - ", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("Standard Size - ", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("28 x 0.01", '0.28')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("28 x 0.01", '0.28')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("30 x 0.06", '1.8')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("30 x 0.06", '1.8')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("90 x 0.012", '1.08')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("90 x 0.012", '1.08')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("50 x 0.04", '2')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("50 x 0.04", '2')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("7 Ampoules, .0625", '0.4375')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("7 Ampoules, .0625", '0.4375')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("4 Count - 0.29", '1.16')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("4 Count - 0.29", '1.16')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("4 x 1.25", '5')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("4 x 1.25", '5')

products.loc[2038, 'size1'] = 0
products.loc[37, 'size2'] = 1.7
products = products[products.name != 'Cleansing Spa Water Cloths'] #these are cleansing cloths not liquid products

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("Single-use mask ", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("Single-use mask ", '')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("12 packettes 0.02", '0.24')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("12 packettes 0.02", '0.24')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("7 Sachets; 0.33", '2.31')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("7 Sachets; 0.33", '2.31')

products.loc[:, 'size1'] = products.loc[:, 'size1'].str.replace("Lip Crayon: ", '')
products.loc[:, 'size2'] = products.loc[:, 'size2'].str.replace("Lip Crayon: ", '')




#converting the size1 and size2 to floats
products.loc[:, ['size1', 'size2']] = products.loc[:, ['size1', 'size2']].astype(float)


#make a column that has the size for the product from size1 + size2
products.loc[:, 'final_size'] = products.size1 + products.size2

In [26]:
products.head(2)

Unnamed: 0,name,brand,category,price,ingredients,no_reviews,hearts,size1,size2,url,final_size
0,Protini™ Polypeptide Moisturizer,Drunk Elephant,moisturizing-cream-oils-mists,68.0,"Dicaprylyl Carbonate, Glycerin, Cetearyl Alcoh...",3000,216935,1.69,0.0,https://www.sephora.com/product/protini-tm-pol...,1.69
1,The Water Cream,Tatcha,moisturizing-cream-oils-mists,68.0,"Dicaprylyl Carbonate, Glycerin, Cetearyl Alcoh...",2000,197492,0.0,1.7,https://www.sephora.com/product/the-water-crea...,1.7


In [27]:
#imputing 0 final_sizes by mean of the category
products['final_size'] = products['final_size'].replace(0, np.NaN)
products.loc[:, 'final_size'] = products.groupby(['category'])['final_size'].apply(lambda x:x.fillna(x.mean()))


#### Price per Ounce

In [28]:
#creating a price per ounce columns
products.loc[:, 'price_per_ounce'] = products['price'] / products['final_size']

In [29]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2460 entries, 0 to 2764
Data columns (total 12 columns):
name               2460 non-null object
brand              2460 non-null object
category           2460 non-null object
price              2460 non-null float64
ingredients        2460 non-null object
no_reviews         2460 non-null int64
hearts             2460 non-null int64
size1              2459 non-null float64
size2              2459 non-null float64
url                2460 non-null object
final_size         2460 non-null float64
price_per_ounce    2460 non-null float64
dtypes: float64(5), int64(2), object(5)
memory usage: 249.8+ KB


In [30]:
products.head(10)

Unnamed: 0,name,brand,category,price,ingredients,no_reviews,hearts,size1,size2,url,final_size,price_per_ounce
0,Protini™ Polypeptide Moisturizer,Drunk Elephant,moisturizing-cream-oils-mists,68.0,"Dicaprylyl Carbonate, Glycerin, Cetearyl Alcoh...",3000,216935,1.69,0.0,https://www.sephora.com/product/protini-tm-pol...,1.69,40.236686
1,The Water Cream,Tatcha,moisturizing-cream-oils-mists,68.0,"Dicaprylyl Carbonate, Glycerin, Cetearyl Alcoh...",2000,197492,0.0,1.7,https://www.sephora.com/product/the-water-crea...,1.7,40.0
2,Ultra Facial Cream,Kiehl's Since 1851,moisturizing-cream-oils-mists,32.0,"Aqua, Cyclohexasiloxane, Squalane, BisPEG18 Me...",943,87617,0.0,1.7,https://www.sephora.com/product/ultra-facial-c...,1.7,18.823529
3,CC+ Cream with SPF 50+,IT Cosmetics,moisturizing-cream-oils-mists,39.5,"Titanium Dioxide 90%, Zinc Oxide 63%",3000,225410,1.08,0.0,https://www.sephora.com/product/your-skin-but-...,1.08,36.574074
4,The Dewy Skin Cream,Tatcha,moisturizing-cream-oils-mists,68.0,"Saccharomyces, Glycerin, Propanediol, Dimethic...",1000,85005,0.0,1.7,https://www.sephora.com/product/the-dewy-skin-...,1.7,40.0
5,Lala Retro™ Whipped Moisturizer with Ceramides,Drunk Elephant,moisturizing-cream-oils-mists,60.0,"Glycerin, Caprylic, Isopropyl Isostearate, Pse...",909,40470,1.69,0.0,https://www.sephora.com/product/drunk-elephant...,1.69,35.502959
6,Crème de la Mer Moisturizer,La Mer,moisturizing-cream-oils-mists,180.0,"Algae Extract, Mineral Oil, Petrolatum, Glycer...",612,72214,0.0,1.0,https://www.sephora.com/product/creme-de-la-me...,1.0,180.0
7,F-Balm™ Electrolyte Waterfacial Mask,Drunk Elephant,moisturizing-cream-oils-mists,52.0,"Squalane, Propanediol, Niacinamide, Olive Oil ...",478,31337,1.69,0.0,https://www.sephora.com/product/drunk-elephant...,1.69,30.769231
8,The True Cream Aqua Bomb,belif,moisturizing-cream-oils-mists,38.0,"Dipropylene Glycol, Glycerin, Methl Trimethico...",3000,173760,0.0,1.68,https://www.sephora.com/product/the-true-cream...,1.68,22.619048
9,Virgin Marula Antioxidant Face Oil,Drunk Elephant,moisturizing-cream-oils-mists,72.0,Sclerocraya Birrea Kernel Oil,1000,143287,0.0,1.0,https://www.sephora.com/product/virgin-marula-...,1.0,72.0


In [31]:
products.to_csv("./data/products_clean.csv", index = False)