### This notebook clean up beauty product data and prepare some basic statistic features.

In [1]:
from difflib import SequenceMatcher
from ast import literal_eval
from collections import Counter
from tqdm import tqdm
import pandas as pd
import numpy as np
import re
import operator
import itertools
import gc

### Prepare ingredient information

Read ingredient dataframe:

In [2]:
cols = ['name','category','rating']
ingredient_df = pd.read_csv('../web_scraper/ingredients.csv', usecols=cols, converters={"category": literal_eval})
ingredient_df['name'] = ingredient_df['name'].str.strip()
ingredient_df['rating_num'] = ingredient_df['rating'].map({'Poor':0, 'Average':1, 'Good':2, 'GOOD':2, 'Best':3})
print("number of ingredient:",ingredient_df.shape[0])
ingredient_df.head()

number of ingredient: 1750


Unnamed: 0,name,category,rating,rating_num
0,"1, 2-Hexanediol",[Preservatives],Good,2
1,10-Hydroxydecanoic Acid,[Emollients],Good,2
2,4-T-butylcyclohexanol,"[Emollients, Skin-Soothing]",Good,2
3,Acacia farnesiana extract,"[Plant Extracts, Fragrance: Synthetic and Frag...",Poor,0
4,acacia senegal gum,"[Texture Enhancer, Plant Extracts, Skin-Soothing]",Good,2


Create a class where we can check if an ingredient matches our existing ingredient dictionary. if there is a match,
find the ingredient's rating and category.
* Initialize the class with ingredient rating dictionary and category dictionary.
* Given a list of ingredient, find best matching ingredient that rating and category are avaliable. This is done by evaluating the similarity metric between the name of all existing ingredient and the name of given ingredient (use python function SequenceMatcher). If the similarity is below a thresh, then the given ingredient is labeled as 'unknown'.
* After building up the matching dictionary, we can find an ingredient's matching, rating and category by calling the lookup function.

In [3]:
class look_up_ingredient():
    
    def __init__(self, rating_dict, category_dict):
        self.rating_dict = rating_dict
        self.rating_dict['unknown'] = np.nan
        
        self.category_dict = category_dict
        self.category_dict['unknown'] = []
        
        self.rating = set([value for value in self.rating_dict.values()])
        self.category = set([value for values in self.category_dict.values() for value in values])
        
        self.match_dict = {}
    
    def find_matching_ingredient(self, my_ingredients, thresh=0.2):

        for ingredient in tqdm(my_ingredients):
            if ingredient in self.match_dict.keys():
                continue
            match_matric = {key : SequenceMatcher(None, key, ingredient).ratio() for key in self.rating_dict.keys()}
            best_match, best_metric = max(match_matric.items(), key=operator.itemgetter(1))
            if best_metric > thresh:
                self.match_dict[ingredient] = best_match
            else:
                self.match_dict[ingredient] = 'unknown'
                
    def lookup(self, ingredient, option=''):
        
        key = self.match_dict.get(ingredient, 'unknown')
        rating = self.rating_dict.get(key, -1)
        category = self.category_dict.get(key, [])
        
        if option == 'ingredient':
            return key
        elif option == 'rating':
            return rating
        elif option == 'category':
            return category
        else:
            return key, rating, category

Create ingredient class, note for ingredient with alias we will duplicate the record. 

For example, for "PEG/PPG-18/18 dimethicone" we will create three dict items, with different keys but same value.

In [4]:
ingredient_rating_dict = {name: row['rating_num'] for (idx, row) in ingredient_df.iterrows() for name in row['name'].split('/')}
ingredient_category_dict = {name: row['category'] for (idx, row) in ingredient_df.iterrows() for name in row['name'].split('/')}
lookup = look_up_ingredient(ingredient_rating_dict, ingredient_category_dict)

### Clean product data 

* Drop products that are not "chemical" products, like makeup brushes, cleaning devices.
* Split 'size' column to a number and unit, to unit conversion as necessary
* Compute 'price/size'
* Basic cleaning on ingredients:
    * split inactive and active ingredient
    * convert ingredients to a list
    * find number of inactive and active ingredient
    * check if the ingredients are in alphabatical order -- most companies like to list ingredient in a descending order of their quantity in the product, some companies just list ingredients alphabatically.
* Look up ingredients in our ingredient dictionary.
    * get a set of all unique ingredients in the products dataframe
    * find the match of all these ingredients
    * for all product, we loop over its ingredient list and look up the matching ingredient, rating and ingredient category
    * count how many ingredients in a product have a certain rating (how many ingredient rated as Good/Average etc.)
    * count how many ingredients in a product belongs to a certain category (how many antioxidants/sunscreen etc.)
    * compute average ingredient rating, potentially we want to take product's rank into consideration.

In [5]:
class product_df_cleaning:
    
    def __init__(self, df):
        self.df = df.copy(deep=True)
        
    def drop_rows(self, drop_dict):
        for col, values in drop_dict.items():
            self.df = self.df.loc[~self.df[col].isin(values)]
    
    def clean_price(self):
        # convert all price to float
        # compute price/size
        if self.df['price'].dtype != 'float':
            self.df['price'] = self.df['price'].apply(lambda x: x.replace(',','')).astype('float')
        self.df['avg_price'] = self.df['price']/self.df['size_num']
        
    def clean_size(self):
        # split size number and unit
        # convert fl.oz. to ml
        self.df['size_num'] = self.df['size'].apply(lambda x: float(str(x).split()[0]))
        self.df['size_unit'] = self.df['size'].apply(lambda x: str(x)[len(str(x).split()[0])+1:])
        self.df['size_num'].loc[self.df['size_unit']=='fl. oz.'] *= 29.5735
        self.df['size_num'] = self.df['size_num'].round()
        self.df['size_unit'].loc[self.df['size_unit']=='fl. oz.'] = 'ml'
        self.df['size_unit'].loc[self.df['size_unit']=='grams'] = 'gram'
        self.df['size_unit'].loc[~self.df['size_unit'].isin(['ml','gram',''])] = 'piece/other'
        
    def clean_ingredient(self):
        
        def split_active_inactive(sr_ingredient):
            inactive_start = pd.concat([sr_ingredient.str.find('Other'),
                                       sr_ingredient.str.find('Inactive'),
                                       sr_ingredient.str.find('Cosmetic Ingredients')],
                                       axis=1).max(axis=1)
            
            inactive_start = inactive_start.replace(-1, 0)
            inactive = [ingredient[start:] for (ingredient, start) in zip(sr_ingredient, inactive_start)]
            inactive = [ingredient[ingredient.find(':')+1:] for ingredient in inactive]                           
            active = [ingredient[:start] for (ingredient, start) in zip(sr_ingredient, inactive_start)]
            active = [ingredient[ingredient.rfind(':')+1:] for ingredient in active]
            return active, inactive
             
        def check_alphabetical(word_list, tol=1):
            if(len(word_list)) <= tol:
                return True
            count = 0
            for i in range(len(word_list) - 1):
                if word_list[i] > word_list[i + 1]:
                    count += 1
                if count > tol:
                    return False
            return True
        
        #split active and inactive ingredient
        self.df['ingredient'].fillna('', inplace=True)
        self.df['active_ingredient'], self.df['inactive_ingredient'] = split_active_inactive(self.df['ingredient'])
        #convert to list
        self.df['active_ingredient_list'] = self.df['active_ingredient'].apply(lambda x: [l.strip() for l in str(x).split(',') if l.lower().islower()])
        self.df['inactive_ingredient_list'] = self.df['inactive_ingredient'].apply(lambda x: [l.strip() for l in str(x).split(',') if l.lower().islower()])      
        #find number of ingredient
        self.df['n_inactive_ingredient'] = self.df['inactive_ingredient_list'].apply(lambda x: len(x))
        self.df['n_active_ingredient'] = self.df['active_ingredient_list'].apply(lambda x: len(x))
        #check if ingredients are listed alphabatically or perhaps by there quantity
        self.df['is_alphabatical'] = self.df['inactive_ingredient_list'].apply(check_alphabetical)

        
    def lookup_ingredients(self, lookup):
        
        print("processing all ingredients...")
        merged_ingredients = set(list(itertools.chain(*self.df['inactive_ingredient_list'].values)))
        merged_ingredients = merged_ingredients.union(
                             set(list(itertools.chain(*self.df['active_ingredient_list'].values))))
        lookup.find_matching_ingredient(merged_ingredients)
        ingredient_property = pd.DataFrame(index=merged_ingredients)
        print("find all ingredients information...")
        ingredient_property['matching'] = [lookup.lookup(ingredient, option='ingredient') for ingredient in merged_ingredients]
        ingredient_property['rating'] = [lookup.lookup(ingredient, option='rating') for ingredient in merged_ingredients]
        ingredient_property['category'] = [lookup.lookup(ingredient, option='category') for ingredient in merged_ingredients]
        
        # map origianal ingredient list to matched ingredient
        self.df['inactive_ingredient_matched_list'] = [[ingredient_property.loc[ingredient, 'matching'] 
                                                        for ingredient in ingredients]
                                                        for ingredients in self.df['inactive_ingredient_list'].values]
        self.df['active_ingredient_matched_list'] = [[ingredient_property.loc[ingredient, 'matching'] 
                                                        for ingredient in ingredients]
                                                        for ingredients in self.df['active_ingredient_list'].values]
        
        # map original ingredient list to ingredient rating
        self.df['inactive_ingredient_rating_list'] = [[ingredient_property.loc[ingredient, 'rating'] 
                                                        for ingredient in ingredients]
                                                        for ingredients in self.df['inactive_ingredient_list'].values]
        self.df['active_ingredient_rating_list'] = [[ingredient_property.loc[ingredient, 'rating'] 
                                                        for ingredient in ingredients]
                                                        for ingredients in self.df['active_ingredient_list'].values]
        
        # map original ingredient list to ingredient category
        self.df['inactive_ingredient_category_list'] = [[ingredient_property.loc[ingredient, 'category'] 
                                                        for ingredient in ingredients]
                                                        for ingredients in self.df['inactive_ingredient_list'].values]
        self.df['active_ingredient_category_list'] = [[ingredient_property.loc[ingredient, 'category'] 
                                                        for ingredient in ingredients]
                                                        for ingredients in self.df['active_ingredient_list'].values]
        
        def count_ingredient(col, prefix='count', mode='rating'):
            count_df = pd.DataFrame()
            if mode=='rating':
                count_df = pd.DataFrame.from_dict([dict(Counter(row))
                                                   for row in self.df[col].values])
            elif mode=='cat':
                # if an ingredient belongs to multiple category, we will increment all categories
                count_df = pd.DataFrame.from_dict([dict(Counter([cat for catlist in row for cat in catlist]))
                                                   for row in self.df[col].values])
            else:
                print('unknown mode in count_ingredient')
                return count_df
            
            count_df.set_index(self.df.index, inplace=True)
            count_df.fillna(0, inplace=True)
            count_df = count_df.add_prefix(prefix)
            return count_df
        
        # count inactive rating
        inactive_rating_count = count_ingredient('inactive_ingredient_rating_list', prefix='inactive_rating_count_')
        inactive_rating_count.drop(['inactive_rating_count_nan'],axis=1,inplace=True) #hack...
        
        # count active rating
        active_rating_count = count_ingredient('active_ingredient_rating_list', prefix='active_rating_count_')
        
        # count inactive category
        inactive_category_count = count_ingredient('inactive_ingredient_category_list', 
                                                   prefix='inactive_cat_count_', mode='cat')
        
        # count active category
        active_category_count = count_ingredient('active_ingredient_category_list', 
                                                 prefix='active_cat_count_', mode='cat')
        
        # merge to main dateframe
        self.df = pd.concat([self.df, 
                             inactive_rating_count,
                               active_rating_count,
                             inactive_category_count,
                               active_category_count,], axis=1)
        
        # compute average ingredient rating
        def get_mean_rating(rating_df):
            mean_rating = np.zeros(rating_df.shape[0])
            for col in rating_df.columns.values:
                rating = re.search('(\d+(\.\d*)?)',col)
                if rating is not None:
                    rating = float(rating[0])
                    mean_rating += rating_df[col].values * rating
            mean_rating = mean_rating / rating_df.sum(axis=1).values
            return mean_rating
        
        self.df['inactive_mean_rating'] = get_mean_rating(inactive_rating_count)
        self.df['active_mean_rating'] = get_mean_rating(active_rating_count)
        
        del ingredient_property
        del inactive_rating_count, active_rating_count
        del inactive_category_count, active_category_count
        gc.collect()
        
    def basic_clean(self):        
        self.clean_size()
        self.clean_price()
        self.clean_ingredient()
        
    def get_df(self):
        return self.df

Read product data from disk

In [6]:
cols = ['product_names','product_category','brand','ingredient','size','price']
skin_care_df = pd.read_csv('../web_scraper/skin_care_products.csv', usecols = cols)
body_care_df = pd.read_csv('../web_scraper/body_care_products.csv', usecols = cols)
makeup_df = pd.read_csv('../web_scraper/makeup_products.csv', usecols = cols)

#### Before processing all products, let's sample a few products and check how the data cleaning is doing..

check basic data cleaning:

In [7]:
sample = skin_care_df.sample(10)
data_cleaner = product_df_cleaning(sample)
data_cleaner.basic_clean()
sample_cleaned = data_cleaner.get_df()
sample_cleaned

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,product_names,product_category,brand,ingredient,size,price,size_num,size_unit,avg_price,active_ingredient,inactive_ingredient,active_ingredient_list,inactive_ingredient_list,n_inactive_ingredient,n_active_ingredient,is_alphabatical
1387,180º AHA Facial Peel,Exfoliants,Nu Skin,AHA Facial Peel-Step 1 (18 pads) *pH ~ 3.5* Wa...,,53.5,,,,,AHA Facial Peel-Step 1 (18 pads) *pH ~ 3.5* Wa...,[],[AHA Facial Peel-Step 1 (18 pads) *pH ~ 3.5* W...,25,0,False
3941,Hydraphel Intensive Night Cream,Nighttime Moisturizer,Erno Laszlo,"Water, Hydrogenated Polyisobutene, Dimethico...",1.70 fl. oz.,90.0,50.0,ml,1.8,,"Water, Hydrogenated Polyisobutene, Dimethico...",[],"[Water, Hydrogenated Polyisobutene, Dimethicon...",27,0,False
2693,Superdefense Age Defense Eye Cream Broad Spect...,Eye Cream & Treatment,Clinique,"Titanium Dioxide 5.6%, Zinc Oxide 3.8%. Water...",0.50 fl. oz.,41.0,15.0,ml,2.733333,,"Titanium Dioxide 5.6%, Zinc Oxide 3.8%. Water...",[],"[Titanium Dioxide 5.6%, Zinc Oxide 3.8%. Water...",51,0,False
3755,Retinol Youth Renewal Night Cream,Nighttime Moisturizer,Murad,"Water/Aqua/Eau, Dimethicone, Glycerin, Buty...",1.70 fl. oz.,82.0,50.0,ml,1.64,,"Water/Aqua/Eau, Dimethicone, Glycerin, Buty...",[],"[Water/Aqua/Eau, Dimethicone, Glycerin, Butyro...",57,0,False
89,Yes to Tomatoes Repairing Acne Lotion,Acne & Blemish Treatment,Yes To,Active Ingredient: Salicylic Acid 1%; Inactive...,1.70 fl. oz.,14.99,50.0,ml,0.2998,Salicylic Acid 1%;,"Aloe Barbadensis Leaf Extract, Water, Punic...",[Salicylic Acid 1%;],"[Aloe Barbadensis Leaf Extract, Water, Punica ...",19,1,False
2154,renewed hope in a jar eye,Eyes,philosophy,"Aqua/Water/Eau, Glycerin, Isononyl Isononano...",0.50 fl. oz.,51.0,15.0,ml,3.4,,"Aqua/Water/Eau, Glycerin, Isononyl Isononano...",[],"[Aqua/Water/Eau, Glycerin, Isononyl Isononanoa...",41,0,False
5326,Skin Brightening Serum,Serum,DeVita Natural Skin Care,Aloe barbadensis (certified organic aloe vera ...,1.00 fl. oz.,28.95,30.0,ml,0.965,,Aloe barbadensis (certified organic aloe vera ...,[],[Aloe barbadensis (certified organic aloe vera...,15,0,False
4483,Tru Face Essence Ultra,Nighttime Moisturizer,Nu Skin,"Cyclopentasiloxane, Dimethiconol, Squalane, ...",60.00 capsules,172.1,60.0,piece/other,2.868333,,"Cyclopentasiloxane, Dimethiconol, Squalane, ...",[],"[Cyclopentasiloxane, Dimethiconol, Squalane, E...",17,0,False
1484,CALM Redness Relief 1% BHA Lotion Exfoliant,Exfoliants,Paula's Choice Skincare,"Water (Aqua), Butylene Glycol, Cetyl Alcohol...",3.30 fl. oz.,27.0,98.0,ml,0.27551,,"Water (Aqua), Butylene Glycol, Cetyl Alcohol...",[],"[Water (Aqua), Butylene Glycol, Cetyl Alcohol,...",25,0,False
450,Essential Lift Smoothing Cleanser,Cleansers,Avalon Organics,"Aloe Barbadensis Leaf Juice, Sea Silt Extract...",6.00 fl. oz.,12.95,177.0,ml,0.073164,,"Aloe Barbadensis Leaf Juice, Sea Silt Extract...",[],"[Aloe Barbadensis Leaf Juice, Sea Silt Extract...",31,0,False


Find all unique ingredients in these 10 products, look up these ingredients in dictionary, check the matching accuracy:

We can see for most ingredient, our matching algorithm find a reasonable match, however, there are a few mistakes:


In [8]:
merged_ingredients = set(list(itertools.chain(*sample_cleaned['inactive_ingredient_list'].values)))
ingredient_property = pd.DataFrame(index=merged_ingredients)
lookup.find_matching_ingredient(merged_ingredients)
ingredient_property['matching'] = [lookup.lookup(ingredient, option='ingredient') for ingredient in merged_ingredients]
ingredient_property['rating'] = [lookup.lookup(ingredient, option='rating') for ingredient in merged_ingredients]
ingredient_property['category'] = [lookup.lookup(ingredient, option='category') for ingredient in merged_ingredients]
ingredient_property

100%|██████████| 240/240 [00:53<00:00,  4.46it/s]


Unnamed: 0,matching,rating,category
Behenyl Alcohol,behenyl alcohol,2.0,[Texture Enhancer]
Citric Acid,citric acid,2.0,[Uncategorized]
Cetearyl Alcohol,cetearyl alcohol,2.0,"[Texture Enhancer, Emollients]"
Avena Sativa (Oat) Bran Extract,Ananas sativus fruit extract,2.0,"[Plant Extracts, Exfoliant]"
Tromethamine,bromelain,2.0,"[Plant Extracts, Exfoliant]"
Dunalielia Salina Extract (Colorless Carotenoids),Dunaliella salina extract,3.0,"[Plant Extracts, Antioxidants]"
Dimethicone,dimethicone,2.0,"[Texture Enhancer, Silicones]"
Panthenol,panthenol,3.0,[Vitamins]
Centella asiatica (goto kola),Centella asiatica,3.0,"[Skin-Soothing, Antioxidants]"
Potassium sorbate.,potassium sorbate,2.0,[Preservatives]


Look up ingredients for these sample product, and see the final cleaned dataframe.

In [9]:
data_cleaner.lookup_ingredients(lookup)
sample_cleaned = data_cleaner.get_df()
sample_cleaned

  0%|          | 0/241 [00:00<?, ?it/s]

processing all ingredients...


100%|██████████| 241/241 [00:00<00:00, 1052.23it/s]


find all ingredients information...


Unnamed: 0,product_names,product_category,brand,ingredient,size,price,size_num,size_unit,avg_price,active_ingredient,...,inactive_cat_count_Texture Enhancer,inactive_cat_count_Thickeners,inactive_cat_count_Thickeners/Emulsifiers,inactive_cat_count_Uncategorized,inactive_cat_count_Vitamins,active_cat_count_Anti-Acne,active_cat_count_Exfoliant,active_cat_count_Skin-Soothing,inactive_mean_rating,active_mean_rating
1387,180º AHA Facial Peel,Exfoliants,Nu Skin,AHA Facial Peel-Step 1 (18 pads) *pH ~ 3.5* Wa...,,53.5,,,,,...,3,0.0,0.0,4.0,2.0,0.0,0.0,0.0,2.2,
3941,Hydraphel Intensive Night Cream,Nighttime Moisturizer,Erno Laszlo,"Water, Hydrogenated Polyisobutene, Dimethico...",1.70 fl. oz.,90.0,50.0,ml,1.8,,...,5,0.0,0.0,0.0,3.0,0.0,0.0,0.0,2.222222,
2693,Superdefense Age Defense Eye Cream Broad Spect...,Eye Cream & Treatment,Clinique,"Titanium Dioxide 5.6%, Zinc Oxide 3.8%. Water...",0.50 fl. oz.,41.0,15.0,ml,2.733333,,...,11,0.0,9.0,2.0,3.0,0.0,0.0,0.0,2.28,
3755,Retinol Youth Renewal Night Cream,Nighttime Moisturizer,Murad,"Water/Aqua/Eau, Dimethicone, Glycerin, Buty...",1.70 fl. oz.,82.0,50.0,ml,1.64,,...,18,0.0,4.0,2.0,3.0,0.0,0.0,0.0,2.070175,
89,Yes to Tomatoes Repairing Acne Lotion,Acne & Blemish Treatment,Yes To,Active Ingredient: Salicylic Acid 1%; Inactive...,1.70 fl. oz.,14.99,50.0,ml,0.2998,Salicylic Acid 1%;,...,3,1.0,3.0,1.0,0.0,1.0,1.0,1.0,1.789474,3.0
2154,renewed hope in a jar eye,Eyes,philosophy,"Aqua/Water/Eau, Glycerin, Isononyl Isononano...",0.50 fl. oz.,51.0,15.0,ml,3.4,,...,7,0.0,4.0,3.0,1.0,0.0,0.0,0.0,2.121951,
5326,Skin Brightening Serum,Serum,DeVita Natural Skin Care,Aloe barbadensis (certified organic aloe vera ...,1.00 fl. oz.,28.95,30.0,ml,0.965,,...,1,0.0,0.0,1.0,3.0,0.0,0.0,0.0,2.266667,
4483,Tru Face Essence Ultra,Nighttime Moisturizer,Nu Skin,"Cyclopentasiloxane, Dimethiconol, Squalane, ...",60.00 capsules,172.1,60.0,piece/other,2.868333,,...,1,0.0,0.0,0.0,4.0,0.0,0.0,0.0,2.588235,
1484,CALM Redness Relief 1% BHA Lotion Exfoliant,Exfoliants,Paula's Choice Skincare,"Water (Aqua), Butylene Glycol, Cetyl Alcohol...",3.30 fl. oz.,27.0,98.0,ml,0.27551,,...,4,0.0,5.0,1.0,0.0,0.0,0.0,0.0,2.2,
450,Essential Lift Smoothing Cleanser,Cleansers,Avalon Organics,"Aloe Barbadensis Leaf Juice, Sea Silt Extract...",6.00 fl. oz.,12.95,177.0,ml,0.073164,,...,4,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.580645,


#### clean up skin care products and save to disk

In [10]:
data_cleaner = product_df_cleaning(skin_care_df)
data_cleaner.drop_rows({'product_category': ['Cleansing Brushes & Devices']})
data_cleaner.basic_clean()
data_cleaner.lookup_ingredients(lookup)
skin_care_cleaned = data_cleaner.get_df()
skin_care_cleaned.to_csv('skin_care_cleaned.csv',index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
  0%|          | 0/14190 [00:00<?, ?it/s]

processing all ingredients...


100%|██████████| 14190/14190 [58:23<00:00,  4.05it/s]


find all ingredients information...




#### clean up body care products and save to disk

In [11]:
data_cleaner = product_df_cleaning(body_care_df)
data_cleaner.basic_clean()
data_cleaner.lookup_ingredients(lookup)
body_care_cleaned = data_cleaner.get_df()
body_care_cleaned.to_csv('body_care_cleaned.csv',index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
  0%|          | 0/2406 [00:00<?, ?it/s]

processing all ingredients...


100%|██████████| 2406/2406 [02:27<00:00, 16.33it/s]


find all ingredients information...




#### clean up make products and save to disk

In [12]:
data_cleaner = product_df_cleaning(makeup_df)
data_cleaner.drop_rows({'product_category': ['Makeup Brushes']})
data_cleaner.basic_clean()
data_cleaner.lookup_ingredients(lookup)
makeup_cleaned = data_cleaner.get_df()
makeup_cleaned.to_csv('makeup_cleaned.csv',index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
  0%|          | 0/4359 [00:00<?, ?it/s]

processing all ingredients...


100%|██████████| 4359/4359 [09:16<00:00,  7.83it/s]


find all ingredients information...


