In [30]:
import pandas as pd
import re

In [2]:
raw_data = pd.read_csv('../scrape_notebook/Sephora_product_info.csv')
raw_data.sample(10)

Unnamed: 0,Product_Id,product_name,item_num,brand,category,price,size,love_count,rating,reviews_count,link
651,P381214,Addicted to Speed - Quick-drying Nail Top Coat,1541994,Deborah Lippmann,Makeup/Nail,20.0,na,6673,4.184,125,https://www.sephora.com/product/addicted-to-sp...
60,P405297,CAVIAR Anti-Aging® Multiplying Volume Styling ...,2140325,ALTERNA Haircare,Hair/Hair Styling & Treatments/Hair Styling Pr...,30.0,4.8 oz/ 142 mL,13684,4.2822,365,https://www.sephora.com/product/caviar-anti-ag...
2148,P406661,VOLUME Shampoo,1802016,OUAI,Hair/Shampoo & Conditioner/Shampoo,14.0,10 oz/ 295 mL,6612,2.9704,135,https://www.sephora.com/product/volume-shampoo...
347,P450581,Making Spirits Bright: The Best of boscia Kit,2240661,boscia,Skincare/Value & Gift Sets,42.0,na,1961,4.25,4,https://www.sephora.com/product/making-spirits...
6,P422289,Colonia Pura,1979608,Acqua Di Parma,Fragrance/Women/Perfume,114.0,1.7 oz/ 50 mL,371,4.5,4,https://www.sephora.com/product/colonia-pura-P...
534,P441876,Jessica Rabbit Eyeshadow Palette,2191468,Ciaté London,Makeup/Eye/Eye Palettes,27.0,9 x 0.04 oz/ 1.2 g,3859,3.0,46,https://www.sephora.com/product/jessica-rabbit...
2417,P426818,Wild With Desire Lipstick,2022382,rms beauty,Makeup/Lip/Lipstick,28.0,0.15 oz/ 4.5 g,15706,3.5472,53,https://www.sephora.com/product/wild-with-desi...
866,P428642,No. 3 Exotic Goddess Ageless Serum,2057040,Edible Beauty,Skincare/Treatments/Face Serums,57.0,1 oz/ 30 mL,8463,4.6763,139,https://www.sephora.com/product/no-3-exotic-go...
418,P435106,Discovery set,2236248,By Rosie Jane,Fragrance/Value & Gift Sets/Perfume Gift Sets,30.0,5 x 0.10 oz/ 3 mL,3505,4.5,22,https://www.sephora.com/product/discovery-set-...
1856,P395432,Sexy Amber Rollerball,1703701,Michael Kors,Fragrance/Women/Rollerballs & Travel Size,30.0,0.34 oz/ 10 mL,2734,4.1429,14,https://www.sephora.com/product/sexy-amber-rol...


In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3083 entries, 0 to 3082
Data columns (total 11 columns):
Product_Id       3083 non-null object
product_name     3083 non-null object
item_num         3083 non-null int64
brand            3083 non-null object
category         3083 non-null object
price            3083 non-null object
size             3083 non-null object
love_count       3083 non-null object
rating           3083 non-null object
reviews_count    3083 non-null object
link             3083 non-null object
dtypes: int64(1), object(10)
memory usage: 265.1+ KB


From the sample dataframe and dataframe info, we can recognize that `Product_Id`, `item_num` and `link` may be not useful for analysis. Therefore, we are going to drop these three columns. Besides, `category` should be splited into three columns. `size` could have na value or other invalid value. Other columns like `love_count`, `rating`, `reviews_count` should turn to numeric variables.

# Let's start cleaning data!

## drop columns

In [4]:
raw_data_c = raw_data.drop(columns=['Product_Id', 'item_num', 'link'])
raw_data_c.sample(5)

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count
2131,GinZing™ Oil- Free Energy Boosting Gel Moistur...,Origins,Skincare/Moisturizers/Moisturizers,30.0,1.7 oz/ 50 mL,6172,4.3807,3155
497,Love Story,Chloé,Fragrance/Women/Perfume,32.0,2.5 oz/ 75 mL,12712,4.4427,192
780,Discovery Kit,Dr. Barbara Sturm,Skincare/Value & Gift Sets,110.0,na,2178,2.5,4
2769,Revitalizing Anti-Aging Body Oil,Tata Harper,Bath & Body/Body Moisturizers/Body Lotions & B...,115.0,4.1 oz/ 125 mL,3850,3.8,30
250,The True Cream Aqua Bomb Deco,belif,Skincare/Moisturizers/Moisturizers,22.0,0.84 oz/ 25 mL,0,4.7143,14


## split category into three columns

In [5]:
category_lst_series = raw_data_c['category'].str.split('/')
raw_data_c['category_1'] = category_lst_series.str[0]
raw_data_c['category_2'] = category_lst_series.str[1]
raw_data_c['category_3'] = category_lst_series.str[-1]
raw_data_c.sample(5)

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
351,Detoxifying Black Charcoal Cleanser,boscia,Skincare/Cleansers/Face Wash & Cleansers,30.0,5 oz/ 150 mL,96622,4.3106,2508,Skincare,Cleansers,Face Wash & Cleansers
977,Gloss Bomb Universal Lip Luminizer,FENTY BEAUTY by Rihanna,Makeup/Lip/Lip Gloss,18.0,0.3 oz/ 9 mL,450978,4.6848,9498,Makeup,Lip,Lip Gloss
484,Airbrush Flawless Finish Setting Powder,Charlotte Tilbury,Makeup/Face/Setting Spray & Powder,45.0,0.28 oz,103629,4.2593,810,Makeup,Face,Setting Spray & Powder
1884,MIU MIU Body Lotion,Miu Miu,Fragrance/Women/Lotions & Oils,52.0,6.7 oz/ 198 mL,899,5.0,2,Fragrance,Women,Lotions & Oils
580,Reserve - Warm Cotton,CLEAN RESERVE,Fragrance/Women/Perfume,28.0,3.4 oz/ 101 mL,2785,3.2,205,Fragrance,Women,Perfume


## turn price, love_count, reviews_count, rating into numeric variables.

In [6]:
# Check the situation when love_count is 'na'.
raw_data_c.loc[raw_data_c.love_count=='na']

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
2493,Holiday Gift Card,SEPHORA COLLECTION,Gifts,10.0,$10,na,na,na,Gifts,,Gifts


As gift card is not useful for analysis, we can drop this row for data cleaning.

In [7]:
# Check the situation when reviews_count is 'na'.
raw_data_c.loc[raw_data_c.reviews_count=='na']

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
10,Blu Mediterraneo Arancia di Capri Gift Set,Acqua Di Parma,Fragrance/Value & Gift Sets/Perfume Gift Sets,113.0,na,188,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets
36,Complete Eye Renewal Balm Duo,Algenist,Skincare/Value & Gift Sets,68.0,na,545,na,na,Skincare,Value & Gift Sets,Value & Gift Sets
201,Gut Primer™ Inner Beauty Support™,The Beauty Chef,Skincare,$69.00,7.05 oz/ 200g,445,na,na,Skincare,,Skincare
371,"Reign, Dear: Texture Set",Bumble and bumble,Hair/Value & Gift Sets,15.0,na,2362,na,na,Hair,Value & Gift Sets,Value & Gift Sets
426,Lake Travel Spray,By Rosie Jane,Fragrance/Women/Perfume,25.0,0.25 oz/ 7.5 mL,198,na,na,Fragrance,Women,Perfume
...,...,...,...,...,...,...,...,...,...,...,...
2946,Best Life 2 Palette,Violet Voss,Makeup/Eye/Eye Palettes,49.0,1.21 oz/ 34.3 g,737,na,na,Makeup,Eye,Eye Palettes
2994,Sleep Duo Essential Oil Set,Vitruvi,Skincare/Value & Gift Sets,50.0,na,230,na,na,Skincare,Value & Gift Sets,Value & Gift Sets
3012,Copper Clove Boxed Scalloped Candlepot,VOLUSPA,Fragrance/Candles & Home Scents,18.0,6.2oz/ 176g,560,na,na,Fragrance,Candles & Home Scents,Candles & Home Scents
3065,Mon Paris Holiday Set,Yves Saint Laurent,Fragrance/Value & Gift Sets/Perfume Gift Sets,97.0,na,1442,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets


After checking these product page on Sephora, we can know the reason why we cannot scrape the information is that the review counts is actually 0. So we are going to fill na with 0 for review_count

In [8]:
# Check the situation when reviews_count is 'na'.
raw_data_c.loc[raw_data_c.rating=='na']

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
10,Blu Mediterraneo Arancia di Capri Gift Set,Acqua Di Parma,Fragrance/Value & Gift Sets/Perfume Gift Sets,113.0,na,188,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets
36,Complete Eye Renewal Balm Duo,Algenist,Skincare/Value & Gift Sets,68.0,na,545,na,na,Skincare,Value & Gift Sets,Value & Gift Sets
201,Gut Primer™ Inner Beauty Support™,The Beauty Chef,Skincare,$69.00,7.05 oz/ 200g,445,na,na,Skincare,,Skincare
371,"Reign, Dear: Texture Set",Bumble and bumble,Hair/Value & Gift Sets,15.0,na,2362,na,na,Hair,Value & Gift Sets,Value & Gift Sets
426,Lake Travel Spray,By Rosie Jane,Fragrance/Women/Perfume,25.0,0.25 oz/ 7.5 mL,198,na,na,Fragrance,Women,Perfume
...,...,...,...,...,...,...,...,...,...,...,...
2946,Best Life 2 Palette,Violet Voss,Makeup/Eye/Eye Palettes,49.0,1.21 oz/ 34.3 g,737,na,na,Makeup,Eye,Eye Palettes
2994,Sleep Duo Essential Oil Set,Vitruvi,Skincare/Value & Gift Sets,50.0,na,230,na,na,Skincare,Value & Gift Sets,Value & Gift Sets
3012,Copper Clove Boxed Scalloped Candlepot,VOLUSPA,Fragrance/Candles & Home Scents,18.0,6.2oz/ 176g,560,na,na,Fragrance,Candles & Home Scents,Candles & Home Scents
3065,Mon Paris Holiday Set,Yves Saint Laurent,Fragrance/Value & Gift Sets/Perfume Gift Sets,97.0,na,1442,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets


The situation when rating is na is just the same as that of reviews_count

From the dataframe above, we can also observe that price may contain dollar sign as well. So we need to strip that.

In [117]:
# turn price, love_count, reviews_count, rating into numeric variables.
df_process_1 = raw_data_c.loc[raw_data_c.love_count != 'na']
df_process_1.price = df_process_1.price.str.strip('$')
df_process_1.loc[raw_data_c.reviews_count == 'na', ['reviews_count','rating']] = 0
df_process_2 = df_process_1.copy()
df_process_2[['love_count', 'reviews_count']] = df_process_1.loc[df_process_1.love_count != 'na', [
    'love_count', 'reviews_count']].astype('int64')
df_process_2[['rating', 'price']] = df_process_1.loc[df_process_1.love_count != 'na', [
    'rating', 'price']].astype('float')
df_process_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3082 entries, 0 to 3082
Data columns (total 11 columns):
product_name     3082 non-null object
brand            3082 non-null object
category         3082 non-null object
price            3082 non-null float64
size             3082 non-null object
love_count       3082 non-null int64
rating           3082 non-null float64
reviews_count    3082 non-null int64
category_1       3082 non-null object
category_2       3079 non-null object
category_3       3082 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 288.9+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


# Size

As there are different measure for different products. Here, we mainly focus on products that can be measured by 'ml', 'oz', and 'g'.

In [118]:
df_process_2

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
0,Blu Mediterraneo MINIATURE Set,Acqua Di Parma,Fragrance/Value & Gift Sets/Perfume Gift Sets,63.0,5 x 0.16oz/5mL,2603,4.0000,4,Fragrance,Value & Gift Sets,Perfume Gift Sets
1,Colonia,Acqua Di Parma,Fragrance/Men/Cologne,63.0,0.7 oz/ 20 mL,2537,4.3067,75,Fragrance,Men,Cologne
2,Rosa Nobile,Acqua Di Parma,Fragrance/Women/Perfume,68.0,3.4 oz/ 101 mL,4582,4.5652,69,Fragrance,Women,Perfume
3,Fico di Amalfi,Acqua Di Parma,Fragrance/Women/Perfume,64.0,5 oz/ 148 mL,2472,4.5714,70,Fragrance,Women,Perfume
4,Mirto di Panarea,Acqua Di Parma,Fragrance/Women/Perfume,64.0,2.5 oz/ 74 mL,2624,4.6364,22,Fragrance,Women,Perfume
...,...,...,...,...,...,...,...,...,...,...,...
3078,Orange Blossom Eau de Parfum Rollerball,The 7 Virtues,Fragrance/Women/Rollerballs & Travel Size,29.0,0.33 oz/ 11 mL,2009,4.2083,24,Fragrance,Women,Rollerballs & Travel Size
3079,Vetiver Elemi Eau de Parfum Rollerball,The 7 Virtues,Fragrance/Women/Rollerballs & Travel Size,29.0,0.33 oz/ 11 mL,1230,4.6000,15,Fragrance,Women,Rollerballs & Travel Size
3080,Patchouli Citrus Eau de Parfum Rollerball,The 7 Virtues,Fragrance/Women/Rollerballs & Travel Size,29.0,0.33 oz/ 11 mL,1336,4.3077,26,Fragrance,Women,Rollerballs & Travel Size
3081,8Greens Gummies Dietary Supplement,8Greens,Skincare/Wellness/Beauty Supplements,45.0,60 Gummies,1668,4.3333,9,Skincare,Wellness,Beauty Supplements


Use regex to find measures with oz or mL or g and create responding columns

In [119]:
def find_measure(measure, text):
    regex = '(\d?\.?\d+)' + measure   
    result = re.findall(regex, text)
    if len(result)!=0:
        return result[0]
    else:
        return None

In [122]:
measures = [' oz', ' mL', ' g']
for measure in measures:
    measure_ = measure.split(' ')[-1]
    name = 'size_' + measure_
    df_process_2[name] = df_process_2.loc[(df_process_2['size'].str.contains(measure_)) &
                                          (~df_process_2['size'].str.contains('x')), 'size'].apply(lambda x: find_measure(measure, x))
df_process_2.sample(5)

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3,size_oz,size_mL,size_g
1273,Hyaluronic Acid Hydrating Serum,The INKEY List,Skincare/Treatments/Face Serums,7.99,1 oz/ 30 mL,12744,4.5149,369,Skincare,Treatments,Face Serums,1.0,30.0,
1701,Lilly Lashes 3D Mink Miami & Mini Lash Adhesiv...,Lilly Lashes,Makeup/Eye/False Eyelashes,25.0,na,5373,5.0,5,Makeup,Eye,False Eyelashes,,,
963,Skintune Blur Perfecting Primer Serum,FARSÁLI,Makeup/Face/Face Primer,54.0,1 oz/ 30 mL,74595,3.704,581,Makeup,Face,Face Primer,1.0,30.0,
315,Agave+ Nighttime Lip Therapy,Bite Beauty,Makeup/Lip/Lip Balm & Treatment,22.0,0.5 oz/ 15 g,20649,4.07,243,Makeup,Lip,Lip Balm & Treatment,0.5,,15.0
554,Tonic Body Treatment Oil,Clarins,Bath & Body/Body Care/Cellulite & Stretch Marks,65.0,3.4 oz/ 100 mL,16067,4.3682,220,Bath & Body,Body Care,Cellulite & Stretch Marks,3.4,100.0,
