In [1]:
import pandas as pd
import re

In [2]:
raw_data = pd.read_csv('../scrape_notebook/Sephora_product_info.csv')
raw_data.sample(10)

Unnamed: 0,Product_Id,product_name,item_num,brand,category,price,size,love_count,rating,reviews_count,link
2645,P429683,Self Tan Purity Bronzing Water Mousse,2059038,St. Tropez Tanning Essentials,Skincare/Self Tanners/For Body,42.0,6.7 oz/ 200 mL,5225,3.92,125,https://www.sephora.com/product/self-tan-purit...
1812,P410470,Divine Decadence,1838234,Marc Jacobs Fragrances,Fragrance/Women/Perfume,74.0,1.7 oz/ 50 mL,5318,3.8349,109,https://www.sephora.com/product/divine-decaden...
1775,P440634,REPLICA Under the Lemon Trees Eau de Toilette,2177996,MAISON MARGIELA,Fragrance/Women/Perfume,28.0,3.4oz/ 100mL,5500,4.3476,210,https://www.sephora.com/product/replica-under-...
162,P439201,Aerosol Brush Cleansing Foam,2066686,Artis,Makeup/Brushes & Applicators/Brush Cleaners,25.0,6 oz/ 170 g,1134,3.1429,7,https://www.sephora.com/product/aerosol-brush-...
2194,P454062,Lip Brush,2323772,PATRICK TA,Makeup/Brushes & Applicators/Lip Brushes,18.0,na,388,na,na,https://www.sephora.com/product/patrick-ta-lip...
434,P412434,Deep Euphoria Rollerball,1858133,Calvin Klein,Fragrance/Women/Rollerballs & Travel Size,26.0,0.33 oz/ 10 mL,1207,3.6,5,https://www.sephora.com/product/deep-euphoria-...
2774,P441101,The Dewy Skin Cream,2181006,Tatcha,Skincare/Moisturizers/Moisturizers,68.0,1.7 oz/ 50 mL,74831,4.2517,858,https://www.sephora.com/product/the-dewy-skin-...
1332,P434514,Man Blue,2098093,JIMMY CHOO,Fragrance/Men/Cologne,72.0,1.7 oz/ 50 mL,562,5,6,https://www.sephora.com/product/man-blue-P4345...
2299,P398365,Luna Rossa Sport Eau de Toilette,1725563,Prada,Fragrance/Men/Cologne,70.0,3.4 oz/ 100 mL,3346,4.8218,101,https://www.sephora.com/product/luna-rossa-spo...
1323,P451002,The King of Jacks™,2289700,Jack Black,Skincare/Value & Gift Sets,49.0,na,562,na,na,https://www.sephora.com/product/the-king-jacks...


In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3083 entries, 0 to 3082
Data columns (total 11 columns):
Product_Id       3083 non-null object
product_name     3083 non-null object
item_num         3083 non-null int64
brand            3083 non-null object
category         3083 non-null object
price            3083 non-null object
size             3083 non-null object
love_count       3083 non-null object
rating           3083 non-null object
reviews_count    3083 non-null object
link             3083 non-null object
dtypes: int64(1), object(10)
memory usage: 265.1+ KB


From the sample dataframe and dataframe info, we can recognize that `Product_Id`, `item_num` and `link` may be not useful for analysis. Therefore, we are going to drop these three columns. Besides, `category` should be splited into three columns. `size` could have na value or other invalid value. Other columns like `love_count`, `rating`, `reviews_count` should turn to numeric variables.

# Let's start cleaning data!

## drop columns

In [4]:
raw_data_c = raw_data.drop(columns=['Product_Id', 'item_num', 'link'])
raw_data_c.sample(5)

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count
2002,Mini Nude Eyeshadow Palette,Natasha Denona,Makeup/Eye/Eye Palettes,25.0,na,42892,4.519,289
984,Killawatt Freestyle Highlighter,FENTY BEAUTY by Rihanna,Makeup/Face/Highlighter,36.0,2 x 0.12 oz/ 3.5 g,351677,4.6473,3295
1453,Amazing Exfoliating Water,KENZOKI,Skincare/Cleansers/Exfoliators,29.0,6.7 oz,880,4.2857,7
2433,Rosebud Salve in a Tube,Rosebud Perfume Co.,Skincare/Lip Treatments/Lip Balms & Treatments,7.0,0.5 oz/ 14.2 g,46973,4.4222,2198
2278,Endless Summer Spray,Playa,Hair/Hair Styling & Treatments/Hair Styling Pr...,24.0,3.65 oz/ 108 mL,4937,4.1176,34


## split category into three columns

In [5]:
category_lst_series = raw_data_c['category'].str.split('/')
raw_data_c['category_1'] = category_lst_series.str[0]
raw_data_c['category_2'] = category_lst_series.str[1]
raw_data_c['category_3'] = category_lst_series.str[2]
raw_data_c.sample(5)

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
2624,Brazilian Bum Bum Cream,Sol de Janeiro,Bath & Body/Body Moisturizers/Body Lotions & B...,20.0,8.1 oz/ 240 mL,175320,4.3424,4947,Bath & Body,Body Moisturizers,Body Lotions & Body Oils
217,the original beautyblender®,beautyblender,Makeup/Brushes & Applicators/Sponges & Applica...,20.0,Original,322757,4.5109,8196,Makeup,Brushes & Applicators,Sponges & Applicators
2426,Luxury Face Oil- Lavender Absolute Mini,RODIN olio lusso,Skincare/Mini Size,96.0,0.5 oz/ 15 mL,504,4.6667,3,Skincare,Mini Size,
942,The Award Winners,Eve Lom,Skincare/Value & Gift Sets,55.0,na,8393,3.4375,16,Skincare,Value & Gift Sets,
1502,Intoxicated,Kilian,Fragrance/Women/Perfume,295.0,1.7 oz/ 50 mL,610,4.25,4,Fragrance,Women,Perfume


## Turn price, love_count, reviews_count, rating into numeric variables.

In [6]:
# Check the situation when love_count is 'na'.
raw_data_c.loc[raw_data_c.love_count=='na']

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
2493,Holiday Gift Card,SEPHORA COLLECTION,Gifts,10.0,$10,na,na,na,Gifts,,


As gift card is not useful for analysis, we can drop this row for data cleaning.

In [7]:
# Check the situation when reviews_count is 'na'.
raw_data_c.loc[raw_data_c.reviews_count=='na']

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
10,Blu Mediterraneo Arancia di Capri Gift Set,Acqua Di Parma,Fragrance/Value & Gift Sets/Perfume Gift Sets,113.0,na,188,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets
36,Complete Eye Renewal Balm Duo,Algenist,Skincare/Value & Gift Sets,68.0,na,545,na,na,Skincare,Value & Gift Sets,
201,Gut Primer™ Inner Beauty Support™,The Beauty Chef,Skincare,$69.00,7.05 oz/ 200g,445,na,na,Skincare,,
371,"Reign, Dear: Texture Set",Bumble and bumble,Hair/Value & Gift Sets,15.0,na,2362,na,na,Hair,Value & Gift Sets,
426,Lake Travel Spray,By Rosie Jane,Fragrance/Women/Perfume,25.0,0.25 oz/ 7.5 mL,198,na,na,Fragrance,Women,Perfume
...,...,...,...,...,...,...,...,...,...,...,...
2946,Best Life 2 Palette,Violet Voss,Makeup/Eye/Eye Palettes,49.0,1.21 oz/ 34.3 g,737,na,na,Makeup,Eye,Eye Palettes
2994,Sleep Duo Essential Oil Set,Vitruvi,Skincare/Value & Gift Sets,50.0,na,230,na,na,Skincare,Value & Gift Sets,
3012,Copper Clove Boxed Scalloped Candlepot,VOLUSPA,Fragrance/Candles & Home Scents,18.0,6.2oz/ 176g,560,na,na,Fragrance,Candles & Home Scents,
3065,Mon Paris Holiday Set,Yves Saint Laurent,Fragrance/Value & Gift Sets/Perfume Gift Sets,97.0,na,1442,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets


After checking these product page on Sephora, we can know the reason why we cannot scrape the information is that the review counts is actually 0. So we are going to fill na with 0 for review_count

In [8]:
# Check the situation when rating is 'na'.
raw_data_c.loc[raw_data_c.rating=='na']

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
10,Blu Mediterraneo Arancia di Capri Gift Set,Acqua Di Parma,Fragrance/Value & Gift Sets/Perfume Gift Sets,113.0,na,188,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets
36,Complete Eye Renewal Balm Duo,Algenist,Skincare/Value & Gift Sets,68.0,na,545,na,na,Skincare,Value & Gift Sets,
201,Gut Primer™ Inner Beauty Support™,The Beauty Chef,Skincare,$69.00,7.05 oz/ 200g,445,na,na,Skincare,,
371,"Reign, Dear: Texture Set",Bumble and bumble,Hair/Value & Gift Sets,15.0,na,2362,na,na,Hair,Value & Gift Sets,
426,Lake Travel Spray,By Rosie Jane,Fragrance/Women/Perfume,25.0,0.25 oz/ 7.5 mL,198,na,na,Fragrance,Women,Perfume
...,...,...,...,...,...,...,...,...,...,...,...
2946,Best Life 2 Palette,Violet Voss,Makeup/Eye/Eye Palettes,49.0,1.21 oz/ 34.3 g,737,na,na,Makeup,Eye,Eye Palettes
2994,Sleep Duo Essential Oil Set,Vitruvi,Skincare/Value & Gift Sets,50.0,na,230,na,na,Skincare,Value & Gift Sets,
3012,Copper Clove Boxed Scalloped Candlepot,VOLUSPA,Fragrance/Candles & Home Scents,18.0,6.2oz/ 176g,560,na,na,Fragrance,Candles & Home Scents,
3065,Mon Paris Holiday Set,Yves Saint Laurent,Fragrance/Value & Gift Sets/Perfume Gift Sets,97.0,na,1442,na,na,Fragrance,Value & Gift Sets,Perfume Gift Sets


The situation when rating is na is just the same as that of reviews_count

From the dataframe above, we can also observe that price may contain dollar sign as well. So we need to strip that.

In [9]:
# remove null love counts obsevataions
df_process_1 = raw_data_c.loc[raw_data_c.love_count != 'na'].copy()

# remove dollar sign for price column
df_process_1['price'] = df_process_1['price'].str.strip('$')

# turn null review counts or ratings as 0
df_process_1.loc[raw_data_c.reviews_count == 'na', ['reviews_count','rating']] = 0

# turn love_count, reviews_count, rating, price into numerical variables
df_process_2 = df_process_1.copy()
df_process_2[['love_count', 'reviews_count']] = df_process_1[['love_count', 'reviews_count']].astype('int64')
df_process_2[['rating', 'price']] = df_process_1[['rating', 'price']].astype('float')

df_process_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3082 entries, 0 to 3082
Data columns (total 11 columns):
product_name     3082 non-null object
brand            3082 non-null object
category         3082 non-null object
price            3082 non-null float64
size             3082 non-null object
love_count       3082 non-null int64
rating           3082 non-null float64
reviews_count    3082 non-null int64
category_1       3082 non-null object
category_2       3079 non-null object
category_3       2705 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 288.9+ KB


## Fill na categories as 'no info'

In [10]:
# Turn null categories into 'no info'
df_process_2.fillna('no info', inplace=True)

# Size

As there are different measure for different products. Here, we mainly focus on products that can be measured by 'ml', 'oz', and 'g'.

In [11]:
df_process_2

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3
0,Blu Mediterraneo MINIATURE Set,Acqua Di Parma,Fragrance/Value & Gift Sets/Perfume Gift Sets,63.0,5 x 0.16oz/5mL,2603,4.0000,4,Fragrance,Value & Gift Sets,Perfume Gift Sets
1,Colonia,Acqua Di Parma,Fragrance/Men/Cologne,63.0,0.7 oz/ 20 mL,2537,4.3067,75,Fragrance,Men,Cologne
2,Rosa Nobile,Acqua Di Parma,Fragrance/Women/Perfume,68.0,3.4 oz/ 101 mL,4582,4.5652,69,Fragrance,Women,Perfume
3,Fico di Amalfi,Acqua Di Parma,Fragrance/Women/Perfume,64.0,5 oz/ 148 mL,2472,4.5714,70,Fragrance,Women,Perfume
4,Mirto di Panarea,Acqua Di Parma,Fragrance/Women/Perfume,64.0,2.5 oz/ 74 mL,2624,4.6364,22,Fragrance,Women,Perfume
...,...,...,...,...,...,...,...,...,...,...,...
3078,Orange Blossom Eau de Parfum Rollerball,The 7 Virtues,Fragrance/Women/Rollerballs & Travel Size,29.0,0.33 oz/ 11 mL,2009,4.2083,24,Fragrance,Women,Rollerballs & Travel Size
3079,Vetiver Elemi Eau de Parfum Rollerball,The 7 Virtues,Fragrance/Women/Rollerballs & Travel Size,29.0,0.33 oz/ 11 mL,1230,4.6000,15,Fragrance,Women,Rollerballs & Travel Size
3080,Patchouli Citrus Eau de Parfum Rollerball,The 7 Virtues,Fragrance/Women/Rollerballs & Travel Size,29.0,0.33 oz/ 11 mL,1336,4.3077,26,Fragrance,Women,Rollerballs & Travel Size
3081,8Greens Gummies Dietary Supplement,8Greens,Skincare/Wellness/Beauty Supplements,45.0,60 Gummies,1668,4.3333,9,Skincare,Wellness,Beauty Supplements


Use regex to find measures with oz or mL or g and create responding columns

In [12]:
def find_measure(measure, text):
    regex = '(\d?\.?\d+)' + measure   
    result = re.findall(regex, text)
    if len(result)!=0:
        return result[0]
    else:
        return None

In [13]:
measures = [' oz', ' mL', ' g']
for measure in measures:
    measure_ = measure.split(' ')[-1]
    name = 'size_' + measure_
    df_process_2[name] = df_process_2.loc[(df_process_2['size'].str.contains(measure_)) &
                                          (~df_process_2['size'].str.contains('x')), 'size'].apply(lambda x: find_measure(measure, x))
df_process_2.sample(5)

Unnamed: 0,product_name,brand,category,price,size,love_count,rating,reviews_count,category_1,category_2,category_3,size_oz,size_mL,size_g
1105,Watermelon Pink Juice Oil-Free Moisturizer Mini,Glow Recipe,Skincare/Mini Size,21.0,0.85 oz/ 25 mL,24498,3.6337,172,Skincare,Mini Size,no info,0.85,25.0,
482,Darling Eyeshadow Palette,Charlotte Tilbury,Makeup/Eye/Eye Palettes,57.0,0.19 oz/ 5.4 g,2758,4.0,1,Makeup,Eye,Eye Palettes,0.19,,5.4
1216,Mighty Night Overnight Renewal Supplement,HUM Nutrition,Bath & Body/Beauty Supplements,40.0,60 Vegan Softgels,1268,4.6524,164,Bath & Body,Beauty Supplements,no info,,,
2670,Stay All Day® Waterproof Liquid Eye Liner,stila,Makeup/Eye/Eyeliner,22.0,0.016 oz/ 0.5 mL,302455,4.3902,10466,Makeup,Eye,Eyeliner,0.016,0.5,
1253,Jelly Shot,INC.redible,Makeup/Lip/Lip Balm & Treatment,10.0,0.15 oz/ 4.4 g,32859,2.6614,189,Makeup,Lip,Lip Balm & Treatment,0.15,,4.4


In [14]:
df_process_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3082 entries, 0 to 3082
Data columns (total 14 columns):
product_name     3082 non-null object
brand            3082 non-null object
category         3082 non-null object
price            3082 non-null float64
size             3082 non-null object
love_count       3082 non-null int64
rating           3082 non-null float64
reviews_count    3082 non-null int64
category_1       3082 non-null object
category_2       3082 non-null object
category_3       3082 non-null object
size_oz          2086 non-null object
size_mL          1556 non-null object
size_g           336 non-null object
dtypes: float64(2), int64(2), object(10)
memory usage: 361.2+ KB


In [15]:
df_process_2.to_csv('cleaned_data.csv', index=False)