# Capstone: Sephora. Predicting prices based on Ingredients

## Problem description

It is an assumption customers make that their skin care product price is dependent on the ingredients in this product. The goal of my projects is to see if I can predict prices of the products based on the ingredients. To accomplish this goal, I first had to gather my data. I used Sephora.com data for this.

### Project Structure:
- Notebook 0. Selenium URL Collection
- Notebook 1. Saving data from URL to an HTML file
- Notebook 2. Collecting Product Data
- Notebook 3. Data Cleaning 
- Notebook 4. EDA
- Notebook 5. Fuzzy String Matching
- Notebook 6. Regression Modeling
- Notebook 7. Classification Modeling

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

In [2]:
#credit for this code: https://www.hackerearth.com/fr/practice/notes/praveen97uma/crawling-a-website-that-loads-content-using-javascript-with-selenium-webdriver-in-python/
#this code creates a function that gets the browser to scroll down
def scrollDown(driver, n_scroll):
    body = driver.find_element_by_tag_name("body")
    while n_scroll >= 0:
        body.send_keys(Keys.PAGE_DOWN)
        n_scroll -= 1
    return driver

In [3]:
chrome_path = "/Users/yelenanevel/Downloads/chromedriver"
driver = webdriver.Chrome(executable_path = chrome_path)


#this list represents different skin care categories to collect
categories = ['moisturizing-cream-oils-mists',
             'cleanser',
             'facial-treatments',
             'eye-treatment-dark-circle-treatment',
             'facial-treatment-masks',
             'sunscreen-sun-protection',
             'lip-treatments']
#creating an empty data frame that all the URLs and categories will go in to
final_df = pd.DataFrame(columns = ['category', 'URL'])


for category in categories: #looping through the different categories

    page_num = 1 #the first page

    while True:
    
        #when running the following .get request in order for everything to work properly
        #the window needs to be popped open on the screen
        url = 'https://www.sephora.com/shop/'+ category + '?pageSize=300&currentPage=' + str(page_num)
        driver.get(url)
        time.sleep(20)
        try: #check to see if the page is empty   
            if driver.find_element_by_class_name('css-3a7b4s').is_displayed():
                break
            
        except:
    
            #check to see if there is a pop up windew
            try:
                #exit the pop up window
                xpath = '//*[@id="modalDialog"]/button'
                btn = driver.find_element_by_xpath(xpath)
                btn.click()
                time.sleep(20)
            except:
                pass
    
            #as scrolling check if there is any more room to scroll
            old_len = 0
            while True:
                browser = scrollDown(driver, 20) #scroll down the page
                time.sleep(10) #give it time to load
                slug = driver.find_elements_by_class_name('css-ix8km1') #look for the urls of products
                new_len = len(slug)
                if old_len == new_len: #if the old length and new length are equal, the bottom of page was reached
                    break
                else:
                    old_len = new_len

            #from the list of URLs in slug pull all the href and make a dictionary with it and the category name        
            slugURL = []
            for a in slug:
                subURL = {}
                subURL['category'] = category 
                subURL['URL'] = a.get_attribute('href')
                slugURL.append(subURL)
    
            #append our data frame with categories and URLs
            df = pd.DataFrame(slugURL)
            #print(df.head())
            #adding to go to next page
            page_num += 1
            #concatenating to get all in same df
            final_df = pd.concat([final_df, df], axis = 0, ignore_index = True) 
    
#close the chrome
driver.close()


In [4]:
#checking the structure of my final dataframe
final_df

Unnamed: 0,category,URL
0,moisturizing-cream-oils-mists,https://www.sephora.com/product/protini-tm-pol...
1,moisturizing-cream-oils-mists,https://www.sephora.com/product/the-water-crea...
2,moisturizing-cream-oils-mists,https://www.sephora.com/product/ultra-facial-c...
3,moisturizing-cream-oils-mists,https://www.sephora.com/product/your-skin-but-...
4,moisturizing-cream-oils-mists,https://www.sephora.com/product/the-dewy-skin-...
...,...,...
2763,lip-treatments,https://www.sephora.com/product/dual-nourishin...
2764,lip-treatments,https://www.sephora.com/product/butterstick-li...
2765,lip-treatments,https://www.sephora.com/product/lip-lock-primi...
2766,lip-treatments,https://www.sephora.com/product/kiss-mix-P4039...


In [5]:
#saving the final dataframe of URLs in to a csv folder
final_df.to_csv('./data/product_urls2.csv', index = False)

In [None]:
#these URLs will be used to gather the required data about each product