# Capstone: Sephora. Predicting prices based on Ingredients

## Problem description

It is an assumption customers make that their skin care product price is dependent on the ingredients in this product. The goal of my projects is to see if I can predict prices of the products based on the ingredients. To accomplish this goal, I first had to gather my data. I used Sephora.com data for this.

### Project Structure:
- Notebook 0. Selenium URL Collection
- Notebook 1. Saving data from URL to an HTML file
- Notebook 2. Collecting Product Data
- Notebook 3. Data Cleaning 
- Notebook 4. EDA
- Notebook 5. Fuzzy String Matching
- Notebook 6. Regression Modeling
- Notebook 7. Classification Modeling

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import csv

In [2]:
#reading in the urls gathered in 01_getting_soups.ipynb
urls = pd.read_csv('./data/product_urls2.csv')
urls

Unnamed: 0,category,URL
0,moisturizing-cream-oils-mists,https://www.sephora.com/product/protini-tm-pol...
1,moisturizing-cream-oils-mists,https://www.sephora.com/product/the-water-crea...
2,moisturizing-cream-oils-mists,https://www.sephora.com/product/ultra-facial-c...
3,moisturizing-cream-oils-mists,https://www.sephora.com/product/your-skin-but-...
4,moisturizing-cream-oils-mists,https://www.sephora.com/product/the-dewy-skin-...
...,...,...
2763,lip-treatments,https://www.sephora.com/product/dual-nourishin...
2764,lip-treatments,https://www.sephora.com/product/butterstick-li...
2765,lip-treatments,https://www.sephora.com/product/lip-lock-primi...
2766,lip-treatments,https://www.sephora.com/product/kiss-mix-P4039...


In [3]:
#opening a new csv file with headers
header = ['name', 'brand', 'category', 'price', 'ingredients', 'no_reviews', 'hearts', 'size1', 'size2', 'url']

with open('./data/product_info.csv', "w", newline='') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerow(header) # write the header

In [6]:
for i in range(1629, 2768):
#for i in urls.index:
    products = [] #empty list to append the dictionary in to before passing in to a DataFrame
    #setting the path to the the html files
    path = "./data/soups/soup"+str(i)+".html"
    #opening and reading the file
    file_path = open(path, 'rb')
    file_read = file_path.read()
    #creating a new soup from it
    soup = BeautifulSoup(file_read)
    #gathering the data from the pages
    try:   
              
        product = {}
        product['name'] = soup.find('span', {'class': 'css-0'}).text
        product['brand'] = soup.find('span', {'class': 'css-euydo4'}).text
        product['category'] = urls.category[i]
        product['price'] = soup.find('div', {'class': 'css-slwsq8'}).text
        product['ingredients'] = soup.find('div', {'id': 'tabpanel2'})
        #product['ingredients'] = str(ingredients).split('<br/>') #will need to break during cleaning
        product['no_reviews'] = soup.find_all('span', {'class': 'css-2rg6q7'})[0].text
        product['hearts'] = soup.find('span', {'data-at': 'product_love_count'}).text
        #the product size has two different formats on the pages
        product['size1'] = size = soup.find('div', {'class': 'css-v7k1z0'}).text
        try:
            product['size2'] = soup.find('span', {'class': 'css-ng5oyv'}).text
        except:
            product['size2'] = '0'
        product['url'] = urls.URL[i]
    
        #append the empty list to later make in to a dataframe
        products.append(product)
        product_df = pd.DataFrame(products) 
        #append in the csv file that created above
        product_df.to_csv('./data/product_info.csv', mode='a', index = False, header = False)

    except: #if the page no longer existed when gathered html
        pass
   
    

