# DATA SCRAPING

To scrape 60.000 product details from the e-commerce website there are four main steps necessary: 
- Look at the main categories the products are devided in
- Find out how much overview pages I have to enter per category 
- Enter every overview page to get the product url and the rating of each product (df_01) 
- Scrape the product details from the product page and safe them into a dataframe (df_02)
- Concat the two dataframes (df) 

This is what the target data frame should look like:

| Column           | Description                        | Example                   |
|------------------|------------------------------------|---------------------------|
| item_nb          | Item number                        | 037414                    |
| brand            | Name of the brand                  | Clinique                  |
| product          | Name of the product                | Lash Power                |
| typ              | Typ of the product                 | Mascara                   |
| size             | ml of the product                  | 6ml                       |
| price            | Price in €                         | 18,39                     |
| category         | Hair, Face, Make-up, Body, Perfume | Make-up                   |
| scope            | Area of application                | Face                      |
| charesteristics  | Find trending charesteristics      | highly pigmented, shaping |
| effect           | Desired effect of the product      | refining                  |
| product_award    | Find trending awards               | perfume free              |
| age              | For which product is recommended   | 20+                       |
| number_rating    | How many people rate the product   | 55                        |
| rating           | Star rating 1-5                    | 4,5                       |
| url              | URL ID                             | 50000050                  |

In [3]:
import pandas as pd
from helper_scraping import *

# GET OVERVIEW URLS

This function gives us first the number of pages we have to itterate for each category and then the urls for the pages. 

In [None]:
parfum_urls = category_urls('https://www.website.de/de/c/parfum/07')
make_up_urls = category_urls('https://www.website.de/de/c/make-up/08')
face_urls = category_urls('https://www.website.de/de/c/face/09')
body_urls = category_urls('https://www.website.de/de/c/body/10')
hair_urls = category_urls('https://www.website.de/de/c/hair/11')

Giving all the data into a dataframe with pd.DataFrame and save it as a csv file. After this controlling if every csv has a realitic length of urls. 

In [None]:
df_hair_urls = pd.DataFrame({"url": hair_urls})
df_hair_urls.to_csv("df_category_hair_urls.csv")

In [5]:
csv_url_category = ["df_category_body_urls.csv", "df_category_parfum.csv", "df_category_face_urls.csv", "df_category_hair_urls.csv", "df_category_make_up_urls.csv"]

liste_all_categroy_urls = []

for i in csv_url_category:
    url_category = pd.read_csv(i)
    liste_url_category = url_category["url"]
    print(i + ", length: " + str(len(liste_url_category)))
    liste_all_categroy_urls.append(liste_url_category)

df_category_body_urls.csv, length: 185
df_category_parfum.csv, length: 151
df_category_face_urls.csv, length: 287
df_category_hair_urls.csv, length: 174
df_category_make_up_urls.csv, length: 302


To get a list of all categories together using the chain function:

In [7]:
all_category_urls = list(itertools.chain(liste_all_categroy_urls[0], liste_all_categroy_urls[1] ,liste_all_categroy_urls[2] ,liste_all_categroy_urls[3],liste_all_categroy_urls[4]))
len(all_category_urls)

1099

We've got 1099 pages with mostly 55 products on each page. So after scraping the details from each product we will have a dataset with round about 60.000 products. 

# GET PRODUCT URLS & PRODUCT RATINGS 

On every overview page are 50 products. Scraping the url of each product page helps to scrape the product details afterwards. In the same step it is neccessary to scrape the rating, because the rating is only on the overview page and not on the product page. 
The product ID make the product accessible throug the link (href link = /de/z/2090105322) 

This function returns the cleaned rating, number of ratings and the url of every product

In [None]:
df_01 = id_rating_cleaner(all_category_urls)
df_01.to_csv("df_01.csv")

In [8]:
df_01 = pd.read_csv("df_01.csv")
product_urls = df_01["url"].to_list()
len(product_urls)

(52116, 6)

# GET ALL DETAILS 

This function works with calling many different functions to get all the different details from the product page and store it into a dataframe. 
Beause I want make sure to not overtax the server I'm scraping from (and stop now and than)  I make sure to save the product details after I scraped 50 and additionally I worked with time sleepers. 

In [None]:
start = 0

for i in range(start, 50000, 50):
    product_urls_list = product_urls[i:i+50]
    df_2 = get_data(product_urls_list)
    df_2.to_csv('df_2_{0}.csv'.format(i))

# GET A DATAFRAME 

Concat everything and safe it as csv. 

In [None]:
df_2 = pd.DataFrame([])

for file_name in glob.glob(r'C:\Users\charlotte\df_02_50\*.csv'):
    df = pd.read_csv(file_name)
    df_2 = pd.concat([df_2, df])

In [None]:
frames = [df_1, df_2]
df = pd.concat(frames)
df.to_csv("df.csv")

In [17]:
df = pd.read_csv("df.csv")
len(df["item_nb"])

61598

There are details from about 61.598 product pages in the dataframe df. In the next step I will clean the data and look for some insights. (Notebook: Cleaning_data_first_insights)