# Getting Skala products ingredient lists
This notebook will lookup for Skala cosmetics products ingredient lists and save them to a file so that they can be later analyzed (see `"hair_products_classification.ipynb"`).

The web pages of the hair products will be retrieved by using `requests` and then the useful information will be extracted with the help of `beautiful soup` and saved into a dataframe. Products are further divided in categories such as "shampoo", "conditioner", etc. and finally the dataframe is saved to a file.

First, we do all relevant imports and set up an empty dataframe for all the products' info.

In [2]:
import bs4
import urllib.request
import pandas as pd
import requests
import numpy as np
#import re

In [3]:
columns=['Brand','Product_name','Ingredients']
products_df=pd.DataFrame(columns=columns)

In [4]:
#def get_ingredients(raw_ingredients):
   #raw_ingredients=soup.find(id="produto-texto-composicao").string
#   ingredients=re.sub('\s{2,}',' ',raw_ingredients)
#   ingredients=re.sub('\.|\\\\[a-zA-Z]','',ingredients) 
#   ingredients_lower = ingredients.lower()
#   ingredients_list=ingredients_lower.split(',') # tokenize ingredients
#   ingredients_list=[re.sub('^[ \t]+|[ \t]+$','',ing) for ing in ingredients_list]
#   return ingredients_list

By examining the website html we found the page where the products are listed. This page can show different categories of products based on use, line, needs, etc. We only set the category to 1, which refers to hair products.

The product list is also divided in pages, which are sequentially loaded upon the user's request, so we need to perform a while loop to get the entire list and not just the very first part.

From this product list page we save only the links to the individual product pages, where we are going to look for their ingredients. The links are repeated multiple times in the page, so we transform our link list to a dictionary in order to remove duplicates.

In [5]:
# the product list is divided in pages, so I need to loop on all of them to get the entire list
products_page = "https://www.skala.com.br/listaProdutos.php"
#webpage=str(urllib.request.urlopen(products_page).read())
data = { 'pagina': 1, 
         'codigoLinha': 0,
			'codigoGrupo': 0,
			'codigoCategoria': 1,
			'codigoTipoCabelo': 0,
		}
scroll=True
html=''
while scroll==True:
   #webpage=str(urllib.request.urlopen(products_page).read())
   r = requests.get(products_page,params=data)
   data['pagina']=data['pagina']+1
   if r.text=='0':
      scroll=False
   else:
      html+=r.text

soup = bs4.BeautifulSoup(html)
#print(soup.get_text())

# extract all links and remove duplicates passing through a dictionary
links=list(dict.fromkeys([l.get('href') for l in soup.find_all('a')]))

Now we enter each product page and with the help of `beautiful soup` we look for the product name, its brand and its ingredients ("produto-texto-composicao"). We save this info into our dataframe.

In [6]:
for i,link in enumerate(links):
   #print(link.get('href'))
   webpage=str(urllib.request.urlopen(link).read())
   soup = bs4.BeautifulSoup(webpage)

   title=soup.title.string.split(' - ')
   brand=title[1]
   product_name=title[0]
   
   ingredients_string=soup.find(id="produto-texto-composicao").string
   if ingredients_string is not None:
      #ingredients=get_ingredients(ingredients_string) #pandas doesn't like lists
      ingredients=ingredients_string
   else:
      continue

   product_series=pd.DataFrame([[brand,product_name,ingredients]],columns=columns)
   products_df=pd.concat([products_df,product_series], ignore_index=True, axis = 0)

products_df.head()

Unnamed: 0,Brand,Product_name,Ingredients
0,Skala Cosm\xe9ticos,Shampoo Shampoo Camomila,"\r\n Aqua, Sodium Methyl 2-Sulf..."
1,Skala Cosm\xe9ticos,Shampoo Skala Genetiqs,"\r\n Aqua, Sodium Laureth Sulfa..."
2,Skala Cosm\xe9ticos,Shampoo #MaisLisos,"\r\n Aqua, Sodium Laureth Sulfa..."
3,Skala Cosm\xe9ticos,Shampoo Abacate,"\r\n Aqua, Sodium Laureth Sulfa..."
4,Skala Cosm\xe9ticos,Shampoo Crespinho Divino,"\r\n Aqua, Sodium Methyl 2-Sulf..."


Based on the product name string we are able to assign each of them a category, so we add relevant columns to the dataframe with this information.

In [14]:
# assign product category
categories={'Shampoo':'shampoo','Conditioner':'condicionador','Mask':'creme de tratamento','Comb_cream':'creme para pentear','Gel':'gel','Beard':'barba'}
products_df['Product_type']=np.nan

for key,value in categories.items():
   products_df[key]=[True if value in p_name.lower() else False for p_name in products_df.Product_name]
   products_df.loc[products_df[key]==True,'Product_type']=key

products_df.head()

Unnamed: 0,Brand,Product_name,Ingredients,Shampoo,Conditioner,Mask,Comb_cream,Gel,Beard,Product_type
0,Skala Cosm\xe9ticos,Shampoo Shampoo Camomila,"\r\n Aqua, Sodium Methyl 2-Sulf...",True,False,False,False,False,False,Shampoo
1,Skala Cosm\xe9ticos,Shampoo Skala Genetiqs,"\r\n Aqua, Sodium Laureth Sulfa...",True,False,False,False,False,False,Shampoo
2,Skala Cosm\xe9ticos,Shampoo #MaisLisos,"\r\n Aqua, Sodium Laureth Sulfa...",True,False,False,False,False,False,Shampoo
3,Skala Cosm\xe9ticos,Shampoo Abacate,"\r\n Aqua, Sodium Laureth Sulfa...",True,False,False,False,False,False,Shampoo
4,Skala Cosm\xe9ticos,Shampoo Crespinho Divino,"\r\n Aqua, Sodium Methyl 2-Sulf...",True,False,False,False,False,False,Shampoo


Finally we save the dataframe to a file, which we are going to read and analyse in the `"hair_products_classification.ipynb"` notebook and get interesting insight on the composition of the products.

In [9]:
products_df.to_csv('datasets/Skala_hair_products.csv')