# Star Jeans Analysis Project Using Webscraping

 - Quetão de negócio
 
A empresa Star Jeans! Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendimento. Depois de vários negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um modelo de negócio do tipo E-commerce.
 
A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso o produto seria calças Jeans para o público masculino. O objetivo é manter o custo de operação baixo e escalar a medida que forem conseguindo clientes.

Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e o material para a fabricação de cada peça.

Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguinte perguntas: 

1. Qual o melhor preço de venda para as calças? 
2. Quantos tipos de calças e suas cores para o produto inicial? 
3. Quais as matérias-prima necessárias para confeccionar as calças?

As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from datetime import datetime
import re


In [2]:
url = 'https://www2.hm.com/en_us/men/products/jeans.html'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
page = requests.get (url, headers=headers)

In [3]:
soup = BeautifulSoup (page.text, 'html.parser')

In [4]:
total_item = soup.find_all ('h2', class_='load-more-heading')[0].get('data-total')
total_item

'93'

In [5]:
page_number = np.ceil(int(total_item)/36)
page_number

3.0

In [6]:
url02 = url + '?page-size=' + str(int(page_number*36))

In [7]:
products = soup.find('ul', class_='products-listing small')

In [8]:
# product id
product_list = products.find_all ('article', class_='hm-product-item')
product_id = [p.get('data-articlecode') for p in product_list]

# product category
product_category = [p.get('data-category') for p in product_list]

In [9]:
# product name
product_list = products.find_all ('a', class_='link')
product_name = [p.get_text() for p in product_list]

In [10]:
# product price
product_list = products.find_all ('span', class_='price regular')
product_price = [p.get_text() for p in product_list]



In [11]:
data = pd.DataFrame([product_id, product_category, product_name, product_price]).T
data.columns = ['product_id', 'product_category', 'product_name', 'product_price']

# Scrapy Datetime 
data['scrapy_datetime'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S:')

## One Product Details

In [12]:
url = 'https://www2.hm.com/en_us/productpage.1024256001.html'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
page = requests.get (url, headers=headers)

In [13]:
soup = BeautifulSoup(page.text, 'html.parser')

In [14]:
# Color Name 
product_list = soup.find_all('a', class_='filter-option miniature')
color_name = [p.get('data-color') for p in product_list]

# Product Id 
product_id = [p.get('data-articlecode') for p in product_list]

df_color = pd.DataFrame([product_id, color_name]).T
df_color.columns = ['product_id', 'color_name']

# Generate style id + color id 
df_color['style_id'] = df_color['product_id'].apply(lambda x: x[:-3])
df_color['color_id'] = df_color['product_id'].apply(lambda x: x[-3:])

In [15]:
# Composition
product_composition_list = soup.find_all('div', class_='details-attributes-list-item')
product_composition = [list(filter(None, p.get_text().split('\n'))) for p in product_composition_list]

# Rename Dataframe
df_composition = pd.DataFrame(product_composition).T
df_composition.columns = df_composition.iloc[0]

# Delete first row 
df_composition = df_composition.iloc[1:].fillna(method='ffill')

# Generate style id + color id 
df_composition['style_id'] = df_composition['Art. No.'].apply(lambda x: x[:-3])
df_composition['color_id'] = df_composition['Art. No.'].apply(lambda x: x[-3:])

# Merge Data color + composition 

data_sku = pd.merge(df_color, df_composition[['style_id', 'Fit', 'Composition']], how='left', on='style_id')

### Multiple Products Details

In [16]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}

# Empty dataframe
df_details = pd.DataFrame()

# Unique columns for all products 
aux = []

cols = ['Art. No.', 'Composition', 'Fit', 'Product safety', 'Size'] 
df_pattern = pd.DataFrame( columns=cols )

for i in range(len(data)):
    # API Requests
    url = 'https://www2.hm.com/en_us/productpage.' + data.loc[i, 'product_id'] + '.html'
    page = requests.get (url, headers=headers)
    
    soup = BeautifulSoup(page.text, 'html.parser')
     
    # Color Name 
    product_list = soup.find_all('a', class_='filter-option miniature')
    color_name = [p.get('data-color') for p in product_list]

    # Product Id 
    product_id = [p.get('data-articlecode') for p in product_list]

    df_color = pd.DataFrame([product_id, color_name]).T
    df_color.columns = ['product_id', 'color_name']

    # Generate style id + color id 
    df_color['style_id'] = df_color['product_id'].apply(lambda x: x[:-3])
    df_color['color_id'] = df_color['product_id'].apply(lambda x: x[-3:])
    
    # Composition
    product_composition_list = soup.find_all('div', class_='details-attributes-list-item')
    product_composition = [list(filter(None, p.get_text().split('\n'))) for p in product_composition_list]

    # Rename Dataframe
    df_composition = pd.DataFrame(product_composition).T
    df_composition.columns = df_composition.iloc[0]

    # Delete first row 
    df_composition = df_composition.iloc[1:].fillna(method='ffill')
    
    # garantee the same number of columns
    df_composition = pd.concat( [df_pattern, df_composition], axis=0 )

    # Generate style id + color id 
    df_composition['style_id'] = df_composition['Art. No.'].apply(lambda x: x[:-3])
    df_composition['color_id'] = df_composition['Art. No.'].apply(lambda x: x[-3:])

    # Merge Data color + composition 
    data_sku = pd.merge(df_color, df_composition[['style_id', 'Fit', 'Composition', 'Size', 'Product safety']], how='left', on='style_id')
    
    # All details products
    df_details = pd.concat([df_details, data_sku], axis = 0)
    
# Join Showroom data + details
data['style_id'] = data['product_id'].apply( lambda x: x[:-3] )
data['color_id'] = data['product_id'].apply( lambda x: x[-3:] )
    
df = pd.merge( data, df_details[['style_id', 'color_name', 'Fit', 'Composition', 'Size', 'Product safety']],
                    how='left', on='style_id' )

## Data Cleaning

In [17]:
# Removendo os 'NA' da coluna product_id e mudando o tipo para int
df = df.dropna (subset=['product_id'])
df['product_id'] = df['product_id'].astype(int)

# Colocando no padrão a coluna 'product_name'
df['product_name'] = df['product_name'].apply(lambda x: x.replace(' ', '_').lower())

# Tirando o '$ ' da coluna product_price e mudando para tipo float
df['product_price'] = df['product_price'].apply(lambda x: x.replace('$ ','')).astype(float)

# style id para int
df['style_id'] = df['style_id'].astype( int )

# color id para int
df['color_id'] = df['color_id'].astype( int )

# Colocando no padrão a coluna 'color_name'
df['color_name'] = df['color_name'].apply(lambda x: x.replace(' ', '_').lower())

# Colocando no padrão a coluna 'fit'
df['Fit'] = df['Fit'].apply(lambda x: x.replace(' ', '_').lower())

# size number
df['size_number'] = df['Size'].apply( lambda x: re.search( '\d{2}.\d{1}.cm.', x ).group(0) if pd.notnull( x ) else x )

# deixar somente os numeros retirando os 'cm'
df['size_number'] = df['size_number'].apply( lambda x: re.search( '\d+', x).group(0) if pd.notnull( x ) else x )

# Size Model 
df['size_model'] = df['Size'].str.extract('(\d+/\\d+)')

# Removendo a coluna size
df = df.drop(columns=['Size'], axis=1)

# Removendo a coluna Product Safety
df = df.drop(columns=['Product safety'], axis=1)

In [18]:
df

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,Fit,Composition,size_number,size_model
0,1024256006,men_jeans_slim,slim_jeans,24.99,2023-05-25 13:05:08:,1024256,6,black,slim_fit,"Shell: Cotton 99%, Spandex 1%",,
1,1024256006,men_jeans_slim,slim_jeans,24.99,2023-05-25 13:05:08:,1024256,6,black,slim_fit,Pocket lining: Cotton 100%,,
2,1024256006,men_jeans_slim,slim_jeans,24.99,2023-05-25 13:05:08:,1024256,6,black,slim_fit,Pocket lining: Cotton 100%,,
3,1024256006,men_jeans_slim,slim_jeans,24.99,2023-05-25 13:05:08:,1024256,6,black,slim_fit,Pocket lining: Cotton 100%,,
4,1024256006,men_jeans_slim,slim_jeans,24.99,2023-05-25 13:05:08:,1024256,6,black,slim_fit,Pocket lining: Cotton 100%,,
...,...,...,...,...,...,...,...,...,...,...,...,...
8894,1166422005,men_jeans_tapered,tapered_regular_crop_jeans,34.99,2023-05-25 13:05:08:,1166422,5,denim_gray,regular_fit,"Pocket lining: Polyester 65%, Cotton 35%",67,
8895,1166422005,men_jeans_tapered,tapered_regular_crop_jeans,34.99,2023-05-25 13:05:08:,1166422,5,denim_gray,regular_fit,"Pocket lining: Polyester 65%, Cotton 35%",67,
8896,1166422005,men_jeans_tapered,tapered_regular_crop_jeans,34.99,2023-05-25 13:05:08:,1166422,5,denim_gray,regular_fit,"Pocket lining: Polyester 65%, Cotton 35%",67,
8897,1166422005,men_jeans_tapered,tapered_regular_crop_jeans,34.99,2023-05-25 13:05:08:,1166422,5,denim_gray,regular_fit,"Pocket lining: Polyester 65%, Cotton 35%",67,
