## Questão de Negócio

A empresa Star Jeans! Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendi-
mento. Depois de vários negócio bem sucedidos, eles estão planejando entrar no mercado de moda
dos USA como um modelo de negócio do tipo E-commerce.
A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso
o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação
baixo e escalar a medida que forem conseguindo clientes.
Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência
nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e
o material para a fabricação de cada peça.
Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes
perguntas: 

1. Qual o melhor preço de venda para as calças? 

2. Quantos tipos de calças e suas
cores para o produto inicial? 

3. Quais as matérias-prima necessárias para confeccionar as calças?

Obs.: As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

# Limpeza e definição da granularidade

# Import Librarys

In [1]:
import re #regex Library
import warnings
import inflection

import numpy             as np
import pandas            as pd
import seaborn           as sns
import matplotlib.pyplot as plt

from IPython.display       import Image
from IPython.core.display  import HTML


warnings.filterwarnings( 'ignore' )

## Help Function

In [2]:
def jupyter_settings():
    #%matplotlib notebook
    #%pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
    
jupyter_settings()

## Load Data

In [3]:
df = pd.read_csv('../data/data_raw_size_all.csv')

In [4]:
df.sample()

Unnamed: 0.1,Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size
3399,3399,938875012,men_jeans_slim,Slim Tapered Jeans,$ 39.99,2022-04-19 13:17:23,938875,12,Light denim blue,Slim fit,"Cotton 99%, Spandex 1%","The model is 187cm/6'2"" and wears a size 31/32"


# Rename columns

In [5]:
data = df.copy()

In [6]:
data.columns

Index(['Unnamed: 0', 'product_id', 'product_category', 'product_name',
       'product_price', 'scrapy_datetime', 'style_id', 'color_id',
       'color_name', 'fit', 'composition', 'size'],
      dtype='object')

# Clean to dataframe

In [7]:
#data.shape
data.dtypes

Unnamed: 0           int64
product_id           int64
product_category    object
product_name        object
product_price       object
scrapy_datetime     object
style_id             int64
color_id             int64
color_name          object
fit                 object
composition         object
size                object
dtype: object

In [8]:
data.isna().sum()

Unnamed: 0             0
product_id             0
product_category       0
product_name           0
product_price          0
scrapy_datetime        0
style_id               0
color_id               0
color_name             1
fit                 1034
composition         1034
size                1034
dtype: int64

In [9]:
data = data.drop( columns=['Unnamed: 0'], axis=1 )

In [10]:
#df.isna().sum()
data.shape

(5610, 11)

In [11]:
# product id
data = data.dropna( subset=['product_id'] )
data['product_id'] = data['product_id'].astype( int )

# product name
data['product_name'] = data['product_name'].apply( lambda x: x.replace( ' ', '_' ).lower())

# product price
data['product_price'] = data['product_price'].apply(lambda x: x.replace( '$ ', '' )).astype(float)

# scrapy datetime
data['scrapy_datetime'] = pd.to_datetime(data['scrapy_datetime'],format='%Y-%m-%d %H:%M:%S' )

# style id
data['style_id'] = data['style_id'].astype( int )

# color id
data['color_id'] = data['color_id'].astype( int )

#color name
data['color_name'] = data['color_name'].apply( lambda x: x.replace(' ', '_' ).replace( '/', '_' ).lower() if pd.notnull( x ) else x )

# fit
data['fit'] = data['fit'].apply( lambda x: x.replace( ' ', '_' ).lower() if pd.notnull( x ) else x )
# quando houver valores nulos, acrescentar o notnull...

In [12]:
# size number
#data['size_number'] = data['size'].apply( lambda x: re.search('(\d{3}cm)', x ).group(0) if pd.notnull( x ) else x)
#data['size_number'] = data['size_number'].apply( lambda x: re.search( '(\d{3})', x).group(0) if pd.notnull( x ) else x ).astype(float)

data['size_number'] = data['size'].str.extract( '(\d{3}cm)')
data['size_number'] = data['size_number'].apply( lambda x: re.search( '(\d{3})', x).group(0) if pd.notnull( x ) else x ).astype(float)
#
### size model
### size model
data['size_model'] = data['size'].str.extract( '(size\s\d+.\d+|size\s\w+)')
#
#### Droping  columns
data = data.drop( columns=['size'], axis=1 )

In [23]:
##composition
# droping lines with pocket and lining materials
data = data[~data['composition'].str.contains( 'Pocket lining:', na=False )]
data = data[~data['composition'].str.contains( 'Pocket:', na=False )]
data = data[~data['composition'].str.contains( 'Lining:', na=False )]
data = data[~data['composition'].str.contains( 'Shell:', na=False )]
data = data[~data['composition'].str.contains( '"FOR CHILD’S SAFETY, GARMENT SHOULD FIT SNUGLY. THIS GARMENT IS NOT FLAME RESISTANT. LOOSE-FITTING GARMENT IS MORE LIKELY TO CATCH FIRE."', na=False )]


#Drop duplicat
data = data.drop_duplicates(subset=['product_id', 'product_category', 'product_name', 'product_price',
                                    'scrapy_datetime', 'style_id', 'color_id', 'color_name', 'fit'])
#data = data.drop_duplicates()

# reste index
data = data.reset_index( drop=True )

# break composition by comma
df1 = data['composition'].str.split( ',', expand=True )
##
### cotton | spandex | polyester
df_ref = pd.DataFrame( index=np.arange( len( data ) ), columns=['cotton', 'spandex', 'polyester'] )
##
### cotton
df_cotton = df1[0]
df_cotton.name = 'cotton'
df_ref = pd.concat( [df_ref, df_cotton ], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]
df_ref['cotton'] = df_ref['cotton'].fillna('Cotton 0%')
###
#### spandex
df_spandex = df1.loc[df1[1].str.contains( 'Spandex', na=True ), 1]
df_spandex.name = 'spandex'
df_ref = pd.concat( [df_ref, df_spandex], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['spandex'] = df_ref['spandex'].fillna('Spandex 0%')
##
### polyester
df_polyester = df1.loc[df1[1].str.contains( 'Polyester', na=True ), 1]
df_polyester.name = 'polyester'
df_ref = pd.concat( [df_ref, df_polyester], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['polyester'] = df_ref['polyester'].fillna('Polyester 0%')




In [24]:
# final join
data = pd.concat( [data, df_ref], axis=1 )
data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,spandex,polyester,cotton.1,spandex.1,polyester.1,cotton.2,spandex.2,polyester.2
0,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
1,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,black,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
2,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,light_denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
3,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,dark_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
4,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,dark_denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%


In [25]:
data.head(100)

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,spandex,polyester,cotton.1,spandex.1,polyester.1,cotton.2,spandex.2,polyester.2
0,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
1,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,black,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
2,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,light_denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
3,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,dark_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
4,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,dark_denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
5,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,dark_gray,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
6,1024256001,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,1,white,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
7,1024256002,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,2,denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
8,1024256002,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,2,black,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%
9,1024256002,men_jeans_slim,slim_jeans,19.99,2022-04-19 13:17:23,1024256,2,light_denim_blue,,,,,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%,Cotton 0%,Spandex 0%,Polyester 0%


In [19]:
#format composition data
data['cotton'] = data['cotton'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) /100 if pd.notnull( x ) else x )
data['spandex'] = data['spandex'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) /100 if pd.notnull( x ) else x )
data['elastomultiester'] = data['elastomultiester'].apply( lambda x: int( re.search( '\d+', x).group(0) ) /100 if pd.notnull( x ) else x )

In [20]:
#Cleaning dataframe
data = data.dropna( subset=['product_id'] )
data['product_id'] = data['product_id'].astype( int )

In [26]:
data.isna().sum()

product_id            0
product_category      0
product_name          0
product_price         0
scrapy_datetime       0
style_id              0
color_id              0
color_name            1
fit                 393
composition         393
size_number         393
size_model          393
cotton                0
spandex               0
polyester             0
cotton                0
spandex               0
polyester             0
cotton                0
spandex               0
polyester             0
dtype: int64

In [22]:
data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,spandex,elastomultiester
0,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,light_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
2,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,light_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
4,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
6,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,dark_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
8,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,dark_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,


In [23]:
Continuar no proximo ciclo!

SyntaxError: invalid syntax (2104124353.py, line 1)

In [119]:
#save to datafrmae pre-cleaned
data.to_csv('../data/data_raw_clean.csv')

In [None]:
COLEtar os dados novamente!