## Questão de Negócio

A empresa Star Jeans! Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendi-
mento. Depois de vários negócio bem sucedidos, eles estão planejando entrar no mercado de moda
dos USA como um modelo de negócio do tipo E-commerce.
A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso
o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação
baixo e escalar a medida que forem conseguindo clientes.
Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência
nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e
o material para a fabricação de cada peça.
Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes
perguntas: 

1. Qual o melhor preço de venda para as calças? 

2. Quantos tipos de calças e suas
cores para o produto inicial? 

3. Quais as matérias-prima necessárias para confeccionar as calças?

Obs.: As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

# Import Librarys

In [1]:
import re #regex Library
import warnings
import inflection

import numpy             as np
import pandas            as pd
import seaborn           as sns
import matplotlib.pyplot as plt

from IPython.display       import Image
from IPython.core.display  import HTML


warnings.filterwarnings( 'ignore' )

## Help Function

In [2]:
def jupyter_settings():
    #%matplotlib notebook
    #%pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
    
jupyter_settings()

## Load Data

In [3]:
df = pd.read_csv('../data/data_raw_size.csv')

In [4]:
df.sample()

Unnamed: 0.1,Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size
164,164,811993036,men_jeans_regular,Regular Jeans,$ 29.99,2022-04-18 17:54:04,811993,36,Denim blue,,,


# Rename columns

In [5]:
data = df.copy()

In [6]:
data.columns

Index(['Unnamed: 0', 'product_id', 'product_category', 'product_name',
       'product_price', 'scrapy_datetime', 'style_id', 'color_id',
       'color_name', 'fit', 'composition', 'size'],
      dtype='object')

# Clean to dataframe

In [7]:
data.shape

(1814, 12)

In [8]:
df.isna().sum()

Unnamed: 0            0
product_id            0
product_category      0
product_name          0
product_price         0
scrapy_datetime       0
style_id              0
color_id              0
color_name            0
fit                 274
composition         274
size                274
dtype: int64

In [9]:
# product id
data['product_id'] = data['product_id'].astype( int )

# product category

# product name
data['product_name'] = data['product_name'].apply( lambda x: x.replace( ' ', '_' ).lower())

# product price
data['product_price'] = data['product_price'].apply(lambda x: x.replace( '$ ', '' )).astype(float)

# scrapy datetime
data['scrapy_datetime'] = pd.to_datetime(data['scrapy_datetime'],format='%Y-%m-%d %H:%M:%S' )

# style id
data['style_id'] = data['style_id'].astype( int )

# color id
data['color_id'] = data['color_id'].astype( int )


In [10]:
data.dtypes

Unnamed: 0                   int64
product_id                   int64
product_category            object
product_name                object
product_price              float64
scrapy_datetime     datetime64[ns]
style_id                     int64
color_id                     int64
color_name                  object
fit                         object
composition                 object
size                        object
dtype: object

In [11]:
#color name
data['color_name'] = data['color_name'].apply( lambda x: x.replace(' ', '_' ).replace( '/', '_' ).lower() if pd.notnull( x ) else x )

# fit
data['fit'] = data['fit'].apply( lambda x: x.replace( ' ', '_' ).lower() if pd.notnull( x ) else x)
# quando houver valores nulos, acrescentar o notnull...

In [12]:
data['size'].unique()

array(['The model is 185cm/6\'1" and wears a size 31/32',
       'The model is 189cm/6\'2" and wears a size 31/32',
       'The model is 180cm/5\'11" and wears a size 31/32', nan,
       'The model is 187cm/6\'2" and wears a size 31/32',
       'The model is 186cm/6\'1" and wears a size 31/32',
       'The model is 188cm/6\'2" and wears a size 31/30',
       'The model is 187cm/6\'2" and wears a size 32/32',
       'The model is 183cm/6\'0" and wears a size 31/32',
       'The model is 189cm/6\'2" and wears a size 32/32',
       'The model is 187cm/6\'2" and wears a size 33/32',
       'The model is 182cm/6\'0" and wears a size 31',
       'The model is 183cm/6\'0" and wears a size 32',
       'The model is 180cm/5\'11" and wears a size 31/30'], dtype=object)

In [13]:
# Por razão de agilidade na entrega do primeiro ciclo, assumi o não uso das informações referentes ao tamanho!!!

# size number
data['size_number'] = data['size'].apply( lambda x: re.search('(\d+cm)', x ).group(0) if pd.notnull( x ) else x )
data['size_number'] = data['size_number'].apply( lambda x: re.search( '(\d{3})', x).group(0) if pd.notnull( x ) else x ).astype(float)
#
## size model
#data['size_model'] = data['size'].str.extract( '(Size.\d+|\d[a-zA-z])' ) (size.\d{2})|(size.\d{2}.\d{2})
data['size_model'] = data['size'].str.extract( '(size.\d{2}.\d+|size.\d{2})')

In [14]:
data['composition'].unique()

array(['Shell: Cotton 99%, Spandex 1%',
       'Pocket lining: Polyester 65%, Cotton 35%', nan,
       'Cotton 99%, Spandex 1%', 'Pocket lining: Cotton 100%',
       'Pocket lining: Polyester 63%, Cotton 37%', 'Pocket: Cotton 100%',
       'Shell: Cotton 100%', 'Cotton 98%, Spandex 2%',
       'Lining: Polyester 100%',
       'Shell: Cotton 90%, Elastomultiester 8%, Spandex 2%',
       'Pocket lining: Polyester 80%, Cotton 20%'], dtype=object)

In [15]:
# composition
# droping lines with pocket and lining materials
data = data[~data['composition'].str.contains( 'Pocket lining:', na=False )]
#data = data[~data['composition'].str.contains( 'Pocket:', na=False )]
data = data[~data['composition'].str.contains( 'Lining:', na=False )]
#data = data[~data['composition'].str.contains( 'Shell:', na=False )]

# break composition by comma
df1 = data['composition'].str.split( ',', expand=True )
#
## cotton | polyester | Spandex
df_ref = pd.DataFrame( index=np.arange( len( data ) ), columns=['cotton', 'spandex', 'elastomultiester'] )
#
## cotton
df_cotton = df1[0]
df_cotton.name = 'cotton'
df_ref = pd.concat( [df_ref, df_cotton ], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]
#
## spandex
df_spandex = df1.loc[df1[1].str.contains( 'Spandex', na=True ), 1]
df_spandex.name = 'spandex'
df_ref = pd.concat( [df_ref, df_spandex], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]

# elastomultiester
df_elastomultiester = df1.loc[df1[1].str.contains( 'Elastomultiester', na=True ), 1]
df_elastomultiester.name = 'elastomultiester'
df_ref = pd.concat( [df_ref, df_elastomultiester], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
#

#
## Droping  columns
data = data.drop( columns=['Unnamed: 0', 'size'], axis=1 )

In [16]:
# final join
data = pd.concat( [data, df_ref], axis=1 )

In [17]:
data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,spandex,elastomultiester
0,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,light_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,Shell: Cotton 99%,Spandex 1%,
2,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,light_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,Shell: Cotton 99%,Spandex 1%,
4,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,Shell: Cotton 99%,Spandex 1%,
6,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,dark_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,Shell: Cotton 99%,Spandex 1%,
8,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,dark_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,Shell: Cotton 99%,Spandex 1%,


In [18]:
data.shape

(1593, 15)

In [19]:
#format composition data
data['cotton'] = data['cotton'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) /100 if pd.notnull( x ) else x )
data['spandex'] = data['spandex'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) /100 if pd.notnull( x ) else x )
data['elastomultiester'] = data['elastomultiester'].apply( lambda x: int( re.search( '\d+', x).group(0) ) /100 if pd.notnull( x ) else x )

In [20]:
#Cleaning dataframe
data = data.dropna( subset=['product_id'] )
data['product_id'] = data['product_id'].astype( int )

In [21]:
data.isna().sum()

product_id           404
product_category     404
product_name         404
product_price        404
scrapy_datetime      404
style_id             404
color_id             404
color_name           404
fit                  678
composition          678
size_number          678
size_model           678
cotton               678
spandex              777
elastomultiester    1589
dtype: int64

In [22]:
data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,spandex,elastomultiester
0,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,light_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
2,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,light_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
4,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
6,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,dark_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,
8,1024256000.0,men_jeans_slim,slim_jeans,19.99,2022-04-18 17:54:04,1024256.0,1.0,dark_denim_blue,slim_fit,"Shell: Cotton 99%, Spandex 1%",185.0,size 31/32,0.99,0.01,


In [23]:
Continuar no proximo ciclo!

SyntaxError: invalid syntax (2104124353.py, line 1)