# Empresa de E-commerce Quem Quer Comprar
### Objetivo

Estamos contratando um **analista de dados** para realizar uma investiga√ß√£o detalhada sobre nossas demandas de vendas ao longo do ano. O objetivo √© obter ``insights`` decisivos que contribuam para o aumento das vendas online no pr√≥ximo per√≠odo. √â fundamental que a an√°lise seja de alta qualidade, permitindo identificar os pontos mais relevantes e compreender melhor o comportamento de nossos clientes.

As ``partes interessadas`` desejam receber uma apresenta√ß√£o t√©cnica contendo os resultados da an√°lise, incluindo valores ``absolutos`` e ``margens``. Al√©m disso, solicitam uma segunda apresenta√ß√£o, mais simples e direta, destacando os pontos fortes do nosso neg√≥cio e tamb√©m as oportunidades de melhoria. Dessa forma, todos os envolvidos ter√£o clareza sobre as a√ß√µes estrat√©gicas a serem conduzidas no pr√≥ximo ciclo.

---

##### **Columns name and meanings**:  

**Order_Date**: The date the product was ordered.  
**Aging**: The time from the day the product is ordered to the day it is delivered.  
**Customer_id**: Unique id created for each customer.  
**Gender**: Gender of customer.  
**Device_Type**: The device the customer uses to actualize the transaction (Web/Mobile).  
**Customer_Login_Type**: The type the customer logged in. Such as Member, Guest etc.  
**Product_Category**: Product category  
**Product**: Product  
**Sales**: Total sales amount  
**Quantity**: Unit amount of product  
**Discount**: Percent discount rate  
**Profit**: Profit  
**Shipping_cost**: Shipping cost  
**Order_Priority**: Order priority. Such as critical, high etc.  
**Payment_method**: Payment method 

---

In [118]:
import pandas as pd
import numpy as np

# importa pacotes/fun√ß√µes
from e_commerce.data_extraction.extractor import get_data
from e_commerce.preprocessing.column_utils import clean_columns
from e_commerce.preprocessing.data_types import padroniza_tipos_dados
from e_commerce.utils.save import save_to_csv_interim

## Carregar Dados

In [70]:
input_path = r'e_commerce_dataset.csv'

df = get_data(source_type='csv', type_name='raw', thousands=',', decimal='.', filename_or_path=input_path)
df

Unnamed: 0,Order_Date,Time,Aging,Customer_Id,Gender,Device_Type,Customer_Login_type,Product_Category,Product,Sales,Quantity,Discount,Profit,Shipping_Cost,Order_Priority,Payment_method
0,2018-01-02,10:56:33,8.0,37077,Female,Web,Member,Auto & Accessories,Car Media Players,140.0,1.0,0.3,46.0,4.6,Medium,credit_card
1,2018-07-24,20:41:37,2.0,59173,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.3,112.0,11.2,Medium,credit_card
2,2018-11-08,08:38:49,8.0,41066,Female,Web,Member,Auto & Accessories,Car Body Covers,117.0,5.0,0.1,31.2,3.1,Critical,credit_card
3,2018-04-18,19:28:06,7.0,50741,Female,Web,Member,Auto & Accessories,Car & Bike Care,118.0,1.0,0.3,26.2,2.6,High,credit_card
4,2018-08-13,21:18:39,9.0,53639,Female,Web,Member,Auto & Accessories,Tyre,250.0,1.0,0.3,160.0,16.0,Critical,credit_card
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51285,2018-02-28,22:59:50,6.0,78489,Female,Mobile,Member,Home & Furniture,Shoe Rack,124.0,4.0,0.3,19.2,1.9,Medium,money_order
51286,2018-02-28,13:19:25,2.0,91941,Female,Mobile,Member,Home & Furniture,Umbrellas,70.0,5.0,0.2,14.0,1.4,Medium,credit_card
51287,2018-02-28,10:25:07,6.0,63313,Male,Web,Member,Home & Furniture,Dinner Crockery,133.0,1.0,0.3,39.7,4.0,Medium,credit_card
51288,2018-02-28,10:50:08,7.0,86485,Male,Web,Member,Home & Furniture,Sofa Covers,216.0,1.0,0.2,131.7,13.2,Medium,credit_card


### Transformar dados

In [87]:
# Padronizar nomes das colunas
df_clean = clean_columns(df.copy(), verbose=False)
df_clean

Unnamed: 0,order_date,time,aging,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
0,2018-01-02,10:56:33,8.0,37077,Female,Web,Member,Auto & Accessories,Car Media Players,140.0,1.0,0.3,46.0,4.6,Medium,credit_card
1,2018-07-24,20:41:37,2.0,59173,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.3,112.0,11.2,Medium,credit_card
2,2018-11-08,08:38:49,8.0,41066,Female,Web,Member,Auto & Accessories,Car Body Covers,117.0,5.0,0.1,31.2,3.1,Critical,credit_card
3,2018-04-18,19:28:06,7.0,50741,Female,Web,Member,Auto & Accessories,Car & Bike Care,118.0,1.0,0.3,26.2,2.6,High,credit_card
4,2018-08-13,21:18:39,9.0,53639,Female,Web,Member,Auto & Accessories,Tyre,250.0,1.0,0.3,160.0,16.0,Critical,credit_card
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51285,2018-02-28,22:59:50,6.0,78489,Female,Mobile,Member,Home & Furniture,Shoe Rack,124.0,4.0,0.3,19.2,1.9,Medium,money_order
51286,2018-02-28,13:19:25,2.0,91941,Female,Mobile,Member,Home & Furniture,Umbrellas,70.0,5.0,0.2,14.0,1.4,Medium,credit_card
51287,2018-02-28,10:25:07,6.0,63313,Male,Web,Member,Home & Furniture,Dinner Crockery,133.0,1.0,0.3,39.7,4.0,Medium,credit_card
51288,2018-02-28,10:50:08,7.0,86485,Male,Web,Member,Home & Furniture,Sofa Covers,216.0,1.0,0.2,131.7,13.2,Medium,credit_card


In [88]:
# Renomeia coluna aging para lead_time tempo entre o pedido e a entrega
df_clean = df_clean.rename(columns={'aging' : 'lead_time'}).copy()
df_clean

Unnamed: 0,order_date,time,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
0,2018-01-02,10:56:33,8.0,37077,Female,Web,Member,Auto & Accessories,Car Media Players,140.0,1.0,0.3,46.0,4.6,Medium,credit_card
1,2018-07-24,20:41:37,2.0,59173,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.3,112.0,11.2,Medium,credit_card
2,2018-11-08,08:38:49,8.0,41066,Female,Web,Member,Auto & Accessories,Car Body Covers,117.0,5.0,0.1,31.2,3.1,Critical,credit_card
3,2018-04-18,19:28:06,7.0,50741,Female,Web,Member,Auto & Accessories,Car & Bike Care,118.0,1.0,0.3,26.2,2.6,High,credit_card
4,2018-08-13,21:18:39,9.0,53639,Female,Web,Member,Auto & Accessories,Tyre,250.0,1.0,0.3,160.0,16.0,Critical,credit_card
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51285,2018-02-28,22:59:50,6.0,78489,Female,Mobile,Member,Home & Furniture,Shoe Rack,124.0,4.0,0.3,19.2,1.9,Medium,money_order
51286,2018-02-28,13:19:25,2.0,91941,Female,Mobile,Member,Home & Furniture,Umbrellas,70.0,5.0,0.2,14.0,1.4,Medium,credit_card
51287,2018-02-28,10:25:07,6.0,63313,Male,Web,Member,Home & Furniture,Dinner Crockery,133.0,1.0,0.3,39.7,4.0,Medium,credit_card
51288,2018-02-28,10:50:08,7.0,86485,Male,Web,Member,Home & Furniture,Sofa Covers,216.0,1.0,0.2,131.7,13.2,Medium,credit_card


In [89]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   order_date           51290 non-null  object 
 1   time                 51290 non-null  object 
 2   lead_time            51289 non-null  float64
 3   customer_id          51290 non-null  int64  
 4   gender               51290 non-null  object 
 5   device_type          51290 non-null  object 
 6   customer_login_type  51290 non-null  object 
 7   product_category     51290 non-null  object 
 8   product              51290 non-null  object 
 9   sales                51289 non-null  float64
 10  quantity             51288 non-null  float64
 11  discount             51289 non-null  float64
 12  profit               51290 non-null  float64
 13  shipping_cost        51289 non-null  float64
 14  order_priority       51288 non-null  object 
 15  payment_method       51290 non-null 

In [90]:
# 1¬™ primeira padroniza√ß√µa dos dados
df_clean = padroniza_tipos_dados(df_clean.copy(), type_mapping={'order_date' : 'datetime64[ns]'}, auto_detect=False, verbose=True)

üîß TIPOS DE DADOS:
order_date             datetime64[ns]
time                           object
lead_time                     float64
customer_id                     int64
gender                         object
device_type                    object
customer_login_type            object
product_category               object
product                        object
sales                         float64
quantity                      float64
discount                      float64
profit                        float64
shipping_cost                 float64
order_priority                 object
payment_method                 object
dtype: object


In [91]:
# Fazendo a uni√£o das horas das ordens com a coluna order_date e dropando o coluna time que n√£o ser√° mais √∫til
df_clean['order_date'] = pd.to_datetime(df_clean['order_date'].dt.date.astype(str) + ' ' + df_clean['time'])
df_clean = df_clean.drop('time', axis=1)

In [92]:
df_clean

Unnamed: 0,order_date,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
0,2018-01-02 10:56:33,8.0,37077,Female,Web,Member,Auto & Accessories,Car Media Players,140.0,1.0,0.3,46.0,4.6,Medium,credit_card
1,2018-07-24 20:41:37,2.0,59173,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.3,112.0,11.2,Medium,credit_card
2,2018-11-08 08:38:49,8.0,41066,Female,Web,Member,Auto & Accessories,Car Body Covers,117.0,5.0,0.1,31.2,3.1,Critical,credit_card
3,2018-04-18 19:28:06,7.0,50741,Female,Web,Member,Auto & Accessories,Car & Bike Care,118.0,1.0,0.3,26.2,2.6,High,credit_card
4,2018-08-13 21:18:39,9.0,53639,Female,Web,Member,Auto & Accessories,Tyre,250.0,1.0,0.3,160.0,16.0,Critical,credit_card
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51285,2018-02-28 22:59:50,6.0,78489,Female,Mobile,Member,Home & Furniture,Shoe Rack,124.0,4.0,0.3,19.2,1.9,Medium,money_order
51286,2018-02-28 13:19:25,2.0,91941,Female,Mobile,Member,Home & Furniture,Umbrellas,70.0,5.0,0.2,14.0,1.4,Medium,credit_card
51287,2018-02-28 10:25:07,6.0,63313,Male,Web,Member,Home & Furniture,Dinner Crockery,133.0,1.0,0.3,39.7,4.0,Medium,credit_card
51288,2018-02-28 10:50:08,7.0,86485,Male,Web,Member,Home & Furniture,Sofa Covers,216.0,1.0,0.2,131.7,13.2,Medium,credit_card


In [93]:
df_clean.isnull().sum()

order_date             0
lead_time              1
customer_id            0
gender                 0
device_type            0
customer_login_type    0
product_category       0
product                0
sales                  1
quantity               2
discount               1
profit                 0
shipping_cost          1
order_priority         2
payment_method         0
dtype: int64

In [94]:
# Verificando se n√£o mais valores nulos dentro da linha solicitada
df_clean[df_clean['lead_time'].isna()]

Unnamed: 0,order_date,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
27,2018-05-02 11:45:38,,26058,Female,Web,Member,Auto & Accessories,Car Media Players,140.0,1.0,0.3,55.8,5.6,High,credit_card


In [95]:
# Preenchendo valor nulo com a m√©dia do dia anterior e posterior utilizando a t√©cnica de rolling

# Identifica linha com valor nulo na coluna lead_time
nulos = df_clean['lead_time'].index[df_clean['lead_time'].isna()]

for i in nulos:
    anterior = df_clean.loc[i - 1, 'lead_time']
    posterior = df_clean.loc[i + 1, 'lead_time']
    df_clean.loc[i, 'lead_time'] = (anterior + posterior) / 2

df_clean

Unnamed: 0,order_date,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
0,2018-01-02 10:56:33,8.0,37077,Female,Web,Member,Auto & Accessories,Car Media Players,140.0,1.0,0.3,46.0,4.6,Medium,credit_card
1,2018-07-24 20:41:37,2.0,59173,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.3,112.0,11.2,Medium,credit_card
2,2018-11-08 08:38:49,8.0,41066,Female,Web,Member,Auto & Accessories,Car Body Covers,117.0,5.0,0.1,31.2,3.1,Critical,credit_card
3,2018-04-18 19:28:06,7.0,50741,Female,Web,Member,Auto & Accessories,Car & Bike Care,118.0,1.0,0.3,26.2,2.6,High,credit_card
4,2018-08-13 21:18:39,9.0,53639,Female,Web,Member,Auto & Accessories,Tyre,250.0,1.0,0.3,160.0,16.0,Critical,credit_card
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51285,2018-02-28 22:59:50,6.0,78489,Female,Mobile,Member,Home & Furniture,Shoe Rack,124.0,4.0,0.3,19.2,1.9,Medium,money_order
51286,2018-02-28 13:19:25,2.0,91941,Female,Mobile,Member,Home & Furniture,Umbrellas,70.0,5.0,0.2,14.0,1.4,Medium,credit_card
51287,2018-02-28 10:25:07,6.0,63313,Male,Web,Member,Home & Furniture,Dinner Crockery,133.0,1.0,0.3,39.7,4.0,Medium,credit_card
51288,2018-02-28 10:50:08,7.0,86485,Male,Web,Member,Home & Furniture,Sofa Covers,216.0,1.0,0.2,131.7,13.2,Medium,credit_card


In [96]:
# Verificando se n√£o mais valores nulos dentro da linha solicitada
df_clean[df_clean['sales'].isna()]

Unnamed: 0,order_date,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
793,2018-05-16 21:30:59,6.0,16381,Male,Web,Member,Auto & Accessories,Car Speakers,,1.0,0.1,124.7,12.5,Critical,credit_card


In [97]:
# Inserir valor da venda na coluna sale buscando por outros mesmos produtos vendidos
df_clean[df_clean['product'].isin(['Car Speakers'])]

Unnamed: 0,order_date,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
1,2018-07-24 20:41:37,2.0,59173,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.3,112.0,11.2,Medium,credit_card
10,2018-07-13 19:58:11,10.0,22249,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,4.0,0.1,122.6,12.3,Critical,credit_card
19,2018-02-18 14:31:30,6.0,26127,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.2,122.6,12.3,Critical,credit_card
28,2018-08-25 13:50:30,7.0,38941,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.3,112.0,11.2,Medium,credit_card
37,2018-10-01 12:23:13,3.0,27385,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,4.0,0.1,122.6,12.3,Critical,credit_card
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17650,2018-06-30 17:29:42,6.0,58003,Male,Web,Member,Auto & Accessories,Car Speakers,211.0,1.0,0.2,109.9,11.0,High,money_order
17659,2018-04-20 11:27:51,10.0,53301,Male,Web,Member,Auto & Accessories,Car Speakers,211.0,4.0,0.2,97.2,9.7,High,money_order
17668,2018-09-05 08:57:45,8.0,29153,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,3.0,0.1,124.7,12.5,Critical,e_wallet
17679,2018-07-30 16:48:21,3.0,52305,Male,Web,Member,Auto & Accessories,Car Speakers,211.0,4.0,0.2,97.2,9.7,Medium,credit_card


In [98]:
# Inserindo valor no campor nulo da coluna sales
df_clean['sales'] = df_clean['sales'].fillna(211.0)

In [99]:
df_clean.isnull().sum()

order_date             0
lead_time              0
customer_id            0
gender                 0
device_type            0
customer_login_type    0
product_category       0
product                0
sales                  0
quantity               2
discount               1
profit                 0
shipping_cost          1
order_priority         2
payment_method         0
dtype: int64

In [100]:
# Verificando se n√£o mais valores nulos dentro da linha solicitada
df_clean[df_clean['quantity'].isna()]

Unnamed: 0,order_date,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
95,2018-04-22 11:32:22,5.0,52267,Male,Web,Member,Auto & Accessories,Bike Tyres,72.0,,0.1,36.0,3.6,Critical,credit_card
321,2018-06-05 11:04:11,3.0,41850,Male,Web,Member,Auto & Accessories,Car Mat,54.0,,0.2,54.0,5.4,Critical,credit_card


In [101]:
# Excluindo dois valores nulos na coluna quantity
df_clean = df_clean.dropna()

In [102]:
df_clean.isnull().sum()

order_date             0
lead_time              0
customer_id            0
gender                 0
device_type            0
customer_login_type    0
product_category       0
product                0
sales                  0
quantity               0
discount               0
profit                 0
shipping_cost          0
order_priority         0
payment_method         0
dtype: int64

In [103]:
# Transforma dados coluna lead_time e quantity em int64
df_clean[['lead_time', 'quantity']] = df_clean[['lead_time', 'quantity']].copy().astype(int)
df_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[['lead_time', 'quantity']] = df_clean[['lead_time', 'quantity']].copy().astype(int)


Unnamed: 0,order_date,lead_time,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
0,2018-01-02 10:56:33,8,37077,Female,Web,Member,Auto & Accessories,Car Media Players,140.0,1,0.3,46.0,4.6,Medium,credit_card
1,2018-07-24 20:41:37,2,59173,Female,Web,Member,Auto & Accessories,Car Speakers,211.0,1,0.3,112.0,11.2,Medium,credit_card
2,2018-11-08 08:38:49,8,41066,Female,Web,Member,Auto & Accessories,Car Body Covers,117.0,5,0.1,31.2,3.1,Critical,credit_card
3,2018-04-18 19:28:06,7,50741,Female,Web,Member,Auto & Accessories,Car & Bike Care,118.0,1,0.3,26.2,2.6,High,credit_card
4,2018-08-13 21:18:39,9,53639,Female,Web,Member,Auto & Accessories,Tyre,250.0,1,0.3,160.0,16.0,Critical,credit_card
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51285,2018-02-28 22:59:50,6,78489,Female,Mobile,Member,Home & Furniture,Shoe Rack,124.0,4,0.3,19.2,1.9,Medium,money_order
51286,2018-02-28 13:19:25,2,91941,Female,Mobile,Member,Home & Furniture,Umbrellas,70.0,5,0.2,14.0,1.4,Medium,credit_card
51287,2018-02-28 10:25:07,6,63313,Male,Web,Member,Home & Furniture,Dinner Crockery,133.0,1,0.3,39.7,4.0,Medium,credit_card
51288,2018-02-28 10:50:08,7,86485,Male,Web,Member,Home & Furniture,Sofa Covers,216.0,1,0.2,131.7,13.2,Medium,credit_card


In [104]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 51284 entries, 0 to 51289
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   order_date           51284 non-null  datetime64[ns]
 1   lead_time            51284 non-null  int64         
 2   customer_id          51284 non-null  int64         
 3   gender               51284 non-null  object        
 4   device_type          51284 non-null  object        
 5   customer_login_type  51284 non-null  object        
 6   product_category     51284 non-null  object        
 7   product              51284 non-null  object        
 8   sales                51284 non-null  float64       
 9   quantity             51284 non-null  int64         
 10  discount             51284 non-null  float64       
 11  profit               51284 non-null  float64       
 12  shipping_cost        51284 non-null  float64       
 13  order_priority       51284 non-null 

In [117]:
for valor in df_clean:
    if df_clean[valor].dtype == "O": # posso inserir tamb√©m object no lugar de "O"
        print(f'{valor}:')
        print(df_clean[valor].unique())
        print('\n\n')

gender:
['Female' 'Male']



device_type:
['Web' 'Mobile']



customer_login_type:
['Member' 'Guest' 'New ' 'First SignUp']



product_category:
['Auto & Accessories' 'Fashion' 'Electronic' 'Home & Furniture']



product:
['Car Media Players' 'Car Speakers' 'Car Body Covers' 'Car & Bike Care'
 'Tyre' 'Bike Tyres' 'Car Mat' 'Car Seat Covers' 'Car Pillow & Neck Rest'
 'Shirts' 'Jeans' 'Suits' 'Sports Wear' 'Casula Shoes' 'Running Shoes'
 'Formal Shoes' 'Sneakers' 'Titak watch' 'Fossil Watch' 'T - Shirts'
 'Samsung Mobile' 'Watch' 'Fans' 'Iron' 'Tablet' 'Mouse' 'Keyboard'
 'Apple Laptop' 'Mixer/Juicer' 'LED' 'LCD' 'Speakers' 'Sofa Covers'
 'Bed Sheets' 'Curtains' 'Towels' 'Sofas' 'Beds' 'Dinning Tables'
 'Shoe Rack' 'Umbrellas' 'Dinner Crockery']



order_priority:
['Medium' 'Critical' 'High' 'Low']



payment_method:
['credit_card' 'money_order' 'e_wallet' 'debit_card' 'not_defined']





### Salvar arquivo de an√°lise

In [120]:
save_to_csv_interim(df_clean, filename='e_commerce_clean.csv', index=False)

‚úÖ Arquivo salvo em: H:\Portifolios_e_anotacoes_Jackson\PORTIF√ìLIOS\america_ecommerce\data\interim\e_commerce_clean.csv


'H:\\Portifolios_e_anotacoes_Jackson\\PORTIF√ìLIOS\\america_ecommerce\\data\\interim\\e_commerce_clean.csv'