<a href="https://colab.research.google.com/github/CardosoJr/bootcamp/blob/main/Labs/Lab_10%20-%20E2E%20ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab - End-to-end Machine Learning

## Dataset

Vamos trabalhar com dataset de e-commerce da [Olist](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce). 

Neste dataset temos informações de ordens, entrega, localizações, reviews, preços e etc

## Hipótese

**Será que podemos prever qual rating que o cliente dará ao serviço?**

Quais os motivos para que um pedido seja mal avaliado? 


1.   Atraso na entrega
2.   Pedido veio errado, com defeito ou não satisfez a necessidade do cliente

## O Fluxo de Modelagem: 

Este fluxo deve ser cíclico, ou seja, devemos repetir os passos até chegar na performance adequada do modelo. 

1. Construção do Dataset
  * Join / Merge com as tabelas e construção de uma base analítica
  * Limpeza e pré-processamento da base
2. EDA 
  * Uma análise minuciosa dos dados que será insumo para feature engineering e modelagem
3. Feature Engineering 
  * Construção de novas variáveis informativas que podem ajudar o modelo a encontrar melhor os padrões nos dados
  * Podemos realizar também um feature selection, ou seja, remover variáveis não informativas que degradam a performance do modelo
4. Modelagem
  * Train / Test split dos dados para evitar overfitting
  * Construção de um baseline a ser batido
  * Construção de modelos propícios ao problema a ser resolvido
5. Tuning de Hiperparâmetros
  * Utilizar o conjunto de validação para tuning dos hiperparâmetros dos modelos candidatos
6. Model Selection
  * Nesta etapa já deveríamos ter uma métrica de performance principal que queremos otimizar. Essa métrica deve ter relação com o negócio, ou seja, melhorias na métrica do modelo implica em melhoria nas métricas de negócio
  * Escolha do melhor modelo (aquele com melhor performance no conjunto de teste) 
7. Análise de Performance
  * Dado um melhor modelo escolhido (ou uma short-list de melhores modelos). É importante tentar entender como estes modelos performam no conjunto de dados. 
  * Vamos aplicar ferramentas de interpretabilidade para entender as principais features utilizadas
  * vamos analisar os erros no conjunto de teste para verificar possíveis viéses. Estas análises podem ser insumo para recomeçar este processo e construir novas features e/ou novos modelos

Neste notebook, vamos simplificar algumas dessas etapas, como por exemplo a EDA. Porém, cabe a cada um aprimorar este pipeline para encontrar melhores modelos



In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 500)
import warnings

from pathlib import Path
import pickle
warnings.filterwarnings('ignore')

# Importando libs de plots
import matplotlib.pyplot as plt
import seaborn as sns

import os
import sys

sns.set_style('darkgrid')
# sns.set_context('talk')
sns.set_palette('rainbow')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
os.chdir(str(Path("../")))

In [4]:
import src.data.preprocessing as pp
import src.data.problem_definition as problem
import src.data.split as split
import src.feature_engineering.engineering as eng
import src.feature_engineering.encoding as enc
import src.models.train_model as train

## Construção do Dataset

### Leitura

Vamos ler todos os CSVs do dataset e fazer os merges necessários

In [5]:
file_path = "./data/raw/"

In [6]:
df_o = pd.read_parquet(Path(file_path + 'olist_orders_dataset.pq'))
for c in ['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date',	'order_delivered_customer_date', 'order_estimated_delivery_date']:
    df_o[c] = pd.to_datetime(df_o[c])
df_oi = pd.read_parquet(Path(file_path + 'olist_order_items_dataset.pq'))
for c in ['shipping_limit_date']:
    df_oi[c] = pd.to_datetime(df_oi[c])
df_op = pd.read_parquet(Path(file_path + 'olist_order_payments_dataset.pq'))
df_or = pd.read_parquet(Path(file_path + 'olist_order_reviews_dataset.pq'))
df_p = pd.read_parquet(Path(file_path + 'olist_products_dataset.pq'))
df_s = pd.read_parquet(Path(file_path + 'olist_sellers_dataset.pq'))
df_c = pd.read_parquet(Path(file_path + 'olist_customers_dataset.pq'))
df_l = pd.read_parquet(Path(file_path + 'olist_geolocation_dataset.pq'))
df_pc = pd.read_parquet(Path(file_path + 'product_category_name_translation.pq'))

### Merges

In [7]:
df_oi, df_o = pp.merge_datasets(df_o = df_o, 
                  df_oi = df_oi,  
                   df_c = df_c,
                   df_or = df_or,
                   df_op = df_op,
                   df_geo_grouped = pp.preprocess_geo(df_l),
                   df_p = df_p,
                   df_s = df_s,
)

### Feature Engineering Inicial

Construindo algumas features que fazem sentido com o problema

In [8]:
df_oi = eng.create_sellers_features(df_oi)
df_oi = eng.create_product_features(df_oi)
df_oi = eng.simple_fillna(df_oi, ['product_category_name', 'seller_state'])
df_o = eng.create_logistics_features(df_o = df_o)
df_o = eng.create_order_items_features(df_o = df_o, df_oi = df_oi)
df_o = eng.grouping_states_with_low_representativity(df_o, 1e3)

In [9]:
df_o.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,payment_value,review_score,geolocation_zip_code_prefix,c_lat,c_lng,logistics_length,delay_length,is_delayed,customer_seller_distance,product_volume,product_weight_g,product_photos_qty,freight_value,order_item_id,shipping_limit_date,product_category_name,seller_state,s_total_volume,s_total_items,freight_ratio
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,38.71,4.0,3149.0,-23.576983,-46.587161,8.0,-8.0,0,0.0,1976.0,500.0,4.0,8.72,1.0,2017-10-06 11:07:15,utilidades_domesticas,SP,2349.94,53.0,0.225265
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,af07308b275d755c9edb36a90c618231,47813,barreiras,BA,141.46,4.0,47813.0,-12.177924,-44.660711,12.0,-6.0,0,0.0,4693.0,400.0,1.0,22.76,1.0,2018-07-30 03:24:27,perfumaria,SP,13544.95,126.0,0.160894
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,3a653a41f6f9fc3d2a113cf8398680e8,75265,vianopolis,GO,179.12,5.0,75265.0,-16.74515,-48.514783,9.0,-18.0,0,0.0,9576.0,420.0,1.0,19.22,1.0,2018-08-13 08:55:23,automotivo,SP,229472.63,1156.0,0.107302
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,7c142cf63193a1473d2e66489a9ae977,59296,sao goncalo do amarante,NE,72.2,5.0,59296.0,-5.77419,-35.271143,13.0,-13.0,0,0.0,6000.0,450.0,3.0,27.2,1.0,2017-11-23 19:45:59,pet_shop,MG,14362.3,156.0,0.376731
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,72632f0f9dd73dfee390c9b22eb56dd6,9195,santo andre,SP,28.62,5.0,9195.0,-23.67637,-46.514627,2.0,-10.0,0,0.0,11475.0,250.0,4.0,8.72,1.0,2018-02-19 20:31:37,papelaria,SP,6109.44,174.0,0.304682


In [10]:
categorical_features = ['customer_state', 'product_category_name', 'seller_state']

### Limpeza

Inicialmente vemos que não temos muitos dados faltantes no dataset e, dessa forma, resolvi preenchê-los. Uma decisão que pode ser revista

In [11]:
drop_cols = ['order_id', 
             'order_item_id',
             'customer_id',
             'order_status',
             'order_purchase_timestamp',
             'order_approved_at', 
             'order_delivered_carrier_date',
             'customer_unique_id',
             'customer_zip_code_prefix',
             'customer_city',
             'geolocation_zip_code_prefix',
             'c_lat', 
             'c_lng',
             'shipping_limit_date',
             'order_delivered_customer_date',
             'order_estimated_delivery_date']
df_o.drop(columns = drop_cols, inplace = True)

In [12]:
df_o.head()

Unnamed: 0,customer_state,payment_value,review_score,logistics_length,delay_length,is_delayed,customer_seller_distance,product_volume,product_weight_g,product_photos_qty,freight_value,product_category_name,seller_state,s_total_volume,s_total_items,freight_ratio
0,SP,38.71,4.0,8.0,-8.0,0,0.0,1976.0,500.0,4.0,8.72,utilidades_domesticas,SP,2349.94,53.0,0.225265
1,BA,141.46,4.0,12.0,-6.0,0,0.0,4693.0,400.0,1.0,22.76,perfumaria,SP,13544.95,126.0,0.160894
2,GO,179.12,5.0,9.0,-18.0,0,0.0,9576.0,420.0,1.0,19.22,automotivo,SP,229472.63,1156.0,0.107302
3,NE,72.2,5.0,13.0,-13.0,0,0.0,6000.0,450.0,3.0,27.2,pet_shop,MG,14362.3,156.0,0.376731
4,SP,28.62,5.0,2.0,-10.0,0,0.0,11475.0,250.0,4.0,8.72,papelaria,SP,6109.44,174.0,0.304682


In [13]:
df_o.isna().mean() * 100

customer_state              0.000000
payment_value               0.001006
review_score                0.772317
logistics_length            2.995746
delay_length                2.981668
is_delayed                  0.000000
customer_seller_distance    0.779357
product_volume              0.779357
product_weight_g            0.779357
product_photos_qty          0.779357
freight_value               0.779357
product_category_name       0.779357
seller_state                0.779357
s_total_volume              0.779357
s_total_items               0.779357
freight_ratio               0.780362
dtype: float64

### Definiçao do Target

Vamos definir como target prever se um pedido teve 5 estrelas, ou seja, transformando o problema em classificação binária

In [16]:
p = problem.BinaryClassProblem(target = "review_score", data = df_o)

In [17]:
X = p.get_data()
y = p.get_target()

## Baseline Model

Dividindo dataset entre treino, teste e validação e testando um modelo xgboost

In [15]:
from sklearn.model_selection import KFold
import category_encoders as ce
import xgboost
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [16]:
df.dropna(inplace = True)

In [49]:
splitter = split.SimpleSplit(test_size = 0.3, validation_size = None)
idx_train, idx_val, idx_test = splitter.fit_transform(data = df)

In [70]:
cat_cols = [x[0] for x in df.dtypes.items() if x[1] == 'object' and x[0] != 'target']

encoder = enc.Encoding(convert_structure = {'catboost' : {'cols' : cat_cols, 'target' : 'target'}})

In [71]:
encoder.fit(df.loc[idx_train])
df.loc[idx_train] = encoder.transform(df.loc[idx_train])
df.loc[idx_test] = encoder.transform(df.loc[idx_test])