# Introduction: Segment customers of an e-commerce site
Dans ce Notebook, nous réaliserons une segmentation des client pour l'entreprise [Olist](https://olist.com/pt-br/) afin de permettre à ses équipes de e-commerce de l'utilisez pour leurs campagnes de communication. Notre objectif sera de comprendre les différents types d'utilisateurs grâce à leur comportement et à leur données personnelles.

## Import
Nous utiliserons une stack de Data Science habituelle : `numpy`, `pandas`, `sklearn`, `matplotlib`.

In [1]:
#manipulation des données
import numpy as np
import pandas as pd 

# sklearn preprocessing pour le traiter les variables catégorielles
from sklearn.preprocessing import LabelEncoder

# Gestion du système de fichiers
import os

# Suppression des alertes 
import warnings
warnings.filterwarnings('ignore')

# matplotlib et seaborn pour les représentations graphiques
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Import du Dataset
Tout d’abord, nous pouvons lister tous les fichiers de données disponibles. Il y a au total 9 fichiers organisés comme suit :
![image](https://i.imgur.com/HRhd2Y0.png)

In [2]:
print(os.listdir("./archive/"))

['olist_sellers_dataset.csv', 'product_category_name_translation.csv', 'olist_orders_dataset.csv', 'olist_order_items_dataset.csv', 'olist_customers_dataset.csv', 'olist_geolocation_dataset.csv', 'olist_order_payments_dataset.csv', 'olist_order_reviews_dataset.csv', 'olist_products_dataset.csv']


Nous importons les fichiers dont les données nous serons utile pour l'étude de cluestering des individus.

In [3]:
olist_orders_dataset = pd.read_csv('./archive/olist_orders_dataset.csv', sep=',')#
olist_sellers_dataset = pd.read_csv('./archive/olist_sellers_dataset.csv', sep=',')#
product_category_name_translation = pd.read_csv('./archive/product_category_name_translation.csv', sep=',')
olist_order_items_dataset = pd.read_csv('./archive/olist_order_items_dataset.csv', sep=',')#
olist_customers_dataset = pd.read_csv('./archive/olist_customers_dataset.csv', sep=',')
olist_geolocation_dataset = pd.read_csv('./archive/olist_geolocation_dataset.csv', sep=',')
olist_order_payments_dataset = pd.read_csv('./archive/olist_order_payments_dataset.csv', sep=',')#
olist_order_reviews_dataset = pd.read_csv('./archive/olist_order_reviews_dataset.csv', sep=',')#
olist_products_dataset = pd.read_csv('./archive/olist_products_dataset.csv', sep=',')#

La clef `geolocation_zip_code_prefix` permettant de fusionner les dataframe possède différentes géolocalisation pour une même clef. Nous pouvons simplifier le travail de fusion des dataframe en attribuant à une clef une seule valeurs de localisation.

In [4]:
olist_geolocation_dataset = olist_geolocation_dataset.drop_duplicates(subset=['geolocation_zip_code_prefix'])
olist_geolocation_dataset

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.644820,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP
5,1012,-23.547762,-46.635361,são paulo,SP
...,...,...,...,...,...
999774,99955,-28.107588,-52.144019,vila langaro,RS
999780,99970,-28.345143,-51.876926,ciriaco,RS
999786,99910,-27.863500,-52.084760,floriano peixoto,RS
999803,99920,-27.858716,-52.300403,erebango,RS


Nous fusionnons les différents Dataframe pour simplifier notre travail d'analyse des différents individus et de leurs caractéristiques.

In [5]:
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_order_payments_dataset, on='order_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_order_reviews_dataset, on='order_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_order_items_dataset, on='order_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_products_dataset, on='product_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_sellers_dataset, on='seller_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_geolocation_dataset, left_on='seller_zip_code_prefix',\
                                 right_on='geolocation_zip_code_prefix')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_customers_dataset, on='customer_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_geolocation_dataset, left_on='customer_zip_code_prefix',\
                                 right_on='geolocation_zip_code_prefix')

In [6]:
olist_orders_dataset = olist_orders_dataset.drop(columns=['geolocation_zip_code_prefix_x', 'geolocation_city_x', 'geolocation_state_x', \
                                                          'geolocation_zip_code_prefix_y', 'geolocation_state_y'])
pd.set_option('display.max_columns', None) #permet d'afficher toutes les colonnes
olist_orders_dataset

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,payment_sequential,payment_type,payment_installments,payment_value,review_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,seller_zip_code_prefix,seller_city,seller_state,geolocation_lat_x,geolocation_lng_x,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,geolocation_lat_y,geolocation_lng_y,geolocation_city_y
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1,credit_card,1,18.12,a54f0611adc9ed256b57ede6b6eb5114,4,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,utilidades_domesticas,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,3,voucher,1,2.00,a54f0611adc9ed256b57ede6b6eb5114,4,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,utilidades_domesticas,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,2,voucher,1,18.59,a54f0611adc9ed256b57ede6b6eb5114,4,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,utilidades_domesticas,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo
3,70b35acffdf851e782ebf6fbc35eb620,8e8ee9b08afb49b080d193f98b0505af,delivered,2018-03-22 17:23:21,2018-03-22 18:05:36,2018-03-23 18:03:03,2018-03-25 17:22:41,2018-04-04 00:00:00,1,credit_card,2,223.38,3cd186b6013f4145b9bd406847b61f19,5,,Nâo sabia da entrega aos domingos pelo correio...,2018-03-26 00:00:00,2018-03-27 02:21:27,1,6cc44821f36f3156c782da72dd634e47,da8622b14eb17ae2831f4ac5b9dab84a,2018-03-28 18:05:36,99.90,11.79,cama_mesa_banho,55.0,273.0,1.0,1050.0,38.0,10.0,38.0,13405,piracicaba,SP,-22.716839,-47.657366,8a4002923e801e3120a11070fd31c9e2,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo
4,70b35acffdf851e782ebf6fbc35eb620,8e8ee9b08afb49b080d193f98b0505af,delivered,2018-03-22 17:23:21,2018-03-22 18:05:36,2018-03-23 18:03:03,2018-03-25 17:22:41,2018-04-04 00:00:00,1,credit_card,2,223.38,3cd186b6013f4145b9bd406847b61f19,5,,Nâo sabia da entrega aos domingos pelo correio...,2018-03-26 00:00:00,2018-03-27 02:21:27,2,6cc44821f36f3156c782da72dd634e47,da8622b14eb17ae2831f4ac5b9dab84a,2018-03-28 18:05:36,99.90,11.79,cama_mesa_banho,55.0,273.0,1.0,1050.0,38.0,10.0,38.0,13405,piracicaba,SP,-22.716839,-47.657366,8a4002923e801e3120a11070fd31c9e2,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116745,99a3d4b1228bc02abf9eec5d9a7a742d,0f71a0b83b78a05118efc4aae3c2a774,delivered,2018-07-17 21:05:10,2018-07-17 21:15:20,2018-07-19 13:53:00,2018-07-26 16:42:01,2018-08-13 00:00:00,2,voucher,1,20.00,c7e023e460c5eff6814409f97e67d762,5,,,2018-07-27 00:00:00,2018-07-27 21:35:24,1,a2ead66678565eb7679dfdf1fa4cbe43,e0a3c37054ade6478d923a6efc10f308,2018-07-23 21:15:20,109.90,19.96,bebes,60.0,2648.0,6.0,800.0,32.0,11.0,27.0,88340,camboriu,SC,-27.032579,-48.658232,99ece6acd6265aa21fee1ae7831e3e89,3050,sao paulo,SP,-23.542495,-46.613578,sao paulo
116746,8937c3e485f73f480931feaca88a35cb,ecfea8bfe1a00c6b4bdac9d7524efce3,processing,2017-02-16 19:56:08,2017-02-18 19:50:19,,,2017-03-29 00:00:00,1,credit_card,8,295.99,4279a5882a3519c9d21805248ed7ffa4,1,,Meu produto foi comprado a mais de 1 mês e não...,2017-04-02 00:00:00,2017-04-02 12:09:33,1,c347971e06135135a97fd33d9db5b74c,9a208dee8f95cfdf00760c4d627828ec,2017-02-22 18:56:08,269.90,26.09,beleza_saude,30.0,293.0,1.0,1133.0,30.0,18.0,23.0,26125,belford roxo,RJ,-22.728758,-43.402930,ab0325ea50327c1d7cab7dd30e5c27cb,58690,livramento,PB,-7.379848,-36.942182,livramento
116747,4442d1fdf454197e9e141f0d83a9031e,3c3c45651f50bb4b13a4de268cad02b0,delivered,2017-02-03 18:08:16,2017-02-04 07:01:55,2017-02-20 10:49:53,2017-03-01 11:24:45,2017-03-06 00:00:00,1,boleto,1,1323.79,cb44c661da8d40b4af9f8ef4b2de292e,5,,"muito boa a informação pós compra, para saber ...",2017-03-02 00:00:00,2017-03-03 11:26:49,1,7cb009e2ae1cdf7d16e8fbf0255ba953,45a3d05fb00435e52a28859dd03703b3,2017-02-07 18:08:16,1299.00,24.79,brinquedos,47.0,662.0,2.0,10800.0,55.0,18.0,19.0,7176,guarulhos,SP,-23.409499,-46.396566,9aed28a7d5e85a3f48fd4a23647dc8c4,97575,santana do livramento,RS,-30.891886,-55.484419,santana do livramento
116748,8fec1057df930bcc1be865fbfef606b4,1964a80794004a2f2456740630cc3b9d,delivered,2018-07-26 20:18:26,2018-07-27 08:24:21,2018-07-30 14:44:00,2018-08-08 20:06:36,2018-08-20 00:00:00,1,debit_card,1,258.56,ab2741a851e1c6d2626f21ce8f085770,5,,Produto excelente como eu esperava.,2018-08-09 00:00:00,2018-08-13 12:07:07,1,0af2b3833420630d9d1eacc2daf2bdf9,8c9348f33ae3dada25c99c99ade2af78,2018-08-02 08:24:21,230.00,28.56,sinalizacao_e_seguranca,60.0,915.0,3.0,1900.0,30.0,21.0,21.0,6340,carapicuiba,SP,-23.565739,-46.823220,57dd3bbd8b010e66ef1c8f4d114ddac6,45750,itape,BA,-14.894157,-39.428203,itape


### Analyse exploratoire
 - etudier la distribution des valeurs pour montrer les valeurs aberrantes
 - etudier les valeurs manquantes
 - traduire la catégorie de produit
 - montrer le nombre de client
 - montrer le nombre de vendeur
 - montrer la concentration de client par région
 - montrer la concentration de vendeur par région
 - montrer la concentration en fonction du nombre d'habitant par région
 - montrer les région qui rapporte de l'argent
 - montrer le délais entre l'achat et la livraison et le rating du produits
 - montrer le panier moyen d'un client
 - montrer le prix de livraison en fonction du cubage du colis
 - montrer la distance entre le client et le vendeur et le délais de livraison
 - montrer le nombre de client payant a crédit et le montant moyen
 - montrer les date d'achat
 - montrer le nombre de client récurrent
 - montrer le délais entre plusieurs achats


test supplémentaire