# Segmenter les clients d'un site d'e-commerce
Dans ce Notebook, nous réaliserons une segmentation des client pour l'entreprise [Olist](https://olist.com/pt-br/) afin de permettre à ses équipes de e-commerce de l'utilisez pour leurs campagnes de communication. Notre objectif sera de comprendre les différents types d'utilisateurs grâce à leur comportement et à leur données personnelles.  


![Olist](https://user.oc-static.com/upload/2019/02/24/15510251487267_Capture%20d%E2%80%99e%CC%81cran%202019-02-20%20a%CC%80%2017.37.38.png)  


Olist exploite un site de commerce électronique en ligne destiné aux vendeurs, qui met en relation les commerçants et leurs produits avec les principales places de marché du Brésil. Olist a été fondée en 2015 par Tiago Dalvi. La société a son siège à Curitiba, Parana, au Brésil. Le produit phare d'Olist, Olist Store, offre aux commerçants un moyen de gérer les annonces de produits, la logistique et les paiements de magasin. Il propose également "une expérience de vente unique" à travers des canaux tels que Mercado Livre, B2W et Via Varejo. Olist a attiré plus de 200 000 utilisateurs dans 180 pays. La société compte plus de 45 000 commerçants et détaillants parmi ses clients.

## Utilisation de méthodes non supervisées pour le regroupement des clients de profils similaires.

### Introduction :  


Dans un paysage numérique en constante évolution, la compréhension des diverses typologies de clients revêt une importance capitale pour les entreprises opérant dans le domaine de l'e-commerce.  


Notre étude vise à disséquer et à catégoriser les utilisateurs de la plateforme [Olist](https://olist.com/pt-br/) en fonction de leurs comportements et de leurs données personnelles. En analysant ces paramètres cruciaux, nous cherchons à élucider les tendances, motivations et préférences qui sous-tendent les interactions des clients, afin d'optimiser leur expérience sur le site [Olist](https://olist.com/pt-br/) et de maximiser la valeur qu'ils en tirent.  


Cette segmentation fine permettra d'adapter la stratégies de marketing de la plateforme, d'améliorer la personnalisation des offres et de cibler efficacement les campagnes, renforçant ainsi la satisfaction et la fidélité de leur clientèle. La présente étude offre une vue d'ensemble détaillée des différentes cohortes d'utilisateurs, apportant ainsi des insights fondamentaux pour l'orientation stratégique de leur plateforme e-commerce.

## Table des matières


#### [1.Nettoyage du Dataset](#1.-Nettoyage-du-Dataset)
#### - [A.Traitement des valeurs aberrantes](#A.-Traitement-des-valeurs-aberrantes)
#### - [B.Traitement des valeurs manquantes](#B.-Traitement-des-valeurs-manquantes)

#### [2.Analyse exploratoire du Dataset](#2.-Analyse-exploratoire-du-Dataset)
#### - [A.Analyse univariées](#A.-Analyse-univariées)
#### - [B.Analyse bivariées](#B.-Analyse-bivariées)
#### - [C.Analyse multivariées](#C.-Analyse-multivariées)

## Import
Nous utiliserons une stack de Data Science habituelle : `numpy`, `pandas`, `sklearn`, `matplotlib`.

In [25]:
#manipulation des données
import numpy as np
import pandas as pd 

# sklearn preprocessing pour le traiter les variables catégorielles
from sklearn.preprocessing import LabelEncoder

# Gestion du système de fichiers
import os

# Suppression des alertes 
import warnings
warnings.filterwarnings('ignore')

# matplotlib et seaborn pour les représentations graphiques
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Import du Dataset
Tout d’abord, nous pouvons lister tous les fichiers de données disponibles. Il y a au total 9 fichiers organisés comme suit :
![image](https://i.imgur.com/HRhd2Y0.png)

In [26]:
print(os.listdir("./archive/"))

['olist_sellers_dataset.csv', 'product_category_name_translation.csv', 'olist_orders_dataset.csv', 'olist_order_items_dataset.csv', 'olist_customers_dataset.csv', 'olist_geolocation_dataset.csv', 'olist_order_payments_dataset.csv', 'olist_order_reviews_dataset.csv', 'olist_products_dataset.csv']


Nous importons les fichiers dont les données nous serons utile pour l'étude de cluestering des individus.

In [27]:
olist_orders_dataset = pd.read_csv('./archive/olist_orders_dataset.csv', sep=',')#
olist_sellers_dataset = pd.read_csv('./archive/olist_sellers_dataset.csv', sep=',')#
product_category_name_translation = pd.read_csv('./archive/product_category_name_translation.csv', sep=',')
olist_order_items_dataset = pd.read_csv('./archive/olist_order_items_dataset.csv', sep=',')#
olist_customers_dataset = pd.read_csv('./archive/olist_customers_dataset.csv', sep=',')#
olist_geolocation_dataset = pd.read_csv('./archive/olist_geolocation_dataset.csv', sep=',')#
olist_order_payments_dataset = pd.read_csv('./archive/olist_order_payments_dataset.csv', sep=',')#
olist_order_reviews_dataset = pd.read_csv('./archive/olist_order_reviews_dataset.csv', sep=',')#
olist_products_dataset = pd.read_csv('./archive/olist_products_dataset.csv', sep=',')#

La clef `geolocation_zip_code_prefix` permettant de fusionner les dataframe possède différentes géolocalisation pour une même clef. Nous pouvons simplifier le travail de fusion des dataframe en attribuant à une clef une seule valeurs de localisation.

In [28]:
olist_geolocation_dataset = olist_geolocation_dataset.drop_duplicates(subset=['geolocation_zip_code_prefix'])
olist_geolocation_dataset

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.644820,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP
5,1012,-23.547762,-46.635361,são paulo,SP
...,...,...,...,...,...
999774,99955,-28.107588,-52.144019,vila langaro,RS
999780,99970,-28.345143,-51.876926,ciriaco,RS
999786,99910,-27.863500,-52.084760,floriano peixoto,RS
999803,99920,-27.858716,-52.300403,erebango,RS


Nous fusionnons les différents Dataframe pour simplifier notre travail d'analyse des différents individus et de leurs caractéristiques.

In [29]:
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_order_payments_dataset, on='order_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_order_reviews_dataset, on='order_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_order_items_dataset, on='order_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_products_dataset, on='product_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_sellers_dataset, on='seller_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_geolocation_dataset, left_on='seller_zip_code_prefix',\
                                 right_on='geolocation_zip_code_prefix')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_customers_dataset, on='customer_id')
olist_orders_dataset = pd.merge(olist_orders_dataset, olist_geolocation_dataset, left_on='customer_zip_code_prefix',\
                                 right_on='geolocation_zip_code_prefix')

In [30]:
olist_orders_dataset = olist_orders_dataset.drop(columns=['geolocation_zip_code_prefix_x', 'geolocation_city_x', 'geolocation_state_x', \
                                                          'geolocation_zip_code_prefix_y', 'geolocation_state_y'])
pd.set_option('display.max_columns', None) #permet d'afficher toutes les colonnes

Nous passons également par étape de traduction pour simplifier notre compréhension des données.

In [31]:
olist_orders_dataset = pd.merge(olist_orders_dataset, product_category_name_translation, on='product_category_name')
olist_orders_dataset = olist_orders_dataset.drop(columns=['product_category_name'])
olist_orders_dataset

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,payment_sequential,payment_type,payment_installments,payment_value,review_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,seller_zip_code_prefix,seller_city,seller_state,geolocation_lat_x,geolocation_lng_x,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,geolocation_lat_y,geolocation_lng_y,geolocation_city_y,product_category_name_english
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1,credit_card,1,18.12,a54f0611adc9ed256b57ede6b6eb5114,4,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo,housewares
1,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,3,voucher,1,2.00,a54f0611adc9ed256b57ede6b6eb5114,4,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo,housewares
2,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,2,voucher,1,18.59,a54f0611adc9ed256b57ede6b6eb5114,4,,"Não testei o produto ainda, mas ele veio corre...",2017-10-11 00:00:00,2017-10-12 03:43:48,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-10-06 11:07:15,29.99,8.72,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP,-23.574809,-46.587471,sao paulo,housewares
3,128e10d95713541c87cd1a2e48201934,a20e8105f23924cd00833fd87daa0831,delivered,2017-08-15 18:29:31,2017-08-15 20:05:16,2017-08-17 15:28:33,2017-08-18 14:44:43,2017-08-28 00:00:00,1,credit_card,3,37.77,b46f1e34512b0f4c74a72398b03ca788,4,,Deveriam embalar melhor o produto. A caixa vei...,2017-08-19 00:00:00,2017-08-20 15:16:36,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-08-21 20:05:16,29.99,7.78,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,3a51803cc0d012c3b5dc8b7528cb05f7,3366,sao paulo,SP,-23.565578,-46.534603,sao paulo,housewares
4,0e7e841ddf8f8f2de2bad69267ecfbcf,26c7ac168e1433912a51b924fbd34d34,delivered,2017-08-02 18:24:47,2017-08-02 18:43:15,2017-08-04 17:35:43,2017-08-07 18:30:01,2017-08-15 00:00:00,1,credit_card,1,37.77,dc90f19c2806f1abba9e72ad3c350073,5,,"Só achei ela pequena pra seis xícaras ,mais é ...",2017-08-08 00:00:00,2017-08-08 23:26:23,1,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,2017-08-08 18:37:31,29.99,7.78,40.0,268.0,4.0,500.0,19.0,8.0,13.0,9350,maua,SP,-23.680114,-46.452454,ef0996a1a279c26e7ecbd737be23d235,2290,sao paulo,SP,-23.543295,-46.630743,sao paulo,housewares
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115032,25af0443015b8d200489039a00361f2e,1674ec99b39d56ec2aa2329be6a79864,delivered,2017-08-22 16:53:59,2017-08-22 17:32:43,2017-08-25 21:11:35,2017-08-29 13:04:49,2017-09-12 00:00:00,1,credit_card,1,54.94,07d74068b22311efab32f86a06d595bd,5,,,2017-08-30 00:00:00,2017-08-30 17:03:25,1,1dceebcc5f23c02ea23e16d5bedca000,4e922959ae960d389249c378d1c939f5,2017-08-31 17:06:36,45.00,9.94,47.0,117.0,2.0,550.0,35.0,15.0,25.0,12327,jacarei,SP,-23.298186,-45.974828,0829b9a7bb7f0ac382c994b6e43aacd3,9693,sao bernardo do campo,SP,-23.669153,-46.579037,sao bernardo do campo,cds_dvds_musicals
115033,742a36775534b588ed2a62ba4c2d9cd7,b3137029b6e1d3d2e172a8725c0d3e5b,delivered,2017-07-28 09:45:54,2017-07-29 02:23:30,2017-07-31 14:27:28,2017-08-12 11:33:14,2017-08-17 00:00:00,1,boleto,1,54.94,5b96ca8d3c4d4a99cb9c4ab5d0415370,4,,"Ótimo, entregaram na datacert",2017-08-13 00:00:00,2017-08-13 20:06:09,1,1dceebcc5f23c02ea23e16d5bedca000,4e922959ae960d389249c378d1c939f5,2017-08-08 02:23:30,45.00,9.94,47.0,117.0,2.0,550.0,35.0,15.0,25.0,12327,jacarei,SP,-23.298186,-45.974828,d4b56c2e202ff3096477f92a0daef341,8411,sao paulo,SP,-23.555130,-46.415533,sao paulo,cds_dvds_musicals
115034,07fda4bada645831fb11a20eaf52c2ef,6be2538210060931127d78f976f0708e,delivered,2017-12-18 18:43:26,2017-12-18 19:50:13,2017-12-19 22:57:05,2017-12-27 14:22:20,2018-01-10 00:00:00,1,credit_card,7,74.94,10503652f1d2364e1d0e506f7852fe72,5,,,2017-12-28 00:00:00,2018-01-04 15:19:08,1,1dceebcc5f23c02ea23e16d5bedca000,4e922959ae960d389249c378d1c939f5,2017-12-28 19:50:13,65.00,9.94,47.0,117.0,2.0,550.0,35.0,15.0,25.0,12327,jacarei,SP,-23.298186,-45.974828,5426571d21e82bb33eaab30e48e6f290,3345,sao paulo,SP,-23.573424,-46.560441,sao paulo,cds_dvds_musicals
115035,2c4ada2e75c2ad41dd93cebb5df5f023,363d3a9b2ec5c5426608688ca033292d,delivered,2017-01-26 11:09:00,2017-01-26 11:22:17,2017-01-27 14:59:35,2017-02-14 16:24:01,2017-03-07 00:00:00,1,credit_card,1,209.06,82ec4a1c6f0134f607033e23431ee298,4,,Envio muito rápido. Recomendo.,2017-02-15 00:00:00,2017-02-16 02:54:35,1,6c7a0a349ad11817745e3ad58abd5c79,48162d548f5b1b11b9d29d1e01f75a61,2017-01-30 11:09:00,183.29,25.77,55.0,506.0,1.0,1225.0,27.0,35.0,15.0,13403,piracicaba,SP,-22.727375,-47.670610,d8bee9ec375c3a0f9ef8ed7456a51dcd,76940,rolim de moura,RO,-11.712918,-61.775307,rolim de moura,security_and_services


Pour permettre une bonne compréhension des variables qui caractérisent chaque individus il est important d'avoir une explication clair du sens de chaque variable :  

 - `order_id` : order unique identifier
 - `customer_id` : key to the orders dataset. Each order has a unique customer_id.
 - `order_status` : Reference to the order status (delivered, shipped, etc).
 - `order_purchase_timestamp` : Shows the purchase timestamp.
 - `order_approved_at` : Shows the payment approval timestamp.
 - `order_delivered_carrier_date` : Shows the order posting timestamp. When it was handled to the logistic partner.
 - `order_delivered_customer_date` : Shows the actual order delivery date to the customer.
 - `order_estimated_delivery_date` : Shows the estimated delivery date that was informed to customer at the purchase moment.
 - `payment_sequential` : a customer may pay an order with more than one payment method. If he does so, a sequence will be created to accommodate all payments.
 - `payment_type` : method of payment chosen by the customer.
 - `payment_installments` : number of installments chosen by the customer.
 - `payment_value` : transaction value.
 - `review_id` : unique review identifier
 - `review_score` : Note ranging from 1 to 5 given by the customer on a satisfaction survey.
 - `review_comment_title` : Comment title from the review left by the customer, in Portuguese.
 - `review_comment_message` : Comment message from the review left by the customer, in Portuguese.
 - `review_creation_date` : Shows the date in which the satisfaction survey was sent to the customer.
 - `review_answer_timestamp` : Shows satisfaction survey answer timestamp.
 - `order_item_id` : sequential number identifying number of items included in the same order.
 - `product_id` : product unique identifier
 - `seller_id` : seller unique identifier
 - `shipping_limit_date` : Shows the seller shipping limit date for handling the order over to the logistic partner.
 - `price` : item price
 - `freight_value` : item freight value item (if an order has more than one item the freight value is splitted between items)
 - `product_name_lenght` : number of characters extracted from the product name.
 - `product_description_lenght` : number of characters extracted from the product description.
 - `product_photos_qty` : number of product published photos
 - `product_weight_g` : product weight measured in grams.
 - `product_length_cm` : product length measured in centimeters.
 - `product_height_cm` : product height measured in centimeters.
 - `product_width_cm` : product width measured in centimeters.
 - `seller_zip_code_prefix` : first 5 digits of zip code
 - `seller_city` : city name
 - `seller_state` : state
 - `geolocation_lat_x` : latitude
 - `geolocation_lng_x` : longitude
 - `customer_unique_id` : unique identifier of a customer.
 - `customer_zip_code_prefix` : first five digits of customer zip code
 - `customer_city` : customer city name
 - `customer_state` : customer state
 - `product_category_name_english` : root category of product

Nous pouvons désormais passer à une première étape d'analyse de nos données. Nous transformons le nom du dataframe et affichons le type de valeurs de chaque caractéristique.

In [33]:
data = olist_orders_dataset.copy()
data.dtypes

order_id                          object
customer_id                       object
order_status                      object
order_purchase_timestamp          object
order_approved_at                 object
order_delivered_carrier_date      object
order_delivered_customer_date     object
order_estimated_delivery_date     object
payment_sequential                 int64
payment_type                      object
payment_installments               int64
payment_value                    float64
review_id                         object
review_score                       int64
review_comment_title              object
review_comment_message            object
review_creation_date              object
review_answer_timestamp           object
order_item_id                      int64
product_id                        object
seller_id                         object
shipping_limit_date               object
price                            float64
freight_value                    float64
product_name_len

Nous pouvons également étudier la distribution de chaque variable pour tenter de mettre en avant les valeurs aberrantes ou atypiques qui pourraient biaiser la généralisation de notre modèle de Machine Learning

In [34]:
data.describe()

Unnamed: 0,payment_sequential,payment_installments,payment_value,review_score,order_item_id,price,freight_value,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,seller_zip_code_prefix,geolocation_lat_x,geolocation_lng_x,customer_zip_code_prefix,geolocation_lat_y,geolocation_lng_y
count,115037.0,115037.0,115037.0,115037.0,115037.0,115037.0,115037.0,115037.0,115037.0,115037.0,115036.0,115036.0,115036.0,115036.0,115037.0,115037.0,115037.0,115037.0,115037.0,115037.0
mean,1.094057,2.944566,172.409294,4.034319,1.194685,120.650015,20.049325,48.754288,785.867573,2.200318,2112.365746,30.300897,16.658716,23.104089,24523.886054,-22.796059,-47.247909,34983.378296,-21.235765,-46.197857
std,0.731544,2.780108,266.204508,1.385651,0.686686,182.853114,15.850423,10.038336,653.051492,1.713213,3776.693083,16.203684,13.484108,11.730325,27644.643367,2.696612,2.346336,29829.438971,5.571871,4.050645
min,1.0,0.0,0.0,1.0,1.0,0.85,0.0,5.0,4.0,1.0,0.0,7.0,2.0,6.0,1001.0,-36.605374,-67.809656,1003.0,-36.605374,-72.666706
25%,1.0,1.0,60.85,4.0,1.0,39.9,13.08,42.0,345.0,1.0,300.0,18.0,8.0,15.0,6429.0,-23.611654,-48.831547,11095.0,-23.590023,-48.101695
50%,1.0,2.0,108.0,5.0,1.0,74.9,16.32,52.0,600.0,1.0,700.0,25.0,13.0,20.0,13720.0,-23.420739,-46.755211,24230.0,-22.929912,-46.632021
75%,1.0,4.0,189.43,5.0,1.0,134.9,21.2,57.0,982.0,3.0,1800.0,38.0,20.0,30.0,28470.0,-21.766477,-46.518082,58297.0,-20.198222,-43.625993
max,29.0,24.0,13664.08,5.0,21.0,6735.0,409.68,76.0,3992.0,20.0,40425.0,105.0,105.0,118.0,99730.0,-2.546079,-34.847856,99980.0,42.184003,-8.577855


### Analyse exploratoire
 - etudier la distribution des valeurs pour montrer les valeurs aberrantes
 - etudier les valeurs manquantes
 - traduire la catégorie de produit
 - montrer le nombre de client
 - montrer le nombre de vendeur
 - montrer la concentration de client par région
 - montrer la concentration de vendeur par région
 - montrer la concentration en fonction du nombre d'habitant par région
 - montrer les région qui rapporte de l'argent
 - montrer le délais entre l'achat et la livraison et le rating du produits
 - montrer le panier moyen d'un client
 - montrer le prix de livraison en fonction du cubage du colis
 - montrer la distance entre le client et le vendeur et le délais de livraison
 - montrer le nombre de client payant a crédit et le montant moyen
 - montrer les date d'achat
 - montrer le nombre de client récurrent
 - montrer le délais entre plusieurs achats


test supplémentaire