# Projet 5 : Segmenter des clients d'un site e-commerce

*Pierre-Eloi Ragetly*

Ce projet fait parti du parcours *Data Scientist* d'OpenClassroooms.

L'objectif pricipal est de réaliser **une segmentation des clients** d'un site de e-commerce, **une proposition de contrat de maintenance** devra être inclue.

Les données mises à notre disposition proviennent du site *kaggle* :
https://www.kaggle.com/olistbr/brazilian-ecommerce

# Partie I : Data Wrangling

L'objectif de ce notebook est de décrire les opérations de nettoyage nécessaires à l'obtention d'un jeu de données exploitable.

In [1]:
# Import usual libraries
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import pandas as pd
import seaborn as sns

In [2]:
# Change some default parameters of matplotlib using seaborn
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams.update({'axes.titleweight': 'bold'})
sns.set(style='ticks')
current_palette = sns.color_palette('RdBu')
sns.set_palette(current_palette)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Récupération-des-données" data-toc-modified-id="Récupération-des-données-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Récupération des données</a></span></li><li><span><a href="#Ingénierie-des-variables" data-toc-modified-id="Ingénierie-des-variables-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Ingénierie des variables</a></span><ul class="toc-item"><li><span><a href="#Variables-liées-aux-transactions" data-toc-modified-id="Variables-liées-aux-transactions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Variables liées aux transactions</a></span></li><li><span><a href="#Variables-liées-aux-produits" data-toc-modified-id="Variables-liées-aux-produits-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Variables liées aux produits</a></span></li><li><span><a href="#Catégories-des-produits" data-toc-modified-id="Catégories-des-produits-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Catégories des produits</a></span></li><li><span><a href="#Variables-liées-aux-commentaires" data-toc-modified-id="Variables-liées-aux-commentaires-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Variables liées aux commentaires</a></span></li><li><span><a href="#Moyens-de-paiement" data-toc-modified-id="Moyens-de-paiement-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Moyens de paiement</a></span></li><li><span><a href="#États" data-toc-modified-id="États-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>États</a></span></li></ul></li></ul></div>

## Récupération des données

Une fois les données téléchargées, nous pouvons les charger dans un DataFrame en utilisant la librairie **pandas**.

In [3]:
df_customers =  pd.read_csv("data/olist_customers_dataset.csv")
df_geolocation = pd.read_csv("data/olist_geolocation_dataset.csv")
df_order_items = pd.read_csv("data/olist_order_items_dataset.csv")
df_order_payments = pd.read_csv("data/olist_order_payments_dataset.csv")
df_order_reviews = pd.read_csv("data/olist_order_reviews_dataset.csv")
df_orders = pd.read_csv("data/olist_orders_dataset.csv")
df_products = pd.read_csv("data/olist_products_dataset.csv")
df_sellers = pd.read_csv("data/olist_sellers_dataset.csv")
df_translation = pd.read_csv("data/product_category_name_translation.csv")

Regardons ce qui est contenu dans chaque DataFrame.

In [4]:
df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


In [5]:
df_geolocation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   geolocation_zip_code_prefix  1000163 non-null  int64  
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object 
 4   geolocation_state            1000163 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB


In [6]:
df_order_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB


In [7]:
df_order_payments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


In [8]:
df_order_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   review_id                100000 non-null  object
 1   order_id                 100000 non-null  object
 2   review_score             100000 non-null  int64 
 3   review_comment_title     11715 non-null   object
 4   review_comment_message   41753 non-null   object
 5   review_creation_date     100000 non-null  object
 6   review_answer_timestamp  100000 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB


In [9]:
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


In [10]:
df_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category_name       32341 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32949 non-null  float64
 6   product_length_cm           32949 non-null  float64
 7   product_height_cm           32949 non-null  float64
 8   product_width_cm            32949 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB


In [11]:
df_sellers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


In [12]:
df_translation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     object
 1   product_category_name_english  71 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


Nous pouvons noter plusieurs choses :
1. La table geolocation n'apporte rien de plus que la table *customers*
2. Beaucoup de données redondantes
3. Les tables contiennent très peu de données manquantes

## Ingénierie des variables

Bon nombre des variables ne sont pas exploitables en l'état, il va falloir les transformer avant de penser à les utiliser pour faire tourner des modèles de partionnement.

De plus, ayant pour but de faire une segmentation des clients il va nous falloir regrouper les données par client. Dans la base de donnée, un client unique est assigné à chaque transaction. Ainsi, un client ayant effectué plusieurs transactions se verra attribuer plusieurs *customer_id*. C'est pourquoi nous avons aussi accès à la variable *customer_unique_id*. C'est cette dernière que nous utiliserons pour regrouper les données.

### Variables liées aux transactions

Nous commencerons par traités les variables de transaction. Pour chaque client, nous calculerons les variables suivantes :
- Nombre de transactions effectuées
- Nombre d'articles du panier moyen
- Prix du panier moyen
- Frais de port du panier moyen
- Taux de transaction ayant été finalisées

In [13]:
# Create a new dataframe with all required data
df = pd.merge(df_orders, df_order_items, how="inner",
              on="order_id", copy=True)
df = df.merge(right=df_customers, how="inner",
                on="customer_id", copy=False)

values = ["order_id",
          "product_id",
          "price",
          "freight_value",
          "order_status"]
aggfunc = {"order_id": pd.Series.nunique,
           "product_id": "count",
           "price": sum,
           "freight_value": sum,
           "order_status": lambda s: s[s=="delivered"].count()}

pivot_orders = (pd.pivot_table(df, index="customer_unique_id",
                               values=values, aggfunc=aggfunc)
                  .rename(columns={"order_id": "n_orders",
                                   "product_id": "order_n_products",
                                   "price": "order_price",
                                   "freight_value": "order_freight_value",
                                   "order_status": "order_delivered"})
                  .assign(order_delivered=lambda df: df[df["order_n_products"]>0]["order_delivered"]/
                                                     df[df["order_n_products"]>0]["order_n_products"],               
                          order_n_products=lambda df: df["order_n_products"]/df["n_orders"],
                          order_price=lambda df: df["order_price"]/df["n_orders"],
                          order_freight_value= lambda df: df["order_freight_value"]/df["n_orders"]))

In [14]:
pivot_orders.describe()

Unnamed: 0,order_freight_value,n_orders,order_delivered,order_price,order_n_products
count,95420.0,95420.0,95420.0,95420.0,95420.0
mean,22.81905,1.034018,0.977804,138.22572,1.1391
std,21.560878,0.211234,0.146304,211.414089,0.52688
min,0.0,1.0,0.0,0.85,1.0
25%,13.89,1.0,1.0,46.4,1.0
50%,17.24,1.0,1.0,87.3825,1.0
75%,24.13,1.0,1.0,149.9,1.0
max,1794.96,16.0,1.0,13440.0,21.0


### Variables liées aux produits

Nous traiterons ensuite les variables liées aux produits. Pour chaque client, nous calculerons les variables suivantes :
- prix moyen par produit
- nombre moyen de photos par produit
- longeur moyenne de description par produit
- frais de port moyens par produit

In [15]:
# Add products data
df = df.merge(right=df_products, how="inner",
                  on="product_id", copy=False)

values = ["price",
          "product_photos_qty",
          "product_description_lenght",
          "freight_value"]
aggfunc = {"price": 'mean',
           "product_photos_qty": 'mean',
           "product_description_lenght": 'mean',
           "freight_value": 'mean'}

pivot_products = pd.pivot_table(df, index="customer_unique_id",
                                values=values, aggfunc=aggfunc)

In [16]:
pivot_products.describe()

Unnamed: 0,freight_value,price,product_description_lenght,product_photos_qty
count,95420.0,95420.0,94108.0,94108.0
mean,20.226742,126.516509,795.837564,2.255114
std,15.821514,191.743686,650.586308,1.735351
min,0.0,0.85,4.0,1.0
25%,13.37,42.9,354.0,1.0
50%,16.4,79.08,611.5,2.0
75%,21.22,139.9,998.0,3.0
max,409.68,6735.0,3992.0,20.0


### Catégories des produits

Nous allons maintenant créer une variable par catégorie de produit. Nous prendrons les catégories traduites en anglais, plus parlantes que les catégories en portugais. Pour chaque catégorie, nous renseignerons la propotion de produits de cette catégorie acheté par chaque client.

In [17]:
# Add the English translation of the product category
df = df.merge(right=df_translation, how="inner",
              on="product_category_name", copy=False)

# Create a feature by category
list_products_cat = df["product_category_name_english"].unique().tolist()
aggfunc = {"product_id": 'count'}
for c in list_products_cat:
    cond = df["product_category_name_english"]==c
    df[c] = 1
    df[c] = df[c].where(cond, 0)
    aggfunc[c] = sum

# Create the pivot table
values = ["product_id"] + list_products_cat
pivot_cat = pd.pivot_table(df, index="customer_unique_id",
                           values=values, aggfunc=aggfunc)
# Calculate the proportions
for c in list_products_cat:
    pivot_cat[c] /= pivot_cat["product_id"]
pivot_cat = pivot_cat.drop(columns=["product_id"])

In [18]:
pivot_cat.describe()

Unnamed: 0,agro_industry_and_commerce,air_conditioning,art,arts_and_craftmanship,audio,auto,baby,bed_bath_table,books_general_interest,books_imported,...,security_and_services,signaling_and_security,small_appliances,small_appliances_home_oven_and_coffee,sports_leisure,stationery,tablets_printing_image,telephony,toys,watches_gifts
count,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,...,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0,94088.0
mean,0.0019,0.002562,0.002072,0.000203,0.0036,0.040234,0.029403,0.094342,0.005308,0.000547,...,2.1e-05,0.00145,0.006523,0.000792,0.078312,0.023846,0.000795,0.043443,0.039913,0.05796
std,0.04345,0.050066,0.045101,0.014029,0.059483,0.195708,0.167624,0.289972,0.072413,0.023333,...,0.00461,0.037893,0.080106,0.028081,0.26731,0.151737,0.027902,0.203065,0.194622,0.232698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Variables liées aux commentaires

Nous traiterons maintenant les variables liées aux commentaires. Pour chaque client, nous calculerons les variables suivantes :
- nombre moyens de commentaires par transactions
- score moyen
- temps de réponse moyen

In [19]:
# Create a new dataframe with all required data
df = pd.merge(df_orders, df_order_reviews, how="inner",
              on="order_id", copy=True)
df = df.merge(right=df_customers, how="inner",
              on="customer_id", copy=False)
# Convert date features into 'datetime' type
df["review_creation_date"] = pd.to_datetime(df["review_creation_date"])
df["review_answer_timestamp"] = pd.to_datetime(df["review_answer_timestamp"])
# Calculate the time to answer in days
df["review_answer_timedelta"] = df["review_answer_timestamp"] - df["review_creation_date"]
df["review_answer_timedelta"] /= pd.to_timedelta(1, unit="D")

values = ["order_id",
          "review_id",
          "review_score",
          "review_answer_timedelta"]
aggfunc = {"order_id": pd.Series.nunique,
           "review_id": 'count',
           "review_score": 'mean',
           "review_answer_timedelta": 'mean'}

pivot_reviews = (pd.pivot_table(df, index="customer_unique_id",
                               values=values, aggfunc=aggfunc)
                  .rename(columns={"order_id": "n_orders",
                                   "review_id": "order_n_reviews"})
                  .assign(order_n_reviews=lambda df: df["order_n_reviews"]/df["n_orders"])
                  .drop(columns=["n_orders"]))

In [20]:
pivot_reviews.describe()

Unnamed: 0,review_answer_timedelta,order_n_reviews,review_score
count,96096.0,96096.0,96096.0
mean,3.154216,1.0032,4.069535
std,9.824978,0.049501,1.353446
min,0.089225,1.0,1.0
25%,1.007701,1.0,4.0
50%,1.690735,1.0,5.0
75%,3.103235,1.0,5.0
max,518.699213,2.0,5.0


### Moyens de paiement

Il pourrait être utile de voir le moyen de paiement privilégié par chaque client. Pour cela nous allons créer une variable par moyen de paiement possible, et nous renseignerons la proportion du montant total payé par ce moyen de paiement.

In [21]:
# Create a new dataframe with all required data
df = pd.merge(df_orders, df_order_payments, how="inner",
              on="order_id", copy=True)
df = df.merge(right=df_customers, how="inner",
              on="customer_id", copy=False)

# Create a feature by type of payment
list_payment_type = df["payment_type"].unique().tolist()
aggfunc = {"payment_value": sum}
for c in list_payment_type:
    cond = df["payment_type"]==c
    df[c] = df["payment_value"]
    df[c] = df[c].where(cond, 0)
    aggfunc[c] = sum

# Create the pivot table
values = ["payment_value"] + list_payment_type
pivot_payment = pd.pivot_table(df, index="customer_unique_id",
                           values=values, aggfunc=aggfunc)
# Calculate the proportions
for c in list_payment_type:
    pivot_payment[c] /= pivot_payment["payment_value"]
pivot_payment = pivot_payment.drop(columns=["payment_value", "not_defined"])

In [22]:
pivot_payment.describe()

Unnamed: 0,boleto,credit_card,debit_card,voucher
count,96093.0,96093.0,96093.0,96093.0
mean,0.199304,0.755017,0.015474,0.030205
std,0.398744,0.42522,0.122984,0.159874
min,0.0,0.0,0.0,0.0
25%,0.0,0.67176,0.0,0.0
50%,0.0,1.0,0.0,0.0
75%,0.0,1.0,0.0,0.0
max,1.0,1.0,1.0,1.0


### États

Les dernière variables que nous allons créer sont les état ou vivent les clients. Il y aura une variable par état possible, avec pour valeur :
- **1** si le client y vit
- **0** sinon.

C'est ce qu'on appelle un **one hot encoding**. Pour cela, nous utiliserons la classe `OneHotEncoder` de *sklearn.preprocessing*.

In [28]:
from sklearn.preprocessing import OneHotEncoder

# Create the features
df = df_customers.copy()
cat_encoder = OneHotEncoder()
states_1hot = cat_encoder.fit_transform(df['customer_state'].values.reshape(-1, 1))
list_states = cat_encoder.categories_[0].tolist()
df_1hot = pd.DataFrame(states_1hot.toarray(),
                       columns=list_states,
                       index=df.index)
df = pd.concat([df, df_1hot], axis=1)

# Group by unique customer
values = list_states
aggfunc = pd.Series.unique
pivot_states = pd.pivot_table(df, index="customer_unique_id",
                              values=values, aggfunc=aggfunc)

In [23]:
from functools import reduce

list_df_orders = [df_orders,
                  df_order_items,
                  df_order_payments,
                  df_order_reviews]
# Merge all DataFrame handling orders data
data = reduce(lambda left, right: pd.merge(left, right, how="outer",
                                           on="order_id", copy=True),
              list_df_orders)
print(data.shape)
# Add customers data
data = data.merge(right=df_customers, how="outer",
                  on="customer_id", copy=False)
print(data.shape)
# Add products data
data = data.merge(right=df_products, how="outer",
                  on="product_id", copy=False)
# Add the English translation of the product category
data = data.merge(right=df_translation, how="outer",
                  on="product_category_name", copy=False)
# Add sellers data
data = data.merge(right=df_sellers, how="outer",
                  on="seller_id", copy=False)

(119151, 24)
(119151, 28)
