### 0. Imports

In [11]:
%load_ext autoreload
%autoreload 2

# Data transformation
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Visualizations
# -----------------------------------------------------------------------
import seaborn as sns
import matplotlib.pyplot as plt

# Progress loops
# -----------------------------------------------------------------------
from tqdm import tqdm

# Handle warnings
# -----------------------------------------------------------------------
import warnings
warnings.filterwarnings("once")

# modify system variables
# -----------------------------------------------------------------------
import sys
sys.path.append("..") # append parent folder to path

# import support functions
# -----------------------------------------------------------------------
import src.soporte_eda as se
import src.soporte_preprocesamiento as sp
import src.soporte_clustering as sc


# statistics functions
# -----------------------------------------------------------------------
from scipy.stats import pearsonr, spearmanr, pointbiserialr



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


  mpl_ge_150 = LooseVersion(mpl.__version__) >= "1.5.0"
  other = LooseVersion(other)
  mpl_ge_150 = LooseVersion(mpl.__version__) >= "1.5.0"


# 1. Intro to this notebook and data

## 1.1 Introduction

The goal of this busines case is to 1. identify similar groups of clients according to their buying behaviour 2. identify groups of products according to their profitability. For that, analysis, clustering algorithms and regression models will be used.

The purpose of this notebook is to explore Global Ecommerce's product and customers data in order to clean any possible errors that could impair the extraction of insights through analysis and the quality of the subsequent model development for their brand.

## 1.2 Data import

In [10]:
global_superstore = pd.read_csv("../data/Global_Superstore.csv", encoding="latin1")

# transform column names into lowercase and replace spaces by underscores
global_superstore.columns = [col.lower().replace(" ", "_") for col in global_superstore.columns]

global_superstore.head(3)

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,city,state,...,product_id,category,sub-category,product_name,sales,quantity,discount,profit,shipping_cost,order_priority
0,32298,CA-2012-124891,31-07-2012,31-07-2012,Same Day,RH-19495,Rick Hansen,Consumer,New York City,New York,...,TEC-AC-10003033,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.65,7,0.0,762.1845,933.57,Critical
1,26341,IN-2013-77878,05-02-2013,07-02-2013,Second Class,JR-16210,Justin Ritter,Corporate,Wollongong,New South Wales,...,FUR-CH-10003950,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.765,923.63,Critical
2,25330,IN-2013-71249,17-10-2013,18-10-2013,First Class,CR-12730,Craig Reiter,Consumer,Brisbane,Queensland,...,TEC-PH-10004664,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.971,915.49,Medium


# 2. Preliminary data analysis and cleaning

For a quick initial exploration:

In [15]:
se.exploracion_dataframe(global_superstore)

El número de datos es 51290 y el de columnas es 24

 ..................... 

5 filas aleatorias del dataframe son:


Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,city,state,...,product_id,category,sub-category,product_name,sales,quantity,discount,profit,shipping_cost,order_priority
33800,19820,ES-2011-4436456,26-01-2011,26-01-2011,Same Day,HG-15025,Hunter Glantz,Consumer,Reims,Champagne-Ardenne,...,OFF-BI-10001384,Office Supplies,Binders,"Cardinal Binder, Economy",41.49,6,0.5,-10.89,3.98,Medium
6935,20592,ID-2013-22025,09-10-2013,13-10-2013,Standard Class,TS-21655,Trudy Schmidt,Consumer,Bandung,Jawa Barat,...,TEC-CO-10000821,Technology,Copiers,"Canon Ink, High-Speed",675.8775,5,0.07,-43.6725,49.27,High
1653,19562,ES-2011-3142386,06-12-2011,06-12-2011,Same Day,LT-17110,Liz Thompson,Consumer,Hamburg,Hamburg,...,TEC-PH-10001066,Technology,Phones,"Apple Smart Phone, Cordless",636.3,1,0.0,139.98,151.11,High
25870,2861,MX-2014-130295,28-04-2014,30-04-2014,Second Class,SB-20185,Sarah Brown,Consumer,Granada,Granada,...,TEC-PH-10004242,Technology,Phones,"Apple Speaker Phone, VoIP",82.26,1,0.0,18.08,7.66,High
7875,3635,MX-2013-137449,04-11-2013,11-11-2013,Standard Class,RP-19855,Roy Phan,Corporate,Guantánamo,Guantánamo,...,FUR-FU-10001142,Furniture,Furnishings,"Deflect-O Clock, Black",234.78,7,0.0,35.14,43.05,Low



 ..................... 

Los tipos de las columnas y sus valores únicos son:


Unnamed: 0,tipo_dato,conteo
row_id,int64,51290
order_id,object,25035
order_date,object,1430
ship_date,object,1464
ship_mode,object,4
customer_id,object,1590
customer_name,object,795
segment,object,3
city,object,3636
state,object,1094



 ..................... 

Los duplicados que tenemos en el conjunto de datos son: 0

 ..................... 

Los nulos que tenemos en el conjunto de datos son:


Unnamed: 0,%_nulos
postal_code,80.51472



 ..................... 

Comprobamos que no haya valores con una sola variable:

 ..................... 

Comprobamos una representación mínima para valores numéricos:

 ..................... 

Estadísticas descriptivas de las columnas numéricas:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
row_id,51290.0,25645.5,14806.29199,1.0,12823.25,25645.5,38467.75,51290.0
postal_code,9994.0,55190.379428,32063.69335,1040.0,23223.0,56430.5,90008.0,99301.0
sales,51290.0,246.490581,487.565361,0.444,30.758625,85.053,251.0532,22638.48
discount,51290.0,0.142908,0.21228,0.0,0.0,0.0,0.2,0.85
profit,51290.0,28.610982,174.340972,-6599.978,0.0,9.24,36.81,8399.976
shipping_cost,51290.0,26.375915,57.296804,0.0,2.61,7.79,24.45,933.57



 ..................... 

Estadísticas descriptivas de las columnas categóricas:


Unnamed: 0,count,unique,top,freq
order_id,51290,25035,CA-2014-100111,14
order_date,51290,1430,18-06-2014,135
ship_date,51290,1464,22-11-2014,130
ship_mode,51290,4,Standard Class,30775
customer_id,51290,1590,PO-18850,97
customer_name,51290,795,Muhammed Yedwab,108
segment,51290,3,Consumer,26518
city,51290,3636,New York City,915
state,51290,1094,California,2001
country,51290,147,United States,9994



 ..................... 

Los valores que tenemos para las columnas categóricas son: 
La columna ORDER_ID tiene 25035 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CA-2014-100111,14,0.0
IN-2012-41261,13,0.0
TO-2014-9950,13,0.0
IN-2013-42311,13,0.0
NI-2014-8880,13,0.0


La columna ORDER_DATE tiene 1430 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1
18-06-2014,135,0.3
18-11-2014,127,0.2
03-09-2014,126,0.2
20-11-2014,118,0.2
29-12-2014,116,0.2


La columna SHIP_DATE tiene 1464 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
ship_date,Unnamed: 1_level_1,Unnamed: 2_level_1
22-11-2014,130,0.3
07-09-2014,115,0.2
07-12-2014,101,0.2
17-11-2014,101,0.2
29-11-2014,100,0.2


La columna SHIP_MODE tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
ship_mode,Unnamed: 1_level_1,Unnamed: 2_level_1
Standard Class,30775,60.0
Second Class,10309,20.1
First Class,7505,14.6
Same Day,2701,5.3


La columna CUSTOMER_ID tiene 1590 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
PO-18850,97,0.2
BE-11335,94,0.2
JG-15805,90,0.2
SW-20755,89,0.2
MY-18295,85,0.2


La columna CUSTOMER_NAME tiene 795 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
customer_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Muhammed Yedwab,108,0.2
Steven Ward,106,0.2
Gary Hwang,102,0.2
Patrick O'Brill,102,0.2
Bill Eplett,102,0.2


La columna SEGMENT tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
segment,Unnamed: 1_level_1,Unnamed: 2_level_1
Consumer,26518,51.7
Corporate,15429,30.1
Home Office,9343,18.2


La columna CITY tiene 3636 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
city,Unnamed: 1_level_1,Unnamed: 2_level_1
New York City,915,1.8
Los Angeles,747,1.5
Philadelphia,537,1.0
San Francisco,510,1.0
Santo Domingo,443,0.9


La columna STATE tiene 1094 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,2001,3.9
England,1499,2.9
New York,1128,2.2
Texas,985,1.9
Ile-de-France,981,1.9


La columna COUNTRY tiene 147 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,9994,19.5
Australia,2837,5.5
France,2827,5.5
Mexico,2644,5.2
Germany,2065,4.0


La columna MARKET tiene 7 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
market,Unnamed: 1_level_1,Unnamed: 2_level_1
APAC,11002,21.5
LATAM,10294,20.1
EU,10000,19.5
US,9994,19.5
EMEA,5029,9.8


La columna REGION tiene 13 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
region,Unnamed: 1_level_1,Unnamed: 2_level_1
Central,11117,21.7
South,6645,13.0
EMEA,5029,9.8
North,4785,9.3
Africa,4587,8.9


La columna PRODUCT_ID tiene 10292 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1
OFF-AR-10003651,35,0.1
OFF-AR-10003829,31,0.1
OFF-BI-10003708,30,0.1
OFF-BI-10002799,30,0.1
FUR-CH-10003354,28,0.1


La columna CATEGORY tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
category,Unnamed: 1_level_1,Unnamed: 2_level_1
Office Supplies,31273,61.0
Technology,10141,19.8
Furniture,9876,19.3


La columna SUB-CATEGORY tiene 17 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
sub-category,Unnamed: 1_level_1,Unnamed: 2_level_1
Binders,6152,12.0
Storage,5059,9.9
Art,4883,9.5
Paper,3538,6.9
Chairs,3434,6.7


La columna PRODUCT_NAME tiene 3788 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Staples,227,0.4
"Cardinal Index Tab, Clear",92,0.2
"Eldon File Cart, Single Width",90,0.2
"Rogers File Cart, Single Width",84,0.2
"Ibico Index Tab, Clear",83,0.2


La columna QUANTITY tiene 14 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
quantity,Unnamed: 1_level_1,Unnamed: 2_level_1
2,12748,24.9
3,9682,18.9
1,8963,17.5
4,6385,12.4
5,4882,9.5


La columna ORDER_PRIORITY tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
order_priority,Unnamed: 1_level_1,Unnamed: 2_level_1
Medium,29433,57.4
High,15501,30.2
Critical,3932,7.7
Low,2424,4.7


**Aggregation level**

First and foremost, it necessary to notice that the data at hand consists of order records data. To use this data to analyse by clients or to analyse by products, agggregation on the appropiate level will be necessary.