# Análises de itens associados em compras no e-commerce utilizando o algoritmo Apriori de descoberta de regras de associação

In [1]:
import pandas as pd
import matplotlib

In [3]:
ecommerce_df = pd.read_csv('data.csv')

In [44]:
ecommerce_df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

In [48]:
ecommerce_df['InvoiceNo'] = ecommerce_df['InvoiceNo'].astype('category')
ecommerce_df['StockCode'] = ecommerce_df['StockCode'].astype('category')
ecommerce_df['InvoiceDate'] = ecommerce_df['InvoiceDate'].astype('datetime64')
ecommerce_df['UnitPrice'] = ecommerce_df['UnitPrice'].astype('int64')
ecommerce_df['CustomerID'] = ecommerce_df['CustomerID'].astype('category')
ecommerce_df['Country'] = ecommerce_df['Country'].astype('category')

In [49]:
ecommerce_df.dtypes

InvoiceNo            category
StockCode            category
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice               int64
CustomerID           category
Country              category
dtype: object

In [52]:
ecommerce_df.count()

InvoiceNo      541909
StockCode      541909
Description    540455
Quantity       541909
InvoiceDate    541909
UnitPrice      541909
CustomerID     406829
Country        541909
dtype: int64

In [51]:
ecommerce_df[ecommerce_df['Description'].isnull()].count()

InvoiceNo      1454
StockCode      1454
Description       0
Quantity       1454
InvoiceDate    1454
UnitPrice      1454
CustomerID        0
Country        1454
dtype: int64

##### Descrição das variáveis numéricas

In [53]:
ecommerce_df[ecommerce_df['Description'].isnull()].describe()

Unnamed: 0,Quantity,UnitPrice
count,1454.0,1454.0
mean,-9.359697,0.0
std,243.238758,0.0
min,-3667.0,0.0
25%,-24.0,0.0
50%,-3.0,0.0
75%,4.0,0.0
max,5568.0,0.0


##### Descrição das variáveis de identificação

In [43]:
ecommerce_df[ecommerce_df['Description'].isnull()][['InvoiceNo','StockCode','CustomerID']].describe()

Unnamed: 0,CustomerID
count,0.0
mean,
std,
min,
25%,
50%,
75%,
max,


##### Descrição das variáveis de data/hora

In [42]:
ecommerce_df[ecommerce_df['Description'].isnull()]['InvoiceDate'].describe()

count               1454
unique              1121
top       4/8/2011 15:06
freq                   5
Name: InvoiceDate, dtype: object

In [5]:
ecommerce_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [10]:
ecommerce_df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [20]:
ecommerce_df[['InvoiceNo','StockCode']].describe()

Unnamed: 0,InvoiceNo,StockCode
count,541909,541909
unique,25900,4070
top,573585,85123A
freq,1114,2313


Como é possível notar, existem valores negativos de quantidade e preço unitário existentes no conjunto de dados, o que não é consistente. Isso não deve afetar a análise das regras de associação. No entanto, é bom darmos uma olhada para nos certificar que estes valores inconsistentes não vem acompanhados de outras inconsistências.

Outra inconsistência no conjunto é a 

#### Características do subconjunto de dados com quantidade menor que 0 (inconsistente)

In [13]:
ecommerce_df[ecommerce_df['Quantity'] < 0].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,12/1/2010 9:49,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,12/1/2010 10:24,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom


#### Presença de desconto no conjunto de dados

Note a presença de um item de desconto presente no conjunto de dados. A presença deste item é importante, pois a presença de um desconto aplicado a alguma compra ou produto pode indicar uma estratégia de venda onde a compra de um produto pode, por escolha do vendedor, outro produto tem desconto. Esta estratégia é chamada de *cross selling*. 

In [12]:
ecommerce_df[ecommerce_df['Quantity'] < 0].describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,10624.0,10624.0,8905.0
mean,-45.60721,42.308012,14991.667266
std,1092.214216,623.481552,1706.772357
min,-80995.0,0.0,12346.0
25%,-10.0,1.06,13510.0
50%,-2.0,2.1,14895.0
75%,-1.0,4.95,16393.0
max,-1.0,38970.0,18282.0


In [None]:
Repare que, quando a quantidade está marcada com 