# **PROJETO APLICADO DE PIPELINE DE DADOS - FASE 01**


---



**Data Set:**

https://www.kaggle.com/datasets/carrie1/ecommerce-data/data

Este é um conjunto de dados transnacional que contém todas as transações ocorridas entre 01/12/2010 e 09/12/2011 para uma loja online de varejo sem loja física, sediada no Reino Unido e registrada. A empresa vende principalmente presentes exclusivos para todas as ocasiões. Muitos clientes da empresa são atacadistas.

**541909 rows × 8 columns**

# **1. Extração e visualização dos dados**

In [44]:
import pandas as pd
import kagglehub

path = kagglehub.dataset_download("carrie1/ecommerce-data")
df = pd.read_csv(path + "/data.csv", encoding = "ISO-8859-1")

Using Colab cache for faster access to the 'ecommerce-data' dataset.


In [45]:
display(df)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [47]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [48]:
df.isnull().sum()

Unnamed: 0,0
InvoiceNo,0
StockCode,0
Description,1454
Quantity,0
InvoiceDate,0
UnitPrice,0
CustomerID,135080
Country,0


# **2. Identificação de Problemas e Transformações**


In [49]:
# Novo DataFrame para transformaçõees
df_clean = df.copy()

## 2.1 Valores faltantes




### CustomerID


*   CustomerID	| 135.080 Rows NaN

---


Checando a possibilidade de tratar CustomerID NaN de acordo com InvoiceNo (Nº da fatura) iguais e com CustomerID preenchido

In [50]:
# Contar quantos InvoiceNo possuem pelo menos um CustomerID nulo
invoice_null = df.groupby('InvoiceNo')['CustomerID'].apply(lambda x: x.isnull().any())
invoice_notnull = df.groupby('InvoiceNo')['CustomerID'].apply(lambda x: x.notnull().any())

# Quantos têm mistura (nulo e não nulo)?
mixed_invoice = (invoice_null & invoice_notnull)

print("Total de InvoiceNo mistos:", mixed_invoice.sum())
print("Total de InvoiceNo com CustomerID nulo:", invoice_null.sum())
print("Total de InvoiceNo com CustomerID não nulo:", invoice_notnull.sum())
print(f"Total de linhas com CustomerID nulo: {df['CustomerID'].isnull().sum()}")

Total de InvoiceNo mistos: 0
Total de InvoiceNo com CustomerID nulo: 3710
Total de InvoiceNo com CustomerID não nulo: 22190
Total de linhas com CustomerID nulo: 135080


 - Total de InvoiceNo mistos: 0


Significa que nao é possivel recuperar CustomerID pelo InvoiceNo

### Description

*   Description | 1454 Rows NaN

 Muitos Description possuem erros de preenchimento também

---



Cada produto possui o seu StockCode. Muitos Description (nome do produto) estão NaN ou preenchidos de forma errada, porém possuem StockCode iguais

In [51]:
# Contar quantos StockCode possuem pelo menos um Description nulo
StockCode_null = df.groupby('StockCode')['Description'].apply(lambda x: x.isnull().any())
StockCode_notnull = df.groupby('StockCode')['Description'].apply(lambda x: x.notnull().any())

# Quantos têm mistura (nulo e não nulo)?
mixed_StockCode = (StockCode_null & StockCode_notnull)

print("Total de StockCode mistos:", mixed_StockCode.sum())
print("Total de StockCode com Description nulo:", StockCode_null.sum())
print("Total de StockCode com Description não nulo:", StockCode_notnull.sum())
print(f"Total de linhas com Description nulo: {df['Description'].isnull().sum()}")

Total de StockCode mistos: 848
Total de StockCode com Description nulo: 960
Total de StockCode com Description não nulo: 3958
Total de linhas com Description nulo: 1454


 - Total de StockCode mistos: 848

848 Stockcode possuem Description iguais e nulas ao mesmo tempo


In [52]:
#Exemplo:
display(df[df["StockCode"] == "35965"])
print("\n")
print(f"Quantidade de Description NaN do StockCode 35965: {df[df["StockCode"] == "35965"]["Description"].isnull().sum()}")

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
2889,536592,35965,FOLKART HEART NAPKIN RINGS,4,12/1/2010 17:06,3.36,,United Kingdom
6017,536876,35965,FOLKART HEART NAPKIN RINGS,1,12/3/2010 11:36,3.36,,United Kingdom
7205,537013,35965,,-25,12/3/2010 15:40,0.00,,United Kingdom
8071,537126,35965,FOLKART HEART NAPKIN RINGS,1,12/5/2010 12:13,2.95,18118.0,United Kingdom
10678,537237,35965,FOLKART HEART NAPKIN RINGS,3,12/6/2010 9:58,3.36,,United Kingdom
...,...,...,...,...,...,...,...,...
347758,567337,35965,,5,9/19/2011 14:56,0.00,,United Kingdom
349563,567507,35965,FOLKART HEART NAPKIN RINGS,12,9/20/2011 14:46,0.97,,United Kingdom
454169,575513,35965,,7,11/10/2011 10:39,0.00,,United Kingdom
464522,576110,35965,,5,11/14/2011 10:33,0.00,,United Kingdom




Quantidade de Description NaN do StockCode 35965: 10


In [53]:
# Criar mapeamento de StockCode  Description válida
mapa_descricoes = df_clean.dropna(subset=['Description']).groupby('StockCode')['Description'].first().to_dict()

# Preencher Description de acordo com o primeiro valor
df_clean['Description'] = df_clean.apply(
    lambda row: mapa_descricoes.get(row['StockCode'], row['Description']),
    axis=1
)

In [54]:
print(f"Descriptions recuperados: {df["Description"].isnull().sum() - df_clean["Description"].isnull().sum()}")
print(f"Descriptions não recuperados: {df_clean["Description"].isnull().sum()}")
print(f"Quantidade de Descriptions erradas eliminadas: {df['Description'].nunique() - df_clean['Description'].nunique()}")


Descriptions recuperados: 1342
Descriptions não recuperados: 112
Quantidade de Descriptions erradas eliminadas: 406


## 2.2 Duplicatas

Duplicatas são aceitáveis no modelo de negócio deste DataFrame. Cada linha representa um item em uma fatura, portanto, o mesmo `InvoiceNo` (número da fatura) pode aparecer várias vezes se uma fatura contiver múltiplos produtos. A venda em si é representada unicamente pelo `InvoiceNo`.

In [55]:
print(f"Quantidade de linhas duplicadas: {df_clean.duplicated().sum()}")
display(df_clean[df_clean.duplicated()])

Quantidade de linhas duplicadas: 5270


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,12/1/2010 11:45,1.25,17908.0,United Kingdom
527,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,12/1/2010 11:45,2.10,17908.0,United Kingdom
537,536409,22900,SET 2 TEA TOWELS I LOVE LONDON,1,12/1/2010 11:45,2.95,17908.0,United Kingdom
539,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,12/1/2010 11:45,4.95,17908.0,United Kingdom
555,536412,22327,ROUND SNACK BOXES SET OF 4 SKULLS,1,12/1/2010 11:49,2.95,17920.0,United Kingdom
...,...,...,...,...,...,...,...,...
541675,581538,22068,BLACK PIRATE TREASURE CHEST,1,12/9/2011 11:34,0.39,14446.0,United Kingdom
541689,581538,23318,BOX OF 6 MINI VINTAGE CRACKERS,1,12/9/2011 11:34,2.49,14446.0,United Kingdom
541692,581538,22992,REVOLVER WOODEN RULER,1,12/9/2011 11:34,1.95,14446.0,United Kingdom
541699,581538,22694,WICKER STAR,1,12/9/2011 11:34,2.10,14446.0,United Kingdom


## 2.3 Inconsistências Gerais

### Valores negativos, Cancelamentos e tarifas

In [56]:
# Linhas de tarifas
stockcode_fees = ['C2', 'DOT', 'POST','AMAZONFEE']

# Filtrar DataFrame para mostrar Inconsistências:
#(InvoiceNo que começam com 'C','A)'| Quantity <= 0 | UnitPrice <= 0 e StockCode de tarifas
df_inconsistencias = df_clean[
   (df_clean['InvoiceNo'].astype(str).str.startswith('C')) |
    (df_clean['InvoiceNo'].astype(str).str.startswith('A')) |
     (df_clean['Quantity'] <= 0) | (df_clean['UnitPrice'] <= 0) | (df_clean['StockCode'].isin(stockcode_fees))
]

display(df_inconsistencias)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
45,536370,POST,POSTAGE,3,12/1/2010 8:45,18.00,12583.0,France
141,C536379,D,Discount,-1,12/1/2010 9:41,27.50,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,12/1/2010 9:49,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,12/1/2010 10:24,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
...,...,...,...,...,...,...,...,...
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,12/9/2011 11:58,1.25,17315.0,United Kingdom
541717,C581569,20979,36 PENCILS TUBE RED RETROSPOT,-5,12/9/2011 11:58,1.25,17315.0,United Kingdom
541730,581570,POST,POSTAGE,1,12/9/2011 11:59,18.00,12662.0,Germany
541767,581574,POST,POSTAGE,2,12/9/2011 12:09,18.00,12526.0,Germany


### Outliers

In [57]:
df_clean.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [58]:
# identificando Outliers
display(df_clean[df_clean['Quantity'] > 5000])

print("\n")
print("-------------------------------------------------------")
print("Verificando padrão de compra do maior Outlier/Cliente")
print("-------------------------------------------------------")

# Trocar ID para visualizar todos
display(df_clean[df_clean['CustomerID'] == 16446])

print("Foram feitos cancelamenos das compras Outliers")

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,1/18/2011 10:01,1.04,12346.0,United Kingdom
74614,542504,37413,ICON MUG REVOLUTIONARY,5568,1/28/2011 12:03,0.0,,United Kingdom
502122,578841,84826,ASSTD DESIGN 3D PAPER STICKERS,12540,11/25/2011 15:57,0.0,13256.0,United Kingdom
540421,581483,23843,"PAPER CRAFT , LITTLE BIRDIE",80995,12/9/2011 9:15,2.08,16446.0,United Kingdom




-------------------------------------------------------
Verificando padrão de compra do maior Outlier/Cliente
-------------------------------------------------------


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
194354,553573,22980,PANTRY SCRUBBING BRUSH,1,5/18/2011 9:52,1.65,16446.0,United Kingdom
194355,553573,22982,PANTRY PASTRY BRUSH,1,5/18/2011 9:52,1.25,16446.0,United Kingdom
540421,581483,23843,"PAPER CRAFT , LITTLE BIRDIE",80995,12/9/2011 9:15,2.08,16446.0,United Kingdom
540422,C581484,23843,"PAPER CRAFT , LITTLE BIRDIE",-80995,12/9/2011 9:27,2.08,16446.0,United Kingdom


Foram feitos cancelamenos das compras Outliers


# **3. Transformações**

In [77]:
df_clean['total_value'] = df_clean['Quantity'] * df_clean['UnitPrice']

# =========================================
#               TABELAS FATO
# =========================================


# --------- Fato Vendas -----------
fact_sales = df_clean[
    (~df_clean['InvoiceNo'].astype(str).str.startswith(('C', 'A'))) &
    (df_clean['Quantity'] > 0) &
    (df_clean['UnitPrice'] > 0) &
    (~df_clean['StockCode'].isin(stockcode_fees))
].copy()

# Trata CustomersID NaN
fact_sales['CustomerID'] = fact_sales['CustomerID'].fillna('Unknown')

# Incluir 'Country' na seleção de colunas para uso posterior
fact_sales = fact_sales[[
    'InvoiceNo', 'StockCode', 'CustomerID', 'InvoiceDate',
    'Quantity', 'UnitPrice', 'total_value', 'Country'
]]

# --------- Fato Tarifas -----------
fact_fees = df_clean[
    (df_clean['StockCode'].isin(stockcode_fees)) &
    (~df_clean['InvoiceNo'].astype(str).str.startswith(('C', 'A')))].copy()
fact_fees = fact_fees[[
    'InvoiceNo', 'StockCode', 'CustomerID', 'InvoiceDate',
    'Quantity', 'UnitPrice', 'total_value', 'Country'
]]


# --------- Fato Cancelamentos -----------
fact_cancellations = df_clean[
    (df_clean['InvoiceNo'].astype(str).str.startswith(('C', 'A'))) &
    (~df_clean['StockCode'].isin(['C2', 'DOT', 'POST']))
].copy()

fact_cancellations = fact_cancellations[[
    'InvoiceNo', 'StockCode', 'CustomerID', 'InvoiceDate',
    'Quantity', 'UnitPrice', 'total_value', 'Country'
]]


# =========================================
#               TABELAS DIMENSAO
# =========================================

# --------- Dim Clientes -----------

dim_customer = (
    fact_sales[['CustomerID', 'Country']]
    .drop_duplicates(subset=['CustomerID'])
    .copy()
)

# Garante existencia do cliente 'Unknown'
if 'Unknown' not in dim_customer['CustomerID'].values:
    dim_customer.loc[len(dim_customer)] = ['Unknown', 'Desconhecido']



# --------- Dim Produtos -----------
dim_product = (
    df_clean[['StockCode', 'Description']]
    .drop_duplicates(subset=['StockCode'])
    .rename(columns={'Description': 'ProductDescription'})
    .copy()
)


# --------- Dim Date -----------
dim_date = (
    df_clean[['InvoiceDate']]
    .drop_duplicates(subset=['InvoiceDate'])
    .copy()
)
# Cria colunas de data
dim_date['InvoiceDate'] = pd.to_datetime(dim_date['InvoiceDate'])
dim_date['Year'] = dim_date['InvoiceDate'].dt.year
dim_date['Month'] = dim_date['InvoiceDate'].dt.month
dim_date['Day'] = dim_date['InvoiceDate'].dt.day
dim_date['Weekday'] = dim_date['InvoiceDate'].dt.day_name()
dim_date['Hour'] = dim_date['InvoiceDate'].dt.hour

# --------- Dim Country -----------
dim_country = (
    df_clean[['Country']]
    .drop_duplicates()
    .copy()
)

# **4. Carregamento**

In [79]:
# =========================================
#                 EXPORTAR
# =========================================
fact_sales.to_csv('fact_sales.csv', index=False)
fact_cancellations.to_csv('fact_cancellations.csv', index=False)
fact_fees.to_csv('fact_fees.csv', index=False)

dim_customer.to_csv('dim_customer.csv', index=False)
dim_product.to_csv('dim_product.csv', index=False)
dim_date.to_csv('dim_date.csv', index=False)
dim_country.to_csv('dim_country.csv', index=False)

# ....