# Exploratory Data Analysis

The objective of this notebook is to explore the distributions and relationships of the pre-processed data, identify patterns, anomalies and outliers, and generate hypotheses that may guide the further RFM modeling and clustering processes.

The notebook is divided in the following sub-sections:
- Overview of the cleaned dataset

In [1]:
import pandas as pd
path = "../data_clean/clean_online_retail.csv"
df = pd.read_csv(path)

## Overview of the cleaned dataset
Before starting with the exploratory analysis, it is necessary to obtain a general overview of the cleaned dataset. Therefore, the following questions will be answered:
1. What are the main variables of the final dataset?
2. How many transactions does the dataset contain?
3. How many unique customers are represented in the dataset?

### 1. What are the main variables of the final dataset?
The dataset is compound of the following 9 columns:
| Column | Data Type | Description |
|---|---|---|
| Invoice | Nominal, String |Â A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation. |
| StockCode | Nominal, String | A 5-digit integral number uniquely assigned to each distinct product. |
| Description | Nominal, String | Product item name. |
| Quantity | Numeric, Int64 | The quantities of each product (item) per transaction. |
| InvoiceDate | Numeric, DateTime64 [us] | The day and time when a transaction was generated. |
| Price | Numeric, Float64 | Product price per unit in sterling |
| Customer ID | Nominal, Int64 | A 5-digit integral number uniquely assigned to each customer. |
| Country | Nominal, String | The name of the country where a customer resides. |
| TotalPrice | Numeric, Float64 | The total monetary value of the transaction (_Quantity_ x _Price_). |

### 2. How many transactions does the dataset contain?
Each row in the dataset represents a registered monetary transaction of one single product of the total purchase.

In [2]:
df.shape

(779423, 9)

The dataset contains 779,423 transactions in total, which provides a rich basis for analyzing customer purchasing behavior.

### 3. How many unique customers are represented in the dataset?

In [3]:
df['Customer ID'].unique().shape

(5878,)

The dataset includes 5,878 unique customers during the analyzed period.

This figure represents the customers observed in the dataset and does not necessarily correspond to the total number of customers of the company.