# Data Understanding and Cleaning

The aim of the following notebook is to understand the structure and identify potencial data quality issues that could affect for further exploration data analysis and RFM calculations.

The notebook is divided in the following sub-sections:
- Dataset dimensions
- Dataset colum data-type and meaning
- Missing values
- Column's value ranges
- Duplicate values

First it's necessary to import the Python libraries needed for this initial data exploration. In the same cell, the path of the file and the method used to read it are included.

In [2]:
import pandas as pd

path = "../data_raw/online_retail_II.csv"
df = pd.read_csv(path)

## Dataset dimensions
According to the result shown below, there is a rectangular dataset with 1,067,371 instances of height and 8 columns of width.

In [4]:
df.shape

(1067371, 8)

## Dataset colum data-type and meaning
The dataset is compound of the following 8 columns:
| Column | Data Type | Description |
|---|---|---|
| Invoice | Nominal, String |Â A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation. |
| StockCode | Nominal, String | A 5-digit integral number uniquely assigned to each distinct product. |
| Description | Nominal, String | Product item name. |
| Quantity | Numeric, Int64 | The quantities of each product (item) per transaction. |
| InvoiceDate | Numeric, String | The day and time when a transaction was generated. |
| Price | Numeric, Float64 | Product price per unit in sterling |
| Customer ID | Nominal, Float64 | A 5-digit integral number uniquely assigned to each customer. |
| Country | NOminal, String | The name of the country where a customer resides. |

## Missing values
However, not all the attributes above have their entire registries with not null values, as we can see in the cell result below. Both _Description_ and _Customer ID_ have some rows with null values, being the last one the column with the highest count of null values.

Due to the main goal of the project, which is to implement a RFM analysis, the _Description_ column will be discharged because its lack of relevance for this specific type of analysis. Moreover, _Customer ID_'s null values will be removed in the cleaning process, including the rest of the attributes of the corresponding rows, since RFM analysis requires customer level identification. This decision reduces the total number of records but ensures accurate and meaningful customer segmentation.

In [4]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   Invoice      1067371 non-null  str    
 1   StockCode    1067371 non-null  str    
 2   Description  1062989 non-null  str    
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  str    
 5   Price        1067371 non-null  float64
 6   Customer ID  824364 non-null   float64
 7   Country      1067371 non-null  str    
dtypes: float64(2), int64(1), str(5)
memory usage: 65.1 MB


### Cleaning Customer ID missing values
Before continue with the rest of the data understanding process, transactions without a valid _Customer ID_ value will be removed. This is to avoid hindering the analysis of the remaining data.

In [9]:
df_v1 = df.dropna(subset=['Customer ID'])

In [10]:
df_v1.info()

<class 'pandas.DataFrame'>
Index: 824364 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Invoice      824364 non-null  str    
 1   StockCode    824364 non-null  str    
 2   Description  824364 non-null  str    
 3   Quantity     824364 non-null  int64  
 4   InvoiceDate  824364 non-null  str    
 5   Price        824364 non-null  float64
 6   Customer ID  824364 non-null  float64
 7   Country      824364 non-null  str    
dtypes: float64(2), int64(1), str(5)
memory usage: 56.6 MB


In [16]:
df.describe()

Unnamed: 0,Quantity,Price,Customer ID
count,1067371.0,1067371.0,824364.0
mean,9.938898,4.649388,15324.638504
std,172.7058,123.5531,1697.46445
min,-80995.0,-53594.36,12346.0
25%,1.0,1.25,13975.0
50%,3.0,2.1,15255.0
75%,10.0,4.15,16797.0
max,80995.0,38970.0,18287.0
