### Imports

In [8]:
import pandas as pd
import numpy as np

### Loading the data

In [9]:
data = pd.read_csv('../data/online_retail_II.csv')
data.head(3)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   Invoice      1067371 non-null  object 
 1   StockCode    1067371 non-null  object 
 2   Description  1062989 non-null  object 
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  object 
 5   Price        1067371 non-null  float64
 6   Customer ID  824364 non-null   float64
 7   Country      1067371 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 65.1+ MB


So we notice a few things:
1. our invoice date is not in date format yet.
2. Description and Customer ID columns have missing values
3. Customer ID is a float instead of int (we won't use str since it doesn't have any alphabets)

In [11]:
# Convert InvoiceDate to datetime
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

# CHeck the min and max dates
print(f'Min date: {data['InvoiceDate'].min()}')
print(f'Max date: {data['InvoiceDate'].max()}')

# Confirm the total time span of the dataset
time_span = data['InvoiceDate'].max() - data['InvoiceDate'].min()
print(f'Time span: {time_span}')

# Check the amount of unique customers
print(f'Unique customers: {data['Customer ID'].nunique()}')

# Check the unique invoices
print(f'Unique invoices: {data['Invoice'].nunique()}')

# Check for negative quantities
print(f'Negative quantities: {(data['Quantity'] < 0).sum()}')

Min date: 2009-12-01 07:45:00
Max date: 2011-12-09 12:50:00
Time span: 738 days 05:05:00
Unique customers: 5942
Unique invoices: 53628
Negative quantities: 22950
