# Online Retail Dataset Data Cleaning
<hr style="border: 2px solid #000000;">

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

---

# Table of Contents
1. [Introduction](#I.-Introduction)
2. [Data Exploration](#II.-Data-Exploration)
3. [Canceled Orders](#III.-Canceled-Orders)
4. [Preprocessing](#IV.-Preprocessing)
5. [Conclusion](#V.-Conclusion)

---

## I. Introduction

Welcome to the "Online Retail Data Cleaning" Notebook for the UCI Machine Learning Repository dataset. This notebook addresses the essential task of refining and preparing the dataset for analysis. The UCI dataset, regarded as a valuable resource, may exhibit imperfections like missing values, outliers, and inconsistencies. The objective is to systematically address these issues, ensuring the creation of a clean and reliable dataset.

The cleaning process plays a vital role in establishing a robust foundation for subsequent data analysis and modeling. Throughout this notebook, we explore the intricacies of the UCI dataset, applying strategies to handle missing data, outliers, and other common challenges. By the conclusion of this process, the goal is to present a meticulously cleaned dataset, ready for meaningful insights and advanced analytics.\

This data set which contains transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. Source: http://archive.ics.uci.edu/ml/datasets/Online+Retail

Each key feature in the dataset represents a unique transaction, and understanding its structure is essential for subsequent data analysis and cleaning.

Examination of the fundamental features that characterize each transaction reveals:

- **InvoiceNo:** A 6-digit integral number serving as a unique identifier for each transaction. If it begins with the letter 'c', it signifies a cancellation.
- **StockCode:** A 5-digit integral number assigned to each distinct product, uniquely identifying items in the dataset.
- **Description:** The nominal field that holds the product or item name.
- **Quantity:** Numeric field representing the quantities of each product per transaction.
- **InvoiceDate:** Numeric field indicating the date and time when a transaction occurred.
- **UnitPrice:** Numeric field denoting the unit price of each product in sterling (£).
- **CustomerID:** A 5-digit integral number serving as a unique identifier for each customer.
- **Country:** Nominal field indicating the country where a customer resides.

---

## II. Data Exploration

Now that we have gained an initial understanding of the dataset features, it is time to delve into the exploration phase. Data exploration plays a pivotal role in uncovering patterns, trends, and potential challenges within the dataset. By closely examining the distribution and characteristics of our variables, we aim to gain valuable insights that will inform subsequent steps in our analysis.

### Overview of Dataset Characteristics

In [2]:
#read in dataset
raw_OR = pd.read_csv('OnlineRetail.csv',encoding='latin1')

In [3]:
#viewing head of dataset
raw_OR.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [4]:
#changing InvoiceDate to datetime format
raw_OR = raw_OR.copy()
raw_OR.loc[:, 'InvoiceDate'] = pd.to_datetime(raw_OR['InvoiceDate'])

In [5]:
#data information
raw_OR.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [6]:
#convert CustomerID to object 
raw_OR['CustomerID'] = raw_OR['CustomerID'].astype(object)

In [7]:
#quanitative data
raw_OR.describe()

Unnamed: 0,Quantity,UnitPrice
count,541909.0,541909.0
mean,9.55225,4.611114
std,218.081158,96.759853
min,-80995.0,-11062.06
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,38970.0


Upon initial inspection of the dataset, several noteworthy observations come to light.

Firstly, the current **InvoiceDate** lacks a datetime format, which may impede the analysis process. To enhance analytical capabilities, it is data is transformed into the appropriate datetime format.

Additionally, the data type for **CustomerID** appears to be set as a float. Given that CustomerID serves as an identifying label, despite its numerical representation, it would be more appropriate to treat it as an object, aligning with the nature of other identification columns.

Also, there are instances of negative values in both **Quantity** and **UnitPrice**. This occurrence is likely associated with canceled orders. To facilitate a more nuanced analysis, it is advisable to segregate canceled and non-canceled orders into distinct datasets. This distinction will allow for a more targeted exploration of each subset and better insights into the underlying patterns within the data.

Finally, it's worth noting the presence of null values in the **Description** and **CustomerID** fields, which necessitates attention and resolution in the upcoming steps of our data preparation.

### Cleaning Data

Before proceeding to address canceled orders, it's essential to examine the dataset for any duplicated entries. This check ensures the integrity of our analysis by identifying and resolving potential issues arising from redundant order records.

In [8]:
#check for duplicates
duplicated=raw_OR[raw_OR.duplicated()]
duplicated.shape[0]

5268

In [9]:
#remove duplicate orders
OR_no_dups=raw_OR.drop_duplicates()

A total of 5268 duplicate entries have been identified. To maintain the integrity of our analysis, these duplicates are removed to prevent any interference with subsequent data exploration and modeling.

With these duplicates removed, we can now examine the information and description of the enhanced dataset.

In [10]:
#view data information
OR_no_dups.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 536641 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    536641 non-null  object        
 1   StockCode    536641 non-null  object        
 2   Description  535187 non-null  object        
 3   Quantity     536641 non-null  int64         
 4   InvoiceDate  536641 non-null  datetime64[ns]
 5   UnitPrice    536641 non-null  float64       
 6   CustomerID   401604 non-null  object        
 7   Country      536641 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 36.8+ MB


In [11]:
#quantitative Data
OR_no_dups.describe()

Unnamed: 0,Quantity,UnitPrice
count,536641.0,536641.0
mean,9.620029,4.632656
std,219.130156,97.233118
min,-80995.0,-11062.06
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,38970.0


The previously identified concerns involving negative values in **Quantity** and **UnitPrice**, as well as the presence of null values, persist in our refined dataset and require further attention and resolution.

---

## III. Canceled Orders

The dataset description reveals that orders with an InvoiceNo starting with the letter 'C' indicate cancellations. The next step involves isolating these canceled transactions into a distinct dataset for further exploration.

In [12]:
#create new dataset for cancelled transactions
canceled = OR_no_dups[OR_no_dups['InvoiceNo'].astype(str).str.contains('C')]
canceled.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.5,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom


In [13]:
#ensure that no canceled orders have positive values
canceled[canceled['Quantity']>0].shape[0]

0

### Exploring Cancelled Orders

Exploring the distinct dataset further, the initial analysis will aim to identify the products with the highest number of canceled orders.

In [14]:
#group by product and count canceled orders
canceled_product_counts = canceled.groupby('StockCode').size().reset_index(name='CanceledCount')

#sort products by the number of canceled orders in descending order
canceled_product_counts = canceled.groupby(['StockCode', 'Description']).size().reset_index(name='CanceledCount')

#sort products by the number of canceled orders in descending order
canceled_product_counts = canceled_product_counts.sort_values(by='CanceledCount', ascending=False)

#display the top products with the most canceled orders
print(canceled_product_counts.head(5))

     StockCode               Description  CanceledCount
1972         M                    Manual            244
723      22423  REGENCY CAKESTAND 3 TIER            180
1973      POST                   POSTAGE            126
1117     22960  JAM MAKING SET WITH JARS             87
1970         D                  Discount             77


From this analysis, it becomes evident that many of the cancellations are not valid canceled orders and do not contribute to total sales. Specifically, entries categorized as "Manual," "Postage," "Discounts," and "Samples" with non-numerical StockCodes fall into this category. The next step involves removing these stock codes from our dataset as they do not contribute to the overall sales figures.

In [15]:
#identify rows with non-numerical product IDs
non_numerical_ids = canceled[canceled['StockCode'].astype(str).str.isalpha()]

#display non-numerical product IDs and their descriptions
print(non_numerical_ids[['StockCode', 'Description']].drop_duplicates())

        StockCode      Description
141             D         Discount
13052        POST          POSTAGE
14436           S          SAMPLES
14514   AMAZONFEE       AMAZON FEE
14716           M           Manual
75004         DOT   DOTCOM POSTAGE
317508       CRUK  CRUK Commission


In [16]:
#specify stock codes to remove
stock_codes_to_remove = ['D', 'POST', 'S','AMAZONFEE','M','DOT','CRUK'] 

In [17]:
#remove rows with specified stock codes
canceled_filtered = canceled[~canceled['StockCode'].isin(stock_codes_to_remove)]

The cleaned data frame is now ready for examination.

In [18]:
#display the resulting dataframe
canceled_filtered.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
238,C536391,21980,PACK OF 12 RED RETROSPOT TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom


In [19]:
#view data information
canceled_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8695 entries, 154 to 541717
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    8695 non-null   object        
 1   StockCode    8695 non-null   object        
 2   Description  8695 non-null   object        
 3   Quantity     8695 non-null   int64         
 4   InvoiceDate  8695 non-null   datetime64[ns]
 5   UnitPrice    8695 non-null   float64       
 6   CustomerID   8507 non-null   object        
 7   Country      8695 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 611.4+ KB


In [20]:
#quantitative data
canceled_filtered.describe()

Unnamed: 0,Quantity,UnitPrice
count,8695.0,8695.0
mean,-31.057389,5.25409
std,1183.97569,23.579654
min,-80995.0,0.03
25%,-6.0,1.45
50%,-2.0,2.55
75%,-1.0,4.95
max,-1.0,1050.15


## Removing Canceled Orders

Having identified the canceled orders, the next step involves removing them from the **OR_no_dups** data frame. Subsequently, a new data frame will be created specifically for non-canceled orders.

In [21]:
#remove canceled orders
non_canceled = OR_no_dups[~OR_no_dups['InvoiceNo'].astype(str).str.contains('C')]

non_canceled.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [22]:
#view data information
non_canceled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 527390 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    527390 non-null  object        
 1   StockCode    527390 non-null  object        
 2   Description  525936 non-null  object        
 3   Quantity     527390 non-null  int64         
 4   InvoiceDate  527390 non-null  datetime64[ns]
 5   UnitPrice    527390 non-null  float64       
 6   CustomerID   392732 non-null  object        
 7   Country      527390 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 36.2+ MB


In [23]:
#quantitative data
non_canceled.describe()

Unnamed: 0,Quantity,UnitPrice
count,527390.0,527390.0
mean,10.311272,3.861939
std,160.367285,41.963759
min,-9600.0,-11062.06
25%,1.0,1.25
50%,3.0,2.08
75%,11.0,4.13
max,80995.0,13541.33


Upon a brief sampling of the provided descriptions, it is evident that certain entries represent non-standard sale transactions. Consequently, these entries will be removed from our dataset to ensure the integrity of our analysis.

## IV. Preprocessing

### Addressing Negative Quantity

Even after addressing canceled orders, the refined dataset still contains negative quantities. The following analysis will investigate this issue further.

In [24]:
#check for entries with negative quantity
non_canceled[non_canceled['Quantity']<0].head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
2406,536589,21777,,-10,2010-12-01 16:50:00,0.0,,United Kingdom
4347,536764,84952C,,-38,2010-12-02 14:42:00,0.0,,United Kingdom
7188,536996,22712,,-20,2010-12-03 15:30:00,0.0,,United Kingdom
7189,536997,22028,,-20,2010-12-03 15:30:00,0.0,,United Kingdom
7190,536998,85067,,-6,2010-12-03 15:30:00,0.0,,United Kingdom


In [25]:
#view data information
non_canceled[non_canceled['Quantity']<0].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1336 entries, 2406 to 538919
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    1336 non-null   object        
 1   StockCode    1336 non-null   object        
 2   Description  474 non-null    object        
 3   Quantity     1336 non-null   int64         
 4   InvoiceDate  1336 non-null   datetime64[ns]
 5   UnitPrice    1336 non-null   float64       
 6   CustomerID   0 non-null      object        
 7   Country      1336 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 93.9+ KB


Upon investigation into negative quantities, it is observed that these entries lack a CustomerID, and some also lack a Description. This aligns with one of the previously noted concerns. Unfortunately, as StockCode consists only of numbers, it cannot offer insights into the negative quantities. The focus will now shift to examining the Description field for additional context.

In [26]:
non_canceled[non_canceled['Quantity'] < 0]['Description'].unique()[:50]

array([nan, '?', 'check', 'damages', 'faulty', 'Dotcom sales',
       'reverse 21/5/10 adjustment', 'mouldy, thrown away.', 'counted',
       'Given away', 'Dotcom', 'label mix up', 'samples/damages',
       'thrown away', 'incorrectly made-thrown away.', 'showroom', 'MIA',
       'Dotcom set', 'wrongly sold as sets', 'Amazon sold sets',
       'dotcom sold sets', 'wrongly sold sets', '? sold as sets?',
       '?sold as sets?', 'Thrown away.', 'damages/display',
       'damaged stock', 'broken', 'throw away', 'wrong barcode (22467)',
       'wrong barcode', 'barcode problem', '?lost',
       "thrown away-can't sell.", "thrown away-can't sell", 'damages?',
       're dotcom quick fix.', "Dotcom sold in 6's", 'sold in set?',
       'cracked', 'sold as 22467', 'Damaged',
       'mystery! Only ever imported 1800',
       'MERCHANT CHANDLER CREDIT ERROR, STO', 'POSSIBLE DAMAGES OR LOST?',
       'damaged', 'DAMAGED', 'Display', 'Missing', 'wrong code?'],
      dtype=object)

Upon a brief sampling of the provided descriptions, it is evident that certain entries represent non-standard sale transactions. These entries will be removed from our dataset to ensure the integrity of our analysis.

In [27]:
#only use positive quantities
positive_quantity=non_canceled[non_canceled['Quantity']>0]
positive_quantity.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [28]:
#view data information
positive_quantity.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 526054 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    526054 non-null  object        
 1   StockCode    526054 non-null  object        
 2   Description  525462 non-null  object        
 3   Quantity     526054 non-null  int64         
 4   InvoiceDate  526054 non-null  datetime64[ns]
 5   UnitPrice    526054 non-null  float64       
 6   CustomerID   392732 non-null  object        
 7   Country      526054 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 36.1+ MB


In [29]:
#quantitative data
positive_quantity.describe()

Unnamed: 0,Quantity,UnitPrice
count,526054.0,526054.0
mean,10.730874,3.871747
std,157.591838,42.01656
min,1.0,-11062.06
25%,1.0,1.25
50%,4.0,2.08
75%,11.0,4.13
max,80995.0,13541.33


The issue of negative **Quantity** has been addressed. The previously identified concerns involving negative values in **UnitPrice**, as well as the presence of null values, persist in our refined dataset and require further attention and resolution.

### Addressing Negative Unit Price

In [30]:
#checking for negative unit price
positive_quantity[positive_quantity['UnitPrice']<0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
299983,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00,-11062.06,,United Kingdom
299984,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00,-11062.06,,United Kingdom


**Description**s related to negative **UnitPrice**s show non standard transactions and can be removed from the refined dataser.Descriptions associated with negative UnitPrices indicate non-standard transactions and can be safely removed from the refined dataset.

In [31]:
positive_quantity=positive_quantity[positive_quantity['UnitPrice']>0]
positive_quantity.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


### Addressing Null Values

Without clear information on the reasons for missing customer ID values, it remains essential to remove such data from the dataset. Although the underlying causes for the missing data are uncertain, its removal is a necessary step. By eliminating missing data, we ensure that the dataset remains sizable for analysis, and the presence of null values would have hindered a thorough and accurate analysis.

In [32]:
cleaned_data=positive_quantity.dropna()
cleaned_data.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [33]:
#view data information
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392692 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    392692 non-null  object        
 1   StockCode    392692 non-null  object        
 2   Description  392692 non-null  object        
 3   Quantity     392692 non-null  int64         
 4   InvoiceDate  392692 non-null  datetime64[ns]
 5   UnitPrice    392692 non-null  float64       
 6   CustomerID   392692 non-null  object        
 7   Country      392692 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 27.0+ MB


In [34]:
#quantitative data
cleaned_data.describe()

Unnamed: 0,Quantity,UnitPrice
count,392692.0,392692.0
mean,13.119702,3.125914
std,180.492832,22.241836
min,1.0,0.001
25%,2.0,1.25
50%,6.0,1.95
75%,12.0,3.75
max,80995.0,8142.75


### Additional Preprocessing

All identified concerns with the data have been successfully addressed. Moving forward, we can perform additional preprocessing to facilitate easier analysis. Specifically, we will add new columns for total price, hour, day of the week, and month to enhance the dataset for more insightful analysis.  A new column to identify repeat customers will also be added for later analysis on retention rate

In [35]:
#create copy of data frame
cleaned_data = cleaned_data.copy()

# Create a new column for TotalPrice
cleaned_data['TotalPrice'] = cleaned_data['Quantity'] * cleaned_data['UnitPrice']

# Assuming 'InvoiceDate' is the name of the datetime column
cleaned_data['InvoiceDate'] = pd.to_datetime(cleaned_data['InvoiceDate'])

# Extract hour from 'InvoiceDate'
cleaned_data['Hour'] = cleaned_data['InvoiceDate'].dt.hour

# Extract day from 'InvoiceDate'
cleaned_data['Day'] = cleaned_data['InvoiceDate'].dt.dayofweek

# Extract day from 'InvoiceDate'
cleaned_data['Month'] = cleaned_data['InvoiceDate'].dt.month

In [36]:
# Calculate purchase frequency for each customer
purchase_frequency = cleaned_data.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
purchase_frequency.columns = ['CustomerID', 'PurchaseFrequency']

# Define a threshold for repeat customers (e.g., 1 or more purchases)
threshold = 1

# Create a new column indicating if a customer is a repeat customer
cleaned_data = pd.merge(cleaned_data, purchase_frequency, on='CustomerID', how='left')
cleaned_data['IsRepeatCustomer'] = cleaned_data['PurchaseFrequency'] > threshold

# Drop the intermediate column if needed
cleaned_data = cleaned_data.drop(columns=['PurchaseFrequency'])


Added fundamental features:

- **TotalPrice:** This column represents the total monetary value of each transaction, calculated by multiplying the quantity of items purchased by their respective unit prices.
- **Hour:** Specific hour of the day when each transaction occurred, extracted from the 'InvoiceDate' timestamp.
- **Day:** Day of the week (0 for Monday, 1 for Tuesday, and so on) when each transaction took place, derived from the 'InvoiceDate' timestamp.
- **Month:** Numerical representation of the month when each transaction occurred, extracted from the 'InvoiceDate' timestamp.
- **IsRepeatCustomer:** Binary flag, indicating whether a customer has made more than one purchase ('True') or only a single purchase ('False').

In [37]:
#save cleaned dataset
cleaned_data.to_csv('CleanedData.csv', index=False)

## V. Conclusion


The cleaning and preprocessing has successfully addressed various challenges, ensuring the reliability of the dataset. The meticulous analysis has uncovered valuable insights into customer behaviors, order patterns, and product interactions.

The significance of this project extends beyond technical aspects to offer actionable insights for strategic business decisions. Identification and resolution of concerns, such as canceled orders, negative quantities, and missing data, have laid the foundation for a more accurate analysis.

The addition of columns for total price, hour, day of the week, and month provides enriched dimensions for understanding transactional dynamics and seasonality. These enhanced features contribute to a more nuanced and contextual interpretation of the data.

As the data cleaning concludes, the insights gained not only inform the current state of the online retail landscape but also set the stage for continuous improvement and innovation. The analysis will persist with exploratory data analysis (EDA) and customer segmentation, promising ongoing advancements in uncovering intricate patterns and trends within the data.