# Unlocking Insights from Retail Data
Retail businesses generate vast amounts of data daily, and within this data lies the key to understanding customer behavior, optimizing operations, and driving revenue growth. For this exercise, we will explore a dataset containing retail transaction data spanning two time periods: 2009-2010 and 2010-2011.

## This dataset includes information such as:
•⁠ ⁠Invoice Number: A unique identifier for each transaction.

•⁠ ⁠Stock Code: Unique codes for each product sold.

•⁠ ⁠Description: Detailed information about the products.

•⁠ ⁠Quantity: The number of items purchased in each transaction.

•⁠ ⁠Invoice Date: The date and time of the transaction.

•⁠ ⁠Unit Price: Price per unit of each product.

•⁠ ⁠Customer ID: An anonymized identifier for each customer.

•⁠ ⁠Country: The country where the transaction took place.

## Task Instructions for Mentees

Your task is to perform a clustering analysis on this dataset to group customers or transactions into meaningful segments. Considerations:

•⁠ ⁠Use features every possible feature to cluster customers based on purchasing patterns.

•⁠ ⁠Explore clustering algorithms such as K-Means, DBSCAN, or Hierarchical Clustering.

•⁠ ⁠Visualize the clusters and interpret the results to understand customer behavior.

## Submission Guidelines
Submit your findings and insights to info@oaorogun.co.uk. Be sure to include:

•⁠ ⁠For data analysts: A comprehensive report with charts and insights.

•⁠ ⁠For data scientists: A detailed explanation of your clustering methodology, code snippets, and visualizations.

In [1]:
pip install dataprep

Collecting dataprep
  Downloading dataprep-0.4.1-py3-none-any.whl.metadata (14 kB)
Collecting bokeh<3,>=2 (from dataprep)
  Downloading bokeh-2.4.3-py3-none-any.whl.metadata (14 kB)
Collecting dask<3.0,>=2.25 (from dask[array,dataframe,delayed]<3.0,>=2.25->dataprep)
  Downloading dask-2.30.0-py3-none-any.whl.metadata (3.4 kB)
Collecting flask<2.0.0,>=1.1.4 (from dataprep)
  Downloading Flask-1.1.4-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting flask_cors<4.0.0,>=3.0.10 (from dataprep)
  Downloading Flask_Cors-3.0.10-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting ipywidgets<8.0,>=7.5 (from dataprep)
  Downloading ipywidgets-7.8.5-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting jinja2<3.0,>=2.11 (from dataprep)
  Downloading Jinja2-2.11.3-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting jsonpath-ng<2.0,>=1.5 (from dataprep)
  Downloading jsonpath_ng-1.7.0-py3-none-any.whl.metadata (18 kB)
Collecting levenshtein<0.13.0,>=0.12.0 (from dataprep)
  Downloading levenshtein-0.12.0.tar

  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [28 lines of output]
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\Levenshtein
  copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-cpython-311\Levenshtein
  copying Levenshtein\__init__.py -> build\lib.win-amd64-cpython-311\Levenshtein
  running egg_info
  writing levenshtein.egg-info\PKG-INFO
  writing dependency_links to levenshtein.egg-info\dependency_links.txt
  deleting levenshtein.egg-info\entry_points.txt
  writing namespace_packages to levenshtein.egg-info\namespace_packages.txt
  writing requirements to levenshtein.egg-info\requires.txt
  writing top-level names to levenshtein.egg-info\top_level.txt
  reading manifest file 'levenshtein.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  adding license file 'COPYING'
  writing 

In [None]:
conda install -c conda-forge dataprep

In [None]:
from dataprep.eda import create_report

In [1]:
import pandas as pd

# Path to your Excel file
file_path = "online_retail_II.xlsx"

# Load the workbook
sheet_names = pd.ExcelFile(file_path).sheet_names  # Get all sheet names

# Load sheets into DataFrames
dfs = {sheet: pd.read_excel(file_path, sheet_name=sheet) for sheet in sheet_names}

# Access each DataFrame by its sheet name
df1 = dfs.get("Year 2009-2010")  # Replace with actual sheet name
df2 = dfs.get("Year 2010-2011")  # Replace with actual sheet name

# Display the first few rows of each DataFrame
print("DataFrame for Year 2009-2010:")
print(df1.head())

print("\nDataFrame for Year 2010-2011:")
print(df2.head())


DataFrame for Year 2009-2010:
  Invoice StockCode                          Description  Quantity  \
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   
1  489434    79323P                   PINK CHERRY LIGHTS        12   
2  489434    79323W                  WHITE CHERRY LIGHTS        12   
3  489434     22041         RECORD FRAME 7" SINGLE SIZE         48   
4  489434     21232       STRAWBERRY CERAMIC TRINKET BOX        24   

          InvoiceDate  Price  Customer ID         Country  
0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  
1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
3 2009-12-01 07:45:00   2.10      13085.0  United Kingdom  
4 2009-12-01 07:45:00   1.25      13085.0  United Kingdom  

DataFrame for Year 2010-2011:
  Invoice StockCode                          Description  Quantity  \
0  536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1  536365     7105

In [4]:
print(df1.shape)
df1.head(3)

(525461, 8)


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom


In [8]:
df1.describe()

Unnamed: 0,Quantity,InvoiceDate,Price,Customer ID
count,525461.0,525461,525461.0,417534.0
mean,10.337667,2010-06-28 11:37:36.845017856,4.688834,15360.645478
min,-9600.0,2009-12-01 07:45:00,-53594.36,12346.0
25%,1.0,2010-03-21 12:20:00,1.25,13983.0
50%,3.0,2010-07-06 09:51:00,2.1,15311.0
75%,10.0,2010-10-15 12:45:00,4.21,16799.0
max,19152.0,2010-12-09 20:01:00,25111.09,18287.0
std,107.42411,,146.126914,1680.811316


In [9]:
df1.isnull()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
525456,False,False,False,False,False,False,False,False
525457,False,False,False,False,False,False,False,False
525458,False,False,False,False,False,False,False,False
525459,False,False,False,False,False,False,False,False
