# Unlocking Insights from Retail Data
Retail businesses generate vast amounts of data daily, and within this data lies the key to understanding customer behavior, optimizing operations, and driving revenue growth. For this exercise, we will explore a dataset containing retail transaction data spanning two time periods: 2009-2010 and 2010-2011.

## This dataset includes information such as:
•⁠ ⁠Invoice Number: A unique identifier for each transaction.

•⁠ ⁠Stock Code: Unique codes for each product sold.

•⁠ ⁠Description: Detailed information about the products.

•⁠ ⁠Quantity: The number of items purchased in each transaction.

•⁠ ⁠Invoice Date: The date and time of the transaction.

•⁠ ⁠Unit Price: Price per unit of each product.

•⁠ ⁠Customer ID: An anonymized identifier for each customer.

•⁠ ⁠Country: The country where the transaction took place.

## Task Instructions for Mentees

Your task is to perform a clustering analysis on this dataset to group customers or transactions into meaningful segments. Considerations:

•⁠ ⁠Use features every possible feature to cluster customers based on purchasing patterns.

•⁠ ⁠Explore clustering algorithms such as K-Means, DBSCAN, or Hierarchical Clustering.

•⁠ ⁠Visualize the clusters and interpret the results to understand customer behavior.

## Submission Guidelines
Submit your findings and insights to info@oaorogun.co.uk. Be sure to include:

•⁠ ⁠For data analysts: A comprehensive report with charts and insights.

•⁠ ⁠For data scientists: A detailed explanation of your clustering methodology, code snippets, and visualizations.

In [6]:
import pandas as pd
import openpyxl
from openpyxl import load_workbook
from dataprep.eda import create_report


ModuleNotFoundError: No module named 'dataprep'

In [1]:
import pandas as pd

# Path to your Excel file
file_path = "online_retail_II.xlsx"

# Load the workbook
# Specify the sheet name or use sheet_name=None to load all sheets into a dictionary
df = pd.read_excel(file_path, sheet_name=None)  # Load all sheets

# Display the data from the first sheet
for sheet_name, sheet_data in df.items():
    print(f"Sheet: {sheet_name}")
    print(sheet_data.head())  # Print first few rows


Sheet: Year 2009-2010
  Invoice StockCode                          Description  Quantity  \
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   
1  489434    79323P                   PINK CHERRY LIGHTS        12   
2  489434    79323W                  WHITE CHERRY LIGHTS        12   
3  489434     22041         RECORD FRAME 7" SINGLE SIZE         48   
4  489434     21232       STRAWBERRY CERAMIC TRINKET BOX        24   

          InvoiceDate  Price  Customer ID         Country  
0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  
1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  
3 2009-12-01 07:45:00   2.10      13085.0  United Kingdom  
4 2009-12-01 07:45:00   1.25      13085.0  United Kingdom  
Sheet: Year 2010-2011
  Invoice StockCode                          Description  Quantity  \
0  536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1  536365     71053                

In [2]:
excel_file = pd.ExcelFile(file_path)

# Get the sheet names
sheet_names = excel_file.sheet_names

# Print the sheet names
print("Sheet names in the workbook:")
print(sheet_names)

Sheet names in the workbook:
['Year 2009-2010', 'Year 2010-2011']


In [4]:
# Path to dataset
file_path = "online_retail_II.xlsx"

# Lets load the workbook
workbook = load_workbook(file_path)

# Renaming the individual excel sheets
sheet_mapping = {
    "Year 2009-2010": "df1",
    "Year 2010-2011": "df2",
}

for old_name, new_name in sheet_mapping.items():
    if old_name in workbook.sheetnames:
        workbook[old_name].title = new_name

# Save the workbook
#workbook.save(file_path)

print("Sheets renamed successfully!")


Sheets renamed successfully!


In [5]:
df1.shape

NameError: name 'df1' is not defined