In the retail sector, it is possible to unlock insights to win and retain customers, drive business efficiencies, and ultimately improve sales and customer interest. Retail organizations are using advanced analysis to understand their customers, improve forecasting, and achieve better, faster results. As a company's resources are limited, it is crucial to identify and target customers to secure their loyalty, enhance business efficiency, and ultimately improve performance.

You have been given access to a dataset containing customer transactions for an online retailer and tasked with using your machine learning tools to gain and report on business insights. The audience for this report are non-specialists.  In particular, your tasks are:

Clustering
Apply and evaluate various clustering techniques with the aim of generating actionable insights from the data. 

●	Select and justify the features you will be using.

●	Apply appropriate clustering algorithms to the dataset.

●	Evaluate the performance of the algorithms and make a recommendation as to which gives the “best” results.

●	Include in your report your own interpretation of the results.

Market Basket Analysis
Perform a market basket analysis of the transaction data. 

●	Include in your report a comparison and evaluation of at least two algorithms.

●	Include in your report your own interpretation of the results.


In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
df = pd.read_excel("data.xlsx")
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


### Note

To improve readability, we have set the 'InvoiceDate' column as the index of the dataset.
This guarantees that each row is unique, eliminating the need for the default numerical index.

In [3]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df = df.set_index('InvoiceDate')

In [4]:
df.shape

(525461, 7)

## Exploratory Data Analysis

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 525461 entries, 2009-12-01 07:45:00 to 2010-12-09 20:01:00
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Invoice      525461 non-null  object 
 1   StockCode    525461 non-null  object 
 2   Description  522533 non-null  object 
 3   Quantity     525461 non-null  int64  
 4   Price        525461 non-null  float64
 5   Customer ID  417534 non-null  float64
 6   Country      525461 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 32.1+ MB


In [6]:
def to_camel_case(s):
    # Split the string into words
    words = s.replace('_', ' ').split()
    # Convert the first word to lowercase and capitalize the initials of the remaining words
    camel_case_str = words[0].lower() + ''.join(word.capitalize() for word in words[1:])
    return camel_case_str

df.columns = [to_camel_case(col) for col in df.columns]

print(df.columns)

Index(['invoice', 'stockcode', 'description', 'quantity', 'price',
       'customerId', 'country'],
      dtype='object')


In [7]:
df.dtypes

invoice         object
stockcode       object
description     object
quantity         int64
price          float64
customerId     float64
country         object
dtype: object

In [8]:
# # To prevent errors, converting 'Description','Invoice' and 'StockCode' to a string.
# df['Description'] = df['Description'].astype(str)
# df['Invoice'] = df['Invoice'].astype(str)
# df['StockCode'] = df['StockCode'].astype(str)

In [9]:
df.describe()

Unnamed: 0,quantity,price,customerId
count,525461.0,525461.0,417534.0
mean,10.337667,4.688834,15360.645478
std,107.42411,146.126914,1680.811316
min,-9600.0,-53594.36,12346.0
25%,1.0,1.25,13983.0
50%,3.0,2.1,15311.0
75%,10.0,4.21,16799.0
max,19152.0,25111.09,18287.0


In [10]:
df.Country.nunique()

AttributeError: 'DataFrame' object has no attribute 'Country'

In [None]:
df.Country.unique()

In [None]:
customer_country=df[['Country','Customer ID']]
customer_country.groupby(['Country'])['Customer ID'].aggregate('count').reset_index().sort_values('Customer ID', ascending=False).head()

In [None]:
pd.DataFrame(df.nunique())

## Null values

In [None]:
df.isnull().sum(axis=0)

In [None]:
df['Description'].tail(50)

In [None]:
df[df['Description'].isnull()].tail()

Note

- The Price in these rows is 0, indicating that these orders did not generate any sales.

- At present, we can impute it with 'UNKNOWN ITEM' and address those later during the analysis.

#### Analyzing "Description" feature

In [None]:
df['Description'].value_counts().tail(20)

In [None]:
df['Description'].value_counts().head()

Note

The code above shows that valid items are typically in uppercase, while non-valid or cancelled items are in lowercase.

#### Analyzing Invoice feature

In [None]:
df['Invoice'].value_counts().tail(50)

In [None]:
df['Invoice'].value_counts().head()

## Data cleaning

In [None]:
customer_country=df[['Country','Customer ID']].drop_duplicates()
customer_country.groupby(['Country'])['Customer ID'].aggregate('count').reset_index().sort_values('Customer ID', ascending=False).head()

In [None]:
df['Description'] = df['Description'].fillna('UNKNOWN ITEM')
df.isnull().sum()

In [None]:
df = df[pd.notnull(df1['CustomerID'])]
df1.isnull().sum(axis=0)

## Exploratory Data Analysis II

### Do we have returns?

In [None]:
df[df['Quantity'] < 0].head()

## How many customers are not recurrent?

In [None]:
def unique_counts(df1):
   for i in df1.columns:
       count = df1[i].nunique()
       print(i, ": ", count)
unique_counts(df1)

## What items were purchased more frequently?

In [None]:
item_counts = df['Description'].value_counts().sort_values(ascending=False).iloc[0:15]
plt.figure(figsize=(18,6))
sns.barplot(x=item_counts.index, y=item_counts.values, palette=sns.cubehelix_palette(15))
plt.ylabel("Counts")
plt.title("Which items were bought more often?");
plt.xticks(rotation=90);

##  Which invoices had the highest number of items?

In [None]:
inv_counts = df['Invoice'].value_counts().sort_values(ascending=False).iloc[0:15]
plt.figure(figsize=(18,6))
sns.barplot(x=inv_counts.index, y=inv_counts.values, palette=sns.color_palette("BuGn_d"))
plt.ylabel("Counts")
plt.title("Which invoices had the most items?");
plt.xticks(rotation=90);

In [None]:
df[df['InvoiceNo'].str.startswith('C')].describe()

In [None]:
Looks like Invoices that start with 'C' are the "Canceling"/"Returning" invoices. This resolves the mystery of negative quantities.

Although, we should've gotten deeper into the analysis of those returns, for the sake of simplicity let's just ignore those values for the moment.

We can actually start a separate project based on that data and predict the returning/canceling rates for the stor

#### Analyzing "Description" feature II

In [None]:
df[~df['Description'].str.isupper()]['Description'].value_counts().head()

In [None]:
# Not full upper case items
lcase_counts = df[~df['Description'].str.isupper()]['Description'].value_counts().sort_values(ascending=False).iloc[0:15]
plt.figure(figsize=(18,6))
sns.barplot(x=lcase_counts.index, y=lcase_counts.values, palette=sns.color_palette("hls", 15))
plt.ylabel("Counts")
plt.title("Not full upper case items");
plt.xticks(rotation=90);

#### Analyzing Invoice feature II

In [None]:
df[df['Invoice'].str.startswith('C')].describe()

Looks like Invoices that start with 'C' are the "Canceling"/"Returning" invoices. This resolves the mystery of negative quantities.

Although, we should've gotten deeper into the analysis of those returns, for the sake of simplicity let's just ignore those values for the moment.

We can actually start a separate project based on that data and predict the returning/canceling rates for the store.