<a href="https://colab.research.google.com/github/JoshuaPaul-lasisi/Customer_segmentation/blob/main/Customer_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview: Customer Segmentation

**Introduction:**

This project aims to leverage customer data to identify meaningful segments within our customer base. By analyzing various aspects of the data, we will uncover patterns, trends, and relationships that can inform targeted marketing strategies and improve customer engagement.

**Project Background:**

The data used in this analysis comes from [brief description of dataset source and contents]. This dataset includes [list key features and variables relevant to customer segmentation].

**Objective:**

The primary objective of this project is to gain a deeper understanding of our customer base and identify distinct groups based on shared characteristics and behaviors. This will allow us to:

* **Develop targeted marketing campaigns:** Tailor messaging and offerings to specific customer segments for increased effectiveness.
* **Improve customer retention:** Identify at-risk segments and implement strategies to foster loyalty and prevent churn.
* **Optimize resource allocation:** Focus efforts on the most valuable customer segments to maximize return on investment.

**Methodology:**

This project will utilize a data-driven approach to customer segmentation. We will employ various exploratory data analysis techniques, including:

* **Univariate analysis:** Examining the distribution and characteristics of individual variables.
* **Bivariate analysis:** Exploring relationships between pairs of variables.
* **Categorical variable analysis:** Understanding the impact of categorical variables on customer behavior.
* **Time series analysis:** Analyzing customer behavior over time, including purchase patterns and trends.
* **Outlier detection:** Identifying and handling potential data anomalies.

**Expected Output:**

By the end of this project, we aim to achieve the following:

* **Identify distinct customer segments:** Define clear and actionable segments based on shared characteristics and behaviors.
* **Develop segment profiles:** Create detailed descriptions of each segment, including key attributes, purchase patterns, and potential needs.
* **Formulate targeted marketing strategies:** Recommend specific marketing approaches and messaging tailored to each segment.

**Next Steps:**

Following this exploratory analysis, we will proceed with the following:

* **Segmentation model development:** Implement and evaluate various segmentation models to identify the most effective approach.
* **Model validation and refinement:** Assess the performance of the chosen model and refine it as needed.
* **Actionable insights and recommendations:** Translate the segmentation findings into actionable strategies for marketing, customer engagement, and product development.

This project will provide valuable insights into our customer base and empower us to make data-driven decisions that improve customer relationships and drive business growth.

**Additional Considerations:**

* **Customer Lifetime Value (CLTV):** Consider incorporating CLTV analysis to understand the long-term value of each customer segment.
* **RFM Analysis:** RFM (Recency, Frequency, Monetary) analysis can provide valuable insights into customer engagement and purchase behavior.
* **Customer Journey Mapping:** Understanding the customer journey across different touchpoints can inform targeted marketing strategies.
* **Ethical Considerations:** Ensure data privacy and ethical considerations are addressed throughout the project.


Necessary imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


...and the dataframe to be used


In [2]:
# Second Dataframe used in the kaggle we went through in our meeting
df = pd.read_csv('https://raw.githubusercontent.com/sheidheda/SusAc-ML-Files/main/cs_data2.csv', encoding="ISO-8859-1")
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


# Data Loading & Cleaning

This section of the notebook focuses on preparing the dataset for analysis by loading it into the environment and performing initial data cleaning steps. These steps are crucial to ensure that the data is in a usable format and free from any inconsistencies that could affect the analysis results. This section will address any missing values, outliers, and duplicates in the dataset, ensuring that the data is clean and ready for further analysis.

______________________________________
### Peeking under the hood of the data



We get a quick glance of the data. Why?

To check:
- How big the data is
- the kind of data in each column
- to check for possible missing values
- to check how much space our data takes

In [3]:
#checking the info for df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


We now have answers to our questions:
- **Data size** -> 541,909 entries
- **Kind of data** -> the dataset contains a mix of data types
- **Possible missing values** -> Description and CustomerID have non-null counts less than the total number of entries. This indicates that there might be missing values in these columns
- **Space data takes** -> Memory usage is approximately 33.1+ MB


A more concise depiction of our dataset inspecting the dimensions of our dataset shows our dataset is **large** with 541,909 columns and 8 columns as seen below


In [4]:
df.shape

(541909, 8)

The **CustomerID** column should not be a float but a string since it is an identifier that could have leading zeros especially for older customers.

We will therefore convert it to string and remove the decimal point

In [5]:
df['CustomerID'] = df['CustomerID'].astype(str).str[:-2]

The **Quantity** ordered should not be less than zero for it to get into an invoice. But just in case, we will check it...

In [6]:
# to check whether quantity column has negative values
if (df['Quantity'].unique() <0).any():

  print("there is presence of Negative values")

there is presence of Negative values


This shows that we have **refunds** in our dataset. We will consider this is in our **segmentation**.

## Sorting out duplicates

The next step is to consider if we have any duplicates in the data.

For large data like this, we will employ the **.duplicate()** function which shows columns with repeated values for all columns.

In [7]:
# Identify rows with duplicate values across all columns in df2
duplicates = df[df.duplicated()]

# Count the number of duplicate rows
number_of_duplicates = len(duplicates)

# Print a message to inform the user about the results
print(f"The DataFrame df has {number_of_duplicates} duplicate rows.")

The DataFrame df has 5268 duplicate rows.


...to view them

In [8]:
duplicates

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,12/1/2010 11:45,1.25,17908,United Kingdom
527,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,12/1/2010 11:45,2.10,17908,United Kingdom
537,536409,22900,SET 2 TEA TOWELS I LOVE LONDON,1,12/1/2010 11:45,2.95,17908,United Kingdom
539,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,12/1/2010 11:45,4.95,17908,United Kingdom
555,536412,22327,ROUND SNACK BOXES SET OF 4 SKULLS,1,12/1/2010 11:49,2.95,17920,United Kingdom
...,...,...,...,...,...,...,...,...
541675,581538,22068,BLACK PIRATE TREASURE CHEST,1,12/9/2011 11:34,0.39,14446,United Kingdom
541689,581538,23318,BOX OF 6 MINI VINTAGE CRACKERS,1,12/9/2011 11:34,2.49,14446,United Kingdom
541692,581538,22992,REVOLVER WOODEN RULER,1,12/9/2011 11:34,1.95,14446,United Kingdom
541699,581538,22694,WICKER STAR,1,12/9/2011 11:34,2.10,14446,United Kingdom


Considering that they are redundancies, we will remove them **completely** from the dataframe.

In [9]:
# Drop the duplicates
df.drop_duplicates(inplace = True)

Our next course of action is to confirm if we have any missing values in our dataset.

To specify which columns have missing values...

##  Correcting missing values

In [10]:
# Check for missing values in df2
missing_values = df.isnull().sum()

# Calculate missing value percentages as a ratio (avoiding data type issues)
missing_percentages = (missing_values / len(df)) * 100

# Print a summary of missing values for each column
for col in df.columns:
  # Check if there are any missing values (avoiding unnecessary comparisons)
  if missing_values[col] > 0:
    print(f"Column '{col}' has {missing_values[col]} missing values ({missing_percentages[col]:.2f}%)")

Column 'Description' has 1454 missing values (0.27%)


We find that the **Description** and **CustomerID** columns have 1454 and 135037 missing values respectively.

Since CustomerID identifies the customer and Description identifies the product, let's see if there are cases where both product and customer are not identified

In [11]:
# columns with both description and CustomerID as null
no_id_desc = df[(df['Description'].isnull()) & (df['CustomerID'].isnull())]
no_id_desc

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


We are **positive** that there are such columns and we also notice that they are the **exact same amount** as the amount of missing values in the Description column.

We can conclude that this is a subset of the rows with missing customer identifiers.

Another **observation** is that some of our unit prices are **zero**. Let's check how many...

In [12]:
no_id_desc['UnitPrice'].unique()

array([], dtype=float64)

**All of them!!!!**
Since there is not unit price, it means there is no sale.

When there is no sale, there is no customer.

This part of the data therefore adds no value to our **customer segmentation** then.

We will continue without them.

In [13]:
df_null = df[-df['Description'].isnull()]

Now we check how many null values we have left in the dataset

In [14]:
df_null.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [15]:
df_null['CustomerID'].fillna('Null', inplace = True)

df_null.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_null['CustomerID'].fillna('Null', inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_null['CustomerID'].fillna('Null', inplace = True)


InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

## The finished work

In [16]:
df_cleaned = df_null

df_cleaned.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom


# Exploratory Data Analysis

We can look through the data to get necessary informartion for segmentation. Based on the columns in our df_cleaned, we can explore:

**Location**:

- Customer Locations:
- Population Density:

**Behavior**:

- Common Purchase Categories/Products
- Average Order Value and Frequency
- Popular Channels for Engagement:


## Location

## Behavior

### Common Purchase Categories/Products

1. Analyzing Purchase Categories:


We explore potential purchase categories and their frequency by iterating through the 'Description' column. Assuming descriptions might contain category clues, we extract keywords based on word length and alphabetical characters. This is a starting point, and we might need to adjust it for our specific data. We then use Counter to count the occurrences of each extracted category, giving us insights into the most frequently purchased ones.

In [1]:
# Considering that 'Description' column contains category clues
from collections import Counter  # For counting category occurrences (optional)

# Explore common purchase categories (using Description for potential clues)
def analyze_categories(df):
  """
  Analyzes potential purchase categories and identifies frequent items.

  Args:
      df (pandas.DataFrame): The cleaned DataFrame containing purchase data.

  Returns:
      None (prints analysis results to console)
  """

  # Extract potential category keywords from descriptions (optional)
  # This approach might require adjustments based on your data
  categories = []
  for desc in df['Description']:
    words = desc.split()  # Split description into words
    potential_category = [word for word in words if len(word) > 3 and word.isalpha()]  # Filter words with length > 3 and only letters
    if potential_category:
      categories.append(potential_category[0])  # Assuming the first relevant word is the category

  # Count category occurrences (alternative: using groupby)
  if categories:
    category_counts = Counter(categories)
    print("\n**Potential Purchase Categories (from Description):")
    for category, count in category_counts.most_common(10):  # Print top 10 most frequent
      print(f"- {category}: {count} occurrences")
  else:
    print("\n**Unable to extract clear categories from descriptions. Consider alternative methods.**")

2. Understanding Customer Behavior:


To delve into customer behavior, we group the data by the 'CustomerID'. This allows us to analyze spending patterns for each customer. We calculate their total spending by summing the 'UnitPrice' within each customer group. This helps us identify our top spending customers, the ones who contribute the most to overall sales. Additionally, we calculate the most frequent items purchased overall using the 'Description' column. This provides a general picture of the most popular products across all customers. We can further analyze frequent purchases for each individual by grouping by 'CustomerID'.

In [2]:
# Analyze customer behavior (total spending, frequent items)
def analyze_customer_behavior(df):
  """
  Analyzes customer behavior by grouping data and calculating relevant metrics.

  Args:
      df (pandas.DataFrame): The cleaned DataFrame containing purchase data.

  Returns:
      None (prints analysis results to console)
  """

  # Group data by customer ID
  customer_groups = df.groupby('CustomerID')

  # Calculate total spending per customer
  total_spend_per_customer = customer_groups['UnitPrice'].sum()

  # Identify top 5 spending customers
  top_spenders = total_spend_per_customer.sort_values(ascending=False).head(5)

  # Calculate most frequent items purchased by all customers
  # (consider grouping by customer for individual customer analysis)
  all_item_counts = df['Description'].value_counts().head(10)  # Top 10 most frequent items

  # Print analysis results
  print("\n**Analysis of Customer Behavior:**")
  print(f"\n- Top 5 Spending Customers:")
  print(top_spenders)

  print(f"\n- Top 10 Most Frequently Purchased Items:")
  print(all_item_counts)

In [3]:
# Analyzing the cleaned data
analyze_categories(df_cleaned.copy())  # Avoid modifying original data
analyze_customer_behavior(df_cleaned.copy())

NameError: name 'df_cleaned' is not defined

# Feature Engineering

# Model Development

# Model Validation and Refinement

# Conclusion