# Lookalike Model
This notebook outlines the process of building a lookalike model that takes user's information as input and recommends 3 similar customers based on their profile and transaction history. The model will use both customer and product information, and assign a similarity score to each recommended customer.

## Setup and Intialization

### Import Necessary Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

### Loading the dataset

In [2]:
customers = pd.read_csv('../datasets/Customers.csv')
products = pd.read_csv('../datasets/Products.csv')
transactions = pd.read_csv('../datasets/Transactions.csv')

In [3]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   CustomerID    200 non-null    object
 1   CustomerName  200 non-null    object
 2   Region        200 non-null    object
 3   SignupDate    200 non-null    object
dtypes: object(4)
memory usage: 6.4+ KB


In [4]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ProductID    100 non-null    object 
 1   ProductName  100 non-null    object 
 2   Category     100 non-null    object 
 3   Price        100 non-null    float64
dtypes: float64(1), object(3)
memory usage: 3.3+ KB


In [5]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TransactionID    1000 non-null   object 
 1   CustomerID       1000 non-null   object 
 2   ProductID        1000 non-null   object 
 3   TransactionDate  1000 non-null   object 
 4   Quantity         1000 non-null   int64  
 5   TotalValue       1000 non-null   float64
 6   Price            1000 non-null   float64
dtypes: float64(2), int64(1), object(4)
memory usage: 54.8+ KB


## Data Preparation and Feature Engineering

Clean and prepare the data for analysis by standardizing date formats and enriching transactions.

In [6]:
# Convert date columns to datetime
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# Merge transactions with products to get product categories
transactions = transactions.merge(products[['ProductID', 'Category']], on='ProductID', how='left')

# Determine the last transaction date as a reference point
last_transaction_date = transactions['TransactionDate'].max()

### Customer Profile Features

Create a feature representing customer tenure.

In [7]:
# Calculate days since the earliest signup date as a numerical feature
customers['DaysSinceEarliestSignup'] = (customers['SignupDate'] - customers['SignupDate'].min()).dt.days

Stored as `DaysSinceEarliestSignup`, this numerical feature reflects how long a customer has been active, which may influence their behavior.

### Transaction-Based Features

Generate features summarizing customer transaction behavior.

In [8]:
# Aggregate transaction data per customer
customer_agg = transactions.groupby('CustomerID').agg({
    'TransactionID': 'count',            # Frequency: number of transactions
    'TotalValue': 'sum',                # Monetary: total spend
    'ProductID': 'nunique',            # Unique products purchased
    'Category': 'nunique',            # Unique categories purchased
    'TransactionDate': 'max'          # Date of the last transaction
}).rename(columns={
    'TransactionID': 'Frequency',
    'TotalValue': 'Monetary',
    'ProductID': 'UniqueProducts',
    'Category': 'UniqueCategories',
    'TransactionDate': 'LastTransactionDate'
})

In [9]:
# Calculate Recency: days since the last transaction
customer_agg['Recency'] = (last_transaction_date - customer_agg['LastTransactionDate']).dt.days

# Calculate Average Transaction Value
customer_agg['AvgTransactionValue'] = customer_agg['Monetary'] / customer_agg['Frequency']

- `Frequency`: Counts transactions per customer.
- `Monetary`: Sums total spending.
- `UniqueProducts`: Counts distinct products purchased.
- `UniqueCategories`: Counts distinct categories purchased.
- `LastTransactionDate`: Identifies the most recent transaction.
- `Recency`: Calculates days since the last transaction (lower values mean more recent activity).
- `AvgTransactionValue`: Computes average spend per transaction (Monetary / Frequency).

In [10]:
# Calculate spending per category
category_spend = transactions.groupby(['CustomerID', 'Category'])['TotalValue'].sum().reset_index()
category_spend_pivot = category_spend.pivot(index='CustomerID', columns='Category', values='TotalValue').fillna(0)

# Compute proportions of total spend per category
for col in category_spend_pivot.columns:
    category_spend_pivot[col + '_prop'] = category_spend_pivot[col] / category_spend_pivot.sum(axis=1)

### Create Feature Set

Integrate all features into a single DataFrame and prepare it for modeling.

In [11]:
# Join transaction features with category proportions
customer_features = customer_agg.join(category_spend_pivot.filter(like='_prop'), how='left')

In [12]:
# Merge with customer profile features
customer_features = customers[['CustomerID', 'Region', 'DaysSinceEarliestSignup']].merge(
    customer_features.reset_index(), on='CustomerID', how='left'
)

In [13]:
# One-hot encode the Region column
customer_features = pd.get_dummies(customer_features, columns=['Region'], prefix='Region')

# Handle any missing values (e.g., customers with no transactions)
customer_features.fillna(0, inplace=True)

In [14]:
customer_features.head()

Unnamed: 0,CustomerID,DaysSinceEarliestSignup,Frequency,Monetary,UniqueProducts,UniqueCategories,LastTransactionDate,Recency,AvgTransactionValue,Books_prop,Clothing_prop,Electronics_prop,Home Decor_prop,Region_Asia,Region_Europe,Region_North America,Region_South America
0,C0001,169,5.0,3354.52,5.0,3.0,2024-11-02 17:04:16,55.0,670.904,0.034163,0.0,0.842824,0.122972,False,False,False,True
1,C0002,22,4.0,1862.74,4.0,2.0,2024-12-03 01:41:41,25.0,465.685,0.0,0.550512,0.0,0.449356,True,False,False,False
2,C0003,775,4.0,2725.38,4.0,3.0,2024-08-24 18:54:04,125.0,681.345,0.0,0.044896,0.508251,0.446753,False,False,False,True
3,C0004,260,8.0,5354.88,8.0,3.0,2024-12-23 14:13:52,4.0,669.36,0.352665,0.0,0.253162,0.394112,False,False,False,True
4,C0005,205,3.0,2034.24,3.0,2.0,2024-11-04 00:30:22,54.0,678.08,0.0,0.0,0.580256,0.419624,True,False,False,False


## Feature Scaling

### Define Feature Vector

In [15]:
# Select relevant features for similarity calculation
feature_columns = (
    ['DaysSinceEarliestSignup'] +
    [col for col in customer_features.columns if col.startswith('Region_')] +
    ['Recency', 'Frequency', 'Monetary', 'AvgTransactionValue', 'UniqueProducts', 'UniqueCategories'] +
    [col for col in customer_features.columns if col.endswith('_prop')]
)

# Set CustomerID as index for easier lookup
customer_features.set_index('CustomerID', inplace=True)

### Standardize Features

To provide fair comparison in similarity computations, I normalize characteristics. I converts features to a mean of 0 and standard deviation of 1 using `StandardScaler`. This keeps the similarity metric from being dominated by qualities with greater scales (like `monetary`).

In [16]:
scaler = StandardScaler()
features_scaled = scaler.fit_transform(customer_features[feature_columns])

## Recommendation Model

Recommendation systems are algorithms that estimate what users will like or find useful and then recommend relevant items to them, such as products, movies, or articles.  These systems improve personalization by evaluating user preferences, actions, and traits.  There are two major types of recommendation systems: `Collaborative filtering` and `Content-based filtering`.

### Collaborative filtering

- **Collaborative Filtering** detects patterns based on user-item interactions such as ratings, purchases, and clicks. It assumes that users who have previously shared similar tastes will do so again. 
- It can be either `user-based` (recommends items liked by people similar to the target user) or `item-based` (suggests items similar to previously liked products). 
- Collaborative filtering captures complicated patterns without comprehensive item information and performs well with large amounts of interaction data. However, it struggles with "cold start" problems for new users or things with no history and requires extensive interaction data.

### Content-Based filtering

- This method recommends items by matching user profiles to item characteristics, rather than user interactions. 
- Content-Based filtering works well for new users or items with known qualities, and does not require considerable interaction data.
- However, the quality of feature data limits the ability to completely capture complex preferences.

We chose content-based filtering for our approach because it takes advantage of easily available comprehensive customer variables such as tenure, geography, spending history, and category preferences. It succeeds in instances with little interaction data, such as those involving new clients, by relying on profile similarities rather than prior interactions. Furthermore, it provides interpretability, explaining why some customers are considered similar, and is consistent with our goal of identifying lookalikes for a certain group  based on specified criteria. This makes it a realistic and effective option for our use case.

#### Compute Similarity Matrix

In [17]:
similarity_matrix = cosine_similarity(features_scaled)

- `cosine_similarity` computes a matrix where each element is the cosine similarity between two customers’ feature vectors.
- Values range from `-1` to `1`, with higher values indicating greater similarity.

#### Function to Get Top Similar Customers

This functions takes a DataFrame, similarity matrix, customer ID, and number of similar customers (n) to be fetched. It finds the customer’s index, extracts their similarity scores, sorts them descending, and excludes the customer themselves (starts at index 1).

In [18]:
def get_top_similar(customers_df, sim_matrix, customer_id, n=3):
    """Returns the top N similar customers for a given customer ID."""
    idx = customers_df.index.get_loc(customer_id)
    sim_scores = sim_matrix[idx]
    # Sort indices by similarity (descending), exclude self (idx) by starting at 1
    top_indices = sim_scores.argsort()[::-1][1:n+1]
    top_customers = customers_df.index[top_indices].tolist()
    top_scores = sim_scores[top_indices].tolist()
    return list(zip(top_customers, top_scores))

## Get Recommendations

In [19]:
# Generate Recommendations for C0001 to C0020
target_customers = [f'C{i:04d}' for i in range(1, 21)]
lookalike_results = []

for target_customer in target_customers:
    if target_customer in customer_features.index:
        top_similar = get_top_similar(customer_features, similarity_matrix, target_customer, n=3)
        lookalike_results.append((target_customer, top_similar))
    else:
        # Handle case where a target customer has no data (unlikely per problem context)
        lookalike_results.append((target_customer, []))

We can now save the dataset that includes the lookalikes and similarity score.

In [20]:
lookalike_df = pd.DataFrame(lookalike_results, columns=['target_customer', 'similar_customers'])

lookalike_df.to_csv('../Reports/Lookalike.csv', index=False)