Build a Lookalike Model that takes a user's information as input and recommends 3 similar
customers based on their profile and transaction history. The model should:
● Use both customer and product information.
● Assign a similarity score to each recommended customer.
Deliverables:
● Give the top 3 lookalikes with there similarity scores for the first 20 customers
(CustomerID: C0001 - C0020) in Customers.csv. Form an “Lookalike.csv” which has
just one map: Map<cust_id, List<cust_id, score>>
● A Jupyter Notebook/Python script explaining your model development.
Evaluation Criteria:
● Model accuracy and logic.
● Quality of recommendations and similarity scores.

#####Step 1: Load and Explore Data

In [18]:
#import libraries
import pandas as pd
from datetime import datetime
import numpy as np

In [3]:
# Load datasets
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

In [4]:
# Inspect data
print("Customers Dataset:")
print(customers.head(), "\n")
print(customers.info(), "\n")

Customers Dataset:
  CustomerID        CustomerName         Region  SignupDate
0      C0001    Lawrence Carroll  South America  2022-07-10
1      C0002      Elizabeth Lutz           Asia  2022-02-13
2      C0003      Michael Rivera  South America  2024-03-07
3      C0004  Kathleen Rodriguez  South America  2022-10-09
4      C0005         Laura Weber           Asia  2022-08-15 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   CustomerID    200 non-null    object
 1   CustomerName  200 non-null    object
 2   Region        200 non-null    object
 3   SignupDate    200 non-null    object
dtypes: object(4)
memory usage: 6.4+ KB
None 



In [5]:
print("Products Dataset:")
print(products.head(), "\n")
print(products.info(), "\n")

Products Dataset:
  ProductID              ProductName     Category   Price
0      P001     ActiveWear Biography        Books  169.30
1      P002    ActiveWear Smartwatch  Electronics  346.30
2      P003  ComfortLiving Biography        Books   44.12
3      P004            BookWorld Rug   Home Decor   95.69
4      P005          TechPro T-Shirt     Clothing  429.31 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ProductID    100 non-null    object 
 1   ProductName  100 non-null    object 
 2   Category     100 non-null    object 
 3   Price        100 non-null    float64
dtypes: float64(1), object(3)
memory usage: 3.3+ KB
None 



In [6]:
print("Transactions Dataset:")
print(transactions.head(), "\n")
print(transactions.info(), "\n")

Transactions Dataset:
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue   Price  
0      300.68  300.68  
1      300.68  300.68  
2      300.68  300.68  
3      601.36  300.68  
4      902.04  300.68   

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TransactionID    1000 non-null   object 
 1   CustomerID       1000 non-null   object 
 2   ProductID        1000 non-null   object 
 3   TransactionDate  1000 non-null   object 
 4   Quantity         10

####Step 2: Data Preprocessing

Objective: Clean and prepare the data for feature engineering.

Handle Missing Values:


Check if any rows in Customers.csv, Products.csv, or Transactions.csv have missing values.

Drop or fill missing values as appropriate.

In [8]:
# 1. Handle missing values
print("Missing Values:")
print(customers.isnull().sum(), "\n")
print(products.isnull().sum(), "\n")
print(transactions.isnull().sum(), "\n")

Missing Values:
CustomerID      0
CustomerName    0
Region          0
SignupDate      0
dtype: int64 

ProductID      0
ProductName    0
Category       0
Price          0
dtype: int64 

TransactionID      0
CustomerID         0
ProductID          0
TransactionDate    0
Quantity           0
TotalValue         0
Price              0
dtype: int64 



In [10]:
# 2. Remove duplicates
customers = customers.drop_duplicates()
products = products.drop_duplicates()
transactions = transactions.drop_duplicates()

Convert Dates to Datetime:


Convert SignupDate in Customers.csv and TransactionDate in Transactions.csv to datetime format.

In [13]:
# 3. Convert dates to datetime
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

Add new columns:

1.days_since_signup in Customers.csv (difference between today and signup date).

2.days_since_transaction in Transactions.csv (difference between today and transaction date).

In [14]:
# 4. Create new features
today = datetime.now()
customers['days_since_signup'] = (today - customers['SignupDate']).dt.days
transactions['days_since_transaction'] = (today - transactions['TransactionDate']).dt.days

Merge Datasets:

1.Merge Transactions.csv with Products.csv to get product details in transactions.

2.Merge the result with Customers.csv to get a complete dataset.

In [15]:
# 5. Merge datasets
transactions = transactions.merge(products, on='ProductID', how='left')
complete_data = transactions.merge(customers, on='CustomerID', how='left')

In [16]:
print("Merged Dataset:")
print(complete_data.head())

Merged Dataset:
  TransactionID CustomerID ProductID     TransactionDate  Quantity  \
0        T00001      C0199      P067 2024-08-25 12:38:23         1   
1        T00112      C0146      P067 2024-05-27 22:23:54         1   
2        T00166      C0127      P067 2024-04-25 07:38:55         1   
3        T00272      C0087      P067 2024-03-26 22:55:37         2   
4        T00363      C0070      P067 2024-03-21 15:10:10         3   

   TotalValue  Price_x  days_since_transaction  \
0      300.68   300.68                     154   
1      300.68   300.68                     244   
2      300.68   300.68                     277   
3      601.36   300.68                     306   
4      902.04   300.68                     311   

                       ProductName     Category  Price_y     CustomerName  \
0  ComfortLiving Bluetooth Speaker  Electronics   300.68   Andrea Jenkins   
1  ComfortLiving Bluetooth Speaker  Electronics   300.68  Brittany Harvey   
2  ComfortLiving Bluetooth Spea

####Step 3: Feature Engineering

Objective: Create a feature set that combines customer profiles and transaction history, which can be used to compute similarity.

Aggregate Transaction History:

For each customer:

Total spending: Sum of TotalValue.

Total quantity purchased: Sum of Quantity.

Count of unique products purchased.

Most frequent product categories (use one-hot encoding or counts).

In [17]:
from sklearn.preprocessing import MinMaxScaler

In [20]:
# 1. Aggregate transaction history
agg_data = complete_data.groupby('CustomerID').agg({
    'TotalValue': 'sum',  # Total spending
    'Quantity': 'sum',    # Total quantity purchased
    'ProductID': 'nunique',  # Number of unique products
    'Category': lambda x: x.mode()[0]  # Most frequent product category
}).reset_index()
agg_data.rename(columns={
    'TotalValue': 'total_spending',
    'Quantity': 'total_quantity',
    'ProductID': 'unique_products',
    'Category': 'favorite_category'
}, inplace=True)

In [21]:
# One-hot encode favorite_category
agg_data = pd.get_dummies(agg_data, columns=['favorite_category'], prefix='category')

In [22]:
# 2. Normalize numerical features
scaler = MinMaxScaler()
agg_data[['total_spending', 'total_quantity', 'unique_products']] = scaler.fit_transform(
    agg_data[['total_spending', 'total_quantity', 'unique_products']]
)

In [23]:
# 3. Combine features with Customers.csv
customer_features = customers.merge(agg_data, on='CustomerID', how='left')
customer_features.fillna(0, inplace=True)  # Fill missing values if any

print("Final Feature Set:")
print(customer_features.head())

Final Feature Set:
  CustomerID        CustomerName         Region SignupDate  days_since_signup  \
0      C0001    Lawrence Carroll  South America 2022-07-10                932   
1      C0002      Elizabeth Lutz           Asia 2022-02-13               1079   
2      C0003      Michael Rivera  South America 2024-03-07                326   
3      C0004  Kathleen Rodriguez  South America 2022-10-09                841   
4      C0005         Laura Weber           Asia 2022-08-15                896   

   total_spending  total_quantity  unique_products category_Books  \
0        0.308942        0.354839         0.444444          False   
1        0.168095        0.290323         0.333333          False   
2        0.249541        0.419355         0.333333          False   
3        0.497806        0.709677         0.777778           True   
4        0.184287        0.193548         0.222222          False   

  category_Clothing category_Electronics category_Home Decor  
0             Fa

###Step 4: Compute Similarity

Objective: Use the processed customer_features to calculate similarity scores and find the top 3 lookalike customers for each customer.

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

Select Numerical Features:

Use features like total_spending, total_quantity, unique_products and one-hot encoded product categories for similarity computation.

In [25]:
# 1. Select numerical features for similarity
feature_cols = customer_features.columns.difference(['CustomerID', 'CustomerName', 'Region', 'SignupDate', 'days_since_signup'])
features_matrix = customer_features[feature_cols].values

Compute Similarity:

Use Cosine Similarity to measure how similar each customer is to others.

In [26]:
# 2. Compute Cosine Similarity
similarity_matrix = cosine_similarity(features_matrix)

Get Top 3 Lookalikes:

For each customer:
Compute similarity scores with all other customers.

Sort by similarity and select the top 3.

In [27]:
# 3. Get top 3 lookalikes for each customer
lookalike_dict = {}

for idx, customer_id in enumerate(customer_features['CustomerID']):
    # Exclude self-similarity by masking diagonal
    similarity_scores = list(enumerate(similarity_matrix[idx]))
    similarity_scores = [(i, score) for i, score in similarity_scores if i != idx]

    # Sort by similarity score in descending order and get top 3
    top_3 = sorted(similarity_scores, key=lambda x: x[1], reverse=True)[:3]
    lookalike_dict[customer_id] = [(customer_features['CustomerID'].iloc[i], round(score, 4)) for i, score in top_3]

Create a dictionary or DataFrame where each CustomerID maps to a list of 3 most similar customers with their similarity scores.

Save the result as Lookalike.csv.

In [30]:
# 4. Create Lookalike.csv
lookalike_list = [{'CustomerID': cust_id, 'Lookalikes': str(lookalike_dict[cust_id])} for cust_id in lookalike_dict]
lookalike_df = pd.DataFrame(lookalike_list)
lookalike_df.to_csv('Lookalike.csv', index=False)
print("Lookalike Model Complete! Check 'Lookalike.csv' for results.")

Lookalike Model Complete! Check 'Lookalike.csv' for results.


In [32]:
lookalike = pd.read_csv('Lookalike.csv')

In [36]:
lookalike.head(21)

Unnamed: 0,CustomerID,Lookalikes
0,C0001,"[('C0048', 0.9993), ('C0055', 0.9993), ('C0072..."
1,C0002,"[('C0029', 1.0), ('C0010', 0.9984), ('C0133', ..."
2,C0003,"[('C0166', 0.9966), ('C0160', 0.996), ('C0086'..."
3,C0004,"[('C0017', 0.9989), ('C0173', 0.9977), ('C0101..."
4,C0005,"[('C0186', 0.9997), ('C0112', 0.9995), ('C0007..."
5,C0006,"[('C0117', 0.9999), ('C0168', 0.9981), ('C0070..."
6,C0007,"[('C0120', 0.9995), ('C0050', 0.9991), ('C0115..."
7,C0008,"[('C0124', 0.9837), ('C0104', 0.9805), ('C0065..."
8,C0009,"[('C0083', 0.9992), ('C0198', 0.9942), ('C0077..."
9,C0010,"[('C0029', 0.9984), ('C0002', 0.9984), ('C0133..."
