# Lookalike modeling


Acquring high value customers is always a challenge in the field of ecommerce.

So in this notebook, my main goal is to build a model that is able to take known high value customers and output customers that are most similar in their spending habits to these high value customers.

For this, I will be using data regarding seed customer's total spent on website, frequency of purchase orders,number of days from most recent order day and number of distinct categories of product bought.

These attributes are used to find the most similar customers.




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
customers = pd.read_csv('/content/drive/MyDrive/eCommerce Transactions Dataset/Customers.csv')
products =  pd.read_csv('/content/drive/MyDrive/eCommerce Transactions Dataset/Products.csv')
transactions = pd.read_csv('/content/drive/MyDrive/eCommerce Transactions Dataset/Transactions.csv')

In [None]:
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])
df = pd.merge(transactions, customers, on='CustomerID', how='inner')
df = pd.merge(df,products , on='ProductID', how='inner')
df = df.drop('Price_x',axis = 1)
df.head()

Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,CustomerName,Region,SignupDate,ProductName,Category,Price_y
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,Andrea Jenkins,Europe,2022-12-03,ComfortLiving Bluetooth Speaker,Electronics,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,Brittany Harvey,Asia,2024-09-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,Kathryn Stevens,Europe,2024-04-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,Travis Campbell,South America,2024-04-11,ComfortLiving Bluetooth Speaker,Electronics,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,Timothy Perez,Europe,2022-03-15,ComfortLiving Bluetooth Speaker,Electronics,300.68


In [None]:
customers

Unnamed: 0,CustomerID,CustomerName,Region,SignupDate
0,C0001,Lawrence Carroll,South America,2022-07-10
1,C0002,Elizabeth Lutz,Asia,2022-02-13
2,C0003,Michael Rivera,South America,2024-03-07
3,C0004,Kathleen Rodriguez,South America,2022-10-09
4,C0005,Laura Weber,Asia,2022-08-15
...,...,...,...,...
195,C0196,Laura Watts,Europe,2022-06-07
196,C0197,Christina Harvey,Europe,2023-03-21
197,C0198,Rebecca Ray,Europe,2022-02-27
198,C0199,Andrea Jenkins,Europe,2022-12-03


# Feature Generation



In order to find the most similar customers , first we need to find features that we can use to actually find the most similar customer.

Now I am assuming that we want to use this model as a way to advertize to new potential customers.

For eg. If someone is a high revenue generating and regular customer of the platform , we would like to find potential people that may be similar and show them targeted advertizing accordingly.



In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
le  = LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])
recency_df = df.groupby(by='CustomerID', as_index=False)['TransactionDate'].max()
recency_df.columns = ['CustomerID', 'LastPurchaseDate']
recent_date = recency_df['LastPurchaseDate'].max()
recency_df['Recency'] = recency_df['LastPurchaseDate'].apply(lambda x: (recent_date - x).days)
frequency_df = df.groupby(by=['CustomerID'], as_index=False)['TransactionID'].count()
frequency_df.columns = ['CustomerID', 'Frequency']
monetary_df = df.groupby(by='CustomerID', as_index=False)['TotalValue'].sum()
monetary_df.columns = ['CustomerID', 'Monetary']
features_df = recency_df.merge(frequency_df, on='CustomerID')
features_df = features_df.merge(monetary_df, on='CustomerID').drop(columns='LastPurchaseDate')
product_category_df  = df.groupby('CustomerID').agg({'Category':lambda x: len(list(x.unique()))})
features_df = pd.merge(features_df, product_category_df, on='CustomerID', how='inner')
features_df

Unnamed: 0,CustomerID,Recency,Frequency,Monetary,Category
0,C0001,55,5,3354.52,3
1,C0002,25,4,1862.74,2
2,C0003,125,4,2725.38,3
3,C0004,4,8,5354.88,3
4,C0005,54,3,2034.24,2
...,...,...,...,...,...
194,C0196,13,4,4982.88,3
195,C0197,0,3,1928.65,2
196,C0198,84,2,931.83,2
197,C0199,63,4,1979.28,2


In [None]:
# Normalize numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_df[['Frequency', 'Monetary', 'Recency']])
features_df[['Frequency', 'Monetary', 'Recency']] = scaled_features
features_df

Unnamed: 0,CustomerID,Recency,Frequency,Monetary,Category
0,C0001,-0.266933,-0.011458,-0.061701,3
1,C0002,-0.690872,-0.467494,-0.877744,2
2,C0003,0.722260,-0.467494,-0.405857,3
3,C0004,-0.987630,1.356650,1.032547,3
4,C0005,-0.281064,-0.923530,-0.783929,2
...,...,...,...,...,...
194,C0196,-0.860448,-0.467494,0.829053,3
195,C0197,-1.044155,-0.923530,-0.841689,2
196,C0198,0.142875,-1.379566,-1.386975,2
197,C0199,-0.153882,-0.467494,-0.813993,2


# KNN based models for top 3 similar customers

We will have to use clusters of 4 customers. We will give the seed customer and we get the remaining 3 similar customers.

In [None]:
from sklearn.neighbors import NearestNeighbors
from scipy.cluster.hierarchy import linkage, fcluster
KNN = NearestNeighbors(n_neighbors=4, metric='euclidean')
X = features_df.drop('CustomerID',axis = 1)
KNN.fit(X)

In [None]:
def find_top_3(customer_id):
  seed_vector = features_df[features_df['CustomerID'] == customer_id]
  seed_vector = seed_vector.drop('CustomerID',axis = 1).values
  distances, indices = KNN.kneighbors(seed_vector)
  similar_customers = features_df.loc[indices[0][1:]]
  similar_customers['Similarity'] = 1 - distances[0][1:]
  l1 = list(similar_customers['CustomerID'].values)
  l2 = list(similar_customers['Similarity'].values)
  return [(l1[i],l2[i]) for i in range(3)]

In [None]:
import warnings
warnings.filterwarnings('ignore')
result = pd.DataFrame(features_df['CustomerID'])[:20]
list_map = []
for i in range(20):
  list_map.append(find_top_3(result['CustomerID'].iloc[i]))
result['lookalike'] = list_map

In [None]:
result.to_csv('Shaunak_Mujumdar_Lookalike.csv',index= False)