# Task 2: Lookalike Model
Build a Lookalike Model that takes a user's information as input and recommends 3 similar
customers based on their profile and transaction history. The model should:
● Use both customer and product information.
● Assign a similarity score to each recommended customer.
Deliverables:
● Give the top 3 lookalikes with there similarity scores for the first 20 customers
(CustomerID: C0001 - C0020) in Customers.csv. Form an “Lookalike.csv” which has
just one map: Map<cust_id, List<cust_id, score>>
● A Jupyter Notebook/Python script explaining your model development.
Evaluation Criteria:
● Model accuracy and logic.
● Quality of recommendations and similarity scores.

In [1]:
import pandas as pd 

In [2]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity

# LOADING DATASET:

In [13]:
customers = pd.read_csv("C:/Users/prian/OneDrive/Desktop/Customers.csv")
products = pd.read_csv("C:/Users/prian/OneDrive/Desktop/Products.csv")
transactions = pd.read_csv("C:/Users/prian/OneDrive/Desktop/Transactions.csv")


In [15]:
customers.head(5)
    
    

Unnamed: 0,CustomerID,CustomerName,Region,SignupDate
0,C0001,Lawrence Carroll,South America,2022-07-10
1,C0002,Elizabeth Lutz,Asia,2022-02-13
2,C0003,Michael Rivera,South America,2024-03-07
3,C0004,Kathleen Rodriguez,South America,2022-10-09
4,C0005,Laura Weber,Asia,2022-08-15


In [16]:
products.head(5)

Unnamed: 0,ProductID,ProductName,Category,Price
0,P001,ActiveWear Biography,Books,169.3
1,P002,ActiveWear Smartwatch,Electronics,346.3
2,P003,ComfortLiving Biography,Books,44.12
3,P004,BookWorld Rug,Home Decor,95.69
4,P005,TechPro T-Shirt,Clothing,429.31


In [18]:
transactions.head(5)

Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68


# Merge transactions with product details to get product prices

In [28]:
transactions = transactions.merge(products[['ProductID', 'Price']], on='ProductID', how='left')
transaction_summary = transactions.groupby('CustomerID').agg({
    'ProductID': 'nunique',   
    'Price': 'sum',          
    'TransactionID': 'count'  
}).reset_index()
transaction_summary.rename(columns={
    'ProductID': 'ProductDiversity', 
    'Price': 'TotalSpend',            
    'TransactionID': 'Frequency'     
}, inplace=True)



In [29]:
print(transactions.columns)

Index(['TransactionID', 'CustomerID', 'ProductID', 'TransactionDate',
       'Quantity', 'TotalValue', 'Price_x', 'Price_y', 'Price_x', 'Price_y',
       'Price'],
      dtype='object')


In [30]:
print(transaction_summary.columns)

Index(['CustomerID', 'ProductDiversity', 'TotalSpend', 'Frequency'], dtype='object')


# Merge customers with the aggregated transaction summary

In [31]:
data = customers.merge(transaction_summary, on='CustomerID', how='left')

# DATA CLEANING:

In [32]:
data.fillna(0, inplace=True)

In [34]:
data.head(5)

Unnamed: 0,CustomerID,CustomerName,Region,SignupDate,ProductDiversity,TotalSpend,Frequency
0,C0001,Lawrence Carroll,South America,2022-07-10,5.0,1391.67,5.0
1,C0002,Elizabeth Lutz,Asia,2022-02-13,4.0,835.68,4.0
2,C0003,Michael Rivera,South America,2024-03-07,4.0,782.83,4.0
3,C0004,Kathleen Rodriguez,South America,2022-10-09,8.0,1925.09,8.0
4,C0005,Laura Weber,Asia,2022-08-15,3.0,874.81,3.0


In [37]:
data.isnull().sum()

CustomerID          0
CustomerName        0
Region              0
SignupDate          0
ProductDiversity    0
TotalSpend          0
Frequency           0
dtype: int64

# Normalize numeric features

In [38]:
scaler = StandardScaler()
numeric_features = ['TotalSpend', 'Frequency']
data[numeric_features] = scaler.fit_transform(data[numeric_features])

# Encode categorical features

In [40]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
categorical_features = ['Region']
encoded = encoder.fit_transform(data[categorical_features])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(categorical_features))

In [51]:
encoded_df.head(5)

Unnamed: 0,Region_Asia,Region_Europe,Region_North America,Region_South America
0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0


# Combine all features into final data

In [41]:
final_data = pd.concat([data[['CustomerID']], data[numeric_features], encoded_df], axis=1)


In [49]:
final_data.head(5)

Unnamed: 0,CustomerID,TotalSpend,Frequency,Region_Asia,Region_Europe,Region_North America,Region_South America
0,C0001,0.043323,0.0,0.0,0.0,0.0,1.0
1,C0002,-0.79015,-0.451294,1.0,0.0,0.0,0.0
2,C0003,-0.869377,-0.451294,0.0,0.0,0.0,1.0
3,C0004,0.842962,1.353881,0.0,0.0,0.0,1.0
4,C0005,-0.731491,-0.902587,1.0,0.0,0.0,0.0


# Compute similarity matrix

In [42]:
customer_ids = final_data['CustomerID']
features = final_data.drop(columns=['CustomerID'])
similarity_matrix = cosine_similarity(features)
similarity_df = pd.DataFrame(similarity_matrix, index=customer_ids, columns=customer_ids)


In [46]:
similarity_df

CustomerID,C0001,C0002,C0003,C0004,C0005,C0006,C0007,C0008,C0009,C0010,...,C0191,C0192,C0193,C0194,C0195,C0196,C0197,C0198,C0199,C0200
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C0001,1.000000,-0.025295,0.686830,0.550110,-0.020654,0.907562,-0.014225,0.021611,-0.019934,-0.030971,...,0.999344,0.860219,-0.016817,0.016113,0.911644,0.016659,-0.026157,-0.026763,-0.019118,0.007648
C0002,-0.025295,1.000000,0.470571,-0.501768,0.957926,0.164771,0.921847,-0.556207,0.467677,0.514098,...,-0.046416,0.298693,0.974771,-0.425139,-0.167562,-0.098212,0.531364,0.572372,0.381325,0.624718
C0003,0.686830,0.470571,1.000000,-0.130490,0.486203,0.812238,0.408132,-0.565483,0.477781,0.537051,...,0.662796,0.928633,0.363509,-0.431699,0.485468,-0.116644,0.547432,0.587831,0.393310,-0.109745
C0004,0.550110,-0.501768,-0.130490,1.000000,-0.637176,0.166979,-0.602298,0.793300,-0.633984,-0.527094,...,0.565114,0.052432,-0.446594,0.613964,0.802317,-0.100699,-0.654561,-0.731558,-0.463221,0.079129
C0005,-0.020654,0.957926,0.486203,-0.637176,1.000000,0.264490,0.986813,-0.704684,0.569969,0.510659,...,-0.037901,0.368555,0.956486,-0.543816,-0.266738,0.039877,0.602713,0.667298,0.428082,0.557775
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C0196,0.016659,-0.098212,-0.116644,-0.100699,0.039877,0.137803,0.113749,-0.108550,0.602845,0.369316,...,0.030570,0.039580,-0.005650,-0.092804,-0.135906,1.000000,0.467635,0.394833,0.658029,0.068012
C0197,-0.026157,0.531364,0.547432,-0.654561,0.602713,0.247908,0.536430,-0.724609,0.985537,0.962410,...,-0.047998,0.383238,0.437146,-0.556953,-0.250775,0.467635,1.000000,0.987315,0.947596,-0.106787
C0198,-0.026763,0.572372,0.587831,-0.731558,0.667298,0.288976,0.603262,-0.809497,0.968426,0.921407,...,-0.049110,0.426004,0.479859,-0.623319,-0.291901,0.394833,0.987315,1.000000,0.888252,-0.109261
C0199,-0.019118,0.381325,0.393310,-0.463221,0.428082,0.172511,0.378704,-0.512878,0.961742,0.942780,...,-0.035083,0.271780,0.311502,-0.393936,-0.174609,0.658029,0.947596,0.888252,1.000000,-0.078053


# Generate lookalike recommendations

In [43]:
def get_top_lookalikes(similarity_df, top_n=3):
    lookalike_map = {}
    for cust_id in similarity_df.index[:20]: 
        scores = similarity_df.loc[cust_id].sort_values(ascending=False)
        top_lookalikes = scores.iloc[1:top_n+1].reset_index()
        lookalike_map[cust_id] = list(zip(top_lookalikes['CustomerID'], top_lookalikes[cust_id]))
    return lookalike_map

lookalike_map = get_top_lookalikes(similarity_df)

In [45]:
lookalike_map

{'C0001': [('C0191', 0.9993444249841109),
  ('C0137', 0.9970626119679009),
  ('C0076', 0.9967228863901707)],
 'C0002': [('C0142', 0.998477375718647),
  ('C0178', 0.9964363197713488),
  ('C0027', 0.9955522300590519)],
 'C0003': [('C0025', 0.9991414490658412),
  ('C0031', 0.9895535785705001),
  ('C0052', 0.9810290457748304)],
 'C0004': [('C0147', 0.9931010365094683),
  ('C0113', 0.9928820302373502),
  ('C0165', 0.98495053708105)],
 'C0005': [('C0186', 0.997395829498276),
  ('C0159', 0.9937689276935588),
  ('C0007', 0.986813056707089)],
 'C0006': [('C0133', 0.9899976734185222),
  ('C0192', 0.9706843925413374),
  ('C0158', 0.9668086784126945)],
 'C0007': [('C0186', 0.9959171643541568),
  ('C0115', 0.9931171356525886),
  ('C0040', 0.9891991247463341)],
 'C0008': [('C0109', 0.9912548575598642),
  ('C0065', 0.9838093868961068),
  ('C0068', 0.9819513085487722)],
 'C0009': [('C0061', 0.9998452879289398),
  ('C0167', 0.9995946944804625),
  ('C0132', 0.9951513061804771)],
 'C0010': [('C0121', 0.9

# Save to CSV

In [44]:
output = pd.DataFrame({'CustomerID': lookalike_map.keys(),
                       'Lookalikes': [str(v) for v in lookalike_map.values()]})
output.to_csv('Lookalike.csv', index=False)

In [55]:
output.head(5)

Unnamed: 0,CustomerID,Lookalikes
0,C0001,"[('C0191', 0.9993444249841109), ('C0137', 0.99..."
1,C0002,"[('C0142', 0.998477375718647), ('C0178', 0.996..."
2,C0003,"[('C0025', 0.9991414490658412), ('C0031', 0.98..."
3,C0004,"[('C0147', 0.9931010365094683), ('C0113', 0.99..."
4,C0005,"[('C0186', 0.997395829498276), ('C0159', 0.993..."


In [58]:
output.to_csv

<bound method NDFrame.to_csv of    CustomerID                                         Lookalikes
0       C0001  [('C0191', 0.9993444249841109), ('C0137', 0.99...
1       C0002  [('C0142', 0.998477375718647), ('C0178', 0.996...
2       C0003  [('C0025', 0.9991414490658412), ('C0031', 0.98...
3       C0004  [('C0147', 0.9931010365094683), ('C0113', 0.99...
4       C0005  [('C0186', 0.997395829498276), ('C0159', 0.993...
5       C0006  [('C0133', 0.9899976734185222), ('C0192', 0.97...
6       C0007  [('C0186', 0.9959171643541568), ('C0115', 0.99...
7       C0008  [('C0109', 0.9912548575598642), ('C0065', 0.98...
8       C0009  [('C0061', 0.9998452879289398), ('C0167', 0.99...
9       C0010  [('C0121', 0.9952834803724361), ('C0062', 0.96...
10      C0011  [('C0137', 0.9995684344437581), ('C0191', 0.99...
11      C0012  [('C0108', 0.9975787744854574), ('C0087', 0.99...
12      C0013  [('C0184', 0.9979757295609692), ('C0155', 0.99...
13      C0014  [('C0060', 0.9989130291797239), ('C0063', 0

# Conclusion:
    The developed Lookalike Model sucessfully identifies and recomends the top 3  most similar customers for a given user based on their profile and transaction history.