## Task 2: Lookalike Model
#### Build a Lookalike Model that takes a user's information as input and recommends 3 similar customers based on their profile and transaction history. The model should:
● Use both customer and product information.
● Assign a similarity score to each recommended customer.
#### Deliverables:
● Give the top 3 lookalikes with there similarity scores for the first 20 customers
(CustomerID: C0001 - C0020) in Customers.csv. Form an “Lookalike.csv” which has
just one map: Map<cust_id, List<cust_id, score>>
● A Jupyter Notebook/Python script explaining your model development.
Evaluation Criteria:
● Model accuracy and logic.
● Quality of recommendations and similarity scores.

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
customers = pd.read_csv("Customers.csv")
products = pd.read_csv("Products.csv")
transactions = pd.read_csv("Transactions.csv")

In [2]:
combined = transactions.merge(customers, on="CustomerID", how="left") \
                        .merge(products, on="ProductID", how="left")

In [4]:
combined.head()

Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price_x,CustomerName,Region,SignupDate,ProductName,Category,Price_y
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68,Andrea Jenkins,Europe,2022-12-03,ComfortLiving Bluetooth Speaker,Electronics,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68,Brittany Harvey,Asia,2024-09-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68,Kathryn Stevens,Europe,2024-04-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68,Travis Campbell,South America,2024-04-11,ComfortLiving Bluetooth Speaker,Electronics,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68,Timothy Perez,Europe,2022-03-15,ComfortLiving Bluetooth Speaker,Electronics,300.68


In [6]:
customer_features = combined.groupby("CustomerID").agg({
    "TotalValue": "sum",
    "Quantity": "sum",
    "Price_x": "mean",
    "Price_y": "mean",
    "Region": "first",  # Region of the customer
    "Category": lambda x: ",".join(x)  # List of purchased categories
}).reset_index()

In [7]:
customer_features

Unnamed: 0,CustomerID,TotalValue,Quantity,Price_x,Price_y,Region,Category
0,C0001,3354.52,12,278.334000,278.334000,South America,"Books,Home Decor,Electronics,Electronics,Elect..."
1,C0002,1862.74,10,208.920000,208.920000,Asia,"Home Decor,Home Decor,Clothing,Clothing"
2,C0003,2725.38,14,195.707500,195.707500,South America,"Home Decor,Home Decor,Clothing,Electronics"
3,C0004,5354.88,23,240.636250,240.636250,South America,"Books,Home Decor,Home Decor,Home Decor,Books,B..."
4,C0005,2034.24,7,291.603333,291.603333,Asia,"Home Decor,Electronics,Electronics"
...,...,...,...,...,...,...,...
194,C0196,4982.88,12,416.992500,416.992500,Europe,"Books,Clothing,Home Decor,Home Decor"
195,C0197,1928.65,9,227.056667,227.056667,Europe,"Home Decor,Electronics,Electronics"
196,C0198,931.83,3,239.705000,239.705000,Europe,"Electronics,Clothing"
197,C0199,1979.28,9,250.610000,250.610000,Europe,"Electronics,Home Decor,Home Decor,Electronics"


In [8]:
encoder = OneHotEncoder()
encoded_region = encoder.fit_transform(customer_features[["Region"]]).toarray()
encoded_category = encoder.fit_transform(customer_features[["Category"]]).toarray()

In [9]:
scaler = MinMaxScaler()
normalized_values = scaler.fit_transform(customer_features[["TotalValue", "Quantity", "Price_x","Price_y"]])

In [10]:
import numpy as np
feature_matrix = np.hstack([normalized_values, encoded_region, encoded_category])
similarity_matrix = cosine_similarity(feature_matrix)

In [11]:
similarity_matrix

array([[1.        , 0.20922105, 0.60441146, ..., 0.19444895, 0.23636728,
        0.30613636],
       [0.20922105, 1.        , 0.17008727, ..., 0.14742454, 0.18043109,
        0.60259397],
       [0.60441146, 0.17008727, 1.        , ..., 0.14062468, 0.18586415,
        0.25193616],
       ...,
       [0.19444895, 0.14742454, 0.14062468, ..., 1.        , 0.58286675,
        0.20470018],
       [0.23636728, 0.18043109, 0.18586415, ..., 0.58286675, 1.        ,
        0.25844552],
       [0.30613636, 0.60259397, 0.25193616, ..., 0.20470018, 0.25844552,
        1.        ]])

In [12]:
lookalike_data = {}
customer_ids = customer_features["CustomerID"].values

In [14]:
for idx, customer_id in enumerate(customer_ids[:20]):
    similarities = similarity_matrix[idx]
    similar_indices = np.argsort(similarities)[::-1][1:4]
    similar_customers = [
        {"CustomerID": customer_ids[i], "Score": round(similarities[i], 3)} for i in similar_indices
    ]
    lookalike_data[customer_id] = similar_customers

In [15]:
lookalike_df = pd.DataFrame([
    {"CustomerID": key, "Lookalikes": value} for key, value in lookalike_data.items()
])
lookalike_df.to_csv("Lookalike.csv", index=False)

In [22]:
lookalike_df

Unnamed: 0,CustomerID,Lookalikes
0,C0001,"[{'CustomerID': 'C0096', 'Score': 0.666}, {'Cu..."
1,C0002,"[{'CustomerID': 'C0138', 'Score': 0.604}, {'Cu..."
2,C0003,"[{'CustomerID': 'C0104', 'Score': 0.622}, {'Cu..."
3,C0004,"[{'CustomerID': 'C0082', 'Score': 0.698}, {'Cu..."
4,C0005,"[{'CustomerID': 'C0040', 'Score': 0.653}, {'Cu..."
5,C0006,"[{'CustomerID': 'C0168', 'Score': 0.695}, {'Cu..."
6,C0007,"[{'CustomerID': 'C0040', 'Score': 0.685}, {'Cu..."
7,C0008,"[{'CustomerID': 'C0156', 'Score': 0.681}, {'Cu..."
8,C0009,"[{'CustomerID': 'C0044', 'Score': 0.649}, {'Cu..."
9,C0010,"[{'CustomerID': 'C0124', 'Score': 0.58}, {'Cus..."
