# Task 2: Lookalike Model 
Build a Lookalike Model that takes a user's information as input and recommends 3 similar customers based on their profile and transaction history. The model should: 

● Use both customer and product information. 
● Assign a similarity score to each recommended customer. 

Deliverables: 

● Give the top 3 lookalikes with there similarity scores for the first 20 customers 
(CustomerID: C0001 - C0020)  in Customers.csv. Form an “Lookalike.csv” which has 
just one map: Map<cust_id, List<cust_id, score>>   
● A Jupyter Notebook/Python script explaining your model development. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the datasets
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

print("Customers Data:")
print(customers.head())
print("\nProducts Data:")
print(products.head())
print("\nTransactions Data:")
print(transactions.head())

Customers Data:
  CustomerID        CustomerName         Region  SignupDate
0      C0001    Lawrence Carroll  South America  2022-07-10
1      C0002      Elizabeth Lutz           Asia  2022-02-13
2      C0003      Michael Rivera  South America  2024-03-07
3      C0004  Kathleen Rodriguez  South America  2022-10-09
4      C0005         Laura Weber           Asia  2022-08-15

Products Data:
  ProductID              ProductName     Category   Price
0      P001     ActiveWear Biography        Books  169.30
1      P002    ActiveWear Smartwatch  Electronics  346.30
2      P003  ComfortLiving Biography        Books   44.12
3      P004            BookWorld Rug   Home Decor   95.69
4      P005          TechPro T-Shirt     Clothing  429.31

Transactions Data:
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127 

In [3]:
# Merging datasets to get full customer data
merged = transactions.merge(products, on='ProductID').merge(customers, on='CustomerID')
print("\nMerged Data:")
merged.head()


Merged Data:


Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price_x,ProductName,Category,Price_y,CustomerName,Region,SignupDate
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68,ComfortLiving Bluetooth Speaker,Electronics,300.68,Andrea Jenkins,Europe,2022-12-03
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68,ComfortLiving Bluetooth Speaker,Electronics,300.68,Brittany Harvey,Asia,2024-09-04
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68,ComfortLiving Bluetooth Speaker,Electronics,300.68,Kathryn Stevens,Europe,2024-04-04
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68,ComfortLiving Bluetooth Speaker,Electronics,300.68,Travis Campbell,South America,2024-04-11
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68,ComfortLiving Bluetooth Speaker,Electronics,300.68,Timothy Perez,Europe,2022-03-15


In [4]:
# Creating features for each customer
customer_features = merged.groupby('CustomerID').agg({
    'Quantity': 'sum',        # This is count of Total items bought
    'TotalValue': 'sum',      # This will give the Total money spent
    'Category': lambda x: x.mode()[0]  # Mostly purchased category
}).reset_index()

print("\nCustomer Features:")
print(customer_features.head())



Customer Features:
  CustomerID  Quantity  TotalValue     Category
0      C0001        12     3354.52  Electronics
1      C0002        10     1862.74     Clothing
2      C0003        14     2725.38   Home Decor
3      C0004        23     5354.88        Books
4      C0005         7     2034.24  Electronics


In [5]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import OneHotEncoder

# One-hot encode the 'Category' column
encoder = OneHotEncoder()
category_encoded = encoder.fit_transform(customer_features[['Category']]).toarray()

# Combined encoded categories with other features
import numpy as np
features = np.hstack([
    customer_features[['Quantity', 'TotalValue']].values,
    category_encoded
])

# Calculating cosine similarity
similarity = cosine_similarity(features)

# Converting to DataFrame for easy handling
similarity_df = pd.DataFrame(similarity, index=customer_features['CustomerID'], columns=customer_features['CustomerID'])

# Finding top 3 similar customers for the first 20 customers
results = []
for customer in customer_features['CustomerID'][:20]:
    #let Sort similarity scores
    similar_customers = similarity_df.loc[customer].sort_values(ascending=False).iloc[1:4]
    results.append({
        'CustomerID': customer,
        'Lookalike1': similar_customers.index[0],
        'Score1': similar_customers.iloc[0],
        'Lookalike2': similar_customers.index[1],
        'Score2': similar_customers.iloc[1],
        'Lookalike3': similar_customers.index[2],
        'Score3': similar_customers.iloc[2]
    })

lookalike_data = pd.DataFrame(results)
lookalike_data.to_csv('Pranita_Shelar_Lookalike.csv', index=False)

print("\nLookalike Results:")
print(lookalike_data)


Lookalike Results:
   CustomerID Lookalike1    Score1 Lookalike2    Score2 Lookalike3    Score3
0       C0001      C0069  1.000000      C0120  1.000000      C0028  1.000000
1       C0002      C0029  1.000000      C0176  1.000000      C0034  1.000000
2       C0003      C0031  1.000000      C0025  1.000000      C0136  1.000000
3       C0004      C0175  1.000000      C0169  1.000000      C0017  1.000000
4       C0005      C0069  1.000000      C0127  1.000000      C0026  1.000000
5       C0006      C0126  1.000000      C0171  1.000000      C0118  1.000000
6       C0007      C0146  1.000000      C0048  1.000000      C0184  1.000000
7       C0008      C0113  1.000000      C0086  1.000000      C0160  1.000000
8       C0009      C0198  1.000000      C0150  1.000000      C0092  1.000000
9       C0010      C0049  1.000000      C0144  1.000000      C0091  1.000000
10      C0011      C0139  1.000000      C0057  1.000000      C0173  1.000000
11      C0012      C0065  1.000000      C0179  1.000000 