Task 2: Lookalike Model 

Steps to Build the Lookalike Model

1.Preprocessing and Feature Engineering

Combine customer profile data (Customers.csv) with their transaction and product data.

Create meaningful features (e.g., total spending, product category preferences, purchase frequency).

2.Calculate Similarity

Use a similarity measure (e.g., cosine similarity) to compute the closeness between customer profiles.

3.Find Top 3 Lookalikes

For each customer, rank all other customers by similarity and select the top 3 with their scores.

4.Save Output to Lookalike.csv

Create a mapping of customer IDs to their top 3 lookalikes and scores.

In [15]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

In [16]:
# Load datasets
customers = pd.read_csv("Customers.csv")
products = pd.read_csv("Products.csv")
transactions = pd.read_csv("Transactions.csv")

In [3]:
# Merge datasets for feature engineering
merged_data = transactions.merge(customers, on="CustomerID", how="left")
merged_data = merged_data.merge(products, on="ProductID", how="left")

In [4]:
# Feature Engineering: Create aggregated customer-level features
customer_features = merged_data.groupby('CustomerID').agg(
    total_spending=('TotalValue', 'sum'),
    total_transactions=('TransactionID', 'count'),
    avg_transaction_value=('TotalValue', 'mean'),
    product_categories=('Category', lambda x: x.nunique()),  # Number of unique categories purchased
).reset_index()


Verify Customers.csv Structure

Before merging, inspect the customers DataFrame to confirm it contains the Region column and ensure there are no data loading issues:

In [5]:
print(customers.head())  # Display the first few rows of Customers.csv
print(customers.columns)  # Verify column names in Customers.csv

  CustomerID        CustomerName         Region  SignupDate
0      C0001    Lawrence Carroll  South America  2022-07-10
1      C0002      Elizabeth Lutz           Asia  2022-02-13
2      C0003      Michael Rivera  South America  2024-03-07
3      C0004  Kathleen Rodriguez  South America  2022-10-09
4      C0005         Laura Weber           Asia  2022-08-15
Index(['CustomerID', 'CustomerName', 'Region', 'SignupDate'], dtype='object')


In [6]:
# Strip leading/trailing whitespace from column names
customers.columns = customers.columns.str.strip()
customer_features.columns = customer_features.columns.str.strip()

In [7]:
# Check for inconsistencies in CustomerID format
print(customer_features['CustomerID'].head())  # Sample from customer_features
print(customers['CustomerID'].head())         # Sample from customers

0    C0001
1    C0002
2    C0003
3    C0004
4    C0005
Name: CustomerID, dtype: object
0    C0001
1    C0002
2    C0003
3    C0004
4    C0005
Name: CustomerID, dtype: object


In [8]:
# Ensure CustomerID is the same type (string)
customer_features['CustomerID'] = customer_features['CustomerID'].astype(str)
customers['CustomerID'] = customers['CustomerID'].astype(str)

In [9]:
# Merge customer_features with Region data
customer_features = customer_features.merge(customers[['CustomerID', 'Region']], on='CustomerID', how='left')

Verify Customers.csv Structure

Before merging, inspect the customers DataFrame to confirm it contains the Region column and ensure there are no data loading issues:

In [10]:
print(customer_features.head())  # Display first few rows
print(customer_features.columns)  # Ensure 'Region' exists


  CustomerID  total_spending  total_transactions  avg_transaction_value  \
0      C0001         3354.52                   5                670.904   
1      C0002         1862.74                   4                465.685   
2      C0003         2725.38                   4                681.345   
3      C0004         5354.88                   8                669.360   
4      C0005         2034.24                   3                678.080   

   product_categories         Region  
0                   3  South America  
1                   2           Asia  
2                   3  South America  
3                   3  South America  
4                   2           Asia  
Index(['CustomerID', 'total_spending', 'total_transactions',
       'avg_transaction_value', 'product_categories', 'Region'],
      dtype='object')


In [11]:
if 'Region' in customer_features.columns:
    customer_features = pd.get_dummies(customer_features, columns=['Region'])
    print("One-hot encoding completed successfully.")
else:
    print("Error: 'Region' column still not found after merge.")


One-hot encoding completed successfully.


In [12]:
# Standardize numerical features
scaler = StandardScaler()
numerical_features = ['total_spending', 'total_transactions', 'avg_transaction_value', 'product_categories']
customer_features[numerical_features] = scaler.fit_transform(customer_features[numerical_features])
print(numerical_features)

['total_spending', 'total_transactions', 'avg_transaction_value', 'product_categories']


In [18]:
# Calculate cosine similarity
feature_matrix = customer_features.drop(columns=['CustomerID'])
similarity_matrix = cosine_similarity(feature_matrix)

In [19]:
# Calculate cosine similarity //we got a error tring to solve it so this is solution part and the cosine similarity
print(feature_matrix.dtypes)
print(feature_matrix.head())


total_spending           float64
total_transactions       float64
avg_transaction_value    float64
product_categories       float64
Region_Asia                 bool
Region_Europe               bool
Region_North America        bool
Region_South America        bool
dtype: object
   total_spending  total_transactions  avg_transaction_value  \
0       -0.061701           -0.011458              -0.070263   
1       -0.877744           -0.467494              -0.934933   
2       -0.405857           -0.467494              -0.026271   
3        1.032547            1.356650              -0.076769   
4       -0.783929           -0.923530              -0.040028   

   product_categories  Region_Asia  Region_Europe  Region_North America  \
0            0.160540        False          False                 False   
1           -0.904377         True          False                 False   
2            0.160540        False          False                 False   
3            0.160540        False   

In [20]:
# Ensure all columns are numeric 
feature_matrix = feature_matrix.select_dtypes(include=[np.number])
print("Feature matrix after ensuring numeric data:")
print(feature_matrix.head())


Feature matrix after ensuring numeric data:
   total_spending  total_transactions  avg_transaction_value  \
0       -0.061701           -0.011458              -0.070263   
1       -0.877744           -0.467494              -0.934933   
2       -0.405857           -0.467494              -0.026271   
3        1.032547            1.356650              -0.076769   
4       -0.783929           -0.923530              -0.040028   

   product_categories  
0            0.160540  
1           -0.904377  
2            0.160540  
3            0.160540  
4           -0.904377  


In [21]:
#cosine similiraty calculation 
similarity_matrix = cosine_similarity(feature_matrix)
print("Cosine similarity calculation successful.")

Cosine similarity calculation successful.


In [22]:
# Create Lookalike Mapping for Customers C0001 to C0020
lookalike_map = {}
for idx, customer_id in enumerate(customer_features['CustomerID'][:20]):  # First 20 customers
    
        # Get similarity scores for this customer
    scores = list(enumerate(similarity_matrix[idx]))
    
    # Exclude self (similarity with itself)
    scores = [(i, score) for i, score in scores if i != idx]
    
    # Sort scores in descending order and get top 3
    top_3 = sorted(scores, key=lambda x: x[1], reverse=True)[:3]
    
    # Map customer ID to top 3 lookalike customers and their scores
    lookalike_map[customer_id] = [
        (customer_features['CustomerID'].iloc[i], round(score, 4)) for i, score in top_3
    ]

In [23]:
# Save Lookalike Map to CSV
lookalike_df = pd.DataFrame({
    'CustomerID': list(lookalike_map.keys()),
    'Lookalikes': [str(lookalike_map[customer]) for customer in lookalike_map.keys()]
})
lookalike_df.to_csv("Lookalike.csv", index=False)

In [24]:
# Display the Lookalike Map for reference
print("Lookalike Map for Customers C0001 to C0020:")
print(lookalike_df)

Lookalike Map for Customers C0001 to C0020:
   CustomerID                                         Lookalikes
0       C0001  [('C0086', 0.9966), ('C0189', 0.9948), ('C0055...
1       C0002  [('C0199', 0.9982), ('C0010', 0.998), ('C0033'...
2       C0003  [('C0178', 0.9996), ('C0036', 0.9786), ('C0035...
3       C0004  [('C0101', 0.9971), ('C0156', 0.9964), ('C0108...
4       C0005  [('C0073', 0.9997), ('C0159', 0.9993), ('C0112...
5       C0006  [('C0079', 1.0), ('C0196', 0.992), ('C0158', 0...
6       C0007  [('C0080', 0.9929), ('C0078', 0.9917), ('C0042...
7       C0008  [('C0109', 0.9709), ('C0147', 0.9419), ('C0093...
8       C0009  [('C0077', 0.9998), ('C0083', 0.9969), ('C0060...
9       C0010  [('C0002', 0.998), ('C0199', 0.9924), ('C0009'...
10      C0011  [('C0114', 0.9984), ('C0183', 0.9906), ('C0016...
11      C0012  [('C0155', 0.9978), ('C0065', 0.9935), ('C0165...
12      C0013  [('C0126', 0.9926), ('C0105', 0.9919), ('C0087...
13      C0014  [('C0058', 0.9964), ('C0151', 0

In [None]:
#Completed Task 2: Lookalike Model

Now, letâ€™s proceed to check the output in Lookalike.csv to verify if the lookalike mappings and scores are correct.

Steps:
    
Load the Lookalike.csv file and display its content.

In [27]:
lookalike_output = pd.read_csv("Lookalike.csv")
print(lookalike_output.head(20))  # Displaying top 20(0-19) entries for verification to verify those are correctly aligned

   CustomerID                                         Lookalikes
0       C0001  [('C0086', 0.9966), ('C0189', 0.9948), ('C0055...
1       C0002  [('C0199', 0.9982), ('C0010', 0.998), ('C0033'...
2       C0003  [('C0178', 0.9996), ('C0036', 0.9786), ('C0035...
3       C0004  [('C0101', 0.9971), ('C0156', 0.9964), ('C0108...
4       C0005  [('C0073', 0.9997), ('C0159', 0.9993), ('C0112...
5       C0006  [('C0079', 1.0), ('C0196', 0.992), ('C0158', 0...
6       C0007  [('C0080', 0.9929), ('C0078', 0.9917), ('C0042...
7       C0008  [('C0109', 0.9709), ('C0147', 0.9419), ('C0093...
8       C0009  [('C0077', 0.9998), ('C0083', 0.9969), ('C0060...
9       C0010  [('C0002', 0.998), ('C0199', 0.9924), ('C0009'...
10      C0011  [('C0114', 0.9984), ('C0183', 0.9906), ('C0016...
11      C0012  [('C0155', 0.9978), ('C0065', 0.9935), ('C0165...
12      C0013  [('C0126', 0.9926), ('C0105', 0.9919), ('C0087...
13      C0014  [('C0058', 0.9964), ('C0151', 0.9957), ('C0128...
14      C0015  [('C0095',

In [29]:
#For assignment submission
# Saved the top 20 rows to a new CSV file
lookalike_top20 = lookalike_output.head(20)
lookalike_top20.to_csv("Lookalike_Top20.csv", index=False)
print("Top 20 rows saved to Lookalike_Top20.csv!")


Top 20 rows saved to Lookalike_Top20.csv!
