# **Task 2: Customer Lookalike Model**

In this task, we build a customer lookalike model based on customer spending behavior across different product categories. The goal is to find similar customers using cosine similarity on a vectorized customer profile.

---


In [1]:
# **Import necessary libraries**

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity


### **Step 1: Load Datasets**

We load the **Customers**, **Products**, and **Transactions** datasets, and merge the product categories with transactions for category-level analysis.

---


In [2]:
# **Loading the datasets**

# Load the customer, product, and transaction data
customers = pd.read_csv(r"C:\Users\mistr\Desktop\zeotap\data\Customers.csv")
products = pd.read_csv(r"C:\Users\mistr\Desktop\zeotap\data\Products.csv")
transactions = pd.read_csv(r"C:\Users\mistr\Desktop\zeotap\data\Transactions.csv")

# Merge transactions with product categories
transactions = transactions.merge(products[['ProductID', 'Category']], on='ProductID', how='left')


### **Step 2: Calculate Category-Level Spend**

We calculate how much each customer has spent in each product category, filling missing values with 0.

---


In [3]:
# **Calculate category-level spend**

# Calculate category spend for each customer (sum of TotalValue for each category)
category_spend = transactions.groupby(['CustomerID', 'Category'])['TotalValue'].sum().unstack(fill_value=0)

# Merge category spend into the customer profile
customer_profile = customers.merge(category_spend, on='CustomerID', how='left')

# Fill missing values with 0 (if a customer has not spent in a particular category)
customer_profile.fillna(0, inplace=True)


### **Step 3: Normalize Data**

We standardize the category-level spend, total spend, and recency to bring all features onto a comparable scale.

---


In [4]:
# **Normalize numerical columns**

# Standardize category spend features
numerical_cols = list(category_spend.columns)  # Use only category columns for now
scaler = StandardScaler()
customer_profile[numerical_cols] = scaler.fit_transform(customer_profile[numerical_cols])

# Add additional features: Total Spend and Recency
customer_profile['TotalSpend'] = transactions.groupby('CustomerID')['TotalValue'].sum()
customer_profile['Recency'] = (pd.to_datetime('today') - pd.to_datetime(customer_profile['SignupDate'])).dt.days

# Normalize TotalSpend and Recency as well
customer_profile[['TotalSpend', 'Recency']] = scaler.fit_transform(customer_profile[['TotalSpend', 'Recency']])


  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


### **Step 4: Prepare Customer Vectors**

We drop non-numerical columns and convert the remaining data into vectors for each customer, ensuring there are no missing values.

---


In [5]:
# **Prepare customer vectors**

# Drop non-numerical columns
customer_vectors = customer_profile.drop(['CustomerID', 'CustomerName', 'Region', 'SignupDate'], axis=1).values

# Fill NaN values with 0 (or you can use any other strategy like filling with the mean/median)
customer_vectors = np.nan_to_num(customer_vectors, nan=0)

# Check if there are any NaNs in the customer vectors
if np.any(np.isnan(customer_vectors)):
    print("Warning: NaN values detected in customer_vectors. They will be replaced with 0.")
    customer_vectors = np.nan_to_num(customer_vectors, nan=0)


### **Step 5: Compute Similarity Matrix**

We compute the cosine similarity matrix to measure the similarity between customers based on their spending patterns.

---


In [6]:
# **Compute similarity matrix**

# Compute cosine similarity between customer vectors
similarity_matrix = cosine_similarity(customer_vectors)


### **Step 6: Find Similar Customers**

We define a function to find the top N most similar customers to a given customer based on their similarity scores.

---


In [7]:
# **Find similar customers function**

# Function to find similar customers for a given input customer
def find_similar_customers(input_customer_id, top_n=3):
    # Get the index of the input customer
    customer_idx = customer_profile.index[customer_profile['CustomerID'] == input_customer_id].tolist()[0]
    
    # Get the similarity scores of this customer with all others
    similarity_scores = similarity_matrix[customer_idx]
    
    # Get the indices of the top N most similar customers
    similar_indices = similarity_scores.argsort()[::-1][1:top_n + 1]  # Exclude the customer itself (index 0)
    
    # Create a list of similar customers with similarity scores
    similar_customers = [(customer_profile.iloc[idx]['CustomerID'], similarity_scores[idx]) for idx in similar_indices]
    
    return similar_customers


### **Step 7: Generate Lookalike Results**

We generate the lookalike results for the first 20 customers by calling the `find_similar_customers` function.

---


In [8]:
# **Generate lookalike results**

# Generate lookalike results for the first 20 customers
lookalike_results = {}
for customer_id in customer_profile['CustomerID'][:20]:
    lookalike_results[customer_id] = find_similar_customers(customer_id)


### **Step 8: Save Results to CSV**

We format the results and save the lookalike information to a CSV file called **Lookalike.csv**.

---


In [9]:
# **Format and save lookalike results**

# Format the results and save them to a CSV file

lookalike_df = pd.DataFrame({
    "CustomerID": list(lookalike_results.keys()),
    "Lookalikes": [", ".join([f"{cust_id} ({score:.2f})" for cust_id, score in lookalikes]) for lookalikes in lookalike_results.values()]
})

# Save to Meet_Mistry_Lookalike.csv
lookalike_df.to_csv(r"C:\Users\mistr\Desktop\zeotap\output\Meet_Mistry_Lookalike.csv", index=False)

print(lookalike_df.head())


  CustomerID                                Lookalikes
0      C0001  C0091 (0.96), C0120 (0.95), C0184 (0.92)
1      C0002  C0159 (0.96), C0134 (0.95), C0106 (0.89)
2      C0003  C0127 (0.95), C0085 (0.91), C0026 (0.89)
3      C0004  C0075 (0.95), C0148 (0.81), C0104 (0.78)
4      C0005  C0007 (0.97), C0166 (0.94), C0197 (0.90)


### **Step 9: Run Streamlit App**

Finally, after completing all preprocessing and lookalike calculations, run the Streamlit app using the following command:

```bash
python -m streamlit run app.py
