Task 2: Lookalike Model

Build a Lookalike Model that takes a user's information as input and recommends 3 similar
customers based on their profile and transaction history. The model should:
● Use both customer and product information.
● Assign a similarity score to each recommended customer.
Deliverables:
● Give the top 3 lookalikes with there similarity scores for the first 20 customers
(CustomerID: C0001 - C0020) in Customers.csv. Form an “Lookalike.csv” which has
just one map: Map<cust_id, List<cust_id, score>>
● A Jupyter Notebook/Python script explaining your model development.

In [8]:
import pandas as pd
import sklearn
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
print(sklearn.__version__)
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer  # For product descriptions
from sklearn.metrics.pairwise import cosine_similarity

# Read data (assuming you have CSV files)
customers_df = pd.read_csv("Customers.csv")
products_df = pd.read_csv("Products.csv")
transactions_df = pd.read_csv("Transactions.csv")


# Feature engineering
merged_df = pd.merge(customers_df, transactions_df, on="CustomerID")
merged_df = pd.merge(merged_df, products_df, on="ProductID")

merged_df["purchase_frequency"] = merged_df.groupby("CustomerID")["TransactionID"].transform("count")
merged_df["last_purchase_day"] = (
    merged_df.groupby("CustomerID")["TransactionDate"].transform(pd.to_datetime).max()
    - pd.to_datetime("today")
).days
merged_df["total_purchase_amount"] = merged_df.groupby("CustomerID")["TotalValue"].transform("sum")

# Text-based features for product descriptions (if applicable)
if "ProductDescription" in products_df.columns:
    vectorizer = TfidfVectorizer()
    product_descriptions = products_df["ProductDescription"]
    product_features = vectorizer.fit_transform(product_descriptions)

    product_features_df = pd.DataFrame(product_features.toarray(), columns=vectorizer.get_feature_names_out())
    merged_df = pd.concat([merged_df, product_features_df], axis=1)

# One-hot encode categorical features
categorical_cols = ["Region", "Category"]  
encoder = OneHotEncoder(sparse_output=False) 
encoded_df = pd.DataFrame(encoder.fit_transform(merged_df[categorical_cols]), columns=encoder.get_feature_names_out(categorical_cols))
merged_df = pd.concat([merged_df, encoded_df], axis=1)

# one-hot encoded features, product preferences)
relevant_features = [
    "purchase_frequency", "last_purchase_day", "total_purchase_amount", 
    *encoded_df.columns.tolist(),  
customer_profiles = merged_df.groupby("CustomerID")[relevant_features].mean()

# Similarity calculation using cosine similarity
similarity_matrix = cosine_similarity(customer_profiles)

# Lookalike recommendation
def recommend_lookalikes(customer_id, k=3):
    """Recommends k most similar customers to a given customer."""
    if customer_id not in customer_profiles.index:
        raise ValueError(f"Customer ID {customer_id} not found in the profiles.")
    customer_index = customer_profiles.index.get_loc(customer_id) 
    similar_indices = similarity_matrix[customer_index].argsort()[-k:][::-1] 
    lookalikes = [(customer_profiles.index[i], similarity_matrix[customer_index, i]) for i in similar_indices if i != customer_index]  # Avoid self
    return lookalikes

# Generate Lookalike.csv
lookalike_data = []
for customer_id in customer_profiles.index[:20]: 
    lookalikes = recommend_lookalikes(customer_id)
    for lookalike_id, similarity in lookalikes:
        lookalike_data.append([customer_id, lookalike_id, similarity])

lookalike_df = pd.DataFrame(lookalike_data, columns=["CustomerID", "LookalikeID", "Similarity"])
lookalike_df.to_csv("Lookalike.csv", index=False)


SyntaxError: '[' was never closed (2919813086.py, line 44)