A lookalike model is a machine learning approach used to identify new users/customers who closely resemble a high-value target group (e.g., existing loyal customers, converters, or high spenders). Its core logic lies in leveraging behavioral, demographic, or transactional patterns of a "seed audience" to find similar profiles in a broader population.

How It Works:
Seed Audience Definition:
A seed audience (e.g., customers who purchased a product) is selected based on desired traits or actions. This group’s features (e.g., age, purchase frequency, browsing behavior, geographic location) are analyzed to identify patterns.

Feature Engineering:
Relevant attributes (e.g., RFM metrics, device usage, interests) are extracted and transformed into numerical features. For example:

Avg_session_duration

Days_since_last_purchase

Category_preferences

Model Training:
Algorithms (e.g., logistic regression, decision trees, or neural networks) are trained to distinguish the seed audience from the general population. The model learns weights for each feature to quantify their importance in defining similarity.

Scoring & Expansion:
The model assigns a similarity score to all users in the broader population. Those with high scores are deemed "lookalikes" and prioritized for targeted campaigns.

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
import csv

# Load data
customers = pd.read_csv('/kaggle/input/zeotap-dataset/Customers.csv')
products = pd.read_csv('/kaggle/input/zeotap-dataset/Products.csv')
transactions = pd.read_csv('/kaggle/input/zeotap-dataset/Transactions.csv')

# Convert date columns to datetime
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# Merge transactions with products to get category information
merged_transactions = pd.merge(transactions, products[['ProductID', 'Category']], on='ProductID', how='left')

# Calculate last transaction date for each customer
last_transaction_dates = merged_transactions.groupby('CustomerID')['TransactionDate'].max().reset_index()
last_transaction_dates.rename(columns={'TransactionDate': 'LastTransactionDate'}, inplace=True)

# Merge with customers to get signup date and compute tenure
customer_dates = pd.merge(customers, last_transaction_dates, on='CustomerID', how='left')
customer_dates['Tenure'] = (customer_dates['LastTransactionDate'] - customer_dates['SignupDate']).dt.days
customer_dates['Tenure'].fillna(0, inplace=True)

# Calculate transaction statistics
transaction_stats = merged_transactions.groupby('CustomerID').agg(
    total_transactions=('TransactionID', 'count'),
    total_quantity=('Quantity', 'sum'),
    total_spending=('TotalValue', 'sum'),
    avg_price=('Price', 'mean')
).reset_index()
transaction_stats['avg_quantity'] = transaction_stats['total_quantity'] / transaction_stats['total_transactions']
transaction_stats['avg_spending'] = transaction_stats['total_spending'] / transaction_stats['total_transactions']

# Calculate category proportions
category_counts = merged_transactions.groupby(['CustomerID', 'Category']).size().reset_index(name='counts')
total_transactions = category_counts.groupby('CustomerID')['counts'].sum().reset_index(name='total')
category_counts = pd.merge(category_counts, total_transactions, on='CustomerID')
category_counts['proportion'] = category_counts['counts'] / category_counts['total']
category_pivot = category_counts.pivot(index='CustomerID', columns='Category', values='proportion').fillna(0).reset_index()

# Combine all features
customer_features = pd.merge(
    customer_dates[['CustomerID', 'Region', 'Tenure']],
    transaction_stats,
    on='CustomerID',
    how='left'
)
customer_features = pd.merge(customer_features, category_pivot, on='CustomerID', how='left').fillna(0)

# One-hot encode Region
encoder = OneHotEncoder(sparse_output=False)
region_encoded = encoder.fit_transform(customer_features[['Region']])
region_columns = encoder.get_feature_names_out(['Region'])
region_df = pd.DataFrame(region_encoded, columns=region_columns, index=customer_features.index)
customer_features_encoded = pd.concat([customer_features.drop('Region', axis=1), region_df], axis=1)

# List of all feature columns excluding CustomerID
feature_columns = customer_features_encoded.columns.drop('CustomerID').tolist()

# Scale features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_features_encoded[feature_columns])

# Compute cosine similarity
similarity_matrix = cosine_similarity(scaled_features)

# Prepare the target customers (C0001 to C0020)
target_customers = [f'C{i:04d}' for i in range(1, 21)]
target_indices = customer_features_encoded[customer_features_encoded['CustomerID'].isin(target_customers)].index

# Generate lookalike recommendations
lookalike_results = []
for idx in target_indices:
    target_id = customer_features_encoded.loc[idx, 'CustomerID']
    similarities = []
    for i, score in enumerate(similarity_matrix[idx]):
        if i != idx:  # Exclude self
            similarities.append((customer_features_encoded.loc[i, 'CustomerID'], score))
    # Sort by score descending and take top 3
    similarities.sort(key=lambda x: -x[1])
    top_3 = similarities[:3]
    lookalike_results.append((target_id, top_3))

# Write to Lookalike.csv
with open('Lookalike.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['CustomerID', 'Lookalike1', 'Score1', 'Lookalike2', 'Score2', 'Lookalike3', 'Score3'])
    for cust_id, lookalikes in lookalike_results:
        row = [cust_id]
        for lookalike, score in lookalikes:
            row.extend([lookalike, f"{score:.4f}"])
        writer.writerow(row)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  customer_dates['Tenure'].fillna(0, inplace=True)
