This model finds similar customers based on their behavior and characteristics just like Netflix's recommendedation system we can say but it is for customers and not for movies!

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity

First, we combine customer information with their shopping history to get a complete picture:

We calculate total spending for each customer

We count how many purchases they've made (purchase frequency)

We take their region information

In [2]:
def preprocess_data(customers_df, transactions_df):
    customer_metrics = transactions_df.groupby('CustomerID').agg({
        'TotalValue': ['sum', 'count']
    }).reset_index()
    customer_metrics.columns = ['CustomerID', 'TotalSpending', 'PurchaseFrequency']

    return pd.merge(customers_df, customer_metrics, on='CustomerID', how='inner')

Creating a feature matrix from the preprocessed data that returns scaled features and customer IDs.

In [3]:
def create_feature_matrix(data):
    #OHE on region.(Categorical column)
    encoder = OneHotEncoder(sparse_output=False)
    encoded_features = encoder.fit_transform(data[['Region']])
    encoded_df = pd.DataFrame(
        encoded_features,
        columns=encoder.get_feature_names_out(['Region'])
    )
    #Scaling numerical features
    scaler = StandardScaler()
    numerical_features = ['TotalSpending', 'PurchaseFrequency']
    scaled_numerical = scaler.fit_transform(data[numerical_features])
    scaled_df = pd.DataFrame(
        scaled_numerical,
        columns=numerical_features
    )
    return pd.concat([scaled_df, encoded_df], axis=1)

We use cosine similarity to find similar customers:
Think of each customer as a point in space, the closer two points are, the more similar those customers are to each other based on their features.We calculate how close every customer is to every other customer using "Cosine Similarity Feature"

Technically, cosine similarity calculates the angle between the customer vectors, smaller the angle, more related they are.

In [4]:
def get_similar_customers(similarity_matrix, customer_ids, target_idx, n_recommendations=3):

    similar_indices = np.argsort(-similarity_matrix[target_idx])[1:n_recommendations + 1]
    return [
        (customer_ids[idx], similarity_matrix[target_idx][idx])
        for idx in similar_indices
    ]

For each customer, We rank all other customers by similarity score

Pick the top 3 most similar customers and record their IDs and how similar they are to each other.

In [5]:
def generate_lookalike_recommendations(customers_path, transactions_path, output_path, n_customers=20):
    customers = pd.read_csv(customers_path)
    transactions = pd.read_csv(transactions_path)

    # Preprocess data, create feature matrix, calculate similarity matrix
    processed_data = preprocess_data(customers, transactions)
    feature_matrix = create_feature_matrix(processed_data)
    similarity_matrix = cosine_similarity(feature_matrix)

    # Generate recommendations for first n customers
    customer_ids = processed_data['CustomerID'].values
    lookalike_map = {}

    for idx, customer_id in enumerate(customer_ids[:n_customers]):
        similar_customers = get_similar_customers(
            similarity_matrix,
            customer_ids,
            idx
        )
        lookalike_map[customer_id] = similar_customers

    results_df = pd.DataFrame({
        'CustomerID': lookalike_map.keys(),
        'LookalikeCustomers': [
            ', '.join([f"{cid} (similarity: {score:.3f})"
                      for cid, score in customers])
            for customers in lookalike_map.values()
        ]
    })
    #Storing in csv
    results_df.to_csv(output_path, index=False)
    print(f"Lookalike recommendations saved to {output_path}")

    return lookalike_map

We create a simple report showing:

Each customer ID -> Their 3 most similar customers -> How similar they are (as a score between 0 and 1, where 1 is identical)

In [7]:
recommendations = generate_lookalike_recommendations(
    customers_path="Customers.csv",
    transactions_path="Transactions.csv",
    output_path="Lookalike.csv"
)

for customer_id, similar_customers in list(recommendations.items())[:3]:
    print(f"\nCustomer {customer_id} similar customers:")
    for similar_id, score in similar_customers:
        print(f"  {similar_id}: similarity = {score:.3f}")

Lookalike recommendations saved to /content/Lookalike.csv

Customer C0001 similar customers:
  C0137: similarity = 1.000
  C0152: similarity = 1.000
  C0107: similarity = 0.989

Customer C0002 similar customers:
  C0142: similarity = 0.992
  C0177: similarity = 0.974
  C0043: similarity = 0.970

Customer C0003 similar customers:
  C0133: similarity = 0.997
  C0052: similarity = 0.995
  C0192: similarity = 0.968


Conclusion:- This model helps businesses in->

Target marketing to similar customers
,Find potential customers who might like the same products
,Understand customer segments better, Make personalized recommendations
