# My Approach

Here's a explanation of my approach:

1. Main Idea:
- We find similar customers based on their shopping behavior
- Uses buying patterns, amount spent, and type of products bought
- Gives each customer 3 most similar other customers

2. Feature Creation:
- Make special shopping patterns for each customer:
  * How much they spend (total and average)
  * How many items they buy
  * How often they shop
  * What kind of products they like

3. Similarity Method:
- Convert customer shopping patterns into numbers (matrices)
- Use 3 different ways to compare customers:
  * Based on money spent on products
  * Based on how often they buy products
  * Based on what categories they like
- Combine these comparisons with different importance (weights)

4. Technical Steps:
- I'll use SVD to handle lots of products
- Scale numbers so rich customers don't overshadow others (Z-score)
- Calculate how similar each customer is to others
- Pick top 3 most similar customers for first 20 customers

# Importing Packages 

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from datetime import datetime

# Load data


In [2]:
customers = pd.read_csv('/kaggle/input/e-comm-data/Data/Customers.csv')
products = pd.read_csv('/kaggle/input/e-comm-data/Data/Products.csv')
transactions = pd.read_csv('/kaggle/input/e-comm-data/Data/Transactions.csv')

In [3]:
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# Data Preprocessing

> Calculating recency and add to transactions

In [4]:
transactions['Recency'] = (datetime.now() - transactions['TransactionDate']).dt.days
transactions['Recency'] = (transactions['Recency'] + 1).astype(int)

In [5]:
data = transactions.merge(customers[['CustomerID', 'Region']], on='CustomerID')
data = data.merge(products[['ProductID', 'Category']], on='ProductID')

# Feature Engineering

> Creating customer-product interaction matrix

In [6]:
customer_features = data.groupby('CustomerID').agg({
    'TotalValue': ['sum', 'mean', 'std'],
    'Quantity': ['sum', 'mean'],
    'TransactionDate': ['count', lambda x: (x.max() - x.min()).days],
    'ProductID': 'nunique'
}).fillna(0)


In [7]:
customer_features.columns = ['total_spend', 'avg_spend', 'std_spend', 
                           'total_quantity', 'avg_quantity',
                           'transaction_count', 'purchase_span', 'unique_products']

In [8]:
value_matrix = data.pivot_table(
    index='CustomerID',
    columns='ProductID',
    values='TotalValue',
    aggfunc='sum',
    fill_value=0
)

In [9]:
freq_matrix = data.pivot_table(
    index='CustomerID',
    columns='ProductID',
    values='Quantity',
    aggfunc='count',
    fill_value=0
)

In [10]:
cat_matrix = data.pivot_table(
    index='CustomerID',
    columns='Category',
    values='TotalValue',
    aggfunc='sum',
    fill_value=0
)

# SVD decomposition


In [11]:
n_components_value = min(20, value_matrix.shape[1] - 1)
n_components_freq = min(20, freq_matrix.shape[1] - 1)
n_components_cat = min(3, cat_matrix.shape[1] - 1)

In [12]:
svd_value = TruncatedSVD(n_components=n_components_value, random_state=42)
svd_freq = TruncatedSVD(n_components=n_components_freq, random_state=42)
svd_cat = TruncatedSVD(n_components=n_components_cat, random_state=42)

In [13]:
latent_value = svd_value.fit_transform(value_matrix)
latent_freq = svd_freq.fit_transform(freq_matrix)
latent_cat = svd_cat.fit_transform(cat_matrix)

# Scale features

In [14]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_features)

# Calculate similarities

In [15]:
sim_value = cosine_similarity(latent_value)
sim_freq = cosine_similarity(latent_freq)
sim_cat = cosine_similarity(latent_cat)
sim_features = cosine_similarity(scaled_features)

In [16]:
# Combine similarities with weights
weights = np.array([0.35, 0.25, 0.25, 0.15])
similarity = (weights[0] * sim_value + 
             weights[1] * sim_freq + 
             weights[2] * sim_cat + 
             weights[3] * sim_features)

# Generate recommendations


In [17]:
results = {}
for i, customer_id in enumerate(value_matrix.index[:20]):
    scores = np.argsort(similarity[i])[::-1]
    similar_customers = [
        (value_matrix.index[j], round(similarity[i, j], 4))
        for j in scores[1:4]
    ]
    results[customer_id] = similar_customers

# Save results

In [18]:
output_df = pd.DataFrame({'CustomerID': list(results.keys()),'Lookalikes': [str(v) for v in results.values()]})
output_df.to_csv('FirstName_LastName_Lookalike.csv', index=False)

In [19]:
for cust_id, lookalikes in list(results.items())[:]:
    print(f"\nCustomer {cust_id}:")
    for similar_cust, score in lookalikes:
        print(f"  Similar customer: {similar_cust}, Score: {score}")


Customer C0001:
  Similar customer: C0069, Score: 0.6895
  Similar customer: C0105, Score: 0.6723
  Similar customer: C0050, Score: 0.6069

Customer C0002:
  Similar customer: C0109, Score: 0.7196
  Similar customer: C0076, Score: 0.6866
  Similar customer: C0178, Score: 0.6692

Customer C0003:
  Similar customer: C0031, Score: 0.6686
  Similar customer: C0144, Score: 0.6324
  Similar customer: C0181, Score: 0.623

Customer C0004:
  Similar customer: C0075, Score: 0.7101
  Similar customer: C0041, Score: 0.6577
  Similar customer: C0053, Score: 0.6542

Customer C0005:
  Similar customer: C0192, Score: 0.837
  Similar customer: C0088, Score: 0.768
  Similar customer: C0072, Score: 0.7376

Customer C0006:
  Similar customer: C0040, Score: 0.7667
  Similar customer: C0182, Score: 0.7275
  Similar customer: C0057, Score: 0.6157

Customer C0007:
  Similar customer: C0020, Score: 0.7289
  Similar customer: C0140, Score: 0.7138
  Similar customer: C0085, Score: 0.6769

Customer C0008:
  Simi