

## Objective
The Lookalike Model identifies and recommends three similar customers based on their profiles and transaction history, leveraging both customer and product data.

## Workflow

### 1. Data Loading and Exploration
- Load the data from `Customers.csv`, `Products.csv`, and `Transactions.csv`.
- Inspect the datasets for structure, relationships, and missing values.

### 2. Data Preprocessing
- Merge the datasets to create a unified view combining customer, product, and transaction information.
- Address missing values and standardize the format for further analysis.

### 3. Feature Engineering
- Generate meaningful features like average purchase value, unique products purchased, and preferred product category.
- Encode categorical variables and scale numerical features to ensure consistency in similarity calculations.

### 4. Similarity Computation
- Compute pairwise similarity between customers using distance-based metrics, such as cosine similarity.
- Create a similarity matrix to rank other customers for each individual.

### 5. Recommendations
- For each customer, identify the top 3 most similar customers and assign similarity scores.
- Compile the results into a `Lookalike.csv` file.

## Business Insights
This model enables personalized marketing strategies, customer segmentation, and targeted recommendations. By understanding customer preferences, businesses can enhance engagement and optimize cross-sell or up-sell opportunities.


In [77]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# Load the datasets
customers = pd.read_csv(r"C:\Users\krish\Downloads\Customers.csv")
products = pd.read_csv(r"C:\Users\krish\Downloads\Products.csv")
transactions = pd.read_csv(r"C:\Users\krish\Downloads\Transactions.csv")

In [81]:
# Merge datasets
transactions_customers = transactions.merge(customers, on="CustomerID", how="inner")
full_data = transactions_customers.merge(products, on="ProductID", how="inner")

# Aggregate data to create customer profiles
customer_profiles = full_data.groupby('CustomerID').agg(
    total_spent=('TotalValue', 'sum'),
    total_quantity=('Quantity', 'sum'),
    avg_price=('Price_y', 'mean'),
    product_diversity=('ProductID', 'nunique'),  # Unique products purchased
    purchase_frequency=('TransactionID', 'count')  # Number of transactions
).reset_index()

In [83]:
# One-hot encode product categories and regions
encoder = OneHotEncoder(sparse_output=False)
category_encoded = encoder.fit_transform(full_data[['Category']])
region_encoded = encoder.fit_transform(full_data[['Region']])

In [85]:
# Add encoded features to customer profiles
category_columns = [f"category_{cat}" for cat in encoder.categories_[0]]
region_columns = [f"region_{reg}" for reg in encoder.categories_[0]]
category_df = pd.DataFrame(category_encoded, columns=category_columns)
region_df = pd.DataFrame(region_encoded, columns=region_columns)

category_aggregated = pd.concat([full_data['CustomerID'], category_df], axis=1).groupby('CustomerID').sum().reset_index()
region_aggregated = pd.concat([full_data['CustomerID'], region_df], axis=1).groupby('CustomerID').sum().reset_index()

customer_profiles = customer_profiles.merge(category_aggregated, on='CustomerID', how='left')
customer_profiles = customer_profiles.merge(region_aggregated, on='CustomerID', how='left')

In [87]:
# Normalize features
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(customer_profiles.iloc[:, 1:])
customer_profiles.iloc[:, 1:] = scaled_features


In [89]:
# Prepare data for KNN
X = customer_profiles.iloc[:, 1:].values

# Train a KNN model
knn = NearestNeighbors(n_neighbors=4, metric='cosine')  # Use cosine similarity for better accuracy
knn.fit(X)

# Find top 3 lookalikes for the first 20 customers
first_20_customers = customer_profiles[customer_profiles['CustomerID'].isin([f'C{str(i).zfill(4)}' for i in range(1, 21)])]
first_20_indices = [customer_profiles.index[customer_profiles['CustomerID'] == cid].tolist()[0]
                    for cid in first_20_customers['CustomerID']]
distances, indices = knn.kneighbors(X[first_20_indices])

In [91]:
# Create a lookalike map
lookalike_map = {}
for i, customer_idx in enumerate(first_20_indices):
    customer_id = customer_profiles.iloc[customer_idx]['CustomerID']
    similar_customers = [
        (customer_profiles.iloc[idx]['CustomerID'], round(1 - distances[i][j], 4))  # 1 - cosine distance = similarity
        for j, idx in enumerate(indices[i]) if idx != customer_idx
    ][:3]  # Top 3 lookalikes
    lookalike_map[customer_id] = similar_customers


In [93]:
# Convert lookalike map into a DataFrame for saving
lookalike_df = pd.DataFrame({
    "CustomerID": lookalike_map.keys(),
    "Lookalikes": [str(similar) for similar in lookalike_map.values()]
})


In [109]:
lookalike_df.to_csv(r"C:\Users\krish\Desktop\Krishnaveni_Nenavath_Lookalike1.csv", index=False)


In [111]:
res =pd.read_csv(r"C:\Users\krish\Desktop\Krishnaveni_Nenavath_Lookalike1.csv")

In [113]:
res

Unnamed: 0,CustomerID,Lookalikes
0,C0001,"[('C0112', 0.9804), ('C0190', 0.9687), ('C0048..."
1,C0002,"[('C0106', 0.9768), ('C0134', 0.976), ('C0145'..."
2,C0003,"[('C0195', 0.9892), ('C0031', 0.9849), ('C0039..."
3,C0004,"[('C0113', 0.9877), ('C0012', 0.9772), ('C0104..."
4,C0005,"[('C0007', 0.9946), ('C0146', 0.9738), ('C0140..."
5,C0006,"[('C0187', 0.9887), ('C0082', 0.9656), ('C0011..."
6,C0007,"[('C0005', 0.9946), ('C0140', 0.9808), ('C0146..."
7,C0008,"[('C0189', 0.9729), ('C0098', 0.9715), ('C0194..."
8,C0009,"[('C0198', 0.9763), ('C0061', 0.9292), ('C0074..."
9,C0010,"[('C0111', 0.9287), ('C0141', 0.9002), ('C0062..."
