### **Goal 🎯**
Build a Lookalike Model to recommend the top 3 similar customers for each of the first 20 customers (C0001 - C0020) based on their profile and transaction history. The model uses both customer and product information and assigns a similarity score.

---

### **Steps 📝**

1. **Data Preprocessing 📊**:
   - Merge the `Customers.csv`, `Products.csv`, and `Transactions.csv` datasets.
   
2. **Feature Engineering ⚙️**:
   - Extract customer spending, number of purchases, and product categories.

3. **Normalize Data ⚖️**:
   - Standardize spending and purchase frequency.

4. **Cosine Similarity 🔍**:
   - Calculate similarity between customers using their features.

5. **Top 3 Lookalikes 🔝**:
   - For each of the first 20 customers, find the top 3 most similar customers.

6. **Save Results 📂**:
   - Output the results in `Lookalike.csv` with customer IDs and similarity scores.

### Importing Libraries

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
import numpy as np

In [None]:
# Load datasets
customers_df = pd.read_csv('/content/drive/MyDrive/Zeotap/Customers.csv')
products_df = pd.read_csv('/content/drive/MyDrive/Zeotap/Products.csv')
transactions_df = pd.read_csv('/content/drive/MyDrive/Zeotap/Transactions.csv')

### Data Preprocessing

In [None]:
merged_df = pd.merge(transactions_df, customers_df, on="CustomerID", how="inner")
df = pd.merge(merged_df, products_df, on="ProductID", how="inner")

In [None]:
df['Price'] = df['Price_x']
df = df.drop(['Price_x', 'Price_y'], axis=1)

In [None]:
df.head(5)

Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,CustomerName,Region,SignupDate,ProductName,Category,Price
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,Andrea Jenkins,Europe,2022-12-03,ComfortLiving Bluetooth Speaker,Electronics,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,Brittany Harvey,Asia,2024-09-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,Kathryn Stevens,Europe,2024-04-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,Travis Campbell,South America,2024-04-11,ComfortLiving Bluetooth Speaker,Electronics,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,Timothy Perez,Europe,2022-03-15,ComfortLiving Bluetooth Speaker,Electronics,300.68


### Feature Engineering

In [None]:
customer_features = df.groupby('CustomerID').agg({
    'Price': 'sum',
    'ProductID': 'count'
}).reset_index()

customer_features.rename(columns={'Price': 'TotalSpending', 'ProductID': 'TotalPurchases'}, inplace=True)


In [None]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_features[['TotalSpending', 'TotalPurchases']])


In [None]:
similarity_matrix = cosine_similarity(scaled_features)

In [None]:
lookalike_data = []

for i in range(20):

    similarity_scores = similarity_matrix[i]

    top_3_indices = similarity_scores.argsort()[-4:-1][::-1]

    lookalike_data.extend([
        (customers_df.iloc[i]['CustomerID'], customers_df.iloc[j]['CustomerID'], similarity_scores[j])
        for j in top_3_indices
    ])


In [None]:
lookalike_df = pd.DataFrame(lookalike_data, columns=['CustomerID', 'LookalikeID', 'SimilarityScore'])

In [None]:
lookalike_df.head(3)

Unnamed: 0,CustomerID,LookalikeID,SimilarityScore
0,C0001,C0149,0.998943
1,C0001,C0164,0.994865
2,C0001,C0069,0.989268


In [None]:
lookalike_df.to_csv('Lookalike.csv', index=False)


In [None]:
print(lookalike_df.head())

  CustomerID LookalikeID  SimilarityScore
0      C0001       C0149         0.998943
1      C0001       C0164         0.994865
2      C0001       C0069         0.989268
3      C0002       C0029         0.999994
4      C0002       C0181         0.999306


In [28]:
from IPython.display import FileLink
FileLink('Lookalike.csv')
