### Using Cosine Similarity to Identify Lookalike Customers

- To create the `NareshRaja_ML_LookAlike.csv`, I used **cosine similarity** to identify customers who are similar to each other based on their purchase behavior, such as the products they bought, the total value of their transactions, and other relevant features. 
- Cosine similarity is a metric used to measure the similarity between two non-zero vectors, allowing us to identify how similar customers are to one another, regardless of their purchasing magnitude.


In [45]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
import warnings

warnings.filterwarnings('ignore')

In [46]:
# Load datasets
customers = pd.read_csv('Customers.csv')
transactions = pd.read_csv('Transactions.csv')

In [47]:
# Merge datasets
data = transactions.merge(customers, on='CustomerID')


In [48]:
# Aggregate features for analysis
customer_features = data.groupby('CustomerID').agg({
    'Region': 'first',
    'SignupDate': 'first',
    'Quantity': 'sum',
    'TotalValue': 'sum',
    'ProductID': 'nunique'
}).reset_index()

In [49]:
# Add derived feature
customer_features['SignupDate'] = pd.to_datetime(customer_features['SignupDate'], errors='coerce')
customer_features['YearsSinceSignup'] = (pd.Timestamp.now() - customer_features['SignupDate']).dt.days / 365
customer_features.drop(columns='SignupDate', inplace=True)

In [50]:
# Encode categorical variables
encoder = LabelEncoder()
customer_features['Region'] = encoder.fit_transform(customer_features['Region'])

In [51]:
# Normalize features
scaler = MinMaxScaler()
normalized_features = scaler.fit_transform(customer_features.iloc[:, 1:])

In [52]:
print("Normalized Feature Vector : \n")
normalized_features[:10]

Normalized Feature Vector : 



array([[1.        , 0.35483871, 0.30894178, 0.44444444, 0.74861878],
       [0.        , 0.29032258, 0.16809501, 0.33333333, 0.96593002],
       [1.        , 0.41935484, 0.24954138, 0.33333333, 0.16390424],
       [1.        , 0.70967742, 0.49780626, 0.77777778, 0.77348066],
       [0.        , 0.19354839, 0.18428723, 0.22222222, 0.79742173],
       [1.        , 0.35483871, 0.39137101, 0.33333333, 0.16574586],
       [0.        , 0.22580645, 0.2357983 , 0.22222222, 0.85082873],
       [0.66666667, 0.61290323, 0.39552906, 1.        , 0.32228361],
       [0.33333333, 0.06451613, 0.07686723, 0.22222222, 0.46224678],
       [0.33333333, 0.35483871, 0.15438686, 0.33333333, 0.68508287]])

In [53]:
# Calculate similarity matrix
similarity_matrix = cosine_similarity(normalized_features)

In [54]:
# Function to find top 3 lookalikes for each customer
def get_top_3_lookalikes(similarity_matrix, customer_ids):
    lookalikes = {}
    for i, customer_id in enumerate(customer_ids):
        similar_customers = sorted(list(enumerate(similarity_matrix[i])), key=lambda x: x[1], reverse=True)
        top_3 = [(customer_ids[j], round(score, 2)) for j, score in similar_customers[1:4]]
        lookalikes[customer_id] = top_3
    return lookalikes

# Generate lookalikes
customer_ids = customer_features['CustomerID'].values
lookalikes = get_top_3_lookalikes(similarity_matrix, customer_ids)

# Print the first 10 lookalikes
for i, (customer_id, top_3) in enumerate(list(lookalikes.items())[:10]):
    print(f"Customer ID: {customer_id}")
    print("Top 3 Lookalikes:")
    for lookalike in top_3:
        print(f"  - Lookalike: {lookalike[0]}, Similarity Score: {lookalike[1]}")
    print("\n")


Customer ID: C0001
Top 3 Lookalikes:
  - Lookalike: C0011, Similarity Score: 1.0
  - Lookalike: C0184, Similarity Score: 1.0
  - Lookalike: C0152, Similarity Score: 0.99


Customer ID: C0002
Top 3 Lookalikes:
  - Lookalike: C0043, Similarity Score: 1.0
  - Lookalike: C0159, Similarity Score: 1.0
  - Lookalike: C0005, Similarity Score: 1.0


Customer ID: C0003
Top 3 Lookalikes:
  - Lookalike: C0190, Similarity Score: 1.0
  - Lookalike: C0036, Similarity Score: 0.99
  - Lookalike: C0191, Similarity Score: 0.99


Customer ID: C0004
Top 3 Lookalikes:
  - Lookalike: C0067, Similarity Score: 1.0
  - Lookalike: C0104, Similarity Score: 1.0
  - Lookalike: C0102, Similarity Score: 1.0


Customer ID: C0005
Top 3 Lookalikes:
  - Lookalike: C0159, Similarity Score: 1.0
  - Lookalike: C0007, Similarity Score: 1.0
  - Lookalike: C0123, Similarity Score: 1.0


Customer ID: C0006
Top 3 Lookalikes:
  - Lookalike: C0048, Similarity Score: 1.0
  - Lookalike: C0137, Similarity Score: 0.99
  - Lookalike: C

SAVING IT AS ``.csv`` FILE

In [55]:
# Save to CSV
lookalike_df = pd.DataFrame({
    'CustomerID': list(lookalikes.keys()),
    'Top3_Lookalikes': [str(lookalikes[c]) for c in lookalikes]
})
lookalike_df.to_csv('NareshRaja_ML_Lookalike.csv', index=False)