In [12]:
from sklearn.neighbors import NearestNeighbors
import pandas as pd

# Load datasets
customers = pd.read_csv('https://drive.google.com/uc?export=download&id=1bu_--mo79VdUG9oin4ybfFGRUSXAe-WE')
transactions = pd.read_csv('https://drive.google.com/uc?export=download&id=1saEqdbBB-vuk2hxoAf4TzDEsykdKlzbF')

# Feature engineering
customer_features = (
    transactions.groupby('CustomerID')
    .agg({'TotalValue': 'sum', 'TransactionID': 'count'})
    .rename(columns={'TotalValue': 'TotalSpent', 'TransactionID': 'NumTransactions'})
)

# Merge with customer data
customer_data = customers.set_index('CustomerID').join(customer_features).fillna(0)

# Select only numeric features for normalization
numeric_features = ['TotalSpent', 'NumTransactions']
customer_data_normalized = (customer_data[numeric_features] - customer_data[numeric_features].mean()) / customer_data[numeric_features].std()

# Build k-NN model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(customer_data_normalized)

# Find lookalikes for the first 20 customers
lookalike_results = {}
for cust_id in customer_data_normalized.index[:20]:
    # Ensure the customer ID exists in the normalized data
    if cust_id in customer_data_normalized.index:
        distances, indices = model_knn.kneighbors(customer_data_normalized.loc[cust_id].values.reshape(1, -1), n_neighbors=4)
        similar_customers = [(customer_data_normalized.index[indices[0][i]], distances[0][i]) for i in range(1, 4)]
        lookalike_results[cust_id] = similar_customers

# Save results to CSV
lookalike_df = pd.DataFrame([(k, [f"{c[0]}:{c[1]}" for c in v]) for k, v in lookalike_results.items()], columns=['CustomerID', 'Lookalikes'])
lookalike_df.to_csv('D Veera Harsha Vardhan Reddy_Lookalike.csv', index=False)

# Print the resulting DataFrame to verify
print(lookalike_df.head())



  CustomerID                                         Lookalikes
0      C0001                  [C0137:0.0, C0119:0.0, C0088:0.0]
1      C0002  [C0029:0.00018354971379253016, C0199:0.0005116...
2      C0003  [C0095:1.7032049602905275e-06, C0150:1.2180692...
3      C0004  [C0067:9.965377501530703e-06, C0021:0.00013233...
4      C0005  [C0130:4.444249014690094e-06, C0144:1.26062190...




**Interpretation**

**Highly Similar Customers :** When the similarity score is **0.0** or very close to it, it suggests that these customers share very similar characteristics or behaviors, such as similar purchase patterns, spending amounts, etc.

**Less Similar Customers :** As the similarity score increases, the level of similarity decreases. For instance, a score like **0.05131820830206402** indicates less similarity compared to a score of **0.0001151047563676677.**