<a href="https://colab.research.google.com/github/HarshitSahni18/TASKS/blob/main/LookaLike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Step 1: Load and Explore Data
In this step, we will:

 Load customer, product, and transaction data from Google Drive  
 Explore the structure of each dataset  
 Convert date columns to `datetime` format for analysis  


In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Load datasets
import pandas as pd
customers = pd.read_csv('/content/drive/MyDrive/Customers.csv')
products = pd.read_csv('/content/drive/MyDrive/Products.csv')
transactions = pd.read_csv('/content/drive/MyDrive/Transactions.csv')

# Convert date columns
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# Display first few rows
customers.head(), products.head(), transactions.head()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


(  CustomerID        CustomerName         Region SignupDate
 0      C0001    Lawrence Carroll  South America 2022-07-10
 1      C0002      Elizabeth Lutz           Asia 2022-02-13
 2      C0003      Michael Rivera  South America 2024-03-07
 3      C0004  Kathleen Rodriguez  South America 2022-10-09
 4      C0005         Laura Weber           Asia 2022-08-15,
   ProductID              ProductName     Category   Price
 0      P001     ActiveWear Biography        Books  169.30
 1      P002    ActiveWear Smartwatch  Electronics  346.30
 2      P003  ComfortLiving Biography        Books   44.12
 3      P004            BookWorld Rug   Home Decor   95.69
 4      P005          TechPro T-Shirt     Clothing  429.31,
   TransactionID CustomerID ProductID     TransactionDate  Quantity  \
 0        T00001      C0199      P067 2024-08-25 12:38:23         1   
 1        T00112      C0146      P067 2024-05-27 22:23:54         1   
 2        T00166      C0127      P067 2024-04-25 07:38:55         1   


### Step 2: Merge Datasets
We need to merge:

 `customers` (customer details)  
 `products` (product details)  
 `transactions` (purchase history)  
This will help us create a **customer profile dataset** combining all information.  


In [2]:
# Merge transactions with customers
df = transactions.merge(customers, on='CustomerID', how='left')

# Merge transactions with products
df = df.merge(products, on='ProductID', how='left')

# Display merged dataset
df.head()


Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price_x,CustomerName,Region,SignupDate,ProductName,Category,Price_y
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68,Andrea Jenkins,Europe,2022-12-03,ComfortLiving Bluetooth Speaker,Electronics,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68,Brittany Harvey,Asia,2024-09-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68,Kathryn Stevens,Europe,2024-04-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68,Travis Campbell,South America,2024-04-11,ComfortLiving Bluetooth Speaker,Electronics,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68,Timothy Perez,Europe,2022-03-15,ComfortLiving Bluetooth Speaker,Electronics,300.68


### Step 3: Feature Engineering
In this step, we will:

 Convert categorical features (e.g., Region, Category) into numerical values  
 Aggregate **transaction history per customer**  
 Convert `SignupDate` into numerical format (days since first signup)  


In [3]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical features
label_enc = LabelEncoder()
df['Region'] = label_enc.fit_transform(df['Region'])
df['Category'] = label_enc.fit_transform(df['Category'])

# Aggregate customer transaction history
customer_features = df.groupby('CustomerID').agg({
    'Region': 'first',
    'SignupDate': 'first',
    'TotalValue': 'sum',
    'Quantity': 'sum',
    'Category': lambda x: x.mode()[0]  # Most purchased category
}).reset_index()

# Convert SignupDate to numeric (days since first customer signup)
customer_features['SignupDate'] = (customer_features['SignupDate'] - customer_features['SignupDate'].min()).dt.days

# Display final feature dataset
customer_features.head()


Unnamed: 0,CustomerID,Region,SignupDate,TotalValue,Quantity,Category
0,C0001,3,169,3354.52,12,2
1,C0002,0,22,1862.74,10,1
2,C0003,3,775,2725.38,14,3
3,C0004,3,260,5354.88,23,0
4,C0005,0,205,2034.24,7,2


### Step 4: Compute Customer Similarity
To find **similar customers**, we will:

 **Normalize features** (so all values are on the same scale)  
 Compute **Cosine Similarity** to measure how close customers are  
 Create a **similarity matrix** to compare all customers  


In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

# Normalize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_features.drop(columns=['CustomerID']))

# Compute similarity matrix
similarity_matrix = cosine_similarity(scaled_features)

# Convert similarity scores into a DataFrame
similarity_df = pd.DataFrame(similarity_matrix, index=customer_features['CustomerID'], columns=customer_features['CustomerID'])

# Display similarity matrix
similarity_df.head()


CustomerID,C0001,C0002,C0003,C0004,C0005,C0006,C0007,C0008,C0009,C0010,...,C0191,C0192,C0193,C0194,C0195,C0196,C0197,C0198,C0199,C0200
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C0001,1.0,0.006242,0.423064,0.320381,-0.009981,0.068329,0.03249,0.141956,-0.160422,0.004744,...,-0.010572,0.871211,-0.521311,-0.480751,0.184118,0.416126,0.102036,0.257292,0.237721,-0.620918
C0002,0.006242,1.0,-0.608672,-0.263902,0.876211,-0.602945,0.890102,-0.554733,0.506551,0.894088,...,-0.527031,0.128091,0.762051,-0.525459,-0.832976,0.326771,0.678425,0.8239,0.773373,0.352412
C0003,0.423064,-0.608672,1.0,-0.149333,-0.317027,0.006338,-0.372756,0.711855,-0.163471,-0.42367,...,0.115426,0.45331,-0.927218,-0.20821,0.904022,0.057365,0.062935,-0.342031,-0.021682,-0.677721
C0004,0.320381,-0.263902,-0.149333,1.0,-0.645132,0.525207,-0.538178,0.124834,-0.751102,-0.268559,...,0.268575,-0.077158,-0.139454,0.3704,0.027163,-0.103204,-0.78309,-0.419663,-0.646593,0.058393
C0005,-0.009981,0.876211,-0.317027,-0.645132,1.0,-0.79222,0.985058,-0.358516,0.632133,0.73406,...,-0.663372,0.204794,0.539663,-0.730983,-0.573059,0.497923,0.900401,0.800248,0.932437,0.244335


### Step 5: Find Top 3 Similar Customers
For each customer, we:
 Find the **3 most similar customers**  
 Extract their **Customer IDs & Similarity Scores**  
 Save the results in a structured format  


In [5]:
def get_top_3_similar(customers_list, similarity_df):
    lookalike_dict = {}

    for customer in customers_list:
        # Get similarity scores & sort
        similar_customers = similarity_df[customer].sort_values(ascending=False)[1:4]  # Exclude self
        lookalike_dict[customer] = list(zip(similar_customers.index, similar_customers.values))  # Store scores

    return lookalike_dict

# Get top 3 similar customers for CustomerIDs C0001 - C0020
customer_ids = customer_features['CustomerID'][:20]  # First 20 customers
lookalike_results = get_top_3_similar(customer_ids, similarity_df)

# Convert results to DataFrame
lookalike_df = pd.DataFrame(list(lookalike_results.items()), columns=['CustomerID', 'Similar_Customers'])
lookalike_df.head()


Unnamed: 0,CustomerID,Similar_Customers
0,C0001,"[(C0184, 0.9941840106952722), (C0152, 0.937791..."
1,C0002,"[(C0106, 0.9215099852826913), (C0103, 0.895633..."
2,C0003,"[(C0076, 0.967813381807394), (C0052, 0.9512180..."
3,C0004,"[(C0165, 0.9737034936426925), (C0169, 0.962472..."
4,C0005,"[(C0007, 0.9850576778941272), (C0159, 0.939718..."
