# **Task 2: Lookalike Model**

**Step 1: Import Required Libraries**

Ensure all the necessary libraries are imported.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

**Step 2: Load and Merge Datasets**

Load the Customers.csv, Products.csv, and Transactions.csv datasets and merge them to create a complete dataset.

In [None]:
# Load datasets
customers = pd.read_csv("Customers.csv")
products = pd.read_csv("Products.csv")
transactions = pd.read_csv("Transactions.csv")

# Merge datasets
merged = pd.merge(transactions, customers, on="CustomerID", how="left")
merged = pd.merge(merged, products, on="ProductID", how="left")

# Check the structure
print(merged.head())

  TransactionID CustomerID ProductID   TransactionDate  Quantity  TotalValue  \
0        T00001      C0199      P067  25-08-2024 12:38         1      300.68   
1        T00112      C0146      P067  27-05-2024 22:23         1      300.68   
2        T00166      C0127      P067  25-04-2024 07:38         1      300.68   
3        T00272      C0087      P067  26-03-2024 22:55         2      601.36   
4        T00363      C0070      P067  21-03-2024 15:10         3      902.04   

   Price_x     CustomerName         Region  SignupDate  \
0   300.68   Andrea Jenkins         Europe  03-12-2022   
1   300.68  Brittany Harvey           Asia  04-09-2024   
2   300.68  Kathryn Stevens         Europe  04-04-2024   
3   300.68  Travis Campbell  South America  11-04-2024   
4   300.68    Timothy Perez         Europe  15-03-2022   

                       ProductName     Category  Price_y  
0  ComfortLiving Bluetooth Speaker  Electronics   300.68  
1  ComfortLiving Bluetooth Speaker  Electronics   30

**Step 3: Create Customer Profiles**

Aggregate transaction and product information to create customer profiles. This includes total spending (TotalValue), total quantity of products purchased (Quantity), and average product price.

3.1 - Verify the Columns in merged

Print the columns of the merged DataFrame to confirm what exists.

In [None]:
print(merged.columns)

Index(['TransactionID', 'CustomerID', 'ProductID', 'TransactionDate',
       'Quantity', 'TotalValue', 'Price_x', 'CustomerName', 'Region',
       'SignupDate', 'ProductName', 'Category', 'Price_y'],
      dtype='object')


3.2 - Resolve Conflicts in Column Names

During the merging process, if both Transactions.csv and Products.csv contain a Price column, pandas will append suffixes to distinguish them. For example:

Price_x: The Price column from Transactions.csv.
Price_y: The Price column from Products.csv.
Fix this by renaming or selecting the correct column:

In [None]:
# Check for Price column
if 'Price_x' in merged.columns and 'Price_y' in merged.columns:
    merged['Price'] = merged['Price_x']  # Or choose 'Price_y' if appropriate

3.3 - Create customer_profile

After confirming the column names and fixing any issues, recreate the customer_profile.

In [None]:
# Create customer profiles
customer_profile = merged.groupby('CustomerID').agg({
    'TotalValue': 'sum',  # Total spending
    'Quantity': 'sum',    # Total quantity purchased
    'Price': 'mean'       # Average price of purchased products
}).reset_index()

print(customer_profile.head())

  CustomerID  TotalValue  Quantity       Price
0      C0001     3354.52        12  278.334000
1      C0002     1862.74        10  208.920000
2      C0003     2725.38        14  195.707500
3      C0004     5354.88        23  240.636250
4      C0005     2034.24         7  291.603333


**Step 4: Normalize Features**

Normalize the numerical features (TotalValue, Quantity, and Price) for similarity calculation.

In [None]:
# Normalize features
scaler = StandardScaler()
customer_features = scaler.fit_transform(customer_profile[['TotalValue', 'Quantity', 'Price']])

**Step 5: Calculate Similarity Matrix**

Compute the similarity between all customers using cosine similarity.

In [None]:
# Calculate similarity matrix
similarity_matrix = cosine_similarity(customer_features)

# Convert similarity matrix into a DataFrame
similarity_df = pd.DataFrame(similarity_matrix, index=customer_profile['CustomerID'], columns=customer_profile['CustomerID'])
print(similarity_df.head())

CustomerID     C0001     C0002     C0003     C0004     C0005     C0006  \
CustomerID                                                               
C0001       1.000000  0.104513 -0.524923 -0.925208  0.909351  0.442395   
C0002       0.104513  1.000000  0.791531 -0.464035  0.506433 -0.844066   
C0003      -0.524923  0.791531  1.000000  0.172432 -0.124725 -0.994780   
C0004      -0.925208 -0.464035  0.172432  1.000000 -0.990272 -0.083333   
C0005       0.909351  0.506433 -0.124725 -0.990272  1.000000  0.029596   

CustomerID     C0007     C0008     C0009     C0010  ...     C0191     C0192  \
CustomerID                                          ...                       
C0001       0.957854 -0.980620  0.885035 -0.268370  ...  0.953552  0.875392   
C0002      -0.126391 -0.208586  0.552510  0.929885  ...  0.366172  0.561020   
C0003      -0.694381  0.426063 -0.070251  0.960431  ... -0.270712 -0.056387   
C0004      -0.786871  0.960972 -0.985116 -0.108724  ... -0.969254 -0.975266   
C0005  

**Step 6: Find Top 3 Lookalikes for Each Customer**

For each customer, find the top 3 most similar customers and their similarity scores:

In [None]:
lookalikes = {}

# Loop through the first 20 customers (C0001 - C0020)
for customer_id in customer_profile['CustomerID'][:20]:
    # Get similarity scores for the current customer
    scores = similarity_df[customer_id].sort_values(ascending=False)

    # Exclude the customer itself and pick the top 3
    top_3 = scores.iloc[1:4]  # Skip the first (self-similarity)

    # Store the results in the lookalikes dictionary
    lookalikes[customer_id] = list(zip(top_3.index, top_3.values))

# Convert lookalikes dictionary to a DataFrame
lookalikes_df = pd.DataFrame({
    'CustomerID': lookalikes.keys(),
    'Lookalikes': lookalikes.values()
})
print(lookalikes_df.head())

  CustomerID                                         Lookalikes
0      C0001  [(C0103, 0.9975729385618538), (C0092, 0.996878...
1      C0002  [(C0029, 0.9998543931340029), (C0077, 0.996103...
2      C0003  [(C0111, 0.9984874468302141), (C0190, 0.996656...
3      C0004  [(C0165, 0.9983897071764074), (C0162, 0.998086...
4      C0005  [(C0167, 0.9999721868436701), (C0020, 0.999714...


**Step 7: Save Results**

Save the lookalike recommendations into Lookalike.csv:

In [None]:
lookalikes_df.to_csv("Lookalike.csv", index=False)

Step 8: Validate the Results

In [None]:
for idx, row in lookalikes_df.iterrows():
    print(f"Customer: {row['CustomerID']} -> Lookalikes: {row['Lookalikes']}")

Customer: C0001 -> Lookalikes: [('C0103', 0.9975729385618538), ('C0092', 0.9968787968825864), ('C0135', 0.9927364238882178)]
Customer: C0002 -> Lookalikes: [('C0029', 0.9998543931340029), ('C0077', 0.9961038168882547), ('C0157', 0.9954784900159904)]
Customer: C0003 -> Lookalikes: [('C0111', 0.9984874468302141), ('C0190', 0.9966561574371822), ('C0038', 0.9901332836738033)]
Customer: C0004 -> Lookalikes: [('C0165', 0.9983897071764074), ('C0162', 0.9980867096016259), ('C0075', 0.996932345616167)]
Customer: C0005 -> Lookalikes: [('C0167', 0.9999721868436701), ('C0020', 0.99971426883456), ('C0128', 0.9987615592886807)]
Customer: C0006 -> Lookalikes: [('C0168', 0.9976122332196319), ('C0196', 0.9950250564515252), ('C0187', 0.9947524750205508)]
Customer: C0007 -> Lookalikes: [('C0125', 0.9998486580402707), ('C0089', 0.99834375759003), ('C0085', 0.9960335186380587)]
Customer: C0008 -> Lookalikes: [('C0084', 0.9960866913262758), ('C0113', 0.9958170325568012), ('C0017', 0.993173208985394)]
Custom

**Step 9: Download the CSV file to the device.**

In [None]:
from google.colab import files
files.download('Lookalike.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>