# eCommerce Transactions Data Science Assignment

## Overview
This assignment involves performing exploratory data analysis (EDA), building a Lookalike Model, and performing customer segmentation on an eCommerce transactions dataset.


### Dataset Files
1. **Customers.csv**: Contains customer information.
2. **Products.csv**: Contains product information.
3. **Transactions.csv**: Contains transaction details.


### Tasks
 1. **Task 1**: Perform EDA and derive business insights.
 2. **Task 2**: Build a Lookalike Model.
 3. **Task 3**: Perform Customer Segmentation using clustering.

# Task 2: Lookalike Model for Customer Similarity

This notebook builds a Lookalike Model that recommends 3 similar customers based on a user's profile and transaction history. The model uses both customer and product information and assigns a similarity score to each recommended customer.
 
#### Steps:
 1. Load and preprocess data.
 2. Perform feature engineering.
 3. Compute similarity scores using Cosine Similarity.
 4. Generate top 3 lookalikes for the first 20 customers.
 5. Save results in `Lookalike.csv`.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

## Step 1: Load and Preprocess Data

In [2]:
def load_data(customers_path, transactions_path, products_path):
    """
    Load customer, transaction, and product data from CSV files.
    
    Args:
        customers_path (str): Path to Customers.csv.
        transactions_path (str): Path to Transactions.csv.
        products_path (str): Path to Products.csv.
    
    Returns:
        customers (pd.DataFrame): Customer data.
        transactions (pd.DataFrame): Transaction data.
        products (pd.DataFrame): Product data.
    """
    customers = pd.read_csv(customers_path)
    transactions = pd.read_csv(transactions_path)
    products = pd.read_csv(products_path)
    return customers, transactions, products

# Load data
customers, transactions, products = load_data('Customers.csv', 'Transactions.csv', 'Products.csv')

# Display the first few rows of each dataset
print("Customers Data:")
print(customers.head())

print("\nTransactions Data:")
print(transactions.head())

print("\nProducts Data:")
print(products.head())

Customers Data:
  CustomerID        CustomerName         Region  SignupDate
0      C0001    Lawrence Carroll  South America  2022-07-10
1      C0002      Elizabeth Lutz           Asia  2022-02-13
2      C0003      Michael Rivera  South America  2024-03-07
3      C0004  Kathleen Rodriguez  South America  2022-10-09
4      C0005         Laura Weber           Asia  2022-08-15

Transactions Data:
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue   Price  
0      300.68  300.68  
1      300.68  300.68  
2      300.68  300.68  
3      601.36  300.68  
4      902.04  300.68  

Products Data:
  ProductID      

### Step 2: Feature Engineering
 
 Combine customer, transaction, and product data to create meaningful features for each customer. Examples of features include:
 - **Demographics**: Age, Gender.
 - **Transaction History**: Total spend, frequency of purchases, average transaction value.
 - **Product Preferences**: Categories of products purchased, brands preferred.

In [3]:
def preprocess_data(customers, transactions, products):
    """
    Merge datasets and create customer-level features.
    
    Args:
        customers (pd.DataFrame): Customer data.
        transactions (pd.DataFrame): Transaction data.
        products (pd.DataFrame): Product data.
    
    Returns:
        features (pd.DataFrame): Engineered features for each customer.
    """
    # Merge datasets
    customer_transactions = pd.merge(customers, transactions, on='CustomerID')
    customer_transactions = pd.merge(customer_transactions, products, on='ProductID')
    
    # Feature engineering
    features = customer_transactions.groupby('CustomerID').agg({
        'Region': 'first',
        'SignupDate': 'first',
        'TotalValue': 'sum',
        'Quantity': 'sum',
        'TransactionID': 'count',
        'Category': lambda x: x.mode()[0]
    }).reset_index()
    
    # Calculate tenure
    features['Tenure'] = (pd.Timestamp.now() - pd.to_datetime(features['SignupDate'])).dt.days
    features = features.drop(columns=['SignupDate'])
    
    # Rename columns
    features.columns = ['CustomerID', 'Region', 'TotalSpend', 'TotalQuantity', 'TransactionFrequency', 'PreferredCategory', 'Tenure']
    
    return features

# Perform feature engineering
features = preprocess_data(customers, transactions, products)

# Display engineered features
print("Engineered Features:")
print(features.head())

Engineered Features:
  CustomerID         Region  TotalSpend  TotalQuantity  TransactionFrequency  \
0      C0001  South America     3354.52             12                     5   
1      C0002           Asia     1862.74             10                     4   
2      C0003  South America     2725.38             14                     4   
3      C0004  South America     5354.88             23                     8   
4      C0005           Asia     2034.24              7                     3   

  PreferredCategory  Tenure  
0       Electronics     931  
1          Clothing    1078  
2        Home Decor     325  
3             Books     840  
4       Electronics     895  


### Step 3: Normalize Features
 
Normalize numerical features to ensure they are on the same scale. This is important for computing similarity scores.

In [4]:
def normalize_features(features):
    """
    Normalize numerical features using Min-Max scaling.
    
    Args:
        features (pd.DataFrame): Engineered features.
    
    Returns:
        features_normalized (pd.DataFrame): Normalized features.
    """
    # Normalize numerical features
    scaler = MinMaxScaler()
    features_normalized = scaler.fit_transform(features[['TotalSpend', 'TotalQuantity', 'TransactionFrequency', 'Tenure']])
    
    # Convert normalized features back to a DataFrame
    features_normalized = pd.DataFrame(features_normalized, columns=['TotalSpend', 'TotalQuantity', 'TransactionFrequency', 'Tenure'])
    features_normalized['CustomerID'] = features['CustomerID']
    
    return features_normalized

# Normalize features
features_normalized = normalize_features(features)

# Display normalized features
print("Normalized Features:")
print(features_normalized.head())

Normalized Features:
   TotalSpend  TotalQuantity  TransactionFrequency    Tenure CustomerID
0    0.308942       0.354839                   0.4  0.842204      C0001
1    0.168095       0.290323                   0.3  0.979458      C0002
2    0.249541       0.419355                   0.3  0.276377      C0003
3    0.497806       0.709677                   0.7  0.757236      C0004
4    0.184287       0.193548                   0.2  0.808590      C0005


### Step 4: One-Hot Encoding

In [5]:
def encode_categorical_features(features, features_normalized):
    """
    One-hot encode categorical features and add them to the normalized DataFrame.
    
    Args:
        features (pd.DataFrame): Engineered features.
        features_normalized (pd.DataFrame): Normalized features.
    
    Returns:
        features_encoded (pd.DataFrame): Normalized and encoded features.
    """
    # One-hot encode categorical features
    features_encoded = pd.get_dummies(features, columns=['Region', 'PreferredCategory'])
    
    # Add one-hot encoded features to the normalized DataFrame
    for col in features_encoded.columns:
        if col.startswith('Region_') or col.startswith('PreferredCategory_'):
            features_normalized[col] = features_encoded[col]
    
    return features_normalized

# Encode categorical features
features_normalized = encode_categorical_features(features, features_normalized)

# Display encoded features
print("Encoded Features:")
print(features_normalized.head())

Encoded Features:
   TotalSpend  TotalQuantity  TransactionFrequency    Tenure CustomerID  \
0    0.308942       0.354839                   0.4  0.842204      C0001   
1    0.168095       0.290323                   0.3  0.979458      C0002   
2    0.249541       0.419355                   0.3  0.276377      C0003   
3    0.497806       0.709677                   0.7  0.757236      C0004   
4    0.184287       0.193548                   0.2  0.808590      C0005   

   Region_Asia  Region_Europe  Region_North America  Region_South America  \
0        False          False                 False                  True   
1         True          False                 False                 False   
2        False          False                 False                  True   
3        False          False                 False                  True   
4         True          False                 False                 False   

   PreferredCategory_Books  PreferredCategory_Clothing  \
0         

### Step 5: Compute Similarity Scores
 
Use **Cosine Similarity** to compute similarity scores between customers. Cosine Similarity is a good choice for high-dimensional data.


In [6]:
def compute_similarity(features_normalized):
    """
    Compute cosine similarity between customers.
    
    Args:
        features_normalized (pd.DataFrame): Normalized and encoded features.
    
    Returns:
        similarity_matrix (np.ndarray): Cosine similarity matrix.
        similarity_df (pd.DataFrame): Similarity matrix as a DataFrame.
    """
    # Compute cosine similarity matrix
    similarity_matrix = cosine_similarity(features_normalized.drop('CustomerID', axis=1))
    
    # Convert similarity matrix to a DataFrame
    similarity_df = pd.DataFrame(similarity_matrix, index=features_normalized['CustomerID'], columns=features_normalized['CustomerID'])
    
    return similarity_matrix, similarity_df

# Compute similarity scores
similarity_matrix, similarity_df = compute_similarity(features_normalized)

# Display similarity matrix
print("Similarity Matrix:")
print(similarity_df.head())


Similarity Matrix:
CustomerID     C0001     C0002     C0003     C0004     C0005     C0006  \
CustomerID                                                               
C0001       1.000000  0.351832  0.579095  0.676631  0.645342  0.594865   
C0002       0.351832  1.000000  0.190183  0.357436  0.655801  0.208705   
C0003       0.579095  0.190183  1.000000  0.607893  0.159246  0.584962   
C0004       0.676631  0.357436  0.607893  1.000000  0.302159  0.945626   
C0005       0.645342  0.655801  0.159246  0.302159  1.000000  0.179330   

CustomerID     C0007     C0008     C0009     C0010  ...     C0191     C0192  \
CustomerID                                          ...                       
C0001       0.655570  0.298605  0.196858  0.302921  ...  0.575560  0.993658   
C0002       0.664893  0.252708  0.578864  0.647394  ...  0.182292  0.320449   
C0003       0.171429  0.595902  0.100926  0.183437  ...  0.579475  0.559278   
C0004       0.322482  0.416105  0.196759  0.330387  ...  0.936173  

### Step 5: Generate Top 3 Lookalikes
 
For each of the first 20 customers (C0001 - C0020), find the top 3 most similar customers and their similarity scores.


In [7]:
def generate_lookalikes(similarity_matrix, features_normalized, top_n=3, num_customers=20):
    """
    Generate top N lookalikes for the first M customers.
    
    Args:
        similarity_matrix (np.ndarray): Cosine similarity matrix.
        features_normalized (pd.DataFrame): Normalized and encoded features.
        top_n (int): Number of lookalikes to recommend.
        num_customers (int): Number of customers to process.
    
    Returns:
        lookalike_map (dict): Map of customer IDs to their top N lookalikes and similarity scores.
    """
    lookalike_map = {}
    
    # Iterate over the first M customers
    for i in range(num_customers):
        customer_id = features_normalized.iloc[i]['CustomerID']
        similarity_scores = similarity_matrix[i]
        
        # Exclude the customer themselves and get the top N similar customers
        top_indices = np.argsort(similarity_scores)[-top_n-1:-1]  # Exclude self and get top N
        top_customers = [(features_normalized.iloc[idx]['CustomerID'], similarity_scores[idx]) for idx in top_indices]
        
        # Store the results in the dictionary
        lookalike_map[customer_id] = top_customers
    
    return lookalike_map

# Generate lookalikes
lookalike_map = generate_lookalikes(similarity_matrix, features_normalized)

# Display lookalike recommendations
print("Lookalike Recommendations:")
for cust_id, similar_customers in lookalike_map.items():
    print(f"{cust_id}: {similar_customers}")

Lookalike Recommendations:
C0001: [('C0112', 0.9884238286228844), ('C0192', 0.9936579522253058), ('C0184', 0.9940860392728573)]
C0002: [('C0106', 0.9886648636130803), ('C0040', 0.9921200427077677), ('C0134', 0.9925276872555436)]
C0003: [('C0076', 0.9955701479019959), ('C0031', 0.9956515859335413), ('C0052', 0.9965122787711371)]
C0004: [('C0169', 0.9923454037920215), ('C0155', 0.9932083539079324), ('C0165', 0.9937352037310774)]
C0005: [('C0186', 0.9736900443509858), ('C0140', 0.9834630435232143), ('C0007', 0.9990750616371955)]
C0006: [('C0126', 0.9921318224633991), ('C0191', 0.9936258028507181), ('C0137', 0.994825896280192)]
C0007: [('C0186', 0.9666653205856185), ('C0140', 0.9775220458438616), ('C0005', 0.9990750616371955)]
C0008: [('C0189', 0.953679390764243), ('C0065', 0.9789643919640681), ('C0059', 0.9829013459966449)]
C0009: [('C0010', 0.9736906570486986), ('C0062', 0.9852869238052722), ('C0061', 0.991424164562953)]
C0010: [('C0103', 0.9875572129307074), ('C0061', 0.9905000796764254

### Step 6: Save Results to `Lookalike.csv`
 
Save the lookalike recommendations in the required format: `Map<cust_id, List<cust_id, score>>`.

In [8]:
def save_results(lookalike_map, output_path='Likhith_Varigonda_Lookalike.csv'):
    """
    Save lookalike recommendations to a CSV file.
    
    Args:
        lookalike_map (dict): Map of customer IDs to their top N lookalikes and similarity scores.
        output_path (str): Path to save the output CSV file.
    """
    with open(output_path, 'w') as f:
        for cust_id, similar_customers in lookalike_map.items():
            f.write(f"{cust_id}, {similar_customers}\n")
    
    print(f"Results saved to {output_path}")

# Save results
save_results(lookalike_map)

Results saved to Likhith_Varigonda_Lookalike.csv


## Conclusion
The Lookalike Model successfully identifies similar customers based on their profiles and transaction history. By leveraging feature engineering, normalization, and similarity computation, the model provides actionable insights for personalized marketing strategies. The results are saved in `Lookalike.csv`, which can be used for further analysis or integration into marketing workflows.

## Conclusion
- **Task 1**: EDA and business insights were derived.
- **Task 2**: A Lookalike Model was built and recommendations were saved.
- **Task 3**: Customer segmentation was performed using K-Means clustering.