#TASK 2 - BUILDING LOOKALIKE MODEL
**Build a Lookalike Model that takes a user's information as input and recommends 3 similar customers based on their profile and transaction history**
**The model should:
● Use both customer and product information.
● Assign a similarity score to each recommended customer.**

**1. Import Necessary Libraries**
* Purpose: Import libraries for data manipulation, visualization, and machine learning tasks.
* pandas and numpy: For data handling and computation.
* matplotlib and seaborn: For data visualization.
* OneHotEncoder and StandardScaler: For preprocessing categorical and numerical data.
* cosine_similarity: To compute the similarity between customer profiles.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

**2. Load and Merge Datasets**
* Purpose:
* Load the three datasets: Customers, Products, and Transactions.
* Merge them into a single dataset data2:
* First merge transactions with customers on CustomerID.Then merge the result with products on ProductID.

In [None]:
# Load and merge datasets
customers = pd.read_csv('/content/Customers.csv')
products = pd.read_csv('/content/Products.csv')
transactions = pd.read_csv('/content/Transactions - Transactions.csv')

In [None]:

# Merge customers and product datasets
data2 = transactions.merge(customers, on='CustomerID').merge(products, on='ProductID')

**3. Aggregate Data at the Customer Level**
* Purpose: Summarize transaction data for each customer.
* TotalSpending: Sum of all transactions made by the customer.
* AverageSpending: Average value of the customer's transactions.
* NumTransactions: Total number of transactions.
* MostFrequentCategory: The most purchased product category.
* Regions: The region associated with the customer (assumes one region per customer).

In [None]:
# Aggregate data at the customer level
customer_summary = data2.groupby('CustomerID').agg(
    TotalSpending=('TotalValue', 'sum'),
    AverageSpending=('TotalValue', 'mean'),
    NumTransactions=('TransactionID', 'count'),
    MostFrequentCategory=('Category', lambda x: x.mode()[0]),  # Most common product category
    Regions=('Region', 'first')  # Assuming each customer is associated with a single region
).reset_index()

In [None]:
customer_summary

Unnamed: 0,CustomerID,TotalSpending,AverageSpending,NumTransactions,MostFrequentCategory,Regions
0,C0001,3354.52,670.904000,5,Electronics,South America
1,C0002,1862.74,465.685000,4,Clothing,Asia
2,C0003,2725.38,681.345000,4,Home Decor,South America
3,C0004,5354.88,669.360000,8,Books,South America
4,C0005,2034.24,678.080000,3,Electronics,Asia
...,...,...,...,...,...,...
194,C0196,4982.88,1245.720000,4,Home Decor,Europe
195,C0197,1928.65,642.883333,3,Electronics,Europe
196,C0198,931.83,465.915000,2,Clothing,Europe
197,C0199,1979.28,494.820000,4,Electronics,Europe


**4. Encode Categorical Variables**
* Purpose: Convert categorical data (Regions and MostFrequentCategory) into numerical format using one-hot encoding.
* fit_transform: Encodes each unique category as a binary column (1 or 0).
* get_feature_names_out: Retrieves the names of the generated columns for clarity.

In [None]:
# Encode categorical variables
encoder = OneHotEncoder(sparse_output=False)
encoded_regions = encoder.fit_transform(customer_summary[['Regions']])
encoded_categories = encoder.fit_transform(customer_summary[['MostFrequentCategory']])

**5. Combine Features**
* Purpose: Combine all customer-related features (numerical and encoded categorical) into a single DataFrame for analysis.

In [None]:
# Correct the encoding and feature name generation
encoded_regions = encoder.fit_transform(customer_summary[['Regions']])
encoded_region_columns = encoder.get_feature_names_out(['Regions'])

encoded_categories = encoder.fit_transform(customer_summary[['MostFrequentCategory']])
encoded_category_columns = encoder.get_feature_names_out(['MostFrequentCategory'])

# Combine features
customer_features = pd.concat(
    [
        customer_summary[['TotalSpending', 'AverageSpending', 'NumTransactions']],
        pd.DataFrame(encoded_regions, columns=encoded_region_columns),
        pd.DataFrame(encoded_categories, columns=encoded_category_columns)
    ],
    axis=1
)


In [None]:
customer_features

Unnamed: 0,TotalSpending,AverageSpending,NumTransactions,Regions_Asia,Regions_Europe,Regions_North America,Regions_South America,MostFrequentCategory_Books,MostFrequentCategory_Clothing,MostFrequentCategory_Electronics,MostFrequentCategory_Home Decor
0,3354.52,670.904000,5,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,1862.74,465.685000,4,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,2725.38,681.345000,4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,5354.88,669.360000,8,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,2034.24,678.080000,3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
194,4982.88,1245.720000,4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
195,1928.65,642.883333,3,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
196,931.83,465.915000,2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
197,1979.28,494.820000,4,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


**6. Standardize Features**
* Purpose: Normalize the features to have a mean of 0 and a standard deviation of 1.
* This ensures that all features contribute equally to similarity calculations.

In [None]:

# Standardize features
scaler = StandardScaler()
customer_features_scaled = scaler.fit_transform(customer_features)


**7. Define Lookalike Recommendation Function**
* Purpose: Recommend similar customers based on their profile and transaction history.
* Retrieve the feature vector for the input customer.
* Calculate the cosine similarity between the input customer and all others.
* Identify the indices of the top n most similar customers (excluding the input customer).
* Return the IDs and similarity scores of the recommended customers.

In [None]:

# Build Lookalike Model
def recommend_similar_customers(input_customer_id, top_n=3):
    # Get customer index for input customer
    customer_index = customer_summary[customer_summary['CustomerID'] == input_customer_id].index[0]
    customer_vector = customer_features_scaled[customer_index]

    # Calculate cosine similarity
    similarities = cosine_similarity([customer_vector], customer_features_scaled)[0]

    # Find top N similar customers (excluding the input customer)
    similar_indices = similarities.argsort()[-(top_n+1):-1][::-1]
    similar_customers = customer_summary.iloc[similar_indices].copy()
    similar_customers['SimilarityScore'] = similarities[similar_indices]

    return similar_customers[['CustomerID', 'SimilarityScore']]



**8. Generate Lookalike Map**
* Purpose: Create a dictionary mapping the first 20 customers to their top 3 lookalikes.

In [None]:
# Create a dictionary to store lookalikes for the first 20 customers
lookalike_map = {}
for customer_id in customers['CustomerID'][:20]:
    similar_customers = recommend_similar_customers(customer_id)
    lookalike_map[customer_id] = similar_customers.to_dict('records')


In [None]:
lookalike_df

Unnamed: 0,CustomerID,Lookalikes
0,C0001,"[{'CustomerID': 'C0190', 'SimilarityScore': 0...."
1,C0002,"[{'CustomerID': 'C0088', 'SimilarityScore': 0...."
2,C0003,"[{'CustomerID': 'C0052', 'SimilarityScore': 0...."
3,C0004,"[{'CustomerID': 'C0155', 'SimilarityScore': 0...."
4,C0005,"[{'CustomerID': 'C0186', 'SimilarityScore': 0...."
5,C0006,"[{'CustomerID': 'C0171', 'SimilarityScore': 0...."
6,C0007,"[{'CustomerID': 'C0140', 'SimilarityScore': 0...."
7,C0008,"[{'CustomerID': 'C0065', 'SimilarityScore': 0...."
8,C0009,"[{'CustomerID': 'C0010', 'SimilarityScore': 0...."
9,C0010,"[{'CustomerID': 'C0111', 'SimilarityScore': 0...."


**9. Save Lookalike Data**

In [None]:
# Save the lookalike map to a CSV file
lookalike_df = pd.DataFrame({'CustomerID': lookalike_map.keys(),
                             'Lookalikes': [str(value) for value in lookalike_map.values()]})

In [None]:
lookalike_df.to_csv('Vasudha_Guddeti_Lookalike.csv', index = False)