# Task 2: Lookalike Model

Lookalike Model Approach
1. Data Used:
* Merged Customers.csv (customer profiles) and Transactions.csv (transaction history) to create a unified dataset with both customer data and transaction details.
2. Similarity Calculation:
* Used features like Total Purchase Value, Average Order Value, Total Quantity, and Region (encoded) to calculate customer similarity using Cosine Similarity.
3. Selecting Top 3 Lookalikes:
* For each customer, the top 3 most similar customers were selected by sorting the cosine similarity scores in descending order, excluding the customer itself.
4. Final Output:
* Saved the lookalike recommendations for the first 20 customers in a CSV file, including CustomerID, Lookalikes (3 similar customers), and Scores (similarity scores).
5. Mean Pairwise Distance:
* Calculated the mean pairwise distance between the train and test data to assess model accuracy. Lower distance means better customer similarity grouping.








Logic Used:
1. Feature Engineering:

* Aggregated transaction data (e.g., total purchase value, quantity) and encoded categorical features (e.g., region) to create a customer feature set.
2. Dimensionality Reduction:

* Applied PCA to reduce the dimensionality of customer features, retaining the most important patterns while reducing noise.
3. Similarity Measurement:

* Used Cosine Similarity to compare customers based on their reduced feature vectors, where higher similarity indicates more similar customers.

In [14]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import pairwise_distances

# Step 1: Load the data
customers_df = pd.read_csv('/content/Customers.csv')
transactions_df = pd.read_csv('/content/Transactions.csv')
products_df = pd.read_csv('/content/Products.csv')

# Step 2: Merge and aggregate data
merged_transactions = transactions_df.merge(products_df, on="ProductID", how="left")

# Clean column names by stripping any leading/trailing spaces
merged_transactions.columns = merged_transactions.columns.str.strip()

customer_transactions = merged_transactions.merge(customers_df, on="CustomerID", how="left")

# Inspect the merged DataFrame to ensure 'Price' column exists
print(customer_transactions.head())  # This is for debugging, remove it after confirming

# Aggregate transaction-level data into customer-level features
customer_features = customer_transactions.groupby("CustomerID").agg({
    "TotalValue": "sum",
    "Quantity": "sum",
    "Price_x": "mean",  # Make sure the 'Price' column exists
    "Category": lambda x: x.mode()[0] if not x.mode().empty else None
}).reset_index()

customer_features.columns = ["CustomerID", "TotalPurchaseValue", "TotalQuantity", "AvgPrice", "TopCategory"]
customer_features = customer_features.merge(customers_df, on="CustomerID", how="left")

# Step 3: Encode categorical features
encoder = OneHotEncoder()
encoded_category = pd.DataFrame(encoder.fit_transform(customer_features[["TopCategory"]]).toarray(),
                                columns=encoder.get_feature_names_out(["TopCategory"]))
encoded_region = pd.get_dummies(customer_features["Region"], prefix="Region")

# Combine features into a single DataFrame
customer_features = pd.concat([customer_features, encoded_category, encoded_region], axis=1)
customer_features.drop(columns=["Region", "CustomerName", "SignupDate", "TopCategory"], inplace=True)

# Step 4: Standardize numerical features
scaler = StandardScaler()
numerical_features = ["TotalPurchaseValue", "TotalQuantity", "AvgPrice"]
customer_features[numerical_features] = scaler.fit_transform(customer_features[numerical_features])

# Step 5: Dimensionality reduction with PCA
pca = PCA(n_components=5)
customer_pca = pca.fit_transform(customer_features.drop(columns=["CustomerID"]).set_index(customer_features["CustomerID"]))
customer_features_pca = pd.DataFrame(customer_pca, index=customer_features["CustomerID"])

# Step 6: Calculate similarity
similarity_matrix = cosine_similarity(customer_features_pca)
similarity_df = pd.DataFrame(similarity_matrix, index=customer_features["CustomerID"], columns=customer_features["CustomerID"])

# Step 7: Recommend lookalikes
lookalike_results = {}
for customer_id in customer_features["CustomerID"][:20]:
    similar_customers = similarity_df[customer_id].sort_values(ascending=False).iloc[1:4]
    lookalike_results[customer_id] = list(similar_customers.index), list(similar_customers.values)

# Convert to DataFrame
lookalike_df = pd.DataFrame([{
    "CustomerID": cust_id,
    "Lookalikes": lookalike_results[cust_id][0],
    "Scores": lookalike_results[cust_id][1]
} for cust_id in lookalike_results])

# Step 8: Print sample rows for clarity
print(lookalike_df.head())

# Step 9: Save results
lookalike_df.to_csv("Saras Chandrika_Akkineni_Lookalike.csv", index=False)

# Step 10: Accuracy estimation
# Splitting data for evaluation (train-test split for synthetic validation)
train, test = train_test_split(customer_features_pca, test_size=0.2, random_state=42)
train_similarity = cosine_similarity(train)
test_similarity = cosine_similarity(test, train)

# Evaluate the quality of similarity (lower distance = better similarity match)
mean_distance = pairwise_distances(test, train).mean()
print(f"Mean pairwise distance between test and train data: {mean_distance:.4f}")

  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue  Price_x                      ProductName     Category  Price_y  \
0      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
1      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
2      300.68   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
3      601.36   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   
4      902.04   300.68  ComfortLiving Bluetooth Speaker  Electronics   300.68   

      CustomerName         Region  SignupDate  
0   Andrea Jenkins         Europe  202