#                                              Amazon Customer Purchase Behaviour Analysis

In this notebook, I analyze the Amazon Customer Purchase Dataset to understand customer behavior, clean and transform the data, extract useful patterns, build predictive models, and identify actionable insights.

The main goals are:

Clean and preprocess the data

Create customer-level features

Perform clustering (customer segmentation)

Build predictive models

Linear Regression (CLV Prediction)

Logistic Regression (Churn Prediction)

Generate insights for business decisions

# Import Libraries

In [138]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score, confusion_matrix, classification_report


# Loading the Dataset

In [140]:

df= pd.read_excel(r"C:\Users\adars\Downloads\Hero Vired\Machine Learning\Amazon_Customer_Purchase_Data\Amazon_Customer_Purchase_Data.xlsx")
df.head()

Unnamed: 0,Customer_ID,Customer_Name,Age,Gender,Location,Product_Category,Product_ID,Purchase_Date,Purchase_Amount,Payment_Method,Rating,Feedback_Comments,Customer_Lifetime_Value,Loyalty_Score,Discount_Applied,Return_Status,Customer_Segment,Preferred_Shopping_Channel
0,17270,John,56.0,Other,New York,Books,674,2020-01-01 00:00:00,491.643012,Cash,,,3673.712747,60,No,No,Regular,In-store
1,10860,Eve,33.0,Other,Houston,Home Appliances,393,2020-01-01 01:00:00,144.326722,Cash,5.0,Good,2103.060388,29,Yes,Yes,New,In-store
2,15390,John,50.0,Female,Houston,Clothing,995,2020-01-01 02:00:00,109.301892,Bank Transfer,,,899.115059,92,No,No,VIP,Online
3,15191,Eve,66.0,Other,San Francisco,Electronics,405,2020-01-01 03:00:00,226.655516,Bank Transfer,2.0,Excellent,2591.137716,62,Yes,Yes,Regular,Online
4,15734,Eve,38.0,Female,New York,Toys,353,2020-01-01 04:00:00,37.85188,Bank Transfer,2.0,,548.620397,80,No,Yes,Regular,Both


# Basic Data Inspection

In [142]:
df.info()
df.describe()
df.isna().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Customer_ID                 2000 non-null   int64         
 1   Customer_Name               1900 non-null   object        
 2   Age                         1900 non-null   float64       
 3   Gender                      2000 non-null   object        
 4   Location                    2000 non-null   object        
 5   Product_Category            2000 non-null   object        
 6   Product_ID                  2000 non-null   int64         
 7   Purchase_Date               2000 non-null   datetime64[ns]
 8   Purchase_Amount             1800 non-null   float64       
 9   Payment_Method              2000 non-null   object        
 10  Rating                      1860 non-null   float64       
 11  Feedback_Comments           1099 non-null   object      

Customer_ID                     0
Customer_Name                 100
Age                           100
Gender                          0
Location                        0
Product_Category                0
Product_ID                      0
Purchase_Date                   0
Purchase_Amount               200
Payment_Method                  0
Rating                        140
Feedback_Comments             901
Customer_Lifetime_Value       200
Loyalty_Score                   0
Discount_Applied                0
Return_Status                   0
Customer_Segment                0
Preferred_Shopping_Channel      0
dtype: int64

Observation:

Some numerical columns have missing values

Dataset seems clean but requires transformation

No major data type issues

# Handling Missing Values

In [145]:
#Impute numerical columns with median
data = df.copy()

num_cols = ["Age", "Purchase_Amount", "Rating", "Customer_Lifetime_Value"]
for col in num_cols:
    data[col] = data[col].fillna(data[col].median())


In [146]:
#Impute categorical columns with mode
cat_cols = ["Payment_Method"]
for col in cat_cols:
    if data[col].isna().any():
        data[col] = data[col].fillna(data[col].mode()[0])


ðŸ”¹ 6. Remove Duplicate Transactions

In [148]:

data = data.drop_duplicates(subset=["Customer_ID", "Purchase_Date"])


# Outlier Treatment (Winsorization)

Winsorization is used to clip extreme values without deleting rows.

In [150]:
def winsorize(s, lower=0.01, upper=0.99):
    return s.clip(s.quantile(lower), s.quantile(upper))

data["Purchase_Amount"] = winsorize(data["Purchase_Amount"])
data["Customer_Lifetime_Value"] = winsorize(data["Customer_Lifetime_Value"])


# Feature Engineering

In [152]:
#Create flags for discount & return
data["Discount_Flag"] = (data["Discount_Applied"] == "Yes").astype(int)
data["Return_Flag"] = (data["Return_Status"] == "Yes").astype(int)


In [153]:
#Aggregate to Customer-Level Dataset

#I want one-row-per-customer for segmentation and modeling.

def get_mode(x):
    return x.mode()[0]

agg = {
    "Age": "median",
    "Gender": get_mode,
    "Location": get_mode,
    "Preferred_Shopping_Channel": get_mode,
    "Purchase_Amount": ["sum", "mean", "count"],
    "Rating": "mean",
    "Discount_Flag": "mean",
    "Return_Flag": "mean",
    "Customer_Lifetime_Value": "mean",
    "Loyalty_Score": "mean"
}

cust = data.groupby("Customer_ID").agg(agg)
cust.columns = ["_".join(col).strip("_") for col in cust.columns]
cust.head()


Unnamed: 0_level_0,Age_median,Gender_get_mode,Location_get_mode,Preferred_Shopping_Channel_get_mode,Purchase_Amount_sum,Purchase_Amount_mean,Purchase_Amount_count,Rating_mean,Discount_Flag_mean,Return_Flag_mean,Customer_Lifetime_Value_mean,Loyalty_Score_mean
Customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
10001,44.5,Male,Chicago,In-store,145.014556,72.507278,2,3.5,0.0,0.5,485.826528,22.0
10004,40.0,Female,New York,Online,257.545509,257.545509,1,2.0,0.0,0.0,2305.36438,48.0
10005,32.0,Other,Chicago,Both,453.168418,453.168418,1,4.0,0.0,0.0,4522.924674,94.0
10009,39.0,Female,New York,Online,721.487122,240.495707,3,2.333333,0.666667,0.0,3285.252463,20.0
10011,43.0,Female,Los Angeles,Online,89.687865,89.687865,1,2.0,0.0,0.0,965.220686,90.0


In [154]:
#Renaming Columns for Clarity

In [155]:
cust = cust.rename(columns={
    "Age_median": "Age",                               
    "Gender_get_mode": "Gender",
    "Location_get_mode": "Location",
    "Preferred_Shopping_Channel_get_mode": "Preferred_Shopping_Channel",
    "Purchase_Amount_sum": "Total_Purchase_Amount",
    "Purchase_Amount_mean": "Avg_Purchase_Amount",
    "Purchase_Amount_count": "Num_Orders",
    "Rating_mean": "Avg_Rating",
    "Discount_Flag_mean": "Discount_Usage_Rate",
    "Return_Flag_mean": "Return_Rate",
    "Customer_Lifetime_Value_mean": "CLV",
    "Loyalty_Score_mean": "Loyalty_Score"
})


In [156]:
#Add Payment Method (Most Frequent)
cust["Payment_Method"] = data.groupby("Customer_ID")["Payment_Method"].agg(get_mode)


In [157]:
#Creating Customer Segment (Rule-based)
def segment(score):
    if score < 40:
        return "New"
    elif score < 70:
        return "Regular"
    return "VIP"

cust["Customer_Segment"] = cust["Loyalty_Score"].apply(segment)


In [158]:
#Create a Churn Label (Based on Recency)
last_purchase = data.groupby("Customer_ID")["Purchase_Date"].max()
threshold = last_purchase.quantile(0.75)

cust["Churn"] = (last_purchase < threshold).astype(int)


In [159]:
#Customer Segmentation (K-Means)
seg_features = cust[["Total_Purchase_Amount", "Num_Orders", "Loyalty_Score"]]


In [160]:
#Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(seg_features)


# Run K-Means (3 clusters)

In [162]:

kmeans = KMeans(n_clusters=3, random_state=42)
cust["Cluster"] = kmeans.fit_predict(X_scaled)
cust["Cluster"].value_counts()


Cluster
1    827
2    796
0    177
Name: count, dtype: int64

In [163]:
cust.groupby("Cluster")[["Total_Purchase_Amount", "Num_Orders", "Loyalty_Score", "CLV"]].mean()


Unnamed: 0_level_0,Total_Purchase_Amount,Num_Orders,Loyalty_Score,CLV
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,545.744613,2.129944,48.945857,2535.328216
1,258.556496,1.0,76.391778,2522.764188
2,258.017855,1.0,26.526382,2554.580945


#  Linear Regression â€“ CLV Prediction

In [165]:
reg_df = cust[["Age", "Total_Purchase_Amount", "Num_Orders",
               "Discount_Usage_Rate", "Loyalty_Score", "CLV",
               "Payment_Method"]]

reg_df = pd.get_dummies(reg_df, columns=["Payment_Method"], drop_first=True)

X = reg_df.drop("CLV", axis=1)
y = reg_df["CLV"]
#train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#fitmodel
lr = LinearRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)
#evaluation
print("RÂ² Score:", r2_score(y_test, pred))
print("RMSE:", mean_squared_error(y_test, pred, squared=False))



RÂ² Score: 0.7077253039115983
RMSE: 846.4017482023796




## Logistic Regression

In [167]:
churn_df = cust[[
    "Churn", "Age", "Total_Purchase_Amount", "Num_Orders", "Avg_Rating",
    "Discount_Usage_Rate", "Loyalty_Score", "Gender", "Location",
    "Preferred_Shopping_Channel", "Payment_Method"
]]

churn_df = pd.get_dummies(
    churn_df,
    columns=["Gender", "Location", "Preferred_Shopping_Channel", "Payment_Method"],
    drop_first=True
)

Xc = churn_df.drop("Churn", axis=1)
yc = churn_df["Churn"]
#split train test data
Xc_train, Xc_test, yc_train, yc_test = train_test_split(
    Xc, yc, test_size=0.2, random_state=42, stratify=yc)

#FitModel

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(Xc_train, yc_train)
yc_pred = log_reg.predict(Xc_test)
#evalution
print("Accuracy:", accuracy_score(yc_test, yc_pred))
print(confusion_matrix(yc_test, yc_pred))
print(classification_report(yc_test, yc_pred))


Accuracy: 0.7555555555555555
[[  5  85]
 [  3 267]]
              precision    recall  f1-score   support

           0       0.62      0.06      0.10        90
           1       0.76      0.99      0.86       270

    accuracy                           0.76       360
   macro avg       0.69      0.52      0.48       360
weighted avg       0.73      0.76      0.67       360



In [168]:
cust.to_csv("amazon_customer_level_output.csv")


In [169]:
# Save customer-level cleaned dataset with new features
cust.to_csv("amazon_customer_cleaned_with_features.csv", index=False)
print("Saved: amazon_customer_cleaned_with_features.csv")


Saved: amazon_customer_cleaned_with_features.csv


Additionally comparing different regression model

In [171]:
# SIMPLE REGRESSION MODEL COMPARISON PIPELINE


# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor   # remove if not installed


def compare_regression_models(X, y):

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # All models you want to test
    models = [
        ("Linear Regression", LinearRegression()),
        ("Ridge", Ridge(alpha=1.0)),
        ("Lasso", Lasso(alpha=0.001)),
        ("Random Forest", RandomForestRegressor(n_estimators=300, random_state=42)),
        ("Gradient Boosting", GradientBoostingRegressor(random_state=42)),
        ("XGBoost", XGBRegressor(
            n_estimators=300, learning_rate=0.05, max_depth=6,
            subsample=0.8, colsample_bytree=0.8, random_state=42
        ))
    ]

    results = []

    # Train and evaluate each model
    for name, model in models:
        model.fit(X_train, y_train)
        preds = model.predict(X_test)

        r2 = r2_score(y_test, preds)
        rmse = mean_squared_error(y_test, preds, squared=False)

        results.append([name, r2, rmse])

    # Results table
    return pd.DataFrame(results, columns=["Model", "R2 Score", "RMSE"])\
             .sort_values(by="R2 Score", ascending=False)


# ---- RUN PIPELINE -----
regression_results = compare_regression_models(X, y)
regression_results




Unnamed: 0,Model,R2 Score,RMSE
4,Gradient Boosting,0.716641,833.392399
2,Lasso,0.707726,846.401449
0,Linear Regression,0.707725,846.401748
1,Ridge,0.707618,846.556871
5,XGBoost,0.659675,913.330176
3,Random Forest,0.657266,916.556636


### Findings based on Output
Gradient Boosting worked the best among all models for predicting CLV.

It gave the highest RÂ², meaning it explained the most variation in CLV.

It also had the lowest RMSE, meaning its predictions were more accurate.

Linear models (Linear, Ridge, Lasso) performed almost the same and gave good, stable results.

This means the dataset has mostly linear relationships that simple models can capture.

Random Forest and XGBoost performed worse than expected, showing they didnâ€™t find strong nonlinear patterns.

Model performance can improve further by adding more features (recency, frequency, monetary value, behavior patterns) or tuning tree-model parameters

In [173]:
# SIMPLE CLASSIFICATION MODEL COMPARISON PIPELINE


# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier 


def compare_classification_models(X, y):

    # Train-test split (stratify keeps class balance similar in train & test)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # All models you want to try
    models = [
        ("Logistic Regression", LogisticRegression(max_iter=1000)),
        ("Random Forest", RandomForestClassifier(
            n_estimators=300, class_weight="balanced", random_state=42)),
        ("Gradient Boosting", GradientBoostingClassifier(random_state=42)),
        ("XGBoost", XGBClassifier(
            n_estimators=400, learning_rate=0.05, max_depth=5,
            subsample=0.8, eval_metric='logloss', random_state=42
        ))
    ]

    results = []

    # Train and evaluate each model
    for name, model in models:
        model.fit(X_train, y_train)
        preds = model.predict(X_test)

        acc = accuracy_score(y_test, preds)
        f1 = f1_score(y_test, preds)

        results.append([name, acc, f1])

    # Results table
    return pd.DataFrame(results, columns=["Model", "Accuracy", "F1 Score"])\
             .sort_values(by="F1 Score", ascending=False)


# ---- RUN PIPELINE -----
classification_results = compare_classification_models(Xc, yc)
classification_results


Unnamed: 0,Model,Accuracy,F1 Score
0,Logistic Regression,0.755556,0.858521
2,Gradient Boosting,0.741667,0.84878
1,Random Forest,0.730556,0.842788
3,XGBoost,0.686111,0.808799


# Key Finding from Churn Model Comparison
Logistic Regression performed the best with the highest accuracy and F1 score.

This means the churn problem is mostly linear, and a simple model works very well.

Logistic Regression is also the easiest model to interpret and explain to business teams.

Gradient Boosting came second, showing it learned some useful non-linear patterns.

However, the improvement was not enough to beat Logistic Regression, meaning complex models donâ€™t add much here.

Random Forest performed slightly worse, showing it didnâ€™t find strong patterns in the data.

XGBoost gave the lowest scores, meaning it struggled with this datasetâ€”possibly due to small data size.

The small difference between models suggests the churn problem is not highly complex.

No model drastically outperformed others, so feature engineering may help more than new models.

Overall, Logistic Regression is the best and simplest final model for churn prediction in this case.