# Ecommerce Consumer Behavior Analysis  
## 01 – Data Understanding & Validation

In this notebook I load the raw ecommerce consumer behavior dataset, understand the schema, and perform initial data quality checks before cleaning and feature engineering.

In [1]:
import pandas as pd 
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: f"{x:,.2f}")

In [2]:
file_path = "../data/raw/ecommerce_consumer_behavior.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,03-01-2024,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22,2,Mixed,5,5,0.3,Low,Not Sensitive,1,7,Low,Smartphone,Debit Card,3/15/2024,True,True,Impulsive,No Preference,3
3,48-980-6078,29,Female,Middle,Single,Master's,Middle,Wiwilí,Home Appliances,$101.31,6,Mixed,3,1,1.0,High,Somewhat Sensitive,0,1,,Smartphone,Other,10-04-2024,True,True,Need-based,Express,10
4,91-170-9072,33,Female,Middle,Widowed,High School,Middle,Nara,Furniture,$211.70,6,Mixed,3,4,0.0,Medium,Not Sensitive,2,10,,Smartphone,Debit Card,1/30/2024,False,False,Wants-based,No Preference,4


3. Basic shape & columns (Code + Markdown)


In [3]:
print("Rows", df.shape[0])
print("Columns", df.shape[1])

df.columns

Rows 1000
Columns 28


Index(['Customer_ID', 'Age', 'Gender', 'Income_Level', 'Marital_Status',
       'Education_Level', 'Occupation', 'Location', 'Purchase_Category',
       'Purchase_Amount', 'Frequency_of_Purchase', 'Purchase_Channel',
       'Brand_Loyalty', 'Product_Rating',
       'Time_Spent_on_Product_Research(hours)', 'Social_Media_Influence',
       'Discount_Sensitivity', 'Return_Rate', 'Customer_Satisfaction',
       'Engagement_with_Ads', 'Device_Used_for_Shopping', 'Payment_Method',
       'Time_of_Purchase', 'Discount_Used', 'Customer_Loyalty_Program_Member',
       'Purchase_Intent', 'Shipping_Preference', 'Time_to_Decision'],
      dtype='object')

4. Info & dtypes

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 28 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Customer_ID                            1000 non-null   object 
 1   Age                                    1000 non-null   int64  
 2   Gender                                 1000 non-null   object 
 3   Income_Level                           1000 non-null   object 
 4   Marital_Status                         1000 non-null   object 
 5   Education_Level                        1000 non-null   object 
 6   Occupation                             1000 non-null   object 
 7   Location                               1000 non-null   object 
 8   Purchase_Category                      1000 non-null   object 
 9   Purchase_Amount                        1000 non-null   object 
 10  Frequency_of_Purchase                  1000 non-null   int64  
 11  Purch

In [5]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1000.0,34.3,9.35,18.0,26.0,34.5,42.0,50.0
Frequency_of_Purchase,1000.0,6.95,3.15,2.0,4.0,7.0,10.0,12.0
Brand_Loyalty,1000.0,3.03,1.42,1.0,2.0,3.0,4.0,5.0
Product_Rating,1000.0,3.03,1.44,1.0,2.0,3.0,4.0,5.0
Time_Spent_on_Product_Research(hours),1000.0,1.01,0.79,0.0,0.0,1.0,2.0,2.0
Return_Rate,1000.0,0.95,0.81,0.0,0.0,1.0,2.0,2.0
Customer_Satisfaction,1000.0,5.4,2.87,1.0,3.0,5.0,8.0,10.0
Time_to_Decision,1000.0,7.55,4.04,1.0,4.0,8.0,11.0,14.0


In [6]:
df.describe(include="object").T

Unnamed: 0,count,unique,top,freq
Customer_ID,1000,1000,37-611-6911,1
Gender,1000,8,Female,452
Income_Level,1000,2,High,515
Marital_Status,1000,4,Widowed,260
Education_Level,1000,3,Bachelor's,341
Occupation,1000,2,High,517
Location,1000,969,Oslo,4
Purchase_Category,1000,24,Electronics,54
Purchase_Amount,1000,989,$178.04,2
Purchase_Channel,1000,3,Mixed,340


### Descriptive statistics

Here we look at:
- Ranges and distributions of numeric features (e.g., Purchase_Amount, Frequency_of_Purchase).
- Cardinality of categorical fields (e.g., Purchase_Channel, Location, Payment_Method).

 Missing values analysis (Code)

In [11]:
missing_count = df.isna().sum().sort_values(ascending=False)
print(missing_count)
missing_pct = (df.isna().mean() * 100).round(2).sort_values(ascending=False)
print(missing_pct)

Engagement_with_Ads                      256
Social_Media_Influence                   247
Customer_ID                                0
Age                                        0
Shipping_Preference                        0
Purchase_Intent                            0
Customer_Loyalty_Program_Member            0
Discount_Used                              0
Time_of_Purchase                           0
Payment_Method                             0
Device_Used_for_Shopping                   0
Customer_Satisfaction                      0
Return_Rate                                0
Discount_Sensitivity                       0
Time_Spent_on_Product_Research(hours)      0
Product_Rating                             0
Brand_Loyalty                              0
Purchase_Channel                           0
Frequency_of_Purchase                      0
Purchase_Amount                            0
Purchase_Category                          0
Location                                   0
Occupation

In [12]:
missing_df = pd.DataFrame({
    "missing_count": missing_count,
    "missing_pct": missing_pct
})

missing_df[missing_df["missing_count"] > 0]

Unnamed: 0,missing_count,missing_pct
Engagement_with_Ads,256,25.6
Social_Media_Influence,247,24.7


 Duplicate checks

In [15]:
dup_rows = df.duplicated().sum()
print("Duplicate rows:", dup_rows)

Duplicate rows: 0


In [17]:
#Customer_ID duplicates – expected, because one customer can have multiple purchases
cut_duplicate = df["Customer_ID"].duplicated().sum()
print("Customer_ID duplicates:", cut_duplicate)


### Duplicate records
- Total duplicate rows in dataset: `<value>`.
- Duplicate `Customer_ID` is expected because one customer can appear many times.
We will remove exact row duplicates in the cleaning step.

Customer_ID duplicates: 0


Basic value checks per key columns

In [19]:
cols_to_check = [
    "Purchase_Amount",
    "Frequency_of_Purchase",
    "Return_Rate",
    "Product_Rating",
    "Discount_Used",
    "Time_Spent_on_Product_Research(hours)",
    "Time_to_Decision"
]

df[cols_to_check].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Frequency_of_Purchase,1000.0,6.95,3.15,2.0,4.0,7.0,10.0,12.0
Return_Rate,1000.0,0.95,0.81,0.0,0.0,1.0,2.0,2.0
Product_Rating,1000.0,3.03,1.44,1.0,2.0,3.0,4.0,5.0
Time_Spent_on_Product_Research(hours),1000.0,1.01,0.79,0.0,0.0,1.0,2.0,2.0
Time_to_Decision,1000.0,7.55,4.04,1.0,4.0,8.0,11.0,14.0


In [23]:
df[["Purchase_Amount", "Frequency_of_Purchase",
    "Return_Rate", "Discount_Used"]].dtypes
# negative values?
# (df[["Purchase_Amount", "Frequency_of_Purchase", "Return_Rate", "Discount_Used"]] < 0).sum()

Purchase_Amount          object
Frequency_of_Purchase     int64
Return_Rate               int64
Discount_Used              bool
dtype: object

In [25]:
df["Purchase_Amount_clean"] = (
    df["Purchase_Amount"]
    .replace(r"[\$,]", "", regex=True)
    .astype(float)
)

In [26]:
(df[["Purchase_Amount_clean",
     "Frequency_of_Purchase",
     "Return_Rate"]] < 0).sum()

Purchase_Amount_clean    0
Frequency_of_Purchase    0
Return_Rate              0
dtype: int64

In [27]:
df["Return_Rate"].describe()

count   1,000.00
mean        0.95
std         0.81
min         0.00
25%         0.00
50%         1.00
75%         2.00
max         2.00
Name: Return_Rate, dtype: float64

In [30]:
invalid_return = df[df["Return_Rate"] < 0 | (df["Return_Rate"] < 1)]
invalid_rating = df[(df["Product_Rating"] < 1) | (df["Product_Rating"] > 5)]
print(invalid_return)
print(invalid_rating)


### Sanity checks on numeric fields

- Verify there are no negative purchase amounts or frequencies.
- Return_Rate and Discount_Used should typically be between 0 and 1 (or 0–100 if in %).
- Product_Rating should fall within the expected rating scale (e.g., 1–5).

     Customer_ID  Age    Gender Income_Level Marital_Status Education_Level  \
3    48-980-6078   29    Female       Middle         Single        Master's   
6    90-144-9193   21    Female       Middle       Divorced     High School   
10   44-674-4037   33      Male       Middle       Divorced      Bachelor's   
11   78-116-8349   38    Female         High        Widowed      Bachelor's   
13   80-684-5072   32      Male         High        Married     High School   
..           ...  ...       ...          ...            ...             ...   
991  48-271-1908   40  Bigender         High         Single      Bachelor's   
992  46-978-3874   22    Female         High         Single     High School   
996  41-366-4205   50    Female         High         Single     High School   
997  77-241-7621   26      Male         High        Married      Bachelor's   
998  53-091-2176   21    Female         High        Widowed      Bachelor's   

    Occupation     Location     Purchase_Category P

In [31]:
cat_cols = [
    "Gender",
    "Income_Level",
    "Marital_Status",
    "Education_Level",
    "Occupation",
    "Location",
    "Purchase_Category",
    "Purchase_Channel",
    "Brand_Loyalty",
    "Social_Media_Influence",
    "Discount_Sensitivity",
    "Customer_Satisfaction",
    "Engagement_with_Ads",
    "Device_Used_for_Shopping",
    "Payment_Method",
    "Customer_Loyalty_Program_Member",
    "Purchase_Intent",
    "Shipping_Preference"
]

unique_counts = df[cat_cols].nunique().sort_values(ascending=False)
unique_counts



### Categorical cardinality

This helps decide:
- Which columns are high-cardinality (e.g., Location) and better suited for grouping.
- Which columns are low-cardinality segments (e.g., Gender, Payment_Method) and good for breakdowns in EDA.

Location                           969
Purchase_Category                   24
Customer_Satisfaction               10
Gender                               8
Payment_Method                       5
Brand_Loyalty                        5
Marital_Status                       4
Purchase_Intent                      4
Device_Used_for_Shopping             3
Engagement_with_Ads                  3
Social_Media_Influence               3
Discount_Sensitivity                 3
Purchase_Channel                     3
Education_Level                      3
Shipping_Preference                  3
Income_Level                         2
Occupation                           2
Customer_Loyalty_Program_Member      2
dtype: int64

In [38]:
time_cols = ["Time_of_Purchase"]
for col in time_cols:
    df[col] = pd.to_datetime(df[col], errors= "coerce")
    
df[time_cols].head()
df[time_cols].isna().sum()

### Time fields

`Time_of_Purchase` is converted to datetime so that we can later derive:
- hour of day,
- day of week,
- part of day (morning/evening),
which will be useful for behaviour analysis.

Time_of_Purchase    618
dtype: int64

## Data Quality Summary

From the validation step:

- The dataset contains 1000rows and 28columns covering customer demographics, behaviour, purchase details, and engagement.
- Data types are mostly appropriate; `Time_of_Purchase` has been converted to a datetime field.
- Missing values are mainly present in: [Engagement_with_Ads, Social_Media_Influence] . These will be handled in the cleaning notebook.
- There are 0 fully duplicated rows, which will be dropped.
- No obvious negative or impossible values were found in key numeric fields such as `Purchase_Amount` and `Frequency_of_Purchase` (or they will be filtered out during cleaning).

This notebook confirms that the raw data is usable and highlights the issues that need to be fixed in **02_data_cleaning_feature_engineering.ipynb**.
