# Customer Purchase Prediction - Exploratory Data Analysis

## Business Context

RetailTech Solutions is an international e-commerce platform seeking to optimize their marketing spend by predicting customer purchase likelihood based on browsing behavior. This analysis explores the customer session data to understand patterns and relationships that can inform our predictive modeling approach.

## Dataset Overview

The dataset contains 500 customer browsing sessions with 7 features capturing browsing behavior and demographic information. Our goal is to predict whether a customer will make a purchase (binary classification) based on their session characteristics.

### Feature Description

| Feature | Type | Description | Expected Range |
|---------|------|-------------|----------------|
| `customer_id` | Integer | Unique identifier | 1-500 (no missing) |
| `time_spent` | Float | Minutes on website | 0+ minutes |
| `pages_viewed` | Integer | Pages viewed in session | 0+ pages |
| `basket_value` | Float | Basket monetary value | 0+ dollars |
| `device_type` | Categorical | Device used | Mobile/Desktop/Tablet |
| `customer_type` | Categorical | Customer status | New/Returning |
| `purchase` | Binary | Target variable | 0=No, 1=Yes |

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Load the raw data
data_path = Path('../data/raw/raw_customer_data.csv')
df = pd.read_csv(data_path)

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape[0]} rows Ã— {df.shape[1]} columns")

## Data Quality Assessment

First, let's examine the data types, missing values, and basic statistics to understand data quality issues.

In [None]:
# Examine data types and missing values
print("=== DATA TYPES ===")
print(df.dtypes)
print("\n=== MISSING VALUES ===")
missing_summary = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_summary,
    'Missing Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0])

### Key Data Quality Findings:

- **Dataset Size**: 500 customer sessions
- **Missing Values**: Several features have missing data:
  - `time_spent`: 13.0% missing (65 values)
  - `pages_viewed`: 8.4% missing (42 values)
  - `basket_value`: 16.6% missing (83 values)
  - `device_type`: 3.8% missing (19 values)
  - `customer_type`: 5.4% missing (27 values)
- **Target Variable**: `purchase` has no missing values
- **Customer ID**: Complete, can be used as index

The missing values will need to be handled during preprocessing according to business rules.

## Descriptive Statistics

Let's examine the statistical properties of our numerical features.

In [None]:
# Descriptive statistics for numerical features
numerical_cols = ['time_spent', 'pages_viewed', 'basket_value']
print("=== DESCRIPTIVE STATISTICS (Numerical Features) ===")
desc_stats = df[numerical_cols].describe()
print(desc_stats.round(2))

print("\n=== TARGET DISTRIBUTION ===")
target_dist = df['purchase'].value_counts()
target_pct = df['purchase'].value_counts(normalize=True) * 100
target_summary = pd.DataFrame({
    'Count': target_dist,
    'Percentage': target_pct.round(2)
})
print(target_summary)

### Key Statistical Insights:

**Numerical Features:**
- **Time Spent**: Mean = 34.33 minutes, Std = 15.43 minutes, Range = 6.95-59.58 minutes
- **Pages Viewed**: Mean = 9.78 pages, Std = 5.52 pages, Range = 1-19 pages
- **Basket Value**: Mean = $49.63, Std = $27.57, Range = $0-$130.53

**Target Variable:**
- **Class Imbalance**: 81.4% purchases (407) vs 18.6% no-purchases (93)
- **Imbalance Ratio**: ~4.4:1 (purchase:no-purchase)
- **Business Implication**: This imbalance will need to be addressed in modeling

The dataset shows a strong purchase bias, which is expected for e-commerce data but requires careful handling during model training.

## Categorical Feature Analysis

Let's examine the distribution of categorical features.

In [None]:
# Categorical feature distributions
categorical_cols = ['device_type', 'customer_type']

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

for i, col in enumerate(categorical_cols):
    # Get value counts (excluding missing values for visualization)
    value_counts = df[col].value_counts()
    
    # Create bar plot
    bars = axes[i].bar(range(len(value_counts)), value_counts.values, 
                       color=sns.color_palette("husl", len(value_counts)))
    
    # Add value labels on bars
    for bar, value in zip(bars, value_counts.values):
        axes[i].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, 
                    f'{value}', ha='center', va='bottom', fontweight='bold')
    
    axes[i].set_title(f'Distribution of {col.replace("_", " ").title()}', fontsize=14, fontweight='bold')
    axes[i].set_xticks(range(len(value_counts)))
    axes[i].set_xticklabels(value_counts.index, rotation=45)
    axes[i].set_ylabel('Count')
    axes[i].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed statistics
print("=== CATEGORICAL FEATURE DETAILS ===")
for col in categorical_cols:
    print(f"\n{col.replace('_', ' ').title()}:")
    counts = df[col].value_counts()
    percentages = df[col].value_counts(normalize=True) * 100
    summary = pd.DataFrame({'Count': counts, 'Percentage': percentages.round(2)})
    print(summary)

### Categorical Feature Insights:

**Device Type Distribution:**
- **Mobile**: 45.8% (229 users) - Largest segment
- **Desktop**: 34.0% (170 users)
- **Tablet**: 16.4% (82 users)
- **Missing**: 3.8% (19 users)

**Customer Type Distribution:**
- **Returning**: 58.2% (291 users) - Majority are repeat customers
- **New**: 36.4% (182 users)
- **Missing**: 5.4% (27 users)

**Business Implications:**
- Mobile dominates device usage (nearly half of sessions)
- Returning customers represent the majority, suggesting good retention
- Missing values in categorical features will be imputed as specified

## Feature Distributions by Target

Let's examine how our features vary between purchasers and non-purchasers.

In [None]:
# Feature distributions by purchase outcome
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

# Numerical features
numerical_cols = ['time_spent', 'pages_viewed', 'basket_value']

for i, col in enumerate(numerical_cols):
    # Separate data by purchase outcome
    purchase_data = df[df['purchase'] == 1][col].dropna()
    no_purchase_data = df[df['purchase'] == 0][col].dropna()
    
    # Create histograms
    axes[i].hist(purchase_data, alpha=0.7, label='Purchase', bins=20, density=True)
    axes[i].hist(no_purchase_data, alpha=0.7, label='No Purchase', bins=20, density=True)
    
    axes[i].set_title(f'{col.replace("_", " ").title()} by Purchase Outcome', fontweight='bold')
    axes[i].set_xlabel(col.replace('_', ' ').title())
    axes[i].set_ylabel('Density')
    axes[i].legend()
    axes[i].grid(alpha=0.3)

# Categorical features
categorical_cols = ['device_type', 'customer_type']

for i, col in enumerate(categorical_cols):
    ax_idx = i + 3
    
    # Create cross-tabulation
    cross_tab = pd.crosstab(df[col].fillna('Missing'), df['purchase'], normalize='index') * 100
    
    # Plot
    cross_tab.plot(kind='bar', ax=axes[ax_idx], width=0.8)
    axes[ax_idx].set_title(f'{col.replace("_", " ").title()} vs Purchase Rate', fontweight='bold')
    axes[ax_idx].set_xlabel(col.replace('_', ' ').title())
    axes[ax_idx].set_ylabel('Purchase Rate (%)')
    axes[ax_idx].legend(['No Purchase', 'Purchase'])
    axes[ax_idx].tick_params(axis='x', rotation=45)
    axes[ax_idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical comparison
print("=== STATISTICAL COMPARISON BY PURCHASE OUTCOME ===")
for col in numerical_cols:
    purchase_mean = df[df['purchase'] == 1][col].mean()
    no_purchase_mean = df[df['purchase'] == 0][col].mean()
    print(f"\n{col.replace('_', ' ').title()}:")
    print(f"  Purchase: {purchase_mean:.2f}")
    print(f"  No Purchase: {no_purchase_mean:.2f}")
    print(f"  Difference: {purchase_mean - no_purchase_mean:.2f}")

### Purchase Behavior Insights:

**Numerical Features Comparison:**
- **Time Spent**: Purchasers spend more time (34.5 min vs 33.2 min, small difference)
- **Pages Viewed**: Purchasers view more pages (10.1 vs 7.9, moderate difference)
- **Basket Value**: Purchasers have higher basket values ($52.1 vs $33.5, large difference)

**Categorical Features:**
- **Device Type**: Mobile users have highest purchase rate (~83%), Desktop lowest (~77%)
- **Customer Type**: Returning customers purchase at higher rate (~85% vs ~76% for new)

**Key Patterns:**
- Basket value appears to be the strongest discriminator between purchasers and non-purchasers
- Returning customers show higher purchase intent than new customers
- Device type shows some variation but less pronounced differences

## Correlation Analysis

Let's examine relationships between features and identify potential multicollinearity.

In [None]:
# Correlation analysis
# First, create a copy with filled missing values for correlation analysis
df_corr = df.copy()

# Fill missing values with median/mean for correlation analysis
df_corr['time_spent'].fillna(df_corr['time_spent'].median(), inplace=True)
df_corr['pages_viewed'].fillna(df_corr['pages_viewed'].mean(), inplace=True)
df_corr['basket_value'].fillna(0, inplace=True)
df_corr['device_type'].fillna('Unknown', inplace=True)
df_corr['customer_type'].fillna('New', inplace=True)

# Encode categorical variables for correlation
df_corr['device_type_encoded'] = df_corr['device_type'].map({'Mobile': 0, 'Desktop': 1, 'Tablet': 2, 'Unknown': 3})
df_corr['customer_type_encoded'] = df_corr['customer_type'].map({'New': 0, 'Returning': 1})

# Select features for correlation
corr_features = ['time_spent', 'pages_viewed', 'basket_value', 'device_type_encoded', 'customer_type_encoded', 'purchase']

# Calculate correlation matrix
corr_matrix = df_corr[corr_features].corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Print correlation with target
print("\n=== CORRELATION WITH TARGET (Purchase) ===")
target_corr = corr_matrix['purchase'].drop('purchase').sort_values(ascending=False)
for feature, corr in target_corr.items():
    print(f"{feature.replace('_', ' ').title()}: {corr:.3f}")

### Correlation Analysis Insights:

**Key Correlations with Purchase:**
- **Basket Value**: 0.431 (Strongest positive correlation)
- **Pages Viewed**: 0.266 (Moderate positive correlation)
- **Time Spent**: 0.042 (Weak positive correlation)
- **Customer Type**: 0.157 (Weak positive correlation)
- **Device Type**: -0.063 (Weak negative correlation)

**Multicollinearity Check:**
- No strong correlations between predictor variables (all < 0.3)
- Features appear relatively independent

**Feature Importance Implications:**
- Basket value emerges as the most predictive feature
- Time spent shows surprisingly weak correlation despite being intuitive
- Categorical features have modest predictive power

## Outlier Analysis

Let's check for potential outliers in our numerical features.

In [None]:
# Outlier analysis using box plots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

numerical_cols = ['time_spent', 'pages_viewed', 'basket_value']

for i, col in enumerate(numerical_cols):
    # Create box plot
    box_data = df[col].dropna()
    bp = axes[i].boxplot(box_data, patch_artist=True, 
                        boxprops=dict(facecolor='lightblue', color='blue'),
                        medianprops=dict(color='red', linewidth=2),
                        whiskerprops=dict(color='blue'),
                        capprops=dict(color='blue'),
                        flierprops=dict(marker='o', markerfacecolor='red', markersize=8, alpha=0.6))
    
    axes[i].set_title(f'{col.replace("_", " ").title()} Distribution', fontweight='bold')
    axes[i].set_ylabel(col.replace('_', ' ').title())
    axes[i].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical outlier detection using IQR method
print("=== OUTLIER ANALYSIS (IQR Method) ===")
for col in numerical_cols:
    data = df[col].dropna()
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    
    print(f"\n{col.replace('_', ' ').title()}:")
    print(f"  IQR: {IQR:.2f}")
    print(f"  Outlier Range: < {lower_bound:.2f} or > {upper_bound:.2f}")
    print(f"  Outliers Found: {len(outliers)} ({len(outliers)/len(data)*100:.1f}%)")
    if len(outliers) > 0:
        print(f"  Outlier Values: {sorted(outliers.values)[:5]}...")  # Show first 5

### Outlier Analysis Results:

**Outlier Distribution:**
- **Time Spent**: 6 outliers (1.4%) - Some extreme browsing sessions
- **Pages Viewed**: 3 outliers (0.7%) - Users viewing many pages
- **Basket Value**: 3 outliers (0.7%) - High-value shopping baskets

**Assessment:**
- Relatively few outliers in the dataset
- Outliers appear legitimate (e.g., users spending long time shopping, high basket values)
- No extreme outliers that would require removal
- Data scaling will handle any remaining scale differences

## Summary and Key Findings

### Dataset Characteristics
- **Size**: 500 customer sessions
- **Features**: 6 predictors + 1 binary target
- **Missing Values**: Present in all features except customer_id and purchase
- **Class Imbalance**: 81.4% purchase rate (407 yes, 93 no)

### Key Insights

**Strongest Predictors:**
1. **Basket Value** (correlation = 0.431) - Most predictive feature
2. **Pages Viewed** (correlation = 0.266) - Moderate relationship
3. **Customer Type** (correlation = 0.157) - Returning customers more likely to purchase

**Data Quality:**
- Missing values require imputation following business rules
- No concerning multicollinearity between features
- Outliers are minimal and appear legitimate

**Business Implications:**
- **Marketing Focus**: Target customers with high basket values
- **User Experience**: Optimize for mobile users (45.8% of traffic)
- **Retention**: Leverage returning customer behavior patterns

### Next Steps for Modeling
1. **Preprocessing**: Handle missing values, encode categoricals, scale features
2. **Feature Engineering**: Consider interaction terms, behavioral segments
3. **Class Imbalance**: Address 4.4:1 purchase:no-purchase ratio
4. **Model Selection**: Focus on precision-recall balance for business value
5. **Evaluation**: Use appropriate metrics for imbalanced classification