# Exploratory Data Analysis and Visualization

**Author:** Nino Gagnidze  
**Purpose:** Comprehensive exploratory analysis with statistical insights and visualizations

## Objectives
- Perform statistical analysis of customer demographics and behavior
- Create at least 5 different types of visualizations
- Investigate relationships and correlations between features
- Identify patterns, trends, and customer segments
- Generate insights for business decision-making

## Visualization Types Included
1. Distribution plots (histograms with KDE)
2. Box plots for outlier visualization
3. Correlation heatmap
4. Scatter plots with trend analysis
5. Bar charts for categorical variables
6. Pair plots for multivariate analysis
7. Violin plots
8. Pie charts

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
sys.path.append('../src')

# Import custom visualization functions
from visualization import (
    plot_distribution,
    plot_boxplot,
    plot_correlation_heatmap,
    plot_scatter,
    plot_count_bar,
    plot_pairplot,
    plot_grouped_bar,
    plot_violin,
    plot_pie_chart,
    plot_multiple_distributions,
    create_statistical_summary_table
)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

In [None]:
# Load processed data
data_path = '../data/processed/mall_customers_processed.csv'
df = pd.read_csv(data_path)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Display first few rows
df.head(10)

## 2. Statistical Summary

In [None]:
# Basic descriptive statistics
print("Descriptive Statistics:")
df.describe()

In [None]:
# Comprehensive statistical summary
numerical_features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
stats_summary = create_statistical_summary_table(df, numerical_features)

print("Comprehensive Statistical Summary:")
print("=" * 80)
stats_summary

In [None]:
# Categorical features summary
print("Gender Distribution:")
print(df['Gender'].value_counts())
print("\nPercentage:")
print((df['Gender'].value_counts(normalize=True) * 100).round(2))

## 3. Visualization Type 1: Distribution Plots (Histograms with KDE)

In [None]:
# Age distribution
plot_distribution(
    df, 
    'Age',
    title='Distribution of Customer Age',
    bins=20,
    color='skyblue',
    save_path='../reports/figures/01_age_distribution.png'
)

In [None]:
# Annual Income distribution
plot_distribution(
    df,
    'Annual Income (k$)',
    title='Distribution of Annual Income',
    bins=20,
    color='lightgreen',
    save_path='../reports/figures/02_income_distribution.png'
)

In [None]:
# Spending Score distribution
plot_distribution(
    df,
    'Spending Score (1-100)',
    title='Distribution of Spending Score',
    bins=20,
    color='salmon',
    save_path='../reports/figures/03_spending_distribution.png'
)

## 4. Visualization Type 2: Box Plots

In [None]:
# Box plots for all numerical features
plot_boxplot(
    df,
    numerical_features,
    title='Box Plots of Numerical Features - Outlier Detection',
    save_path='../reports/figures/04_boxplots_all_features.png'
)

## 5. Visualization Type 3: Correlation Heatmap

In [None]:
# Correlation heatmap for numerical features
correlation_features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)', 'Gender_Encoded']
plot_correlation_heatmap(
    df,
    columns=correlation_features,
    title='Correlation Heatmap of Customer Features',
    save_path='../reports/figures/05_correlation_heatmap.png'
)

In [None]:
# Print correlation matrix
print("Correlation Matrix:")
print(df[correlation_features].corr().round(3))

## 6. Visualization Type 4: Scatter Plots

In [None]:
# Scatter plot: Income vs Spending Score
plot_scatter(
    df,
    'Annual Income (k$)',
    'Spending Score (1-100)',
    title='Annual Income vs Spending Score',
    save_path='../reports/figures/06_income_vs_spending.png'
)

In [None]:
# Scatter plot with Gender coloring
plot_scatter(
    df,
    'Annual Income (k$)',
    'Spending Score (1-100)',
    hue='Gender',
    title='Annual Income vs Spending Score (by Gender)',
    save_path='../reports/figures/07_income_vs_spending_by_gender.png'
)

In [None]:
# Scatter plot: Age vs Spending Score
plot_scatter(
    df,
    'Age',
    'Spending Score (1-100)',
    hue='Gender',
    title='Age vs Spending Score (by Gender)',
    save_path='../reports/figures/08_age_vs_spending.png'
)

## 7. Visualization Type 5: Bar Charts

In [None]:
# Gender distribution bar chart
plot_count_bar(
    df,
    'Gender',
    title='Customer Gender Distribution',
    save_path='../reports/figures/09_gender_distribution.png'
)

In [None]:
# Age Group distribution
plot_count_bar(
    df,
    'Age_Group',
    title='Customer Age Group Distribution',
    save_path='../reports/figures/10_age_group_distribution.png'
)

In [None]:
# Income Category distribution
plot_count_bar(
    df,
    'Income_Category',
    title='Customer Income Category Distribution',
    save_path='../reports/figures/11_income_category_distribution.png'
)

In [None]:
# Spending Category distribution
plot_count_bar(
    df,
    'Spending_Category',
    title='Customer Spending Category Distribution',
    save_path='../reports/figures/12_spending_category_distribution.png'
)

## 8. Visualization Type 6: Pair Plots

In [None]:
# Pair plot for numerical features colored by Gender
plot_pairplot(
    df,
    numerical_features,
    hue='Gender',
    save_path='../reports/figures/13_pairplot_by_gender.png'
)

## 9. Visualization Type 7: Violin Plots

In [None]:
# Violin plot: Spending Score by Gender
plot_violin(
    df,
    x_col='Gender',
    y_col='Spending Score (1-100)',
    title='Spending Score Distribution by Gender',
    save_path='../reports/figures/14_spending_violin_by_gender.png'
)

In [None]:
# Violin plot: Income by Age Group
plot_violin(
    df,
    x_col='Age_Group',
    y_col='Annual Income (k$)',
    title='Annual Income Distribution by Age Group',
    save_path='../reports/figures/15_income_violin_by_age_group.png'
)

## 10. Visualization Type 8: Pie Charts

In [None]:
# Pie chart for Gender
plot_pie_chart(
    df,
    'Gender',
    title='Gender Distribution',
    save_path='../reports/figures/16_gender_pie_chart.png'
)

In [None]:
# Pie chart for Spending Category
plot_pie_chart(
    df,
    'Spending_Category',
    title='Spending Category Distribution',
    save_path='../reports/figures/17_spending_category_pie_chart.png'
)

## 11. Grouped Analysis

In [None]:
# Grouped bar chart: Age Group by Gender
plot_grouped_bar(
    df,
    'Age_Group',
    'Gender',
    title='Age Group Distribution by Gender',
    save_path='../reports/figures/18_age_group_by_gender.png'
)

In [None]:
# Grouped bar chart: Income Category by Gender
plot_grouped_bar(
    df,
    'Income_Category',
    'Gender',
    title='Income Category Distribution by Gender',
    save_path='../reports/figures/19_income_category_by_gender.png'
)

## 12. Statistical Analysis by Groups

In [None]:
# Average statistics by Gender
print("Average Statistics by Gender:")
print("=" * 80)
gender_stats = df.groupby('Gender')[numerical_features].mean().round(2)
print(gender_stats)

In [None]:
# Average statistics by Age Group
print("Average Statistics by Age Group:")
print("=" * 80)
age_group_stats = df.groupby('Age_Group')[numerical_features].mean().round(2)
print(age_group_stats)

In [None]:
# Average statistics by Income Category
print("Average Statistics by Income Category:")
print("=" * 80)
income_stats = df.groupby('Income_Category')[numerical_features].mean().round(2)
print(income_stats)

In [None]:
# Average statistics by Spending Category
print("Average Statistics by Spending Category:")
print("=" * 80)
spending_stats = df.groupby('Spending_Category')[numerical_features].mean().round(2)
print(spending_stats)

## 13. Cross-tabulation Analysis

In [None]:
# Cross-tabulation: Gender vs Age Group
print("Cross-tabulation: Gender vs Age Group")
print("=" * 80)
ct1 = pd.crosstab(df['Gender'], df['Age_Group'], margins=True)
print(ct1)

In [None]:
# Cross-tabulation: Gender vs Income Category
print("Cross-tabulation: Gender vs Income Category")
print("=" * 80)
ct2 = pd.crosstab(df['Gender'], df['Income_Category'], margins=True)
print(ct2)

In [None]:
# Cross-tabulation: Income Category vs Spending Category
print("Cross-tabulation: Income Category vs Spending Category")
print("=" * 80)
ct3 = pd.crosstab(df['Income_Category'], df['Spending_Category'], margins=True)
print(ct3)

## 14. Key Insights and Patterns

Run all cells above and document your findings here:

### Distribution Analysis:
- Age Distribution: [To be filled after running]
- Income Distribution: [To be filled after running]
- Spending Score Distribution: [To be filled after running]

### Correlation Findings:
- Income vs Spending Score correlation: [To be filled after running]
- Age vs Spending Score correlation: [To be filled after running]
- Gender impact on spending: [To be filled after running]

### Customer Segments Identified:
1. [To be filled after running]
2. [To be filled after running]
3. [To be filled after running]

### Business Insights:
1. [To be filled after running]
2. [To be filled after running]
3. [To be filled after running]

### Patterns Observed:
- [To be filled after running]
- [To be filled after running]

### Anomalies or Interesting Findings:
- [To be filled after running]

## 15. Save EDA Summary Report

In [None]:
# Save comprehensive EDA report
report_path = '../reports/results/eda_summary_report.txt'

with open(report_path, 'w') as f:
    f.write("EXPLORATORY DATA ANALYSIS SUMMARY REPORT\n")
    f.write("=" * 80 + "\n\n")
    
    f.write("1. DATASET OVERVIEW\n")
    f.write("-" * 80 + "\n")
    f.write(f"Total Records: {len(df)}\n")
    f.write(f"Total Features: {len(df.columns)}\n\n")
    
    f.write("2. STATISTICAL SUMMARY\n")
    f.write("-" * 80 + "\n")
    f.write(stats_summary.to_string())
    f.write("\n\n")
    
    f.write("3. CORRELATION ANALYSIS\n")
    f.write("-" * 80 + "\n")
    f.write(df[correlation_features].corr().to_string())
    f.write("\n\n")
    
    f.write("4. CATEGORICAL DISTRIBUTIONS\n")
    f.write("-" * 80 + "\n")
    f.write("Gender Distribution:\n")
    f.write(df['Gender'].value_counts().to_string())
    f.write("\n\nAge Group Distribution:\n")
    f.write(df['Age_Group'].value_counts().to_string())
    f.write("\n\nIncome Category Distribution:\n")
    f.write(df['Income_Category'].value_counts().to_string())
    f.write("\n\nSpending Category Distribution:\n")
    f.write(df['Spending_Category'].value_counts().to_string())
    f.write("\n\n")
    
    f.write("5. GROUP STATISTICS\n")
    f.write("-" * 80 + "\n")
    f.write("Average by Gender:\n")
    f.write(gender_stats.to_string())
    f.write("\n\nAverage by Age Group:\n")
    f.write(age_group_stats.to_string())
    f.write("\n\nAverage by Income Category:\n")
    f.write(income_stats.to_string())
    f.write("\n\nAverage by Spending Category:\n")
    f.write(spending_stats.to_string())
    f.write("\n\n")
    
    f.write("6. VISUALIZATIONS CREATED\n")
    f.write("-" * 80 + "\n")
    f.write("Total visualizations: 19\n")
    f.write("Types: Distribution plots, Box plots, Correlation heatmap, \n")
    f.write("       Scatter plots, Bar charts, Pair plots, Violin plots, Pie charts\n")
    f.write("All figures saved to: reports/figures/\n")

print(f"EDA summary report saved to: {report_path}")

In [None]:
# Display total number of visualizations created
import os
figures_dir = '../reports/figures/'
figure_count = len([f for f in os.listdir(figures_dir) if f.endswith('.png')])
print(f"Total visualizations created and saved: {figure_count}")
print(f"Location: {figures_dir}")

## Summary

This EDA notebook has successfully:
- Performed comprehensive statistical analysis of all features
- Created 8+ different types of visualizations (19 total figures)
- Analyzed correlations and relationships between features
- Generated insights through grouped analysis and cross-tabulations
- Saved all visualizations and reports for presentation

**Next Steps:**
1. Use insights to inform machine learning model selection
2. Apply K-Means clustering to identify customer segments
3. Build classification models to predict customer categories