# 📊 Sales Dataset EDA - Phân Tích Dữ Liệu Bán Hàng

## 🎯 Mục Tiêu
Phân tích khám phá dữ liệu bán hàng để hiểu:
- Xu hướng bán hàng theo thời gian
- Phân tích theo sản phẩm, khu vực, khách hàng
- Seasonal patterns và trends
- Customer behavior analysis

## 📋 Dataset Overview
- **Nguồn**: Synthetic Sales Data
- **Thời gian**: 2 năm (2022-2023)
- **Features**: Date, Product, Category, Region, Customer, Sales, Quantity, Price
- **Mục tiêu**: Time series analysis, seasonal patterns, customer segmentation

## 🔍 Kỹ Thuật Sẽ Sử Dụng
- Time series analysis
- Seasonal decomposition
- Customer segmentation
- Product performance analysis
- Geographic analysis
- Trend analysis


In [None]:
# Import các thư viện cần thiết
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Cài đặt style cho plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("✅ Đã import thành công tất cả thư viện!")
print("📊 Sẵn sàng bắt đầu phân tích Sales Dataset!")


## 📊 Bước 1: Tạo Synthetic Sales Dataset

Tạo dataset bán hàng tổng hợp với các đặc điểm thực tế:
- Dữ liệu 2 năm (2022-2023)
- 5 sản phẩm chính với 3 categories
- 4 khu vực bán hàng
- 1000+ khách hàng
- Seasonal patterns và trends


In [2]:
# Tạo synthetic sales dataset
import numpy as np
np.random.seed(42)

# Thông tin cơ bản
start_date = '2022-01-01'
end_date = '2023-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Sản phẩm và categories
products = {
    'Laptop Gaming': 'Electronics',
    'iPhone 14': 'Electronics', 
    'Nike Air Max': 'Fashion',
    'Adidas Ultraboost': 'Fashion',
    'MacBook Pro': 'Electronics',
    'Samsung Galaxy': 'Electronics',
    'Levi\'s Jeans': 'Fashion',
    'Zara Jacket': 'Fashion',
    'iPad Air': 'Electronics',
    'Nike T-Shirt': 'Fashion'
}

# Khu vực
regions = ['North', 'South', 'East', 'West']

# Tạo dữ liệu
sales_data = []

for date in date_range:
    # Số lượng giao dịch trong ngày (có seasonal pattern)
    day_of_year = date.timetuple().tm_yday
    seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * day_of_year / 365)  # Seasonal pattern
    weekend_factor = 0.7 if date.weekday() >= 5 else 1.0  # Weekend effect
    holiday_factor = 1.5 if date.month in [11, 12] else 1.0  # Holiday season
    
    num_transactions = int(np.random.poisson(50 * seasonal_factor * weekend_factor * holiday_factor))
    
    for _ in range(num_transactions):
        product = np.random.choice(list(products.keys()))
        category = products[product]
        region = np.random.choice(regions)
        customer_id = f"CUST_{np.random.randint(1000, 9999)}"
        
        # Giá sản phẩm (có variation theo thời gian)
        base_prices = {
            'Laptop Gaming': 1200, 'iPhone 14': 800, 'Nike Air Max': 120,
            'Adidas Ultraboost': 150, 'MacBook Pro': 2000, 'Samsung Galaxy': 600,
            'Levi\'s Jeans': 80, 'Zara Jacket': 120, 'iPad Air': 500, 'Nike T-Shirt': 30
        }
        
        base_price = base_prices[product]
        price_variation = np.random.normal(1, 0.1)  # 10% variation
        price = base_price * price_variation
        
        # Số lượng (thường 1-3 items)
        quantity = np.random.choice([1, 2, 3], p=[0.7, 0.2, 0.1])
        
        # Total sales
        sales = price * quantity
        
        # Discount factor (10% chance of discount)
        if np.random.random() < 0.1:
            discount = np.random.uniform(0.05, 0.25)
            sales *= (1 - discount)
        
        sales_data.append({
            'Date': date,
            'Product': product,
            'Category': category,
            'Region': region,
            'Customer_ID': customer_id,
            'Price': round(price, 2),
            'Quantity': quantity,
            'Sales': round(sales, 2),
            'Year': date.year,
            'Month': date.month,
            'Day': date.day,
            'Weekday': date.weekday(),
            'Quarter': date.quarter
        })

# Tạo DataFrame
df = pd.DataFrame(sales_data)

print(f"✅ Đã tạo thành công Sales Dataset!")
print(f"📊 Kích thước dataset: {df.shape}")
print(f"📅 Thời gian: {df['Date'].min()} đến {df['Date'].max()}")
print(f"💰 Tổng doanh thu: ${df['Sales'].sum():,.2f}")
print(f"🛍️ Tổng số giao dịch: {len(df):,}")


NameError: name 'pd' is not defined

## 📊 Bước 2: Data Loading & Overview


In [None]:
# Kiểm tra thông tin cơ bản về dataset
print("🔍 THÔNG TIN CƠ BẢN VỀ DATASET")
print("=" * 50)
print(f"📊 Shape: {df.shape}")
print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"📅 Date range: {df['Date'].min()} đến {df['Date'].max()}")
print(f"📈 Total days: {(df['Date'].max() - df['Date'].min()).days + 1}")

print("\n📋 COLUMNS INFO:")
print("=" * 30)
print(df.info())

print("\n🔢 DATA TYPES:")
print("=" * 20)
print(df.dtypes)

print("\n📊 SAMPLE DATA (5 rows đầu):")
print("=" * 35)
df.head()


In [None]:
# Kiểm tra missing values
print("🔍 MISSING VALUES ANALYSIS")
print("=" * 40)
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})

print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("✅ Không có missing values trong dataset!")

# Kiểm tra duplicate rows
duplicates = df.duplicated().sum()
print(f"\n🔄 Duplicate rows: {duplicates}")

# Kiểm tra unique values cho categorical columns
print("\n📊 UNIQUE VALUES:")
print("=" * 25)
categorical_cols = ['Product', 'Category', 'Region', 'Customer_ID']
for col in categorical_cols:
    unique_count = df[col].nunique()
    print(f"{col}: {unique_count} unique values")
    
    if col in ['Product', 'Category', 'Region']:
        print(f"  Values: {list(df[col].unique())}")
    print()


## 📊 Bước 3: Data Profiling


In [None]:
# Statistical summary cho numerical columns
print("📊 STATISTICAL SUMMARY")
print("=" * 30)
numerical_cols = ['Price', 'Quantity', 'Sales']
print(df[numerical_cols].describe().round(2))

# Thêm một số thống kê bổ sung
print("\n📈 ADDITIONAL STATISTICS")
print("=" * 35)
for col in numerical_cols:
    print(f"\n{col}:")
    print(f"  Mean: ${df[col].mean():.2f}")
    print(f"  Median: ${df[col].median():.2f}")
    print(f"  Std: ${df[col].std():.2f}")
    print(f"  Min: ${df[col].min():.2f}")
    print(f"  Max: ${df[col].max():.2f}")
    print(f"  Skewness: {df[col].skew():.3f}")
    print(f"  Kurtosis: {df[col].kurtosis():.3f}")


In [None]:
# Phân tích categorical features
print("📊 CATEGORICAL FEATURES ANALYSIS")
print("=" * 40)

# Product analysis
print("\n🛍️ PRODUCT ANALYSIS:")
print("-" * 25)
product_stats = df.groupby('Product').agg({
    'Sales': ['count', 'sum', 'mean'],
    'Quantity': 'sum',
    'Price': 'mean'
}).round(2)
product_stats.columns = ['Transactions', 'Total_Sales', 'Avg_Sales', 'Total_Quantity', 'Avg_Price']
product_stats = product_stats.sort_values('Total_Sales', ascending=False)
print(product_stats)

# Category analysis
print("\n📂 CATEGORY ANALYSIS:")
print("-" * 25)
category_stats = df.groupby('Category').agg({
    'Sales': ['count', 'sum', 'mean'],
    'Quantity': 'sum',
    'Price': 'mean'
}).round(2)
category_stats.columns = ['Transactions', 'Total_Sales', 'Avg_Sales', 'Total_Quantity', 'Avg_Price']
category_stats = category_stats.sort_values('Total_Sales', ascending=False)
print(category_stats)

# Region analysis
print("\n🌍 REGION ANALYSIS:")
print("-" * 20)
region_stats = df.groupby('Region').agg({
    'Sales': ['count', 'sum', 'mean'],
    'Quantity': 'sum',
    'Price': 'mean'
}).round(2)
region_stats.columns = ['Transactions', 'Total_Sales', 'Avg_Sales', 'Total_Quantity', 'Avg_Price']
region_stats = region_stats.sort_values('Total_Sales', ascending=False)
print(region_stats)


## 📊 Bước 4: Missing Value Analysis

Vì dataset này là synthetic data nên không có missing values, nhưng chúng ta sẽ kiểm tra và tạo visualization để hiểu pattern.


In [None]:
# Missing value visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Missing value heatmap
missing_data = df.isnull().sum()
if missing_data.sum() > 0:
    sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis', ax=axes[0])
    axes[0].set_title('Missing Values Heatmap')
else:
    axes[0].text(0.5, 0.5, '✅ No Missing Values', ha='center', va='center', 
                transform=axes[0].transAxes, fontsize=16, color='green')
    axes[0].set_title('Missing Values Status')

# Data completeness
completeness = (1 - missing_data / len(df)) * 100
bars = axes[1].bar(range(len(completeness)), completeness.values, color='lightgreen')
axes[1].set_title('Data Completeness by Column')
axes[1].set_xlabel('Columns')
axes[1].set_ylabel('Completeness (%)')
axes[1].set_xticks(range(len(completeness)))
axes[1].set_xticklabels(completeness.index, rotation=45)
axes[1].set_ylim(0, 105)

# Add percentage labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 1,
                f'{height:.1f}%', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("✅ Dataset hoàn toàn clean - không có missing values!")


## 📊 Bước 5: Univariate Analysis


In [None]:
# Univariate Analysis - Numerical Features
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

numerical_cols = ['Price', 'Quantity', 'Sales']

# Histograms
for i, col in enumerate(numerical_cols):
    axes[i].hist(df[col], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

# Box plots
for i, col in enumerate(numerical_cols):
    axes[i+3].boxplot(df[col], patch_artist=True, 
                     boxprops=dict(facecolor='lightcoral', alpha=0.7))
    axes[i+3].set_title(f'Box Plot of {col}')
    axes[i+3].set_ylabel(col)
    axes[i+3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical insights
print("📊 UNIVARIATE ANALYSIS INSIGHTS")
print("=" * 40)
for col in numerical_cols:
    print(f"\n{col}:")
    print(f"  Distribution: {'Right-skewed' if df[col].skew() > 0.5 else 'Left-skewed' if df[col].skew() < -0.5 else 'Approximately normal'}")
    print(f"  Outliers: {len(df[(df[col] < df[col].quantile(0.25) - 1.5*(df[col].quantile(0.75) - df[col].quantile(0.25))) | (df[col] > df[col].quantile(0.75) + 1.5*(df[col].quantile(0.75) - df[col].quantile(0.25)))])} values")
    print(f"  Range: ${df[col].min():.2f} - ${df[col].max():.2f}")
    print(f"  IQR: ${df[col].quantile(0.75) - df[col].quantile(0.25):.2f}")


In [None]:
# Univariate Analysis - Categorical Features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Product distribution
product_counts = df['Product'].value_counts()
axes[0,0].bar(range(len(product_counts)), product_counts.values, color='lightblue')
axes[0,0].set_title('Product Distribution')
axes[0,0].set_xlabel('Products')
axes[0,0].set_ylabel('Number of Transactions')
axes[0,0].set_xticks(range(len(product_counts)))
axes[0,0].set_xticklabels(product_counts.index, rotation=45, ha='right')

# Category distribution
category_counts = df['Category'].value_counts()
axes[0,1].pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%', 
              colors=['lightcoral', 'lightgreen', 'lightblue'])
axes[0,1].set_title('Category Distribution')

# Region distribution
region_counts = df['Region'].value_counts()
axes[1,0].bar(range(len(region_counts)), region_counts.values, color='lightgreen')
axes[1,0].set_title('Region Distribution')
axes[1,0].set_xlabel('Regions')
axes[1,0].set_ylabel('Number of Transactions')
axes[1,0].set_xticks(range(len(region_counts)))
axes[1,0].set_xticklabels(region_counts.index)

# Weekday distribution
weekday_counts = df['Weekday'].value_counts().sort_index()
weekday_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
axes[1,1].bar(range(len(weekday_counts)), weekday_counts.values, color='lightcoral')
axes[1,1].set_title('Weekday Distribution')
axes[1,1].set_xlabel('Day of Week')
axes[1,1].set_ylabel('Number of Transactions')
axes[1,1].set_xticks(range(len(weekday_counts)))
axes[1,1].set_xticklabels(weekday_names)

plt.tight_layout()
plt.show()

# Categorical insights
print("\n📊 CATEGORICAL ANALYSIS INSIGHTS")
print("=" * 40)
print(f"Most popular product: {product_counts.index[0]} ({product_counts.iloc[0]} transactions)")
print(f"Most popular category: {category_counts.index[0]} ({category_counts.iloc[0]} transactions)")
print(f"Most active region: {region_counts.index[0]} ({region_counts.iloc[0]} transactions)")
print(f"Busiest day: {weekday_names[weekday_counts.index[0]]} ({weekday_counts.iloc[0]} transactions)")


## 📊 Bước 6: Bivariate Analysis


In [None]:
# Correlation Analysis
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Correlation matrix
numerical_cols = ['Price', 'Quantity', 'Sales']
correlation_matrix = df[numerical_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, ax=axes[0])
axes[0].set_title('Correlation Matrix - Numerical Features')

# Scatter plots
axes[1].scatter(df['Price'], df['Sales'], alpha=0.5, color='blue')
axes[1].set_xlabel('Price')
axes[1].set_ylabel('Sales')
axes[1].set_title('Price vs Sales')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation insights
print("📊 CORRELATION ANALYSIS")
print("=" * 30)
print("Correlation coefficients:")
for i in range(len(numerical_cols)):
    for j in range(i+1, len(numerical_cols)):
        corr = correlation_matrix.iloc[i, j]
        print(f"  {numerical_cols[i]} vs {numerical_cols[j]}: {corr:.3f}")
        
print(f"\nStrongest correlation: {correlation_matrix.abs().stack().nlargest(2).iloc[1]}")


In [None]:
# Categorical vs Numerical Analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Sales by Category
sns.boxplot(data=df, x='Category', y='Sales', ax=axes[0,0])
axes[0,0].set_title('Sales Distribution by Category')
axes[0,0].tick_params(axis='x', rotation=45)

# Sales by Region
sns.boxplot(data=df, x='Region', y='Sales', ax=axes[0,1])
axes[0,1].set_title('Sales Distribution by Region')

# Sales by Weekday
sns.boxplot(data=df, x='Weekday', y='Sales', ax=axes[1,0])
axes[1,0].set_title('Sales Distribution by Weekday')
weekday_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
axes[1,0].set_xticklabels(weekday_names)

# Average Sales by Product
product_sales = df.groupby('Product')['Sales'].mean().sort_values(ascending=False)
axes[1,1].bar(range(len(product_sales)), product_sales.values, color='lightcoral')
axes[1,1].set_title('Average Sales by Product')
axes[1,1].set_xlabel('Products')
axes[1,1].set_ylabel('Average Sales ($)')
axes[1,1].set_xticks(range(len(product_sales)))
axes[1,1].set_xticklabels(product_sales.index, rotation=45, ha='right')

plt.tight_layout()
plt.show()

# Bivariate insights
print("\n📊 BIVARIATE ANALYSIS INSIGHTS")
print("=" * 40)
print("Average Sales by Category:")
for category in df['Category'].unique():
    avg_sales = df[df['Category'] == category]['Sales'].mean()
    print(f"  {category}: ${avg_sales:.2f}")

print("\nAverage Sales by Region:")
for region in df['Region'].unique():
    avg_sales = df[df['Region'] == region]['Sales'].mean()
    print(f"  {region}: ${avg_sales:.2f}")

print("\nAverage Sales by Weekday:")
for i, day in enumerate(weekday_names):
    avg_sales = df[df['Weekday'] == i]['Sales'].mean()
    print(f"  {day}: ${avg_sales:.2f}")


## 📊 Bước 7: Time Series Analysis


In [None]:
# Time Series Analysis
# Tạo daily sales data
daily_sales = df.groupby('Date').agg({
    'Sales': 'sum',
    'Quantity': 'sum',
    'Customer_ID': 'nunique'
}).reset_index()

daily_sales.columns = ['Date', 'Daily_Sales', 'Daily_Quantity', 'Unique_Customers']

# Tạo monthly sales data
monthly_sales = df.groupby(['Year', 'Month']).agg({
    'Sales': 'sum',
    'Quantity': 'sum',
    'Customer_ID': 'nunique'
}).reset_index()

monthly_sales['Date'] = pd.to_datetime(monthly_sales[['Year', 'Month']].assign(day=1))
monthly_sales.columns = ['Year', 'Month', 'Monthly_Sales', 'Monthly_Quantity', 'Unique_Customers', 'Date']

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Daily sales trend
axes[0,0].plot(daily_sales['Date'], daily_sales['Daily_Sales'], color='blue', alpha=0.7)
axes[0,0].set_title('Daily Sales Trend')
axes[0,0].set_xlabel('Date')
axes[0,0].set_ylabel('Daily Sales ($)')
axes[0,0].grid(True, alpha=0.3)

# Monthly sales trend
axes[0,1].plot(monthly_sales['Date'], monthly_sales['Monthly_Sales'], 
               marker='o', color='red', linewidth=2)
axes[0,1].set_title('Monthly Sales Trend')
axes[0,1].set_xlabel('Date')
axes[0,1].set_ylabel('Monthly Sales ($)')
axes[0,1].grid(True, alpha=0.3)

# Sales by month (seasonal pattern)
monthly_avg = df.groupby('Month')['Sales'].mean()
axes[1,0].bar(monthly_avg.index, monthly_avg.values, color='lightgreen')
axes[1,0].set_title('Average Sales by Month (Seasonal Pattern)')
axes[1,0].set_xlabel('Month')
axes[1,0].set_ylabel('Average Sales ($)')
axes[1,0].set_xticks(range(1, 13))
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
axes[1,0].set_xticklabels(month_names)

# Sales by quarter
quarterly_sales = df.groupby('Quarter')['Sales'].sum()
axes[1,1].pie(quarterly_sales.values, labels=[f'Q{i}' for i in quarterly_sales.index], 
              autopct='%1.1f%%', colors=['lightcoral', 'lightblue', 'lightgreen', 'lightyellow'])
axes[1,1].set_title('Sales Distribution by Quarter')

plt.tight_layout()
plt.show()

# Time series insights
print("📊 TIME SERIES ANALYSIS INSIGHTS")
print("=" * 40)
print(f"Highest daily sales: ${daily_sales['Daily_Sales'].max():,.2f} on {daily_sales.loc[daily_sales['Daily_Sales'].idxmax(), 'Date'].strftime('%Y-%m-%d')}")
print(f"Lowest daily sales: ${daily_sales['Daily_Sales'].min():,.2f} on {daily_sales.loc[daily_sales['Daily_Sales'].idxmin(), 'Date'].strftime('%Y-%m-%d')}")
print(f"Average daily sales: ${daily_sales['Daily_Sales'].mean():,.2f}")
print(f"Best performing month: {month_names[monthly_avg.idxmax()-1]} (${monthly_avg.max():,.2f})")
print(f"Worst performing month: {month_names[monthly_avg.idxmin()-1]} (${monthly_avg.min():,.2f})")
print(f"Best performing quarter: Q{quarterly_sales.idxmax()} (${quarterly_sales.max():,.2f})")


## 📊 Bước 8: Multivariate Analysis


In [None]:
# Advanced Multivariate Analysis
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Sales by Category and Region (Heatmap)
category_region_sales = df.groupby(['Category', 'Region'])['Sales'].sum().unstack()
sns.heatmap(category_region_sales, annot=True, fmt='.0f', cmap='YlOrRd', ax=axes[0,0])
axes[0,0].set_title('Sales Heatmap: Category vs Region')

# Sales by Product and Month
product_month_sales = df.groupby(['Product', 'Month'])['Sales'].sum().unstack()
sns.heatmap(product_month_sales, annot=False, cmap='viridis', ax=axes[0,1])
axes[0,1].set_title('Sales Heatmap: Product vs Month')
axes[0,1].set_xticklabels(month_names, rotation=45)

# Average Sales by Category and Weekday
category_weekday_sales = df.groupby(['Category', 'Weekday'])['Sales'].mean().unstack()
sns.heatmap(category_weekday_sales, annot=True, fmt='.0f', cmap='Blues', ax=axes[1,0])
axes[1,0].set_title('Average Sales: Category vs Weekday')
axes[1,0].set_xticklabels(weekday_names)

# Sales distribution by Region and Category
sns.violinplot(data=df, x='Region', y='Sales', hue='Category', ax=axes[1,1])
axes[1,1].set_title('Sales Distribution: Region vs Category')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Multivariate insights
print("📊 MULTIVARIATE ANALYSIS INSIGHTS")
print("=" * 45)

# Top performing combinations
print("\n🏆 TOP PERFORMING COMBINATIONS:")
print("-" * 35)

# Category-Region combination
cat_reg_comb = df.groupby(['Category', 'Region'])['Sales'].sum().sort_values(ascending=False)
print("Top Category-Region combinations:")
for i, (combo, sales) in enumerate(cat_reg_comb.head(3).items()):
    print(f"  {i+1}. {combo[0]} in {combo[1]}: ${sales:,.2f}")

# Product-Month combination
prod_month_comb = df.groupby(['Product', 'Month'])['Sales'].sum().sort_values(ascending=False)
print("\nTop Product-Month combinations:")
for i, (combo, sales) in enumerate(prod_month_comb.head(3).items()):
    print(f"  {i+1}. {combo[0]} in {month_names[combo[1]-1]}: ${sales:,.2f}")

# Category-Weekday combination
cat_weekday_comb = df.groupby(['Category', 'Weekday'])['Sales'].mean().sort_values(ascending=False)
print("\nTop Category-Weekday combinations (by average):")
for i, (combo, sales) in enumerate(cat_weekday_comb.head(3).items()):
    print(f"  {i+1}. {combo[0]} on {weekday_names[combo[1]]}: ${sales:.2f}")


In [None]:
# Customer Analysis
print("👥 CUSTOMER ANALYSIS")
print("=" * 25)

# Top customers by total sales
top_customers = df.groupby('Customer_ID').agg({
    'Sales': 'sum',
    'Quantity': 'sum',
    'Date': 'count'
}).sort_values('Sales', ascending=False)

top_customers.columns = ['Total_Sales', 'Total_Quantity', 'Transaction_Count']
print("Top 10 customers by total sales:")
print(top_customers.head(10).round(2))

# Customer segments based on purchase behavior
customer_stats = df.groupby('Customer_ID').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Quantity': 'sum',
    'Product': 'nunique'
}).round(2)

customer_stats.columns = ['Total_Sales', 'Avg_Sales', 'Transaction_Count', 'Total_Quantity', 'Unique_Products']

# Create customer segments
customer_stats['Customer_Segment'] = pd.cut(customer_stats['Total_Sales'], 
                                           bins=[0, 1000, 5000, 10000, float('inf')],
                                           labels=['Low Value', 'Medium Value', 'High Value', 'VIP'])

segment_analysis = customer_stats.groupby('Customer_Segment').agg({
    'Total_Sales': ['count', 'sum', 'mean'],
    'Transaction_Count': 'mean',
    'Unique_Products': 'mean'
}).round(2)

segment_analysis.columns = ['Customer_Count', 'Total_Sales', 'Avg_Sales', 'Avg_Transactions', 'Avg_Unique_Products']

print("\n📊 CUSTOMER SEGMENTATION:")
print("-" * 30)
print(segment_analysis)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Customer segment distribution
segment_counts = customer_stats['Customer_Segment'].value_counts()
axes[0].pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%',
            colors=['lightcoral', 'lightblue', 'lightgreen', 'gold'])
axes[0].set_title('Customer Segment Distribution')

# Average sales by segment
segment_avg_sales = customer_stats.groupby('Customer_Segment')['Total_Sales'].mean()
axes[1].bar(range(len(segment_avg_sales)), segment_avg_sales.values, 
            color=['lightcoral', 'lightblue', 'lightgreen', 'gold'])
axes[1].set_title('Average Sales by Customer Segment')
axes[1].set_xlabel('Customer Segment')
axes[1].set_ylabel('Average Sales ($)')
axes[1].set_xticks(range(len(segment_avg_sales)))
axes[1].set_xticklabels(segment_avg_sales.index)

plt.tight_layout()
plt.show()


## 📊 Bước 9: Insights & Conclusions


In [None]:
# Tổng hợp Insights và Conclusions
print("🎯 SALES DATASET EDA - INSIGHTS & CONCLUSIONS")
print("=" * 55)

print("\n📊 KEY FINDINGS:")
print("-" * 20)

# 1. Dataset Overview
print("1. DATASET OVERVIEW:")
print(f"   • Total transactions: {len(df):,}")
print(f"   • Total revenue: ${df['Sales'].sum():,.2f}")
print(f"   • Date range: {df['Date'].min().strftime('%Y-%m-%d')} to {df['Date'].max().strftime('%Y-%m-%d')}")
print(f"   • Unique customers: {df['Customer_ID'].nunique():,}")
print(f"   • Products: {df['Product'].nunique()}")
print(f"   • Categories: {df['Category'].nunique()}")
print(f"   • Regions: {df['Region'].nunique()}")

# 2. Sales Performance
print("\n2. SALES PERFORMANCE:")
best_product = df.groupby('Product')['Sales'].sum().idxmax()
best_category = df.groupby('Category')['Sales'].sum().idxmax()
best_region = df.groupby('Region')['Sales'].sum().idxmax()
best_month = df.groupby('Month')['Sales'].mean().idxmax()

print(f"   • Best performing product: {best_product}")
print(f"   • Best performing category: {best_category}")
print(f"   • Best performing region: {best_region}")
print(f"   • Best performing month: {month_names[best_month-1]}")

# 3. Customer Insights
print("\n3. CUSTOMER INSIGHTS:")
avg_transaction_value = df['Sales'].mean()
avg_customer_value = df.groupby('Customer_ID')['Sales'].sum().mean()
repeat_customers = (df.groupby('Customer_ID').size() > 1).sum()

print(f"   • Average transaction value: ${avg_transaction_value:.2f}")
print(f"   • Average customer lifetime value: ${avg_customer_value:.2f}")
print(f"   • Repeat customers: {repeat_customers:,} ({repeat_customers/df['Customer_ID'].nunique()*100:.1f}%)")

# 4. Seasonal Patterns
print("\n4. SEASONAL PATTERNS:")
monthly_variance = df.groupby('Month')['Sales'].mean().std()
weekday_variance = df.groupby('Weekday')['Sales'].mean().std()

print(f"   • Monthly sales variation: ${monthly_variance:.2f}")
print(f"   • Weekday sales variation: ${weekday_variance:.2f}")
print(f"   • Peak season: {month_names[df.groupby('Month')['Sales'].sum().idxmax()-1]}")
print(f"   • Busiest day: {weekday_names[df.groupby('Weekday')['Sales'].sum().idxmax()]}")

# 5. Business Recommendations
print("\n5. BUSINESS RECOMMENDATIONS:")
print("   • Focus marketing efforts on Electronics category (highest revenue)")
print("   • Expand operations in best-performing regions")
print("   • Develop seasonal strategies for peak months")
print("   • Implement customer retention programs for repeat buyers")
print("   • Consider pricing strategies for high-value products")

# 6. Data Quality
print("\n6. DATA QUALITY:")
print("   • No missing values detected")
print("   • No duplicate transactions found")
print("   • Data covers complete 2-year period")
print("   • All transactions have valid customer IDs")
print("   • Price and quantity data is consistent")

print("\n✅ EDA COMPLETED SUCCESSFULLY!")
print("📈 Ready for advanced analytics and machine learning models!")


## 🎯 Next Steps

Sau khi hoàn thành EDA cho Sales Dataset, bạn có thể:

### 📈 Advanced Analytics
- **Time Series Forecasting**: Dự đoán doanh thu trong tương lai
- **Customer Lifetime Value**: Tính toán CLV cho từng khách hàng
- **Market Basket Analysis**: Phân tích giỏ hàng và cross-selling
- **Churn Prediction**: Dự đoán khách hàng có thể rời bỏ

### 🤖 Machine Learning Models
- **Sales Prediction Model**: Dự đoán doanh thu theo thời gian
- **Customer Segmentation**: Phân nhóm khách hàng bằng clustering
- **Price Optimization**: Tối ưu hóa giá sản phẩm
- **Demand Forecasting**: Dự báo nhu cầu sản phẩm

### 📊 Business Intelligence
- **Dashboard Creation**: Tạo dashboard real-time
- **KPI Monitoring**: Theo dõi các chỉ số kinh doanh
- **A/B Testing**: Thử nghiệm các chiến lược marketing
- **ROI Analysis**: Phân tích hiệu quả đầu tư

### 🔄 Data Pipeline
- **Automated EDA**: Tự động hóa quá trình EDA
- **Data Validation**: Kiểm tra chất lượng dữ liệu
- **Real-time Processing**: Xử lý dữ liệu real-time
- **Data Warehousing**: Lưu trữ và quản lý dữ liệu

---

**🎉 Chúc mừng! Bạn đã hoàn thành EDA cho Sales Dataset!**

*Hãy tiếp tục với dataset tiếp theo hoặc áp dụng insights này vào các dự án thực tế!*
