# üìä Sales Dataset EDA - Ph√¢n T√≠ch D·ªØ Li·ªáu B√°n H√†ng

## üéØ M·ª•c Ti√™u
Ph√¢n t√≠ch kh√°m ph√° d·ªØ li·ªáu b√°n h√†ng ƒë·ªÉ hi·ªÉu:
- Xu h∆∞·ªõng b√°n h√†ng theo th·ªùi gian
- Ph√¢n t√≠ch theo s·∫£n ph·∫©m, khu v·ª±c, kh√°ch h√†ng
- Seasonal patterns v√† trends
- Customer behavior analysis

## üìã Dataset Overview
- **Ngu·ªìn**: Synthetic Sales Data
- **Th·ªùi gian**: 2 nƒÉm (2022-2023)
- **Features**: Date, Product, Category, Region, Customer, Sales, Quantity, Price
- **M·ª•c ti√™u**: Time series analysis, seasonal patterns, customer segmentation

## üîç K·ªπ Thu·∫≠t S·∫Ω S·ª≠ D·ª•ng
- Time series analysis
- Seasonal decomposition
- Customer segmentation
- Product performance analysis
- Geographic analysis
- Trend analysis


In [None]:
# Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# C√†i ƒë·∫∑t style cho plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("‚úÖ ƒê√£ import th√†nh c√¥ng t·∫•t c·∫£ th∆∞ vi·ªán!")
print("üìä S·∫µn s√†ng b·∫Øt ƒë·∫ßu ph√¢n t√≠ch Sales Dataset!")


## üìä B∆∞·ªõc 1: T·∫°o Synthetic Sales Dataset

T·∫°o dataset b√°n h√†ng t·ªïng h·ª£p v·ªõi c√°c ƒë·∫∑c ƒëi·ªÉm th·ª±c t·∫ø:
- D·ªØ li·ªáu 2 nƒÉm (2022-2023)
- 5 s·∫£n ph·∫©m ch√≠nh v·ªõi 3 categories
- 4 khu v·ª±c b√°n h√†ng
- 1000+ kh√°ch h√†ng
- Seasonal patterns v√† trends


In [2]:
# T·∫°o synthetic sales dataset
import numpy as np
np.random.seed(42)

# Th√¥ng tin c∆° b·∫£n
start_date = '2022-01-01'
end_date = '2023-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# S·∫£n ph·∫©m v√† categories
products = {
    'Laptop Gaming': 'Electronics',
    'iPhone 14': 'Electronics', 
    'Nike Air Max': 'Fashion',
    'Adidas Ultraboost': 'Fashion',
    'MacBook Pro': 'Electronics',
    'Samsung Galaxy': 'Electronics',
    'Levi\'s Jeans': 'Fashion',
    'Zara Jacket': 'Fashion',
    'iPad Air': 'Electronics',
    'Nike T-Shirt': 'Fashion'
}

# Khu v·ª±c
regions = ['North', 'South', 'East', 'West']

# T·∫°o d·ªØ li·ªáu
sales_data = []

for date in date_range:
    # S·ªë l∆∞·ª£ng giao d·ªãch trong ng√†y (c√≥ seasonal pattern)
    day_of_year = date.timetuple().tm_yday
    seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * day_of_year / 365)  # Seasonal pattern
    weekend_factor = 0.7 if date.weekday() >= 5 else 1.0  # Weekend effect
    holiday_factor = 1.5 if date.month in [11, 12] else 1.0  # Holiday season
    
    num_transactions = int(np.random.poisson(50 * seasonal_factor * weekend_factor * holiday_factor))
    
    for _ in range(num_transactions):
        product = np.random.choice(list(products.keys()))
        category = products[product]
        region = np.random.choice(regions)
        customer_id = f"CUST_{np.random.randint(1000, 9999)}"
        
        # Gi√° s·∫£n ph·∫©m (c√≥ variation theo th·ªùi gian)
        base_prices = {
            'Laptop Gaming': 1200, 'iPhone 14': 800, 'Nike Air Max': 120,
            'Adidas Ultraboost': 150, 'MacBook Pro': 2000, 'Samsung Galaxy': 600,
            'Levi\'s Jeans': 80, 'Zara Jacket': 120, 'iPad Air': 500, 'Nike T-Shirt': 30
        }
        
        base_price = base_prices[product]
        price_variation = np.random.normal(1, 0.1)  # 10% variation
        price = base_price * price_variation
        
        # S·ªë l∆∞·ª£ng (th∆∞·ªùng 1-3 items)
        quantity = np.random.choice([1, 2, 3], p=[0.7, 0.2, 0.1])
        
        # Total sales
        sales = price * quantity
        
        # Discount factor (10% chance of discount)
        if np.random.random() < 0.1:
            discount = np.random.uniform(0.05, 0.25)
            sales *= (1 - discount)
        
        sales_data.append({
            'Date': date,
            'Product': product,
            'Category': category,
            'Region': region,
            'Customer_ID': customer_id,
            'Price': round(price, 2),
            'Quantity': quantity,
            'Sales': round(sales, 2),
            'Year': date.year,
            'Month': date.month,
            'Day': date.day,
            'Weekday': date.weekday(),
            'Quarter': date.quarter
        })

# T·∫°o DataFrame
df = pd.DataFrame(sales_data)

print(f"‚úÖ ƒê√£ t·∫°o th√†nh c√¥ng Sales Dataset!")
print(f"üìä K√≠ch th∆∞·ªõc dataset: {df.shape}")
print(f"üìÖ Th·ªùi gian: {df['Date'].min()} ƒë·∫øn {df['Date'].max()}")
print(f"üí∞ T·ªïng doanh thu: ${df['Sales'].sum():,.2f}")
print(f"üõçÔ∏è T·ªïng s·ªë giao d·ªãch: {len(df):,}")


NameError: name 'pd' is not defined

## üìä B∆∞·ªõc 2: Data Loading & Overview


In [None]:
# Ki·ªÉm tra th√¥ng tin c∆° b·∫£n v·ªÅ dataset
print("üîç TH√îNG TIN C∆† B·∫¢N V·ªÄ DATASET")
print("=" * 50)
print(f"üìä Shape: {df.shape}")
print(f"üíæ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"üìÖ Date range: {df['Date'].min()} ƒë·∫øn {df['Date'].max()}")
print(f"üìà Total days: {(df['Date'].max() - df['Date'].min()).days + 1}")

print("\nüìã COLUMNS INFO:")
print("=" * 30)
print(df.info())

print("\nüî¢ DATA TYPES:")
print("=" * 20)
print(df.dtypes)

print("\nüìä SAMPLE DATA (5 rows ƒë·∫ßu):")
print("=" * 35)
df.head()


In [None]:
# Ki·ªÉm tra missing values
print("üîç MISSING VALUES ANALYSIS")
print("=" * 40)
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})

print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("‚úÖ Kh√¥ng c√≥ missing values trong dataset!")

# Ki·ªÉm tra duplicate rows
duplicates = df.duplicated().sum()
print(f"\nüîÑ Duplicate rows: {duplicates}")

# Ki·ªÉm tra unique values cho categorical columns
print("\nüìä UNIQUE VALUES:")
print("=" * 25)
categorical_cols = ['Product', 'Category', 'Region', 'Customer_ID']
for col in categorical_cols:
    unique_count = df[col].nunique()
    print(f"{col}: {unique_count} unique values")
    
    if col in ['Product', 'Category', 'Region']:
        print(f"  Values: {list(df[col].unique())}")
    print()


## üìä B∆∞·ªõc 3: Data Profiling


In [None]:
# Statistical summary cho numerical columns
print("üìä STATISTICAL SUMMARY")
print("=" * 30)
numerical_cols = ['Price', 'Quantity', 'Sales']
print(df[numerical_cols].describe().round(2))

# Th√™m m·ªôt s·ªë th·ªëng k√™ b·ªï sung
print("\nüìà ADDITIONAL STATISTICS")
print("=" * 35)
for col in numerical_cols:
    print(f"\n{col}:")
    print(f"  Mean: ${df[col].mean():.2f}")
    print(f"  Median: ${df[col].median():.2f}")
    print(f"  Std: ${df[col].std():.2f}")
    print(f"  Min: ${df[col].min():.2f}")
    print(f"  Max: ${df[col].max():.2f}")
    print(f"  Skewness: {df[col].skew():.3f}")
    print(f"  Kurtosis: {df[col].kurtosis():.3f}")


In [None]:
# Ph√¢n t√≠ch categorical features
print("üìä CATEGORICAL FEATURES ANALYSIS")
print("=" * 40)

# Product analysis
print("\nüõçÔ∏è PRODUCT ANALYSIS:")
print("-" * 25)
product_stats = df.groupby('Product').agg({
    'Sales': ['count', 'sum', 'mean'],
    'Quantity': 'sum',
    'Price': 'mean'
}).round(2)
product_stats.columns = ['Transactions', 'Total_Sales', 'Avg_Sales', 'Total_Quantity', 'Avg_Price']
product_stats = product_stats.sort_values('Total_Sales', ascending=False)
print(product_stats)

# Category analysis
print("\nüìÇ CATEGORY ANALYSIS:")
print("-" * 25)
category_stats = df.groupby('Category').agg({
    'Sales': ['count', 'sum', 'mean'],
    'Quantity': 'sum',
    'Price': 'mean'
}).round(2)
category_stats.columns = ['Transactions', 'Total_Sales', 'Avg_Sales', 'Total_Quantity', 'Avg_Price']
category_stats = category_stats.sort_values('Total_Sales', ascending=False)
print(category_stats)

# Region analysis
print("\nüåç REGION ANALYSIS:")
print("-" * 20)
region_stats = df.groupby('Region').agg({
    'Sales': ['count', 'sum', 'mean'],
    'Quantity': 'sum',
    'Price': 'mean'
}).round(2)
region_stats.columns = ['Transactions', 'Total_Sales', 'Avg_Sales', 'Total_Quantity', 'Avg_Price']
region_stats = region_stats.sort_values('Total_Sales', ascending=False)
print(region_stats)


## üìä B∆∞·ªõc 4: Missing Value Analysis

V√¨ dataset n√†y l√† synthetic data n√™n kh√¥ng c√≥ missing values, nh∆∞ng ch√∫ng ta s·∫Ω ki·ªÉm tra v√† t·∫°o visualization ƒë·ªÉ hi·ªÉu pattern.


In [None]:
# Missing value visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Missing value heatmap
missing_data = df.isnull().sum()
if missing_data.sum() > 0:
    sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis', ax=axes[0])
    axes[0].set_title('Missing Values Heatmap')
else:
    axes[0].text(0.5, 0.5, '‚úÖ No Missing Values', ha='center', va='center', 
                transform=axes[0].transAxes, fontsize=16, color='green')
    axes[0].set_title('Missing Values Status')

# Data completeness
completeness = (1 - missing_data / len(df)) * 100
bars = axes[1].bar(range(len(completeness)), completeness.values, color='lightgreen')
axes[1].set_title('Data Completeness by Column')
axes[1].set_xlabel('Columns')
axes[1].set_ylabel('Completeness (%)')
axes[1].set_xticks(range(len(completeness)))
axes[1].set_xticklabels(completeness.index, rotation=45)
axes[1].set_ylim(0, 105)

# Add percentage labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 1,
                f'{height:.1f}%', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("‚úÖ Dataset ho√†n to√†n clean - kh√¥ng c√≥ missing values!")


## üìä B∆∞·ªõc 5: Univariate Analysis


In [None]:
# Univariate Analysis - Numerical Features
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

numerical_cols = ['Price', 'Quantity', 'Sales']

# Histograms
for i, col in enumerate(numerical_cols):
    axes[i].hist(df[col], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

# Box plots
for i, col in enumerate(numerical_cols):
    axes[i+3].boxplot(df[col], patch_artist=True, 
                     boxprops=dict(facecolor='lightcoral', alpha=0.7))
    axes[i+3].set_title(f'Box Plot of {col}')
    axes[i+3].set_ylabel(col)
    axes[i+3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical insights
print("üìä UNIVARIATE ANALYSIS INSIGHTS")
print("=" * 40)
for col in numerical_cols:
    print(f"\n{col}:")
    print(f"  Distribution: {'Right-skewed' if df[col].skew() > 0.5 else 'Left-skewed' if df[col].skew() < -0.5 else 'Approximately normal'}")
    print(f"  Outliers: {len(df[(df[col] < df[col].quantile(0.25) - 1.5*(df[col].quantile(0.75) - df[col].quantile(0.25))) | (df[col] > df[col].quantile(0.75) + 1.5*(df[col].quantile(0.75) - df[col].quantile(0.25)))])} values")
    print(f"  Range: ${df[col].min():.2f} - ${df[col].max():.2f}")
    print(f"  IQR: ${df[col].quantile(0.75) - df[col].quantile(0.25):.2f}")


In [None]:
# Univariate Analysis - Categorical Features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Product distribution
product_counts = df['Product'].value_counts()
axes[0,0].bar(range(len(product_counts)), product_counts.values, color='lightblue')
axes[0,0].set_title('Product Distribution')
axes[0,0].set_xlabel('Products')
axes[0,0].set_ylabel('Number of Transactions')
axes[0,0].set_xticks(range(len(product_counts)))
axes[0,0].set_xticklabels(product_counts.index, rotation=45, ha='right')

# Category distribution
category_counts = df['Category'].value_counts()
axes[0,1].pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%', 
              colors=['lightcoral', 'lightgreen', 'lightblue'])
axes[0,1].set_title('Category Distribution')

# Region distribution
region_counts = df['Region'].value_counts()
axes[1,0].bar(range(len(region_counts)), region_counts.values, color='lightgreen')
axes[1,0].set_title('Region Distribution')
axes[1,0].set_xlabel('Regions')
axes[1,0].set_ylabel('Number of Transactions')
axes[1,0].set_xticks(range(len(region_counts)))
axes[1,0].set_xticklabels(region_counts.index)

# Weekday distribution
weekday_counts = df['Weekday'].value_counts().sort_index()
weekday_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
axes[1,1].bar(range(len(weekday_counts)), weekday_counts.values, color='lightcoral')
axes[1,1].set_title('Weekday Distribution')
axes[1,1].set_xlabel('Day of Week')
axes[1,1].set_ylabel('Number of Transactions')
axes[1,1].set_xticks(range(len(weekday_counts)))
axes[1,1].set_xticklabels(weekday_names)

plt.tight_layout()
plt.show()

# Categorical insights
print("\nüìä CATEGORICAL ANALYSIS INSIGHTS")
print("=" * 40)
print(f"Most popular product: {product_counts.index[0]} ({product_counts.iloc[0]} transactions)")
print(f"Most popular category: {category_counts.index[0]} ({category_counts.iloc[0]} transactions)")
print(f"Most active region: {region_counts.index[0]} ({region_counts.iloc[0]} transactions)")
print(f"Busiest day: {weekday_names[weekday_counts.index[0]]} ({weekday_counts.iloc[0]} transactions)")


## üìä B∆∞·ªõc 6: Bivariate Analysis


In [None]:
# Correlation Analysis
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Correlation matrix
numerical_cols = ['Price', 'Quantity', 'Sales']
correlation_matrix = df[numerical_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, ax=axes[0])
axes[0].set_title('Correlation Matrix - Numerical Features')

# Scatter plots
axes[1].scatter(df['Price'], df['Sales'], alpha=0.5, color='blue')
axes[1].set_xlabel('Price')
axes[1].set_ylabel('Sales')
axes[1].set_title('Price vs Sales')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation insights
print("üìä CORRELATION ANALYSIS")
print("=" * 30)
print("Correlation coefficients:")
for i in range(len(numerical_cols)):
    for j in range(i+1, len(numerical_cols)):
        corr = correlation_matrix.iloc[i, j]
        print(f"  {numerical_cols[i]} vs {numerical_cols[j]}: {corr:.3f}")
        
print(f"\nStrongest correlation: {correlation_matrix.abs().stack().nlargest(2).iloc[1]}")


In [None]:
# Categorical vs Numerical Analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Sales by Category
sns.boxplot(data=df, x='Category', y='Sales', ax=axes[0,0])
axes[0,0].set_title('Sales Distribution by Category')
axes[0,0].tick_params(axis='x', rotation=45)

# Sales by Region
sns.boxplot(data=df, x='Region', y='Sales', ax=axes[0,1])
axes[0,1].set_title('Sales Distribution by Region')

# Sales by Weekday
sns.boxplot(data=df, x='Weekday', y='Sales', ax=axes[1,0])
axes[1,0].set_title('Sales Distribution by Weekday')
weekday_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
axes[1,0].set_xticklabels(weekday_names)

# Average Sales by Product
product_sales = df.groupby('Product')['Sales'].mean().sort_values(ascending=False)
axes[1,1].bar(range(len(product_sales)), product_sales.values, color='lightcoral')
axes[1,1].set_title('Average Sales by Product')
axes[1,1].set_xlabel('Products')
axes[1,1].set_ylabel('Average Sales ($)')
axes[1,1].set_xticks(range(len(product_sales)))
axes[1,1].set_xticklabels(product_sales.index, rotation=45, ha='right')

plt.tight_layout()
plt.show()

# Bivariate insights
print("\nüìä BIVARIATE ANALYSIS INSIGHTS")
print("=" * 40)
print("Average Sales by Category:")
for category in df['Category'].unique():
    avg_sales = df[df['Category'] == category]['Sales'].mean()
    print(f"  {category}: ${avg_sales:.2f}")

print("\nAverage Sales by Region:")
for region in df['Region'].unique():
    avg_sales = df[df['Region'] == region]['Sales'].mean()
    print(f"  {region}: ${avg_sales:.2f}")

print("\nAverage Sales by Weekday:")
for i, day in enumerate(weekday_names):
    avg_sales = df[df['Weekday'] == i]['Sales'].mean()
    print(f"  {day}: ${avg_sales:.2f}")


## üìä B∆∞·ªõc 7: Time Series Analysis


In [None]:
# Time Series Analysis
# T·∫°o daily sales data
daily_sales = df.groupby('Date').agg({
    'Sales': 'sum',
    'Quantity': 'sum',
    'Customer_ID': 'nunique'
}).reset_index()

daily_sales.columns = ['Date', 'Daily_Sales', 'Daily_Quantity', 'Unique_Customers']

# T·∫°o monthly sales data
monthly_sales = df.groupby(['Year', 'Month']).agg({
    'Sales': 'sum',
    'Quantity': 'sum',
    'Customer_ID': 'nunique'
}).reset_index()

monthly_sales['Date'] = pd.to_datetime(monthly_sales[['Year', 'Month']].assign(day=1))
monthly_sales.columns = ['Year', 'Month', 'Monthly_Sales', 'Monthly_Quantity', 'Unique_Customers', 'Date']

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Daily sales trend
axes[0,0].plot(daily_sales['Date'], daily_sales['Daily_Sales'], color='blue', alpha=0.7)
axes[0,0].set_title('Daily Sales Trend')
axes[0,0].set_xlabel('Date')
axes[0,0].set_ylabel('Daily Sales ($)')
axes[0,0].grid(True, alpha=0.3)

# Monthly sales trend
axes[0,1].plot(monthly_sales['Date'], monthly_sales['Monthly_Sales'], 
               marker='o', color='red', linewidth=2)
axes[0,1].set_title('Monthly Sales Trend')
axes[0,1].set_xlabel('Date')
axes[0,1].set_ylabel('Monthly Sales ($)')
axes[0,1].grid(True, alpha=0.3)

# Sales by month (seasonal pattern)
monthly_avg = df.groupby('Month')['Sales'].mean()
axes[1,0].bar(monthly_avg.index, monthly_avg.values, color='lightgreen')
axes[1,0].set_title('Average Sales by Month (Seasonal Pattern)')
axes[1,0].set_xlabel('Month')
axes[1,0].set_ylabel('Average Sales ($)')
axes[1,0].set_xticks(range(1, 13))
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
axes[1,0].set_xticklabels(month_names)

# Sales by quarter
quarterly_sales = df.groupby('Quarter')['Sales'].sum()
axes[1,1].pie(quarterly_sales.values, labels=[f'Q{i}' for i in quarterly_sales.index], 
              autopct='%1.1f%%', colors=['lightcoral', 'lightblue', 'lightgreen', 'lightyellow'])
axes[1,1].set_title('Sales Distribution by Quarter')

plt.tight_layout()
plt.show()

# Time series insights
print("üìä TIME SERIES ANALYSIS INSIGHTS")
print("=" * 40)
print(f"Highest daily sales: ${daily_sales['Daily_Sales'].max():,.2f} on {daily_sales.loc[daily_sales['Daily_Sales'].idxmax(), 'Date'].strftime('%Y-%m-%d')}")
print(f"Lowest daily sales: ${daily_sales['Daily_Sales'].min():,.2f} on {daily_sales.loc[daily_sales['Daily_Sales'].idxmin(), 'Date'].strftime('%Y-%m-%d')}")
print(f"Average daily sales: ${daily_sales['Daily_Sales'].mean():,.2f}")
print(f"Best performing month: {month_names[monthly_avg.idxmax()-1]} (${monthly_avg.max():,.2f})")
print(f"Worst performing month: {month_names[monthly_avg.idxmin()-1]} (${monthly_avg.min():,.2f})")
print(f"Best performing quarter: Q{quarterly_sales.idxmax()} (${quarterly_sales.max():,.2f})")


## üìä B∆∞·ªõc 8: Multivariate Analysis


In [None]:
# Advanced Multivariate Analysis
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Sales by Category and Region (Heatmap)
category_region_sales = df.groupby(['Category', 'Region'])['Sales'].sum().unstack()
sns.heatmap(category_region_sales, annot=True, fmt='.0f', cmap='YlOrRd', ax=axes[0,0])
axes[0,0].set_title('Sales Heatmap: Category vs Region')

# Sales by Product and Month
product_month_sales = df.groupby(['Product', 'Month'])['Sales'].sum().unstack()
sns.heatmap(product_month_sales, annot=False, cmap='viridis', ax=axes[0,1])
axes[0,1].set_title('Sales Heatmap: Product vs Month')
axes[0,1].set_xticklabels(month_names, rotation=45)

# Average Sales by Category and Weekday
category_weekday_sales = df.groupby(['Category', 'Weekday'])['Sales'].mean().unstack()
sns.heatmap(category_weekday_sales, annot=True, fmt='.0f', cmap='Blues', ax=axes[1,0])
axes[1,0].set_title('Average Sales: Category vs Weekday')
axes[1,0].set_xticklabels(weekday_names)

# Sales distribution by Region and Category
sns.violinplot(data=df, x='Region', y='Sales', hue='Category', ax=axes[1,1])
axes[1,1].set_title('Sales Distribution: Region vs Category')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Multivariate insights
print("üìä MULTIVARIATE ANALYSIS INSIGHTS")
print("=" * 45)

# Top performing combinations
print("\nüèÜ TOP PERFORMING COMBINATIONS:")
print("-" * 35)

# Category-Region combination
cat_reg_comb = df.groupby(['Category', 'Region'])['Sales'].sum().sort_values(ascending=False)
print("Top Category-Region combinations:")
for i, (combo, sales) in enumerate(cat_reg_comb.head(3).items()):
    print(f"  {i+1}. {combo[0]} in {combo[1]}: ${sales:,.2f}")

# Product-Month combination
prod_month_comb = df.groupby(['Product', 'Month'])['Sales'].sum().sort_values(ascending=False)
print("\nTop Product-Month combinations:")
for i, (combo, sales) in enumerate(prod_month_comb.head(3).items()):
    print(f"  {i+1}. {combo[0]} in {month_names[combo[1]-1]}: ${sales:,.2f}")

# Category-Weekday combination
cat_weekday_comb = df.groupby(['Category', 'Weekday'])['Sales'].mean().sort_values(ascending=False)
print("\nTop Category-Weekday combinations (by average):")
for i, (combo, sales) in enumerate(cat_weekday_comb.head(3).items()):
    print(f"  {i+1}. {combo[0]} on {weekday_names[combo[1]]}: ${sales:.2f}")


In [None]:
# Customer Analysis
print("üë• CUSTOMER ANALYSIS")
print("=" * 25)

# Top customers by total sales
top_customers = df.groupby('Customer_ID').agg({
    'Sales': 'sum',
    'Quantity': 'sum',
    'Date': 'count'
}).sort_values('Sales', ascending=False)

top_customers.columns = ['Total_Sales', 'Total_Quantity', 'Transaction_Count']
print("Top 10 customers by total sales:")
print(top_customers.head(10).round(2))

# Customer segments based on purchase behavior
customer_stats = df.groupby('Customer_ID').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Quantity': 'sum',
    'Product': 'nunique'
}).round(2)

customer_stats.columns = ['Total_Sales', 'Avg_Sales', 'Transaction_Count', 'Total_Quantity', 'Unique_Products']

# Create customer segments
customer_stats['Customer_Segment'] = pd.cut(customer_stats['Total_Sales'], 
                                           bins=[0, 1000, 5000, 10000, float('inf')],
                                           labels=['Low Value', 'Medium Value', 'High Value', 'VIP'])

segment_analysis = customer_stats.groupby('Customer_Segment').agg({
    'Total_Sales': ['count', 'sum', 'mean'],
    'Transaction_Count': 'mean',
    'Unique_Products': 'mean'
}).round(2)

segment_analysis.columns = ['Customer_Count', 'Total_Sales', 'Avg_Sales', 'Avg_Transactions', 'Avg_Unique_Products']

print("\nüìä CUSTOMER SEGMENTATION:")
print("-" * 30)
print(segment_analysis)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Customer segment distribution
segment_counts = customer_stats['Customer_Segment'].value_counts()
axes[0].pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%',
            colors=['lightcoral', 'lightblue', 'lightgreen', 'gold'])
axes[0].set_title('Customer Segment Distribution')

# Average sales by segment
segment_avg_sales = customer_stats.groupby('Customer_Segment')['Total_Sales'].mean()
axes[1].bar(range(len(segment_avg_sales)), segment_avg_sales.values, 
            color=['lightcoral', 'lightblue', 'lightgreen', 'gold'])
axes[1].set_title('Average Sales by Customer Segment')
axes[1].set_xlabel('Customer Segment')
axes[1].set_ylabel('Average Sales ($)')
axes[1].set_xticks(range(len(segment_avg_sales)))
axes[1].set_xticklabels(segment_avg_sales.index)

plt.tight_layout()
plt.show()


## üìä B∆∞·ªõc 9: Insights & Conclusions


In [None]:
# T·ªïng h·ª£p Insights v√† Conclusions
print("üéØ SALES DATASET EDA - INSIGHTS & CONCLUSIONS")
print("=" * 55)

print("\nüìä KEY FINDINGS:")
print("-" * 20)

# 1. Dataset Overview
print("1. DATASET OVERVIEW:")
print(f"   ‚Ä¢ Total transactions: {len(df):,}")
print(f"   ‚Ä¢ Total revenue: ${df['Sales'].sum():,.2f}")
print(f"   ‚Ä¢ Date range: {df['Date'].min().strftime('%Y-%m-%d')} to {df['Date'].max().strftime('%Y-%m-%d')}")
print(f"   ‚Ä¢ Unique customers: {df['Customer_ID'].nunique():,}")
print(f"   ‚Ä¢ Products: {df['Product'].nunique()}")
print(f"   ‚Ä¢ Categories: {df['Category'].nunique()}")
print(f"   ‚Ä¢ Regions: {df['Region'].nunique()}")

# 2. Sales Performance
print("\n2. SALES PERFORMANCE:")
best_product = df.groupby('Product')['Sales'].sum().idxmax()
best_category = df.groupby('Category')['Sales'].sum().idxmax()
best_region = df.groupby('Region')['Sales'].sum().idxmax()
best_month = df.groupby('Month')['Sales'].mean().idxmax()

print(f"   ‚Ä¢ Best performing product: {best_product}")
print(f"   ‚Ä¢ Best performing category: {best_category}")
print(f"   ‚Ä¢ Best performing region: {best_region}")
print(f"   ‚Ä¢ Best performing month: {month_names[best_month-1]}")

# 3. Customer Insights
print("\n3. CUSTOMER INSIGHTS:")
avg_transaction_value = df['Sales'].mean()
avg_customer_value = df.groupby('Customer_ID')['Sales'].sum().mean()
repeat_customers = (df.groupby('Customer_ID').size() > 1).sum()

print(f"   ‚Ä¢ Average transaction value: ${avg_transaction_value:.2f}")
print(f"   ‚Ä¢ Average customer lifetime value: ${avg_customer_value:.2f}")
print(f"   ‚Ä¢ Repeat customers: {repeat_customers:,} ({repeat_customers/df['Customer_ID'].nunique()*100:.1f}%)")

# 4. Seasonal Patterns
print("\n4. SEASONAL PATTERNS:")
monthly_variance = df.groupby('Month')['Sales'].mean().std()
weekday_variance = df.groupby('Weekday')['Sales'].mean().std()

print(f"   ‚Ä¢ Monthly sales variation: ${monthly_variance:.2f}")
print(f"   ‚Ä¢ Weekday sales variation: ${weekday_variance:.2f}")
print(f"   ‚Ä¢ Peak season: {month_names[df.groupby('Month')['Sales'].sum().idxmax()-1]}")
print(f"   ‚Ä¢ Busiest day: {weekday_names[df.groupby('Weekday')['Sales'].sum().idxmax()]}")

# 5. Business Recommendations
print("\n5. BUSINESS RECOMMENDATIONS:")
print("   ‚Ä¢ Focus marketing efforts on Electronics category (highest revenue)")
print("   ‚Ä¢ Expand operations in best-performing regions")
print("   ‚Ä¢ Develop seasonal strategies for peak months")
print("   ‚Ä¢ Implement customer retention programs for repeat buyers")
print("   ‚Ä¢ Consider pricing strategies for high-value products")

# 6. Data Quality
print("\n6. DATA QUALITY:")
print("   ‚Ä¢ No missing values detected")
print("   ‚Ä¢ No duplicate transactions found")
print("   ‚Ä¢ Data covers complete 2-year period")
print("   ‚Ä¢ All transactions have valid customer IDs")
print("   ‚Ä¢ Price and quantity data is consistent")

print("\n‚úÖ EDA COMPLETED SUCCESSFULLY!")
print("üìà Ready for advanced analytics and machine learning models!")


## üéØ Next Steps

Sau khi ho√†n th√†nh EDA cho Sales Dataset, b·∫°n c√≥ th·ªÉ:

### üìà Advanced Analytics
- **Time Series Forecasting**: D·ª± ƒëo√°n doanh thu trong t∆∞∆°ng lai
- **Customer Lifetime Value**: T√≠nh to√°n CLV cho t·ª´ng kh√°ch h√†ng
- **Market Basket Analysis**: Ph√¢n t√≠ch gi·ªè h√†ng v√† cross-selling
- **Churn Prediction**: D·ª± ƒëo√°n kh√°ch h√†ng c√≥ th·ªÉ r·ªùi b·ªè

### ü§ñ Machine Learning Models
- **Sales Prediction Model**: D·ª± ƒëo√°n doanh thu theo th·ªùi gian
- **Customer Segmentation**: Ph√¢n nh√≥m kh√°ch h√†ng b·∫±ng clustering
- **Price Optimization**: T·ªëi ∆∞u h√≥a gi√° s·∫£n ph·∫©m
- **Demand Forecasting**: D·ª± b√°o nhu c·∫ßu s·∫£n ph·∫©m

### üìä Business Intelligence
- **Dashboard Creation**: T·∫°o dashboard real-time
- **KPI Monitoring**: Theo d√µi c√°c ch·ªâ s·ªë kinh doanh
- **A/B Testing**: Th·ª≠ nghi·ªám c√°c chi·∫øn l∆∞·ª£c marketing
- **ROI Analysis**: Ph√¢n t√≠ch hi·ªáu qu·∫£ ƒë·∫ßu t∆∞

### üîÑ Data Pipeline
- **Automated EDA**: T·ª± ƒë·ªông h√≥a qu√° tr√¨nh EDA
- **Data Validation**: Ki·ªÉm tra ch·∫•t l∆∞·ª£ng d·ªØ li·ªáu
- **Real-time Processing**: X·ª≠ l√Ω d·ªØ li·ªáu real-time
- **Data Warehousing**: L∆∞u tr·ªØ v√† qu·∫£n l√Ω d·ªØ li·ªáu

---

**üéâ Ch√∫c m·ª´ng! B·∫°n ƒë√£ ho√†n th√†nh EDA cho Sales Dataset!**

*H√£y ti·∫øp t·ª•c v·ªõi dataset ti·∫øp theo ho·∫∑c √°p d·ª•ng insights n√†y v√†o c√°c d·ª± √°n th·ª±c t·∫ø!*
