# üìä Sales Dataset EDA - Ph√¢n T√≠ch D·ªØ Li·ªáu B√°n H√†ng

## üéØ M·ª•c Ti√™u
Ph√¢n t√≠ch kh√°m ph√° d·ªØ li·ªáu b√°n h√†ng ƒë·ªÉ hi·ªÉu:
- Xu h∆∞·ªõng b√°n h√†ng theo th·ªùi gian
- Ph√¢n t√≠ch theo s·∫£n ph·∫©m, khu v·ª±c, kh√°ch h√†ng
- Seasonal patterns v√† trends
- Customer behavior analysis

## üìã Dataset Overview
- **Ngu·ªìn**: Synthetic Sales Data
- **Th·ªùi gian**: 2 nƒÉm (2022-2023)
- **Features**: Date, Product, Category, Region, Customer, Sales, Quantity, Price
- **M·ª•c ti√™u**: Time series analysis, seasonal patterns, customer segmentation

## üîç K·ªπ Thu·∫≠t S·∫Ω S·ª≠ D·ª•ng
- Time series analysis
- Seasonal decomposition
- Customer segmentation
- Product performance analysis
- Geographic analysis
- Trend analysis


In [None]:
# Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# C√†i ƒë·∫∑t style cho plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("‚úÖ ƒê√£ import th√†nh c√¥ng t·∫•t c·∫£ th∆∞ vi·ªán!")
print("üìä S·∫µn s√†ng b·∫Øt ƒë·∫ßu ph√¢n t√≠ch Sales Dataset!")


## üìä B∆∞·ªõc 1: T·∫°o Synthetic Sales Dataset

T·∫°o dataset b√°n h√†ng t·ªïng h·ª£p v·ªõi c√°c ƒë·∫∑c ƒëi·ªÉm th·ª±c t·∫ø:
- D·ªØ li·ªáu 2 nƒÉm (2022-2023)
- 5 s·∫£n ph·∫©m ch√≠nh v·ªõi 3 categories
- 4 khu v·ª±c b√°n h√†ng
- 1000+ kh√°ch h√†ng
- Seasonal patterns v√† trends


In [None]:
# T·∫°o synthetic sales dataset
np.random.seed(42)

# Th√¥ng tin c∆° b·∫£n
start_date = '2022-01-01'
end_date = '2023-12-31'
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# S·∫£n ph·∫©m v√† categories
products = {
    'Laptop Gaming': 'Electronics',
    'iPhone 14': 'Electronics', 
    'Nike Air Max': 'Fashion',
    'Adidas Ultraboost': 'Fashion',
    'MacBook Pro': 'Electronics',
    'Samsung Galaxy': 'Electronics',
    'Levi\'s Jeans': 'Fashion',
    'Zara Jacket': 'Fashion',
    'iPad Air': 'Electronics',
    'Nike T-Shirt': 'Fashion'
}

# Khu v·ª±c
regions = ['North', 'South', 'East', 'West']

# T·∫°o d·ªØ li·ªáu
sales_data = []

for date in date_range:
    # S·ªë l∆∞·ª£ng giao d·ªãch trong ng√†y (c√≥ seasonal pattern)
    day_of_year = date.timetuple().tm_yday
    seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * day_of_year / 365)  # Seasonal pattern
    weekend_factor = 0.7 if date.weekday() >= 5 else 1.0  # Weekend effect
    holiday_factor = 1.5 if date.month in [11, 12] else 1.0  # Holiday season
    
    num_transactions = int(np.random.poisson(50 * seasonal_factor * weekend_factor * holiday_factor))
    
    for _ in range(num_transactions):
        product = np.random.choice(list(products.keys()))
        category = products[product]
        region = np.random.choice(regions)
        customer_id = f"CUST_{np.random.randint(1000, 9999)}"
        
        # Gi√° s·∫£n ph·∫©m (c√≥ variation theo th·ªùi gian)
        base_prices = {
            'Laptop Gaming': 1200, 'iPhone 14': 800, 'Nike Air Max': 120,
            'Adidas Ultraboost': 150, 'MacBook Pro': 2000, 'Samsung Galaxy': 600,
            'Levi\'s Jeans': 80, 'Zara Jacket': 120, 'iPad Air': 500, 'Nike T-Shirt': 30
        }
        
        base_price = base_prices[product]
        price_variation = np.random.normal(1, 0.1)  # 10% variation
        price = base_price * price_variation
        
        # S·ªë l∆∞·ª£ng (th∆∞·ªùng 1-3 items)
        quantity = np.random.choice([1, 2, 3], p=[0.7, 0.2, 0.1])
        
        # Total sales
        sales = price * quantity
        
        # Discount factor (10% chance of discount)
        if np.random.random() < 0.1:
            discount = np.random.uniform(0.05, 0.25)
            sales *= (1 - discount)
        
        sales_data.append({
            'Date': date,
            'Product': product,
            'Category': category,
            'Region': region,
            'Customer_ID': customer_id,
            'Price': round(price, 2),
            'Quantity': quantity,
            'Sales': round(sales, 2),
            'Year': date.year,
            'Month': date.month,
            'Day': date.day,
            'Weekday': date.weekday(),
            'Quarter': date.quarter
        })

# T·∫°o DataFrame
df = pd.DataFrame(sales_data)

print(f"‚úÖ ƒê√£ t·∫°o th√†nh c√¥ng Sales Dataset!")
print(f"üìä K√≠ch th∆∞·ªõc dataset: {df.shape}")
print(f"üìÖ Th·ªùi gian: {df['Date'].min()} ƒë·∫øn {df['Date'].max()}")
print(f"üí∞ T·ªïng doanh thu: ${df['Sales'].sum():,.2f}")
print(f"üõçÔ∏è T·ªïng s·ªë giao d·ªãch: {len(df):,}")


## üìä B∆∞·ªõc 2: Data Loading & Overview


In [None]:
# Ki·ªÉm tra th√¥ng tin c∆° b·∫£n v·ªÅ dataset
print("üîç TH√îNG TIN C∆† B·∫¢N V·ªÄ DATASET")
print("=" * 50)
print(f"üìä Shape: {df.shape}")
print(f"üíæ Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"üìÖ Date range: {df['Date'].min()} ƒë·∫øn {df['Date'].max()}")
print(f"üìà Total days: {(df['Date'].max() - df['Date'].min()).days + 1}")

print("\nüìã COLUMNS INFO:")
print("=" * 30)
print(df.info())

print("\nüî¢ DATA TYPES:")
print("=" * 20)
print(df.dtypes)

print("\nüìä SAMPLE DATA (5 rows ƒë·∫ßu):")
print("=" * 35)
df.head()
