### Data Analysis

- **How many rows**: Number of roows in the dataset.
- **How many columns**: Number of columns in the dataset.
- **List of the columns**: List of columns in the dataset.
- **Total Products**: Number of products in the dataset.
- **Distinct Products**: Number of unique products.
- **Total Transactions**: Number of transactions recorded.
- **Total Customers**: Number of customers in the dataset.
- **Distinct Categories**: Number of unique product categories.


In [13]:
import pandas as pd
import os

path = r'C:\Users\moham\Apriori_VS_Word2Vec'
excel_file = 'df_merged_items_category.xlsx'
excel_file_path = os.path.join(path, excel_file)

# Load the dataset
df = pd.read_excel(excel_file_path)

# Basic Dataset Statistics
print("===== DATASET OVERVIEW =====")
print(f"How many rows: {df.shape[0]}")
print(f"How many columns: {df.shape[1]}")
print(f"List of columns: {list(df.columns)}")

# Total and distinct products
total_products = df.shape[0]
distinct_products = df['Itemname'].nunique()
print(f"Total Products: {total_products}")
print(f"Distinct Products: {distinct_products}")

# Transactions analysis
total_transactions = df['BillNo'].nunique()
print(f"Total Transactions: {total_transactions}")

# Customer analysis
total_customers = df['CustomerID'].nunique()
print(f"Total Customers: {total_customers}")

# Category analysis
if 'category' in df.columns:
    distinct_categories = df['category'].nunique()
    print(f"Distinct Categories: {distinct_categories}")
    # Category distribution
    category_counts = df['category'].value_counts()
    print("\n===== CATEGORY DISTRIBUTION =====")
    for category, count in category_counts.items():
        percentage = (count / total_products) * 100
        print(f"{category}: {count} products ({percentage:.2f}%)")

# Date range
if 'Date' in df.columns:
    min_date = df['Date'].min()
    max_date = df['Date'].max()
    print(f"\nDate Range: {min_date} to {max_date}")




ModuleNotFoundError: No module named 'numpy.rec'

### Transaction Patterns

Analyze the purchasing patterns visible in the data:

- **Transaction frequency over time**: Daily, weekly, and monthly patterns.
- **Average items per transaction**.
- **Distribution of transaction sizes**: Histogram showing the number of items per invoice.
- **Temporal patterns**: Time of day, day of week, and seasonal trends.

In [None]:
import pandas as pd
import os

path = r'C:\Users\moham\Apriori_VS_Word2Vec'
excel_file = 'df_merged_items_category.xlsx'
excel_file_path = os.path.join(path, excel_file)

# Load the dataset
df = pd.read_excel(excel_file_path)

# Transaction patterns
print("\n===== TRANSACTION PATTERNS =====")
# Items per transaction
items_per_transaction = df.groupby('BillNo').size()
avg_items = items_per_transaction.mean()
print(f"Average Items per Transaction: {avg_items:.2f}")
print(f"Min Items per Transaction: {items_per_transaction.min()}")
print(f"Max Items per Transaction: {items_per_transaction.max()}")

# Distribution of transaction sizes
transaction_size_counts = items_per_transaction.value_counts().sort_index()
print("\nTransaction Size Distribution:")
for size, count in transaction_size_counts.items():
    percentage = (count / total_transactions) * 100
    print(f"{size} items: {count} transactions ({percentage:.2f}%)")

# Top products
print("\n===== TOP PRODUCTS =====")
top_products = df['Itemname'].value_counts().head(10)
print("Top 10 Most Frequently Purchased Products:")
for product, count in top_products.items():
    percentage = (count / total_products) * 100
    print(f"{product}: {count} purchases ({percentage:.2f}%)")

# Time-based analysis
if 'Date' in df.columns:
    # Convert to datetime if not already
    if not pd.api.types.is_datetime64_any_dtype(df['Date']):
        df['Date'] = pd.to_datetime(df['Date'])
    
    # Add time components
    df['Day'] = df['Date'].dt.day_name()
    df['Hour'] = df['Date'].dt.hour
    df['Month'] = df['Date'].dt.month_name()
    
    # Daily pattern
    daily_transactions = df.groupby('Day')['BillNo'].nunique()
    print("\n===== TEMPORAL PATTERNS =====")
    print("Transactions by Day of Week:")
    for day, count in daily_transactions.items():
        percentage = (count / total_transactions) * 100
        print(f"{day}: {count} transactions ({percentage:.2f}%)")
    
    # Monthly pattern
    monthly_transactions = df.groupby('Month')['BillNo'].nunique()
    print("\nTransactions by Month:")
    for month, count in monthly_transactions.items():
        percentage = (count / total_transactions) * 100
        print(f"{month}: {count} transactions ({percentage:.2f}%)")
    
    # Hourly pattern
    hourly_transactions = df.groupby('Hour')['BillNo'].nunique()
    print("\nTransactions by Hour of Day:")
    for hour, count in hourly_transactions.items():
        percentage = (count / total_transactions) * 100
        print(f"{hour}:00 - {hour+1}:00: {count} transactions ({percentage:.2f}%)")

### Product Analysis

Examine the product-related characteristics:

- **Top 10-20 most frequently purchased products**.
- **Product category distribution**: Percentage of items in each category.
- **Price distribution**: Across different product categories.
- **Product popularity vs. price**: Correlation analysis.