# Phase 2: Exploratory Data Analysis of Sales

This notebook represents the second phase of the sales analysis project. Here, we dive deeper into the preprocessed sales dataset (`preprocessed_sales.csv`) to uncover financial insights through statistical analysis and visualizations. By addressing specific questions and generating plots, we aim to gain a comprehensive understanding of order patterns and sales trends.

## Objectives
- Determine the number of unique orders.
- Identify the time window of data collection.
- Analyze order distribution across days of the week.
- Visualize sales trends over time.

## Step 1: Import Libraries

Load essential Python libraries for data manipulation, visualization, and logging.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Step 2: Load Preprocessed Dataset

Read the preprocessed sales data from `preprocessed_sales.csv` into a DataFrame. This dataset has been cleaned in Phase 1, removing missing values, duplicates, invalid prices, and canceled orders.

In [2]:
try:
    df = pd.read_csv('preprocessed_sales.csv')
    logger.info(f"Dataset loaded successfully. Shape: {df.shape}")
except FileNotFoundError:
    logger.error("File 'preprocessed_sales.csv' not found.")
    raise
except Exception as e:
    logger.error(f"Error loading dataset: {e}")
    raise

df.head()

2025-04-10 12:20:01,983 - ERROR - File 'preprocessed_sales.csv' not found.


FileNotFoundError: [Errno 2] No such file or directory: 'preprocessed_sales.csv'

## Step 3: Count Unique Orders

Calculate the number of unique invoices in the dataset and store it in `number_of_orders`.

In [None]:
number_of_orders = df['InvoiceNumber'].nunique()
logger.info(f"Number of unique orders: {number_of_orders}")
print(f"Number of unique orders: {number_of_orders}")

## Step 4: Determine Time Window

Find the earliest and latest timestamps in `InvoiceDate` to define the data collection period. Store the result as a tuple in `window_period`.

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])  # Ensure datetime format
window_period = (df['InvoiceDate'].min().strftime('%Y-%m-%d %H:%M:%S'), 
                 df['InvoiceDate'].max().strftime('%Y-%m-%d %H:%M:%S'))
logger.info(f"Data collection window: {window_period}")
print(f"Data collection window: {window_period}")

## Step 5: Analyze Orders by Day of Week

Create a bar chart showing the number of unique orders issued on each day of the week (Monday through Sunday).

### Chart Specifications
| **Feature**            | **Value**            |
|------------------------|----------------------|
| Figure size           | 15 x 6 inches        |
| Bar color             | `lime`              |
| X/Y label color       | `lightseagreen`     |
| Title color           | `green`             |
| Font size (title, labels, ticks) | 15        |
| X-axis tick rotation  | 0 degrees           |

In [None]:
# Extract day of week (0 = Monday, 6 = Sunday)
df['DayOfWeek'] = df['InvoiceDate'].dt.day_name()
order_counts = df.groupby('DayOfWeek')['InvoiceNumber'].nunique().reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Sunday']
)

# Create bar chart
fig1, ax1 = plt.subplots(figsize=(15, 6))
order_counts.plot(kind='bar', ax=ax1, color='lime')

# Customize chart
ax1.set_xlabel('Day of Week', color='lightseagreen', size=15)
ax1.set_ylabel('Number of Orders', color='lightseagreen', size=15)
ax1.set_title('Number of Orders for Different Days', color='green', size=15)
ax1.tick_params(axis='x', rotation=0, labelsize=15)
ax1.tick_params(axis='y', labelsize=15)

plt.tight_layout()
logger.info("Bar chart for orders by day of week generated.")
plt.show()

## Step 6: Visualize Monthly Sales Trends

Plot a bar chart of total sales amounts by month and year to explore temporal trends.

### Chart Adjustments
- Compute `TotalPrice` as `Quantity * UnitPrice`.
- Aggregate sales by month-year.
- Customize labels for clarity (e.g., rename 'Jul_2010' to 'July_2010').

In [None]:
# Calculate total price
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Extract month-year
df['MonthYear'] = df['InvoiceDate'].dt.strftime('%b_%Y')

# Group and sum sales
monthly_sales = df.groupby('MonthYear', sort=False)['TotalPrice'].sum()

# Create bar chart
fig2, ax2 = plt.subplots(figsize=(15, 6))
monthly_sales.plot(kind='bar', ax=ax2, color='darkkhaki')

# Customize labels
labels = ax2.get_xticklabels()
for i, label in enumerate(labels):
    if label.get_text() == 'Jul_2010':
        labels[i] = plt.Text(label.get_text(), 0, 'July_2010')
ax2.set_xticklabels(labels, rotation=45, size=13)

# Set labels and title
ax2.set_xlabel('Month', color='orange', size=15)
ax2.set_ylabel('Sales Amount', color='orange', size=15)
ax2.set_title('Number of Orders for Different Months', color='cadetblue', size=15)

plt.tight_layout()
logger.info("Bar chart for monthly sales trends generated.")
plt.show()

## Step 7: Prepare Submission

Save outputs and compress them into `result.zip` for submission.

In [None]:
import zipfile
import joblib

# Save outputs
joblib.dump(number_of_orders, 'number_of_orders')
joblib.dump(window_period, 'window_period')
joblib.dump(fig1, 'fig1')
joblib.dump(fig2, 'fig2')

# Compress files
file_names = ['number_of_orders', 'window_period', 'fig1', 'fig2', 'final_project_2_exploration.ipynb']
with zipfile.ZipFile('result.zip', 'w', compression=zipfile.ZIP_DEFLATED) as zf:
    for file in file_names:
        zf.write(file)
logger.info("Submission files compressed into 'result.zip'")