# 🟣 Project 1: Sales Data Analysis
🎯 Project Goal:

You are given 12 CSV files each for one month of sales in 2019.

Your goal is to analyze all this data together and answer business questions like:
- What was the best month for sales?
- Which city sold the most product?
- What time should we show ads to increase sales?
- What products are sold together?
- Which product sold the most?
- Which products are often bought by high-value customers?
- What is the average basket size per order?
- What’s the revenue share of top 20% customers (Pareto 80/20)?
- Which products have seasonal sales patterns?
- What product combinations happen most with high-value orders?

## 📚 Step 1: Load and Combine All Monthly Files

In [None]:
import os
import pandas as pd

# Path to your data folder
folder = '../data'
files = [file for file in os.listdir(folder) if file.endswith('.csv')]

# Combine all files
all_months_data = pd.DataFrame()

for file in files:
    data = pd.read_csv(os.path.join(folder, file))
    all_months_data = pd.concat([all_months_data, data])

print(all_months_data.shape)
print(all_months_data.head())


## 🧹 Step 2: Clean the Data

In [None]:
# Drop NaNs
all_months_data = all_months_data.dropna(how='all')

# Remove invalid rows (e.g., headers inside data)
all_months_data = all_months_data[all_months_data['Order Date'].str[0:2] != 'Or']

# Convert columns
all_months_data['Quantity Ordered'] = pd.to_numeric(all_months_data['Quantity Ordered'], errors='coerce')
all_months_data['Price Each'] = pd.to_numeric(all_months_data['Price Each'], errors='coerce')
all_months_data['Order Date'] = pd.to_datetime(all_months_data['Order Date'], format='%m/%d/%y %H:%M', errors='coerce')

# Add total column
all_months_data['Total Price'] = all_months_data['Quantity Ordered'] * all_months_data['Price Each']


## ⁉️ Question 01: What was the best month for sales?

In [None]:
all_months_data['Month'] = all_months_data['Order Date'].dt.month

monthly_sales = all_months_data.groupby('Month')['Total Price'].sum()
print(monthly_sales)
monthly_sales.plot(kind='bar', title='Total Sales by Month')


## ⁉️ Question 02: What city had the highest sales?

In [None]:
# Extract city from address
all_months_data['City'] = all_months_data['Purchase Address'].apply(lambda x: x.split(',')[1].strip())

city_sales = all_months_data.groupby('City')['Total Price'].sum()
city_sales = city_sales.sort_values(ascending=False)
print(city_sales)
city_sales.plot(kind='bar', title='Sales by City')


## ⁉️ Question 03: What time should we advertise?

In [None]:
all_months_data['Hour'] = all_months_data['Order Date'].dt.hour

hourly_orders = all_months_data.groupby('Hour').size()
hourly_orders.plot(kind='line', title='Number of Orders by Hour')


📌 Useful for deciding when to run ads or offer discounts

## 📌 Question 4: What products are most often sold together?
🎯Goal: Check for cross-sell between products only on high-value orders

This requires a little trick using duplicated Order IDs:

In [None]:
from collections import Counter
import pandas as pd

# Step 1: Load data if not already done
df = pd.read_csv('../data/Sales_January_2019.csv')
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%y %H:%M', errors='coerce')
df['Quantity Ordered'] = pd.to_numeric(df['Quantity Ordered'], errors='coerce')
df['Price Each'] = pd.to_numeric(df['Price Each'], errors='coerce')
df.dropna(subset=['Order Date', 'Quantity Ordered', 'Price Each'], inplace=True)
df['Total Price'] = df['Quantity Ordered'] * df['Price Each']

# Step 2: Filter high-value orders
high_value = df[df['Total Price'] > 500]

# Step 3: Find multi-product orders
grouped_orders = df[df.duplicated(['Order ID'], keep=False)].copy()

# Step 4: Group products into one row
grouped_orders['Grouped'] = grouped_orders.groupby('Order ID')['Product'].transform(lambda x: ', '.join(x))

# Step 5: Drop repeated rows for clean combination list
grouped_orders = grouped_orders[['Order ID', 'Grouped']].drop_duplicates()

# Step 6: Filter grouped orders by high-value Order IDs
high_orders = grouped_orders[grouped_orders['Order ID'].isin(high_value['Order ID'])]

# Step 7: Count combinations
combo_counter = Counter()
for row in high_orders['Grouped']:
    items = tuple(sorted(row.split(', ')))
    combo_counter[items] += 1

# Step 8: Show results
for combo in combo_counter.most_common(10):
    print(combo)

Bar Chart:

In [None]:
import matplotlib.pyplot as plt

# Prepare data for plotting
top_combos = combo_counter.most_common(10)

labels = [' + '.join(combo[0]) for combo in top_combos]
counts = [combo[1] for combo in top_combos]

# Plot
plt.figure(figsize=(12, 6))
plt.barh(labels, counts, color='teal')
plt.title('Top 10 High-Value Product Combinations')
plt.xlabel('Number of Orders')
plt.gca().invert_yaxis()  # Show most frequent on top
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

🔸 high_value['Order ID'] → Orders with a high amount

🔸 grouped_orders['Grouped'] → Combination of products in each order (e.g. "iPhone, USB Cable")

📌 Application: Automatic second product suggestion on checkout page

## ⁉️ Question 05: What product sold the most?

In [None]:
product_sales = all_months_data.groupby('Product')['Quantity Ordered'].sum()
product_sales.plot(kind='bar', title='Quantity Sold per Product')


## ⁉️ Question 06: Which products are often bought by high-value customers?

🎯Goal: Find popular products for customers who make large purchases.

In [None]:
# Filter high-value orders
high_value = all_months_data[all_months_data['Total Price'] > 500]

# Count top products
top_products = high_value['Product'].value_counts().head(10)
print(top_products)


📌 Application: This type of analysis helps the marketing team understand which products are suitable for premium campaigns.

## ⁉️ Question 07: What is the average basket size per order?

🎯Goal: Calculate the number of items each customer purchases on average in an order.

In [None]:
order_size = all_months_data.groupby('Order ID')['Quantity Ordered'].sum()
avg_basket_size = order_size.mean()
print(f"Average items per order: {avg_basket_size:.2f}")


📌 Application: For designing incremental offers (cross-sell / upsell)

## ⁉️ Question 08: What’s the revenue share of top 20% customers (Pareto 80/20)?

🎯Goal: To investigate whether 20% of customers generate 80% of sales.

In [None]:
customer_revenue = all_months_data.groupby('Purchase Address')['Total Price'].sum().sort_values(ascending=False)
top_20_percent = int(0.2 * len(customer_revenue))

revenue_top = customer_revenue.iloc[:top_20_percent].sum()
revenue_total = customer_revenue.sum()

share = revenue_top / revenue_total
print(f"Top 20% customers bring in {share:.1%} of total revenue")


📌 Application: Focus on more profitable customers

## ⁉️ Question 09: Which products have seasonal sales patterns?

🎯Goal: Identify products whose sales increase or decrease in certain months.

In [None]:
monthly = all_months_data.copy()
monthly['Month'] = monthly['Order Date'].dt.month

seasonality = monthly.groupby(['Month', 'Product'])['Quantity Ordered'].sum().unstack()
seasonality.plot(figsize=(14,6))


📌 Application: Inventory forecasting and seasonal promotions