## Introduction to the Dataset
---

The Global Superstore dataset is a rich, large-scale retail sales dataset that represents a fictional but realistic multi-national store. It is widely used in data science, analytics, and business intelligence projects because it mirrors the complexities of real-world business operations such as sales, shipping, profit management, and customer segmentation.

---
### 📌 Background & Purpose
---
The dataset is designed to simulate how a global retail company operates across different countries, regions, and markets. It includes details of orders, customers, products, shipping, and financial metrics (sales, profit, shipping costs, discounts).

It is commonly used in:

- Exploratory Data Analysis (EDA): Understanding data trends and anomalies.

- Business Intelligence (BI): Creating dashboards to monitor KPIs.

- Machine Learning (ML): Building models for sales prediction, customer segmentation, and demand forecasting.

- Decision Science: Optimizing shipping, discounts, and profitability.
---
### 📑 Structure of the Dataset

---
The dataset contains 51,290 transactions (rows) and 24 attributes (columns), covering all aspects of a sales process:

- Order Information

    - Order ID, Order Date, Ship Date, Ship Mode, Order Priority.
    - These fields describe how and when the order was placed, and the urgency of fulfillment.

- Customer Information

    - Customer ID, Customer Name, Segment, City, State, Country, Postal Code, Market, Region.
    - These provide demographic and geographical details about buyers.

- Product Information

    - Product ID, Product Name, Category, Sub-Category.
    - These capture the catalog structure and type of items purchased.

- Financial Information

    - Sales, Quantity, Discount, Profit, Shipping Cost.
    - These indicate the revenue generated, items sold, discounts applied, and profitability.

- 🌎 Geographic & Market Scope

    - Covers customers from 147 countries across 7 global markets (Africa, APAC, Canada, EMEA, EU, LATAM, US).

    - Includes detailed breakdown by regions (e.g., East, West, North, South).

    - Captures both developed and developing markets, making it ideal for global comparative studies.

- 📊 Business Value

    - The dataset provides a 360-degree view of a retail operation, enabling:

    - Sales Analysis: Which products, categories, and regions drive the most sales?

    - Profitability Studies: Where are discounts hurting profit margins?

    - Customer Segmentation: Understanding behavior across Consumer, Corporate, and Home Office customers.

    - Shipping Optimization: Which shipping modes are cost-efficient vs. loss-making?

    - Market Expansion Strategy: Identifying high-performing regions or underperforming countries.

- ⚙️ Why It’s Popular

    - Realistic & Complex: Covers multi-dimensional data (customer, product, geography, finance).

    - Sizeable: With 51k+ records, it’s large enough to test models but manageable on personal computers.

    - Widely Used: Adopted in tutorials, Kaggle competitions, Tableau dashboards, and academic research.

    - Versatile Applications: Supports both descriptive analytics (dashboards) and predictive analytics (ML models).

Creating an interactive business dashboard using Streamlit involves several steps, including cleaning the data, building the dashboard, and visualizing key performance indicators (KPIs). Below is a step-by-step guide to accomplish this using the Global Superstore dataset.

## Step 1: *Import Libraries*

In [4]:
import numpy as np 
import pandas as pd 
import streamlit as st
import plotly.express  as px
import matplotlib.pyplot as plt 

## Step 2: Load and Clean the Dataset

In [5]:
df = pd.read_csv('./Global_Superstore.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,32298,CA-2012-124891,31-07-2012,31-07-2012,Same Day,RH-19495,Rick Hansen,Consumer,New York City,New York,...,TEC-AC-10003033,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.65,7,0.0,762.1845,933.57,Critical
1,26341,IN-2013-77878,05-02-2013,07-02-2013,Second Class,JR-16210,Justin Ritter,Corporate,Wollongong,New South Wales,...,FUR-CH-10003950,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.765,923.63,Critical
2,25330,IN-2013-71249,17-10-2013,18-10-2013,First Class,CR-12730,Craig Reiter,Consumer,Brisbane,Queensland,...,TEC-PH-10004664,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.971,915.49,Medium
3,13524,ES-2013-1579342,28-01-2013,30-01-2013,First Class,KM-16375,Katherine Murray,Home Office,Berlin,Berlin,...,TEC-PH-10004583,Technology,Phones,"Motorola Smart Phone, Cordless",2892.51,5,0.1,-96.54,910.16,Medium
4,47221,SG-2013-4320,05-11-2013,06-11-2013,Same Day,RH-9495,Rick Hansen,Consumer,Dakar,Dakar,...,TEC-SHA-10000501,Technology,Copiers,"Sharp Wireless Fax, High-Speed",2832.96,8,0.0,311.52,903.04,Critical


- Let's use the info function to check the data 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Row ID          51290 non-null  int64  
 1   Order ID        51290 non-null  object 
 2   Order Date      51290 non-null  object 
 3   Ship Date       51290 non-null  object 
 4   Ship Mode       51290 non-null  object 
 5   Customer ID     51290 non-null  object 
 6   Customer Name   51290 non-null  object 
 7   Segment         51290 non-null  object 
 8   City            51290 non-null  object 
 9   State           51290 non-null  object 
 10  Country         51290 non-null  object 
 11  Postal Code     9994 non-null   float64
 12  Market          51290 non-null  object 
 13  Region          51290 non-null  object 
 14  Product ID      51290 non-null  object 
 15  Category        51290 non-null  object 
 16  Sub-Category    51290 non-null  object 
 17  Product Name    51290 non-null 

- We have 51290 rows and 24 columns in this dataset and just 2 columns in *int* data type and 2 columns in *float* data type 
  and others all columns have in object data type . 


In [7]:
df.isnull().sum()

Row ID                0
Order ID              0
Order Date            0
Ship Date             0
Ship Mode             0
Customer ID           0
Customer Name         0
Segment               0
City                  0
State                 0
Country               0
Postal Code       41296
Market                0
Region                0
Product ID            0
Category              0
Sub-Category          0
Product Name          0
Sales                 0
Quantity              0
Discount              0
Profit                0
Shipping Cost         0
Order Priority        0
dtype: int64

In [8]:
df.columns

Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country',
       'Postal Code', 'Market', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit', 'Shipping Cost', 'Order Priority'],
      dtype='object')

- ## Name of the columns in this data set
---
- `'Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country',
       'Postal Code', 'Market', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit', 'Shipping Cost', 'Order Priority'`

- This is our column names and we missing  values just in **Postal Code** column.

- In this column 41296 are missing values so we need to deal with this. 

- Let's remove this column because this in not an important column and this have so many missing values.

In [9]:
df.drop(['Postal Code'], axis=1, inplace=True)
df.columns


Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country',
       'Market', 'Region', 'Product ID', 'Category', 'Sub-Category',
       'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit',
       'Shipping Cost', 'Order Priority'],
      dtype='object')

In [10]:
df.isnull().sum()

Row ID            0
Order ID          0
Order Date        0
Ship Date         0
Ship Mode         0
Customer ID       0
Customer Name     0
Segment           0
City              0
State             0
Country           0
Market            0
Region            0
Product ID        0
Category          0
Sub-Category      0
Product Name      0
Sales             0
Quantity          0
Discount          0
Profit            0
Shipping Cost     0
Order Priority    0
dtype: int64

- As you can see there is no missing value and our data is clean 
- Convert dates: Change the 'Order Date' to datetime format and extract the year for potential future use.

In [11]:
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Year'] = df['Order Date'].dt.year

  df['Order Date'] = pd.to_datetime(df['Order Date'])


- **Step 3: Build the Streamlit Dashboard**
---
- Now, let's create the Streamlit dashboard in app.py file .
- But we can visualize in this file 
- let's visualize the data

In [None]:
import seaborn as sns
# Group data by category and sum the sales
category_sales = df.groupby('Category')['Sales'].sum()

# Set the color palette
sns.set_palette("pastel")

# Create a pie chart
plt.figure(figsize=(10, 6))
plt.pie(category_sales, labels=category_sales.index, autopct='%1.1f%%', startangle=140)
plt.title('Sales Distribution by Category')
plt.axis('equal')  # Equal aspect ratio ensures that pie chart is a circle.
# Save the figure
plt.savefig('sales_distribution_by_category.png')


Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.

