# Pandas GroupBy & Aggregation

## Objective
Analyze cleaned sales data using aggregation techniques to answer
key business questions and identify performance trends.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("data/superstore_cleaned.csv",encoding="latin1")

In [6]:
df.head()

Unnamed: 0,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,product_id,category,sub-category,product_name,sales
0,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


The cleaned dataset is used for this analysis to ensure accurate aggregations and reliable business insights.

## Analysis Approach
The data will be grouped by meaningful business dimensions such as category, segment, and region to evaluate performance at different levels.

In [56]:
df.groupby("category")["sales"].sum().sort_values(ascending=False)

category
Technology         825070.4070
Furniture          720038.5927
Office Supplies    702073.1650
Name: sales, dtype: float64

In [57]:
df.groupby("category")["sales"].count().sort_values(ascending=False)

category
Office Supplies    5871
Furniture          2063
Technology         1804
Name: sales, dtype: int64

In [50]:
df.groupby("sub-category")["sales"].count().sort_values()

sub-category
Copiers          66
Machines        115
Supplies        183
Fasteners       212
Bookcases       224
Envelopes       247
Tables          311
Labels          354
Appliances      457
Chairs          603
Accessories     752
Art             779
Storage         826
Phones          871
Furnishings     925
Paper          1333
Binders        1480
Name: sales, dtype: int64

In [30]:
df.groupby("segment")["sales"].count()

segment
Consumer       5068
Corporate      2933
Home Office    1737
Name: sales, dtype: int64

In [51]:
df.groupby("state")["sales"].sum().sort_values(ascending=False)

state
California              442579.1485
New York                306209.0070
Texas                   166356.8382
Washington              135022.0020
Pennsylvania            111986.0970
Florida                  87311.0440
Illinois                 79213.6130
Michigan                 75644.5240
Ohio                     74971.9820
Virginia                 70532.7100
North Carolina           55156.5720
Indiana                  48680.1800
Georgia                  48219.1100
Kentucky                 36458.3900
Arizona                  35249.4570
New Jersey               34604.4320
Colorado                 31266.6620
Wisconsin                31160.2300
Tennessee                30661.8730
Minnesota                29821.9500
Massachusetts            28634.4340
Delaware                 27322.9990
Maryland                 23705.5230
Rhode Island             22525.0260
Missouri                 22195.4700
Oklahoma                 19683.3900
Alabama                  19302.4800
Nevada                

In [48]:
df.groupby(["category","sub-category"])["sales"].mean().reset_index().sort_values(["category","sub-category"],ascending=False)

Unnamed: 0,category,sub-category,sales
16,Technology,Phones,373.944744
15,Technology,Machines,1645.553313
14,Technology,Copiers,2215.880212
13,Technology,Accessories,217.92262
12,Office Supplies,Supplies,253.603049
11,Office Supplies,Storage,263.95269
10,Office Supplies,Paper,57.488759
9,Office Supplies,Labels,33.441684
8,Office Supplies,Fasteners,14.091415
7,Office Supplies,Envelopes,65.23387


In [33]:
df.groupby(["category","segment"])["sales"].count()

category         segment    
Furniture        Consumer       1084
                 Corporate       625
                 Home Office     354
Office Supplies  Consumer       3054
                 Corporate      1769
                 Home Office    1048
Technology       Consumer        930
                 Corporate       539
                 Home Office     335
Name: sales, dtype: int64

**Insights:**
- The Technology category generates the highest total sales by value, while Office Supplies records higher sales in terms of quantity.
- The top three states by total sales are California, New York, and Texas, indicating strong regional performance in these markets.
- The Consumer segment consistently contributes the highest revenue across multiple categories.

The data is grouped based on key business domains such as category, segment, and region to evaluate performance at various levels.
Aggregations which include total sales, average sales, etc., are utilized to identify high- and low-performing areas and facilitate data-driven decision making.

## GroupBy Summary
- Sales performance varies significantly across product categories, segments, and regions.
- Certain categories consistently generate higher total sales, indicating strong demand.
- Some segments and regions show comparatively lower performance and may require focused attention.
- Aggregation-based analysis helps identify priority areas for business growth and optimization.