# Sales Performance Analysis Project

## Project Overview
This project analyzes retail sales data to uncover insights about:
- Sales trends and patterns
- Product performance
- Customer behavior
- Geographic distribution

## Dataset
- Source: Online Retail Dataset (UCI ML Repository)
- Sample: 25,000 transactions
- Columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country

## Data Loading

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:

use_columns = ["InvoiceNo", "StockCode", "Description", "Quantity", "InvoiceDate", "UnitPrice", "CustomerID", "Country"]
sales = pd.read_excel("Online Retail.xlsx", usecols=use_columns, nrows=25_000)

### Checking Data Structure


In [3]:
sales.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,2010-12-01 08:34:00,1.69,13047.0,United Kingdom


In [4]:
sales.tail(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
24990,538351,21976,PACK OF 60 MUSHROOM CAKE CASES,6,2010-12-10 15:17:00,2.13,,United Kingdom
24991,538351,21989,PACK OF 20 SKULL PAPER NAPKINS,2,2010-12-10 15:17:00,2.51,,United Kingdom
24992,538351,22069,BROWN PIRATE TREASURE CHEST,2,2010-12-10 15:17:00,4.21,,United Kingdom
24993,538351,22075,6 RIBBONS ELEGANT CHRISTMAS,2,2010-12-10 15:17:00,4.21,,United Kingdom
24994,538351,22077,6 RIBBONS RUSTIC CHARM,11,2010-12-10 15:17:00,4.21,,United Kingdom
24995,538351,22078,RIBBON REEL LACE DESIGN,2,2010-12-10 15:17:00,4.81,,United Kingdom
24996,538351,22080,RIBBON REEL POLKADOTS,1,2010-12-10 15:17:00,4.21,,United Kingdom
24997,538351,22081,RIBBON REEL FLORA + FAUNA,1,2010-12-10 15:17:00,4.21,,United Kingdom
24998,538351,22082,RIBBON REEL STRIPES DESIGN,3,2010-12-10 15:17:00,4.21,,United Kingdom
24999,538351,22083,PAPER CHAIN KIT RETROSPOT,23,2010-12-10 15:17:00,6.77,,United Kingdom


In [5]:

sales.shape

(25000, 8)

In [6]:
sales.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [7]:
sales.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [8]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    25000 non-null  object        
 1   StockCode    25000 non-null  object        
 2   Description  24889 non-null  object        
 3   Quantity     25000 non-null  int64         
 4   InvoiceDate  25000 non-null  datetime64[ns]
 5   UnitPrice    25000 non-null  float64       
 6   CustomerID   16056 non-null  float64       
 7   Country      25000 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 1.5+ MB


### Unique Value Counts


In [9]:
sales['InvoiceNo'].nunique(), sales['StockCode'].nunique(), sales['Country'].nunique(), sales['CustomerID'].nunique()


(1166, 2540, 18, 654)

## Data Quality Check

In [10]:
sales['InvoiceDate'].min(), sales['InvoiceDate'].max()

(Timestamp('2010-12-01 08:26:00'), Timestamp('2010-12-10 15:17:00'))

In [11]:
sales['Quantity'].describe()

count    25000.000000
mean         7.252680
std         70.411142
min      -9360.000000
25%          1.000000
50%          2.000000
75%          6.000000
max       2880.000000
Name: Quantity, dtype: float64

In [12]:
sales['UnitPrice'].describe()

count    25000.000000
mean         7.108175
std        181.004981
min          0.000000
25%          1.450000
50%          2.510000
75%          4.250000
max      13541.330000
Name: UnitPrice, dtype: float64

In [13]:
sales['CustomerID'].isnull().sum()

np.int64(8944)


  
### Initial Observations

- The dataset contains 25,000 transactions with 8 columns.
- The `InvoiceDate` column is correctly parsed as datetime.
- Some missing values are present in `Description` and `CustomerID`.
- Memory usage is approximately **1.5 MB**.
- There are 1166 unique invoices, 2540 unique products, 18 unique countries, and 654 unique customers.

### Data Quality Issues Identified:
- Negative quantities (returns/cancellations): minimum -9,360
- Zero unit prices found: need investigation  
- 35.8% transactions missing CustomerID
- Extreme values present in both quantity and price ranges

