# Data Understanding & Validation

This notebook focuses on understanding the structure and quality of the raw online retail dataset.
Key activities include validating data types, identifying missing values, detecting invalid records
(negative quantities, zero or negative prices), and distinguishing sales from returns.

The goal is to establish clear business rules and data quality assumptions before performing
any cleaning or analysis.


In [2]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv("online_retail.csv")

In [4]:
df.head()


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [5]:
df.info()


<class 'pandas.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  str    
 1   StockCode    541909 non-null  str    
 2   Description  540455 non-null  str    
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  str    
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  str    
dtypes: float64(2), int64(1), str(5)
memory usage: 33.1 MB


In [6]:
df['Quantity'].describe()

count    541909.000000
mean          9.552250
std         218.081158
min      -80995.000000
25%           1.000000
50%           3.000000
75%          10.000000
max       80995.000000
Name: Quantity, dtype: float64

### Quantity Column Observations

- Quantity has no missing values
- Values range from -80,995 to 80,995
- Presence of negative quantities indicates returns or cancellations
- Extremely large positive values suggest outliers or bulk transactions
- Median quantity is 3, which is more representative than the mean


In [7]:
df['UnitPrice'].describe()

count    541909.000000
mean          4.611114
std          96.759853
min      -11062.060000
25%           1.250000
50%           2.080000
75%           4.130000
max       38970.000000
Name: UnitPrice, dtype: float64

### UnitPrice Column Observations

- UnitPrice has no missing values
- Minimum price is negative (-11062), which is not valid for a product price
- Standard deviation is very high compared to the mean, indicating extreme values
- Median price (2.08) is much lower than the mean (4.61), suggesting a skewed distribution
- Data type is float, which is appropriate for monetary values



In [8]:
(df['UnitPrice'] < 0).sum()


np.int64(2)

- Only 2 rows contain negative UnitPrice values
- These are likely data entry or system errors due to their rarity


In [9]:
(df['UnitPrice'] == 0).sum()


np.int64(2515)

- 2,515 rows have UnitPrice equal to 0
- Zero-priced items require further investigation before revenue calculations


In [10]:
df.loc[df['UnitPrice']==0,['Quantity']].describe()

Unnamed: 0,Quantity
count,2515.0
mean,-53.421074
std,540.206783
min,-9600.0
25%,-32.0
50%,-1.0
75%,3.0
max,12540.0


- Analysis of zero UnitPrice rows shows median Quantity = -1
- Majority of zero-priced transactions represent returns or cancellations
- These rows should be excluded from revenue calculations but may be retained for return analysis


In [11]:
df['InvoiceNo'].str.startswith('C').sum()


np.int64(9288)

### Cancelled Invoices Observation

- 9,288 invoices start with the letter "C"
- These represent cancelled transactions or credit notes
- Cancelled invoices must be handled separately to avoid distorting sales and revenue analysis


In [12]:
df.loc[df['InvoiceNo'].str.startswith('C'), ['Quantity', 'UnitPrice']].describe()


Unnamed: 0,Quantity,UnitPrice
count,9288.0,9288.0
mean,-29.885228,48.393661
std,1145.786965,666.60043
min,-80995.0,0.01
25%,-6.0,1.45
50%,-2.0,2.95
75%,-1.0,5.95
max,-1.0,38970.0


### Cancelled Invoice Analysis

- Cancelled invoices have strictly negative quantities (median = -2)
- Maximum quantity for cancelled invoices is -1, confirming no sales occur in these rows
- UnitPrice values appear normal, indicating quantity sign is the primary indicator
- Cancelled invoices should be excluded from sales analysis and handled separately as returns
