# Exploratory Data Analysis: Online Retail Dataset  
*Author: Pragun Sapotra*  
*Date: August 2025*

This notebook explores the "Online Retail" dataset to understand data structure, types, and basic statistics.  
It sets the foundation for subsequent data cleaning and analysis tasks.


### Importing libraries

In [3]:
import os
import pandas as pd

### Define file path to Excel dataset

In [28]:
base_dir = os.getcwd()
file_path = os.path.join(base_dir, "../Data/global_indicators_raw.xlsx")

### Retrieving Sheet Names from the Excel File

In [9]:
xlsx = pd.ExcelFile(file_path)
print(xlsx.sheet_names)


['Online Retail']


### Load Data from 'Online Retail' Sheet  
Read the 'Online Retail' sheet into a DataFrame for analysis.


In [12]:
df = pd.read_excel(file_path, sheet_name='Online Retail')

### Exploring Data

In [23]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [30]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


### Initial Observations:
- The Description and CustomerID columns have some missing values, which might need cleaning.
- CustomerID is stored as a float, which seems unusual because IDs are usually strings or integers. I think this might be because of those missing values (NaNs) causing pandas to use float by default.
- InvoiceNo has many repeated values, which could mean multiple items per invoice or duplicate records — I need to check that further.
