# **Raw Data Ingestion**

## **Purpose**
- Read raw Excel sheets without modifying source data and produce a unified dataset for downstream processing.


### **1) Installing the Libraries**

In [1]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from pathlib import Path
import openpyxl

### **2) Checking the Path**

In [15]:
RAW_DATA_PATH = Path("../data/raw/online_retail.xlsx")
RAW_DATA_PATH.exists()

True

In [16]:
PROCESSED_DATA_PATH = Path("../data/processed")
RAW_DATA_PATH.exists()

True

### **3) Inspecting the sheets**

In [5]:
xls = pd.ExcelFile(RAW_DATA_PATH)
xls.sheet_names

['Year 2009-2010', 'Year 2010-2011']

In [6]:
df_2009 = pd.read_excel(RAW_DATA_PATH, sheet_name="Year 2009-2010")
df_2010 = pd.read_excel(RAW_DATA_PATH, sheet_name="Year 2010-2011")

In [7]:
df_2009.shape, df_2010.shape

((525461, 8), (541910, 8))

In [8]:
df_2009.columns

Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'Price', 'Customer ID', 'Country'],
      dtype='object')

In [9]:
df_2010.columns

Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'Price', 'Customer ID', 'Country'],
      dtype='object')

### **4) Union datasets**

In [10]:
df_raw = pd.concat([df_2009, df_2010], ignore_index=True)

### **5) Minimal sanity checks**

In [12]:
df_raw.shape

(1067371, 8)

In [11]:
df_raw.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [13]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   Invoice      1067371 non-null  object        
 1   StockCode    1067371 non-null  object        
 2   Description  1062989 non-null  object        
 3   Quantity     1067371 non-null  int64         
 4   InvoiceDate  1067371 non-null  datetime64[ns]
 5   Price        1067371 non-null  float64       
 6   Customer ID  824364 non-null   float64       
 7   Country      1067371 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 65.1+ MB


### **6) Write processed output**

In [17]:
PROCESSED_DATA_PATH.mkdir(exist_ok=True)
df_raw.to_csv(PROCESSED_DATA_PATH / "raw_unified.csv", index=False)