#### Lab 2 — Data Collection & Pre-Processing
**Student:** Albright Maduka  

**Course:** PROG8245 


#### Lab Assignment
##### You will execute the 12-step Data Engineering road-map practiced in class, this time end-to-end on a realistic e-commerce dataset.
##### Your deliverable is a well-commented Jupyter Notebook that loads raw data, cleans and enriches it, and finishes with a concise analytical insight. All code, data, and documentation must live in a GitHub repository you control.

### 1. Hello, Data!

Aim: Load raw CSV and display first 3 rows

Reason: It ensure that my dataset is read and loaded correctly

In [26]:
import pandas as pd
# Read the CSV file into a DataFrame
data = pd.read_csv(r"data\1000 Sales Records.csv")

print(data.head(3)) # Display the first few rows of the DataFrame


                         Region Country   Item Type Sales Channel  \
0  Middle East and North Africa   Libya   Cosmetics       Offline   
1                 North America  Canada  Vegetables        Online   
2  Middle East and North Africa   Libya   Baby Food       Offline   

  Order Priority  Order Date   Order ID   Ship Date  Units Sold  Unit Price  \
0              M  10/18/2014  686800706  10/31/2014        8446      437.20   
1              M   11/7/2011  185941302   12/8/2011        3018      154.06   
2              C  10/31/2016  246222341   12/9/2016        1517      255.28   

   Unit Cost  Total Revenue  Total Cost  Total Profit  
0     263.33     3692591.20  2224085.18    1468506.02  
1      90.93      464953.08   274426.74     190526.34  
2     159.42      387259.76   241840.14     145419.62  


#### 2. Pick the Right Container
Aim: Dict vs namedtuple vs set (1–2 sentences).

Reason: Dict is used for deciding how to store data 


In [27]:
# Dict is for mapping keys to values, allowing for flexible and dynamic data storage.
# Namedtuple is for creating lightweight, immutable objects with named fields, providing better structure and readability
# Set is for storing unique products.


#### 3. Implement Functions and Data Structure

In [28]:
from collections import namedtuple

Transaction = namedtuple('Transaction', ['date', 'customer_id', 'product', 'price', 'quantity', 'coupon_code', 'shipping_city'])

def clean(row):
    return Transaction(
        date=row['Order Date'],
        customer_id=row['customer_id'],
        product=row['Item Type'],
        price=row['Unit Price'],
        quantity=row['Units Sold'],
        coupon_code=row['coupon_code'],
        shipping_city=row['shipping_city']
    )

In [29]:
print("Data Info:")
print(data.info()) # This is used to display summary of the DataFrame  

print("Data Description:")
print(data.describe(include='all')) # Generate or describe statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Region          1000 non-null   object 
 1   Country         1000 non-null   object 
 2   Item Type       1000 non-null   object 
 3   Sales Channel   1000 non-null   object 
 4   Order Priority  1000 non-null   object 
 5   Order Date      1000 non-null   object 
 6   Order ID        1000 non-null   int64  
 7   Ship Date       1000 non-null   object 
 8   Units Sold      1000 non-null   int64  
 9   Unit Price      1000 non-null   float64
 10  Unit Cost       1000 non-null   float64
 11  Total Revenue   1000 non-null   float64
 12  Total Cost      1000 non-null   float64
 13  Total Profit    1000 non-null   float64
dtypes: float64(5), int64(2), object(7)
memory usage: 109.5+ KB
None
Data Description:
        Region Country  Item Type Sales Channel Order Priority Order Date  \

#### 3 Cleaning
It is the process of cleaning missing values, correcting typos and also ensuring numerical columns are correctly inputed.

In [19]:
# check for missing values in the dataset 0 means no missing values
print("Missing Values:")
print(data.isnull().sum())

# isnull() function is used to detect missing values in a DataFrame. It returns a DataFrame of the same shape as the original, with boolean values indicating whether each element is missing (True) or not (False). 
# The sum() function is then applied to count the number of missing values in each column.

Missing Values:
Region            0
Country           0
Item Type         0
Sales Channel     0
Order Priority    0
Order Date        0
Order ID          0
Ship Date         0
Units Sold        0
Unit Price        0
Unit Cost         0
Total Revenue     0
Total Cost        0
Total Profit      0
dtype: int64


In [35]:
# Remove duplicates if any
data = data.drop_duplicates()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Region          1000 non-null   object 
 1   Country         1000 non-null   object 
 2   Item Type       1000 non-null   object 
 3   Sales Channel   1000 non-null   object 
 4   Order Priority  1000 non-null   object 
 5   Order Date      1000 non-null   object 
 6   Order ID        1000 non-null   int64  
 7   Ship Date       1000 non-null   object 
 8   Units Sold      1000 non-null   int64  
 9   Unit Price      1000 non-null   float64
 10  Unit Cost       1000 non-null   float64
 11  Total Revenue   1000 non-null   float64
 12  Total Cost      1000 non-null   float64
 13  Total Profit    1000 non-null   float64
dtypes: float64(5), int64(2), object(7)
memory usage: 109.5+ KB


In [34]:
def clean_data(df): # Function to clean the dataset
    df = df.drop_duplicates() # Remove duplicates if any
    df = df.dropna(subset=['Order ID', 'Order Date', 'Ship Date', 'Sales']) # Drop rows with missing critical values
    df['Item Type'] = df['Item Type'].str.strip()
    df['Country'] = df['Country'].str.strip()
    df['Order Date'] = pd.to_datetime(df['Order Date']) # Convert to datetime
    df['Ship Date'] = pd.to_datetime(df['Ship Date'])
    if 'Coupon Code' in df.columns:
        df['Coupon Code'] = df['Coupon Code'].fillna('None') # Fill missing coupon codes with 'None'
    return df


Reason: It's the process of cleaning raw data
i. I removed duplicates
ii. I removed the rows with missing values
iii. I removed the spaces
iv. converted dates to time
v. filled the missing coupon codes with none

#### 4. Bulk Loaded
Markdown: Explain how you mapped raw data into structured objects.

In [None]:

transactions = [clean(row) for _, row in df.iterrows()]



NameError: name 'df' is not defined

5. Quick Profiling

In [44]:
print("Min price:", df['Unit Price'].min())
print("Mean price:", df['Unit Price'].mean())
print("Max price:", df['Unit Price'].max())
print("Unique cities:", len(set(df['shipping_city'])))

NameError: name 'df' is not defined