#### Lab 2 — Data Collection & Pre-Processing
Albright Maduka  

PROG8245 


#### Lab Assignment

This notebook follows the 12-step Data Engineering roadmap to process a synthetic e-commerce dataset.  
Primary dataset: `1000_sales_records_csv` (synthetic, 500 rows).  
Secondary dataset: `world_cities.csv` (metadata from SimpleMaps).

### 1. Hello, Data!

Aim: Load raw CSV and display first 3 rows

Reason: It ensure that my dataset is read and loaded correctly

In [2]:
import pandas as pd
import numpy as np
# Read the CSV file into a DataFrame
df = pd.read_csv(r"data\1000 Sales Records.csv")
data = pd.DataFrame(df) # Convert to DataFrame
print(df.head(3)) # Display the first few rows of the DataFrame


                         Region Country   Item Type Sales Channel  \
0  Middle East and North Africa   Libya   Cosmetics       Offline   
1                 North America  Canada  Vegetables        Online   
2  Middle East and North Africa   Libya   Baby Food       Offline   

  Order Priority  Order Date   Order ID   Ship Date  Units Sold  Unit Price  \
0              M  10/18/2014  686800706  10/31/2014        8446      437.20   
1              M   11/7/2011  185941302   12/8/2011        3018      154.06   
2              C  10/31/2016  246222341   12/9/2016        1517      255.28   

   Unit Cost  Total Revenue  Total Cost  Total Profit  
0     263.33     3692591.20  2224085.18    1468506.02  
1      90.93      464953.08   274426.74     190526.34  
2     159.42      387259.76   241840.14     145419.62  


#### 2. Pick the Right Container
Question : Dict vs namedtuple vs set (1–2 sentences).

Answer: Dict is use for looksup/mappings ie. coupon for discout
    
    set is use for unique checks ie. cities
      
    Namedtuple is for creating lightweight, objects with named fields, providing better structure and readability


#### 3. Implement Functions and Data Structure
Implement and use it to populate a data structure

In [51]:
class SalesData:
    # Initialize with a DataFrame
    def __init__(self, df):
        self.df = df.copy()
    
    def clean(self):
        # data cleaning
        # basic example: drop rows missing critical values
        before = len(self.df)
        self.df = self.df.dropna(subset=["Units Sold", "Unit Price", "Total Revenue"]) #remove rows with missing values in these columns
        after = len(self.df)
        print(f"Cleaned: {before} → {after} rows")
        return self.df
    
    def total(self):
        # compute and return
        # total revenue cross-check
        return float(self.df["Total Revenue"].sum())

sales = SalesData(df)
_ = sales.clean()
sales.total()

Cleaned: 1000 → 1000 rows


1327321840.33

#### 4. Bulk Loaded 
Map data structures from dataframes to dictionaries

In [7]:
# Convert sample rows to list of dictionaries
records = df.to_dict(orient="records")[:3]
records


[{'Region': 'Middle East and North Africa',
  'Country': 'Libya',
  'Item Type': 'Cosmetics',
  'Sales Channel': 'Offline',
  'Order Priority': 'M',
  'Order Date': '10/18/2014',
  'Order ID': 686800706,
  'Ship Date': '10/31/2014',
  'Units Sold': 8446,
  'Unit Price': 437.2,
  'Unit Cost': 263.33,
  'Total Revenue': 3692591.2,
  'Total Cost': 2224085.18,
  'Total Profit': 1468506.02},
 {'Region': 'North America',
  'Country': 'Canada',
  'Item Type': 'Vegetables',
  'Sales Channel': 'Online',
  'Order Priority': 'M',
  'Order Date': '11/7/2011',
  'Order ID': 185941302,
  'Ship Date': '12/8/2011',
  'Units Sold': 3018,
  'Unit Price': 154.06,
  'Unit Cost': 90.93,
  'Total Revenue': 464953.08,
  'Total Cost': 274426.74,
  'Total Profit': 190526.34},
 {'Region': 'Middle East and North Africa',
  'Country': 'Libya',
  'Item Type': 'Baby Food',
  'Sales Channel': 'Offline',
  'Order Priority': 'C',
  'Order Date': '10/31/2016',
  'Order ID': 246222341,
  'Ship Date': '12/9/2016',
  'Uni

Note: This line above is used to see how your dataset rows look in the dictionary.

Reason: It's the process of cleaning raw data
i. I removed duplicates
ii. I removed the rows with missing values
iii. I removed the spaces
iv. converted dates to time
v. filled the missing coupon codes with none

In [16]:
# Create the 7-field "transactions"
np.random.seed(7) 

transactions = pd.DataFrame({
    "date": pd.to_datetime(df["Order Date"]), # map from Order Date
    "customer_id": ["CUST" + str(i).zfill(5) for i in range(1, len(df)+1)],  # synthetic
    "product": df["Item Type"], # map from Item Type                 
    "price": df["Unit Price"], # map from Unit Price
    "quantity": df["Units Sold"], # map from Units Sold                  
})

# Synthetic coupon codes & shipping cities
coupon_pool = ["PROMO11", "PROMO10", "DISCOUNT5", "FREESHIP", "SAVE20"]
city_pool   = ["Toronto","New York","London","Dubai","Sydney","Mumbai","Paris","Berlin","Tokyo","Mexico City"]

transactions["coupon_code"]   = np.random.choice(coupon_pool, size=len(transactions))
transactions["shipping_city"] = np.random.choice(city_pool, size=len(transactions))

transactions.head(3)


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city
0,2014-10-18,CUST00001,Cosmetics,437.2,8446,SAVE20,Toronto
1,2011-11-07,CUST00002,Vegetables,154.06,3018,PROMO10,Tokyo
2,2016-10-31,CUST00003,Baby Food,255.28,1517,FREESHIP,New York


Note:  Bulk Loaded (Create Required Transactions Schema)

My dataset does not have the required 7 fields
date, customer_id, product, price, quantity, coupon_code, shipping_city

 *date* - `Order Date`
 *customer_id* - generated synthetic IDs (`CUST00001`, …)
 *product* - `Item Type`
*price* - `Unit Price`
*quantity* - `Units Sold`
*coupon_code* - (`PROMO10`, `DISCOUNT5`, etc.)
*shipping_city* - randomly from cities

This allows my dataset to match the required fields


#### 5. Quick Profiling
We already did something like this in class

In [18]:
data = transactions.copy()  # stick to a single working variable

print("Data Info:") # This is used to display summary of the DataFrame
print(data.info())

print("\nData Description:") # Generate or describe statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.
print(data.describe(include='all'))

# Optional: standardize headers to lower-snake for easier coding later
data.columns = (data.columns
                .str.strip()
                .str.lower() 
                .str.replace(" ", "_")) # replace spaces with underscores
data.head(3)


Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           1000 non-null   datetime64[ns]
 1   customer_id    1000 non-null   object        
 2   product        1000 non-null   object        
 3   price          1000 non-null   float64       
 4   quantity       1000 non-null   int64         
 5   coupon_code    1000 non-null   object        
 6   shipping_city  1000 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 54.8+ KB
None

Data Description:
                                 date customer_id    product       price  \
count                            1000        1000       1000  1000.00000   
unique                            NaN        1000         12         NaN   
top                               NaN   CUST00001  Beverages         NaN   
freq           

Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city
0,2014-10-18,CUST00001,Cosmetics,437.2,8446,SAVE20,Toronto
1,2011-11-07,CUST00002,Vegetables,154.06,3018,PROMO10,Tokyo
2,2016-10-31,CUST00003,Baby Food,255.28,1517,FREESHIP,New York


#### 6. Spot the Grime

In [None]:
# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# Remove duplicates if any is present
dup_count = data.duplicated().sum()
print(f"\nDuplicate rows: {dup_count}")

# Spot negative or zero values in quantity and price
print("\nNon-positive quantities:", (data["quantity"] <= 0).sum())
print("Non-positive prices:", (data["price"] <= 0).sum())



Missing Values:
date             0
customer_id      0
product          0
price            0
quantity         0
coupon_code      0
shipping_city    0
dtype: int64

Duplicate rows: 0

Non-positive quantities: 0
Non-positive prices: 0


### 7. Cleaning 
Also in the lecture lab note


In [23]:
before = len(data)

# Remove duplicates
data = data.drop_duplicates()

# Handle missing values (example: fill missing coupon codes with "NO_COUPON")
data["coupon_code"] = data["coupon_code"].fillna("NO_COUPON")

# Standardize text fields (example: trim whitespace and convert to title case for shipping_city)
data["shipping_city"] = data["shipping_city"].str.strip().str.title().str.lower()

# Convert data types
data["date"] = pd.to_datetime(data["date"])
data["quantity"] = pd.to_numeric(data["quantity"], errors="coerce")
data["price"] = pd.to_numeric(data["price"], errors="coerce")

# Remove rows with non-positive quantity or price
data = data[(data["quantity"] > 0) & (data["price"] > 0)]

after = len(data)
print(f"Cleaning complete: {before} → {after} rows")
data.head(3)


Cleaning complete: 1000 → 1000 rows


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city
0,2014-10-18,CUST00001,Cosmetics,437.2,8446,SAVE20,toronto
1,2011-11-07,CUST00002,Vegetables,154.06,3018,PROMO10,tokyo
2,2016-10-31,CUST00003,Baby Food,255.28,1517,FREESHIP,new york


#### 8. Transformation (Transforming Coupons to Discounts, Revenue, Transaction key)

In our lecture lab note we had address but the lab assignment did ask me to use address.

-Mapping *coupon_code* to a numeric discount using a dictionary.  
*net_price* = price × (1 − discount)

revenue = net_price × quantity.  
*transaction_key* (date + customer_id + product) for tracking.  
FREESHIP = `0.00`.


In [32]:
# Map coupon_code → numeric discount rate in the dictionary lookup 
coupon_to_disc = {
    "PROMO10": 0.10,
    "DISCOUNT5": 0.05,
    "FREESHIP": 0.00,
    "SAVE20": 0.20,
    "PROMO11": 0.11,
}
data["discount"] = data["coupon_code"].map(coupon_to_disc).fillna(0.0)
#mapping out each coupon code to its corresponding discount rate
#fillna(0.0) is used to assign a discount rate of 0.0 to any coupon code not found in the dictionary

# Derived price after discount & line revenue
data["net_price"] = data["price"] * (1 - data["discount"])
data["revenue"]   = data["net_price"] * data["quantity"]

# Transaction key (composite string)
data["transaction_key"] = (
    data["date"].dt.strftime("%Y%m%d") + "_" +
    data["customer_id"] + "_" +
    data["product"].str.replace(r"\s+", "", regex=True)
)

data.head(5)


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city,discount,net_price,revenue,transaction_key
0,2014-10-18,CUST00001,Cosmetics,437.2,8446,SAVE20,toronto,0.2,349.76,2954072.96,20141018_CUST00001_Cosmetics
1,2011-11-07,CUST00002,Vegetables,154.06,3018,PROMO10,tokyo,0.1,138.654,418457.772,20111107_CUST00002_Vegetables
2,2016-10-31,CUST00003,Baby Food,255.28,1517,FREESHIP,new york,0.0,255.28,387259.76,20161031_CUST00003_BabyFood
3,2010-04-10,CUST00004,Cereal,205.7,3322,FREESHIP,new york,0.0,205.7,683335.4,20100410_CUST00004_Cereal
4,2011-08-16,CUST00005,Fruits,9.33,9845,SAVE20,london,0.2,7.464,73483.08,20110816_CUST00005_Fruits


#### 9. Feature Engineeering 

In [36]:
# Date features
data["order_year"]  = data["date"].dt.year
data["order_month"] = data["date"].dt.month

# Binary engineered feature: high discount flag (1 if discount >= 10%, else 0)
# Priority/region flags (since original file had Region/Order Priority; here we simulate similar flags)

data["is_high_discount"] = (data["discount"] >= 0.10).astype(int) #astype(int) converts the boolean values (True/False) to integers (1/0)

# To display the first letter of the product 
data["product_initial"] = data["product"].apply(lambda x: x[0] if isinstance(x, str) and len(x)>0 else "")

data.head(3)


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city,discount,net_price,revenue,transaction_key,order_year,order_month,is_high_discount,product_initial
0,2014-10-18,CUST00001,Cosmetics,437.2,8446,SAVE20,toronto,0.2,349.76,2954072.96,20141018_CUST00001_Cosmetics,2014,10,1,C
1,2011-11-07,CUST00002,Vegetables,154.06,3018,PROMO10,tokyo,0.1,138.654,418457.772,20111107_CUST00002_Vegetables,2011,11,1,V
2,2016-10-31,CUST00003,Baby Food,255.28,1517,FREESHIP,new york,0.0,255.28,387259.76,20161031_CUST00003_BabyFood,2016,10,0,B


#### 10. Mini Aggregation

In [40]:
# For the revenue using shipping_city for the top 10 cities
#city_revenue = (data.groupby("shipping_city")["revenue"]
                  #.sum()
                  #.sort_values(ascending=False)
                  #.head(10))
#city_revenue

product_revenue = (data.groupby("product")["revenue"]
                  .sum()
                    .sort_values(ascending=False)
                    .head(10))
product_revenue


product
Office Supplies    2.618948e+08
Household          2.206722e+08
Cosmetics          1.696349e+08
Meat               1.555799e+08
Baby Food          9.991841e+07
Cereal             7.323029e+07
Vegetables         6.536898e+07
Snacks             5.428317e+07
Clothes            3.721904e+07
Personal Care      3.498388e+07
Name: revenue, dtype: float64

#### 11. Serialization Checkpoint
Save cleaned data to JSON

In [41]:
# CSV + JSON
data.to_csv("data/transactions_cleaned_final.csv", index=False)
data.to_json("data/transactions_cleaned_final.json", orient="records")

print("Saved: data/transactions_cleaned_final.csv & .json")

Saved: data/transactions_cleaned_final.csv & .json


#### 12. Soft Interview Reflection

### Step 12 — Soft Interview Reflection

Using some functions and a class made my work easier and more organized. 

The `clean()` method helped me to keep all my cleaning in one place so i dont repeat codes.

The `total()` method :- I was able to check revenue and confirm my data was well using the total() function. 
By putting logic inside the functions, I was able to re-run my notebook on new data without rewriting steps.  
This structure also made the notebook easier to read and debug, since each part had a clear job.  
Overall, functions helped me save time, avoid mistakes, and keep my workflow simple.


#### After Step 12 (Data-Dictionary)

Merge field definitions from the primary CSV header and the secondary metadata source. 

Present as a tidy Markdown table including the new columns, for example: Field, Type, Description, Source. Explain how they were created, e.g. synthetic, combination, mean, etc

In [53]:
# Read city metadata
cities = pd.read_csv("data/worldcities.csv")

# I picked only the columns I need
cities_small = cities[["city_name","country","province","population","latitude","longitude"]].copy()
cities_small["city_name"] = cities_small["city_name"].str.title()

# I filtered to just the shipping cities to shrink size
my_cities = data["shipping_city"].dropna().unique().tolist() # Get unique shipping cities from the transactions data
cities_filtered = cities_small[cities_small["city_name"].isin(my_cities)] # Filter to just our shipping cities to shrink size

# I merged metadata to transactions
enriched = data.merge(cities_filtered, left_on="shipping_city", right_on="city_name", how="left")
enriched.head(5)


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city,discount,net_price,revenue,...,order_year,order_month,is_high_discount,product_initial,city_name,country,province,population,latitude,longitude
0,2014-10-18,CUST00001,Cosmetics,437.2,8446,SAVE20,Toronto,0.2,349.76,2954072.96,...,2014,10,1,C,Toronto,Canada,Ontario,5647656.0,43.7417,-79.3733
1,2011-11-07,CUST00002,Vegetables,154.06,3018,PROMO10,Tokyo,0.1,138.654,418457.772,...,2011,11,1,V,Tokyo,Japan,Tōkyō,37785000.0,35.687,139.7495
2,2016-10-31,CUST00003,Baby Food,255.28,1517,FREESHIP,New York,0.0,255.28,387259.76,...,2016,10,0,B,New York,United States,New York,18832416.0,40.6943,-73.9249
3,2010-04-10,CUST00004,Cereal,205.7,3322,FREESHIP,New York,0.0,205.7,683335.4,...,2010,4,0,C,New York,United States,New York,18832416.0,40.6943,-73.9249
4,2011-08-16,CUST00005,Fruits,9.33,9845,SAVE20,London,0.2,7.464,73483.08,...,2011,8,1,F,London,United Kingdom,"London, City of",11262000.0,51.5072,-0.1275
