# HW1 — Python Programming for Data Analysis (No Pandas)

**Instructions**
- Complete this notebook **top to bottom**.
- Use **only Python standard library + NumPy** (no Pandas).
- Write clean, readable code with meaningful variable names.
- Unless told otherwise, do not change provided starter code.

---

## Dataset: Realistic E-commerce Orders (Synthetic, Reproducible)

You will work with a dataset of e-commerce orders (hundreds of rows) that includes:
- numeric fields (price, quantity, discount, shipping days)
- categorical fields (state, category, payment method)
- boolean/flags (returned)
- missing values and a few outliers (like real data)

In [3]:
# Dataset generator (DO NOT MODIFY)
import random
import datetime as dt

RNG_SEED = 551
random.seed(RNG_SEED)

STATES = ["CA","NY","TX","FL","IL","WA","MA","PA","GA","NC","VA","AZ","CO","NJ","OH"]
CATEGORIES = ["Electronics","Home","Beauty","Grocery","Sports","Books","Clothing","Toys"]
PAYMENTS = ["card","paypal","apple_pay","google_pay","bank_transfer"]

def _choice_weighted(items, weights):
    # items: list, weights: list of positive numbers
    total = sum(weights)
    r = random.random() * total
    upto = 0.0
    for item, w in zip(items, weights):
        upto += w
        if upto >= r:
            return item
    return items[-1]

def generate_orders(n=420):
    """
    Returns a list of dicts with keys:
      order_id, customer_id, order_date, state, category, payment,
      unit_price, quantity, discount_pct, shipping_days, returned, rating
    """
    base_date = dt.date(2025, 8, 1)
    orders = []
    for i in range(1, n+1):
        # skew categories a bit
        cat = _choice_weighted(CATEGORIES, [12, 11, 7, 8, 6, 5, 10, 4])
        state = _choice_weighted(STATES, [14, 12, 11, 9, 8, 7, 6, 6, 6, 5, 5, 4, 4, 4, 3])

        payment = _choice_weighted(PAYMENTS, [60, 18, 10, 8, 4])
        order_date = base_date + dt.timedelta(days=random.randint(0, 140))

        # price distribution by category (roughly realistic)
        if cat == "Electronics":
            unit_price = round(random.uniform(35, 1200), 2)
        elif cat == "Home":
            unit_price = round(random.uniform(8, 380), 2)
        elif cat == "Beauty":
            unit_price = round(random.uniform(5, 95), 2)
        elif cat == "Grocery":
            unit_price = round(random.uniform(2, 40), 2)
        elif cat == "Sports":
            unit_price = round(random.uniform(10, 260), 2)
        elif cat == "Books":
            unit_price = round(random.uniform(6, 55), 2)
        elif cat == "Clothing":
            unit_price = round(random.uniform(9, 180), 2)
        else:  # Toys
            unit_price = round(random.uniform(6, 120), 2)

        # quantity: mostly small, sometimes larger
        quantity = _choice_weighted([1,2,3,4,5,6,7,8,9,10], [45,26,10,6,4,3,2,2,1,1])

        # discount: many zero, some small, few larger
        discount_pct = _choice_weighted([0,5,10,15,20,25,30,40,50], [55,14,10,7,5,3,2,2,2])

        # shipping days: mostly 2–6, sometimes slower
        shipping_days = int(round(random.uniform(2, 6)))
        if random.random() < 0.08:
            shipping_days += random.randint(3, 8)

        # returns: influenced by category + shipping delay
        base_return_prob = {
            "Electronics": 0.10, "Home": 0.08, "Beauty": 0.05, "Grocery": 0.03,
            "Sports": 0.07, "Books": 0.04, "Clothing": 0.12, "Toys": 0.06
        }[cat]
        return_prob = base_return_prob + (0.02 if shipping_days >= 8 else 0.0)
        returned = (random.random() < return_prob)

        # rating: missing sometimes; worse if returned
        rating = None
        if random.random() > 0.12:
            mu = 4.2 - (0.9 if returned else 0.0) - (0.2 if shipping_days >= 7 else 0.0)
            raw = random.gauss(mu, 0.6)
            rating = max(1.0, min(5.0, round(raw, 1)))

        order = {
            "order_id": 10_000 + i,
            "customer_id": 2_000 + random.randint(1, 180),
            "order_date": order_date.isoformat(),  # keep as string for parsing practice
            "state": state,
            "category": cat,
            "payment": payment,
            "unit_price": unit_price,
            "quantity": quantity,
            "discount_pct": discount_pct,
            "shipping_days": shipping_days,
            "returned": returned,
            "rating": rating
        }
        orders.append(order)

    # Inject a few outliers + messy values (like real data)
    for idx in random.sample(range(n), 6):
        orders[idx]["unit_price"] = round(orders[idx]["unit_price"] * random.uniform(4, 12), 2)  # extreme price
    for idx in random.sample(range(n), 8):
        orders[idx]["shipping_days"] = orders[idx]["shipping_days"] + random.randint(10, 18)  # extreme delay
    for idx in random.sample(range(n), 10):
        orders[idx]["discount_pct"] = 60  # invalid (should be <= 50)
    for idx in random.sample(range(n), 8):
        orders[idx]["order_date"] = "2025-13-40"  # invalid date string
    for idx in random.sample(range(n), 10):
        orders[idx]["rating"] = "N/A"  # messy rating

    return orders

orders = generate_orders()
len(orders), orders[0]


(420,
 {'order_id': 10001,
  'customer_id': 2155,
  'order_date': '2025-08-04',
  'state': 'PA',
  'category': 'Grocery',
  'payment': 'apple_pay',
  'unit_price': 9.9,
  'quantity': 1,
  'discount_pct': 0,
  'shipping_days': 5,
  'returned': False,
  'rating': 3.3})

## Part 1 — Quick Warm‑up (Loops, Indexing, Conditionals)

1. Print the **first 3 orders** (nicely formatted, one per line).
2. Count how many orders are from **CA**.
3. Count how many orders have `quantity >= 5`.
4. Create a list `high_value_order_ids` containing `order_id` for orders where `unit_price * quantity >= 500`.
   - Print the number of such orders.

In [4]:
# TODO 1: first 3 orders (nicely formatted)
for i in range(3):
    
    print(f"Details of order {i}:")
    for key,value in orders[i].items():
        print(f" |{key}: {value}\t")
        
    
        

# TODO 2: coun
count =0
for i in orders:
    if 'state' in i and i['state'] == "CA":
        
        count = count +1

print("There are "+str(count)+ " orders from CA\n")    
# TODO 3: count quantity >= 5
amount =0
for i in orders:
    if 'quantity' in i and i['quantity'] >=5:
        amount +=1
print("There are " +str(amount) +" orders of at least 5\n")

# TODO 4: high_value_order_ids (subtotal >= 500)

high_value_order_ids = []
number =0
for i in orders:
	a = i['unit_price']
	b = i['quantity']

	if a*b >= 500:
		number += 1
print("There are "+str(number)+" orders that are high value, given that the product of the unit price and quantity of the order is at least 500 or greater.\n")

Details of order 0:
 |order_id: 10001	
 |customer_id: 2155	
 |order_date: 2025-08-04	
 |state: PA	
 |category: Grocery	
 |payment: apple_pay	
 |unit_price: 9.9	
 |quantity: 1	
 |discount_pct: 0	
 |shipping_days: 5	
 |returned: False	
 |rating: 3.3	
Details of order 1:
 |order_id: 10002	
 |customer_id: 2113	
 |order_date: 2025-12-01	
 |state: CO	
 |category: Clothing	
 |payment: card	
 |unit_price: 133.18	
 |quantity: 1	
 |discount_pct: 10	
 |shipping_days: 6	
 |returned: False	
 |rating: 3.7	
Details of order 2:
 |order_id: 10003	
 |customer_id: 2005	
 |order_date: 2025-10-01	
 |state: WA	
 |category: Books	
 |payment: card	
 |unit_price: 9.72	
 |quantity: 1	
 |discount_pct: 0	
 |shipping_days: 5	
 |returned: False	
 |rating: 4.0	
There are 55 orders from CA

There are 48 orders of at least 5

There are 110 orders that are high value, given that the product of the unit price and quantity of the order is at least 500 or greater.



## Part 2 — Functions (Reusable Analysis)

Write the functions below. Each should have a docstring and handle edge cases.

### 2.1 `order_subtotal(order)`
Return `unit_price * quantity` for one order.

### 2.2 `order_total(order)`
Return the total after discount:
total = subtotal * (1 - discount_pct/100)

If `discount_pct` is invalid (> 50 or < 0), treat it as **0**.

### 2.3 `safe_float(x)`
Convert `x` to float if possible; return `None` if not.

Test your functions on 5 random orders.

In [5]:

def safe_float(x):
    """Convert x to float if possible; otherwise return None."""
    # TODO
    try:
        float(x)
    except:
        print("There was an error converting the provided variable into a float, are you sure that the variable provided can be turned into a float?\n")
        return None
    x = float(x)

    return  x
    
def order_subtotal(order):
    """Return unit_price * quantity for one order."""
    # TODO
    a = order['unit_price']
    b=order['quantity']

    subtotal = a*b

    return subtotal

def order_total(order):
    """Return subtotal after discount. Invalid discounts are treated as 0."""
    # TODO
    if 'discount_pct' in order:
        dis = order['discount_pct']
    elif 'discount_pct_clean' in order:
        dis = order['discount_pct_clean']
        
    if dis is None:
        try:
            dis = order['discount_pct_clean']
        except:
            print('error with discount_error')
            return order

    if dis >50 or dis <0:
        dis=0

    
    sub = order_subtotal(order)
    
    total = sub*(1-(dis/100))
    total = safe_float(total)
    return total

# TODO: test on 5 random orders
count=[]
j=1
for i in orders:
    
    j+=1
    count.append(j)
for i in range(5):
    j=random.choice(count)
    res = order_total(orders[i])
    print(f"for order {j}, the total will be {res:.2f}\n")


for order 286, the total will be 9.90

for order 56, the total will be 119.86

for order 20, the total will be 9.72

for order 34, the total will be 488.50

for order 114, the total will be 1582.38



## Part 3 — Data Cleaning (Realistic Messy Data)

### 3.1 Fix invalid discounts
Create `clean_discount_pct(order)` that returns a valid discount in **[0, 50]**:
- if discount is missing or invalid, return 0
- if discount is 60 (or any > 50), return 50

### 3.2 Parse dates safely
Create `parse_date(date_str)` that returns a `datetime.date` or `None` if invalid.

### 3.3 Clean ratings
Create `clean_rating(x)` that returns:
- float rating in [1,5] if valid
- None if missing or "N/A" or invalid

### 3.4 Create `clean_orders`
Create a new list `clean_orders` where each order:
- has `order_date_obj` (parsed date)
- has `discount_pct_clean`
- has `rating_clean`
Do not delete rows yet.

Print:
- how many orders have invalid dates
- how many have missing/invalid ratings

In [6]:
import datetime as dt

def clean_discount_pct(order):
    # TODO
    try:
        order['discount_pct']
    except:
        print(f"Error with {order} discount_pct field")
        return None
        
    a = order['discount_pct']
    if a >50:
        a =50
    order['discount_pct'] = a 
    return order

def parse_date(date_str):
    # TODO
    #if not date_str:
     #   return None
    try:
        a = dt.datetime.strptime(date_str.strip(), '%Y-%m-%d').date()
        return a
        
    except:
        print("error turning provided string into a date")
        return None
        
def clean_rating(x):
    # TODO
    a = safe_float(x)
    #error handling handled within safe_float function
    return a 
    


clean_orders = []
# TODO: build clean_orders

bad_date =0
invalid_rating=0
empty_date=0
count =0


#parse order dates
for i in orders:
    new = parse_date(i['order_date'])
    
    if new is None:
        bad_date += 1
    else:
        i['order_date'] = new
        i['order_date_obj'] = i.pop('order_date')
       # print(i['order_date_obj'])

    new = clean_discount_pct(i)
    
    if i['discount_pct'] is None:
        print('error with discount_pct')
        
    else:
        i['discount_pct_clean'] = i.pop('discount_pct')
        #print(i['discount_pct_clean'])
        
    new = clean_rating(i['rating'])
    
    if new is None: 
        print('error with rating')
        invalid_rating+=1
        
    else:
        i['rating'] = new
        i['rating_clean']= i.pop('rating')

    if 'rating_clean' in i and 'discount_pct_clean' in i and 'order_date_obj' in i:
        clean_orders.append(i)

print(f"\nHere are all the orders with invalid ratings: {invalid_rating}\n")
print(f"\nHere are all the orders with invalid dates: {bad_date}\n")

# TODO: print invalid date count, invalid rating count

There was an error converting the provided variable into a float, are you sure that the variable provided can be turned into a float?

error with rating
There was an error converting the provided variable into a float, are you sure that the variable provided can be turned into a float?

error with rating
There was an error converting the provided variable into a float, are you sure that the variable provided can be turned into a float?

error with rating
There was an error converting the provided variable into a float, are you sure that the variable provided can be turned into a float?

error with rating
There was an error converting the provided variable into a float, are you sure that the variable provided can be turned into a float?

error with rating
There was an error converting the provided variable into a float, are you sure that the variable provided can be turned into a float?

error with rating
error turning provided string into a date
There was an error converting the provid

## Part 4 — Lists, Tuples, and Summary Statistics (No NumPy Yet)

Using only Python (no NumPy in this part):

1. Build a list `totals` containing the **order total** for each order (use cleaned discount).
2. Compute and print:
   - count
   - min, max
   - mean
   - median
   - range
3. Implement from scratch:
   - population variance
   - population standard deviation
   - percentile(values, p) with linear interpolation
   - IQR (Q3 - Q1) using 75th and 25th percentiles
4. Print all results rounded to 2 decimals.

In [7]:
# TODO: totals list
totals = []
for i in orders:
    if 'discount_pct_clean' in i:
        total = order_total(i)
        totals.append(total)
    elif 'discount_pct' in i:
        i = clean_discount_pct(i)
        total = order_total(i)
        totals.append(total)
    else:
        continue
        
    
def mean(values):
    sum = 0
    count = len(values)
    for i in values:
        sum+=i
    mean = sum/count
    safe_float(mean)
    return mean

    
def median_f(values):
    values = sorted(values)
    length = len(values)
    
    if length % 2 == 0:
        b = length/2
        
        b = int(b)
       
        med_2= values[b]
        med_1 = values[b-1]
        a = (med_1+med_2)/2
        a = float(a)
        return a 
    median=((length+1)/2)
    a = values[median]
    return a 

median = median_f(totals)
print(median)

def variance_pop(values):
    data_sq = []
    m = mean(values)

    sum = 0

    for i in values :
        var = i -m
        var = var * var
        data_sq.append(var)

    for i in data_sq:
        sum +=i

    length = len(data_sq)
    pop_var = (sum/length)

    return pop_var
a = variance_pop(totals)

print ( a  )

def std_pop(values):
    var_pop = variance_pop(values)
    std_pop=var_pop **(.5)

    return std_pop

def percentile(values, p):
    """p in [0,100]. Use linear interpolation between closest ranks."""
    values = sorted(values)
    n = len(values)

    position=(p/100) * (n-1)
    index = int(position)
    frac = position - index

    if frac ==0:
        return values[index]
    else:
        l = values[index]
        h=values[index+1]
        return l + frac * (h-l)

def IQR(values):
    Q3 = percentile(values,75)
    Q1 = percentile(values, 25)
    IQR = Q3 - Q1
    return IQR


print(f"Here is the mean of totals: {mean(totals):.2f}\n")
print(f"Here is the median of totals: {median_f(totals):.2f}\n")
print(f"Here is the variance of the population of totals: {variance_pop(totals):.2f}\n")
print(f"Here is the standard deviation of the population of totals: {std_pop(totals):.2f}\n")
print(f"Here is the 25th, 50th, and 75th percentiles of totals: {percentile(totals, 25):.2f}, {percentile(totals, 50):.2f}, {percentile(totals, 75):.2f}\n")
print(f"Here is the IQR of totals: {IQR(totals):.2f}\n")

    
# TODO: compute + print summary stats

157.52499999999998
792129.0155501126
Here is the mean of totals: 490.16

Here is the median of totals: 157.52

Here is the variance of the population of totals: 792129.02

Here is the standard deviation of the population of totals: 890.02

Here is the 25th, 50th, and 75th percentiles of totals: 45.38, 157.52, 513.27

Here is the IQR of totals: 467.89



### Part 5 — Dictionaries and Sets (Grouping Like Analysts Do)

### 5.1 Unique values
- Create a set of unique `states`
- Create a set of unique `categories`
Print both and their counts.

### 5.2 Revenue by category
Create a dictionary `revenue_by_category` mapping category -> total revenue (sum of order total).

### 5.3 Return rate by category
Create `return_rate_by_category` mapping category -> return rate (returned_count / total_count).

### 5.4 Top categories
Print the **top 3 categories by revenue**, in descending order, as:
`Category: $revenue (return_rate=...)`

In [14]:
# TODO: sets
states = set()
categories = set()

for i in orders:
    if 'state' in i:
        states.add(i['state'])
                      
    if 'category' in i:
        categories.add(i['category'])
                          
    else:
        continue    
        
print(f"This is how many unique states there are in orders: {len(states)}\n")
print(f"This is how many unique categories there are in orders: {len(categories)}\n")


revenue_by_category = {}
return_rate_by_category = {}

# TODO: fill dictionaries

# TODO: print top 3 by revenue

This is how many unique sates there are in orders: 15

This is how many unique categories there are in orders: 8



## Part 6 — Scope (Local, Global, Nonlocal)

### 6.1 Global vs local
Create a global variable `TAX_RATE = 0.08`.
Write `total_with_tax(order)` that uses `TAX_RATE` and returns:
`order_total_clean(order) * (1 + TAX_RATE)`.

### 6.2 Nonlocal (harder)
Write a function `make_counter()` that returns a function `counter()`:
- Each time you call `counter()`, it returns 1, 2, 3, ...
This requires using `nonlocal`.

Test by calling it 5 times.

In [34]:
# TODO 6.1
TAX_RATE = 0.08

def total_with_tax(order):
    # TODO
    pass

# TODO 6.2
def make_counter():
    # TODO
    pass

# TODO: test counter

## Part 7 — NumPy (Arrays, Masking, Vectorized Computation)

Create NumPy arrays for:
- unit_price
- quantity
- discount_clean
- shipping_days
- returned (as int 0/1)

### 7.1 Vectorized totals
Compute vectorized order totals using NumPy and compare:
- the mean total from Python list `totals`
- the mean total from NumPy

### 7.2 Boolean masking
Compute:
- average total for returned orders
- average total for not-returned orders
- percent of orders with shipping_days >= 10

### 7.3 Outlier detection (z-score)
Using NumPy, compute z-scores for totals and flag outliers with |z| >= 3.
Print how many outliers and show the top 5 outlier orders (order_id, total, z).

In [35]:
import numpy as np

# TODO: build arrays

# TODO 7.1 vectorized totals

# TODO 7.2 masking stats

# TODO 7.3 z-score outliers

## Part 8 — Mini “Analyst Task” (Hard)

You are asked to create a short report for a manager:

1. Find the **worst 3 states by average shipping_days** (highest averages).
   - Ignore orders with invalid dates (`order_date_obj is None`).
2. For each of those states, compute:
   - average shipping_days
   - return rate
   - average rating (ignore missing ratings)
3. Print a clean report (one state per line).

**Constraints**
- Use dictionaries + loops (no Pandas).
- Use helper functions where appropriate.

In [36]:
# TODO: manager report

## Submission Checklist
- All TODOs completed
- Notebook runs without errors top-to-bottom
- Outputs are readable (use rounding and formatting)