# **Task 1: Generate the raw dataset using fixed rules*

**Use the rules below exactly. Do not change formulas or thresholds.**

**Create seed_value from your birth date in format DDMM (example: 7 April -> 704).**

**Set n = 320.**

**Create RNG: rng = np.random.default_rng(seed_value).**


**Generate a list of dictionaries named tickets with exactly n records. For index i from 1 to n:**

**ticket_id: "T{seed_value}-{i:04d}"**

**route: choose from ["NYC-LAX", "LHR-JFK", "SFO-SEA", "DXB-SIN", "MAD-ROM"] using (i + seed_value) % 5**

**day: choose from ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] using (i + seed_value) % 7**

**days_to_departure: 1 + ((i * 3 + seed_value) % 60)**

**class: choose from ["economy", "premium", "business"] using (i * 2 + seed_value) % 3**

**price_usd:**

**base = 120 + (days_to_departure * -1.5)**

**route_adj = [140, 220, 60, 180, 80] based on route index**

**class_adj = [0, 80, 220] based on class index**

**noise = rng.normal(0, 25)**

**price_usd = round(base + route_adj + class_adj + noise, 2)**


**Inject deterministic data issues:**
**
**If i % 28 == 0, set price_usd = ""**
**
**If i % 45 == 0, multiply price_usd by -1**
**
**If i % 37 == 0, set class to uppercase**

**After generation, print total record count and show first five records.**

In [4]:
import random
import numpy as np
seed_value = 1605
n = 320
rng = np.random.default_rng(seed_value)

In [5]:
route = ["NYC-LAX", "LHR-JFK", "SFO-SEA", "DXB-SIN", "MAD-ROM"]
day = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
classes = ["economy", "premium", "business"]
route_adj = [140, 220, 60, 180, 80]
class_adj = [0, 80, 220]

In [6]:
tickets = []

In [7]:
for i in range(1,n+1):
    
    route_idx = (i + seed_value) % 5
    day_idx = (i + seed_value) % 7
    days_to_dep = 1 + ((i * 3 + seed_value) % 60)
    classes_idx = (i * 2 + seed_value) % 3


    
    base = 120 + (days_to_dep * -1.5)
    noise = rng.normal(0, 25)
    price_usd = round(base + route_idx + classes_idx + noise, 2)

    
    ticket_class = classes[classes_idx]
    if i % 28 == 0:
        price_usd = ""
    if i % 45 == 0:
        price_usd = price_usd * (-1)
    if i % 37 == 0:
        ticket_class = ticket_class.upper()

    
    tickets.append({
        "ticket_id": f"T{seed_value}-{i:04d}",
        "route": route[route_idx],
        "day": day[day_idx],
        "days_to_departure": days_to_dep,
        "class": ticket_class,
        "price_usd": price_usd
    })

print(f"Task 1: Total Records Generated: {len(tickets)}")
print("First 5 records:", tickets[:5])

Task 1: Total Records Generated: 320
First 5 records: [{'ticket_id': 'T1605-0001', 'route': 'LHR-JFK', 'day': 'Thu', 'days_to_departure': 49, 'class': 'business', 'price_usd': 83.49}, {'ticket_id': 'T1605-0002', 'route': 'SFO-SEA', 'day': 'Fri', 'days_to_departure': 52, 'class': 'premium', 'price_usd': 37.46}, {'ticket_id': 'T1605-0003', 'route': 'DXB-SIN', 'day': 'Sat', 'days_to_departure': 55, 'class': 'economy', 'price_usd': 69.39}, {'ticket_id': 'T1605-0004', 'route': 'MAD-ROM', 'day': 'Sun', 'days_to_departure': 58, 'class': 'business', 'price_usd': 18.72}, {'ticket_id': 'T1605-0005', 'route': 'NYC-LAX', 'day': 'Mon', 'days_to_departure': 1, 'class': 'premium', 'price_usd': 102.19}]


--------------------------------------------

# **Task 2: Validate and clean records with core Python*

**Identify invalid records (missing/non-numeric price_usd, or negative prices). Build cleaned_tickets with only valid records and normalized lowercase class.**

**After cleaning, confirm cleaned count and verify no invalid prices remain. Show two cleaned records.**

In [8]:
cleaned_tickets =[]
for t in tickets:
    p = t['price_usd']
    if isinstance(p,(int,float)) and p>0:
        cleaned_t = t.copy()
        cleaned_t['class'] = t['class'].lower()
        cleaned_tickets.append(t)
print(f"Task2: Cleaned Records: {len(cleaned_tickets)}")
print("Sample Cleaned Records:", cleaned_tickets[:2])
                                    

Task2: Cleaned Records: 302
Sample Cleaned Records: [{'ticket_id': 'T1605-0001', 'route': 'LHR-JFK', 'day': 'Thu', 'days_to_departure': 49, 'class': 'business', 'price_usd': 83.49}, {'ticket_id': 'T1605-0002', 'route': 'SFO-SEA', 'day': 'Fri', 'days_to_departure': 52, 'class': 'premium', 'price_usd': 37.46}]


# **Task 3: Convert to NumPy for analysis*
**Create NumPy arrays for prices and days. Compute mean and standard deviation of prices. Compute total revenue per day and ticket counts per day using vectorized operations (no loops). Validate daily totals sum to overall total revenue.**

In [16]:
prices = np.array([t["price_usd"] for t in cleaned_tickets])
days_arr = np.array([t["day"] for t in cleaned_tickets])

In [10]:
mean_p = np.mean(prices)
std_p = np.std(prices)

In [29]:
unique_days = np.array(['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])
day_indices = np.searchsorted(unique_days,days_arr)

counts_vector = np.bincount(day_indices,minlength=7)
totals_vector = np.bincount(day_indices,weights=prices,minlength=7)

daily_totals = dict(zip(unique_days,totals_vector))
daily_counts = dict(zip(unique_days,counts_vector))

In [30]:
# 1. Calculate total revenue from the original cleaned prices array
total_revenue = np.sum(prices)

# 2. Calculate the sum of our vectorized daily totals
# (totals_vector is the array we got from np.bincount)
sum_of_daily_buckets = np.sum(totals_vector)

# 3. Compare them
# We use np.isclose because computers sometimes have tiny rounding errors with decimals
is_valid = np.isclose(total_revenue, sum_of_daily_buckets)

print(f"Total Revenue: ${total_revenue:,.2f}")
print(f"Sum of Daily Totals: ${sum_of_daily_buckets:,.2f}")
print(f"Validation Successful: {is_valid}")

Total Revenue: $24,811.83
Sum of Daily Totals: $24,811.83
Validation Successful: True


# **Task 4: Identify high-price tickets*
**Define high-price tickets as above the 90th percentile of prices. Compute threshold and count. Verify all selected prices are >= threshold.**

In [31]:
p90 = np.percentile(prices,90)

In [32]:
high_price_mask = prices>90

In [33]:
high_price_count = np.sum(high_price_mask)
high_price_count

np.int64(130)

In [34]:
print(f"Task4: High-Price (90th percentile) is >= {p90}")

Task4: High-Price (90th percentile) is >= 127.27000000000001


# **Task 5: Produce a final report*

**Create a report dictionary with keys:**

**total_tickets**

**cleaned_tickets**

**mean_price**

**std_price**

**daily_totals**

**high_price_count**

**Print a readable report and include at leas**

In [38]:
report = {
    "total_tickets": len(tickets),
    "cleaned_tickets": len(cleaned_tickets),
    "mean_price": round(float(mean_p), 2),
    "std_price": round(float(std_p), 2),
    "daily_totals": {k: round(v, 2) for k, v in daily_totals.items()},
    "high_price_count": int(high_price_count)
}

print("\n--- FINAL REPORT ---")
for key, value in report.items():
    print(f"{key}: {value}")

print(f"\nExplicit Validation: The cleaning removed {len(tickets) - len(cleaned_tickets)} invalid records.")


--- FINAL REPORT ---
total_tickets: 320
cleaned_tickets: 302
mean_price: 82.16
std_price: 34.86
daily_totals: {np.str_('Mon'): np.float64(7430.24), np.str_('Tue'): np.float64(7742.09), np.str_('Wed'): np.float64(0.0), np.str_('Thu'): np.float64(0.0), np.str_('Fri'): np.float64(0.0), np.str_('Sat'): np.float64(0.0), np.str_('Sun'): np.float64(3329.95)}
high_price_count: 130

Explicit Validation: The cleaning removed 18 invalid records.
