#### Task 1: Generate the raw dataset using fixed rules
Use the rules below exactly. Do not change the formulas or thresholds.

Create a seed called seed_value from your birth date in format DDMM. Example: if your birthday is 7 April, seed_value = 0704 as an integer 704.
Set the number of records to n = 320.
Create a NumPy random generator with rng = np.random.default_rng(seed_value).
Now generate a list of dictionaries called tickets with exactly n records, where each record is built as follows for index i from 1 to n:

ticket_id: "T{seed_value}-{i:04d}"
route: choose from ["NYC-LAX", "LHR-JFK", "SFO-SEA", "DXB-SIN", "MAD-ROM"] using index (i + seed_value) % 5
day: choose from ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] using index (i + seed_value) % 7
days_to_departure: 1 + ((i * 3 + seed_value) % 60) (integer 1â€“60)
class: choose from ["economy", "premium", "business"] using index (i * 2 + seed_value) % 3
price_usd: compute:
base = 120 + (days_to_departure * -1.5)
route_adj = [140, 220, 60, 180, 80] based on the chosen route index
class_adj = [0, 80, 220] based on class index
noise = rng.normal(0, 25)
price_usd = round(base + route_adj + class_adj + noise, 2)
Inject data issues deterministically by modifying records as you generate them:

If i % 28 == 0, set price_usd to an empty string ""
If i % 45 == 0, set price_usd to a negative number by multiplying it by -1
If i % 37 == 0, set class to uppercase
After generation, print the total number of records and show the first five entries to confirm structure.

In [73]:
import numpy as np

In [74]:
seed_value=2312
n=320
rng=np.random.default_rng(seed_value)

routes = ["NYC-LAX", "LHR-JFK", "SFO-SEA", "DXB-SIN", "MAD-ROM"]
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
classes = ["economy", "premium", "business"]

In [75]:
tickets = [] 

for i in range(1, n + 1):

    route_idx = (i + seed_value) % 5
    day_idx = (i + seed_value) % 7
    class_idx = (i * 2 + seed_value) % 3

    route = routes[route_idx]
    day = days[day_idx]
    ticket_class = classes[class_idx]

    days_to_departure = 1 + ((i * 3 + seed_value) % 60)

    base = 120 + (days_to_departure * -1.5)
    route_adj = [140, 220, 60, 180, 80][route_idx]
    class_adj = [0, 80, 220][class_idx]
    noise = rng.normal(0, 25)

    price_usd = round(base + route_adj + class_adj + noise, 2)

    if i % 28 == 0:
        price_usd = ""
    elif i % 45 == 0:
        price_usd = -price_usd

    if i % 37 == 0:
        ticket_class = ticket_class.upper()

    ticket = {
        "ticket_id": f"T{seed_value}-{i:04d}",
        "route": route,
        "day": day,
        "days_to_departure": days_to_departure,
        "class": ticket_class,
        "price_usd": price_usd
    }

    tickets.append(ticket)

In [76]:
print("Total records:", len(tickets))
print("First 5 records:")
for t in tickets[:5]:
    print(t)

Total records: 320
First 5 records:
{'ticket_id': 'T2312-0001', 'route': 'DXB-SIN', 'day': 'Thu', 'days_to_departure': 36, 'class': 'premium', 'price_usd': 303.6}
{'ticket_id': 'T2312-0002', 'route': 'MAD-ROM', 'day': 'Fri', 'days_to_departure': 39, 'class': 'economy', 'price_usd': 125.63}
{'ticket_id': 'T2312-0003', 'route': 'NYC-LAX', 'day': 'Sat', 'days_to_departure': 42, 'class': 'business', 'price_usd': 435.39}
{'ticket_id': 'T2312-0004', 'route': 'LHR-JFK', 'day': 'Sun', 'days_to_departure': 45, 'class': 'premium', 'price_usd': 334.82}
{'ticket_id': 'T2312-0005', 'route': 'SFO-SEA', 'day': 'Mon', 'days_to_departure': 48, 'class': 'economy', 'price_usd': 95.01}


#### Task 2: Validate and clean records with core Python
Write validation logic that identifies invalid records. Treat missing or non-numeric price_usd values and negative prices as invalid. Build a new list cleaned_tickets that contains only valid records and normalizes the class values to lowercase.

After cleaning, confirm the number of cleaned records and verify that no invalid prices remain. Show two cleaned records to demonstrate normalization.

In [77]:
cleaned_tickets=[]

for t in tickets:
    price=t["price_usd"]
    
    if price == "" or not isinstance(price, (int, float)):
        continue

    if price < 0:
        continue

    cleaned_ticket=t.copy()
    cleaned_ticket["class"]=cleaned_ticket["class"].lower()



    cleaned_tickets.append(cleaned_ticket)
        

In [78]:
print ("Cleaned ticket: ",len(cleaned_tickets))

Cleaned ticket:  302


In [79]:
invalid_exists = any(
    (t["price_usd"] == "" or not isinstance(t["price_usd"], (int, float)) or t["price_usd"] < 0)
    for t in cleaned_tickets
)


print("Any invalid prices left?:", invalid_exists)

Any invalid prices left?: False


In [80]:
print("Sample cleaned records:")
for t in cleaned_tickets[:2]:
    print(t)

Sample cleaned records:
{'ticket_id': 'T2312-0001', 'route': 'DXB-SIN', 'day': 'Thu', 'days_to_departure': 36, 'class': 'premium', 'price_usd': 303.6}
{'ticket_id': 'T2312-0002', 'route': 'MAD-ROM', 'day': 'Fri', 'days_to_departure': 39, 'class': 'economy', 'price_usd': 125.63}


#### Task 3: Convert to NumPy for analysis
Create NumPy arrays for numeric analysis. Build an array prices from the cleaned ticket prices and an array days from the cleaned day labels. Use NumPy to compute the overall mean and standard deviation of prices. Compute total revenue per day and the number of tickets per day using vectorized operations, not loops. Validate that the sum of daily totals matches the overall total revenue from prices

In [81]:
prices = np.array([t["price_usd"] for t in cleaned_tickets], dtype=float)
days = np.array([t["day"] for t in cleaned_tickets])

In [82]:
price_mean=prices.mean()
print("Overal mean of prices: ",round(price_mean,2))

Overal mean of prices:  307.23


In [83]:
price_std=prices.std()
print("Std of prices: ",round(price_std,2))

Std of prices:  115.41


In [84]:
unique_days=np.unique(days)

In [85]:
daily_revenue=np.array([prices[days == d].sum() for d in unique_days])
daily_count=np.array([np.sum(days == d) for d in unique_days])

In [86]:
print("\nDaily revenue and ticket count:")
for i, d in enumerate(unique_days):
    print(f"{d}: revenue = {daily_revenue[i]:.2f}, count = {daily_count[i]}")


Daily revenue and ticket count:
Fri: revenue = 13736.29, count = 45
Mon: revenue = 13548.82, count = 45
Sat: revenue = 14367.91, count = 45
Sun: revenue = 13782.55, count = 45
Thu: revenue = 13643.49, count = 45
Tue: revenue = 13316.07, count = 44
Wed: revenue = 10388.16, count = 33


In [87]:
total_revenue = prices.sum()
sum_of_daily = daily_revenue.sum()

In [88]:
print("\nTotal revenue from prices:", round(total_revenue, 2))
print("Sum of daily revenue:", round(sum_of_daily, 2))
print("Match?:", np.isclose(total_revenue, sum_of_daily))


Total revenue from prices: 92783.29
Sum of daily revenue: 92783.29
Match?: True


#### Task 4: Identify high-price tickets
Define high-price tickets as those above the 90th percentile of prices. Use NumPy to compute the percentile threshold and select the corresponding tickets. Report the threshold and the count of high-price tickets, and verify that all selected prices are greater than or equal to the threshold.

In [89]:
threshold_90 = np.percentile(prices, 90)

In [90]:
high_price_tickets = np.array(cleaned_tickets)[prices >= threshold_90]

In [91]:
print("90th percentile threshold:", round(threshold_90,2))
print("Number of high-price tickets:", high_price_tickets.shape[0])
print("All prices >= threshold?:", np.all(prices[prices >= threshold_90] >= threshold_90))

90th percentile threshold: 466.61
Number of high-price tickets: 31
All prices >= threshold?: True


#### Task 5: Produce a final report
Create a report dictionary with keys total_tickets, cleaned_tickets, mean_price, std_price, daily_totals, and high_price_count. Convert the report into a readable string and print it in the notebook. Include at least one explicit validation statement, such as confirming that cleaned_tickets is less than or equal to total_tickets.

In [92]:
report = {
    "total_tickets": len(tickets),
    "cleaned_tickets": len(cleaned_tickets),
    "mean_price": round(prices.mean(), 2),
    "std_price": round(prices.std(), 2),
    "daily_totals": {d: round(total, 2) for d, total in zip(unique_days, daily_revenue)},
    "high_price_count": high_price_tickets.shape[0]
}

assert report["cleaned_tickets"] <= report["total_tickets"], "Cleaned tickets cannot exceed total tickets"

In [93]:

report_str = (
    f"Report Summary:\n"
    f"-----------------\n"
    f"Total tickets: {report['total_tickets']}\n"
    f"Cleaned tickets: {report['cleaned_tickets']}\n"
    f"Mean price: ${report['mean_price']}\n"
    f"Std dev of prices: ${report['std_price']}\n"
    f"Daily revenue totals:\n"
)

In [94]:
for day, total in report["daily_totals"].items():
    report_str += f"  {day}: ${total}\n"

report_str += f"High-price ticket count (90th percentile+): {report['high_price_count']}\n"

print(report_str)

Report Summary:
-----------------
Total tickets: 320
Cleaned tickets: 302
Mean price: $307.23
Std dev of prices: $115.41
Daily revenue totals:
  Fri: $13736.29
  Mon: $13548.82
  Sat: $14367.91
  Sun: $13782.55
  Thu: $13643.49
  Tue: $13316.07
  Wed: $10388.16
High-price ticket count (90th percentile+): 31

