## Setup and Context 

You will work with synthetic data representing airline ticket prices for a week of departures. Each record includes `ticket_id`, `route`, `day`, `days_to_departure`, `class`, and `price_usd`

### Task 1: Generate the raw dataset using fixed rules 

Use the rules below exactly. Do not change the formulas or thresholds.

1. Create a seed called `seed_value` from your birth date in format `DDMM`. Example: if your birthday is 7 April, `seed_value = 0704` as an integer `704`.
2. Set the number of records to `n = 320`.
3. Create a NumPy random generator with `rng = np.random.default_rng(seed_value)`.

Now generate a list of dictionaries called `tickets` with exactly `n` records, where each record is built as follows for index `i` from 1 to `n`:

- `ticket_id`: `"T{seed_value}-{i:04d}"`
- `route`: choose from `["NYC-LAX", "LHR-JFK", "SFO-SEA", "DXB-SIN", "MAD-ROM"]` using index `(i + seed_value) % 5`
- `day`: choose from `["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]` using index `(i + seed_value) % 7`
- `days_to_departure`: `1 + ((i * 3 + seed_value) % 60)` (integer 1â€“60)
- `class`: choose from `["economy", "premium", "business"]` using index `(i * 2 + seed_value) % 3`
- `price_usd`: compute: 
  - `base = 120 + (days_to_departure * -1.5)`
  - `route_adj = [140, 220, 60, 180, 80]` based on the chosen route index
  - `class_adj = [0, 80, 220]` based on class index
  - `noise = rng.normal(0, 25)`
  - `price_usd = round(base + route_adj + class_adj + noise, 2)`

Inject data issues deterministically by modifying records as you generate them:

- If `i % 28 == 0`, set `price_usd` to an empty string `""`
- If `i % 45 == 0`, set `price_usd` to a negative number by multiplying it by `-1`
- If `i % 37 == 0`, set `class` to uppercase

After generation, print the total number of records and show the first five entries to confirm structure. 

In [1]:
import numpy as np 
import random 

In [2]:
tickets = [] 
route = ["NYC-LAX", "LHR-JFK", "SFO-SEA", "DXB-SIN", "MAD-ROM"] 
day = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] 
class_ = ["economy", "premium", "business"] 

seed_value = 2603 
n = 320 
rng = np.random.default_rng(seed_value) 

def generate_data(seed_value): 
    
    for i in range(1,n+1):
        days_to_departure = 1 + ((i * 3 + seed_value) % 60) 
        base = 120 + (days_to_departure * -1.5)
        route_adj = [140, 220, 60, 180, 80] 
        class_adj = [0, 80, 220] 
        route_index = (i + seed_value) % 5
        day_index = (i + seed_value) % 7
        class_index = (i * 2 + seed_value) % 3
        noise = rng.normal(0, 25)
        price_usd = round(base + route_adj[route_index] + class_adj[class_index] + noise, 2) 
        
        if i % 28 == 0:
            yield {
                "ticket_id": f"T{seed_value}-{i:04d}",
                "route": route[(i + seed_value) % 5],
                "day": day[day_index],
                "days_to_departure": days_to_departure,
                "class": class_[(i * 2 + seed_value) % 3],
                "price_usd": ""
            } 
            
        elif i % 45 == 0:
            yield {
                "ticket_id": f"T{seed_value}-{i:04d}",
                "route": route[(i + seed_value) % 5],
                "day": day[(i + seed_value) % 7],
                "days_to_departure": 1 + ((i * 3 + seed_value) % 60),
                "class": class_[(i * 2 + seed_value) % 3],
                "price_usd": -1*price_usd
            } 
        elif i % 37 == 0:
            yield {
                "ticket_id": f"T{seed_value}-{i:04d}",
                "route": route[(i + seed_value) % 5],
                "day": day[(i + seed_value) % 7],
                "days_to_departure": 1 + ((i * 3 + seed_value) % 60),
                "class": class_[(i * 2 + seed_value) % 3].lower(),
                "price_usd": price_usd
            } 
        else:
            yield {
                "ticket_id": f"T{seed_value}-{i:04d}",
                "route": route[(i + seed_value) % 5],
                "day": day[(i + seed_value) % 7],
                "days_to_departure": 1 + ((i * 3 + seed_value) % 60),
                "class": class_[(i * 2 + seed_value) % 3],
                "price_usd": price_usd
            } 
            
    return tickets
    
tickets = list(generate_data(320))
print(tickets) 

[{'ticket_id': 'T320-0001', 'route': 'LHR-JFK', 'day': 'Sun', 'days_to_departure': 24, 'class': 'premium', 'price_usd': 400.49}, {'ticket_id': 'T320-0002', 'route': 'SFO-SEA', 'day': 'Mon', 'days_to_departure': 27, 'class': 'economy', 'price_usd': 136.18}, {'ticket_id': 'T320-0003', 'route': 'DXB-SIN', 'day': 'Tue', 'days_to_departure': 30, 'class': 'business', 'price_usd': 506.1}, {'ticket_id': 'T320-0004', 'route': 'MAD-ROM', 'day': 'Wed', 'days_to_departure': 33, 'class': 'premium', 'price_usd': 221.03}, {'ticket_id': 'T320-0005', 'route': 'NYC-LAX', 'day': 'Thu', 'days_to_departure': 36, 'class': 'economy', 'price_usd': 205.29}, {'ticket_id': 'T320-0006', 'route': 'LHR-JFK', 'day': 'Fri', 'days_to_departure': 39, 'class': 'business', 'price_usd': 496.32}, {'ticket_id': 'T320-0007', 'route': 'SFO-SEA', 'day': 'Sat', 'days_to_departure': 42, 'class': 'premium', 'price_usd': 197.97}, {'ticket_id': 'T320-0008', 'route': 'DXB-SIN', 'day': 'Sun', 'days_to_departure': 45, 'class': 'econom

In [3]:
tickets_copy = list(tickets)

### Task 2: Validate and clean records with core Python

Identify invalid records (missing/non-numeric `price_usd`, or negative prices). Build `cleaned_tickets` with only valid records and normalized lowercase `class`.

After cleaning, confirm cleaned count and verify no invalid prices remain. Show two cleaned records.

In [4]:
cleaned_tickets = []
def clean_records(tickets_copy): 
    for item in tickets_copy: 
        price_usd = item["price_usd"] 
        class_ = item["class"] 

        if not(price_usd == "" or price_usd == -1*price_usd) or not(class_ == class_.lower()):
            item["class"] = item["class"].lower()
            cleaned_tickets.append(item)
            
    return cleaned_tickets  

print(clean_records(tickets_copy)) 

[{'ticket_id': 'T320-0001', 'route': 'LHR-JFK', 'day': 'Sun', 'days_to_departure': 24, 'class': 'premium', 'price_usd': 400.49}, {'ticket_id': 'T320-0002', 'route': 'SFO-SEA', 'day': 'Mon', 'days_to_departure': 27, 'class': 'economy', 'price_usd': 136.18}, {'ticket_id': 'T320-0003', 'route': 'DXB-SIN', 'day': 'Tue', 'days_to_departure': 30, 'class': 'business', 'price_usd': 506.1}, {'ticket_id': 'T320-0004', 'route': 'MAD-ROM', 'day': 'Wed', 'days_to_departure': 33, 'class': 'premium', 'price_usd': 221.03}, {'ticket_id': 'T320-0005', 'route': 'NYC-LAX', 'day': 'Thu', 'days_to_departure': 36, 'class': 'economy', 'price_usd': 205.29}, {'ticket_id': 'T320-0006', 'route': 'LHR-JFK', 'day': 'Fri', 'days_to_departure': 39, 'class': 'business', 'price_usd': 496.32}, {'ticket_id': 'T320-0007', 'route': 'SFO-SEA', 'day': 'Sat', 'days_to_departure': 42, 'class': 'premium', 'price_usd': 197.97}, {'ticket_id': 'T320-0008', 'route': 'DXB-SIN', 'day': 'Sun', 'days_to_departure': 45, 'class': 'econom

In [5]:
for item in cleaned_tickets: 
        price_usd = item["price_usd"] 
        class_ = item["class"] 

        if price_usd == "" or price_usd == -1*price_usd or class_ == class_.upper():
            print(item)
        
print("All clean") 

All clean


In [6]:
cleaned_tickets[:2]

[{'ticket_id': 'T320-0001',
  'route': 'LHR-JFK',
  'day': 'Sun',
  'days_to_departure': 24,
  'class': 'premium',
  'price_usd': 400.49},
 {'ticket_id': 'T320-0002',
  'route': 'SFO-SEA',
  'day': 'Mon',
  'days_to_departure': 27,
  'class': 'economy',
  'price_usd': 136.18}]

### Task 3: Convert to NumPy for analysis

Create NumPy arrays for numeric analysis. Build an array `prices` from the cleaned ticket prices and an array `days` from the cleaned day labels. Use NumPy to compute the overall mean and standard deviation of `prices`. Compute total revenue per day and the number of tickets per day using vectorized operations, not loops. Validate that the sum of daily totals matches the overall total revenue from `prices`. 

In [7]:
prices = np.array([])
days = np.array([])

for i in cleaned_tickets:
    prices = np.append(prices, i["price_usd"])
    days = np.append(days, i["day"]) 
print(prices, days)  

[ 400.49  136.18  506.1   221.03  205.29  496.32  197.97  203.27  343.92
  289.3   233.33  330.28  281.92  153.58  520.77  395.79  152.63  512.6
  244.48  246.56  528.11  221.68  233.82  407.35  301.92  324.87  316.16
  152.64  392.92  317.25   69.87  447.03  334.25  209.86  535.05  287.15
  240.31  411.6   303.96  325.41  342.26  307.8   154.3  -418.99  397.41
  105.31  459.52  203.79  186.12  482.99  135.95  236.85  418.29  335.91
  376.66  365.21  147.27  416.94  388.22  125.65  465.39  261.46  164.89
  448.96  186.35  213.24  348.76  289.93  245.42  308.89  324.35  167.8
  469.99  407.04  141.37  459.33  287.84  254.85  508.6   204.6   277.26
  315.35  253.74  320.8   315.46   94.42 -394.74  345.78   82.77  418.35
  253.57  259.23  512.71  235.51  275.86  386.35  276.87  320.94  336.88
  300.36  161.44  458.66  383.19   90.72  484.5   220.25  173.79  505.15
  193.73  392.6   294.5   362.11  388.16  386.8   195.57  444.9   356.77
  187.71  508.94  249.33  211.12  520.76  245.17  240

In [8]:
mean_price = prices.mean() 

In [9]:
std_price = prices.std()

#### Compute total revenue per day and the number of tickets per day using vectorized operations, not loops. Validate that the sum of daily totals matches the overall total revenue from prices.

In [12]:
import pandas as pd

In [13]:
df = pd.DataFrame(cleaned_tickets)

In [14]:
df

Unnamed: 0,ticket_id,route,day,days_to_departure,class,price_usd
0,T320-0001,LHR-JFK,Sun,24,premium,400.49
1,T320-0002,SFO-SEA,Mon,27,economy,136.18
2,T320-0003,DXB-SIN,Tue,30,business,506.10
3,T320-0004,MAD-ROM,Wed,33,premium,221.03
4,T320-0005,NYC-LAX,Thu,36,economy,205.29
...,...,...,...,...,...,...
304,T320-0316,LHR-JFK,Sun,9,premium,399.31
305,T320-0317,SFO-SEA,Mon,12,economy,156.46
306,T320-0318,DXB-SIN,Tue,15,business,472.66
307,T320-0319,MAD-ROM,Wed,18,premium,249.28


In [16]:
day_revenue = df.groupby("day")["price_usd"].sum()

In [17]:
day_revenue

day
Fri    13213.07
Mon    12763.37
Sat     9726.03
Sun    13437.82
Thu    12991.91
Tue    13577.34
Wed    13281.07
Name: price_usd, dtype: float64

In [18]:
day_tickets = df.groupby("day")["ticket_id"].count()

In [19]:
day_tickets

day
Fri    45
Mon    46
Sat    34
Sun    46
Thu    46
Tue    46
Wed    46
Name: ticket_id, dtype: int64

In [20]:
day_revenue.sum() == df["price_usd"].sum()

np.True_

### Task 4: Identify high-price tickets

Define high-price tickets as above the 90th percentile of `prices`. Compute threshold and count. Verify all selected prices are `>=` threshold.

In [23]:
perc_price = np.percentile(prices, 90)

In [24]:
perc_price 

np.float64(470.524)

In [25]:
high_price_tickets = df[df["price_usd"]>=perc_price]

In [26]:
count_hp_ticks = high_price_tickets.count() 

In [27]:
high_price_tickets

Unnamed: 0,ticket_id,route,day,days_to_departure,class,price_usd
2,T320-0003,DXB-SIN,Tue,30,business,506.1
5,T320-0006,LHR-JFK,Fri,39,business,496.32
14,T320-0015,NYC-LAX,Sun,6,business,520.77
17,T320-0018,DXB-SIN,Wed,15,business,512.6
20,T320-0021,LHR-JFK,Sat,24,business,528.11
34,T320-0036,LHR-JFK,Sun,9,business,535.05
49,T320-0051,LHR-JFK,Mon,54,business,482.99
78,T320-0081,LHR-JFK,Wed,24,business,508.6
92,T320-0096,LHR-JFK,Thu,9,business,512.71
104,T320-0108,DXB-SIN,Tue,45,business,484.5


In [28]:
count_hp_ticks

ticket_id            31
route                31
day                  31
days_to_departure    31
class                31
price_usd            31
dtype: int64

In [29]:
check = (high_price_tickets["price_usd"] >= perc_price).all()

In [30]:
check

np.True_

### Task 5: Produce a final report

Create a report dictionary with keys: 

- `total_tickets`
- `cleaned_tickets`
- `mean_price`
- `std_price`
- `daily_totals`
- `high_price_count`

Print a readable report and include at least one explicit validation statement

In [41]:
report_dictionary = {"total_tickets": len(tickets_copy),
                     "cleaned_tickets": len(cleaned_tickets),
                     "mean_price": mean_price,
                     "std_price": std_price,
                     "daily_totals": day_revenue.sum(),
                     "high_price_count": high_price_tickets["ticket_id"].count()}

In [43]:
print(report_dictionary) 

{'total_tickets': 320, 'cleaned_tickets': 309, 'mean_price': np.float64(287.99550161812294), 'std_price': np.float64(158.84451754122242), 'daily_totals': np.float64(88990.60999999999), 'high_price_count': np.int64(31)}


#### My synthetic data is not "dirty" i have deleted only 11 tickets. Std deviation is large it means we have a large spread. 
