# Machine Learning & Events

## How do I combine Machine Learning with Big Data?

Modern big data systems often have streaming and event systems at their heart. 

Machine learning fits into this quite nicely,

### Case Study: Flight Booking System

In [14]:
import seaborn as sns
import pandas as pd

Let's simulate a stream of flight availabilities being sent to us (a travel agent) who can sell some of them, 

In [17]:
flights = sns.load_dataset('flights')

We won't know their sale price when they launch, so we will want to predict it before advertising the flight.

To predict the price, we need to have a history of prices for similar flights, so let's simualte this...

(In a real-world case, this data would have to be available...)

In [31]:
base_price = flights["month"].replace({
    "Jan": 100,
    "Feb": 100,
    "Mar": 120,
    "Apr": 125,
    "May": 175,
    "Jun": 200,
    "Jul": 225,
    "Aug": 300,
    "Sep": 250,
    "Oct": 175,
    "Nov": 100,
    "Dec": 90
})

The ticket price is something to do with the number of passengers (ie., capacity of airplane) and the month, 

In [39]:
flights["ticket"] = 10*(flights["passengers"]/8 + base_price/10)

Let's simulate the events arriving,

In [24]:
stream = []
for index, row in flights.iterrows():
    stream.append({
        "subject": "BOOKING_SYSTEM",
        "verb": "ADVERTISING",
        "object": "FLIGHT",
        "context": {
            "seats": row["passengers"],
            "month": row["month"],
            "year": 2022 
        }
    })

This finishes the simulation.

----

#### Building the Event-ML System

From now, we have the event log to process,

In [72]:
len(stream)

144

###### Processing the event log,

We want to use a machine learning model to predict prices,

In [41]:
from sklearn.linear_model import LinearRegression

Let's train it on historical data,

In [42]:
lm = LinearRegression().fit(flights[["passengers"]], flights["ticket"])

Incidentally, a rough score, 

In [43]:
lm.score(flights[["passengers"]], flights["ticket"])

0.8767116324486033

The stream processing algorith runs across the event log, watches for "ADVERTSING" events and inserts a new event into a "sellable_stream" when "the predicted price is right", 

In [48]:
sellable_stream = []

customer_cutoff = 500
for event in stream:
    if event["verb"] == "ADVERTISING":
        seats = event["context"]["seats"]
        price_est = lm.predict([[seats]])
        
        if price_est < customer_cutoff:
            event["subject"] = "SALES_SYSTEM"
            sellable_stream.append(event)

The derived stream will have fewer events, ie., flights, as we don't want all of them,

In [51]:
len(stream)

144

In [50]:
len(sellable_stream)

75

##### Aside: More, interesting, complexity

If we were to take the above approach, ie., filtering based on price -- we'd want to know a more accurate assessment of customer's actual preferences on price, 

Let's simulate a dataset we might have,

In [52]:
import numpy as np

The dataset below simulates 1000 sales/mo data,

In [58]:
customer_ticket_history = {
    "Jan": np.random.normal(200, 25, 1_000) + 50,
    "Feb": np.random.normal(200, 25, 1_000) + 50,
    "Mar": np.random.normal(200, 25, 1_000) + 50,
    "Apr": np.random.normal(200, 25, 1_000) + 50,
    "May": np.random.normal(300, 75, 1_000) + 50,
    "Jun": np.random.normal(300, 75, 1_000) + 50,
    "Jul": np.random.normal(300, 75, 1_000) + 50,
    "Aug": np.random.normal(400, 75, 1_000) + 50,
    "Sep": np.random.normal(400, 75, 1_000) + 50,
    "Oct": np.random.normal(300, 75, 1_000) + 50,
    "Nov": np.random.normal(300, 75, 1_000) + 50,
    "Dec": np.random.normal(200, 25, 1_000) + 50
}

We will use these sales to build a cutoff point, based on our "tolerance" to keeping customers,

In [66]:
cutoff_95 = {}
for month, data in customer_ticket_history.items():
    print(month, np.percentile(data, 95).round())
    cutoff_95[month] = np.percentile(data, 95).round()

Jan 289.0
Feb 291.0
Mar 291.0
Apr 292.0
May 481.0
Jun 474.0
Jul 470.0
Aug 572.0
Sep 567.0
Oct 473.0
Nov 464.0
Dec 290.0


Setting the cutoff at 95% means we *might* lose 5% of customers who would paid more than the flights we're throwing away. However, if we keep 100% we waste a lot of time with selling stuff our customers don't want. 

Modifying the processing code, we lookup the cutoff in this database, 

In [68]:
sellable_stream = []


for event in stream:
    if event["verb"] == "ADVERTISING":
        seats = event["context"]["seats"]
        price_est = lm.predict([[seats]])
        
        if price_est < cutoff_95[event["context"]["month"]]:
            event["subject"] = "SALES_SYSTEM"
            sellable_stream.append(event)

This turns out to be more aggressive than our initial version,

In [69]:
len(sellable_stream)

45

----

## Reflections

* This is a pretty accurate event-ml system
* The code in a real-world system would look similar
* The datasets would look similar
* There is nothing *wrong* with the code
    * but, in practice, we might have more going on
    
* More going on:
    * look up cutoffs in db
    * event system (eg., kafka) providing stream to loop over
    * pre-saved model (rather than fitting & then processing a stream)

## Appendix

...a more accurate score,

In [88]:
from sklearn.model_selection import train_test_split

In [92]:
Xtr, Xte, ytr, yte = train_test_split(flights[["passengers"]], flights["ticket"])


LinearRegression().fit(Xtr, ytr).score(Xte, yte).round(2)

0.89