# Python: Functions for ML Methodology

## Machine Learning Process

Recall the Machine Learning process,

`Data -> Algorithm -> prediction = Model(newdata, parameters)`

$(X, y) \rightarrow \mathcal{Alg} \rightarrow \hat{y} = \hat{f}(X_{new}; w, b)$

`generate_data(...) -> min(algo_linear_regression(...)) -> yhat = predict(..., w, b)`

In [4]:
def generate_data():
    pass

In [2]:
def algo_linear_regression():
    pass

In [3]:
def predict():
    pass

---

In [5]:
from random import random

**Q. import `choice` from `random`, what does `choice([True, False])` do?**

**Q. How would you use `choice` to simulate categorical data?**


The `random()` function with the `random` module generates random numbers in `[0, 1]`,

In [13]:
random()

0.8139780834117014

## The Data

In [21]:
def generate_data(w, b, n=5):
    X = [ round(random(), 2) for _ in range(0, n)]
    y = [ w * x + b for x in X ]
    
    return (X, y)

In [22]:
generate_data(10, 5)

([0.61, 0.52, 0.71, 0.63, 0.77], [11.1, 10.2, 12.1, 11.3, 12.7])

**Q. Define a new `generate_data_category` so that `y` is `True` or `False`**

**HINT: use, eg., `<0` on the `w*x + b`**. 

## The Algorithm

We would like to guess $w, b$ until we get close to the "true" ones, 

In [54]:
def algo_linear_regression(X, y):
    history = []
    
    for wguess in range(0, 100):
        for bguess in range(0, 100):
            yhat = (wguess * x + bguess)
            error = sum([ abs(y - yhat) for x, y in zip(X, y)])
            
            history.append( (error, (wguess, bguess)))
            
    return history 

**Q. EXTRA: define `algo_cat_score` which works with a category `y` (ie., `True`, `False`).**

**Q. HINT: modify `yhat` to use `<0`.**

With the functions above we can now `generate_data`,

In [42]:
X_discount, y_spent = generate_data(10, 5)

Let's inspect the first five entries, 

In [43]:
X_discount[:5]

[0.23, 0.43, 0.18, 0.81, 0.25]

In [44]:
y_spent[:5]

[7.300000000000001, 9.3, 6.8, 13.100000000000001, 7.5]

## The Algorithm on The Data:  "Learning"

Using `algo_linear_regression(X_discount, y_spent)` we can get a list with lots of $w, b$ attempts and the associated error, 

In [45]:
trials = algo_linear_regression(X_discount, y_spent)

The first five trials, 

In [46]:
trials[:5]

[(44.0, (0, 0)),
 (39.0, (0, 1)),
 (34.0, (0, 2)),
 (29.000000000000004, (0, 3)),
 (24.000000000000004, (0, 4))]

`min` provides the entry with the minimum error (because error is the first number), 

In [48]:
min(trials)

(0.0, (10, 5))

We can assign multiple variables at once, if we match the pattern of what we are assigning,

In [49]:
error, (wbest, bbest) = min(trials)

In [51]:
print(wbest, bbest)

10 5


## Prediction

To predict for a new point we use the $w, b$ we found by minimizing the error,

In [40]:
def predict(x, w, b):
    return w * x + b

**Q. define a `predict_cat` function which predicts using a score, ie., `<0`**. 

Here, for $x_{discount} = 0.5$, $\hat{y}_{spend} = £10$,

In [52]:
yhat = predict(0.5, wbest, bbest)

In [53]:
yhat

10.0

---

## Stretch Exercise

Having answered the questions above, use your three functions for categorical data to: 

1. generate categorical data; 

2. find a best (w, b) for classifying/scoring it; 

3. predict using your best (w,b) with your prediction function. 