# Introduction to Python Machine Learning 
### with Sci-Kit Learn

## What simple library do I use for Machine Learning?

The sci-kit learn library in python is well-suited for simple machine learning problems.

If you have a complex problem (eg., working with images, audio, etc.); you may find other libraries perform better (eg., tensorflow). 

These libraries are industry-standard, perform reasonably well; free for commercial use. 

If they do not suit your problem, you require very expert assistance. 

## How do I  import sklearn?

We do not import the sklearn package itself, we always select some more specific part...

Exercise:

Run the below section of code to import the linear_model library.

In [2]:
# from PACKAGE import SPECIFIC LIBRARY
from sklearn import linear_model

The library here is `linear_model`...

In [4]:
linear_model.LinearRegression()

LinearRegression()

## What methodology do I follow ?

ie., How do I solve a machine learning problem in python?

* Pre-ML Phase
    * Get Data (, Experiments, etc.)
    * Data Preparation
        * gather together, join...
    * EDA (Data Understanding)

* ML Phase
    * Feature Selection
        * Choose $x, y$
    * (Mathematical, Statistical) Data Preparation
        * eg., suppose we have (w, h)
            * perhaps we compute bmi = w^2/h 
    * Modelling
        * fit data to a model
            * ie., find the best parameters 
        * Model Choice
            * try lots of different models
            * keep "best" -- depends on project goal
    * (Statistical, Experimental) Evaluation
        * try using your model for predictions
        * how good is it?

* Post-ML Phase
    * Deployment <- put ML model into practice
        * eg., develop software using model
        * eg., make some business decision

## How do I select features?

We typically use pandas to *provide* data...

Exercise:

1) Import pandas with it's typical alias.

2) Use read_csv() with the appropriate filepath to import the tips.csv file. Save this as tips.

3) Run the code: tips.info() to view how the tips data has been stored in python.

4) Use addressing (square brackets) to select the columns for total bill, size, sex, day and save these together as features.

5) Use addressing to select the tips column and save this as target.

In [6]:
#Solution
import pandas as pd

tips = pd.read_csv('datasets/tips.csv')

In [8]:
#Solution
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


In [17]:
#Solution
select_columns = ['total_bill', 'size', 'sex', 'day']

features = tips[ select_columns ]
target = tips['tip']

## How do I prepare data for modelling?

Observe that one of the features is just text:

Exercise:

1) Run the below line of code to check your features table.

2) We wish to transform the sex and day columns from being text to being numbered categories. To do this, run the next line of code to import OrdinalEncoder.

In [30]:
features.sample(2)

Unnamed: 0,total_bill,size,sex,day
83,32.68,2,Male,Thur
51,10.29,2,Female,Sun


In [31]:
from sklearn.preprocessing import OrdinalEncoder

The `OrdinalEncoder` will scan *categorical* columns; determine their unique values (ie., categories) and swap these out with a integers.

* Eg., Leeds, London, Machester -> 0, 1, 2

Exercise:

1) Use OrdinalEncoder().fit_transform(features \[\['sex', 'day'\]\]) and save this as X_categorical.

2) Use addressing with X_categorical to view the first 4 rows of the sex and day columns (they should now appear as integers rather than text).

3) Use addressing with the features table to view the first 4 rows of the sex and day columns (so you can see what text they had originally).

In [32]:
#Solution
X_categorical = OrdinalEncoder().fit_transform(features[['sex', 'day']])

The first rows of X:

In [34]:
#Solution
X_categorical[0:3, :]

array([[0., 2.],
       [1., 2.],
       [1., 2.],
       [1., 2.]])

In [35]:
#Solution
features.loc[0:3, ['sex', 'day']]

Unnamed: 0,sex,day
0,Female,Sun
1,Male,Sun
2,Male,Sun
3,Male,Sun


Note here we have columns: `total_bill`, `size`, `sex`, `day`.

So, for `sex`: 0, 1; `day`: `0` thurs, ...

## How do I combine my dataset into one matrix?

Now, notice we have two split datasets... the processed one called `X_categorical` which is the numerical version of the categorical columns.

And we have the real-valued columns `total_bill` and `size`, still in our `features` table.

Exercise:

Rund the below line of code to select the total bill and size columns and save them as X_real.

In [39]:
X_real = features[['total_bill', 'size']].to_numpy() # selecting the numerical columns

So now we need to combine them into one feature matrix, `X`.

In [41]:
import numpy as np

In numpy, `np.column_stack` means *join* -- glue columns together into one matrix. 

Exercise:

1) Use column_stack() with X_real and X_categorical as arguments. Save this as X.

2) View the first 5 rows (all columns) of X to check how your data has been stored.

X is the table of prepared data we will use as the features to predict the target.

In [46]:
#Solution
X = np.column_stack((X_real, X_categorical))

X[0:4,:]

array([[16.99,  2.  ,  0.  ,  2.  ],
       [10.34,  3.  ,  1.  ,  2.  ],
       [21.01,  3.  ,  1.  ,  2.  ],
       [23.68,  2.  ,  1.  ,  2.  ]])

## How do I choose a model?

* What type of problem?
    * Supervised Learning
        * Regression 
            * $y$ is tip, is a number!
* What type of regression models are there?
    * **Linear** <- CHOOSEN!
    * Polynomial
    * Neural Network
        * Half-linear, half-polynomial
    * ... 
    
* What should I do now?
    * Either: try them all 
    * Or: try ones you think most likely to be accurate
    
    
When we have covered the Statistics section of your programme we will look in more detail about how to determine whether linear regression is appropriate for all of the features.

## How do I import a model?

You will need to consult the documentation, but LinearRegression lives in `sklearn.linear_model`...

Exercise:

Run the below section of code to import the LinearRegression library.

In [47]:
from sklearn.linear_model import LinearRegression

## How do I train a model?

Exercise:

1) Use target.to_numpy() to save the target column in the correct format. Save this as y. Note: This has already been done to X.

In [50]:
#Solution
X = X # reminder that we have X
y = target.to_numpy()

All the machine learning happens here:

Exercise:

Use LinearRegression().fit() with X and y as the arguments for fit(). We are giving the model the formatted features and target to create the model equation. Save this as model.

In [76]:
#Solution
model = LinearRegression().fit(X, y) # <- empirical loss minmization 

## How do I use a model to predict?

Recall that `X` has four elements per observation:

Exercise:

1) Use addressing to view the first couple of rows of X to refresh your memory.

2) Create a row called x_newpoint with the values 20, 1, 1, 1. Hint: it must be a matrix.

3) When you created the model it was stored as an object containing the equation information and a function called predict(). Use model.predict() with x_newpoint as the argument to see what target value it calculates for your new point.

In [52]:
#Solution
X[:2, :] #first two, all columns

array([[16.99,  2.  ,  0.  ,  2.  ],
       [10.34,  3.  ,  1.  ,  2.  ]])

`model.predict` requires the same structure we used in `X`...

In [53]:
#Solution
x_newpoint = [
    [20, 1, 1, 1]  # even though this is ONE observation, it must be a matrix
]

model.predict(x_newpoint)

array([2.70323683])

## How do I predict for multiple observations?

Exercise:

1) Create a matrix called x_newpoint with 3 rows: all three rows should have 1 for size, sex and day. For the total bill column, use 20, 30, and 40.

2) Use model.predict() to view the predicted tips for x_newpoint.

In [55]:
#Solution
x_newpoint = [
    [20, 1, 1, 1],
    [30, 1, 1, 1],
    [40, 1, 1, 1],
]

yhat = model.predict(x_newpoint) # predictions

yhat

array([2.70323683, 3.63332512, 4.5634134 ])

## How do I score a model?

Your object, model, that was created contains not just the information needed for the equation of the line and a function for predict(). It also contains a function called score() which gives a value on a 0-1 scale.

Exercise:

Run the line of code below:

In [57]:
model.score(X, y) # scores model on historical (X, y)

0.4679728655676543

...this is quite low, so maybe we need a more complex model. (Or: maybe this is good enough for your use case...)

## Aside: How do I choose a better model?

Two thoughts:
* reconsider your data
    * prepare it differently
        * use different columns
        * formula to combine them 
        * etc. 
    * get more data
* try a different model


## Trying a Different Model

The kNearest Neighbors algorithm is very simple. 

When you `fit(X, y)` it just remembers them exactly. 

`Database = (X, y)`

When you `predict(z)`

It finds the `k` most similar $X$s, and tells you their average $y$.

eg., $(x, y) = (Age, SalesPrice)$

A prediction for an 18yo, is to find all the people "about 18 yo" in the dataset, find out what they paid and take the average.

```
SELECT MEAN(Price)
FROM database
WHERE Age CLOSE_TO age_newpoint
LIMIT k
```

Exercise:

Run the below sections of code to produce and use a KNeighborsRegressor model.

In [79]:
from sklearn.neighbors import KNeighborsRegressor

In [92]:
for k in range(3, 11):
    model = KNeighborsRegressor(k).fit(X, y)
    
    # find the k closest points in database (X, y)
    # report their mean(y)

    print(model.score(X, y))

0.6435960451483365
0.5848980368607244
0.5247895125206865
0.4996194323723515
0.48323409891574254
0.47662276752680455
0.4497514915579601
0.4417900984789215


In [93]:
model = KNeighborsRegressor(3).fit(X, y)
model.score(X, y)

0.6435960451483365

## Exercise (30min)


### Q1. Obtain & Prepare Data

* read the titianic csv using pandas
* clean the dataset
    * dropna to remove missing rows

### Q2. Select & Prepare Columns

* select the survived column as your target
* use age, fare, pclass as your features
    * NOTE: these do not need to be prepared as they are all numbers!

### Q3. Fit & Predict
* Use `LogisticRegression` to build a *classifier* and *predict* for a point
    * HINT: exactly the same as `LinearRegression`
    * HINT: `from sklearn.linear_model import LogisticRegression`
    
* This algorithm is a classification algorith, so it will output eg., `0`, `1` as predictions

In [None]:
#Solution

In [None]:
import pandas as pd

ti = pd.read_csv("datasets/titanic.csv").dropna()

In [None]:
y = ti['survived'].to_numpy()
X = ti[['age', 'fare', 'pclass']].to_numpy()

In [None]:
from sklearn.linear_model import LogisticRegression

# fit: give historical (X, y) to algorithm 
model = LogisticRegression().fit(X, y) 

# model: formula which takes a new observation (x_new), and predicts y_new

In [None]:
x_new = [
    [18, 20, 2], # age=18, fare=20, pclass=2 ; 1 = SURVIVE
    [1, 5, 1],   #  ... ; SURVIVE
    [70, 35, 2]  #  ... ; DIE
]

yhat = model.predict(x_new) # model.predict == f

yhat 

$f(x) = \frac{1}{1-e^{ax + b}}$

$y = f(5)$

## Aside: Visually inspecting the model (predictions)

Let's run this again with a single column, to simplify so we can plot in 2D easily...

In [None]:
import seaborn as sns


y = ti['survived'].to_numpy()
X = ti[['age']].to_numpy() 

# NB. we select the list ['age'] in order to get a matrix back

model = LogisticRegression().fit(X, y) 

sns.scatterplot(X[:, 0],  y);
sns.lineplot(X[:, 0], model.predict(X), color="red");