## 3) Fit the model to data and using it to make predictions

Now you've chosen a model, the next step is to have it learn from the data so it can be used for predictions in the future.

### 3.1 Fitting a model to data

In Scikit-Learn, the process of having a machine learning model learn patterns from a dataset involves calling the fit() method and passing it data, such as, fit(X, y).

Where X is a feature array and y is a target array.

Other names for X include:

* Data
* Feature variables
* Features

Other names for y include:

* Labels
* Target variable

For supervised learning there is usually an X and y. For unsupervised learning, there's no y (no labels).

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
heart_df = pd.read_csv("data/heart-disease.csv")
heart_df.head() # classification dataset - supervised learning

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [10]:
# y - 1 for heart disease, 0 for not heart disease.

In [6]:
# No. of samples in the dataset
len(heart_df)

303

In [8]:
# Import the RandomForestClassifier model class from the ensemble module
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Setup random seed
np.random.seed(42)

# Split the data into X (features/data) and y (target/labels)
X = heart_df.drop("target", axis=1)
y = heart_df["target"]

# Split into train and test sets - train 80%,test 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the model (on the training set)
clf = RandomForestClassifier()

# Call the fit method on the model and pass it training data
# Training the machine learning model
clf.fit(X_train, y_train) # find out patterns between X(features) and y(labels)
# if there's no y, it'll only find the patterns within X

# Evaluate(check the score of) the model (on the test set)
# use the patterns the model has learned 
heart_score = clf.score(X_test, y_test) # a method - for prediction

In [18]:
print(f"Training Accuracy = {heart_score*100}%")

Training Accuracy = 85.24590163934425%


Passing X and y to fit() will cause the model to go through all of the examples in X (data) and see what their corresponding y (label) is.

How the model does this is different depending on the model you use.

#### During training (finding patterns in data):

A machine learning algorithm looks at a dataset, finds patterns, tries to use those patterns to predict something and corrects itself as best as it can with the available data and labels. It stores these patterns for later use.

#### During testing or in production (using learned patterns):

A machine learning algorithm uses the patterns its previously learned in a dataset to make a prediction on some unseen data.

### 3.2 Making predictions using a machine learning model

Now we've got a trained model, you'll want to use it to make predictions.

Scikit-Learn enables this in several ways. Two of the most common and useful are [predict()](https://github.com/scikit-learn/scikit-learn/blob/5f3c3f037/sklearn/multiclass.py#L299) and [predict_proba()](https://github.com/scikit-learn/scikit-learn/blob/5f3c3f037/sklearn/linear_model/_logistic.py#L1617)

In [12]:
# pass the data the model has learned on 
# predictions based on test data
clf.predict(X_test) # Test labels --> predicted value

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [14]:
np.array([y_test]) # test labels --> true value 

array([[0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
        0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0]], dtype=int64)

In [20]:
# compare predictions to truth labels to evaluate the model

# 2nd method of prediction

y_preds = clf.predict(X_test)
test_accuracy = np.mean(y_preds == y_test)
print(f"Test Accuracy = {test_accuracy*100}%")

Test Accuracy = 85.24590163934425%


It's standard practice to save these predictions to a variable named something like y_preds for later comparison to y_test or y_true (usually same as y_test just another name).

Another way of doing this is with Scikit-Learn's [accuracy_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function.

In [21]:
# 3rd method of prediction - by sklearn
from sklearn.metrics import accuracy_score
test_accuracy = accuracy_score(y_test,y_preds)
print(f"Test Accuracy = {test_accuracy*100}%")

Test Accuracy = 85.24590163934425%


#### Make predictions with predict_proba()

#### Note: 
For the predict() function to work, it must be passed X (data) in the same format(same shape) the model was trained on. Anything different and it will return an error.

predict_proba() returns the probabilities estimates of a classification label.
Each number is the probability of a label given a sample.

In this heart disease example -- From class label => 0 or 1

In [22]:
heart_df["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

In [26]:
# returns the probability of each sample for each of the classification labels we have

# lets see the first five samples
clf.predict_proba(X_test[:5])

# 0.89 > 0.11 - class 0 
# 0.43 < 0.57 - class 1.......

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

#### threshold - 0.5

Because our problem is a binary classification task (heart disease or not heart disease). Therefore, once the prediction probability of a sample passes 0.5, for a certain label, it's assigned that label.

In [27]:
clf.predict(X_test[:5]) # return a single label for each sample

array([0, 1, 1, 0, 1], dtype=int64)

In [29]:
# we can use predict_proba() when we want our model to be confident in prediction 
# we don't want 0.49,0.51
# we want 0.80, 0.11 --> i.e. values close to 1.0

#### Predicting regression models 

In [34]:
# Import the Boston housing dataset of SKlearn - built in regression dataset
from sklearn.datasets import load_boston
boston_df = load_boston()

In [35]:
# Covert it to a pandas dataframe - for better inspection

# take the data key, and label the columns
boston_df = pd.DataFrame(boston["data"],columns=boston["feature_names"])

# create a target column in df by using target values from dataset
boston_df["target"] = pd.Series(boston["target"])
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


In [36]:
# No. of samples
len(boston_df)

506

In [39]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create data
X = boston_df.drop("target",axis=1) # features
y = boston_df["target"] # labels

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Instantiate and fit model
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train,y_train) # train (learn) on training data

# Make predictions
y_preds = model.predict(X_test)

In [42]:
y_preds[:10] # predict labels

array([23.081, 30.574, 16.759, 23.46 , 16.893, 21.644, 19.113, 15.334,
       21.14 , 20.639])

In [43]:
np.array(y_test[:10]) # true labels

array([23.6, 32.4, 13.6, 22.8, 16.1, 20. , 17.8, 14. , 19.6, 16.8])

Now we've seen how to get a model how to find patterns in data using the fit() function and make predictions using what its learned using the predict() and predict_proba() functions, it's time to evaluate those predictions.

In [45]:
# Compare the predictions labels to the truth labels
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

2.136382352941176

Mean absolute error - find the difference of each sample and add them together ---> average of each sample 
e.g. 23.6 - 23.081, 32.4 - 30.574.....then add them together 