# 3.Fit the model/algorithm on our data and use it to make predictions

## 3.1 Fitting the model to the data

In [2]:
# Importing essentials
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Reading the data
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## For Classification Model

In [11]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Setting a random seed
np.random.seed(27)

# Getting the data ready
X = heart_disease.drop('target', axis = 1)
y = heart_disease['target']

# Dividing into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size = 0.2)

# Instantiate the model
model = RandomForestClassifier()

# Fitting the model to the data
model.fit(X_train, y_train)

# Checking the accuracy
model.score(X_test, y_test)

0.7868852459016393

## 3.2 Make predictions using machine learning model

2 ways to make predictions :
1. `predict()`
2. `predict_proba()`

#### Predicting using `predict()`

In [14]:
# using predict() 
model.predict(X_test)

array([0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1], dtype=int64)

In [15]:
np.array(y_test)

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1], dtype=int64)

In [16]:
# Comparing predictions to truth labels(y_test) to evaluate the model
y_preds = model.predict(X_test)
np.mean(y_preds == y_test)

0.7868852459016393

In [17]:
model.score(X_test, y_test)

0.7868852459016393

Clearly both the methods give same output and this is because they both are same ways to evaluate the model

In [18]:
# Another way to evaluate the score
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.7868852459016393

#### Predicting using `predict_proba()`

- `predict_proba` can be used when we want only the predictions that have high probability

In [21]:
# predict_proba() returns probabilities of a classification label
model.predict_proba(X_test[:5])

array([[0.62, 0.38],
       [0.03, 0.97],
       [0.05, 0.95],
       [0.52, 0.48],
       [0.02, 0.98]])

In [22]:
model.predict(X_test[:5])

array([0, 1, 1, 0, 1], dtype=int64)

So on comparing `predict()` and `predict_proba()` we can tell that `predict()` gives the output of the class predicted by the model where as `predict_proba()` gives the probability of how likely each row belongs to a given class.

## For Regression Model

In [23]:
# Importing the data
from sklearn.datasets import load_boston
boston = load_boston()

In [24]:
boston_df = pd.DataFrame(boston['data'], columns = boston['feature_names'])
boston_df['target'] = pd.Series(boston['target'])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [27]:
# Importing RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Setting a random seed
np.random.seed(27)

# Getting data ready
X = boston_df.drop('target', axis = 1)
y = boston_df['target']

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size = 0.2)

# Instantiate the model
model = RandomForestRegressor()

# Fitting the model to the data
model.fit(X_train, y_train)

# Checking the accuracy
model.score(X_test, y_test)

0.9033142910210461

Using `predict()`

In [34]:
y_preds = model.predict(X_test)
y_preds[:5]

array([16.75 , 32.239, 18.817,  7.622, 19.076])

In [35]:
np.array(y_test[:5])

array([10.9, 33.8, 18.3,  5. , 19.6])

To evaluate a regressor model we need to use a technique called `mean_absolute_error` , in this we calculate the modded difference b/w predicted value and the original value for each column and add them together then divide by the total number of samples.

formula and refernce : https://www.statisticshowto.com/absolute-error/#:~:text=Find%20all%20of%20your%20absolute,10%20measurements%2C%20divide%20by%2010.

In [36]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

2.0334705882352933