1. Getting the data ready
2. Choosing the right estimator algorithm for our problem.
3. Fit the model/algorithm to use the prediction over the data.
4. Evaluating the model.
5. Improve a model.
6. Save and Load a Trained model.
7. Putting it all together.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [35]:
import warnings
warnings.filterwarnings("ignore")  #warnings.filterwarnings("default")-->it will not ignore the warnings.

## 1. Getting the Data Ready.

In [2]:
df=pd.read_csv("Heart Disease Dataset.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
# X denotes independent variable
X = df.drop("target", axis=1)

#y denotes dependent variable
y= df["target"]

## 2. Choosing the right estimator algorithm for our problem.

In [4]:
from sklearn.ensemble import RandomForestClassifier
clf= RandomForestClassifier()

#keeping the hyperparamter default.
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 3.Fit the model/algorithm to the Training Data.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [6]:
df.shape  #total shape of the data.

(1025, 14)

In [7]:
X_train.shape, X_test.shape  #shape of the training dataset.

((820, 13), (205, 13))

In [8]:
y_train.shape, y_test.shape  #shape of the test dataset.

((820,), (205,))

In [9]:
clf.fit(X_train, y_train)

RandomForestClassifier()

## 4.Make a prediction

In [11]:
# y_label=clf.predict(np.array[0,1,2,3])

In [12]:
y_preds=clf.predict(X_test)
y_preds

array([1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 0], dtype=int64)

In [13]:
y_test

144    1
662    1
651    1
90     1
850    0
      ..
433    1
361    1
520    0
374    1
822    0
Name: target, Length: 205, dtype: int64

## 4. Evaluate the model on the training data and the test data.

In [18]:
clf.score(X_train,y_train)

1.0

In [19]:
clf.score(X_test, y_test)

1.0

In [20]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [21]:
print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       106
           1       1.00      1.00      1.00        99

    accuracy                           1.00       205
   macro avg       1.00      1.00      1.00       205
weighted avg       1.00      1.00      1.00       205



In [22]:
print(confusion_matrix(y_test,y_preds))

[[106   0]
 [  0  99]]


In [23]:
print(accuracy_score(y_test,y_preds))

1.0


We are happy with  the accuracy, like this is the maximum amount of accuracy we can ever get in any data. But we might get least accuracy on the test dataset after prediction. so to improve the models accuracy we have to improve that by trying different parameters which I will show you on the next step "5. Improve a model."

## 5.Improve a model

In [29]:
np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf=RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) *100:.2f}%")
print("\n")

Trying model with 10 estimators...
Model accuracy on test set: 99.02%
Trying model with 20 estimators...
Model accuracy on test set: 100.00%
Trying model with 30 estimators...
Model accuracy on test set: 100.00%
Trying model with 40 estimators...
Model accuracy on test set: 100.00%
Trying model with 50 estimators...
Model accuracy on test set: 100.00%
Trying model with 60 estimators...
Model accuracy on test set: 100.00%
Trying model with 70 estimators...
Model accuracy on test set: 100.00%
Trying model with 80 estimators...
Model accuracy on test set: 100.00%
Trying model with 90 estimators...
Model accuracy on test set: 100.00%




## 6. Save a model and load it

In [30]:
import pickle

In [32]:
pickle.dump(clf, open("random_forest_model.pkl","wb"))

In [34]:
loaded_model= pickle.load(open("random_forest_model.pkl","rb"))
loaded_model.score(X_test,y_test)

1.0