# Scikit-Learn Library
What you're gonna learn
- An end-to-end Scikit-learn workflow
    1. Getting data ready
    2. Choosing a machine learning model
    3. Fitting a model to the data and making predictions
    4. Evaluating model predictions
    5. Improving model predictions
    6. Saving & Loading models

## 0. An End-to-end Scikit-Learn Workflow

#### 1) Getting the data ready


In [1]:
import pandas as pd
import numpy as np

In [2]:
heart_disease = pd.read_csv("/home/hp1/Documents/College/Coding/Machine Learning/zero_to_mastery_course/csv/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
#Creating X(feature matrix)
X = heart_disease.drop("target", axis=1)
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [4]:
#creating y (labels)
y = heart_disease["target"]
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

#### 2) Choose the right model and hyperparameters


In [5]:
from sklearn.ensemble import RandomForestClassifier


In [6]:
clf = RandomForestClassifier()

#We'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Meaning of some hyperparameters that are mentioned above - 
REFERENCE: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

- n_estimators = number of trees in the foreset
- max_features = max number of features considered for splitting a node
- max_depth = max number of levels in each decision tree
- min_samples_split = min number of data points placed in a node before the node is split
- min_samples_leaf = min number of data points allowed in a leaf node
- bootstrap = method for sampling data points (with or without replacement)

#### 3) Fit the model to the training data

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [9]:
clf.fit(X_train,y_train)

RandomForestClassifier()

#### 4) Make a prediction


In [10]:
X_test

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
271,61,1,3,134,234,0,1,145,0,2.6,1,2,2
36,54,0,2,135,304,1,1,170,0,0.0,2,0,2
84,42,0,0,102,265,0,0,122,0,0.6,1,0,2
213,61,0,0,145,307,0,0,146,1,1.0,1,0,3
59,57,0,0,128,303,0,0,159,0,0.0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
159,56,1,1,130,221,0,0,163,0,0.0,2,0,3
93,54,0,1,132,288,1,0,159,1,0.0,2,1,2
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2
143,67,0,0,106,223,0,1,142,0,0.3,2,2,2


In [11]:
y_predicted = clf.predict(X_test)
y_predicted

array([0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0])

In [12]:
len(X_test)

61

In [13]:
y_test

271    0
36     1
84     1
213    0
59     1
      ..
159    1
93     1
6      1
143    1
255    0
Name: target, Length: 61, dtype: int64

#### 4) Evaluate the model

In [14]:
#evaluating the model on training and test data
clf.score(X_train, y_train)             #has to give 1.0 because we trained our model on this

1.0

In [15]:
clf.score(X_test, y_test)               #if this gives 1.0, the something's wrong 

0.8360655737704918

In [16]:
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score

<b>Precision</b> is the true positive recognition rate <br>
Precision = TP/(TP+FN) <br>
<br>
<b>Recall</b> evaluates the exactness of the model. Actual positive outcomes out of total positive outcomes <br>
Recall = TP/(TP+FP)<br>
<br>
<b>Support</b> is the frequency of an item in the dataset

In [17]:
print(classification_report(y_test,y_predicted))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84        31
           1       0.83      0.83      0.83        30

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



In [18]:
confusion_matrix(y_test, y_predicted)

array([[26,  5],
       [ 5, 25]])

In [19]:
accuracy_score(y_test, y_predicted)

0.8360655737704918

#### 5) Improve the model
- Try different amount of n_estimators

In [20]:
#Trying different amount of n_estimators
np.random.seed(14)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set : {clf.score(X_test, y_test)}\n")
    

#You'll notice that the accuracy is higher with 20 estimators

Trying model with 10 estimators
Model accuracy on test set : 0.7868852459016393

Trying model with 20 estimators
Model accuracy on test set : 0.8032786885245902

Trying model with 30 estimators
Model accuracy on test set : 0.7704918032786885

Trying model with 40 estimators
Model accuracy on test set : 0.8360655737704918

Trying model with 50 estimators
Model accuracy on test set : 0.8360655737704918

Trying model with 60 estimators
Model accuracy on test set : 0.819672131147541

Trying model with 70 estimators
Model accuracy on test set : 0.819672131147541

Trying model with 80 estimators
Model accuracy on test set : 0.7704918032786885

Trying model with 90 estimators
Model accuracy on test set : 0.819672131147541



#### 6) Save a model and load it


In [21]:
#saving the model with pickle
import pickle

pickle.dump(clf, open("random-forest-model.pkl", "wb"))

In [22]:
loaded_model = pickle.load(open("random-forest-model.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.819672131147541

In [23]:
import sklearn
sklearn.get_config()

{'assume_finite': False,
 'working_memory': 1024,
 'print_changed_only': True,
 'display': 'text'}