## Introduction to Scikit-Learn
This notebook demonstrate the some of the useful function in scikitlearn
This section just shows the overview(surfacely) of each of the steps to be covered.
But after this, these each section is covered in detail in another scikit learn section.

Topics covered:
1. An end to end scikit learn workflow
2. Getting the data ready
3. Choose the right estimator/algorithm
4. Fit the model/algorithm and use it to make prediction
5. Evaluating a model
6. Improve the model
7. Save and load a trained model
8. Putting it all together

## 1. End to end Sklearn Workflow

In [20]:
import pandas as pd
import numpy as np

## 2.Getting the data ready

In [21]:
# Get the dataset ready

heart_disease=pd.read_csv("heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [22]:
# Now we are using these columns to predict the heart disease or not

# Create X (features matrix)
X = heart_disease.drop("target",axis=1) 

#Create Y (lables)
Y = heart_disease["target"]


## 3. Choose the right estimator/algorithm

In [23]:
# Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
#well keep the default hyperparameters
classifier.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 4.Fit the model/algorithm and use it to make prediction

In [24]:
# Fit the model to the training data
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train,Y_test =  train_test_split(X,Y,test_size=0.2)
# test_size will be % of data to be used to testing and remaining for the training

In [25]:
# train the model
classifier.fit(X_train,Y_train);

In [26]:
# Make Prediction
y_predict=classifier.predict(X_test)
y_predict

array([1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1])

In [27]:
Y_test

58     1
267    0
135    1
223    0
90     1
      ..
300    0
244    0
192    0
248    0
222    0
Name: target, Length: 61, dtype: int64

## 5.Evaluating a model

In [28]:
# Evaluate the model
classifier.score(X_train,Y_train)  #Training data

1.0

In [29]:
classifier.score(X_test,Y_test)   # Test data

0.819672131147541

In [30]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

print(classification_report(Y_test,y_predict))

              precision    recall  f1-score   support

           0       0.88      0.72      0.79        29
           1       0.78      0.91      0.84        32

    accuracy                           0.82        61
   macro avg       0.83      0.82      0.82        61
weighted avg       0.83      0.82      0.82        61



In [31]:
confusion_matrix(Y_test,y_predict)

array([[21,  8],
       [ 3, 29]])

In [32]:
accuracy_score(Y_test,y_predict)

0.819672131147541

## 6.Improve the model

In [33]:
# Improve the model
# Try different amount of n_estimators
np.random.seed(42)
for i in range(10,110,10):
    print(f"Trying model with {i} estimators  ")
    classifier=RandomForestClassifier(n_estimators=i).fit(X_train,Y_train)
    print(f"Model accuracy on test set: {classifier.score(X_test,Y_test)}")
    print("")

Trying model with 10 estimators  
Model accuracy on test set: 0.7704918032786885

Trying model with 20 estimators  
Model accuracy on test set: 0.819672131147541

Trying model with 30 estimators  
Model accuracy on test set: 0.8524590163934426

Trying model with 40 estimators  
Model accuracy on test set: 0.8360655737704918

Trying model with 50 estimators  
Model accuracy on test set: 0.819672131147541

Trying model with 60 estimators  
Model accuracy on test set: 0.8360655737704918

Trying model with 70 estimators  
Model accuracy on test set: 0.7868852459016393

Trying model with 80 estimators  
Model accuracy on test set: 0.8360655737704918

Trying model with 90 estimators  
Model accuracy on test set: 0.819672131147541

Trying model with 100 estimators  
Model accuracy on test set: 0.819672131147541



Here we see the model is more at 60 estimators

# 7.Save & load the model


In [34]:
import pickle
pickle.dump(classifier,open("random_forest.pkl","wb"))

In [35]:
load_model=pickle.load(open("random_forest.pkl","rb"))
load_model.score(X_test,Y_test)   # shows the accuracy of the last model train i.e 100th estimators

0.819672131147541