## Welcome to Scikit learn !!

**I'll Write my code in this Notebook as I walk through the Scikit Learn lectures.**

**The Actual Complete Notebook of the Instruector's Explanation is ***[Here](./Resources/introduction-to-scikit-learn-video.ipynb)*****


I will Write my Experience of Learning here.😊

👉 **Here is what we're going to Cover!!**

In [1]:
# Let's listify the contents
what_were_covering = [
    "0. An end-to-end Scikit-Learn workflow",
    "1. Getting the data ready",
    "2. Choose the right estimator/algorithm for our problems",
    "3. Fit the model/algorithm and use it to make predictions on our data",
    "4. Evaluating a model",
    "5. Improve a model",
    "6. Save and load a trained model",
    "7. Putting it all together!"]

## 0. End to End SciKit learn WorkFlow

#### **1. Get the Data Ready**

In [2]:
import pandas as pd
heart_disease_df = pd.read_csv("../6. Matplotlib - Visualizing Data/CSVs/heart-disease.csv")

heart_disease_df.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# Let's get the Data separated (Inputs and Corresponding outputs)

# x (Feature matrix) | It contains all the predicting variables which will be used together to predict target.
x = heart_disease_df.drop(columns="target") # Drop target

# y (Labels to input data) (target | Output)
y = heart_disease_df.target

#### Our Problem Statement is, we want to know which patients have Heart disease or not!
#### Based on the Feature Matrix **(*x*)**. (Predicting factors|variables)
### **Means a Classification Problem.**

#### **2. Determine the right model For this Classification**

In [4]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier() # Creating Model | Algorithm instantiation

# Take a look at the default Hyperparameters | used to fine-tuning the model acc. to the available data
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

#### **3. Now fit the data to our model/Algorithm  (`Estimator in Scikit-learn`)**

In [5]:
# To fit the data, first we have to take a bit for testing also
# So let's split the Data into separate Training and Testing Data sets.

from sklearn.model_selection import train_test_split

# x, y splitted into 25% (training) & 75% (Testing) Sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

In [6]:
# Let's fit the model to the training data

clf.fit(x_train, y_train) # Classification Model has been trained with Training Data

##### **Making Predictions with our Trained model :**

In [7]:
# Used Imports
import numpy as np

## Of course, now we can test our model to the Data similar to our
## model's training dataset 

# Error : clf.predict(np.array([1,34,2,4,5])) ## It won't work.

# let's predict our Testing Data
y_predicted = clf.predict(x_test)

y_predicted ## Let's see our prediction

array([0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

In [8]:
# x_test has been predicted but we also have y_test | What to do with it??
y_test.values

array([0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 1], dtype=int64)

#### **4. Evaluate the Model :**

In [9]:
# Now, Let's use y_test to evaluate our model
# By comparing "y_test" to "y_predicted".

# Scoring our model for Train data First!!
# It has trained on the same Data, What will be the chances of predicting it wrong???
clf.score(x_train, y_train)

# It's 1 means 100% accurate! 🥳
# But it's cheating, it was already trained on this Data.
# We should try Some new Data that the model hasn't seen yet!!🤔

1.0

**Let's Check our model's score with Testing Data.**

In [10]:
# Evaluate with Test data
test_score = clf.score(x_test, y_test)

print(f"Yeah!! It's pretty Good, {test_score*100:.3} accurate!!")

Yeah!! It's pretty Good, 84.2 accurate!!


##### **There are much more methods for Evaluating, Let's see them !!**

In [11]:
# Importing Different Evaluation Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# First use Classification_report.
print("\n* Classification Report Score : \n")
print(classification_report(y_test, y_predicted))

# Confusion Matrix Score.
print("\n* Confusion matrix Score : \n")
print(confusion_matrix(y_test, y_predicted))

# Accuracy Score.
print("\n* Accuracy Score : \n")
print(accuracy_score(y_test, y_predicted))


* Classification Report Score : 

              precision    recall  f1-score   support

           0       0.82      0.77      0.79        30
           1       0.85      0.89      0.87        46

    accuracy                           0.84        76
   macro avg       0.84      0.83      0.83        76
weighted avg       0.84      0.84      0.84        76


* Confusion matrix Score : 

[[23  7]
 [ 5 41]]

* Accuracy Score : 

0.8421052631578947


#### **5. Improve the Model :**

In [12]:
# Trying fine-tuning the parameters of the Classification model.

# Let's try different values of "n_estimators" for our model.

for estimator in range(10, 101, 5):
    print(f"Trying model with {estimator} estimators :")
    clf = RandomForestClassifier(n_estimators=estimator).fit(x_train, y_train)
    print(f"Model accuracy on test set: {clf.score(x_test, y_test) * 100:.2f}%\n")


# The best accuracy we got is 85.53% (When, I trained it!)


Trying model with 10 estimators :
Model accuracy on test set: 81.58%

Trying model with 15 estimators :
Model accuracy on test set: 77.63%

Trying model with 20 estimators :
Model accuracy on test set: 78.95%

Trying model with 25 estimators :


Model accuracy on test set: 84.21%

Trying model with 30 estimators :
Model accuracy on test set: 84.21%

Trying model with 35 estimators :
Model accuracy on test set: 85.53%

Trying model with 40 estimators :
Model accuracy on test set: 80.26%

Trying model with 45 estimators :
Model accuracy on test set: 82.89%

Trying model with 50 estimators :
Model accuracy on test set: 85.53%

Trying model with 55 estimators :
Model accuracy on test set: 81.58%

Trying model with 60 estimators :
Model accuracy on test set: 84.21%

Trying model with 65 estimators :
Model accuracy on test set: 85.53%

Trying model with 70 estimators :
Model accuracy on test set: 84.21%

Trying model with 75 estimators :
Model accuracy on test set: 84.21%

Trying model with 80 estimators :
Model accuracy on test set: 80.26%

Trying model with 85 estimators :
Model accuracy on test set: 82.89%

Trying model with 90 estimators :
Model accuracy on test set: 84.21%

Trying model with 95 estimators :
Model accuracy on te

#### **6. Save the Model & Save it !**

In [15]:
# Using pickle to save our model.
import pickle

# BTW, it saves the last model with "clf" name, (n_estimators=100 & accuracy=77.63%) 
pickle.dump(clf, open("./Heart_Disease_Trained_Model.pkl", "wb")) # Using Random Forest Classification.

In [17]:
# Now Let's load our saved model & predict through it.

loaded_model = pickle.load(open("./Heart_Disease_Trained_Model.pkl", "rb")) # rb stands for "Read binary"
loaded_model.score(x_test, y_test) # Accuracy of last trained model i.e. with 100 estimators.

0.8289473684210527

#### `This was too Fast!! Now, Let's break down every step in next Notebook.⏩`