# Introduction to scikit-learn (sklearn)
This notebook provides an introduction to the scikit-learn library, which is a 
powerful tool for machine learning in Python. It covers the basic concepts and 
functionalities of scikit-learn, including data preprocessing, model selection, 
and evaluation metrics. The notebook also includes practical examples and code 
snippets to help you get started with using scikit-learn for your own machine 
learning projects.

What we are going to cover:

0. An end-to-end scikit-learn workflow
1. Getting the data ready
2. Chosse the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data 
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Puting it all together 


## 0. An end-to-end scikit-learn workflowm

In [23]:
# 1. Get the data ready
# import the library
%matplotlib inline
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd

In [2]:
heart_disease = pd.read_csv("data sets/heart-disease.csv", encoding='utf-8-sig')

heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
heart_disease.columns = heart_disease.columns.str.strip().str.replace('\ufeff', '')
# Step 2: Clean column names and string values


# Step 3: Replace 'undefined' or any bad strings with NaN
heart_disease.replace('undefined', np.nan, inplace=True)

# Step 4: Convert all data to numeric
heart_disease = heart_disease.apply(pd.to_numeric, errors='coerce')
heart_disease.dropna(inplace=True)

In [4]:
print(heart_disease.columns.tolist())


['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']


In [5]:
#This code snippet is preparing the data for a machine learning model.
# Create X (features matrix)
x = heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease["target"]


In [6]:
#This code snippet is importing the RandomForestClassifier class from the sklearn.ensemble module. 
#This is a step in choosing the right model for a machine learning task and setting the hyperparameters 
# for the RandomForestClassifier model.

# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)


# We will keep the default hyperparameters for now.

# is a method that returns the current parameters of the RandomForestClassifier model instance `clf`. This method provides a way to view the current hyperparameters that are set for the model. It can be useful for understanding the default settings or for checking the specific values of the hyperparameters that are being used in the model.
clf.get_params()



{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [7]:
# 3. Fit the model to the data 
#This code snippet is splitting the dataset into training and testing sets using the `train_test_split` function from the `sklearn.model_selection` module.
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [8]:
# clf.fit(x_train, y_train)` is a method call to fit a machine learning model (represented by `clf`) to the training data `x_train` and corresponding target labels `y_train`. This process involves training the model to learn patterns and relationships in the training data so that it can make predictions on new, unseen data.
clf.fit(x_train, y_train);

In [9]:
x_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
197,67.0,1.0,0.0,125.0,254.0,1.0,1.0,163.0,0.0,0.2,1.0,2.0,3.0
376,29.0,1.0,1.0,130.0,204.0,0.0,0.0,202.0,0.0,0.0,2.0,0.0,2.0
153,66.0,0.0,2.0,146.0,278.0,0.0,0.0,152.0,0.0,0.0,1.0,1.0,2.0
282,59.0,1.0,2.0,126.0,218.0,1.0,1.0,134.0,0.0,2.2,1.0,1.0,1.0
225,70.0,1.0,0.0,145.0,174.0,0.0,1.0,125.0,1.0,2.6,0.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,49.0,0.0,1.0,134.0,271.0,0.0,1.0,162.0,0.0,0.0,1.0,0.0,2.0
32,44.0,1.0,1.0,130.0,219.0,0.0,0.0,188.0,0.0,0.0,2.0,0.0,2.0
274,47.0,1.0,0.0,110.0,275.0,0.0,0.0,118.0,1.0,1.0,1.0,1.0,2.0
458,39.0,0.0,2.0,138.0,220.0,0.0,1.0,152.0,0.0,0.0,1.0,0.0,2.0


In [10]:
# Make a prediction
y_label = clf.predict(np.array([0,2,3,4]))



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [11]:
y_preds = clf.predict(x_test)
y_preds

array([0., 1., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0.,
       1., 1., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 1., 0., 1., 0.,
       0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0.,
       0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0.,
       1., 1., 1.])

In [12]:
# Show columns that contain the string 'undefined'
print(x_test[x_test == "undefined"].count())


age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
dtype: int64


In [13]:
y_test


209    0.0
41     1.0
420    1.0
46     1.0
528    0.0
      ... 
66     1.0
245    0.0
582    0.0
451    1.0
464    1.0
Name: target, Length: 122, dtype: float64

In [14]:
# 4. Evaluate the model
clf.score(x_train, y_train)

1.0

In [15]:
clf.score(x_test, y_test)

0.9508196721311475

In [16]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

         0.0       0.96      0.92      0.94        53
         1.0       0.94      0.97      0.96        69

    accuracy                           0.95       122
   macro avg       0.95      0.95      0.95       122
weighted avg       0.95      0.95      0.95       122



In [17]:
confusion_matrix(y_test, y_preds)

array([[49,  4],
       [ 2, 67]], dtype=int64)

In [18]:
accuracy_score(y_test, y_preds)

0.9508196721311475

In [19]:
# 5. Improve a model
# Try different amount of n_estomators

np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators ..")
    clf = RandomForestClassifier(n_estimators = i).fit(x_train, y_train)
    print(f"Model accuracy on test set : {clf.score(x_test, y_test)*100:.2f}%")
    print("")

Trying model with 10 estimators ..
Model accuracy on test set : 94.26%

Trying model with 20 estimators ..
Model accuracy on test set : 95.08%

Trying model with 30 estimators ..
Model accuracy on test set : 95.08%

Trying model with 40 estimators ..
Model accuracy on test set : 95.08%

Trying model with 50 estimators ..
Model accuracy on test set : 95.08%

Trying model with 60 estimators ..
Model accuracy on test set : 95.08%

Trying model with 70 estimators ..
Model accuracy on test set : 95.08%

Trying model with 80 estimators ..
Model accuracy on test set : 95.08%

Trying model with 90 estimators ..
Model accuracy on test set : 95.08%



In [21]:
# 6. Save a model and load it 
import pickle
pickle.dump(clf, open("random_forest_model1.pkl", "wb"))

In [22]:
loadded_model = pickle.load(open("random_forest_model1.pkl", "rb"))
loadded_model.score(x_test, y_test)

0.9508196721311475

## 1. Getting our data ready to be used with Machine Learning 

3 main things we have to do:

     1. Split the data into features and labels (x & y)
     2. Filling(also called imputing) or disregarding mi missing values
     3. Converting non-numerical values to numerical values(also called feature encoding)

`

In [24]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,3.0,145.0,233.0,1.0,0.0,150.0,0.0,2.3,0.0,0.0,1.0,1.0
1,37.0,1.0,2.0,130.0,250.0,0.0,1.0,187.0,0.0,3.5,0.0,0.0,2.0,1.0
2,41.0,0.0,1.0,130.0,204.0,0.0,0.0,172.0,0.0,1.4,2.0,0.0,2.0,1.0
3,56.0,1.0,1.0,120.0,236.0,0.0,1.0,178.0,0.0,0.8,2.0,0.0,2.0,1.0
4,57.0,0.0,0.0,120.0,354.0,0.0,1.0,163.0,1.0,0.6,2.0,0.0,2.0,1.0
