# Introduction to scikit-learn (sklearn)
This notebook provides an introduction to the scikit-learn library, which is a 
powerful tool for machine learning in Python. It covers the basic concepts and 
functionalities of scikit-learn, including data preprocessing, model selection, 
and evaluation metrics. The notebook also includes practical examples and code 
snippets to help you get started with using scikit-learn for your own machine 
learning projects.

What we are going to cover:

0. An end-to-end scikit-learn workflow
1. Getting the data ready
2. Chosse the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data 
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Puting it all together 


## 0. An end-to-end scikit-learn workflowm

In [58]:
# 1. Get the data ready
# import the library
%matplotlib inline
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd

In [75]:
heart_disease = pd.read_csv("data sets/heart-disease.csv", encoding='utf-8-sig')

heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [76]:
heart_disease.columns = heart_disease.columns.str.strip().str.replace('\ufeff', '')


In [77]:
print(heart_disease.columns.tolist())


['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']


In [78]:
#This code snippet is preparing the data for a machine learning model.
# Create X (features matrix)
x = heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease["target"]


In [79]:
#This code snippet is importing the RandomForestClassifier class from the sklearn.ensemble module. 
#This is a step in choosing the right model for a machine learning task and setting the hyperparameters 
# for the RandomForestClassifier model.

# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)


# We will keep the default hyperparameters for now.

# is a method that returns the current parameters of the RandomForestClassifier model instance `clf`. This method provides a way to view the current hyperparameters that are set for the model. It can be useful for understanding the default settings or for checking the specific values of the hyperparameters that are being used in the model.
clf.get_params()



{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [80]:
# 3. Fit the model to the data 
#This code snippet is splitting the dataset into training and testing sets using the `train_test_split` function from the `sklearn.model_selection` module.
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [81]:
# clf.fit(x_train, y_train)` is a method call to fit a machine learning model (represented by `clf`) to the training data `x_train` and corresponding target labels `y_train`. This process involves training the model to learn patterns and relationships in the training data so that it can make predictions on new, unseen data.
clf.fit(x_train, y_train);

ValueError: could not convert string to float: '\ufeffage'

In [82]:
x_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
597,67,1,2,152,212,0,0,150,0,0.8,1,0,3
258,62,0,0,150,244,0,1,154,1,1.4,1,0,2
162,41,1,1,120,157,0,1,182,0,0,2,0,2
213,61,0,0,145,307,0,0,146,1,1,1,0,3
544,70,1,2,160,269,0,1,112,1,2.9,1,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
529,70,1,0,145,174,0,1,125,1,2.6,0,0,3
326,42,1,0,140,226,0,1,178,0,0,2,0,2
209,59,1,0,140,177,0,1,162,1,0,2,1,3
114,55,1,1,130,262,0,1,155,0,0,2,0,2


In [67]:
# Make a prediction
y_label = clf.predict(np.array([0,2,3,4]))

AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_'

In [69]:
y_preds = clf.predict(x_test)
y_preds

AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_'