## A Beginner Guide to Machine Learning Modeling: Tutorial with Python and Scikit-Learn (**Part1**)
* As I'm beginning my Machine Learning and Data Science Journey, i thought it would be a good idea to share parts of my journey that i feel are important for any aspiring machine learning engineer and data scientist.
* In this notebook I will go through a range of common and useful features of the Scikit-Learn Library is.

* One thing i quickly learned is to read documentation and dont be afraid to google if you dont understand something. in this case i suggest https://scikit-learn.org/stable/user_guide.html if you get stuck.

## Background Information
The field of data science and machine learning is pretty extensive, And although that is the case, the main goal will always be to be to find patterns within data and then use patterns to make predictions; And there are certain categories which a majority of problems fall into.

* if you're trying to create a machine learning model to predict whether something is one thing or another.;(i.e if an email is spam or not) you're working on a `classification problem`.
* if you're trying to create a machine learning model to predict a number.;(i.e to predict the price of houses given their characteristics) you're working on a `regression problem`
* if you're trying to get a machine learning algorithm to group together similar samples (that you don't necessarily know which should go together), you're working on a `clustering problem`
* Once you know what model you are working on, there will be similar steps you can take for each of them;
  such as:

1. Splitting the data into different sets, one for your machine learning algorithms to learn on (training set) and another to test on (the testing set)
2. Choosing a machine learning model and then evaluating whether or not your model has learned anything.





In [1]:
# Standard Imports 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

## End to End Scikit- Learn Workflow
lets quickly check out what the end-to-end Scikit-Learn workflow might look like.
* Once we've seen an end-to-end workflow, we'll dive into each step a little deeper

We'll get hands on with the following steps:
1. Getting data ready(split into features and labels, prepare train and test steps)
2. Choosing a model for our problem
3. Fit the modek to the data and use it to make predictions
4. Evaluate the model
5. Experiment to improve
6. Save a model for someone else to use 


## Random Forest Classifier Workflow for Classifying Heart Disease
### 1. Get the data ready

As an example dataset, we'll import `heart-disease.csv`
This file contains anonymous patient medical records and whether or not they have heart disease or not (this is a `classification problem` since we're trying to predict whether something is one thing or another)

In [2]:
import pandas as pd
heart_disease = pd.read_csv('heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except `target` are different patient characteristcs

* The `target` columns indicates whether the patient has heart disease `(target=1)` or not `(target=0)`, this is our "label" column, the variable we are trying to predict.

* The rest of the columns (often called features) are what we'll be using to predict the `target` value.

**Note**: It's common custom to save features to a variable `X` and labels to a variable `y`. In practice, we'd like to use `X` (features) to build a predictive algorithm to predict the `y` labels.

In [3]:
# Create X (all the feature columns)
X = heart_disease.drop('target', axis=1)

# Create y (the target columns)
y = heart_disease['target']

# Check the head of the feature DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [4]:
# Check the head and the value counts of the labels
y.head(), y.value_counts()

(0    1
 1    1
 2    1
 3    1
 4    1
 Name: target, dtype: int64,
 target
 1    165
 0    138
 Name: count, dtype: int64)

One of the most important practices in machine learning is to split datasets into training and test sets. 
This means, a model will **train on the training set** to learn patterns and then those patterns can be **evaluated on the test set**

* Scikit-Learn provides the `sklearn.model_selection.train_test_split` method to split datasets into training and test sets.
    **Note**: It is common practice to use an 80/20 or 70/30 or 75/25 split for training/test data.

In [5]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

### 2. Choose the model and hyperparameters
Choosing a model often depends on the type of problem you're working on.
For instance, there are different models that Scikit-Learn recommends whether you're working on a classification or regression problem.

You can see a map breaking down the different kinds of model options and recommendations in the Scikit-Learn documentation:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html 
https://scikit-learn.org/1.3/tutorial/machine_learning_map/index.html

**Note**: Scikit-Learn refers to models as 'estimators', however, they are often also referred to as `model` or `clf` (short for classifier)

* A model's 'hyperparameters' are settings you can change to adjust it to your problem, much like knobs on an oven you can tune to your favorite dish.

In [6]:
# Since we're working on a classification problem, we'll start with a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

We can see the current hyperparameters of a model with the get_params() methods

In [7]:
# View the current hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fit the model to the data and use it to make a prediction

Fitting a model involves passing it data and asking it to figure out the patterns 

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together

Most Scikit-Learn models have the `fit(X,y)` method built-in, 
where the `X` parameter is the features and the `y` parameter is the labels

In our case, we start by fitting a model on the training split
`(X_train, y_train)`

In [8]:
clf.fit(X_train, y_train)

### Use the model to make predictions

The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once your model instance is trained, you can the `predict()` method to predict a target value given a set of features.

In other words, using the model you just fitted on some new 'unseen and unlabelled data' to predict the label.

**Note**: Data you predict on should be in the same shape and format as data you trained on.

* Our goal in many machine learning problems is to use patterns learned from the training data to make predictions on the test data (or future unseen data)

In [9]:
# Use the model to make a prediction on the test data
clf.predict(X_test)

array([1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1])

In [10]:
# assign the 'clf.predict(X_test)' to y_preds variable (for further evaluation)
y_preds = clf.predict(X_test)

### 4. Evaluate the model

Now we've made some predictions, we can start to use some Scikit-Learn methods to figure out how good our model is.

Each model or estimator has a built-in `score()` method.

This method compares how well the model was able to learn the patterns between the features and labels.

The `score()` method for each model uses a standard metric to measure your model's results.

In the case of a classifier(our model), one of the most common evaluation metrics is accuracy (the fraction of correct predictions out of total predictions) 



In [11]:
# Evaluate the model on the training set
train_acc = clf.score(X_train, y_train)
print(f"The model's accuracy on the training dataset is: {train_acc*100:.2f}%")

The model's accuracy on the training dataset is: 100.00%


In [12]:
# Evaluate the model on the test set 
test_acc  = clf.score(X_test, y_test)
print(f"The model's accuracy on the testing data set: {test_acc*100:2f}%")

The model's accuracy on the testing data set: 81.578947%


Seems like our model's accuracy is a bit less on the test dataset than the training dataset.

This is quite often the case, because remember, a model has never seen the testing examples before.

There are also a number of other evelauation methods we can use for our **clasificartion models**.

All of the following classification metrics come from the `sklearn.metrics` module:

* `classification_report(y_true, y_preds)`- Builds a text report showing various classificaton metrics such as [precision, recall](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)
* `confusion_matrix(y_true, y_pred)`- Create a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to compare predictions to truth labels.
* `accuracy_score(y_true, y_pred)` - Find the accuracy score (the default metric) for a classifier.

All metric have the following in common: they compare a model's predictions `(y_pred)` to truth labels `(y_true)`.

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Create a classification report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.90      0.72      0.80        39
           1       0.76      0.92      0.83        37

    accuracy                           0.82        76
   macro avg       0.83      0.82      0.81        76
weighted avg       0.83      0.82      0.81        76



In [14]:
# Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[28, 11],
       [ 3, 34]])

In [15]:
# Compute the accuracy score (same as the score() method)
accuracy_score(y_test, y_preds)

0.8157894736842105

### 5. Experiments to Improve

The first model you build is often referred to as a baseline.
Once you've got a baseline model, like we have here, it's important to remember, this is often not the final model you'll use.

The next step in the workflow is to try and improve upon your baseline model.

Experiments can come in many different forms.

But let's break it into two:

1. From a model perspective
2. From a data perspective

From a **model** perspective may involve things such as using a more complex model or tuning your models hyperparameters.

From a **data** perspective may involve collecting more data or better quality data so your existing model has more of a chance to learn the patterns within.

If you're already working on an existing dataset, it's often easier try a series of model perspective experiments first and then turn to data perspective experiments if you aren't getting the results you're looking for.

One thing you should be aware of is if you're tuning a models hyperparameters in a series of experiments, your reuslts should always be cross-validated 

[cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)is a way of making sure the results you're getting are consistent across your training and test datasets (because it uses multiple versions of training and test sets) rather than just luck because of the order the original training and test sets were created.

* Try different hyperparameters.
* All different parameters should be cross-validated.
   * **Note:** Beware of cross-validation for time series problems (as for time series, you don't want to mix samples from the future with samples from the past).
 
Different models you use will have different hyperparameters you can tune.

For the case of our model, the`RandomForestClassifier()`, we'll start trying different values for `n_estimators`( a measure for the number of trees in the random forest)

By default, `n_estimators=100`, so we can try values from 100 to 300 and see what happens.

In [16]:
# Try different numbers if estimators (trees)...(no cross-validation)
np.random.seed(42)
for i in range(100,300, 10):
    print(f'Trying model with {i} estimators...')
    model = RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f"Model accuracy on test set: {model.score(X_test,y_test)*100:.2f}%")
    print("")

Trying model with 100 estimators...
Model accuracy on test set: 81.58%

Trying model with 110 estimators...
Model accuracy on test set: 81.58%

Trying model with 120 estimators...
Model accuracy on test set: 77.63%

Trying model with 130 estimators...
Model accuracy on test set: 80.26%

Trying model with 140 estimators...
Model accuracy on test set: 80.26%

Trying model with 150 estimators...
Model accuracy on test set: 82.89%

Trying model with 160 estimators...
Model accuracy on test set: 80.26%

Trying model with 170 estimators...
Model accuracy on test set: 82.89%

Trying model with 180 estimators...
Model accuracy on test set: 81.58%

Trying model with 190 estimators...
Model accuracy on test set: 82.89%

Trying model with 200 estimators...
Model accuracy on test set: 81.58%

Trying model with 210 estimators...
Model accuracy on test set: 81.58%

Trying model with 220 estimators...
Model accuracy on test set: 81.58%

Trying model with 230 estimators...
Model accuracy on test set: 

The metrics above were measured on a single train and test split.

Let's use `sklearn.model_selection.cross_val_score(X, y, cv=5)` to measure 
the result across 5  different train and test sets.

We can achieve this by setting `cross_val_score(X, y, cv=5)`.

Where `X` is the **full** feature set and `y` is the **full** label set and `cv` is the number of train and test splits `cross_val_score` will automatically create from the data (in our case, `5` different splits, this is know as 5- fold cross-validation).
    

In [18]:
from sklearn.model_selection import cross_val_score

# With cross-validation
np.random.seed(42)
for i in range(100, 300, 10):
    print(f"Trying model with {i} estimators..")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)

    # Measure the model score on a single train/test split
    model_score = model.score(X_test,y_test)
    print(f"Model accuracy on single test set split:{model_score*100:.2f}%")

    # Measure the mean cross-validation score across 5 different train and test sets.
    cross_val_mean = np.mean(cross_val_score(model, X,y, cv=5))
    print(f"5-fold cross-validation score: {cross_val_mean * 100:.2f}%")

    print("")

Trying model with 100 estimators..
Model accuracy on single test set split:81.58%
5-fold cross-validation score: 82.15%

Trying model with 110 estimators..
Model accuracy on single test set split:82.89%
5-fold cross-validation score: 81.17%

Trying model with 120 estimators..
Model accuracy on single test set split:81.58%
5-fold cross-validation score: 83.16%

Trying model with 130 estimators..
Model accuracy on single test set split:81.58%
5-fold cross-validation score: 83.14%

Trying model with 140 estimators..
Model accuracy on single test set split:81.58%
5-fold cross-validation score: 82.48%

Trying model with 150 estimators..
Model accuracy on single test set split:82.89%
5-fold cross-validation score: 80.17%

Trying model with 160 estimators..
Model accuracy on single test set split:82.89%
5-fold cross-validation score: 80.83%

Trying model with 170 estimators..
Model accuracy on single test set split:80.26%
5-fold cross-validation score: 81.83%

Trying model with 180 estimators

which model had the best cross-validation score?

This is usually a better indicator of a quality model than a single split accuracy score.

Rather than set up and track the results of these experiments manually, we can get Scikit-Learn 
to do the exploration for us

Scikit-Learn's `sklearn.model_selection.GridSearchCV` is a way to search over a set of diffent hyperparameter values and automatically track which performs the best 



In [21]:
# Let's try tuning hyperparameters with GridSearchCV
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

#Define the parameters to search over in dictionary form
#( These can be any of your target model's hyperparameters)
param_grid = {'n_estimators': [i for i in range(100,300,10)]}

# Set up the grid search 
grid = GridSearchCV(estimator=RandomForestClassifier(),
                    param_grid=param_grid,
                    cv=5,
                    verbose=1)

# Fit the grid search to the data 
grid.fit(X,y)

# Find the best parameters
print(f"The best parameter values are: {grid.best_params_}")
print(f"With a score of: {grid.best_score_*100:.2f}%")

Fitting 5 folds for each of 20 candidates, totalling 100 fits
The best parameter values are: {'n_estimators': 120}
With a score of: 82.82%


We can extract the best model/estimator with the `best_estimator_` attribute.


In [22]:
# Set the model to be the best estimator
clf = grid.best_estimator_
clf

And now we've got the best cross-validated model, we can fit and score it on our original single train/test split of data.


In [24]:
# Fit the best model
clf = clf.fit(X_train, y_train)

# Find the best model scores on our single test split
#(note: this may be lower than the cross-validated score sinc it's only on one split)
print(f"Best model score on single split of the data:{clf.score(X_test,y_test)*100:.2f}%")

Best model score on single split of the data:82.89%


### 6. Save a model for someone else to use 

When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

You can [save a Scikit-Learn model](https://scikit-learn.org/stable/model_persistence.html) using Python's in-built `pickle` module.

In [25]:
import pickle

# Save an exisiting model to file 
pickle.dump(model, open('random_forest_model_2.pkl', 'wb'))

In [27]:
# Load a saved pickle model and evalute it
pickel_model_load = pickle.load(open("random_forest_model_2.pkl", 'rb'))
print(f"Loaded pickle model prediction score: {pickel_model_load.score(X_test,y_test)*100:.2f}%")


Loaded pickle model prediction score: 80.26%


For larger models, it may be more efficient to use [Joblib](https://joblib.readthedocs.io/en/stable/)

In [28]:
from joblib import dump, load

# Save a model using Joblib
dump(model, "random_forest_model_2.joblib")

['random_forest_model_2.joblib']

In [29]:
# Load a saved joblib model and evaluate it 
joblib_model = load("random_forest_model_2.joblib")
print(f"Loaded joblib model prediction score: {joblib_model.score(X_test, y_test)*100:.2f}%")

Loaded joblib model prediction score: 80.26%


This has been a quick and extensive overview of the capabilities of Scikit-Learn. In **Part2** we will break down the steps we covered in this notebook more thoroughly. for now you have working knowledge of the basics of Scikit-Learn and machine learning. You should be proud of yourself. Thank you for reading!!

Special Thanks to ['Daniel Bourke' and 'Andre Neagoie'] for making the start of this journey challenging and rewarding all at the same time.