# INTRODUCTION TO MACHINE LEARNING USING SCIKIT LEARN
Machine learning is a subfield of artificial intelligence devoted to understanding and building methods to imitate the way humans learn. These methods include the use of algorithms and data to improve the performance on some set of tasks and often fall into one of the three most common types of learning:

* Supervised learning: a type of machine learning that learns the relationship between input and output.
* Unsupervised learning: a type of machine learning that learns the underlying structure of an unlabeled dataset.   
* Reinforcement learning: a method of machine learning wherein the software agent learns to perform certain actions in an environment which lead it to maximum reward.

In this hands-on sklearn tutorial, we will cover various aspects of the machine learning lifecycle, such as data processing, model training, and model evaluation.

##STEP1: LOAD DATA
The first aspect of the sklearn we will explore is the data; Scikit-learn comes with some standard machine learning datasets, which means you’re not required to download them from an external website or database.

Examples of the toy datasets available in sklearn include the iris dataset for classification and the diabetes dataset for regression. For our example, we will be using the wine dataset.

Let's load it into memory:

In [2]:
from sklearn.datasets import load_wine

wine_data = load_wine()

Executing the code above returns a dictionary-like object containing the data along with metadata about the data it contains.

The data we need is in the .data key of the dictionary-like object, but since it's not an actual dictionary, we can access it as an attribute of the wine_data instance as follows:

In [4]:
wine_data.data

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

This returns an N x M array where N is the number of samples and M is the number of features.  

Let's use this knowledge to load our data into a pandas DataFrame, which is much easier to manipulate and analyze.

In [6]:
import pandas as pd
from sklearn.datasets import load_wine

wine_data = load_wine()

# Convert data to pandas dataframe
wine_df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)

# Add the target label
wine_df["target"] = wine_data.target

# Take a preview
wine_df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


#STEP 2: Data Exploration And Visualization
Pandas DataFrames are defined as two-dimensional labeled data structures consisting of columns, which may contain different data steps. The easiest way to conceptualize a DataFrame is to think of it as three components merged together; those components are 1) data, 2) an index, and 3) columns.

Data exploration is not the main focus of this article but it's an extremely important step in any data project - you can learn more about it in our Python Exploratory Data Analysis tutorial. We will do a brief exploration to get a better idea of what our dataset contains; this will give us a better idea of how to process the data.

The first thing we are going to do is call the info() method on our pandas DataFrame; this will print a concise summary of the wine data contained within the DataFrame.




In [8]:
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  targe

After executing this cell, you will learn:

* The data contains 178 data samples
* There are 14 total columns including the target column (what we want to predict)
* There are 0 columns with missing values; you can infer this from the “Non-Null Count” column.
* All features are of data type float64, whereas the target label is an int64.
* The data uses 19.6 KB of memory.

We can also call the describe() method on our DataFrame to get descriptive statistics about each feature in the dataset.

In [10]:
wine_df.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258,0.938202
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474,0.775035
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0,0.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5,0.0
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5,1.0
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0,2.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0,2.0


Executing this code shows us that our features are on different scales, which may cause problems when dealing with Gradient Descent based algorithms like logistic regression, and when dealing with distance-based algorithms like support vector machines. This is because they are sensitive to the range of data points.

In a normal machine learning workflow, this process will be much more drawn out, but we are going to skip ahead to the data processing to get back on track with the main focus of this tutorial, Scikit-learn.

Data preprocessing
We have a decent understanding of what our data looks like. When you’ve reached this point, it usually means you’re ready to begin moving toward preparing the data to be fed into a machine learning model.

Data processing is a vital step in the machine learning workflow because data from the real world is messy. It may contain:

* Missing values,
* Redundant values
* Outliers
* Errors
* Noise

You must deal with all of this before feeding the data to a machine learning model; otherwise, the model will incorporate these mistakes into its approximation function – it will learn to make mistakes on new instances. This is what formed the famous machine learning saying, “Garbage in, garbage out.”

Another reason is that machine learning models typically require numeric data.  

Other than our data being on different scales, there's not much else wrong with our data at first glance. To combat this problem, let's standardize the features using sklearn's StandardScaler class; this will ensure the mean of each feature is approximately equal to zero.

In [13]:
from sklearn.preprocessing import StandardScaler

# Split data into features and label
X = wine_df[wine_data.feature_names].copy()
y = wine_df["target"].copy()

# Instantiate scaler and fit on features
scaler = StandardScaler()
scaler.fit(X)

# Transform features
X_scaled = scaler.transform(X.values)

# View first instance
print(X_scaled[0])

"""
[ 1.51861254 -0.5622498   0.23205254 -1.16959318  1.91390522  0.80899739
  1.03481896 -0.65956311  1.22488398  0.25171685  0.36217728  1.84791957
  1.01300893]
"""

[ 1.51861254 -0.5622498   0.23205254 -1.16959318  1.91390522  0.80899739
  1.03481896 -0.65956311  1.22488398  0.25171685  0.36217728  1.84791957
  1.01300893]




'\n[ 1.51861254 -0.5622498   0.23205254 -1.16959318  1.91390522  0.80899739\n  1.03481896 -0.65956311  1.22488398  0.25171685  0.36217728  1.84791957\n  1.01300893]\n'

# STEP 3: Model training
Before a machine learning model can make predictions, it must be trained on a set of data to learn an approximation function.

But how will we know if the model performs well on data it has not seen before? We won’t unless we test it out.

One way to test a machine learning model before placing it in an environment where it impacts others is to split the training data into a training and test set and use the test set to evaluate what the model has learned; this is known as offline evaluation.

There are several ways to split data into train and test sets, but scikit-learn has a built-in function to do this on our behalf called train_test_split().

We’ll use this function to split our data such that 70% is used to train the model and 30% is used to evaluate the model's ability to generalize to unseen instances.

In [15]:
from sklearn.model_selection import train_test_split

# Split data into train and test
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled,
                                                                  y,
                                                             train_size=.7,
                                                           random_state=25)

# Check the splits are correct
print(f"Train size: {round(len(X_train_scaled) / len(X) * 100)}% \n\
Test size: {round(len(X_test_scaled) / len(X) * 100)}%")

Train size: 70% 
Test size: 30%


# STEP 4: Building the model
Thanks to sklearn, building a machine learning model is extremely simple.

We are going to build three models to predict the class of wine:

* Logistic regression
* Support vector machine
* Decision tree classifier

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [18]:
# Instnatiating the models
logistic_regression = LogisticRegression()

# Training the models
logistic_regression.fit(X_train_scaled, y_train)

# Making predictions with each model
log_reg_preds = logistic_regression.predict(X_test_scaled)

In [19]:
log_reg_preds

array([1, 0, 0, 0, 1, 1, 0, 2, 1, 2, 1, 1, 0, 1, 1, 1, 2, 0, 1, 1, 2, 2,
       0, 0, 2, 0, 1, 0, 2, 1, 0, 1, 1, 1, 1, 1, 0, 2, 0, 0, 1, 1, 0, 1,
       1, 2, 2, 0, 2, 0, 2, 2, 2, 1])

In [20]:
# Instnatiating the models
svm = SVC()

# Training the models
svm.fit(X_train_scaled, y_train)

# Making predictions with each model
svm_preds = svm.predict(X_test_scaled)

In [21]:
svm_preds

array([1, 0, 0, 0, 1, 1, 0, 2, 1, 2, 1, 1, 0, 1, 1, 1, 2, 0, 1, 1, 2, 2,
       0, 0, 2, 0, 1, 0, 2, 1, 0, 1, 1, 1, 1, 1, 0, 2, 0, 0, 1, 1, 0, 1,
       1, 2, 1, 0, 2, 0, 2, 2, 1, 1])

In [22]:
# Instnatiating the models
tree = DecisionTreeClassifier()

# Training the models
tree.fit(X_train_scaled, y_train)

# Making predictions with each model
tree_preds = tree.predict(X_test_scaled)


In [23]:
tree_preds

array([1, 0, 1, 0, 1, 1, 0, 2, 1, 2, 1, 1, 0, 0, 1, 1, 2, 0, 1, 1, 2, 2,
       0, 0, 2, 0, 1, 0, 2, 2, 0, 1, 0, 1, 1, 1, 0, 2, 0, 0, 1, 1, 0, 1,
       1, 2, 1, 0, 2, 0, 2, 2, 1, 1])

# STEP 5: Model evaluation
Model evaluation is done to test how well the model generalizes to unseen instances. Scikit-learn provides an array of classification and regression metrics to evaluate a trained model's performance.

For our use case, we are going to use classification_report() from the metrics module to build a text report showing the main classification metrics such as precision, recall, f1_score, accuracy, etc.

In [25]:
from sklearn.metrics import classification_report

# Store model predictions in a dictionary
# this makes it's easier to iterate through each model
# and print the results.
model_preds = {
    "Logistic Regression": log_reg_preds,
    "Support Vector Machine": svm_preds,
    "Decision Tree": tree_preds
}

for model, preds in model_preds.items():
    print(f"{model} Results:\n{classification_report(y_test, preds)}", sep="\n\n")

Logistic Regression Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      0.92      0.96        25
           2       0.86      1.00      0.92        12

    accuracy                           0.96        54
   macro avg       0.95      0.97      0.96        54
weighted avg       0.97      0.96      0.96        54

Support Vector Machine Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      1.00      1.00        25
           2       1.00      1.00      1.00        12

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54

Decision Tree Results:
              precision    recall  f1-score   support

           0       0.89      0.94      0.91        17
           1       0.96      0.88      0.92  

At a first glance, it seems as though the support vector machine is the best model. In a typical workflow, this would spark curiosity into the model – is it really as good as it’s showing, or have we made a mistake somewhere? You should be intrigued to learn more about your models and what they are learning, as this will give you better insight into their strengths and weaknesses.

Knowing this information is extremely insightful to stakeholders since it allows them to find solutions to compensate for where the model falls short.