By now you know that there are two main types of machine learning: supervised and unsupervised. And you are also familiar with one of the most popular supervised learning models, linear regression, which is especially helpful for predicting numerical values, like the price of a house or the number of umbrellas. But what if we are not interested in a number, but a category?

Classification is the process of categorizing a given set of data into classes. Here, the classes act as our labels, or ground truth. A classification model uses the features of an object to predict its labels. As we say labels, you can probably guess that here we have another supervised learning model at hand. 

 The algorithm used by your email service providers to filter spam from non-spam emails is an example of classification. This model uses the features of the email: subject, sender’s email address, email body, and attachments as inputs; and makes a prediction for one out of the two classes: spam or non-spam. This is an example of binary classification, 
 
 where the output is restricted to two classes. Spam and non-spam, true and false, zeros and ones, yes and no, positive, or negative and so on. 
 
 If there are more than two classes, we have a multi-class classification problem. An example of multi-class classification can be classifying types of fruits based on their color, weight, and size. Or movies into different genres like comedy, romance, drama, and horror.

The question is, how can machine learning solve this problem? Let’s start with our first classification model: Logistic regression. The best way to think about logistic regression is that it is a linear regression but for classification problems. Logistic regression uses a logistic function, specifically the Sigmoid function. This function takes any real input, and outputs a value between zero and one. Unlike linear regression, logistic regression doesn’t need a linear relationship between input and output variables. Once we have the predicted results from our classification model, or classifier, we compare these results with the actual label, the ground truth, and evaluate the performance of our model.

We categorize each prediction into four categories: True positives, False positives, True negatives, and False negatives. True positives are the results which were predicted as positive and ground truth were also positive. False positives are the instances which were predicted as positives but actually they were negative. Likewise, True negatives are the instances which were predicted as negatives and their ground truth was also negative. And False negatives are the instances which were predicted as negative, but their ground truth was positive. This is called the confusion matrix. We want our true positives and negatives to be maximized and false positives and negatives to be minimized. We use these categories to define our evaluation metrics: accuracy, precision, recall and F1 score.

The accuracy of an algorithm is represented as the ratio of the correctly classified instances and the total number of samples. The precision of an algorithm is represented as the ratio of correctly classified instances with the positive class to the total samples that are predicted positive. The recall metric is defined as the ratio of correctly classified positive class divided by total number of instances which are actually positive. The idea behind recall is to know how many positive samples the classifier has mis predicted. Recall is also called sensitivity.

The F1 score is also known as the F Measure. It indicates the equilibrium between the precision and the recall. Let’s see how this plays out in practice.

We have a binary classification problem. We need to classify tumors into malignant ones that are cancerous or benign which means non-cancerous. Our dataset contains statistical data from histopathology examinations. We will use this dataset to train our logistic regression model.

 Let’s import pandas and read the dataset. 

In [None]:
import pandas as pd
dataset = pd.read_csv("breast-cancer.csv")

Using the shape method, we can easily get the number of observations and features. In this dataset, there are unique instances. The features are radius mean, texture mean, radius worst, and so on. There are features in total. 

In [None]:
dataset.shape

(569, 31)

We can check the first 5 instances in our dataset using the head function. Here, we can see, the first column represents the target variable. The following columns are the features.

In [None]:
dataset.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,M,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,M,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,M,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,M,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


In [None]:
dataset.tail()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
564,M,2.110995,0.721473,2.060786,2.343856,1.041842,0.21906,1.947285,2.320965,-0.312589,...,1.901185,0.1177,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,M,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,...,1.53672,2.047399,1.42194,1.494959,-0.69123,-0.39482,0.236573,0.733827,-0.531855,-0.973978
566,M,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.03868,0.046588,0.105777,-0.809117,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,M,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635
568,B,-1.808401,1.221792,-1.814389,-1.347789,-3.112085,-1.150752,-1.114873,-1.26182,-0.82007,...,-1.410893,0.76419,-1.432735,-1.075813,-1.859019,-1.207552,-1.305831,-1.745063,-0.048138,-0.751207


Here, we can see, the first column represents the target variable. The following columns are the features.

This is a real-life dataset and before we can apply machine learning algorithms to it, it has to be cleaned and organized. Since we know that machines operate with numbers, we need to convert our target variable from categorical to numerical type. There are many ways to do this. One of the simpler methods is called label encoding. With this, we can convert “M” and “B” to 1 and 0. As a first step, we import LabelEncoder from sklearn library. Then, to make it easier to use, we assign LabelEncoder to the “labelencoder” variable. Finally, we convert the “diagnosis” column from categoric to numeric with the code in the third line. 

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
dataset["diagnosis"] = labelencoder.fit_transform(dataset["diagnosis"].values) 

In [None]:
dataset.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,1,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


In [None]:
dataset.tail()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
564,1,2.110995,0.721473,2.060786,2.343856,1.041842,0.21906,1.947285,2.320965,-0.312589,...,1.901185,0.1177,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,...,1.53672,2.047399,1.42194,1.494959,-0.69123,-0.39482,0.236573,0.733827,-0.531855,-0.973978
566,1,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.03868,0.046588,0.105777,-0.809117,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635
568,0,-1.808401,1.221792,-1.814389,-1.347789,-3.112085,-1.150752,-1.114873,-1.26182,-0.82007,...,-1.410893,0.76419,-1.432735,-1.075813,-1.859019,-1.207552,-1.305831,-1.745063,-0.048138,-0.751207


Did you notice that something is missing? Where is our test dataset? This time we have only one file, so we need to divide this dataset into a train set and test set ourselves. We can do that using the sklearn library function train test split.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(dataset, test_size=0.3)

In [None]:
X_train = train.drop("diagnosis",axis=1)
y_train = train.loc[:,"diagnosis"]

X_test = test.drop("diagnosis",axis=1)
y_test = test.loc[:,"diagnosis"]

Also, as we did for the regression problem, we need to define the “target” we want to predict. In this problem, we’re trying to predict if the tumor is malignant (1) or benign (0). Hence, our target variable is the “diagnosis” column. And the rest of the columns are “features”. Let’s assign the x variable as target and the y variable as features. And remember, we need to do this for both the train and test datasets. By the way, you can also do this for the whole dataset and then divide into train and test. It’s up to you. We’re ready to import our logistic regression model from sklearn library. And after importing the logistic regression model, we can assign it to the “model” variable. Now, we’re ready to train our model, that means teach the hidden patterns in the train dataset to our model. And finally, we can make the predictions on the test dataset.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model_1 = LogisticRegression()

In [None]:
model_1.fit(X_train,y_train)

LogisticRegression()

In [None]:
predictions = model_1.predict(X_test)
predictions

array([0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0])

Using the confusion matrix, we can check the accuracy of our results. First, we import confusion_matrix from sklearn and display the number of each metric. We have 103 true negatives, 0 false positives, 4 false negatives, 64 true positives. That means out of 171 predictions are correct, isn’t it amazing? Let’s continue. We import classification_report from sklearn and display the evaluation metrics. And the ratios we get, are quite high. We’ve done a really good job.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predictions)

array([[101,   0],
       [  6,  64]])

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97       101
           1       1.00      0.91      0.96        70

    accuracy                           0.96       171
   macro avg       0.97      0.96      0.96       171
weighted avg       0.97      0.96      0.96       171



Well done! We have actually trained and tested a logistic regression classifier. Now, why not try another classification algorithm: Support Vector Machine. SVM is a supervised machine learning technique that can be used to solve classification and regression problems. It is, however, mostly used for classification. In this algorithm, we use an axis to represent each feature and plot all data points in the space. Then, the SVM model finds boundaries to separate these classes. The decision boundary is what separates different data samples into specific classes. Consider a dataset of different animals of two classes: birds and fish. In this dataset there are only three features: body weight, body length, and daily food consumption. We draw a 3-dimensional grid and plot all these points. A SVM model will try to find a 2D plane that differentiates the 2 classes.

In [None]:
from sklearn.svm import LinearSVC

In [None]:
model_2 = LinearSVC()

In [None]:
model_2.fit(X_train,y_train)

LinearSVC()

In [None]:
predictions = model_2.predict(X_test)
predictions

array([0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0])

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predictions)

array([[101,   0],
       [  7,  63]])

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97       101
           1       1.00      0.90      0.95        70

    accuracy                           0.96       171
   macro avg       0.97      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171



If there were more than 3 features, we would have a hyper-space. A hyper-space is a space with higher than 3 dimensions like 4D, 5D, and so on and therefore it is not possible to visualize. We can find a hyper-plane that clearly distinguishes different classes. Hyper-planes are multidimensional planes that exist in four or more dimensions. This hyper-plane is used as a condition to perform classification. If the hyper-planes are linear, the SVM is called Linear Kernel SVM. However, the hyper-plane can be nonlinear as well. In that case we use a Polynomial Kernel or other advanced SVMs. Let’s see how this model performs with the same breast cancer dataset we used earlier. We start with importing LinearSVC from sklearn and assigning it to the variable. Now, we’re ready to train our model, that means teach the hidden patterns in the train dataset to our model. Finally, we can make the predictions on the test dataset.

Our predictions with the Support Vector Classifier are ready! Now, we can check the accuracy of our model in the same way we did for Logistic Regression. We can start with the confusion matrix. We have one hundred one true negatives, 2 false positives, 4 false negatives, 64 true positives. This means that 165 out of 171 predictions are correct, just a couple less than before. We should also check the classification report. We get quite high metrics here as well. But we got better with Logistic Regression. And here our false positives were higher. This is a key metric for this dataset we want to minimize, because we don’t want healthy patients to be diagnosed with cancer. Therefore, we prefer using the Logistic Regression model for this problem and dataset.

The confusion matrix is a table that is used to evaluate the performance of a classification model. It is a square matrix with two dimensions: the predicted class and the actual class. The entries in the matrix represent the number of instances that were predicted to be in a particular class and were actually in that class.
The formula for the classification evaluation metric is:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

where:
* TP = true positives
* TN = true negatives
* FP = false positives
* FN = false negatives
Accuracy is the proportion of instances that were correctly classified. It is a measure of how well the model can distinguish between the different classes.

In the confusion matrix, we want to maximize the true positives (TP) and true negatives (TN) and minimize the false positives (FP) and false negatives (FN).
TP represents the number of instances that were correctly classified as positive. TN represents the number of instances that were correctly classified as negative. FP represents the number of instances that were incorrectly classified as positive. FN represents the number of instances that were incorrectly classified as negative.
Maximizing TP and TN will increase the accuracy of the model, while minimizing FP and FN will reduce the error rate.

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. One application of the logistic regression algorithm is to solve binary classification problems, such as predicting spam or non-spam mails.

The sigmoid function takes any real input and outputs a value between zero and one. In logistic regression, the ​sigmoid function is used to solve classification problems. Its shape in a plane resembles the letter S.

The recall metric provides information about the model's ability to predict positive samples. It is defined as the ratio of a correctly classified positive class (TP) divided by the total number of instances which are actually positive (TP+FN).

We want our true positives and negatives to be maximized and false positives and negatives to be minimized.

Classification is the process of categorizing a given set of data into classes. Types of problems with two class labels are called binary classification. Unlike linear regression, logistic regression does not need a linear relationship between input and output variables. In the confusion matrix, we want our true positives and negatives maximized and false positives and negatives minimized.

Hyper-planes are multidimensional planes that exist in four or more dimensions. If we try to solve a classification problem with more than three features using the SVM algorithm, it uses a hyper-plane that clearly distinguishes the different classes.

If the hyper-planes are linear, the SVM is called linear kernel SVM. Polynomial kernel or other advanced SVMs used for non-linear hyper-planes. The F1 score is also known as the F measure. It indicates the equilibrium between precision and recall.

Binary classification refers to classification tasks that have two class labels. Spam detection is an example of binary classification where the output is restricted to two labels, spam or non-spam.

If only one dataset is available, it needs to be split into train and test data. To do this, the train_test_split() function from sklearn can be used.

The support vector machine algorithm (SVM) can be used for classification and regression problems, but mostly for classification problems. SVM uses decision boundaries to separate classes. In the data, the target variable needs to be converted to a numerical type if it is categorical. SVM algorithm uses hyper-planes if there are more than three features. So, a hyper-plane is a space with higher than three dimensions.