### Constructing a classifier


##### **Introduction**

In the field of machine learning, **classification refers to the process of using the characteristics of data to separate it into a certain number of classes**. This is different than regression, which we discussed in Chapter 1, **The Realm of Supervised Learning, where the output is a real number**. A **supervised learning classifier builds a model using labeled training data and then uses this model to classify unknown data**.

A classifier can be any algorithm that implements classification. In simple cases, **a classifier can be a straightforward mathematical function**. I**n more real-world cases, a classifier can take very complex forms**. In the course of study, we will see that classification can be either binary, where we separate data into two classes, or it can be multi-class, where we separate data into more than two classes. The mathematical techniques that are devised to deal with classification problems tend to deal with two classes, so we extend them in different ways to deal with multi-class problems as well.

Evaluating the accuracy of a classifier is vital for machine learning. What we need to know is, how we can use the available data, and get a glimpse of how the model performs in the real world. In this chapter, we will look at recipes that deal with all these things.

#### **Builing a simple classifier**

A classifier is a system with some characteristics that allow you to identify the class of the sample examined. In different classification methods, groups are called classes. The goal of a classifier is to establish the classification criterion to maximize performance. The performance of a classifier is measured by evaluating the capacity for generalization. Generalization means attributing the correct class to each new experimental observation. The way in which these classes are identified discriminates between the different methods that are available. 

##### Getting ready
Classifiers identify the class of a new objective, based on knowledge that's been extracted from a series of samples(a dataset). Starting from a dataset, a classifier extracts a model, which is then used to classify new instances.

##### How to do it
Let's see how to build a simple classifier using some training data

We will use the `simple_classifier.py` file, already provided. To start, we import
the numpy and `matplotlib.pyplot()` packages, as did in Ch01, the realm of
supervised learning, and then we create some sample data


##### How it works
In this recipe, we showed how simple it is to build a classifier. We started from a series of identifying pairs of as many points on a plane $(x, y)$. We therefore assigned a class to each of these points $(0, 1)$ so as to divide them into two groups. To understand the spatial arrangement of these points, we visualized them by associating a different marker to each class. Finally, to divide the two groups, we have drew the line of the $y = x$ equation.

##### There's more
We build a simple classifier using the following rule-*The input point $(a, b)$ belongs to `class_0` if $a$ is greater than or equal to $b$; otherwise, it belongs to `class_1`*. If you inspect the points one by one, you will see that this is, in fact, true. That's it! You just built a linear classifier that can classify unknown data. It's a linear classifier because the separating line is a straight line. If it's a curve, then it becomes a nonlinear classifier.

#### **Building a logistic regression classifier**

Despite the word regression beeing present in the name, logistic regression is actually used for classification purposes. Given a set of datapoints, our goal is to build a model that can draw linear boundaries between our classes. It extracts these boundaries by solving a set of equations derived from the training data. In this recipe, we will build a logistic regression classifier.

##### Getting ready
Logistic regression is a non-linear regression model used when the dependent variable is `dichotomous`. The purpose is to establish the probability with which an observation can generate one or the other value of the dependent variable; it can also be used to classify observations, according to their characteristics, into two categories. 

##### How to do it
Let's see how to build a logistic regression classifier

We'll use the `logistric_regression.py`

##### How it works
**Logistic regression** is a classification method within the family if SL algorithms. Using stats methods, logistic regression allows us to generate a result that, in fact, represents a probability that a given input value belongs to a given class. **In binomial logistic regression problems, the probability that output belongs to a class will be $P$, whereas the probability of it belonging to another class will be $1-P$ (Where $P$ is a number between $0$ and $1$ because ut expresses probability)**

LR uses the logistic function to determine the classification of input values. Also called **sigmoid** function, the logistic function is an S_sharped curve that can take any number of a real value and ùap ot to a value between $0$ and $1$, extremes excluded. It can be described by the following equation

$$
F(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times x)}}
$$

This function transforms the real values into numbers between $0$ and $1$.

##### There's more
To obtain the logistic regression equation expressed in probabilistic terms, we need to include the probabilities in the logistic regression equation

$$
P(x) = \frac{e^{-(\beta_0 + \beta_1 \times x)}}{1 + e^{-(\beta_0 + \beta_1 \times x)}}
$$
Recalling that the $e$ function is the opposite of the natural logarithm $ln$, we can write

$$
\frac{P(x)}{1 - P(x)} = \beta_0 + \beta_1 \times x
$$
This function is called a **logit** function. <font color=pink>The logit function, on the other hand, allows us to associate the probabilities (therefore, a value included between $0$ abd $1$ to the whole range of real numbers.</font> It's a link function and represents the inverse of the logistic funtion.

#### **Building a Naive Bayes classifier**


##### Getting ready
The underlying principle of a Bayesian classifier is **that some individuals belong to a class of interest with a given probability based on some observations**. This probability is based on the **assumption that the characteristics observed can be either dependent or independent from one another**; in this second case, **the Bayesian classifier is called Naive because it assumes that the presence or absence of a particular characteristic in a given class of interest** is not related to the presence or absence of other characteristics, greatly simplifying the calculation. Let's go ahead and build a Naive Bayes classifier.

##### How to do it
See the naive_bayes.py file

##### How it works
A **Bayesian classifier** is a classifier based on the application of Bayes' theorem. This classifier requires the knowledge of a priori and conditional probabilities related to the problem; quantities that, in general, are not known but are typically estimable. If reliable estimates of the probabilities involved in the theorem can be obtained, the Bayesian classifier is generally reliable and potentially compact.

The probability that a given event ($E$) occurs, is the ratio between the number ($s$) of favorable cases of the event itself and the total number ($n$) of the possible cases, provided all the considered cases are equally probable. This can be better represented using the following formula:

$$
P = P(E) = \frac{\text{number of favorable cases}}{\text{total number of the possible cases}} = \frac{s}{n}
$$

Given two events, $A$ and $B$, if the two events are independent (the occurrence of one does not affect the probability of the other), the joint probability of the event is equal to the product of the probabilities of $A$ and $B$:

$$
P(A \cap B) = P(A) \times P(B)
$$

If the two events are dependent (that is, the occurrence of one affects the probability of the other), then the same rule may apply, provided P(B | A) is the probability of event A given that event B has occurred. This condition introduces conditional probability, which we are going to dive into now:

$$
P(A \cap B) = P(A) \times P(B\setminus A)
$$

The probability that event A occurs, calculated on the condition that event B occurred, is called conditional probability, and is indicated by P(A | B). It is calculated using the following formula:

$$
P(B\setminus A) = \frac{P(A \cap B)}{P(A)}
$$

Let $A$ and $B$ be two dependent events, as we stated that the joint probability between them is calculated using the following formula:

$$
P(A \cap B) = P(A) \times P(B\setminus A)
$$

Or, similarly, we can use the following formula:
$$
P(A \cap B) = P(B) \times P(A\setminus B)
$$

By looking at the two formulas, we see that they have the first equal member. This shows that even the second members are equal, so the following equation can be written:

$$
P(A) \times P(B \setminus A) = P(B) \times P(A\setminus B)
$$

By solving these equations for conditional probability, we get the following:

$$
P(B \setminus A) = \frac{P(B) \times P(A \setminus B)}{P(A)}
$$

The proposed formulas represent the mathematical statement of Bayes' theorem. The use of one or the other depends on what we are looking for.

A classifier solves the problem of **identifying sub populations of individuals with certain features in a larger set**, with the possible use of a subset of individuals known as a priori(a training set). A Naive Bayes classifier is a supervised learning classifier that uses Bayes' theorem to build the model. In this recipe, we will build a Naive Bayes Classifer.

#### **Splitting a dataset for training and testing**

Let's see how to split our data properly into training and testing datasets. As we said in Chapter 1, The Realm of Supervised Learning, in the Building a linear regressor recipe, when we build a machine learning model, we need a way to validate our model to check whether it is performing at a satisfactory level. To do this, we need to separate our data into two groups—a training dataset and a testing dataset. The training dataset will be used to build the model, and the testing dataset will be used to see how this trained model performs on unknown data. 

In this recipe, we will learn how to split the dataset for training and testing phases.

##### Getting ready
The fundamental objective of a model based on machine learning is to make accurate predictions. Before using a model to make predictions, it is necessary to evaluate the predictive performance of the model. To estimate the quality of a model's predictions, it is necessary to use data that you have never seen before. Training a predictive model and testing it on the same data is a methodological error: a model that simply classifies the labels of samples it has just seen would have a high score but would not be able to predict the new data class. Under these conditions, the generalization capacity of the model would be less.

##### How it works
In this recipe, we split the data using the `train_test_split()` function of the scikit-learn library. This function splits arrays or matrices into random train and testing subsets. Random division of input data into data sources for training and testing ensures that data distribution is similar for training and testing data sources. You choose this option when it is not necessary to preserve the order of the input data.

##### How to do it

See the splitting_dataset.py file

##### There's more
The performance estimate depends on the data used. Therefore, simply dividing data randomly into a training and a testing set does not guarantee that the results are statistically significant. The repetition of the evaluation on different random divisions and the calculation of the performance in terms of the average and standard deviation of the individual evaluations creates a more reliable estimate.

However, even the repetition of evaluations on different random divisions could prevent the most complex data being classified in the testing (or training) phase.

#### **Evaluating accuracy using cross-validation metrics**

**cross-validation** is an important concept in machine learning. In the previous recipe, we split the data into training and testing datasets. However, in order to make it more robust, we need to repeat this process with different subsets. If we just fine-tune it for a particular subset, we may end up overfitting the model. **Overfitting** refers to **a situation where we fine-tune a model to a dataset too much and it fails to perform well on unknow
n data**. We want our machine learning model to perform well on unknown data. In this recipe, we will learn how to evaluate model accuracy using cross-validation metrics.

##### Getting ready
When we are dealing with machine learning models, we usually care about three things—**precision**, **recall**, and **F1 score**. We can get the required performance metric using parameter scoring. ***Precision** refers to the number of items that are correctly classified as a percentage of the overall number of items in the list*. ***Recall** refers to the number of items that are retrieved as a percentage of the overall number of items in the training list*.

##### How it works
Let's consider a test dataset containing $100$ items, out of which $82$ are of interest to us. Now, we want our classifier to identify these $82$ items for us. Our classifier picks out $73$ items as the items of interest. Out of these $73$ items, only $65$ are actually items of interest, and the remaining $8$ are misclassified. We can compute **precision** in the following way

* **The number of correct identifications** = $65$
* **The total number of identifications** = $73$
* **`Precision`** = $\frac{65}{73} = 89.04\%$

To compute **recall**, we use the following:
* **The total number of items of interest in the dataset** = $82$
* **The number of items retrieved correctly** = $65$
* **`Recall`** = $\frac{65}{82} = 79.26\%$

A good machine learning model needs to have good precision and good recall simultaneously. It's easy to get one of them to $100\%$, but the other metric suffers! We need to keep both metrics high at the same time. To quantify this, we use an F1 score, which is a combination of precision and recall. This is actually the harmonic mean of precision and recall:

$F_{1score} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

In the preceding case, the F1 score will be as follows:

$F_{1score} = \frac{2 \times 0.89 \times 0.79}{0.89 + 0.79} = 0.8370$


##### There's more
In cross-validation, all available data is used, in groups of a fixed size, alternatively as a testing and as a training set. Therefore, each pattern is either classified (at least once) or used for training. The performances obtained depend, however, on the particular division. Therefore, it may be useful to repeat cross-validation several times in order to become independent of the particular division.

#### **Visualizing a confusion matrix**

A confusion matrix is a table that is we use to understand the performance of a classification model. This helps us understand how we classify testing data into different classes. When we want to fine-tune our algorithms, we need to understand how data gets misclassified before we make these changes. Some classes are worse than others, and the confusion matrix will help us understand this.

![Confusion matrix](cm.png)

In the preceding diagram, we can see how we categorize data into different classes. Ideally, we want all the non-diagonal elements to be 0. This would indicate perfect classification! Let's consider class 0. Overall, 52 items actually belong to class 0. We get 52 if we sum up the numbers in the first row. Now, 45 of these items are being predicted correctly, but our classifier says that 4 of them belong to class 1 and three of them belong to class 2. We can apply the same analysis to the remaining 2 rows as well. An interesting thing to note is that 11 items from class 1 are misclassified as class 0. This constitutes around 16% of the datapoints in this class. This is an insight that we can use to optimize our model.

##### Getting ready
A CM identifies the nature of the classification errors, as our classification results are compared to real data. In this matrix, the diagonal cells show the number of cases that were correctly classifieed; all the others cells show the misclassified cases.

##### How to do it
Let's see how to visualize the confusion matrix

See the confusion_matrix.py file

##### How it works
A cm displays information about the actual and predicted classifications mae by a model. The performance of such systems is evaluating with the help of data in the matrix.
The following table shows the confusion matrix for a 2-class classifier

| Aligné à gauche  | PREDICTED POSITIVE          | PREDICTED NEGATIVE |
| :--------------- |:---------------:| :-----:|
| ACTUAL TRUE  |   TP        |  FN |
| ACTUAL FALSE  | FP             |   TN |

The entries in the confusion matrix have the following meanings:

* **TP** is the number of **correct predictions that an instance is positive**
* **FN** is the number of **incorrect predictions that an instance is negative**
* **FP** is the number of **incorrect predictions that an instance is positive**
* **TN** is the number of **correct predictions that an instance is negative**

##### Theres more
The confusion matrix shows us the performance of an algorithm. Each row returns the instances in an actual class, while each column returns the instances in an expected class. The term *confusion matrix* results from the fact that it makes it easy to see whether the system is confusing two classes.

#### **Extracting a performance report**

In the Evaluating accuracy using cross-validation metrics recipe, we calculated some metrics to measure the accuracy of the model. Let's remember its meaning. **The accuracy returns the percentage of correct classifications**. **Precision returns the percentage of positive classifications that are correct**. **Recall (sensitivity) returns the percentage of positive elements of the testing set that have been classified as positive**. Finally, in **F1, both the precision and the recall are used to compute the score**. In this recipe, we will learn how to extract a performance report.

##### Getting ready
We also have a function in scikit-learn that can directly print the precision, recall and F1 scores for us. Let's see how

##### How to do it
Let's extract a performance report

See the performance_report.py file

##### How it works
In this recipe we used the classification_report () function of the scikit-learn library to extract a performance report. This function builds a text report showing the main classification metrics. A text summary of the precision, recall, and the F1 score for each class is returned. Referring to the terms introduced in the confusion matrix addressed in the previous recipe, these metrics are calculated as follows:

* The **precision** is the ratio $\frac{tp}{tp + fp}$, where $tp$ is the number of true positives and fp the number of false positives. The precision is the ability of the classifier to not label a sample that is negative as positive.
* The **recall** is the ratio $\frac{tp}{tp + fn}$, where $tp$ is the number of true positives and fn the number of false negatives. The recall is the ability of the classifier to find the positive samples.
* The **F1 score** is said to be a weighted harmonic mean of the precision and recall, where an F-beta score reaches its peak value at $1$ and its lowest score at $0$.

##### There's more
The reported averages include the **micro average** (averaging the total true positives, false negatives, and false positives), the **macro average** (averaging the unweighted mean per label), the **weighted average** (averaging the support-weighted mean per label), and the **sample average** (only for multilabel classification).

#### **Evaluation cars based on their characteristics**

In this recipe, let's see how we can apply classification techniques to a real-world problem. We will use a dataset that contains some details about cars, such as number of doors, boot space, maintenance costs, and so on. Our goal is to determine the quality of the car. For the purposes of classification, quality can take four values: *unacceptable, acceptable, good, or very good*.

##### Getting ready
Let's download the dataset at [this address](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation)

You need to treat each value in the dataset as a string. We consider six attributes in the dataset. Here are the attributes along with the possible values they can take:

* `buying`: These will be `vhigh`, `high`, `med`, and `low`.
* `maint`: These will be `vhigh`, `high`, `med`, and `low`.
* `doors`: These will be `2`, `3`, `4`, `5`, and more.
* `persons`: These will be `2`, `4`, and more.
* `lug_boot`: These will be `small`, `med`, and `big`.
* `safety`: These will be `low`, `med`, and `high`.

Given that each line contains strings, we need to assume that all the features are strings and design a classifier. In the previous chapter, we used random forests to build a regressor. In this recipe, we will use random forests as a classifier.

##### How to do it
Let's see how to evaluate cars based on their characteristics
See the car.py file

##### How it works
The **random forest** was developed by Leo Breiman (University of California, Berkeley, USA) based on the use of classification trees. He has extended the classification tree technique by integrating it into a Monte Carlo simulation procedure and named it **random forest**. It is based on the creation of a large set of tree classifiers, each of which is proposed to classify a single instance, wherein some features have been evaluated. Comparing the classification proposals provided by each tree in the forest shows the class to which to attribute the request: it is the one that received the most votes.

##### There's more
Random forest has three adjustment parameters: the number of trees, the minimum amplitude of the terminal nodes, and the number of variables sampled in each node. The absence of overfitting makes the first two parameters important only from a computational point of view.

#### Extracting validation curves

We used random forests to build a classifier in the previous recipe, Evaluating cars based on their characteristics, but we don't exactly know how to define the parameters. In our case, we dealt with two parameters: `n_estimators` and `max_depth`. They are called **hyperparameters**, and the performance of the classifier depends on them. **It would be nice to see how the performance gets affected as we change the hyperparameters**. This is where **validation curves come into the picture**. 

##### Getting ready

Validation curves help us understand how each hyperparameter influences the training score. Basically, all other parameters are kept constant and we vary the hyperparameter of interest according to our range. We will then be able to visualize how this affects the score.

##### How to do it
Let's see how to extract validation curves

We'll continue using the previous python file

##### How it works

In this recipe, we used the `validation_curve` function of the scikit-learn library to plot the validation curve. This function determines training and test scores for varying parameter values and computes scores for an estimator with different values of a specified parameter. 

##### There's more
Choosing an estimator's hyperparameters is a fundamental procedure for setting up a model. Among the available procedures, **grid search is one of the most used**. This procedure **selects the hyperparameter with the maximum score on a validation set or a multiple validation set**.

#### **Extracting learning curves**

Learning curves help us **understand how the size of our training dataset influences the machine learning model**. This is very useful when you have to **deal with computational constraints**. Let's go ahead and plot learning curves by varying the size of our training dataset.

##### Getting ready
A learning curve shows the validation and training score of an estimartor for varying numbers of training samples

##### How to do it
Let's see how to extract learning curves

We'll continue in the same file `car.py`

##### How it works
In this recipe, we used the `validation_curve` of the scikit-learn library to plot the learning curve. This function determines cross-validated training and testing scores for different training set sizes.

##### There's more
A learning curve allows us to check whether the addition of training data leads to a benefit. It also allows us to estimate the contribution deriving from variance error and bias error. If the validation score and the training score converge with the size of the training set too low, we will not benefit from further training data.

#### **Estimating the income bracket**

We will build a classifier to estimate the income bracket of a person based on 14 attributes. The possible output classes are higher than $50.000$ or lower than or equal to $50.000$. There is a slight twist in this dataset, in the sense that each datapoint is a mixture of numbers and strings. **Numerical data is valuable, and we cannot use a label encoder in theses situations**. We need to design a system that can deal with numerical and non numerical data at the same time.

##### Getting ready
We will use the census income dataset available [here](https://archive.ics.uci.edu/ml/datasets/census+income).

The dataset has the following characteristics:
* Number of instances: $48,842$
* Number of attributes: $14$

The following is a list of attributes:

* **Age**: continuous
* **Workclass**: text
* **fnlwgt**: continuous
* **Education**: text
* **Education-num**: continuous
* **Marital-status**: text
* **Occupation**: text
* **Relationship**: text
* **Race**: text
* **Sex**: female or male
* **Capital-gain**: continuous
* **Capital-loss**: continuous
* **Hours-per-week**: continuous
* **Native-country**: text

##### How to do it
Let's see how to estimate the income bracket
We'll use the `income.py` file

##### How it works
The underlying principle of a Bayesian classifier is that some individuals belong to a class of interest with a given probability based on some observations. This probability is based on the assumption that the characteristics observed can be dependent or independent from one another; in the second case, the Bayesian classifier is called naive because it assumes that the presence or absence of a particular characteristic in a given class of interest is not related to the presence or absence of other characteristics, greatly simplifying the calculation. Let's go ahead and build a Naive Bayes classifier.

##### There's more
The concept of Bayes applied to classification is very intuitive: if I look at a particular measurable feature, I can estimate the probability that this feature represents a certain class after the observation.

#### **Predicting the quality of wine**

In this recipe, we will predict the quality of wine based on the chemical properties of wines grown. The code uses a wine dataset, which contains a DataFrame with 177 rows and 13 columns; the first column contains the class labels. This data is obtained from the chemical analyses of wines grown in the same region in Italy (Piemonte) but derived from three different cultivars—namely, the Nebbiolo, Barberas, and Grignolino grapes. The wine from the Nebbiolo grape is called Barolo.

##### Getting ready
The data consists of the amounts of several constituents found in each of the three types of wine, as well as some spectroscopic variables. The attributes are as follows:

* Alcohol 
* Malic acid 
* Ash 
* Alcalinity of ash 
* Magnesium 
* Total phenols 
* Flavanoids 
* Nonflavanoid phenols 
* Proanthocyanins 
* Color intensity 
* Hue 
* OD280/OD315 of diluted wines 
* Proline 

The first column of the DataFrame contains the class which indicates one of three types of wine as (0, 1, or 2).

##### How to do it
Let's see how to predict the quality of wine
We'll use the `wine_quality.py`file

##### How it works
In this recipe, the quality of wine based on the chemical properties of wines grown was predicted. To do this, a decision tree algorithm was used. A decision tree shows graphically the choices made or proposed. It does not happen so often that things are so clear that the choice between two solutions is immediate. Often, a decision is determined by a series of cascading conditions. Representing this concept with tables and numbers is difficult. In fact, even if a table represents a phenomenon, it may confuse the reader because the justification for the choice is not obvious. 

##### There's more
A tree structure allows us to extract the information with clear legibility by highlighting the branch we have inserted to determine the choice or evaluation. Decision tree technology is useful for identifying a strategy or pursuing a goal by creating a model with probable results. The decision tree graph immediately orients the reading of the result. A plot is much more eloquent than a table full of numbers. The human mind prefers to see a solution first and then go back to understand a justification of the solution, instead of a series of algebraic descriptions, percentages, and data to describe a result.

#### **Newsgroup trending topics classification**

Newsgroups are discussion groups on many issues and are made available by news-servers, located all over the world, which collect messages from clients and transmit them, on the one hand, to all their users and, on the other, to other news-servers connected to the network. The success of this technology is due to user interaction in discussions. Everyone has to respect the rules of the group.

##### Getting ready
In this recipe, we will build a classifier that will allow us to classify the membership of a topic into a particular discussion group. This operation will be useful to verify whether the topic is relevant to the discussion group. We will use the data contained in the 20 newsgroups dataset, available at the following URL: [download here](http://qwone.com/~jason/20Newsgroups/).

This is a collection of about 20,000 newsgroup documents, divided into 20 different newsgroups. Originally collected by Ken Lang, and published in Newsweeder paper: Learning to filter netnews, the dataset is particularly useful for dealing with text classification problems.

##### How to do it
In this recipe, we'll learn how to perform newsgroup trending topics classification:
We'll use the `post_classification.py` file

##### How it works
In this recipe, we built a classifier to classify the membership of a topic into a particular discussion group. To extract features from the text, a **tokenization** procedure was needed. In the tokenization phase, within each single sentence, atomic elements called **tokens** are identified; based on the token identified, it's possible to carry out an analysis and evaluation of the sentence itself. Once the characteristics of the text had been extracted, a classifier based on the multinomial Naive Bayes algorithm was constructed.

##### Theres's more
The Naive Bayes multinomial algorithm is used for text and images when features represent the frequency of words (textual or visual) in a document.