<a href="https://colab.research.google.com/github/LeonardoGoncRibeiro/06_MachineLearning/blob/main/01_Basic/02_FurtherIntoClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning: Further into Classification

In this course, we will go further into classification using Machine Learning. Thus, we will see how to solve real world problems using classification. In this project, we will use the following packages:





In [79]:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import accuracy_score

# Classification algorithms

A classification algorithm is used to classify something based on some explicative features. This is similar to how humans perceive something, and then make a critical judgement about that thing based on their charactheristics. For instance, when we see an animal, based on their charactheristics, we are able to charactherize them as cats, dogs, or pigs (or other). 

Similarly, when we see an e-mail on our list, we are able to say if it is a spam or not. Can we teach to the computer how to identify spams?

Maybe, we can use different information about the e-mail to classify it:

* E-mail size;
* If the e-mail has any file;
* If the sender was already classified as spam before;
* Time the e-mail was sent;
* And others.

Since the computer understands binary code, it is common to send this information using 0 (false) or 1 (true). 

Note that the algorithms are not perfect. All algorithms have a margin of error, and there is always an uncertainty involved. Thus, we should always evaluate the accuracy of our model so that we understand how good it is. 

Another common examples of a real world classification problem is: based on the pages an user has visited, will the user buy our product? Will my client leave our company (churn)? 



## Code example

For instance, let's see a dataset with information about whether someone bought an item from our site based on the pages that user has visited:

In [4]:
df = pd.read_csv('data_site.csv')
df.columns = ['home', 'how_it_works', 'contact', 'bought']

In [5]:
df.head( )

Unnamed: 0,home,how_it_works,contact,bought
0,1,1,0,0
1,1,1,0,0
2,1,1,0,0
3,1,1,0,0
4,1,1,0,0


Nice! Note that, here, we want to understand whether someone has bought an item on our site. Thus, the column 'bought' is the target for our model. Thus, we can separate these data in explicative features $X$ and target features $y$:

In [6]:
y = df.bought
X = df.drop('bought', axis = 1)

Nice! Now, let's use our data to train a very simple naive bayes model.

In [7]:
model_NB = MultinomialNB( )
model_NB.fit(X, y)

MultinomialNB()

We can get the accuracy of our model using:

In [8]:
y_pred = model_NB.predict(X)

diff = y - y_pred

hits  = diff.value_counts( ).loc[0]
total = len(diff)
acc = hits/total

print("Accuracy: {:.2f}%".format(acc*100))

Accuracy: 93.94%


So, we hit 93.94% of our guesses! Note that we used a set to train our algorithm, and the same set was used to test our model. This is not a good practice: our model will usually be good to predict our training set. We have to use a different test set, so that we have a better sense about the model accuracy in real problems. 

To improve this analysis, we can first split our data into a training and testing set:

In [9]:
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, stratify = y, random_state = SEED)

Note that the test set is 10% of the total number of entries:

In [10]:
X.shape

(99, 3)

In [11]:
X_test.shape

(10, 3)

In [12]:
X_train.shape

(89, 3)

Now, we train our model based on the training set:

In [13]:
model_NB = MultinomialNB( )
model_NB.fit(X_train, y_train)

MultinomialNB()

And we test our model based on the test set:

In [14]:
y_pred = model_NB.predict(X_test)

diff = y_test - y_pred

hits  = diff.value_counts( ).loc[0]
total = len(diff)
acc = hits/total

print("Accuracy: {:.2f}%".format(acc*100))

Accuracy: 90.00%


So, this time, our accuracy was 90%. Still, our model showed a very high accuracy!

## Non-binary features

Features are not always binary. For instance, they can assume different discrete values, or even continuous values. Discrete values can be ordinal or nominal: ordinal features can be ordered, while nominal features cannot.

When features are categorical, we can map the possible features to numerical values, to make it easier for the computer. For instance, if we have a feature "Programming language", which may receive "Python", "C++", or "Java", we can map these to numerical values: 0, 1, and 2. However, since this feature is not ordinal, this consideration may raise some numerical issues. In these cases, it is better to transform the "Programming language" feature into three different binary features: one for Python, one for C++, and one for Java.

Let's see an example:

In [69]:
df = pd.read_csv('data_site_alura.csv')
df.columns = ['home', 'search', 'logged_in', 'bought']

In [70]:
df.head( )

Unnamed: 0,home,search,logged_in,bought
0,0,algoritmos,1,1
1,0,java,0,1
2,1,algoritmos,0,1
3,1,ruby,1,0
4,1,ruby,0,1


So, we want to create dummy features from the "search" column. Thus, we can do:

In [71]:
df = pd.get_dummies(df)

In [72]:
df.head( )

Unnamed: 0,home,logged_in,bought,search_algoritmos,search_java,search_ruby
0,0,1,1,1,0,0
1,0,0,1,0,1,0
2,1,0,1,1,0,0
3,1,1,0,0,0,1
4,1,0,1,0,0,1


Nice! Now, we transformed a nominal categorical feature into different binary features! Now, let's separate our explicative features and our target feature, then split our data:

In [73]:
y = df.bought
X = df.drop('bought', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = SEED)

Finally, we train our model and evaluate its accuracy. Note that we can use a simple function to evaluate the accuracy:

In [74]:
model_NB = MultinomialNB( )
model_NB.fit(X_train, y_train)

MultinomialNB()

In [75]:
y_pred = model_NB.predict(X_test)

acc = accuracy_score(y_test, y_pred)

print("Accuracy: {:.2f}%".format(acc*100))

Accuracy: 82.00%


Nice! So, our model is able to hit 82% of the times! However, note that this algorithm is not as good as the last one, as it got a lower accuracy. We need to have a criterion to understand if the algorithm is sufficiently good. 

Thus, we need to have a baseline algorithm to compare to. For that end, we can use a dummy classifier which basically chooses the most frequent class:

In [67]:
dummy_clf = DummyClassifier(strategy = 'most_frequent')
dummy_clf.fit(X_train, y_train)

y_pred = dummy_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Accuracy: {:.2f}%".format(acc*100))

Accuracy: 82.00%


So, actually, the algorithm that simply chooses the most frequent class also has the same accuracy as the Naive Bayes algorithm! This means that, in fact, the accuracy for Naive Bayes is very bad.

Now, the thing is that: the Naive Bayes algorithm actually makes the same prediction as the dummy classifier:

In [78]:
y_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Let's understand more about the Naive Bayes algorithm.

## Naive Bayes algorithm

Multinomial Naive Bayes is a very simple algorithm. It basically evaluates the probability that each class is picked, based on their frequencies on the training set. This algorithm is based on the prior probability, which does not take into consideration any posterior feature (explicative features). Thus, actually, this version of the Naive Bayes algorithm is the same as the dummy classifier.

Another possible rule is that, at random, one of the classes is chosen, and they are biased towards the highest frequency class. For instance, if 82% of the data is 1 and 18% is 0, our if we generate a random uniform number between 0 and 1 that is lower than 0.82, we choose class 1.

Using Naive Bayes, we can choose different decision rules. By default, the Naive Bayes algorithm from scikit-learn uses the most frequent class. The second rule is called maximum a posteriori, and we can test it. 

Instead of using the prior probability, we can use the posterior probability, that is: given that the user acessed the home, *this* is the probability that the user bought or not. This is known as the conditional probability. Thus, this is entirely based on the Bayes rule of probabilities.

One of the advantages of the Naive Bayes is that it is a probability-based model that is very easy (and fast) to train. Also, it is often employed to classify text (such as spam detection).

## Using a better algorithm

Ok, we saw that the Naive Bayres algorithm did not show great accuracy for our data. Thus, we can test different things, such as:

* Remove or add new features.
* Add more data.
* Test other models.
* Test the same model with different parameters.

Note that, the more we test, the higher the time spent. Also, if we make many tests and choose the configuration with the highest accuracy, we might be indirectly fitting our model to the test set. This way, it might be important to use a third set, known as validation set, to check if the accuracy of our algorithm **really** passes to the real world data.

Here, we will try to use an AdaBoost algorithm, and check if it is able to improve our classification:

In [80]:
model_ada = AdaBoostClassifier( )
model_ada.fit(X_train, y_train)

y_pred = model_ada.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Accuracy: {:.2f}%".format(acc*100))

Accuracy: 88.00%


So, our AdaBoost algorithm indeed improves upon the baseline algorithms!