# Lab2.1 Machine learning basics

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook explains the simple basics of machine learning. At the end of this notebook, you learned:

- the basic principles of machine learning for text classification
- how features are represented as vectors
- how to train a classifier from vector representations
- how to train and apply a classifier to text represented by its words
- what a bag-of-words representation is
- what the information value (TF*IDF) of a word is

**Background reading:**

NLTK Book
Chapter 6, section 1 and 3: https://www.nltk.org/book/ch06.html



## 1. Introduction 

### Machine Learning schema

The overall process of machine learning for classification is shown in the next image that is taken from Chapter 6 of the NLTK book. In general, machine learning consists of a training phase in which an algorithm associates data features with certain labels (e.g. sentiment, part-of-speech). The training results in a classifier model that can be applied to unseen data. The classifier compares the features of the unseen data with the previously seen data and makes a prediction of the label on the basis of some similarity calculation.

![title](images/machine-learning-schema.png)


Crucial in this process are 1) the features that represent the data and 2) the algorithm that is used. In this course, we will not be discussing the various machine learning algorithm in depth; rather we focus on the text features and how they are represented as 'vectors'. Since we are working with text, which is not a vector representation, we need to define the features that characterize the text and decide how to transform these features into a feature vector representation that the algorithm and model can handle. In order to compare the unseen text with the training texts, it is crucial that features are extracted and represented in the same way across training and applying.

#### Preparations

We are going to use the Scikit-learn package to transform the diverse feature values into a vector representation:

https://scikit-learn.org/stable/install.html

Scikit-learn is a package that contains a lot of machine learning algorithms and functions for dealing with different types of features as well as carrying out evaluation and error analysis. To install it run one of the following commands from the command line:

- conda install scikit-learn

or

- pip install -U scikit-learn

We are also using a package called "Numpy", which is a package for scientific computing particularly suitable for working with multi-dimensional data: https://numpy.org.

Install Numpy from the command line following the instructions on the website. After installing, you can import it.

### 1.1 Vector representations


Before we turn to a text example, we are going to use a very simple data set. We show how to train and evaluate an SVM (Support-Vector-Machine) using a made-up example of multi-class classification for a non-linguistic dataset. The goal is to predict someone's weight category (say: skinny, fit, average, overweight) based on their properties.

We use three features:
* **age in years**
* **height in cms**
* **number of ice cream cones eaten per year**


The feature representation (for 5 people) is an array of arrays*. Each instance (or person) is represented by an array of numbers in which the first is the age, the second the heights in cms and the third the number of cones per year: 

\* for those of you interested in technicalities: in Python it is technically a list of lists here, which we can convert to an array using Numpy.

In [1]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

The first person is thus 30 years old, 180 cms tall and eats 1000 cones per year. The next command prints the data for the first instance.

In [3]:
print('First instance in the data set X :', X[0])

First instance in the data set X : [30, 180, 1000]


An array of numbers in which each position holds a value for a specific feature is what we call a feature vector. For all our data in the data set we must have a feature vector of the same length. If there is no value, it will be zero.

In addition to the data that is now assigned to the variable 'X', we also need to have the label that goes with the instances. For this we use another array with the values that we assign to the variable 'Y'. 

In [6]:
Y = ["overweight", 
     "skinny",
     "fit",
     "average",
     "average"]

We need to have as many values as we have instances in our data set, as the software pairs the elements in X with the elements in Y. Obviously, the values should also be in the correct order to correspond with the instances!

In [7]:
print('The length of the data set =', len(X))
print('The length of the labels =', len(Y))
print('The first label =', Y[0])

The length of the data set = 5
The length of the labels = 5
The first label = overweight


A nice function to pair lists in Python is the "zip" function which creates a list of tuples from two lists. We can use this to pair the instances with their labels:

In [8]:
for instance, label in zip(X, Y):
    print(instance, label)

[30, 180, 1000] overweight
[80, 180, 100] skinny
[50, 180, 100] fit
[40, 160, 500] average
[15, 160, 400] average


### 1.2 Using Scikit-learn to build a classifier

Now we have the data and the prediction we can train a model. We are going to use the **svm** module from **sklearn**, from which we will select the **LinearSVR** (Linear Support Vector Regression) class. Support Vector Machines or SVMs are powerful supervised machine learning approaches that find the optimal division (a so-called hyperplane in a multidimensional data space) between positive and negative examples of a class. For now it is not important to know the details about this algorithm. You will learn about that in the machine learning class. We instantiate a model with the variable name 'lin_classifier' (any name will do and you can instantiate as any variables as you want until your run out of memory). We will use this instantiation for training and classifying.

In [10]:
from sklearn import svm

lin_classifier = svm.LinearSVC()

Now we train the model by feeding it with the data set 'X' and the labels 'Y'. For this we use ``fit()``.

In [12]:
lin_classifier.fit(X,Y)



LinearSVC()

Calling the fit function gives a response that shows the (default) parameter settings of this model.

When you train the model through the 'fit' command, you might get a warning stating that:
```
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
```
This is to be expected here, given that we only train using five instances.

### 1.3 Using Scikit-learn to classify unseen data

Let's now apply the model to a new instance 'Z'. For this we use ``predict()``.
What does our trained SVM instance think about the weight category of an instance whose is 18 years old, 171cm tall, and who eats 400 ice cream cones per year?

In [13]:
Z=[[18, 171, 400]] # an array containing exactly one feature vector
predicted_label = lin_classifier.predict(Z)
print(predicted_label)

['average']


Apparently the SVM instance thinks it is **average**, which is not surprising since **number of ice cream cones eaten per year** and **height** seem to correlate highly with the weight categories.

Note that as people, we reason with some (weak) causal explanatory model. Our SVM does not - it only uses data patterns and association.

## 2. Representing a text as a Bag-Of-Words

A critical component of almost any machine learning approach is **feature representation**. 
This is not strange since we need to somehow convert a textual unit, e.g., word, sentence, tweet, or document, into something meaningful that can not only be interpreted by a computer, but is also useful for the type of learning we want to do. 

A text consists of a sequence of words on which we impose syntax and semantics. A machine needs to learn to associate the structural properties of the text to some interpretation.
We can use various properties to do this:

- the words (regardless of the order or in order)
- the words and their frequency
- the part-of-speech of words
- grammatical relations such as dependencies
- word pairs, sequences of three words, four words, etc. (so-called word n-grams)
- the characters that make up the words (so-called character n-grams)
- sentences with words
- phrases
- the meaning of words
- the meaning of combinations of words
- word length, sentence length
- word position in a text
- discourse structure: title, header, caption, body, conclusion sections
- etc....

Some of the above properties, we get for free if we split a text into tokens (the words), e.g. by using spaces. Still, we need to consider what to do with punctuation and how to treate upper/lower cases (the word shape). Other properties are not explicit, such as the part-of-speech of words, phrases, syntax and the meaning.

For now, we are only considering the words of a text as features. In fact, we are going to ignore the order of the words and consider a text as a *Bag-Of-Words*.

**If you want to learn more: (information from these blogs was used in this notebook)**
* [bag of words introduction](http://www.insightsbot.com/blog/R8fu5/bag-of-words-algorithm-in-python-introduction)
* [TF-IDF introduction](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3)
* [another TF-IDF introduction](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

In the next notebook of this course, we explain how other features can be combined with a word representation

### 2.1 Bag of words

We are going to create a vector representation of a text in which the words are the features that characterize the content of the teo keep things simple, we ignore the order of the words but we do want to know how often a word occurs in a document so that we can give it a weight.

In our vector representation we want each word to occupy a unique position in the array just as the age [0], length [1] and number of cones [2] in our first example. That means that our vector needs to be as long as the number of words that we find in the text. 

The first thing we there need to do is to create a word-to-document index:

* 1 we extract all the unique words from a collections of textual units, e.g., documents
* 2 we compute the frequency of each word in each document

Knowing the full vocabulary of all the documents, we can create a vector array with the length of the vocabulary and the order of the words in our vocabulary corresponds with the order in the array. 

Next, we can represent each document by the vector array by adding a row for a document (an instance of a text) where we score each position with the frequency of this word in the text. Instead of just counting each word, we can also weight the information value of the word for the document, thus using the *TF.IDF* value.

Let's look at an example.

To do all the above, we will two modules from sklearn that do all the work:

* CountVectorizer: turns a text data set (text, numbers) into a vector representation consisting of a vector array and a vocabulary that relates each data point to the corresponding vector array position
* TfidfTransformer: calculates the *TF.IDF* values from the basic statistics

We also need the NLTK package from the previous notebooks.

In [9]:
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import nltk

Let's try this for the following three sentences that we list in an array (note that sentences can also be complete documents).

In [10]:
sents = ['A rose is a rose',
         'A rose is stinks',
         "A book is nice"]

We have three instances of text with words occurring across the texts and different frequencies in the text. We will use the **CountVectorizer** to create the bag of words representation from the above array. It requires two parameters to be set in advance when we create an instance of the CountVectorizer: 1) the number of documents in which the term shoud occur and 2) what tokenizer should be used.

We create the instance *vectorizer* and feed it with our sentences to derive the data arrays for the instances with the function *fit_transform*. This will give us two things:

* a data structure that represents the instances through their vectors
* the vocabulary that maps to the columns of the data strcuture


In [11]:
# you can adapt min_df to restrict the representation to more frequent words e.g. 2, 3, etc..

vectorizer = CountVectorizer(min_df=1, # in how many documents the term minimally occurs
                             tokenizer=nltk.word_tokenize) # we use the nltk tokenizer to split the text into tokens
sents_vector_data = vectorizer.fit_transform(sents)

Let's us now inspect the data created by the *vectorizer*. The data itself is assigned to the variable *sents_vector_data*. The vocabulary is stored in the *vectorizer*.

We first looks at *sents_vector_data*. It is a special sklearn Object csr_matrix for which there are many functions and attibutes defined. We are going to look at the *shape* which holds the data. Printing the so-called "shape" of sents_counts shows us that we have 3 documents and 6 unique words spread over these documents:

In [12]:
print(type(sents_vector_data))
# sents_counts has a dimension of 3 (document count) by 6 (# of unique words)

print(sents_vector_data.shape)
print('The vector representation of the sentences looks as follows:')
print (sents_vector_data.toarray())

<class 'scipy.sparse.csr.csr_matrix'>
(3, 6)
The vector representation of the sentences looks as follows:
[[2 0 1 0 2 0]
 [1 0 1 0 1 1]
 [1 1 1 1 0 0]]


Great!! That looks very similar to the numerical data that we used to train our SVM for predicting the weight of people with certain features. Now the columns stand for words and the rows are the sentences or documents.

Important to note is that the rows are longer than any sentence because they represent all the vocabulary of all the sentences. That's why the documents have zero values in their representation as well.

Let's check the vocabulary now, which is store in the *vectorizer*:

In [16]:
# this vector is small enough to view in full! 
print('The vocabulary of all the sentences  consists of the following words:', list(vectorizer.vocabulary_.keys()))
print('These words are mapped to the data columns as feature names:', vectorizer.get_feature_names())

The vocabulary of all the sentences  consists of the following words: ['a', 'rose', 'is', 'stinks', 'book', 'nice']
These words are mapped to the data columns as feature names: ['a', 'book', 'is', 'nice', 'rose', 'stinks']


Through the feature name, we can now recover the three texts from the previous data array.

The first array has 6 positions representing the complete vocabulary. The first position represents the first word "a" and it has value '2', which means it occurs twice in the sentence. The third slot is for "is" which occurs once and the fifth slot is for "rose" which occurs twice. The other slots are zero because these words do not occur in the first sentence.

Try to figure out if you understand the representation of the other two sentences!


### 2.2 Training a classifier with word vectors
Now we have seen how we can turn a text into a vector representation. We can associate these text representation to labels as we have seen above for predicting somebody's weight. We now use different labels but note that for the algorithm the labels are meaningless. They could be numbers of any label.

It is now not so difficult to see how we can train an SVM instance with these data. All we need is to pair a set of labels to the data instances. Let's use sentiment values: neutral, negative and positive.

In [13]:
sentiment_labels=["neutral", "negative", "positive"]

In [14]:
for instance, label in zip(sents_vector_data.toarray(),sentiment_labels):
    print(instance, label)

[2 0 1 0 2 0] neutral
[1 0 1 0 1 1] negative
[1 1 1 1 0 0] positive


We have nicely paired sentence representations and sentiment values. Let's train and test an SVM.

In [15]:
from sklearn import svm

lin_classifier = svm.LinearSVC()
lin_classifier.fit(sents_vector_data,sentiment_labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

### 2.3 Classifying a new text with our text classifier

Now we want to apply this model to a new text. We need to create a vector representation for this text as well but we can ONLY(!!!) use the words from the training data since the vectors need to have the same semantics as the training data. The feature names stored in the vectorizer present the vocabulary in the right order:

In [16]:
new_text="a good book is a rose"
print(vectorizer.get_feature_names())

['a', 'book', 'is', 'nice', 'rose', 'stinks']


We thus need to create an array with the length of the training vocabulary and add the counts of these words on the basis of the new text. This would look as follows:

In [17]:
new_text_vector=[[2, 1, 1, 0, 1, 0]]

Note that the word "good" is not represented as it does not occur in the training vocabulary. The word "a" occurs twice, "book" and "is" occur once, "nice" and "stinks" do not occur and "rose" also occurs once.

In [18]:
predicted_label = lin_classifier.predict(new_text_vector)
print(predicted_label)

['neutral']


The prediction is *neutral* which makes sense since none of the distinguishing words "nice" and "stinks" occur in the text. So let's manipulate the data and turn the value for "stinks" to "1":

In [19]:
new_text_vector=[[2, 1, 1, 0, 1, 1]]

In [20]:
predicted_label = lin_classifier.predict(new_text_vector)
print(predicted_label)

['negative']


It seems to help. Now let's see what happens if we turn the value for *nice* to 1.

In [21]:
new_text_vector=[[2, 1, 1, 1, 1, 0]]
predicted_label = lin_classifier.predict(new_text_vector)
print(predicted_label)

['positive']


### 2.2 TF-IDF
One big problem of the bag of words approach is that it treats all words equally. Why is that a disadvantage? It means that words that occur in many documents, such as *a*, contribute more strongly to the decision making of the machine learning than other words that may be more informative, e.g. *rose*. 
TF-IDF addresses this problem by assigning less weight to words that occur in many documents.
You can read [here](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3) a nice introduction to TF-IDF.

This is how you can do this in Python using sklearn:

In [22]:
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_vector_data)

In [23]:
tf_idf_array = sents_tfidf.toarray()
print(vectorizer.get_feature_names())
print(numpy.round(tf_idf_array, decimals=1))

['a', 'book', 'is', 'nice', 'rose', 'stinks']
[[0.6 0.  0.3 0.  0.8 0. ]
 [0.4 0.  0.4 0.  0.5 0.7]
 [0.4 0.6 0.4 0.6 0.  0. ]]


This is a good result! In the bag of words approach, The words **"a"** and **"book"** both had a frequency of 1 in the third sentence. Now that we've applied the TF-IDF approach, we see that the word *book* has a higher weight (0.6) than the word *"a"* since *"a"* occurs in all three sentences and *"book"* only in one, which might indicate that it is more informative.

In [24]:
lin_classifier_weight = svm.LinearSVC()
lin_classifier_weight.fit(tf_idf_array,sentiment_labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [25]:
#redefine new test without manipulation
new_text_vector=[[2, 1, 1, 0, 1, 0]]

In [26]:
predicted_label = lin_classifier_weight.predict(new_text_vector)
print(predicted_label)

['neutral']


The small difference still did not lead to a different prediction. More data is needed or perhaps 'neutral' is correct.

## End of this notebook.