# Lab3.1 Machine learning basics

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

RMA/Text Mining MA, Introduction to HLT

This notebook explains the simple basics of machine learning. At the end of this notebook, you learned:

- the basic principles of machine learning for text classification
- how features are represented as vectors
- how to train a classifier from vector representations
- how to train and apply a classifier to text represented by its words
- what a bag-of-words representation is
- what the information value (TF*IDF) of a word is

**Background reading:**

NLTK Book
Chapter 6, section 1 and 3: https://www.nltk.org/book/ch06.html



## 1. Machine Learning schema

The overall process of machine learning is shown in the next image that is taken from Chapter 6 of the NLTK book. In general, machine learning consists of a training phase in which an algorithm associates data features with certain labels (e.g. sentiment, part-of-speech). The training results in a classifier model that can be applied to unseen data. The classifier compares the features of the unseen data with the previously seen data and makes a prediction of the label on the basis of some similarity calculation.

![title](images/ml-schema.png)


Crucial in this process is 1) the features that represent the data and 2) the algorithm that is used. In this course, we are not going to discuss the various machine learning algorithms in depth but we focus on the text features and how they are represented as so-called feature vectors. In the case of a text, we need to define what the features are that characterize the text. These features are transformed into a feature vector representation that the algorithm and model can handle. In order to compare unseen text with the training texts, it is crucial that features are extracted and represented in the same way across training and applying (among which testing).

**Preparations**

We are going to use the Scikit-learn package to transform the feature values into a vector representation:

https://scikit-learn.org/stable/install.html

Scikit-learn is a package that contains a lot of machine learning algorithms and functions for dealing with features and carrying out evaluation and error analysis. To install it run one of the following commands from the command line:

- pip install -U scikit-learn

or
 
- conda install scikit-learn


In [1]:
%conda install scikit-learn

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/piek/opt/anaconda3

  added / updated specs:
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    scikit-learn-1.1.1         |   py39he9d5cce_0         5.7 MB
    ------------------------------------------------------------
                                           Total:         5.7 MB

The following packages will be UPDATED:

  scikit-learn                         1.0.2-py39hae1ba45_1 --> 1.1.1-py39he9d5cce_0 None



Downloading and Extracting Packages
scikit-learn-1.1.1   | 5.7 MB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: / 

    Installed package of scikit-learn can be accelerated using scikit-learn-intelex.
    More details are available here: ht

We are also using a package called "numpy": https://numpy.org. *Numpy* is a package for representing numerical and vector representations in Python.

Install "numpy" from the command line following the instructions on the website. After installing, you can import it.

In [2]:
%conda install numpy

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Retrieving notices: ...working... done

Note: you may need to restart the kernel to use updated packages.


### 1.1 Vector representations


Before we turn to a text example, we are going to use a very simple data set. We show how to train and evaluate an SVM (Support-Vector-Machine) using a made-up example of multi-class classification for a non-linguistic dataset. The goal is to predict someone's weight category (say: skinny, fit, average, overweight) based on their properties.

We use three features:

* **age in years**
* **height in cms**
* **number of ice cream cones eaten per year**

For each of these features, a person can have a value, e.g. 45, 178, 100. We can thus represent a person as an array of numbers: [45, 178, 100].

The feature representation for 5 people will then be an list of 5 arrays (or a matrix). Each array in the list represents the data for a single person.

Each row (or person) is represented by an array of numbers in which the first is the age, the second the height in cms and the third the number of cones per year: 

In [3]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

The first person is thus 30 years old, 180 cms tall and eats 1000 cones per year. The next command prints the data for the first instance.

In [4]:
print('First instance in the data set X =', X[0])

First instance in the data set X = [30, 180, 1000]


An array of numbers in which each position holds a value for a specific feature is what we call a feature vector. For all our data in the data set we must have a feature vector of the same length. If there is no value for feature, we still need to represent it but the value will be zero.

In addition to the data that is now assigned to the variable 'X', we also need to have the prediction that goes with the instances. For this we use another array with the values that we assign to the variable 'Y'. 

In [5]:
Y = ["overweight", 
     "skinny",
     "fit",
     "average",
     "average"]

We need to have as many values as we have instances in our data set, as the software pairs the elements in X with the elements in Y. Obviously, the values should also be in the correct order to correspond with the instances!

In [6]:
print('The length of the data set =', len(X))
print('The length of the predictions =', len(Y))
print('The first prediction =', Y[0])

The length of the data set = 5
The length of the predictions = 5
The first prediction = overweight


A nice function to pair lists in Python is the "zip" function which creates a list of tuples from two lists. We can use this to pair the instances with their labels:

In [7]:
for instance, label in zip(X, Y):
    print(instance, label)

[30, 180, 1000] overweight
[80, 180, 100] skinny
[50, 180, 100] fit
[40, 160, 500] average
[15, 160, 400] average


### 1.2 Using Skikit learn to build a classifier

Since we have the data and the prediction, we can train a model. We are going to use the **svm** module from **sklearn**, from which we will select the **LinearSVC** (Linear Support Vector Classification: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) class. Support Vector Machines or SVMs are powerful supervised machine learning approaches that find the optimal division (a so-called hyperplane in a multidimensional data space) between positive and negative examples of a class. For now it is not important to know the details about this algorithm. You will learn about that in the machine learning course. 


We first import *svm* from sklearn and next we instantiate a model with the variable name 'lin_classifier' (any name will do and you can instantiate as many variables as you want until your run out of memory). We will use this instantiation for training and classifying.

In [8]:
from sklearn import svm

lin_classifier = svm.LinearSVC()

Now we train the model by feeding it with the data set 'X' and the predictions 'Y'. Feeding we do with the 'fit' function. The 'fit' function creates the model and adds the data to it. The model is defined by the number of properties in the data but also by the order of the properties. So the current data example has 3 properties and the first position has the values for the age and not something else.

In [9]:
lin_classifier.fit(X,Y)



Calling the fit function gives a response that shows the (default) parameter settings of this model.

When you train the model through the 'fit' command, you might get a warning stating that:
```
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
```
This is to be expected given that we only train using five instances.

The default setting of LinearSVC is to iterate maximally 1,000 times over the data to get convergence. See the documentation:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

What we can try is to increase this by setting the parameter for *max_iter* to 10,000:

In [10]:
lin_classifier = svm.LinearSVC(max_iter=10000)
lin_classifier.fit(X,Y)



We see that the warning did not disappear. The data set is really too small for convergence, even after 10,000 iterations. We leave it for now. You will learn more about SVM classifiers in the Machine Learning course.

### 1.3 Using Sklearn to classify unseen data

Let's now apply the model to a new instance 'Z'. What does our trained SVM instance think about the weight category of an instance who is 18 years old, 171cm tall, and who eats 400 ice cream cones per year? We can use the *predict* function of our classifier instance to tell us:

In [11]:
Z=[[18, 171, 450]]
predicted_label = lin_classifier.predict(Z)
print(predicted_label)

['average']


Apparently the SVM instance thinks it is **average**, which is not surprising since **number of ice cream cones eaten per year** and **height** seem to correlate highly with the weight categories.

Note that we people reason with some (weak) causal explanatory model but our Machine Learning model just uses data patterns and association. It does not know why the answer *average* makes sense or not.

## 2. Representing a text as a Bag-Of-Words

So now let's move from people to text. When we process language, we typically want to look at instances of text and to represent each text by the linguistic features. So in our data structure, each row will be a piece of text (as rows were people in the previous example) and the array will have values for different properties of the text. Instead of predicting the *weight*, we now want to predict interpretations of the text, such as its sentiment or the topic.

A critical component of almost any machine learning approach is **feature representation**. 
This is not so strange since we need to somehow convert a complex text, e.g., words, sentences, a tweet, or a document, into something numerical that can be interpreted by a computer, and is also useful for the type of learning we want to do.

A text consists of a sequence of words on which we impose syntax and semantics. A machine needs to learn to associate the structural properties of the text to some interpretation.
We can use various language properties to do this, both structural and external:

- the words (regardless of the order or in order) and their frequency in a text
- the part-of-speech of words
- grammatical relations such as dependencies
- combinations of words: sequences of three words, four words, etc. (so-called word n-grams), phrases, sentences
- the characters that make up the words (so-called character n-grams)
- the meaning of words or combinations of words in a lexicon
- word length, sentence length, word position in a text or a sentence
- discourse structure: title, header, caption, body, conclusion sections
- etc....

Which features work often depends on the kind of task and is often determined experimentally.

Some of the above structural properties we get for free if we split a text into tokens. Other properties are not explicit, such as the part-of-speech of words, phrases, syntax and their meaning.

For now, we are only considering the words of a text as features. In fact, we are going to ignore the order of the words and consider a text as a *Bag-Of-Words*.

**If you want to learn more: (information from these blogs was used in this notebook)**
* [bag of words introduction](http://www.insightsbot.com/blog/R8fu5/bag-of-words-algorithm-in-python-introduction)
* [TF-IDF introduction](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3)
* [another TF-IDF introduction](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

In the machine learning course, we explain how other features can be combined with a word representation

### 2.1 Bag of words

We are going to create a vector representation of a text in which the words are the features that characterize the content of the text. To keep things simple, we ignore the order of the words but we do want to know how often a word occurs in a document so that we can give it a weight.

In our vector representation we want each word to occupy a unique position in the array just as the age [0], length [1] and number of cones [2] in our *weight* prediction example. That means that our vector needs to be as long as the number of words that we find in the texts.

The first thing we therefore need to do is to create a word-to-document index:

* we extract all the unique words from a collection of textual units, e.g., documents or tweets
* we compute the frequency of each word in each document

Knowing the unique vocabulary of the texts, we can create a vector array with the length of the vocabulary and the order of the words in our vocabulary corresponds with the order in the array. 

Next, we can represent each document by the vector array by adding a row for a document (an instance of a text) where we score each position with the frequency of this word in the text. Instead of just counting each word, we can also determine the information value of the word for the document as the *TF.IDF* value.

Let's look at an example.

To do all the above, we use two modules from sklearn that do all the work:

* CountVectorizer: turns a textual data set into a vector representation consisting of a vector array with frequency counts and a vocabulary that maps each word to the corresponding vector array position
* TfidfTransformer: calculates the *TF.IDF* values from the basic counts produced by the CountVectorizer function.

We also need the NLTK package from the previous notebooks.

In [12]:
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import nltk

Let's try this for the following three sentences that we list in an array (note that sentences can also be complete documents).

In [13]:
sents = ['A rose is a rose',
         'A rose stinks',
         "A book is nice"]

We have three instances of text with words occurring across the texts and different frequencies in the text. We will use the **CountVectorizer** function to create a bag of words representation from the above texts. It requires two parameters to be set in advance when we create an instance of the CountVectorizer: 1) the minimal number of documents in which the term should occur (number of rows in our data) and 2) what tokenizer should be used to split the text into separate tokens. Since our data set is small, we set the minimal number of documents (rows) to "1". As a tokenizer to split the text, we use NLTK.

We first create the instance *our_vectorizer* and feed it with our sentences to derive the data arrays for the instances with the function *fit_transform*. This will give us two things:

* a data structure that represents the instances through their vectors
* the vocabulary that maps to the columns of the data structure

The result of calling this function is assigned to the variable *sents_vector_data*. 

In [14]:
# you can adapt min_df to restrict the representation to more frequent words e.g. 2, 3, etc..
# 
our_vectorizer = CountVectorizer(min_df=1, # in how many documents the term minimally occurs
                             tokenizer=nltk.word_tokenize) # we use the nltk tokenizer to split the text into tokens
sents_vector_data = our_vectorizer.fit_transform(sents)

Note there is also a parameter max_df to ignore words that occur in more documents than specified. Check the API of scikit learn to see what other paramters can be used:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Let's us now inspect the sents_vector_data object created by *our_vectorizer*. The data representation of the text is assigned to the variable *sents_vector_data*. The vocabulary is stored in the *our_vectorizer*.

We first look at the type of class for *sents_vector_data*. It is a special sklearn class (csr_matrix) especially designed for large and sparse vectors. Various functions and attributes are provided. We are going to look at the *shape* attribute. Printing the so-called "shape" of sents_counts shows us the dimensions of the matrix. It tells us we have 3 rows and 6 columns to represent the data.

In [15]:
print(type(sents_vector_data))
print(sents_vector_data.shape)
# sents_counts has a dimension of 3 (document count) by 6 (# of unique words)


<class 'scipy.sparse.csr.csr_matrix'>
(3, 6)


'csr' stands for a compressed sparse row. A matrix of 'csr' is a list of such rows, i.e. a table with rows and columns.

The actual data now looks as follows:

In [25]:
print('The vector representation of the sentences looks as follows:')
print (sents_vector_data.toarray())

The vector representation of the sentences looks as follows:
[[2 0 1 0 2 0]
 [1 0 0 0 1 1]
 [1 1 1 1 0 0]]


Great!! That looks very similar to the numerical data that we used to train our SVM for predicting the weight of people with certain features. Now the columns stand for words and the rows are the sentences or pieces of text.

Important to note is that the rows are longer than any sentence we gave it because they represent the complete vocabulary of all the sentences. That's why the texts have zero values in their representation for words that do not occur in it but occur in other texts.

Let's check the vocabulary now, which is stored in *our_vectorizer*:

In [26]:
# this vector is small enough to view in full! 
print('The vocabulary of all the sentences  consists of the following words:', 
      list(our_vectorizer.vocabulary_.keys()))
print('These words are mapped to the data columns as feature names:', 
      our_vectorizer.get_feature_names_out())

The vocabulary of all the sentences  consists of the following words: ['a', 'rose', 'is', 'stinks', 'book', 'nice']
These words are mapped to the data columns as feature names: ['a' 'book' 'is' 'nice' 'rose' 'stinks']


Look carefully at the list. Notice that they contain the same words but in different order. The first is the vocabulary and the second lists the names of the features in the order of the columns in our matrix.

Through the feature name, we can now recover the three texts from the previous data array. If we find a value greater than zero in a cell, the word occurs in a text otherwise not. Obviously, we cannot recover the order, as this is a bag-of-word representation.

The first array has 6 positions representing the complete vocabulary. The first position represents the first word "a" and it has value '2', which means it occurs twice in the sentence. The fourth slot is for "is" which occurs once and the fifth slot is for "rose" which occurs twice. The other slots are zero because these words do not occur in the first sentence.

Try to figure out if you understand the representation of the other two sentences!


### 2.2 Training a classifier with word vectors
Now we have seen how we can turn a text into a vector representation. We can associate these text representations with labels as we have seen above for predicting somebody's weight. We now use different labels but note that for the algorithm the labels are meaningless. They could be numbers or any set of words.

It is not so difficult to see how we can train an SVM instance with these data. All we need is to pair a set of labels to the data instances. Let's use sentiment values: neutral, negative and positive. You can also use other labels such as A, B, C.

In [27]:
sentiment_labels=["neutral", "negative", "positive"]

In [28]:
for instance, label in zip(sents_vector_data.toarray(),sentiment_labels):
    print(instance, label)

[2 0 1 0 2 0] neutral
[1 0 0 0 1 1] negative
[1 1 1 1 0 0] positive


We have nicely paired sentence representations and sentiment values. Let's train and test an SVM. To train the model, we use the *fit* function of the svm again as before. We feed it with the sents_vector_data generated by *our_vectorizer* with the labels.

In [29]:
from sklearn import svm

lin_classifier = svm.LinearSVC()
lin_classifier.fit(sents_vector_data,sentiment_labels)

### 2.3 Classifying a new text with our text classifier

Now we want to apply this model to a new text. We need to create a vector representation for this text as well but we can ONLY(!!!) use the words from the training data since the vectors need to have the same semantics as the training data. We cannot represent words that are not in the training data in our model. Furthermore, we need to represent the words that do in the right order. The feature names stored in the vectorizer present the vocabulary in the right order:

In [30]:
new_text="a good book is a rose"
print(our_vectorizer.get_feature_names_out())

['a' 'book' 'is' 'nice' 'rose' 'stinks']


We thus need to create an array with the length of the training vocabulary and add the counts of these words on the basis of the new text. This would look as follows:

In [31]:
new_text_vector=[[2, 1, 1, 0, 1, 0]]

We can also obtain this representation by using the vectorizer we created before, and convert the csr object to an array. We use the function *transform* which transforms data into the representation of a given model. This literally means represent the text according to the vocabulary of the training data only.

**Note (!!!!)** that we do NOT use the function *fit* (or the alternative *fit_transform*) here. What will happen if you did? Think about it ....

In [33]:
new_text_vector = our_vectorizer.transform([new_text]).toarray()
print(new_text_vector)

[[2 1 1 0 1 0]]


Note that the word "good" is not represented as it does not occur in the training vocabulary. The word "a" occurs twice, "book" and "is" occur once, "nice" and "stinks" do not occur and "rose" also occurs once.

In [34]:
predicted_label = lin_classifier.predict(new_text_vector)
print(predicted_label)

['neutral']


The prediction is *neutral* which makes sense since none of the distinguishing words "nice" and "stinks" occur in the text. So let's manipulate the data and turn the value for "stinks" to "1":

In [35]:
new_text_vector=[[2, 1, 1, 0, 1, 1]]

In [36]:
predicted_label = lin_classifier.predict(new_text_vector)
print(predicted_label)

['negative']


That did help.

### 2.2 TF-IDF
One big problem of the bag of words approach is that it treats all words as equal. Why is that a disadvantage? It means that words that occur in many documents, such as *a*, contribute equally to the decision making of the machine learning approach as other words that are much more informative, e.g., *rose*. 
TF-IDF addresses this problem by assigning less weight to words that occur in many documents.
You read [here](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3) a nice introduction to TF-IDF. 

TF-IDF comes from the information retrieval research. You can image that it matters to measure how discriminative words are across different documents. For other applications such as sentiment classification, it is less clear what the contribution will be. Here we are going to apply it anyway and see what it does to the representation.

This is how you can do it in Python using sklearn. We create an instance of the special class TfidTransformer to feed it the bag-of-words presentation that we created before for training with *our_vectorizer*. This instance has a *fit_transform* function that now produces a representation with the TF-IDF values for the words.

In [37]:
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_vector_data)

To inspect the result, we need to convert it to an *array*. 

In [39]:
tf_idf_array = sents_tfidf.toarray()
print(our_vectorizer.get_feature_names_out())
print(tf_idf_array)

['a' 'book' 'is' 'nice' 'rose' 'stinks']
[[0.57048339 0.         0.36730061 0.         0.73460123 0.        ]
 [0.42544054 0.         0.         0.         0.54783215 0.72033345]
 [0.34520502 0.5844829  0.44451431 0.5844829  0.         0.        ]]


To get rounded values, we can use a *numpy* function:

In [40]:
print(our_vectorizer.get_feature_names())
print(numpy.round(tf_idf_array, decimals=2))

['a', 'book', 'is', 'nice', 'rose', 'stinks']
[[0.57 0.   0.37 0.   0.73 0.  ]
 [0.43 0.   0.   0.   0.55 0.72]
 [0.35 0.58 0.44 0.58 0.   0.  ]]


##### This is an expected result! 

In the bag of words approach, The words **"a"** and **"book"** both had a frequency of 1 in the third sentence. Now that we've applied the TF-IDF approach, we see that the word *book* has a higher weight (0.58) than the word "*a*" in the third text since "*a*" occurs in all three sentences and "*book*" only in one. Can you tell why in the first sentence "*a*" scores 0.57?

Let's try again training a model with the new data representation. Since we want to replace the old model, we use the function *fit*:

In [41]:
lin_classifier_weight = svm.LinearSVC()
lin_classifier_weight.fit(tf_idf_array,sentiment_labels)

Now, to test the new sentence we must apply the same transformations we did to the training data. This means we have to also represent the new sentence using TF-IDF weights.

In [44]:
#redefine new test without manipulation
new_text_vector=[[2, 1, 1, 0, 1, 0]]

# transform the counts to tf-idf features
new_tf_idf_text_vector = tfidf_transformer.transform(new_text_vector)

new_tf_idf_array = new_tf_idf_text_vector.toarray()
print(our_vectorizer.get_feature_names_out())
print(numpy.round(new_tf_idf_array, decimals=2))

['a' 'book' 'is' 'nice' 'rose' 'stinks']
[[0.63 0.53 0.4  0.   0.4  0.  ]]


Note that although the word "*book*" is assigned a high weight (0.53), the TF-IDF approach still gets confused with the test data and assigns a higher weight to the word "*a*". Why is this?

Finally, we can use the new trained model with the new representations to get a prediction

In [45]:
predicted_label = lin_classifier_weight.predict(new_tf_idf_text_vector)
print(predicted_label)

['neutral']


Sadly, the difference in representation did not lead to a different prediction. 

It is important to see two things in this example:

<ol>
<li> Regardless of the changes we made in representing the data, the model cannot process words that it has not seen, such as "*good*". 
<li> The model does not know that unseen words can be **semantically** related to words that it has seen. For example, the model does not know that the word "*good*" is related to "*nice*".
</ol>

One way to fix the these problems would be by having a larger (training) dataset with more diverse words. However, given the (small) data the model has seen so far, perhaps the prediction 'neutral' is correct.

## Summary

What are the important functions to remember?

Two packages to turn text data into vector representations **CountVectorizer** and **TfidfTransformer**:

* fit_transform: used for the training data from which it 1) extracts (fit) a bag of words as features and 2) represents (transform) each training document according to the vector
* transform: applied to test data to represent it according to the vector model of the training data

From sklearn you can import various classifiers such as NaiveBayes or SVM. All classifiers come with at least two functions:

* fit: learn from the training data the association between the vector representations and the labels
* predict: apply the model to a vector representation to predict the associated label


## End of this notebooks.