# A First Jupyter Notebook to Explore

## Machine Learning Example: A model that learns the similarity of words

The first machine learning model we will examine in more detail belongs to the class of unsupervised models. We will cover this later this semester in class. For now our focus is on allowing you to get familiar with changing and running things on Jupyter Notebooks.

Semantic spaces (also called word embeddings) are a form of machine learning models that tries to attempt to learn the relationships between words.

What does this mean? While for a human it is simple to identify that speaking and spoken are very closely related and cat and dog share a relationship, this is not evident for a computer. How can we represent concepts in a form that allows computers to operate based on their meaning?

Early attempts to do so (for example as part of search engines) were based on linguistic analysis of words. Identifying the common base form of verbs and plural and singular nouns allowed computer programs to "understand" that "rocket" and "rockets" are very closely related concepts.

Modern approaches to represent the meaning of concepts are based on calculating a numeric representation (a vector with values) for concepts. Based on this approach it is possible to then to calculate how similar two concepts are by using a metric (e.g. a distance metric such as the euclidean distance or the cosine).
Probably the most famous of these semantic word spaces is called Word2Vec ( [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) ).
The idea behind Word2Vec  is pretty simple. We are making and assumption that "you can tell the meaning of a word by the company it keeps" (Firth 1957 Linguist). This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context.

### Loading the program

First, we start with importing the code we need to run our machine learning program. The code is shipped in so called `package`. Think of a `package` as a container or a zip file that contains all the code required to use it. At some point someone wrote the code for a machine learning algorithm that can learn the similarity of words and placed it in a `package` called `gensim`.
It is an open source library developed by a machine learning consulting company from the Czech Republic. Among other algorithms it contains a reliable implementation of `Word2Vec`.

If you are operating on a new environment or your own machine you will likely miss the `gensim` package and will have to install it. If we run the notebook on the Google Colab platform the module is already installed.
In that case we just have to `import` the gensim package. Import means we load the module in the current environment so we can make use of it.

In [1]:
# import needed modules
import gzip
import gensim
import logging

### Dataset

Apart from the implementation of the model, the second ingredient that is necessary for us to start with the training of unsupervised models is a dataset.

You will find the following datasets on the course moodle page under `Introduction`.

* `swiss-sms.txt.gz` is a set of Swiss German text messages (http://www.sms4science.ch/)
* `reviews_data.txt.gz` is a collection of  English customer reviews from different web sites.
* `reviews_data_small.txt.gz` is a smaller version of the above dataset for faster training times

Upload the datasets into Colab using the sidebar on the left.
In order to use Google Colab you will need a GMAIL account.
Uploading the larger `reviews` dataset can take a couple of minutes.

### Output a Line of the Dataset

By executing the following `code cell` we print a line from from the dataset and get an idea what kind of text is contained. You can execute the cell multiple times. It will output a different random line every time.

In [2]:
import random
data_file="./content/reviews_data.txt.gz"

with gzip.open (data_file, 'rb') as f:
    lines = f.read().splitlines()
    print(random.choice(lines))



b'Sep 27 2007\tNEVER AGAIN! Horrible!\tWe stayed only for one night and paid 150 pound! The room was in the basement - not very clean. The lady at the reception terrible. I had the feeling that she felt disturbed in reading her book. The breakfast was horrible. We then changed the hotel and got a much better one for 71 pound.THIS &quot;Athena Hotel&quot; is not a hotel! It is more or less a dosshouse (but a very expensive one)l Through your money out of the window and you have much jmore than staying at this place.NEVER AGAIN!!!!!!\t'


### Read files into a list

Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the `Word2Vec` machine learning algorithm for learning.

The model learns based on the plain input text as we have printed it in the last cell. The only thing that happens before is some pre-processing when we make a call to `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization (splitting the text into individual words), lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html).



In [3]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""

    print("reading file {0}...this may take a while".format(input_file))

    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f):

            if (i%10000==0):
                print("read {0} lines".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
print("Done reading data file")

reading file ./content/reviews_data.txt.gz...this may take a while
read 0 lines
read 10000 lines
read 20000 lines
read 30000 lines
read 40000 lines
read 50000 lines
read 60000 lines
read 70000 lines
read 80000 lines
read 90000 lines
read 100000 lines
read 110000 lines
read 120000 lines
read 130000 lines
read 140000 lines
read 150000 lines
read 160000 lines
read 170000 lines
read 180000 lines
read 190000 lines
read 200000 lines
read 210000 lines
read 220000 lines
read 230000 lines
read 240000 lines
read 250000 lines
Done reading data file


## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec (load the program into memory - then it is ready for execution) and pass the lines that we read in the previous step (we often call such inputs `documents` instead of using the term `line`). Word2Vec uses all these lines to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model.
Training time depends on the size of the training data. If you use the Reviews dataset training might take a couple of minutes on the Google Colab platform.
The Swiss Text dataset is rather small and should train very quickly.

Execute the code cell below in order to start the learning process of the Word2Vec algorithm. With the reviews data small dataset this will take approximately 5 minutes.

In [19]:
model = gensim.models.Word2Vec (documents, vector_size=150, window=5, min_count=15, workers=10)
model.train(documents,total_examples=len(documents),epochs=1)
print("Done training the machine learning model.")

Done training the machine learning model.


## Inspecting the Semantic Space

Now that we have trained a first model we can start to inspect what the machine learning model has learnt.
This first example shows a simple case of looking up the most similar words. All we need to do here is to call the `most_similar` function and provide a word as input. This returns the top 10 similar words.

Execute the cell below to see the most similar words for the word defined in "".
Test it with other words that come mind. If the word was not contained in the dataset that was used for the training then an error message will appear. We started with a blank model that knew nothing about human language, its knowledge about relations between words are limited to those that appeared in the dateset.

In [20]:

w1 = "hotel"
model.wv.most_similar (positive=w1)


[('property', 0.8275459408760071),
 ('establishment', 0.6695693731307983),
 ('resort', 0.6585601568222046),
 ('place', 0.6144008636474609),
 ('accomodation', 0.5225114822387695),
 ('accommodation', 0.5152970552444458),
 ('motel', 0.5040884613990784),
 ('facility', 0.4804632365703583),
 ('travelodge', 0.47616931796073914),
 ('jbh', 0.46400994062423706)]

That looks pretty good, right? Let's look at a few more. As you can see below the `topn` parameter specifies the number of similar words to return.

In [12]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)


[('courteous', 0.9069100022315979),
 ('cordial', 0.8610674738883972),
 ('curtious', 0.8574371337890625),
 ('friendly', 0.8353148102760315),
 ('curteous', 0.8314024806022644),
 ('freindly', 0.8203408122062683)]

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed* only:

In [7]:
# get everything related to stuff on the bed
w1 = ['hotel','resort']
w2 = ['restaurant']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


[('property', 0.759322464466095),
 ('establishment', 0.5897150635719299),
 ('jbh', 0.46248143911361694),
 ('motel', 0.4615221917629242),
 ('grandview', 0.4508134424686432),
 ('lrm', 0.43939098715782166),
 ('travelodge', 0.4390363097190857),
 ('hotell', 0.43639495968818665),
 ('hgvc', 0.43085941672325134),
 ('swissotel', 0.43083128333091736)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary.
This is the basic usage of the representations of the concepts.
Having a numerical representation allows us to compute the similarity between any two words.



In [15]:
# similarity between two different words
model.wv.similarity(w1="bed",w2="sleep")

0.21008974

In [9]:
# similarity between two identical words
model.wv.similarity(w1="hotel",w2="restaurant")

0.25024793

In [10]:
# similarity between two unrelated words
model.wv.similarity(w1="hotel",w2="street")

0.13489385

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [21]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["hotel","motel","street"])

'street'

## Understanding some of the (hyper-) parameters
To train the model earlier, we had to set some hyperparameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. 100-150 dimensions are common dimensions.

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give allow you to strengthen relationships that are synonymous, while larger window sizes favor associative relationships.

    The distinction between synonymous and associative relationships is based on findings in cognitive linguistics. Based on word priming experimentation, two main relations between words have been identified (see [CHIA1990]): synonymous relations (also referred to as similar or semantic relations in the cognitive science literature) and associative relations. As outlined in [CHIA1990], the distinction for both relationship types is not exclusive; that is, word relations are not exclusively synonymous or associative. Doctor - Nurse is an example of a word relation that can be considered as being of a synonymous-associative nature.


    Two terms/words are associatively related if they often appear in a shared context. The following are examples of this type of relationship:

            Spider - Web
            Google - Page rank
            Smoke - Cigarette
            Phone - Call
            Lighter - Fire

    Two terms/words are synonymously related if they share common features. The following are examples of this type of relationship:

            Wolf - Dog
            Cetacean - Whale
            Astronaut - Man
            Car - Van
            Smartphone - iPhone 4s

[CHIA1990]	(1, 2) Chiarello, Christine, et al. Semantic and associative priming in the cerebral hemispheres: Some words do, some words don’t... sometimes, some places. Brain and language 38.1 (1990): 75-104


### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant (spelling mistakes, non-words), so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?
