<h1><center>Training on the DTM package: Session 1</center></h1>
<div style="text-align: center">Adnane Ez-zizi, 04 May 2020</div> 
<img src="./Figures/logo.png" width="150"/>

# Table of Contents

- - -

##### 0. <a href=#0>Preliminary steps</a>
##### I. <a href=#I>Introduction</a>
##### II. <a href=#II>The naive discriminative learning model</a>
##### III. <a href=#III>Exercises</a>
##### IV. <a href=#III>Practical application</a>

# 0. Preliminary things to do <a ID="0"></a> 

- - -

- Install [Jupyter Notebook](https://jupyter.readthedocs.io/en/latest/install.html) 
- Clone the [Github repo](https://github.com/Adnane017/DTM_training) for the training session
- For more information about the DTM package, visit: https://github.com/Adnane017/Deep_text_modelling 

# I. Introduction <a ID="I"></a> 



## Examples of classification problems on language data

- - -

Some examples of classification problems on text data from both machine learning and behavioural perspectives:

- **Language research:** 
 
 
     - English articles 
     - English tense
     - Russian aspect


- **Machine learning (NLP):** 


     - Sentiment analysis of tweets or customers’ reviews 
     - Spam filter and email classification
     - Plagiarism detection  


- **Question:** Any other examples from your work?

## Why are both the machine learning and plausible behavioural approaches interesting? 

- - -

- Why each camp needs to consider the other camp’s approach?  


    - Powerful machine learning models (e.g. deep learning) can provide an indication 
      about the maximum amount of information that can be extracted from the data. 
    
    - Deep learning models are often seen as black boxes. Behaviouraly plausible models do 
      not suffer as much from this problem.
    
    - Behaviourally plausible model do not necessarily perform badly. People outperform the 
      most powerful engineering models in some basic tasks (e.g. ).

## The Deep Text Modelling (DTM) package 

- - -

A python package for processing and modelling text data designed (mainly) for language researchers. <br>
Two main objectives behind the creation of the package:

1) Make life easier for researchers who want to model language learning. <br>


2) Bridge the gap between the machine learning and behavioural worlds by proposing a unifying framework <br> 
   that both types of users can work with. The framework also allows to easily compare behaviourally <br>            plausible and deep learning models for different types of language data. 

## Main features of DTM

- - -

- Offers a consistent and easy-to-use code syntax to model language data from small and large corpora.


- All the “ugly” stuffs to run Keras algorithms on large corpora are taken care of in the background for the user.


- Provides useful tools to speed up the pre-processing and evaluation steps necessary before or after the modelling.


- Can train and tune pre-trained and task-specific embeddings in addition to classical one-hot encodings.


- For now DTM can work with binary and multiclass classification problems, but the plan is to offer <br>
  support for all possibe type of classification problems. 


- The package comes with multiple examples to illustrate how its models work (more examples will be added in the future).

# II. The naive discriminative learning model <a ID="II"></a> 



## The associative learning framework: Pavlov's experiment

- - -

<img src="./Figures/pavlov.png" width="450"/><center><br>Image source: <a href="https://mariyamulwan.wordpress.com/2014/03/02/classical-conditioning-in-behavioural-learning-theory/">https://mariyamulwan.wordpress.com/2014/03/02/classical-conditioning-in-behavioural-learning-theory</a>

## The associative learning framework for language learning

- - -

- We will use a different terminology than what is used in Psychology: 

   - The stimulus of interest that we want to predict is called an outcome (e.g. food). Multiple outcomes are          possible.
   - The stimuli that predict the occurence of an outcome are called cues (e.g. sound of a bell)
   - A learning event is one experience of the co-occurrence between the cues and outcomes
   
   
- In the same way that the dog in Pavlov’s experiment learned to associate the sound of the bell with food, we can <br> form associations bewteen the language stimuli that we are exposed to.


- **Example:** Learn to predict the grammatical number of nouns based on context. The cues could be all the words that <br> surround the noun and the outcomes are whether the noun is singular or plural. Each noun in a sentence would form <br> a seperate event   

## The Rescorla-Wagner model

- - -

- The naive discriminative learning model is an adaptation of the Rescorla-Wagner model (R-W; Rescorla and Wagner, 1972) for <br> language learning and processing. R-W describes computationally how the associations between cues and outcomes are acquired. 


- After encountering each event, the learner updates the association weight between a cue ($i$) and an outcome $j$, depending on <br> whether they appear or not in the event:

$$ w^{t+1}_{ij} = w^{t}_{ij} + \Delta w^{t}_{ij} $$

<br>

- \begin{equation}
\Delta w^{t}_{ij} =
\left\{
	\begin{array}{ll}
		0  & \mbox{if cue $i$ is present} \\
        \gamma(1 - \sum_{i'}w^{t}_{i'j})  & \mbox{if cue $i$ and outcome $j$ are present} \\
		\gamma(0 - \sum_{i'}w^{t}_{i'j}) & \mbox{if cue $i$ is present and outcome $j$ is absent} 
	\end{array}
\right.
\end{equation}

- $t$: current trial
- $\gamma$: learning rate

## Interpretation of the association weights

- - -

- An association weight measures the tendency of an outcome to be triggered by the presence of a cue. <br><br>

- In the grammatical number example, it reflects the tendency of the singular or plural form to occur in the presence of a certain word. <br><br>

- A higher positive association weight value for a particular form corresponds to a higher likelihood of occurrence of that form. <br><br>

- A lower negative value corresponds to a higher likelihood of non-occurrence of that form. <br><br> 

- Values close to zero mean low chances of observing the form. <br><br>


## Generating choices from the model: Procedure

---

To generate an outcome choice given a certain set of cues: 

1) We calculate the activation of each outcome by summing the association weights between the outcome and each of the cues.

2) We convert the computed activations into choice probability using the [softmax rule](https://en.wikipedia.org/wiki/Softmax_function)

3) The predicted choice from the model is then the outcome that has the highest softmax probability.

## Generating choices from the model: An example

---

Back to our grammatical number example, if we have the following sentence: "the players gathered on the pitch", then the activations of the singular (s) and plural (p) are as follows: 

\begin{equation}
a(s)= w(the, s) + w(gathered, s) + w(on, s) + w(pitch, s)\\
a(p)= w(the, p) + w(gathered, p) + w(on, p) + w(pitch, p)
\end{equation}

The softmax probabilities are given by:

\begin{equation}
prob(s)= \frac{e^{a(s)}}{e^{a(s)} + e^{a(p)}}\\
prob(p)= \frac{e^{a(p)}}{e^{a(s)} + e^{a(p)}}
\end{equation}

If prob(s)>prob(p), the model would predict the singular form, otherwise it would predict the plural form.   

# 3. Exercises <a ID="III"></a>


## Exercise 1:

- - -

This is an example that you can (and should) try with your pocket calculator and paper and pencil, to fully understand how the association weights are updated. The example is presented in pdf-file `Handmade_Tomatoes.pdf`. You can also later read the following three blog posts for a more detailed presentation of the NDL model:

- [https://outofourminds.bham.ac.uk/blog/8](https://outofourminds.bham.ac.uk/blog/8)
- [https://outofourminds.bham.ac.uk/blog/9](https://outofourminds.bham.ac.uk/blog/9)
- [https://outofourminds.bham.ac.uk/blog/10](https://outofourminds.bham.ac.uk/blog/10)

## Exercise 2

- - -

This is the same example like the one above, but this time you will be using DTM to learn the weights.

To run this code, please open jupyter notebook `Python_Tomatoes.ipynb`.


# 4. Practical application

- - -

In our practical application, we will use the popular IMDB dataset where we will try to predict the sentiment of a movie review (positive or negative). The code is in the jupyter notebook `Sentiment_analysis.ipynb`.


**THE END**