The Use of Distributional Semantics in Text Classification

Word embeddings such as word2vec, GloVe and other Distributional Semantic Models have in the latest past been succesfully applied to various Natural Language Processing tasks. Many such tasks implies some kind of classification, for example sentiment analysis and text categorization.

This repository constitutes the main code of my Master's Thesis [1], where I compare the Random Indexing [2] embedding to other popular embeddings by their performance in supervised learning on sentiment analysis tasks. I also take two approaches to update the embeddings jointly with the task to boost the performance.

Dependencies

The code is written for python 2.7 with the following dependencies:

numpy (v1.10.1)
theano (v0.7.0)
lasagne (v0.1)
scikit-learn (v0.17)

Running the thesis experiments

To run the experiments in the thesis, begin by making sure you have the dependencies installed.
Clone the repository:

git clone https://github.com/tobiasnorlund/Parameterized_RI.git

Download or build the embeddings you intend to use.

The Random Indexing vectors (RI, SGD_RI, ATT_RI) were generated using http://github.com/tobiasnorlund/CorpusParser.git
The Skip-Gram embeddings were generated using https://github.com/danielfrg/word2vec
The GloVe embeddings were generated using http://nlp.stanford.edu/projects/glove/

Edit run.py and set the paths to the embeddings
Run an experiment by:

python run.py BOW|RAND|PMI|RI|SGD_RI|ATT_RI|SG|GL MLP|CNN PL05|SST

Additionally, to run batches and/or replications, run:

python batch.py <emb(s)> <mdl(s)> <ds()> [<replications>=1]

for example:

python batch.py RI,SGD_RI - SST 2

runs two replications each of the RI MLP SST, RI CNN SST, SGD_RI MLP SST, SGD_RI CNN SST experiments

Documentation

To separate concerns, the code is split up in three python packages, embedding, model and dataset that contains the embeddings, models and datasets used in the thesis. This goes in line with the pipeline described in the thesis, where we -using well defined interfaces- can combine any combination of embedding, model and dataset to form an experiment. See the packages respective __init__.py for exact interface definitions.

Conceptual experiment:

 -----------------       --------------------      ------------------
|                 |     |                    |    |                  |
|   Embedding     |---->|       Model        |<---|      Dataset     |
|                 |     |                    |    |                  |
 -----------------       --------------------      ------------------

An Embedding implementation supplies the model with a Theano expression of a n-by-d matrix, where row i correspond to the i:th word vector in the sentence to be classified.

References

[1] Norlund, T., The Use of Distributional Semantics in Text Classification Models, 2016, http://www.tobias.norlund.se/portfolio/masters-thesis

[2] Sahlgren, M., An Introduction to Random Indexing, 2005

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
CNN_sentence @ 55767c6		CNN_sentence @ 55767c6
dataset		dataset
embedding		embedding
model		model
.gitignore		.gitignore
.gitmodules		.gitmodules
AttentionRILayer.py		AttentionRILayer.py
Parameterized_context_windows_in_Random_Indexing.pdf		Parameterized_context_windows_in_Random_Indexing.pdf
README.md		README.md
batch.py		batch.py
main.py		main.py
main_old.py		main_old.py
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Use of Distributional Semantics in Text Classification

Dependencies

Running the thesis experiments

Documentation

References

About

Releases

Packages

Languages

TobiasNorlund/Parameterized_RI

Folders and files

Latest commit

History

Repository files navigation

The Use of Distributional Semantics in Text Classification

Dependencies

Running the thesis experiments

Documentation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages