LSTM-Language-Models

Experiments with LSTM networks for language modeling. The language models implemented here can easily be used with any type of corpus. In my experiments I am using the Hutter Prize 100 MB dataset. I have split the dataset into training and validation set (80%/20%).

The language model implemented here is a two layer LSTM network that outputs the following character for a sequence of characters. The trained model can then be used to generate new text character by character that is similar to the original training corpus.

Requirements

I highly suggest that you use the GPU version of Tensorflow because this project requires some heavy computations. In order to use the GPU version of Tensorflow you will first need to install CUDA.

Python
NumPy
Tensorflow
Keras

Run pip install -r requirements.txt to install the requirements.

Usage

Data

In order to train your own language models you will need a corpus of data. The pre-processing required is minimal. In my experiments I just split the raw text data into training and validation set. In order to split the Wiki corpus you can use util.py. If you want to use your own data you simply need to split the dataset in a train_set.txt and val_set.txt set and put the data files into a data/ directory.

Training

Once that data are ready you can start training your own language model. In my experiments I am splitting each set into wiki pages (\n as the delimiter) and each wiki page into batches of 512 characters. Then these batches are fed into the model one at a time for training.

Generation

Once the model has converged, you can use it to generate new text samples. The generation requires an initial text feed to start the sampling chain and then the generation can be controlled using the temperature parameter. You can use generate_text.py in order to generate new text using a trained model.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_gen.py		data_gen.py
generate_text.py		generate_text.py
language_model.py		language_model.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LSTM-Language-Models

Requirements

Usage

Data

Training

Generation

About

Releases

Packages

Languages

License

AlexGidiotis/LSTM-Language-Models

Folders and files

Latest commit

History

Repository files navigation

LSTM-Language-Models

Requirements

Usage

Data

Training

Generation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages