LTM: language time machine

AIR - WS2023 - Group 11


Felix Holz	David Wildauer	Leopold Magurano
Data (pre-)processing	ML model	Visualization and evaluation

Language evolves over time. We assume that it is possible for a machine learning model to pick up on these small changes, given a large enough dataset.

In this project, we aim to employ various document representation techniques, such as analyzing the word frequencies or using doc2vec (an extension of word2vec), to create embeddings of the documents. These embeddings are used to train a machine learning model, which is able to predict the time period (year) in which a text snippet was published.

The download of the raw unprocessed dataset can be found here

For our training, testing, and evaluation dataset we will be using text snippets from a subset of the publicly available Project Gutenberg eBooks. The text snippets with the corresponding publishing dates are parsed from the eBooks extracted from the .zim file^[1] and written to a simple database to give everyone in our team easy access to it. We aim to implement and test different embeddings of our text snippets, which we then use for our machine learning model to predict the time period, or more accurately the year, in which this text snippet was published.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
bert-transformer		bert-transformer
doc2vec		doc2vec
preprocessing		preprocessing
word-frequencies		word-frequencies
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
ltm-pipeline.ipynb		ltm-pipeline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bert-transformer

bert-transformer

doc2vec

doc2vec

preprocessing

preprocessing

word-frequencies

word-frequencies

.DS_Store

.DS_Store

.gitignore

.gitignore

README.md

README.md

ltm-pipeline.ipynb

ltm-pipeline.ipynb

Repository files navigation

LTM: language time machine

AIR - WS2023 - Group 11

About

Releases

Packages

Contributors 3

Languages

NeXTormer/LTM-LanguageTimeMachine

Folders and files

Latest commit

History

Repository files navigation

LTM: language time machine

AIR - WS2023 - Group 11

About

Resources

Stars

Watchers

Forks

Languages