Skip to content

Java-Tool to extract all historical data from the IMDB dumps (which are semi-structured text files)

Notifications You must be signed in to change notification settings

HPI-Information-Systems/IMDBParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IMDBParser

Java-Tool to extract all historical data from the IMDB dumps (which are semi-structured text files) to change records in the following format:

<t,e,p,v> (meaning at point of time t entity e changed in property p to a new value v)

The Data Source

Before IMDB started publishing their data as tsv-files, there were weekly updates provided as semi-structured text files. This parser extracts information from this original data. The data can be obtained at this url:ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/. It contains the original version of the database of an unclear timestamp as well as diff-files to reconstruct the history. Unfortunately, some Diff-files are missing which is why only roughly three and a half years of data can be reconstructed.

Usage

The tool only works on Linux, since it uses the patch command (see DiffApplyer). To extract the data, download the data as mentioned above. Then run ChangeExtractorMain with no parameters to view Usage instructions. The tool only supports parsing actors, composers, countries, directors, editors, genres, locations, plots and ratings.

Architecture

Since the semi-structured text-files are all a little different there is an individual parser for all of them. Most of these are generated via ANTLR, the grammars can be found here

About

Java-Tool to extract all historical data from the IMDB dumps (which are semi-structured text files)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published