A series of scripts to download and parse the OpenSubtitles corpus.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.ipynb_checkpoints
.gitignore
OpenSubtitles.ipynb
README.md
analyze.py
explore.py
load.py
opensubtitles.py
opensubtitles.sh
opensubtitles_imdb.tsv
output.html
parse.py
xml.py

README.md

OpenSubtitles

This repository is a collection of scripts that help download and parse the OpenSubtitles corpus.

  • opensubtitles.sh: downloads, extracts, and merges the 2012/2013 corpora from http://opus.lingfil.uu.se
  • opensubtitles.py: naieve attempt at trying to download a single english subtitle for each imdb id. rate limits at 200 downloads per day

  • analyze.py: tries to cluster a single year of movie transcripts
  • explore.py: prints a list of all genres for the given year
  • load.py: loads all subtitles into memory for a given year, used by all other scripts
  • xml.py: parse xml file into subtitle.txt file
  • parse.py: find corresponding imdb id from opensubtitles id and get json of metadata

Citations

OpenSubtitles: http://www.opensubtitles.org/

Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)