Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
No description, website, or topics provided.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
Some scripts for processing movie subtitles srt2xml .... convert subtitles in srt-format to simple OPUS-style XML format (does sentence splitting and tokenization) (uses nonbreaking_prefix.* files for tokenization which are just copies from the files distributed with the Europarl corpus version 3) Note that subtitle files are usually DOS files and srt2xml expects UNIX-style text files! --> use dos2unix before piping the text into srt2xml.pl srtalign... ... align srt-files which have been converted to XML using srt2xml (requires time-stamps!) For more information on using this script and its options: Look at the header of the script! share/dic ..... This directory contains word alignment dictionaries obtained by aligning the OpenSubtitles corpus from OPUS These dictionaries can be used to improve sentence alignment by synchronizing time stamps with the help of anchor points found by matching dictionary entries with word pairs in the subtitle pair