Skip to content

Conversation

@maxkoryukov
Copy link
Contributor

Hello!

This PR contains the implementation of Sentence Buffer: Split

Usage:

./ccextractor -sbs ~/source.ts

Currently, it works only with sub->type == CC_BITMAP. Implementation details - in comments to the PR.

Long example

New output

1
00:00:00,001 --> 00:00:00,189
Oleon costs.

2
00:00:00,191 --> 00:00:00,783
buried in the annex, 95 Oleon costs.

3
00:00:00,785 --> 00:00:05,159
Didn't want to acknowledge the pressures on hospitals, schools and infrastructure.

Old output

1
00:00:00,001 --> 00:00:00,000
Oleon

2
00:00:00,001 --> 00:00:00,189
Oleon costs.

3
00:00:00,190 --> 00:00:00,889
buried in the annex, 95 Oleon costs.
Didn't

4
00:00:00,890 --> 00:00:01,129
buried in the annex, 95 Oleon costs.
Didn't want

5
00:00:01,130 --> 00:00:01,359
buried in the annex, 95 Oleon costs.
Didn't want to

6
00:00:01,360 --> 00:00:02,059
buried in the annex, 95 Oleon costs.
Didn't want to acknowledge

7
00:00:02,060 --> 00:00:02,299
buried in the annex, 95 Oleon costs.
Didn't want to acknowledge the

8
00:00:02,300 --> 00:00:03,419
Didn't want to acknowledge the
pressures

9
00:00:03,420 --> 00:00:03,609
Didn't want to acknowledge the
pressures on

10
00:00:03,610 --> 00:00:04,029
Didn't want to acknowledge the
pressures on hospitals,

11
00:00:04,030 --> 00:00:04,779
Didn't want to acknowledge the
pressures on hospitals, schools

12
00:00:04,780 --> 00:00:05,019
Didn't want to acknowledge the
pressures on hospitals, schools and

13
00:00:05,020 --> 00:00:05,159
pressures on hospitals, schools and
infrastructure.

@maxkoryukov
Copy link
Contributor Author

fix maxkoryukov#1

return wrote_something;
}
else
// Write subtitles as they come
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of changes lines. BUT, the most changes - deleted symbol. Just ignore whitespaces here:

https://github.com/CCExtractor/ccextractor/pull/491/files?w=1

// in sentences
if (sub->type == CC_BITMAP)
wrote_something = write_cc_bitmap_to_sentence_buffer(sub, context);
sub = reformat_cc_bitmap_through_sentence_buffer(sub, context);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most important point in this PR.

SBS works as transformation filter. It takes incoming subs, and convert them to other subs with sub->type = CC_TEXT. This transformation occurred before all other encoders, so other encoders remain unchanged.

@@ -0,0 +1,11 @@
#ifndef _DEBUG_DEF_H_
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a helper for debugging. You could remove this file, and all references to LOG_DEBUG... But they are useful for debugging (with existing tests)

@@ -0,0 +1,59 @@
SHELL = /bin/sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests folder contains unit-tests for SBS. They are written with libcheck

There is README.md, with short how-to instructions.

@maxkoryukov
Copy link
Contributor Author

@cfsmp3, @canihavesomecoffee , is there a chance to merge this to the upstream?

@cfsmp3
Copy link
Contributor

cfsmp3 commented Dec 14, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants