Skip to content

Dmitr15/php-text-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Simple text scraper on PHP

The parser does:

1) Outputs only direct speech (paragraphs

starting with an em dash)

image

Output
— Папа, как насчет сказки? — спросил Кристофер Робин.
— Что насчет сказки? — спросил папа.
— Ты не мог бы рассказать Винни-Пуху сказочку? Ему очень хочется!
...

2) Automatically places commas before "a" and "but". Replace three dots with the special ellipsis character.

3) Automatically generate a working table of contents for headings of levels 1-3. A table of contents with indents by heading levels should be displayed under the form. Clicking on a heading in the table of contents jumps to the corresponding heading in the full text.

4) Remove all types of visual formatting from the original HTML, leaving only functional and structural elements: <H?>, <P>, <div>, table and link tags. If the text was inside the tag being removed, for example text with font, it should be preserved. All attributes and their values ​​should be removed inside the tags.

Additional features

- if you write in the address bar ****/text.php?preset=1, the form opens with the text of this article https://ru.wikipedia.org/wiki/%D0%9A%D0%B8%D0%BD%D0%BE%D1%80%D0%B8%D0%BD%D1%85%D0%B8

- if you write in the address bar ****/text.php?preset=2, the form opens with the text of this article https://www.gazeta.ru/culture/2021/12/16/a_14322589.shtml

image

image

image

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors