Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces inside and around links are concatenated #58

Closed
aykevl opened this issue Apr 10, 2015 · 6 comments
Closed

Spaces inside and around links are concatenated #58

aykevl opened this issue Apr 10, 2015 · 6 comments

Comments

@aykevl
Copy link

aykevl commented Apr 10, 2015

When there's a space before a link and before the link's content, both are preserved:

~$ python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import html2text
>>> h = html2text.HTML2Text()
>>> h.handle('foo <a href="/"> bar</a>')
u'foo [ bar](/)\n\n'
>>> h.ignore_links = True
>>> h.handle('foo <a href="/"> bar</a>')
u'foo  bar\n\n'
>>> html2text.__version__
'2014.12.29'

Browsers strip the text inside the link:

foo bar

This is html2text installed from pip, on Debian jessie.

aykevl added a commit to aykevl/kninfra that referenced this issue Apr 12, 2015
@Alir3z4 Alir3z4 added bug and removed bug labels Apr 13, 2015
@Alir3z4
Copy link
Owner

Alir3z4 commented Apr 13, 2015

I don't think this is a big bug.
Spaces and half-spaces should be preserved for some formatting reasons.

I'll leave this issue here for now, but It shouldn't be considered as a bug.
Although formatting options/flags are welcome.

half-spaces: In Persian based languages such as Arabic, Farsi, Dari, etc is a must. Words like می باشد and می‌باشد are such examples.

@theSage21
Copy link
Collaborator

I do not understand the issue? Is the preservation of spaces the bug in question?

@aykevl
Copy link
Author

aykevl commented Jun 17, 2015

I had to work aroud this bug in another project, as the text version was showing two spaces where the HTML version (in an email) was showing only one space as intended.

As an additional argument, when the spaces are at one side of the link tag, html2text only outputs one space:

>>> h.handle('foo  <a href="/">bar</a>')
u'foo [bar](/)\n\n'
>>> h.handle('foo<a href="/">  bar</a>')
u'foo[ bar](/)\n\n'
>>> h.ignore_links = True
>>> h.handle('foo  <a href="/">bar</a>')
u'foo bar\n\n'
>>> h.handle('foo<a href="/">  bar</a>')
u'foo bar\n\n'

Just like browsers do:

foo bar
foo bar

Note that the space in the second example is part of the link (in Chrome).

I don't know how non-Latin-based languages do this formatting.

@theSage21 the problem is that there are two spaces rendered where there should be (in my opinion and as browsers indicate) only one.

@theSage21
Copy link
Collaborator

A browser renders one space irrespective of how many spaces are there between two words.
Creating a file sample.html or testing here with the following html

<p>arjoonn sharma
<br>
arjoonn                     sharma</p>

shows that the spaces do not matter and only one space is rendered even though the html is evidently not the same.

@aykevl My point being that html2text preserves information from the original html and translates that to text.

edit
Maybe we should be stripping the link texts? @Alir3z4 your take on this?

@Alir3z4
Copy link
Owner

Alir3z4 commented Jun 18, 2015

As @mcepl once mentioned can we transform the data to its original form or we gonna loose the format of the original.

We're fine with any implementations as long as the text can keep it's original form after parsing to html by python markdown parser.

@mcepl do you have any input on this?

@Alir3z4
Copy link
Owner

Alir3z4 commented Feb 11, 2016

Not a bug and based on #58 (comment) I'll close this.

@Alir3z4 Alir3z4 closed this as completed Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants