Enforce or detect UTF-8 encoding when 'charset' is not set #59

Open
aleksblendwerk opened this Issue May 31, 2012 · 9 comments

Projects

None yet

10 participants

@aleksblendwerk

I am using Goutte to scrape a couple of sites and a few of them provide UTF-8 content but only set "text/html" as the Content-Type, thus making the DomCrawler assume it is ISO-8859-1 which results in double-encoded UTF-8 strings in the returned DOMDocument (and in the results for text() and so on).

Right now I am working around this by extending Goutte\Client and overriding createCrawlerFromContent, calling the parent method with ";charset=UTF-8" added to the type when there is no charset attribute. Probably not a really good way to do it, so I didn't want to make a pull request just yet.

My main point is that this took me quite a while to figure out and Goutte could probably be more convenient/save other new users from falling into the same trap by letting users specify an encoding. Besides that, thanks for a great library!

@mashpie
mashpie commented Jun 5, 2012

+1 :)

@olragon
olragon commented Aug 12, 2012

+1

@akbortoli

+1

@neochief

+1

@RageZBla

+1

@abardan
abardan commented Feb 23, 2016

+1

@aik099
aik099 commented Feb 23, 2016

@aleksblendwerk, this can be what you're after:

  1. allow specifying default charset in Goutte
  2. allow specifying default charset in DomCrawler
  3. pass through default charset from Goutte to DomCrawler

P.S.
Maybe Goutte/DomCrawler already can do that and I'm not aware how setting names for them are called.

@envision

+1

@Oxicode
Oxicode commented Nov 30, 2016

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment