Enforce or detect UTF-8 encoding when 'charset' is not set #59

aleksblendwerk opened this Issue May 31, 2012 · 9 comments


None yet

10 participants


I am using Goutte to scrape a couple of sites and a few of them provide UTF-8 content but only set "text/html" as the Content-Type, thus making the DomCrawler assume it is ISO-8859-1 which results in double-encoded UTF-8 strings in the returned DOMDocument (and in the results for text() and so on).

Right now I am working around this by extending Goutte\Client and overriding createCrawlerFromContent, calling the parent method with ";charset=UTF-8" added to the type when there is no charset attribute. Probably not a really good way to do it, so I didn't want to make a pull request just yet.

My main point is that this took me quite a while to figure out and Goutte could probably be more convenient/save other new users from falling into the same trap by letting users specify an encoding. Besides that, thanks for a great library!

mashpie commented Jun 5, 2012

+1 :)

olragon commented Aug 12, 2012








abardan commented Feb 23, 2016


aik099 commented Feb 23, 2016

@aleksblendwerk, this can be what you're after:

  1. allow specifying default charset in Goutte
  2. allow specifying default charset in DomCrawler
  3. pass through default charset from Goutte to DomCrawler

Maybe Goutte/DomCrawler already can do that and I'm not aware how setting names for them are called.



Oxicode commented Nov 30, 2016


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment