Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded txt files as UTF-8 #4

Merged
merged 1 commit into from
Oct 1, 2013
Merged

Encoded txt files as UTF-8 #4

merged 1 commit into from
Oct 1, 2013

Conversation

seanknox
Copy link
Contributor

No description provided.

@jasonlally
Copy link
Contributor

Thanks for doing that. Did you do this manually or programmatically? Reason I ask, is that our vendor will be pushing updates to FTP that we then automatically add and commit to the repo. I'm seeing if they can just make sure to save with UTF-8 encoding on their end, but just in case, I may need to script this so it doesn't have to be done manually.

@seanknox
Copy link
Contributor Author

Programmatically. Here's my quick hack:

require 'charlock_holmes'

detector = CharlockHolmes::EncodingDetector.new

ARGV.each do|f|
  content = File.read(f)
  detection = detector.detect(content)
  puts "#{f} encoding: #{detection[:encoding]}"
  utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
  File.write(f, utf8_encoded_content)
end

(J) Petitioning�Requesting persons to sign a petition.
(K) Publicize�To inform the public of a planned event by means of newspaper articles or notices, radio or television stories or notices, announcements in public places, leafletting, posting signs or written notices in places viewed by the public, or by other means calculated to notify the public of an event.
(L) Soliciting�Requesting persons to contribute money or anything else of value for charitable, religious or political cause.
(A) Amusement Park Rides—Rides of the type normally found in amusement parks or carnivals, such as ferris wheels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4 In the original Municiap Code, these are em dashes, I believe, encoding them removes them. Thoughts on why this is happening?

@seanknox
Copy link
Contributor Author

seanknox commented Oct 1, 2013

That's possible. I'm not sure there's a way to have the transcoder be a bit smarter about characters like that, but I'll look. Think the best way forward is to have the vendor encode as UTF-8 directly.

jasonlally added a commit that referenced this pull request Oct 1, 2013
Encoded txt files as UTF-8
@jasonlally jasonlally merged commit 6014ac9 into SFMOCI:master Oct 1, 2013
@jasonlally
Copy link
Contributor

#2 addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants