Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are accent marks handled? #366

Closed
cgb37 opened this issue Dec 2, 2020 · 2 comments
Closed

How are accent marks handled? #366

cgb37 opened this issue Dec 2, 2020 · 2 comments

Comments

@cgb37
Copy link

cgb37 commented Dec 2, 2020

We harvest tweets that have accents marks. How are they handled? Should we use hashtags that include terms that have accent marks and the same term without accent marks?
For example, we're searching for the following terms:
#CubaYChacón

Should we configure our search to use terms to use accent marks and no accent mark?
#CubaYChacón
#CubaYChacon

@edsu
Copy link
Member

edsu commented Dec 2, 2020

twarc simply passes the query through to the search API endpoint. It does get URL encoded in order to do the GET request. So this isn't really a question about twarc and more about what Twitter do. I just tested with a search for both of these and they returned the same three tweets so in this case at least I don't think there is any difference. Twitter must be doing some kind of normalization during index and query operations.

@igorbrigadir
Copy link
Contributor

Closing some old issues: Twitter API uses Normalization Form C (NFC) version of the text https://developer.twitter.com/en/docs/counting-characters so, if you search for #CubaYChacon it will match both #CubaYChacón and #CubaYChacon in the API and there's more clues to how exactly text is processed in https://github.com/twitter/twitter-text conformance tests and unicode character map https://github.com/twitter/twitter-text/tree/master/unicode_regex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants