Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Add sentiment analysis capability for 68 more languages #9
In the past, we integrated with the Cognitive Services Text Analytics API to extract the sentiment from a text. When we started the integration, the API only had support for 4 languages. Now, there is support for 15 languages.† However, Fortis would like to support many more languages. This PR addresses that problem.
Most sentiment detection work is done for the big NLP languages (English, German, etc.), however, I did manage to find a paper by the data science group at Stony Brook University which focused on sentiment analysis work for all the other languages out there. Through a combination of a variety of techniques (including machine translation and morphological propagation across linguistically similar languages), they managed to create word-polarity lists for over 100 languages. I analyzed the lists they provide, kept the lists for the languages where they have at least 500 positive and negative terms and uploaded them to our fortis-models blob in a machine readable format.
Inside of Fortis, we then use the word polarity lists to compute sentiment for languages that are unsupported by Cognitive Services, like so:
This approach is super naive (I've asked for support from the Machine Learning TWG to improve the approach if possible), but it's better than nothing and similar to how we did sentiment analysis in Fortis-v1 so I assert it's good enough for now.
This closes Issue#18.
NB: The PR also includes some re-structuring that was necessary to implement the new functionality, e.g. moving around some packages, extracting helper utilities, etc.
†: Cognitive Services currently support English, Spanish, Portuguese, French, German, Italian, Dutch, Norwegian, Swedish, Polish, Danish, Finnish, Russian, Greek and Turkish.