Skip to content
This repository has been archived by the owner. It is now read-only.

Add sentiment analysis capability for 68 more languages #9

Merged
merged 22 commits into from Jun 10, 2017

Conversation

Projects
None yet
2 participants
@c-w
Copy link
Member

commented Jun 7, 2017

In the past, we integrated with the Cognitive Services Text Analytics API to extract the sentiment from a text. When we started the integration, the API only had support for 4 languages. Now, there is support for 15 languages.† However, Fortis would like to support many more languages. This PR addresses that problem.

Most sentiment detection work is done for the big NLP languages (English, German, etc.), however, I did manage to find a paper by the data science group at Stony Brook University which focused on sentiment analysis work for all the other languages out there. Through a combination of a variety of techniques (including machine translation and morphological propagation across linguistically similar languages), they managed to create word-polarity lists for over 100 languages. I analyzed the lists they provide, kept the lists for the languages where they have at least 500 positive and negative terms and uploaded them to our fortis-models blob in a machine readable format.

Inside of Fortis, we then use the word polarity lists to compute sentiment for languages that are unsupported by Cognitive Services, like so:

  1. Download the polarity lists from blob (cache on disk)
  2. Read the positive/negative words (cache in memory)
  3. Count how many positive/negative words are in the sentence.
  4. If there are more positive words, assume positive sentiment. If there are more negative words, assume negative sentiment. If the count is the same, assume neutral sentiment.

This approach is super naive (I've asked for support from the Machine Learning TWG to improve the approach if possible), but it's better than nothing and similar to how we did sentiment analysis in Fortis-v1 so I assert it's good enough for now.

This closes Issue#18.

NB: The PR also includes some re-structuring that was necessary to implement the new functionality, e.g. moving around some packages, extracting helper utilities, etc.

†: Cognitive Services currently support English, Spanish, Portuguese, French, German, Italian, Dutch, Norwegian, Swedish, Polish, Danish, Finnish, Russian, Greek and Turkish.

@c-w c-w added the in progress label Jun 7, 2017

@c-w c-w changed the title More sentiments Add sentiment analysis capability for 68 more languages Jun 7, 2017

@c-w c-w requested review from kevinhartman and erikschlegel Jun 7, 2017

@kevinhartman
Copy link
Contributor

left a comment

LGTM w/ comments.

}
}
}

object SentimentDetector {
val POSITIVE: Double = 1.0

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

Scala constants should be Pascal-cased.

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

Done in cb10a83

enabledLanguages: Set[String] = Set("de", "en", "es", "eu", "it", "nl")
) extends Serializable with Logger {
class ZipModelsProvider(
formatModelsDownloadUrl: String => String,

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

Naming this like a function might make its use clearer. Maybe something like modelsUrlFromLanguage.

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

Done in 9ec3011

) extends Serializable with Logger {
class ZipModelsProvider(
formatModelsDownloadUrl: String => String,
modelsSource: Option[String] = None

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

nit: how about modelDirectory?

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

It could also be a custom URL to a models zip file.


logDebug(s"Analyzed text $text in language $language: $kaf")

kaf.getEntities.toList.filter(entityIsPlace).map(_.getStr).toSet

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

toSet not needed here since return type is Iterable.

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

The function fetches the entities in a sentence so there shouldn't be duplicates. I've changed the return type to Set in 5861eab.


kaf.getEntities.toList.filter(entityIsPlace).map(_.getStr).toSet
} catch {
case ex @ (_ : NullPointerException | _ : IOError) =>

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

Which code will throw a null pointer exception? If it's OpeNER, we should likely be calling it differently, since throwing NullPointerException should not be part of an error handling contract.

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

I wish it weren't the case... but unfortunately the OpeNER code is research-quality code so it throws a NPE in some cases when it can't load a model since they swallow an IOException internally and then later hit the null-pointer.

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 9, 2017

Contributor

Ah yikes. Can we tighten the Try scope to just the relevant OpeNER statements?

This comment has been minimized.

Copy link
@c-w

c-w Jun 9, 2017

Author Member

Done in e89573a.

import scala.io.Source

@SerialVersionUID(100L)
class WordListSentimentDetector(

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

Perhaps you could define a trait for SentimentDetector.

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

There could be an interface for it, but right now there's only two implementations which are both wrapped by a single class so there wouldn't be much value in it.

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 9, 2017

Contributor

I agree that it wouldn't provide immediate value, but might be nice to think through to set us up for easier extensibility in the future.

This comment has been minimized.

Copy link
@c-w

c-w Jun 9, 2017

Author Member

Done in c7008f8.


import com.microsoft.partnercatalyst.fortis.spark.logging.Loggable
import com.microsoft.partnercatalyst.fortis.spark.transforms.ZipModelsProvider
import com.microsoft.partnercatalyst.fortis.spark.transforms.nlp.Tokenizer.tokenize

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

Importing a specific function like this globally seems a little misleading. If you change tokenize to apply in Tokenizer, then you can tokenize in this file with Tokenizer("input string")

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

Done in 880c5af

object Tokenizer {
@transient private lazy val wordTokenizer = """\b""".r

def tokenize(sentence: String): Seq[String] = {

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

Might as well call this apply so we can treat Tokenizer like a global function.

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

Done in 880c5af

}

private def computeSentimentScore(numPositiveWords: Int, numNegativeWords: Int) = {
if (numPositiveWords > numNegativeWords) {

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

This would be much more beautiful with a match :)

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

I'm not a fan of sprinkling Scala constructs left right and center which make the code harder to understand for folks coming from more mainstream languages.

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 9, 2017

Contributor

Looking at this again, I'd say it's fine as is, but as a general response, this codebase is written in Scala 😛.

This comment has been minimized.

Copy link
@c-w

c-w Jun 9, 2017

Author Member

Sure; then again, Java is as close to a lingua franca as we get in the software world so using Scala as Java++ (as opposed to going crazy and using Scala as a Haskell dialect) will make the code easier for others to grok :)

) extends WordListSentimentDetector {

protected override def readWords(path: String): Set[String] = {
if (path.contains("pos.txt")) {

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 8, 2017

Contributor

nit: endsWith

This comment has been minimized.

Copy link
@c-w

c-w Jun 8, 2017

Author Member

Done in d5cbd60.

@c-w c-w merged commit d0d9835 into master Jun 10, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@c-w c-w deleted the more-sentiments branch Jun 10, 2017

@c-w c-w removed the in progress label Jun 10, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.