Skip to content
This repository has been archived by the owner on Mar 7, 2018. It is now read-only.

Add sentiment analysis capability for 68 more languages #9

Merged
merged 22 commits into from Jun 10, 2017
Merged

Conversation

c-w
Copy link
Contributor

@c-w c-w commented Jun 7, 2017

In the past, we integrated with the Cognitive Services Text Analytics API to extract the sentiment from a text. When we started the integration, the API only had support for 4 languages. Now, there is support for 15 languages.† However, Fortis would like to support many more languages. This PR addresses that problem.

Most sentiment detection work is done for the big NLP languages (English, German, etc.), however, I did manage to find a paper by the data science group at Stony Brook University which focused on sentiment analysis work for all the other languages out there. Through a combination of a variety of techniques (including machine translation and morphological propagation across linguistically similar languages), they managed to create word-polarity lists for over 100 languages. I analyzed the lists they provide, kept the lists for the languages where they have at least 500 positive and negative terms and uploaded them to our fortis-models blob in a machine readable format.

Inside of Fortis, we then use the word polarity lists to compute sentiment for languages that are unsupported by Cognitive Services, like so:

  1. Download the polarity lists from blob (cache on disk)
  2. Read the positive/negative words (cache in memory)
  3. Count how many positive/negative words are in the sentence.
  4. If there are more positive words, assume positive sentiment. If there are more negative words, assume negative sentiment. If the count is the same, assume neutral sentiment.

This approach is super naive (I've asked for support from the Machine Learning TWG to improve the approach if possible), but it's better than nothing and similar to how we did sentiment analysis in Fortis-v1 so I assert it's good enough for now.

This closes Issue#18.

NB: The PR also includes some re-structuring that was necessary to implement the new functionality, e.g. moving around some packages, extracting helper utilities, etc.

†: Cognitive Services currently support English, Spanish, Portuguese, French, German, Italian, Dutch, Norwegian, Swedish, Polish, Danish, Finnish, Russian, Greek and Turkish.

@c-w c-w added the in progress label Jun 7, 2017
@c-w c-w changed the title More sentiments Add sentiment analysis capability for 68 more languages Jun 7, 2017
Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM w/ comments.

}
}
}

object SentimentDetector {
val POSITIVE: Double = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala constants should be Pascal-cased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cb10a83

enabledLanguages: Set[String] = Set("de", "en", "es", "eu", "it", "nl")
) extends Serializable with Logger {
class ZipModelsProvider(
formatModelsDownloadUrl: String => String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming this like a function might make its use clearer. Maybe something like modelsUrlFromLanguage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9ec3011

) extends Serializable with Logger {
class ZipModelsProvider(
formatModelsDownloadUrl: String => String,
modelsSource: Option[String] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about modelDirectory?

Copy link
Contributor Author

@c-w c-w Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be a custom URL to a models zip file.


logDebug(s"Analyzed text $text in language $language: $kaf")

kaf.getEntities.toList.filter(entityIsPlace).map(_.getStr).toSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toSet not needed here since return type is Iterable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function fetches the entities in a sentence so there shouldn't be duplicates. I've changed the return type to Set in 5861eab.


kaf.getEntities.toList.filter(entityIsPlace).map(_.getStr).toSet
} catch {
case ex @ (_ : NullPointerException | _ : IOError) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which code will throw a null pointer exception? If it's OpeNER, we should likely be calling it differently, since throwing NullPointerException should not be part of an error handling contract.

Copy link
Contributor Author

@c-w c-w Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish it weren't the case... but unfortunately the OpeNER code is research-quality code so it throws a NPE in some cases when it can't load a model since they swallow an IOException internally and then later hit the null-pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yikes. Can we tighten the Try scope to just the relevant OpeNER statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e89573a.

import scala.io.Source

@SerialVersionUID(100L)
class WordListSentimentDetector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could define a trait for SentimentDetector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be an interface for it, but right now there's only two implementations which are both wrapped by a single class so there wouldn't be much value in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it wouldn't provide immediate value, but might be nice to think through to set us up for easier extensibility in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7008f8.


import com.microsoft.partnercatalyst.fortis.spark.logging.Loggable
import com.microsoft.partnercatalyst.fortis.spark.transforms.ZipModelsProvider
import com.microsoft.partnercatalyst.fortis.spark.transforms.nlp.Tokenizer.tokenize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing a specific function like this globally seems a little misleading. If you change tokenize to apply in Tokenizer, then you can tokenize in this file with Tokenizer("input string")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 880c5af

object Tokenizer {
@transient private lazy val wordTokenizer = """\b""".r

def tokenize(sentence: String): Seq[String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well call this apply so we can treat Tokenizer like a global function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 880c5af

}

private def computeSentimentScore(numPositiveWords: Int, numNegativeWords: Int) = {
if (numPositiveWords > numNegativeWords) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be much more beautiful with a match :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of sprinkling Scala constructs left right and center which make the code harder to understand for folks coming from more mainstream languages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I'd say it's fine as is, but as a general response, this codebase is written in Scala 😛.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; then again, Java is as close to a lingua franca as we get in the software world so using Scala as Java++ (as opposed to going crazy and using Scala as a Haskell dialect) will make the code easier for others to grok :)

) extends WordListSentimentDetector {

protected override def readWords(path: String): Set[String] = {
if (path.contains("pos.txt")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: endsWith

Copy link
Contributor Author

@c-w c-w Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d5cbd60.

@c-w c-w merged commit d0d9835 into master Jun 10, 2017
@c-w c-w deleted the more-sentiments branch June 10, 2017 21:50
@c-w c-w removed the in progress label Jun 10, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants