Add sentiment analysis capability for 68 more languages #9
Conversation
This can be shared between the place recognizer and the file-based sentiment detector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/ comments.
} | ||
} | ||
} | ||
|
||
object SentimentDetector { | ||
val POSITIVE: Double = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scala constants should be Pascal-cased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in cb10a83
enabledLanguages: Set[String] = Set("de", "en", "es", "eu", "it", "nl") | ||
) extends Serializable with Logger { | ||
class ZipModelsProvider( | ||
formatModelsDownloadUrl: String => String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming this like a function might make its use clearer. Maybe something like modelsUrlFromLanguage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9ec3011
) extends Serializable with Logger { | ||
class ZipModelsProvider( | ||
formatModelsDownloadUrl: String => String, | ||
modelsSource: Option[String] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about modelDirectory
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be a custom URL to a models zip file.
|
||
logDebug(s"Analyzed text $text in language $language: $kaf") | ||
|
||
kaf.getEntities.toList.filter(entityIsPlace).map(_.getStr).toSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toSet
not needed here since return type is Iterable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function fetches the entities in a sentence so there shouldn't be duplicates. I've changed the return type to Set in 5861eab.
|
||
kaf.getEntities.toList.filter(entityIsPlace).map(_.getStr).toSet | ||
} catch { | ||
case ex @ (_ : NullPointerException | _ : IOError) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which code will throw a null pointer exception? If it's OpeNER, we should likely be calling it differently, since throwing NullPointerException should not be part of an error handling contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish it weren't the case... but unfortunately the OpeNER code is research-quality code so it throws a NPE in some cases when it can't load a model since they swallow an IOException internally and then later hit the null-pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yikes. Can we tighten the Try scope to just the relevant OpeNER statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e89573a.
import scala.io.Source | ||
|
||
@SerialVersionUID(100L) | ||
class WordListSentimentDetector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could define a trait for SentimentDetector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be an interface for it, but right now there's only two implementations which are both wrapped by a single class so there wouldn't be much value in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it wouldn't provide immediate value, but might be nice to think through to set us up for easier extensibility in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c7008f8.
|
||
import com.microsoft.partnercatalyst.fortis.spark.logging.Loggable | ||
import com.microsoft.partnercatalyst.fortis.spark.transforms.ZipModelsProvider | ||
import com.microsoft.partnercatalyst.fortis.spark.transforms.nlp.Tokenizer.tokenize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing a specific function like this globally seems a little misleading. If you change tokenize
to apply
in Tokenizer
, then you can tokenize in this file with Tokenizer("input string")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 880c5af
object Tokenizer { | ||
@transient private lazy val wordTokenizer = """\b""".r | ||
|
||
def tokenize(sentence: String): Seq[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well call this apply
so we can treat Tokenizer like a global function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 880c5af
} | ||
|
||
private def computeSentimentScore(numPositiveWords: Int, numNegativeWords: Int) = { | ||
if (numPositiveWords > numNegativeWords) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be much more beautiful with a match
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of sprinkling Scala constructs left right and center which make the code harder to understand for folks coming from more mainstream languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I'd say it's fine as is, but as a general response, this codebase is written in Scala 😛.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure; then again, Java is as close to a lingua franca as we get in the software world so using Scala as Java++ (as opposed to going crazy and using Scala as a Haskell dialect) will make the code easier for others to grok :)
) extends WordListSentimentDetector { | ||
|
||
protected override def readWords(path: String): Set[String] = { | ||
if (path.contains("pos.txt")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: endsWith
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d5cbd60.
In the past, we integrated with the Cognitive Services Text Analytics API to extract the sentiment from a text. When we started the integration, the API only had support for 4 languages. Now, there is support for 15 languages.† However, Fortis would like to support many more languages. This PR addresses that problem.
Most sentiment detection work is done for the big NLP languages (English, German, etc.), however, I did manage to find a paper by the data science group at Stony Brook University which focused on sentiment analysis work for all the other languages out there. Through a combination of a variety of techniques (including machine translation and morphological propagation across linguistically similar languages), they managed to create word-polarity lists for over 100 languages. I analyzed the lists they provide, kept the lists for the languages where they have at least 500 positive and negative terms and uploaded them to our fortis-models blob in a machine readable format.
Inside of Fortis, we then use the word polarity lists to compute sentiment for languages that are unsupported by Cognitive Services, like so:
This approach is super naive (I've asked for support from the Machine Learning TWG to improve the approach if possible), but it's better than nothing and similar to how we did sentiment analysis in Fortis-v1 so I assert it's good enough for now.
This closes Issue#18.
NB: The PR also includes some re-structuring that was necessary to implement the new functionality, e.g. moving around some packages, extracting helper utilities, etc.
†: Cognitive Services currently support English, Spanish, Portuguese, French, German, Italian, Dutch, Norwegian, Swedish, Polish, Danish, Finnish, Russian, Greek and Turkish.