Skip to content

Creating your first sentiment analysis application.

Mario Muñoz edited this page Jun 26, 2014 · 10 revisions

Designing a simple sentiment analysis algorithm

Sentiment analysis is a challenging task in the field of semantic technology, due to several factors such as: the complexity inherent to the human language, or the differences between languages and application domains (for example, "cold" is not the same in "beverages" or "room" contexts). Thus, using high quality resources is a key point for getting good results. Although there are several complex, state-of-the-art sentiment analysis algorithms and techniques (visit this link for reading an example), this section will cover the implementation of a really basic algorithm integrating Eurosentiment services and resources.

Imagine a simple sentence like this one The room was fantastic and the service was wonderful. However, we find the restaurant was high-priced. We can find there some words expressing sentiment information: fantastic, wonderful and high-priced. Two of them (fantastic and wonderful) express positive sentiment, whilst high-priced expresses negative sentiment. So doing a rough normalization, we can say that this sentence is (2 - 1)/3 ~ 66% positive. So, the steps of our first sentiment algorithm will be:

  • Detect all positive-sentiment words in a given text.
  • Detect all negative-sentiment words in a given text.
  • Return a sentiment score, calculated as (positive word count - negative word count) / (positive + negative word count).

Caveats: language and domain

Now imagine that your sentiment service makes a huge success and you want to start processing text in other languages. Also, you would like to extend your hotels reviews site with electronic apparel reviews. Should you change your algorithm? With Eurosentiment multilingual, multidomain and standard approach, you only have to change the resources you're consuming. So, adding stages for language and domain detection and redirecting to the right resource will be the only change you'll have to perform.

But language and domain detection seem to be quite complicated tasks... don't worry! There are several Eurosentiment already developed services that do it for you. You'll just have to call them an parse the standard NIF result and focus on your sentiment algorithm.

Then, the final algorithm steps are:

  • Detect the language of the given text.
  • Detect the domain of the given text.
  • Detect all positive-sentiment words in the text, using the right resources based on the detected language.
  • Detect all negative-sentiment words in the text, using the right resources based on the detected language.
  • Return a "positiveness score", calculated as positive word count / (positive + negative word count).

Implementation

This example is already implemented in the provided code. Just examine the classes samples/PositiveWordsMatcher, samples/NegativeWordsMatcher and samples/SimpleSentimentAnalyzer. This classes make use of the concepts explained in the sections Using Eurosentiment LRP Services and Using Eurosentiment LRP Resources.

Please note the use of some common operations like text normalizing (utils/TextNormalizer.java) for making heterogeneus input texts as uniform as possible, thus increasing the chances of matching sentiment words.

Further steps

The presented algorithm is quite basic, so probably it will not achieve great precision and recall numbers. There are a lot of points where it can be improved:

  • Lemmatize the input words (for example, using openNLP or TreeTagger). This way, you will increase your algorithm's recall by increasing the chances of matching words.
  • Apply POS tagging rules for filtering unwanted results. Maybe, you only are interested in adjectives carrying sentiment information.
  • Explore Eurosentiment resources posibilities. The given lexicons carry not only sentiment words: you can find its POS tagging, the entity which they're been applicated to, or even its Wordnet synset or DBPedia additional information.