[GSoC 2013] Cassio Melo New Insights based on Sentiment and Frequent Pattern Analysis

Cassio Melo edited this page May 3, 2013 · 6 revisions
Clone this wiki locally

This proposal aims at creating new qualitative insights based on Sentiment Analysis (Opinion Mining) and Frequent Pattern Analysis to complement ThinkUp's existing analytics.

Machine learning algorithms will be employed to look for non-trivial relationships and semantics in the text, as opposed to purely number crunching approaches. Several insights may emerge from the output of the algorithms. An additional challenge is to efficiently manage storage space and handling of incremental updates (without the need to re-compute from scratch). Moreover, the design of the algorithm should be modular enough to be extended by other developers in the future. A brief description of each method followed by application examples are given below.

Sentiment Analysis

A basic task in sentiment analysis is classifying user's "opinion" of a given text at the document, word — whether it is positive, negative, or neutral. Insights based on sentiment analysis can describe user's reputation over time or provide statistics about his or her sentiment (e.g. which words are associated to a happy or sad state). Sentiment140 is a good example on how sentiment analysis can be done in Twitter.

Sentiment Analysis Insight

Implementation. There are many ways in which sentiment analysis can be accomplished. Sentiment Analysis can be seen as a classification problem, where words (or tweets) are mapped to a positive or negative output. A Naive Bayesian classifier is a good candidate for this task. Heuristically, it yields good results and it is simple to implement. First, it is necessary to train the classifier by feeding it examples of "positive" and "negative" tweets prepared beforehand. The second step is actually to call the classifier to to guess what's the sentiment of the next tweet. There are a couple of already implemented Bayesian classifiers for PHP. Classified tweets are stored in the database and fetched regularly by different insights. Alternatively, the sentiment classification can be done using third party APIs such as Alchemy(free limited usage or commercial license).

Frequent Pattern Analysis

Frequent Pattern Analysis looks at itemsets (e.g. words) that often occur together, for example, {"barack obama", "politics", "united states"}. Frequent itemsets can be used to derive Association Rules, which provide a valuable information on how categorical variables are related to each other. Non-trivial examples include, for instance, which car brands are related to {"reliable", "cheap", "5 seats"} per country. Frequent pattern analysis help users both qualitatively (co-occurrence among sets or words, classes and hierarchies) and quantitatively (number of co-occurrences, strength of the implication, etc). They can also be used to predict the next words the user is likely to text based on his or her history.

The images below illustrate an example of one insight based on frequent pattern mining:

Frequent Pattern Insight distribution chart

Implementation. Ideally, the pattern mining algorithm should run as a background task. It works as follows. The first phase comprises of text fetching and treatment (remove stop words, derivation, inflection, coining, etc). Then, it will create a word term table relating statuses and words. Because of the sparsity of text documents, a compact data structure may be needed, such as a linked list. The algorithm proceeds with the computation of frequent itemsets using the incidence table generated by the previous step. Two approaches are possible: periodic batch processing or online stream mining. The main difference is that the online stream mining algorithm will run indefinitely once it is executed. This approach is also more complex than the period batch processing. Because of this, and taking into account the existing ThinkUp architecture I suggest the batch processing option first. In this paradigm, an efficient algorithm is a variant of the CHARM algorithm that uses bitsets to represent itemsets. The result of the algorithm will be stored in the database and promptly available for the insights depending on it. A simple first insight can be like the example illustrated above. More sophisticated insights can be created using the same input.

Other Insights

I had developed statistic indexes to measure influence, information spread and event characterization on Twitter during a Masters internship at INRIA, Paris. A case study article can be found here. I can easily implement the indexes as insights for ThinkUp.


GSoC internship period lasts a total of 18 weeks. I propose to split the period in four phases of about 4 weeks each, organized by deliverable, so community feedback can be obtained earlier. The phases are: a) Prototyping, b) Sentiment Insights, c) Pattern Analysis Insights and d) Tests & Enhancements. During each "Insight" phase the goal is to produce insights and visualizations.

May 27: Accepted student proposals announced.
May 27 - June 16: Preparation period. Literature review, definition of the methods for sentiment and pattern analysis.
June 17: Internship begins.
June 17 - July 15: Prototyping phase.
July 16 - August 12: Sentiment Insights phase.
July 29 - August 2: Mid-term evaluation.
August 3 - September 9: Pattern Analysis Insights phase.
August 4 - September 22: Tests & Improvements phase.
September 23-29: Final evaluation.

This plan will be reviewed by the mentor(s) and changes can be made through the process.


  • Algorithm for sentiment analysis (tweet-oriented)
  • Insights based on sentiment analysis
  • Algorithm for mining closed frequent itemsets in data stream using sliding window approach
  • Insights based on pattern analysis
  • Insights visualizations (e.g. implications among words, distribution of co-occurrences for a given set of words, confidence of the rules, etc)
  • Prediction/Suggestion of related content
  • Other insights?