# WikiSentimentRanking

## Problem Statement

This project aims to create a tool that allows user to get a ranking of Wikipedia articles relevant to the user-defined query. Ranking is built by sentiment strength of an article text.


## Motivation

Recently a lot of research was done in the field of news and articles sentiment analysis. For social media text, like tweets or facebook posts it is relatively easy to analyse sentiment. At the same tiem, longer and more complex formal texts like news and articles may contain words of a strong positive/negative sentiment, while being neutral. For instance, neutral article describing a crime will have strong negative sentiment, according to social media sentiment analysis tools.

To overcome this, several approaches were developed. E.g. Balahur et al.**[1]** tried to determine the best approach to sentiment analysis in news. Enhanced vocabulary-based approach was used. Subtracting subject-field specific vocabulary significantly improved quality of the sentiment analysis. 

Another option is to detect subjectivity in the text along with concepts this sentiment is related to. This was successfully applied tested by Godbole et al. in **[2]**.

Nielsen et al. in **[3]** tried to built a system for real-time monitoring of sentiment in company related articles. While Zhou et al. investigated the question of neutrality comparison of articles in different languages in **[4]**. In particular, articles about recent wars were considered.

#### Who cares and why? Summary.
1. Wikipedia society: Wikipedia articles should be written form a neutral point of view ([NPOV](https://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view)). Conflict of interests should be detected and biased articles should be fixed.

2. Authors: Authors want to create such a high-quality content.

3. Readers: Readers want to read neutral articles to get information, not opinions.

4. PR department of company or public person: Companies want to track attitude of society to them & their products.

Moreover, developed in a modular fashion, such a system can be scaled to API of other information sources, like Google Search.


## Problem formulation (ML task)
Given a set of $n$ texts (articles) $\{t_i | i=\overline{1;n}\}$ related to certain query, built a ranking of these texts by sentiment strength $\{j-th\text{ rank}:\hat{t_i} | j=\overline{1;n}\}$.

In context of Machine Learning this is a classsical regression problem: for the text one should predict sentiment score from $s_{-}$ - minimum possible negative score indicating strongest negative sentiment, to $s_{+}$ - maximum possible positive score indicating strongest positive sentiment, $0$ - absolutely neutral text. I.e. $|s|$ represents sentiment strength, while $sign(s)$ corresponds to positive or negative sentiment polarity. 

## Approach to Solution
### Pipeline

![Solution pipeline](./img/pipeline_full.png "Solution pipeline")
Figure 1. Solution architecture

High level pipeline of the proposed solution can be described in steps as follows:
1. Query is received from  user and passed to the next step.
2. Wikipedia articles relevant to the query are retrieved through Wiki API in batches. Article text is cleaned from markdown symbols and irrelevant information. Each batch is written to the output folder.
3. In parallel, batches of retrieved articles are read by scorer module and sentiment score is evaluated for each article. Results are written by batches as well. 
4. UI part reads new scoring result batches and updates ranking that is displayed to user.




### Wiki_reader 

This module allows to retrieve articles relevant to a query from Wikipedia via Wiki API. 
`pywikibot` library is used for interaction with API. Results are written in json by batches.


### ML module

Two different approaches were considered: vocabulary- and rule-based sentiment analysis.
This way it is possible to compare results and choose the best one or combine them for even better results. 

Each approach has its' own advantages and drawbacks. These can be summarized as follows:

### 1. VADER - Valence Aware Dictionary and sEntiment Reasoner
As mentioned in the original paper [5] by Hutto et al. this scorer was developed based on social media text analysis. however, it is successfully aplicable to more formal texts like articles and news. Especially, if one aims to detect simple language structures that are expressing positive or negative sentiment. 

Detecting clear signs of the subjective opinions in Wikipedia articles is in the field of our research interest. Thus, we decided to utilize this approach. Its' advantages and disadvantages can be summarized as follows:
#### Pros:
+ Developed for social media, but well applicable to other formats - news, articles.
+ Considers  negations and other complex language structure in a text.
+ Good results for simple subjectivities - can be used as a baseline, or first step method as a rough estimator.
+ Easy to use.
+ Works faster.

#### Cons:
- Rule-based approach misses everything out of rules. In addition, complex language structures, like irony are missed.
- Isn’t possible to determine what exactly is negative: sentiment or concept. E.g. 


### 2. Model-based - Logistic regression 

In sentiment analysis of complex natural texts there is a common problem. Problem of separating subjective opinions related to concept from negative concepts described neutrally. 

To overcome this obstacle, more complex model was considered. However, as at first we aim detecting clear signs of the subjective opinions in Wikipedia articles, simple model - linear regression was considered.  Model is trained to predict sentiment class (positive or negative polarity of text). Then probability of belonging to positive class after certain adjustment is considered as sentiment score. Its' advantages and disadvantages can be summarized as follows:
#### Pros:
+ Theoretically, can be used to determine what is negative: sentiment or concept. However, more complex models, like Neural Networks (in particular, LSTMs) should be utilized. 
+ Allows to achieve better results, because of more complex model structure.

#### Cons:
- Harder to use properly.
- Hard to find proper dataset.
- Slower scoring.
- Overfitting to certain particularities of the data.

## Data
### Vocabulary-based approach - VADER
VADER uses a combination of a sentiment lexicon and set of linguistic rules to determine polarity and sentiment strength. 
sentiment lexicon is a dictionary, where  lexical features (e.g., words) corespond to score that represents positive or negative semantic orientation. In this case, we don't explicetely use any training data, but a sentiment lexicon in the VADER scorer is kind of implicit labeled data. 

### Model-based approach Twitter sentiment dataset
For model based approach 1.6M tweets [6], each labeled as belonging to either positive or negative class, was utilized. Obviously, social media texts don't contain all the information to train general model for sentiment detection in formal texts. However, such a training data is aplicable in case of direct subjectivity detection, i.e. when one tries to detect sentiment expressed in simple natural language structures.


## Evaluation

At the moment we are interested in ranking of the articles much more, than in exact values of sentiment score. 
Thus, we decided to conduct qualitative evaluation of our approach.

Qualitative evaluation procedure can be formalized in the following steps:
1. Determine a search query.
2. Get a set of relevant articles.
3. Evaluate sentiment scores for these articles and build a ranking.
4. Take article from the middle of the ranking (without loss of generality, let it be article with median sentiment score in a a sample)
5. Locally add 2 more copies of this articles to the sample. Edit one copy to be clearly (for human expert) negative and one - to be clearly positive. E.g. by adding template paragraf of strong negative/positive sentiment respectively.
6. Evaluate sentiment scores for extended set of articles and build a new ranking.
7. Expected result - copy edited such that it is clearly obvious should be higher in ranking, than original one. The other edited copy should be lower, as it is clearly has stronger negative sentiment.

Here are two templates of positive/negative paragraphs utilized:
> `<Query>` is very bad. And author doesn't provide any justification. 
    People don't like `<Query>`. 
    Some even hate `<Query>`, because `<Query>` is evil.
    Some groups believe `<Query>` is their main enemy.
    
> `<Query>` is very good. And author doesn't provide any justification. 
    People like `<Query>`. 
    Some even love `<Query>`, because `<Query>` is honest.
    Some groups believe `<Query>` is their best friend.

We added query to the paragraphs to make it clear, that subjectivity is expressed related to the concept of user-defined query. This may be useful later, when more complex models, that are separating negative concepts from negative opinions about these concepts, will be evaluated.

### VADER evaluation
![VADER_evaluation_before](./img/eval_vader_before.PNG "VADER evaluation before")
Figure 2. VADER Approach Evaluation. Ranking before editions

![VADER_evaluation_after](./img/eval_vader_after.PNG "VADER evaluation after")
Figure 3. VADER Approach Evaluation. Ranking after editions

### Expected behaviour is observed: article edited to be more negative is lower in ranking, while positive is higher. However, sentiment scores all are close to 1. I.e. order of articles in ranking is reasonable, results are satisfactory, but if one is more interested in score itself, method should be improved.

### Logistic Regression evaluation
![Logistic_Regression_evaluation_before](./img/eval_lr_before.PNG "Logistic Regression evaluation before")
Figure 4. Logistic Regression Approach Evaluation. Ranking before editions

![Logistic_Regression_evaluation_after](./img/eval_lr_after.PNG "Logistic Regression evaluation after")
Figure 5. Logistic Regression Approach Evaluation. Ranking after editions

### Expected behaviour is *not* observed: article edited to be more negative as well as positive have the same score and ranking as the original article. As sentiment scores all are equal to 1, these results are not satisfactory, method should be improved.

### These poor performance in prediction can be explained by low predictive power of Logistic Regression for long texts. Thus we extended this approach by scoring parts of the articles separately and finding average score.  Results of the extended method evaluation are below:
![Advaced_Logistic_Regression_evaluation_before](./img/eval_lr_adv_before.PNG "Extended Logistic Regression evaluation before")
Figure 6. Extended Logistic Regression Approach Evaluation. Ranking before editions

![Advaced_Logistic_Regression_evaluation_after](./img/eval_lr_adv_after.PNG "Extended Logistic Regression evaluation after")
Figure 7. Extended Logistic Regression Approach Evaluation. Ranking after editions

### Expected behaviour is observed: article edited to be more negative is lower in ranking, while positive is higher. Moreover, sentiment scores are reasonable, not close to 1, like in . I.e. order of articles in ranking is reasonable, results are satisfactory, but if one is more interested in score itself, method should be improved.

## Results & Discussion
*(Note: will be updated soon)*

Ranking for the query `Amazon Company` was built with VADER and advanced logistic regression approaches. 
Top 5 results are on Figures 8 and 9 respectively.
![VADER_evaluation_before](./img/eval_vader_before.PNG "VADER ranking results")
Figure 8. VADER Approach. Ranking results: top 5

As it was mentioned in evaluation section, first approach lets one build reasonable ranking, but exact score values are not precise enough. Thus further usage of sentiment score absolute values is not adviced.

![Advaced_Logistic_Regression_evaluation_before](./img/eval_lr_adv_before.PNG "Extended Logistic Regression ranking results")
Figure 9. Extended Logistic Regression Approach. Ranking results: top 4

With more complex model, not only ranking is reasonable, but scores can be utilized for some further purpose. 
It is important to emphasize, that this model is still pretty simple. I.e. it detects subjectivity directly expressed with simple natural language structures.

Thus, results of this project are good, we satisfied our research and engineering interests. 
However, these results are just first step in developing production solution for mentioned problem.

Modular structure of the project makes it scalable, but one should improve scoring quality to make a step further than research try. In particular, one should overcome problem of separating negative concepts from negative sentiment related to this concept.

## Possible Extensions
* Highlight paragraphs/sentences that triggered scorer
* Summarize score from sources
* Use other sources (e.g. Google) to estimate sentiment
* Apply other NLP models to achieve better results (LSTM, extend vocabulary approach)
* Real-time plugin for advicing readers if content is neutral or biased



## References
[1] Balahur, Alexandra, et al. "Sentiment analysis in the news." arXiv preprint arXiv:1309.6202 (2013). [Source](https://arxiv.org/ftp/arxiv/papers/1309/1309.6202.pdf).

[2] Godbole, Namrata, Manja Srinivasaiah, and Steven Skiena. "Large-Scale Sentiment Analysis for News and Blogs." Icwsm 7.21 (2007): 219-222. [Source](http://www.uvm.edu/pdodds/files/papers/others/2007/godbole2007a.pdf).

[3] Nielsen, F. Å., M. Etter, and L. K. Hansen. "Real-time monitoring of sentiment in business related wikipedia articles, Technical University of Denmark 2013." [Source](https://pdfs.semanticscholar.org/74e6/b642042d33980d70ce2ce7e5c4d1b54aa790.pdf).

[4] Zhou, Yiwei, Alexandra Cristea, and Zachary Roberts. "Is wikipedia really neutral? A sentiment perspective study of war-related wikipedia articles since 1945." Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 2015. [Source](https://www.aclweb.org/anthology/Y15-1019).

[5] Hutto, Clayton J., and Eric Gilbert. "Vader: A parsimonious rule-based model for sentiment analysis of social media text." Eighth international AAAI conference on weblogs and social media. 2014. [Source](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109/8122).

[6] Twitter Sentiment Dataset. Version cleaned from emoticons. [Source](https://1drv.ms/u/s!AqlC23XtB27BisoM5u56CMPeNOBQKw).