# WikiSentimentRanking

## Problem Statement

This project aims to create a tool that allows user to get a ranking of Wikipedia articles relevant to the user-defined query. Ranking is built by sentiment strength of an article text.


## Motivation

Recently a lot of research was done in the field of news and articles sentiment analysis. For social media text, like tweets or facebook posts it is relatively easy to analyse sentiment. At the same tiem, longer and more complex formal texts like news and articles may contain words of a strong positive/negative sentiment, while being neutral. For instance, neutral article describing a crime will have strong negative sentiment, according to social media sentiment analysis tools.

To overcome this, several approaches were developed. E.g. Balahur et al.**[1]** tried to determine the best approach to sentiment analysis in news. Enhanced vocabulary-based approach was used. Subtracting subject-field specific vocabulary significantly improved quality of the sentiment analysis. 

Another option is to detect subjectivity in the text along with concepts this sentiment is related to. This was successfully applied tested by Godbole et al. in **[2]**.

Nielsen et al. in **[3]** tried to built a system for real-time monitoring of sentiment in company related articles. While Zhou et al. investigated the question of neutrality comparison of articles in different languages in **[4]**. In particular, articles about recent wars were considered.

#### Who cares and why? Summary.
1. Wikipedia society: Wikipedia articles should be written form a neutral point of view ([NPOV](https://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view)). Conflict of interests should be detected and biased articles should be fixed.

2. Authors: Authors want to create such a high-quality content.

3. Readers: Readers want to read neutral articles to get information, not opinions.

4. PR department of company or public person: Companies want to track attitude of society to them & their products.

Moreover, developed in a modular fashion, such a system can be scaled to API of other information sources, like Google Search.


## Problem formulation (ML task)
Given a set of $n$ texts (articles) $\{t_i | i=\overline{1;n}\}$ related to certain query, built a ranking of these texts by sentiment strength $\{j-th\text{ rank}:\hat{t_i} | j=\overline{1;n}\}$.

In context of Machine Learning this is a classsical regression problem: for the text one should predict sentiment score from $s_{-}$ - minimum possible negative score indicating strongest negative sentiment, to $s_{+}$ - maximum possible positive score indicating strongest positive sentiment, $0$ - absolutely neutral text. I.e. $|s|$ represents sentiment strength, while $sign(s)$ corresponds to positive or negative sentiment polarity. 

## Approach to Solution
### Pipeline

![Solution pipeline](./img/pipeline_full.png "Solution pipeline")
Figure 1. Solution architecture

High level pipeline of the proposed solution can be described in steps as follows:
1. Query is received from  user and passed to the next step.
2. Wikipedia articles relevant to the query are retrieved through Wiki API in batches. Article text is cleaned from markdown symbols and irrelevant information. Each batch is written to the output folder.
3. In parallel, batches of retrieved articles are read by scorer module and sentiment score is evaluated for each article. Results are written by batches as well. 
4. UI part reads new scoring result batches and updates ranking that is displayed to user.




### Wiki_reader 

This module allows to retrieve articles relevant to a query from Wikipedia via Wiki API. 
`pywikibot` library is used for interaction with API. Results are written in json by batches.


### ML module

Two different approaches were considered: vocabulary- and rule-based sentiment analysis.
This way it is possible to compare results and choose the best one or combine them for even better results. 

Each approach has its' own advantages and drawbacks. These can be summarized as follows:

### 1. VADER - Valence Aware Dictionary and sEntiment Reasoner
As mentioned in the original paper [5] by Hutto et al. this scorer was developed based on social media text analysis. however, it is successfully aplicable to more formal texts like articles and news. Especially, if one aims to detect simple language structures that are expressing positive or negative sentiment. 

Detecting clear signs of the subjective opinions in Wikipedia articles is in the field of our research interest. Thus, we decided to utilize this approach. Its' advantages and disadvantages can be summarized as follows:
#### Pros:
+ Developed for social media, but well applicable to other formats - news, articles
+ Considers  negations and other complex language structure in a text
+ Good results for simple subjectivities - can be used as a baseline, or first step method as a rough estimator.
+ Easy to use

#### Cons:
- Rule-based approach misses everything out of rules. In addition, complex language structures, like irony are missed.
- Isn’t possible to determine what exactly is negative: sentiment or concept. E.g. 


### 2. Model-based - Linear regression (based on 1.6M tweets)

In sentiment analysis of complex natural texts there is a common problem. Problem of separating subjective opinions related to concept from negative concepts described neutrally. 

To overcome this obstacle, more complex model was considered. However, as at first we aim detecting clear signs of the subjective opinions in Wikipedia articles, simple model - linear regression was considered. Its' advantages and disadvantages can be summarized as follows:
#### Pros:
+ Theoretically, can be used to determine what is negative: sentiment or concept. However, more complex models, like Neural Networks (in particular, LSTMs) should be utilized. 
+ Allows to achieve better results, because of more complex model structure.

#### Cons:
- Harder to use properly
- Hard to find proper dataset
- Overfitting to certain particularities of the data

    
    

## Data
### Vocabulary-based approach - VADER

### Model-based approach Twitter sentiment dataset


## Evaluation

## Results & Discussion

## Possible Extensions
* Highlight paragraphs/sentences that triggered scorer
* Summarize score from sources
* Use other sources (e.g. Google) to estimate sentiment
* Apply other NLP models to achieve better results (LSTM, extend vocabulary approach)
* Real-time plugin for advicing readers if content is neutral or biased



## References
[1] Balahur, Alexandra, et al. "Sentiment analysis in the news." arXiv preprint arXiv:1309.6202 (2013). [Source](https://arxiv.org/ftp/arxiv/papers/1309/1309.6202.pdf).

[2] Godbole, Namrata, Manja Srinivasaiah, and Steven Skiena. "Large-Scale Sentiment Analysis for News and Blogs." Icwsm 7.21 (2007): 219-222. [Source](http://www.uvm.edu/pdodds/files/papers/others/2007/godbole2007a.pdf).

[3] Nielsen, F. Å., M. Etter, and L. K. Hansen. "Real-time monitoring of sentiment in business related wikipedia articles, Technical University of Denmark 2013." [Source](https://pdfs.semanticscholar.org/74e6/b642042d33980d70ce2ce7e5c4d1b54aa790.pdf).

[4] Zhou, Yiwei, Alexandra Cristea, and Zachary Roberts. "Is wikipedia really neutral? A sentiment perspective study of war-related wikipedia articles since 1945." Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 2015. [Source](https://www.aclweb.org/anthology/Y15-1019).

[5] Hutto, Clayton J., and Eric Gilbert. "Vader: A parsimonious rule-based model for sentiment analysis of social media text." Eighth international AAAI conference on weblogs and social media. 2014. [Source](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109/8122)