# Diversity of Wikipedia article references

## Mining massive databases course final project

### Motivation and goals

The diversity of resources and content matters. When people are obtaining new knowledge, they don't want to be fooled with fake news or believe in information without proof from other authorized source. Sometimes Wikipedia articles may have poorly filled or unreferenced information, so, in the era of exponential data growth and post-truth, there is a huge need in automatic detection of articles with deficit of sources.


Our solution may help readers — to be more confident or sceptic about the information gained, as well as editors — such that they concentrate on the most important gaps of the article.

Moreover, noteworthy that there exist differences in the same article across different languages, that can be detected with our solution and fixed. 

### Problem statement

Estimate the quality of the article, based on the references, in an unsupervised way. Unite and check the results of our modeling with results from ORES model. 

![problem_statement](assets/clusters_w.png)

### Work pipeline

*First we'll import prepared modules:*

In [4]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import os
import sys
sys.path.insert(0, "1_data_collection")
sys.path.insert(0, "2_feature_engineering")
sys.path.insert(0, "3_modeling")
from xml_to_csv import process_dumps
from csv_to_features import create_features
from features_to_clusters import get_clusters
from test import test_article

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


So the workin pipeline is this

![pipeline](assets/pipeline_w.png)

### Data processing

* Download wikipedia XML dumps:
    * We are using page article multistream dumps. To get faster development loops so far we worked with a single dump, next we will run the full pipeline on the whole wikipedia data. <br> <br>

* Parse XML to CSV using streaming XML parser:
    * We are using lxml and handwritten parser that goes through the file tag by tag and parses articles and meta information and article and last revision. The data we are fetching includes article text, title, revision author, revision comment and timestamp.

* Fetch ORES assessments:
    * **ORES (Objective Revision Evaluation Service)** provides score that represents article quality. The score itself consists of probabilities tha the article is:
        * **FA** (Featured Article)
        * **A** (A-class, well organized and essentially complete.)
        * **GA** (Good Article)
        * **B** (B-class, mostly complete and without major problems, but requires some further work)
        * **C** (C-class, substantial, but is still missing important content or contains much irrelevant material)
        * **Start** (Developing, quite incomplete; might or might not cite adequate reliable sources)
        * **Stub** (A very basic description of the topic / very-bad-quality article)
    * Here we use mwapi and ORES web service to get article scores

* Inspect internal structure of text:
    * Wikipedia articles have its own syntax for declaring blocks inside article: [source](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Layout)

*See data processing result for a sample data below:*

In [5]:
DATA_DIR = "sample_data"
DATE = "20190701"
XML_DIR = os.path.join(DATA_DIR, "xml")
CSV_DIR = os.path.join(DATA_DIR, "csv")

DUMP_BASE_URL = "https://dumps.wikimedia.org/enwiki/{}".format(DATE)

dump_names = ["enwiki-20190701-pages-articles-multistream14.xml-p7697599p7744799"]
dump_ext = ".bz2"

!rm $DATA_DIR/xml/* 2> null
for dump_name in dump_names:
    print("Loading {}...".format(dump_name))
    !wget -P $DATA_DIR/xml/ $DUMP_BASE_URL/$dump_name$dump_ext 2> /dev/null
    !bzip2 -d $DATA_DIR/xml/$dump_name$dump_ext 2> /dev/null
    
print('Parsing XML + Fetching ORES...')
process_dumps(XML_DIR, CSV_DIR, jupyter=True)
!rm $DATA_DIR/xml/* 2> null
print('Collected wiki dump(s) with ORES in {}/csv'.format(DATA_DIR))

Loading enwiki-20190701-pages-articles-multistream14.xml-p7697599p7744799...
Parsing XML + Fetching ORES...
XML Files found: 
Collected wiki dump(s) with ORES in sample_data/csv


### Feature Engineering

Were built features that reflect diversity of text, sources and links:

* Internal, external references count
* Average number of references per block of text (Number of references / Number of paragraphs)
* Citations count (Journals, Books, Web, News)
* Number of images, files, etc in the articles
* Number of non-approved references (“citation needed”)
* Headings count (different levels)
* ORES features 

In [6]:
df_features = create_features(CSV_DIR, DATE, save=False)
df_features.printSchema()

root
 |-- title: string (nullable = true)
 |-- Stub: double (nullable = true)
 |-- Start: double (nullable = true)
 |-- C: double (nullable = true)
 |-- B: double (nullable = true)
 |-- GA: double (nullable = true)
 |-- FA: double (nullable = true)
 |-- n_words: double (nullable = false)
 |-- n_internal_links: double (nullable = false)
 |-- n_external_links: double (nullable = false)
 |-- level2: double (nullable = false)
 |-- level3: double (nullable = false)
 |-- level4: double (nullable = false)
 |-- level5: double (nullable = false)
 |-- level6: double (nullable = false)
 |-- book_citations: double (nullable = true)
 |-- journal_citations: double (nullable = true)
 |-- web_citations: double (nullable = true)
 |-- news_citations: double (nullable = true)
 |-- average_external_links: double (nullable = true)
 |-- average_internal_links: double (nullable = true)
 |-- n_paragraphs: double (nullable = false)
 |-- n_unreferenced: double (nullable = false)
 |-- n_images: double (nullable 

### Modeling

We are dealing with a problem closely related to ORES but not exactly the same. Also ORES scores are not provided for every article, and certanly not every language. Thus we decided to use unsupervised learning techniques for data grouping. We use ORES scores as labels where it's available, and for the data without labels we search for the closest cluster and assign the most representative label of this cluster.

We've chosen a **bisecting K-means** as a clustering algorithm for our data. The algorithm starts from a single cluster. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible.
Among the key properties of the algorithm:

\+ hierarchical top-down approach

\+ parallelism

\+ high speed and efficiency (in terms of entropy, F measure and overall similarity) [1]

\- needs a hyperparameter k — fixed number of clusters — as input; this can be solved by maximizing the likelihood of the evaluation metrics.

In [None]:
# add code here

### Evaluation

We've chosen a **Silhouette coefficient** to measure how appropriately data have been clustered.

For each point a Silhouette coefficient $s(i)$ is calculated using:
- mean intra-cluster distance $a(i)$

\begin{equation}
a(i)=\frac{1}{\left|C_{i}\right|-1} \sum_{j \in C_{i}, i \neq j} d(i, j)
\end{equation}

- mean nearest-cluster distance $b(i)$:

\begin{equation}
b(i)=\min _{k \neq i} \frac{1}{\left|C_{k}\right|} \sum_{j \in C_{k}} d(i, j)
\end{equation}

- $a(i)$ and $b(i)$ are combined in the following way:

\begin{equation}
s(i)=\left\{\begin{array}{ll}{1-a(i) / b(i),} & {\text { if } a(i)<b(i)} \\ {0,} & {\text { if } a(i)=b(i)} \\ {b(i) / a(i)-1,} & {\text { if } a(i)>b(i)}\end{array}\right.
\end{equation}


So the score $s(i)$ is between $-1$ and $1$

In [None]:
# add code here

### Testing our framework

In [9]:
test_article("Principal component analysis")

- the references distribution:
  >   76% scientific papers (journals, publications, etc)
  >   12% books
  >   12% internet resources (news, archive, etc)
  >    0% media materials (prints)
- this article has a good amount of content and references


### Future work: 

* Create supervised machine learning model with ORES as labels and our features as inputs. The resulted model should be transferred to the other languages that didn’t support by ORES (like Ukrainian) <br><br> 

* Test the PySpark MLP / Random Forest / Gradient boosting and get the features importance (visualize the pluses and minuses of the articles references) <br><br>

* Fix clustering with ORES features <br>

### References

[1] Karypis, M.S.G., Kumar, V.: A comparison of document clustering techniques