Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A4 - Transparency
Please use the follwing structure as a starting point. Extend and change the notebook according to your needs. This structure should help you to guide you through your analysis. This notebook is the foundation for condensing your results and writing your reflection in the end. So please read what we expect from you regarding the reflection first to structure your analysis accordingly.

## [1] General understanding
> What is the model about and who is using it?

* What is your model about?

`reverted`: The model used for this endpoitnt predicts whether an edit will eventually be reverted.


* Why is this model useful?
 * It is useful for quality control tools
 * Helps reviewers to find potentially damaging contributions -> make the work of filtering through the Special:RecentChanges feed easier
 * Can be used for detection and removal of damaging contributions
 * There's also the need to identify good-faith contributors



* Who is using this model? 
  * **User** The model is designed to help human editors perform critical wiki-work and to increase their productivity by automating tasks like detecting vandalism and removing edits made in bad fait.
  * **Developers** The model aims to provide data for developers of tools for wikipedia (See below)
  * (**Scientists**)


* What are stakeholder or users of ORES? 
  * Volunteer tool developers and product developers at the **Wikimedia Foundation** and **Wikimedia Deutschland**
  * Authors/Editors of articles (because their edits get assessed)
  * Editors/Reviewers


* Why is this model useful to wikipedia?

It aims to improve the quality of articles and to reduce the work of reviewers/other editors.


* What applications/projects/... within wikipedia are using this model?

3rd party tools used:
  * [Edit Review Improvements (ERI)](https://www.mediawiki.org/wiki/Edit_Review_Improvements/New_filters_for_edit_review)
  * [Huggle](https://en.wikipedia.org/wiki/Wikipedia:Huggle)


## [2] API
> What does the ORES API (v3) tell you about a specific model? What functions does the API offer?

Use the API to investigate your model: https://ores.wikimedia.org/v3/#/. What do the follwing API calls do and what do they tell you about your model?

* `https://ores.wikimedia.org/v3/scores/`
* `https://ores.wikimedia.org/v3/scores/?model_info`
* `https://ores.wikimedia.org/v3/scores/enwiki`
* `https://ores.wikimedia.org/v3/scores/enwiki?models=YOURMODELNAME&model_info`
* `https://ores.wikimedia.org/v3/scores/enwiki?models=YOURMODELNAME&revids=SOMEIDHERE`
* `https://ores.wikimedia.org/v3/scores/enwiki/REVID/YOURMODELNAME?model_info`
* `https://ores.wikimedia.org/v3/scores/enwiki/REVID/YOURMODELNAME?features=true`

### Feature Injection
Please check out the _feature injection_ feature of ORES: https://www.mediawiki.org/wiki/ORES/Feature_injection

**Example:**

     # Here you can get the perdiction for a revision, if the user would habe been anonymous:
     https://ores.wikimedia.org/v3/scores/enwiki/991397091/damaging?features&feature.revision.user.is_anon=true

## [3] ML algorithm and training/test data
> Which machine learning model is underlying and what data is used to build the model?

* Check out `model_info` in detail.


* What does it tell you about the model performance?

`model_info` provides several measures of the model's performance.
Under the attribute `statistics` you can find the values for the confusion matrix, several related measures (recall, precision, false-positive rate) as well as AUC (Area Under Curve) for the ROC curve (Receiver Operating Characteristics).


* You can visualise and explain your results regarding model performance.


* What data was used to train and test the model?

The history of edits (and reverted edits) from a wiki.


* What machine learning algorithm is your model using? Please explain briefly. \
Different algorithms were used. Most wikis use GradientBoosting (bnwiki, elwiki, glwiki, hrwiki, idwiki, iswiki, tawiki, viwiki). Gradient boosting creates an ensemble learner by iteratively adding weak learners to an ensemble. Only the enwiktionary uses RandomForest and testwiki uses RevIDScorer.


## [4] Features
> Which features are used and which have the greatest influence on the prediction?

* What features is your model using?
* What do they mean?
* Which is the most important features?
* `https://ores.wikimedia.org/v3/scores/enwiki/991379667/articlequality?features=true`
* Are all models (in all languages of wikipedia), are they using the same features?

## Sample code

In [70]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [4]:
import requests
import json

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/chrisk280',
    'From': 'chrisk31@zedat.fu-berlin.de'
}

def get_ores_data(rev_id, headers):
    
    # Define the endpoint: This is an example!
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {'project' : 'enwiki',
              'model'   : 'reverted',
              'revids'  : rev_id
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data = json.dumps(response)

    return data

In [38]:
def get_model_info(headers, model, project="enwiki"):
    #https://ores.wikimedia.org/v3/scores/enwiki?models=YOURMODELNAME&model_info
    # Define the endpoint: This is an example!
    endpoint = 'https://ores.wikimedia.org/v3/scores/?model_info'

    api_call = requests.get(endpoint)
    response = api_call.json()
    data = json.dumps(response)

    return data

In [47]:
data_str = get_model_info(headers, "reverted")

In [48]:
"reverted" in data_str

True

In [84]:
data = json.loads(get_model_info(headers, "reverted"))

In [62]:
from pprint import pprint

In [83]:
for k in data.keys():
    if "reverted" in data[k]["models"].keys():
        print("Wiki", k, "| Models", list(data[k]["models"].keys()))
        #pprint(data[k]["models"]["reverted"])
        print("Reverted_Model_TYPE", data[k]["models"]["reverted"]["type"], "\n")

Wiki bnwiki | Models ['reverted']
Reverted_Model_TYPE GradientBoosting 

Wiki elwiki | Models ['reverted']
Reverted_Model_TYPE GradientBoosting 

Wiki enwiktionary | Models ['reverted']
Reverted_Model_TYPE RandomForest 

Wiki glwiki | Models ['articlequality', 'reverted']
Reverted_Model_TYPE GradientBoosting 

Wiki hrwiki | Models ['reverted']
Reverted_Model_TYPE GradientBoosting 

Wiki idwiki | Models ['reverted']
Reverted_Model_TYPE GradientBoosting 

Wiki iswiki | Models ['reverted']
Reverted_Model_TYPE GradientBoosting 

Wiki tawiki | Models ['reverted']
Reverted_Model_TYPE GradientBoosting 

Wiki testwiki | Models ['articlequality', 'articletopic', 'damaging', 'draftquality', 'drafttopic', 'goodfaith', 'reverted', 'wp10']
Reverted_Model_TYPE RevIDScorer 

Wiki viwiki | Models ['articletopic', 'reverted']
Reverted_Model_TYPE GradientBoosting 



***

#### Credits

We release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).