Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A4 - Transparency
Please use the follwing structure as a starting point. Extend and change the notebook according to your needs. This structure should help you to guide you through your analysis. This notebook is the foundation for condensing your results and writing your reflection in the end. So please read what we expect from you regarding the reflection first to structure your analysis accordingly.

## [1] General understanding
> What is the model about and who is using it?

* What is your model about? 

The drafttopic model is designed to route newly created articles based on their apparent topical nature to interested reviewers.

* Why is this model useful? 

This is useful, because one of the biggest difficulies with reviewing new articles is finding someone appropriate to do this task. Due to a a gigantic variety of topics on Wikipedia, reviewers have to be picked accordingly to their skills and knowlegde to judge notability, relevance, and accuracy of an article. 

* Who is using this model? 

The drafttopic model can be used by anyone due to the fact that it is open source. Most probably it is used mostly by data scientists.

* What are stakeholder or users of ORES? 

Stakeholders are for example people that are releasing content on wikipedia or want to understand user activity on the website

* Why is this model useful to wikipedia? 

This is equally useful for Wikipedia and users on the website, because by redirecting new articles to suitable reviewers, the overall quality of information is more likely to increase.

* What applications/projects/... within wikipedia are using this model? 

The model is used only within enwiki.

## [2] API
> What does the ORES API (v3) tell you about a specific model? What functions does the API offer?

Use the API to investigate your model: https://ores.wikimedia.org/v3/#/. What do the follwing API calls do and what do they tell you about your model?

* `https://ores.wikimedia.org/v3/scores/`

Contains data for all wikis and all models. Mostly version info of the models.

* `https://ores.wikimedia.org/v3/scores/?model_info`

Gives details about models e.g. probabilitys of predictions, topics for categorization and training data results, but also information about the running environment. Without specifying the model, the info is general for all models.

* `https://ores.wikimedia.org/v3/scores/enwiki`

Returns data for the english wiki, mostly version info of the models

* `https://ores.wikimedia.org/v3/scores/enwiki?models=drafttopic&model_info`

Gives details about a specific model, in this case drafttopic.

* `https://ores.wikimedia.org/v3/scores/enwiki?models=drafttopic&revids=SOMEIDHERE`

Gives details of a specific model with a specified revision id

* `https://ores.wikimedia.org/v3/scores/enwiki/REVID/drafttopic?model_info`

Model information of a specified model and revision id

* `https://ores.wikimedia.org/v3/scores/enwiki/REVID/drafttopic?features=true`

Lists features and scores of a specific model with a specified revision id

### Feature Injection
Please check out the _feature injection_ feature of ORES: https://www.mediawiki.org/wiki/ORES/Feature_injection

**Example:**

     # Here you can get the perdiction for a revision, if the user would have been anonymous:
     https://ores.wikimedia.org/v3/scores/enwiki/991397091/damaging?features&feature.revision.user.is_anon=true

In [15]:
import requests
import json
import urllib.parse

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/kuzniarz',
    'From': 'sebastian.kuzniarz@fu-berlin.de'
}

def get_ores_data(path, params):
    
    # endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    base_url = 'https://ores.wikimedia.org/v3/scores'
    endpoint = base_url + path

    # params = {'project': 'enwiki', 'model': 'drafttopic', 'revids': rev_id }
    
    if params:
        endpoint = endpoint + '?'
    
    print(endpoint + urllib.parse.urlencode(params))

    api_call = requests.get(endpoint + urllib.parse.urlencode(params))
    response = api_call.json()
    data = json.loads(json.dumps(response))

    return data

data = get_ores_data('/', {'project': 'enwiki', 'model': 'drafttopic'})

https://ores.wikimedia.org/v3/scores/?project=enwiki&model=drafttopic


Only the enwiki uses drafttopic, so let's have a look on the data there

In [14]:
data["enwiki"]

{'models': {'articlequality': {'version': '0.8.2'},
  'articletopic': {'version': '1.2.0'},
  'damaging': {'version': '0.5.1'},
  'draftquality': {'version': '0.2.1'},
  'drafttopic': {'version': '1.2.0'},
  'goodfaith': {'version': '0.5.1'},
  'wp10': {'version': '0.8.2'}}}

In [34]:
data_mi = get_ores_data('/?model_info', {})
data_mi

https://ores.wikimedia.org/v3/scores/?model_info


{'arwiki': {'models': {'articletopic': {'environment': {'machine': 'x86_64',
     'platform': 'Linux-4.9.0-11-amd64-x86_64-with-debian-9.12',
     'processor': '',
     'python_branch': '',
     'python_build': ['default', 'Sep 27 2018 17:25:39'],
     'python_compiler': 'GCC 6.3.0 20170516',
     'python_implementation': 'CPython',
     'python_revision': '',
     'python_version': '3.5.3',
     'release': '4.9.0-11-amd64',
     'revscoring_version': '2.8.2',
     'system': 'Linux',
     'version': '#1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20)'},
    'params': {'ccp_alpha': 0.0,
     'center': False,
     'criterion': 'friedman_mse',
     'init': None,
     'label_weights': {},
     'labels': ['Culture.Biography.Biography*',
      'Culture.Biography.Women',
      'Culture.Food and drink',
      'Culture.Internet culture',
      'Culture.Linguistics',
      'Culture.Literature',
      'Culture.Media.Books',
      'Culture.Media.Entertainment',
      'Culture.Media.Films',
      'Culture

In [26]:
data_enwiki = get_ores_data('/enwiki', {'models': 'drafttopic', 'model_info': ''})
data_enwiki

https://ores.wikimedia.org/v3/scores/enwiki?models=drafttopic&model_info=


{'enwiki': {'models': {'drafttopic': {'environment': {'machine': 'x86_64',
     'platform': 'Linux-4.9.0-11-amd64-x86_64-with-debian-9.12',
     'processor': '',
     'python_branch': '',
     'python_build': ['default', 'Sep 27 2018 17:25:39'],
     'python_compiler': 'GCC 6.3.0 20170516',
     'python_implementation': 'CPython',
     'python_revision': '',
     'python_version': '3.5.3',
     'release': '4.9.0-11-amd64',
     'revscoring_version': '2.8.2',
     'system': 'Linux',
     'version': '#1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20)'},
    'params': {'ccp_alpha': 0.0,
     'center': False,
     'criterion': 'friedman_mse',
     'init': None,
     'label_weights': {},
     'labels': ['Culture.Biography.Biography*',
      'Culture.Biography.Women',
      'Culture.Food and drink',
      'Culture.Internet culture',
      'Culture.Linguistics',
      'Culture.Literature',
      'Culture.Media.Books',
      'Culture.Media.Entertainment',
      'Culture.Media.Films',
      'Culture.M

In [38]:
data_enwiki_model_info = get_ores_data('/enwiki', {'models': 'drafttopic', 'revids': 991397091})
data_enwiki_model_info

https://ores.wikimedia.org/v3/scores/enwiki?models=drafttopic&revids=991397091


{'enwiki': {'models': {'drafttopic': {'version': '1.2.0'}},
  'scores': {'991397091': {'drafttopic': {'score': {'prediction': ['Geography.Regions.Americas.North America',
       'Geography.Regions.Asia.West Asia',
       'Geography.Regions.Europe.Eastern Europe',
       'History and Society.Politics and government'],
      'probability': {'Culture.Biography.Biography*': 0.37227279249199124,
       'Culture.Biography.Women': 0.037097852105352165,
       'Culture.Food and drink': 0.0044497105086688206,
       'Culture.Internet culture': 0.028424382912199363,
       'Culture.Linguistics': 0.00044749092714304044,
       'Culture.Literature': 0.01533391989763697,
       'Culture.Media.Books': 0.004531056640817715,
       'Culture.Media.Entertainment': 0.010065666298636403,
       'Culture.Media.Films': 0.002341004096949225,
       'Culture.Media.Media*': 0.06082809468266984,
       'Culture.Media.Music': 0.0014570502485747377,
       'Culture.Media.Radio': 0.017343727580575955,
       'Cult

In [39]:
data_enwiki_model_revision = get_ores_data('/enwiki/991397091/drafttopic', { 'model_info': '' })
data_enwiki_model_revision

https://ores.wikimedia.org/v3/scores/enwiki/991397091/drafttopic?model_info=


{'enwiki': {'models': {'drafttopic': {'environment': {'machine': 'x86_64',
     'platform': 'Linux-4.9.0-11-amd64-x86_64-with-debian-9.12',
     'processor': '',
     'python_branch': '',
     'python_build': ['default', 'Sep 27 2018 17:25:39'],
     'python_compiler': 'GCC 6.3.0 20170516',
     'python_implementation': 'CPython',
     'python_revision': '',
     'python_version': '3.5.3',
     'release': '4.9.0-11-amd64',
     'revscoring_version': '2.8.2',
     'system': 'Linux',
     'version': '#1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20)'},
    'params': {'ccp_alpha': 0.0,
     'center': False,
     'criterion': 'friedman_mse',
     'init': None,
     'label_weights': {},
     'labels': ['Culture.Biography.Biography*',
      'Culture.Biography.Women',
      'Culture.Food and drink',
      'Culture.Internet culture',
      'Culture.Linguistics',
      'Culture.Literature',
      'Culture.Media.Books',
      'Culture.Media.Entertainment',
      'Culture.Media.Films',
      'Culture.M

In [40]:
data_drafttopic_info = get_ores_data('/enwiki/991397091/drafttopic', {'model_info' : ''})
data_drafttopic_info

https://ores.wikimedia.org/v3/scores/enwiki/991397091/drafttopic?model_info=


{'enwiki': {'models': {'drafttopic': {'environment': {'machine': 'x86_64',
     'platform': 'Linux-4.9.0-11-amd64-x86_64-with-debian-9.12',
     'processor': '',
     'python_branch': '',
     'python_build': ['default', 'Sep 27 2018 17:25:39'],
     'python_compiler': 'GCC 6.3.0 20170516',
     'python_implementation': 'CPython',
     'python_revision': '',
     'python_version': '3.5.3',
     'release': '4.9.0-11-amd64',
     'revscoring_version': '2.8.2',
     'system': 'Linux',
     'version': '#1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20)'},
    'params': {'ccp_alpha': 0.0,
     'center': False,
     'criterion': 'friedman_mse',
     'init': None,
     'label_weights': {},
     'labels': ['Culture.Biography.Biography*',
      'Culture.Biography.Women',
      'Culture.Food and drink',
      'Culture.Internet culture',
      'Culture.Linguistics',
      'Culture.Literature',
      'Culture.Media.Books',
      'Culture.Media.Entertainment',
      'Culture.Media.Films',
      'Culture.M

In [41]:
data_drafttopic_features = get_ores_data('/enwiki/991397091/drafttopic', { 'features': 'true' })
data_drafttopic_features

https://ores.wikimedia.org/v3/scores/enwiki/991397091/drafttopic?features=true


{'enwiki': {'models': {'drafttopic': {'version': '1.2.0'}},
  'scores': {'991397091': {'drafttopic': {'features': {'feature.len(<datasource.wikitext.revision.tokens_matching(\\b(he|him|his)\\b)>)': 599.0,
      'feature.len(<datasource.wikitext.revision.tokens_matching(\\b(she|her|hers)\\b)>)': 12.0,
      'feature_vector.revision.text.en_vectors_mean': [0.15932169242613747,
       0.02844641831563397,
       -0.005585385418560723,
       0.014845608697324585,
       -0.008998420161020506,
       0.2205494042844125,
       0.021162965430486288,
       0.07294982320482798,
       0.023351661305086292,
       -0.10927588664442822,
       -0.10943054775761185,
       -0.004606454745338108,
       -0.0457362645893769,
       0.1771136298441107,
       -0.0384764843411688,
       0.15587926338224375,
       -0.20595068332996774,
       -0.10785879764923707,
       -0.14171333052289484,
       -0.09233033606382264,
       -0.05359327057752053,
       0.1924576984177859,
       0.066383310427

## [3] ML algorithm and training/test data
> Which machine learning model is underlying and what data is used to build the model?

* Check out `model_info` in detail.


* What does it tell you about the model performance?
* You can visualise and explain your results regarding model performance.
* What data was used to train and test the model?
* What machine learning algorithm is your model using? Please explain briefly.

## [4] Features
> Which features are used and which have the greatest influence on the prediction?

* What features is your model using?
* What do they mean?
* Which is the most important features?
* `https://ores.wikimedia.org/v3/scores/enwiki/991379667/articlequality?features=true`
* Are all models (in all languages of wikipedia), are they using the same features?

## Sample code

In [2]:
import requests
import json

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/YOUR-USER-NAME',
    'From': 'YOUR-EMAIL@fu-berlin.de'
}

def get_ores_data(rev_id, headers):
    
    # Define the endpoint: This is an example!
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {'project' : 'enwiki',
              'model'   : 'YOUMODELNAME',
              'revids'  : rev_id
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data = json.dumps(response)

    return data

***

#### Credits

We release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).