## [4] Features
> Which features are used and which have the greatest influence on the prediction?

* (1) What features is your model using?
* (2) What do they mean?
* (3) Which is the most important features?
* (4) Are all models (in all languages of wikipedia), are they using the same features?

### 4.1 Features - _What features is your model using?_

In 2.7, a call to ORES was made that shows the scores and also the features and their values for a specific revid, like so:

ores.wikimedia.org/v3/scores/`context`/`revid`/`model`?features=true

Didn't find a way to get a **definitive** answer about the features _without_ referring to a revid. That's unfortunate, because it might be possible that there exist revids that yield different features. That would not make sense, but so far, we don't know. So let's use the way from above to find the features:

In [6]:
import requests, json

def get_features(revid="485104318", project="enwiki", model="articlequality"):
    if type(revid) is int:
        revid = str(revid)
    link = "https://ores.wikimedia.org/v3/scores/{0}/{1}/{2}?features=true".format(project, revid, model)
    rsp = requests.get(link)
    rsp = rsp.json()
    if "error" in rsp:
        print("\nFEHLER!!! - ", rsp,"\n")
        return[]
    rsp = rsp[project]["scores"][revid][model]

    features = []

    if "error" not in rsp.keys():
        features = list(rsp["features"])

    return features

features_reference_aq_en = get_features()

print("   #####   These are the features within model 'articlequality' of the project 'enwiki':   #####")
print("   #####   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   #####")
for i in features_reference_aq_en:
    print(i)

   #####   These are the features within model 'articlequality' of the project 'enwiki':   #####
   #####   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   #####
feature.english.stemmed.revision.stems_length
feature.enwiki.infobox_images
feature.enwiki.main_article_templates
feature.enwiki.revision.category_links
feature.enwiki.revision.cite_templates
feature.enwiki.revision.cn_templates
feature.enwiki.revision.image_links
feature.enwiki.revision.image_template
feature.enwiki.revision.images_in_tags
feature.enwiki.revision.images_in_templates
feature.enwiki.revision.infobox_templates
feature.enwiki.revision.paragraphs_without_refs_total_length
feature.enwiki.revision.shortened_footnote_templates
feature.enwiki.revision.who_templates
feature.len(<datasource.english.idioms.revision.matches>)
feature.len(<datasource.english.words_to_watch.revision.matches>)
feature.len(<datasource.wikitext.revision.words>)
feature.wikitext.revision.chars
feature.wikitext.re

Like said above: Not not sure if this is the same result for **_all_** articles! Can't try for ALL articles, so instead, grab a bunch of articles randomly and compare if they yield the same kind of features:

In [7]:
def check_if_features_are_the_same(startRange, endRange, count, project="enwiki", model="articlequality"):
    import random  # Random list with a length of "count", random valiues range from startRange to endRange (incl)
    random_revids = random.sample(range(startRange, endRange+1), count)

    reference = []  # this will contain the features of the first valid revid that gets checked 
    suspicious_revid = []  # Saving revid's that don't match the reference
    invalid_revid = 0  # Counting revids that dont exist

    print("Checking revids: ", random_revids)
    print("_"* len(random_revids))
    for revid in random_revids:
        rsp = get_features(revid=revid, project=project, model=model)
        if rsp == []:
            invalid_revid += 1
            print("-", end="")
        elif reference == []:  # First time fetching features, take these as reference, print a ^ as if all is fine
            reference = rsp
            print("^", end="")
        elif rsp != reference:  # Got features that don't match with the reference, printing an X
            suspicious_revid.append(revid)
            print("X", end="")
            continue
        else:
            print("^", end="")  # Got features that match with the reference. All is good, so print a ^

    if invalid_revid == count:  # print some text as result
        print("\n### !! ALL REVIDS WERE INVALID !! ###")
    elif suspicious_revid == []:
        valid_revids_count = count - invalid_revid
        print("\n" + str(valid_revids_count) + " of " + str(count) + " revids were valid and those yield the same features. That's good!" )
        print("-"*80)
        return reference
    else:
        print("\n### !! SOME revids HAVE UNEXPECTED FEATURES: ", suspicious_revid, "!! ###")
    
    return []

res = check_if_features_are_the_same(1000,10000000,15)

Checking revids:  [2130372, 8098803, 6367719, 953033, 7205212]
_____
^^^^^
5 of 5 valid revids yield the same features. That's good!
--------------------------------------------------------------------------------


So, good enough. These seem to be the same features all the time for model articlequality and project enwiki.

### 4.2 Features - _What do they mean?_

- feature.english.stemmed.revision.stems_length

- feature.enwiki.infobox_images

- feature.enwiki.main_article_templates

- feature.enwiki.revision.category_links

- feature.enwiki.revision.cite_templates

- feature.enwiki.revision.cn_templates

- feature.enwiki.revision.image_links

- feature.enwiki.revision.image_template

- feature.enwiki.revision.images_in_tags

- feature.enwiki.revision.images_in_templates

- feature.enwiki.revision.infobox_templates

- feature.enwiki.revision.paragraphs_without_refs_total_length

- feature.enwiki.revision.shortened_footnote_templates

- feature.enwiki.revision.who_templates

- feature.len(<datasource.english.idioms.revision.matches>)

- feature.len(<datasource.english.words_to_watch.revision.matches>)

- feature.len(<datasource.wikitext.revision.words>)

- feature.wikitext.revision.chars

- feature.wikitext.revision.content_chars

- feature.wikitext.revision.external_links

- feature.wikitext.revision.headings_by_level(2)

- feature.wikitext.revision.headings_by_level(3)

- feature.wikitext.revision.ref_tags

- feature.wikitext.revision.templates

- feature.wikitext.revision.wikilinks


### 4.3 Features - _Which is the most important features?_

May have missed the part where this is explained, but by searching for information, the answer was not found. So an idea is to use feature injection: By manually changing the values for some features we can see how much this changes the result and try to find out if some features change the result more than others. This is how an injection is applied:

```https://ores.wikimedia.org/v3/scores/enwiki/485104318/articlequality?feature.enwiki.revision.images_in_templates=9876543```


In [48]:

# This method is close to "get_features", however it does not need a revid. Since we believe the features stay the
# same for every article (within same model and project), we can just grab the features that the first hit yields.
def get_features_list(project="enwiki", model="articlequality"):
    import random  # Random list with a length of "count", random valiues range from startRange to endRange (incl)

    rsp = []  # this will contain the features of the first valid revid that gets checked 
    random_revids = []

    while rsp == []:
        if random_revids == []:
            random_revids = random.sample(range(1, 99999999), 100)

        rsp = get_features(random_revids.pop(), project=project, model=model)
    return rsp

    

get_features_list(project="enwiki", model="articlequality")




['feature.english.stemmed.revision.stems_length',
 'feature.enwiki.infobox_images',
 'feature.enwiki.main_article_templates',
 'feature.enwiki.revision.category_links',
 'feature.enwiki.revision.cite_templates',
 'feature.enwiki.revision.cn_templates',
 'feature.enwiki.revision.image_links',
 'feature.enwiki.revision.image_template',
 'feature.enwiki.revision.images_in_tags',
 'feature.enwiki.revision.images_in_templates',
 'feature.enwiki.revision.infobox_templates',
 'feature.enwiki.revision.paragraphs_without_refs_total_length',
 'feature.enwiki.revision.shortened_footnote_templates',
 'feature.enwiki.revision.who_templates',
 'feature.len(<datasource.english.idioms.revision.matches>)',
 'feature.len(<datasource.english.words_to_watch.revision.matches>)',
 'feature.len(<datasource.wikitext.revision.words>)',
 'feature.wikitext.revision.chars',
 'feature.wikitext.revision.content_chars',
 'feature.wikitext.revision.external_links',
 'feature.wikitext.revision.headings_by_level(2)',
 

### 4.4 Features - _Are all models (in all languages of wikipedia), are they using the same features?_

So, first we grab all available projects (="languages") that are included in ORES. For this, we fetch the information about scores in version 3 and grab the index for `models`. We then iterate these 46 projects and make a request to get the models that this project incorporates, with this result we can check if the model we look for is included in there. The scanning takes about twenty seconds.

In [30]:
def get_projects_for_model(model):
    scores = requests.get("https://ores.wikimedia.org/v3/scores").json()
    print("There are " + str(len(scores)) + " projects for ORES. Scanning them for model '" + model + "':" )
    print("_"*len(scores))
    ret = []
    for project in scores:
        rsp = requests.get("https://ores.wikimedia.org/v3/scores/{0}".format(project))
        rsp = rsp.json()
        models = rsp[project]["models"].keys()
        if model in models:
            ret.append(project)
            print("X", end="")
        else:
            print(".", end="")
    print("\r")
    return ret

model = "articlequality"
projects_with_aq = get_projects_for_model(model)
print("project", "  \t  ", "url                    ", " \t  response-code  \t   language")
for i in projects_with_aq:
    link = "http://{0}.wikipedia.org".format(i[:-4])  # removing the 'wiki' and putting whats left into a link, like "html://www.de.wikipedia.org"
    req = requests.get(link)
    print(i, "  \t  ", link, "\t ", req, " \t  ", req.headers["Content-Language"])

There are 46 projects for ORES. Scanning them for model 'articlequality':
______________________________________________
.......X.....X.X.X.X............X.XX..X.XXX...
project   	   url                      	  response-code  	   language
enwiki   	   http://en.wikipedia.org 	  <Response [200]>  	   en
euwiki   	   http://eu.wikipedia.org 	  <Response [200]>  	   eu
fawiki   	   http://fa.wikipedia.org 	  <Response [200]>  	   fa
frwiki   	   http://fr.wikipedia.org 	  <Response [200]>  	   fr
glwiki   	   http://gl.wikipedia.org 	  <Response [200]>  	   gl
ptwiki   	   http://pt.wikipedia.org 	  <Response [200]>  	   pt
ruwiki   	   http://ru.wikipedia.org 	  <Response [200]>  	   ru
simplewiki   	   http://simple.wikipedia.org 	  <Response [200]>  	   en
svwiki   	   http://sv.wikipedia.org 	  <Response [200]>  	   sv
testwiki   	   http://test.wikipedia.org 	  <Response [200]>  	   en
trwiki   	   http://tr.wikipedia.org 	  <Response [200]>  	   tr
ukwiki   	   http://uk.wikipedia.or

Now we can iterate this list of projects and a) check if the features IN that project stay the same and then b) compare them with the other projects. This may take a while...

In [42]:
# project_list - List of projects to check features
# sammples = How many random revids should be checked for each project? Each check is a call, so the higher this number, the longer it all takes!
def check_projects(project_list, samples = 10):
    result = dict()
    result = []
    all_good = True
    for p in project_list:
        print("CHECKING PROJECT " + p.upper()[:-4] + "! " , end="")
        res = check_if_features_are_the_same(1000,10000,samples,project=p)
        if res == []:
            all_good = False
            print("Problem with project '" + p + "' - inconsistencies regarding the features!")
        else:
            result.append([p,res])
    if all_good:
        print("All given projects are consistent regarding their OWN features! This is good")
    
    return result

features_of_projects = check_projects(projects_with_aq)


CHECKING PROJECT EN! Checking revids:  [6472, 1688, 6282, 5815, 8829, 5950, 3376, 3901, 8725, 5589]
__________
^^^^^^^^^^
10 of 10 valid revids yield the same features. That's good!
--------------------------------------------------------------------------------
CHECKING PROJECT EU! Checking revids:  [5147, 1361, 1049, 8805, 5668, 5645, 4045, 9040, 1979, 6531]
__________
^--^^^^^^^
8 of 10 valid revids yield the same features. That's good!
--------------------------------------------------------------------------------
CHECKING PROJECT FA! Checking revids:  [1504, 8749, 3687, 6419, 5072, 8026, 7165, 5544, 2613, 7555]
__________
^-^^^^^^-^
8 of 10 valid revids yield the same features. That's good!
--------------------------------------------------------------------------------
CHECKING PROJECT FR! Checking revids:  [7731, 2549, 5003, 6604, 8771, 9182, 2190, 9715, 6270, 6553]
__________
-^---^^^^^
6 of 10 valid revids yield the same features. That's good!
--------------------------------

### Short answer: Features stay the same within a project but are not the same over all projects.