
# Keystone Proposal
Wikipedia is a major information source used by millions each day. Accordingly,it is in a companies very own interest to have a balanced article free of vandalism. This might be especially important for recruiting purposes, since potential applicants will research information about the company.

While Wikipedia is an open platform meaning ANYONE can edit article without even registering, several mechanism to prevent unbalanced articles and vandalism to remain unnoticed.
The most important one being watchlists. Wikipedia authors can put an article on their watchlist and will be notified whenever a change is committed, giving them a chance to review it. For highly controversial or frequently sabotaged articles additional protection can be enabled. These protections limit who is allowed to edit and might require peer reviews before publishing. 

The advantage of Wikipedia from a datascience perspective is that all data is available freely. 
In my keystone project I want to investigate on the relation of public interest to protections being in place.


### Technical description:

Data was acquired using the Wikipedia API and the sister project ['Massviews Analysis'](https://tools.wmflabs.org/massviews/).
Massviews tracks article views, Wikipedia-API is capable of supplying categories, watchlists, past changes and basically everything else there is on Wikipedia.

Data was gathered for all companies being traded on NASDAQ or the New York stock exchange
As metrics I currently use:

 1. Average number of visitors per day
 2. Number of registered WP accounts having this article on their watchlist
 3. Is additional protection in place (y/n)

Future features might include categories of the companies (e.g. Pharmaceuticals, Oil, Technology, Tobacco), proposed changes flagged as vandalism. 

### Processing:
First I remove the outliers using the 0.95 quantille. Specifically, I remove entries that are in the top 5% of either average daily views or watchtlist count. The remaining companies are visualized as a scatter plot using the protection level as a color code. I also did a linear regression and included it in the plot.

For my second plot I focussed on the outliers of the previous plot. It is vital to understand if the excess view count is just due to a recent spike in public interest. To investigate this, I plotted the not averaged daily view count over the last month for all companies with having a quadratic deviation from the linear regression being larger than the 0.9 quantille.

### Interpretation
The first plot shows that there are many companies out there whose protection level is not scaled to the interest in their Wikipedia page. These companies should maybe revisit their articles more frequently to spot lingering vandalism. The second plot reveals caveats in the analysis that have to always be considered. Sometimes the view count does not reflect persistent public interest in the article, but might be high due to rare spikes in e.g. media coverage.

In [1]:
import requests
query_pars={"action": "query", "format": "json", 
            "prop": "info", "generator": "categorymembers",
            "inprop": "protection|visitingwatchers|watchers", 
            "gcmtitle": "Category:Companies_listed_on_the_New_York_Stock_Exchange",
            "gcmlimit":"100"}
result = {}
i=0
while True:
    i += 1
    print(i, '... will be finished at around 30')
    r = requests.get(r'https://en.wikipedia.org/w/api.php', params=query_pars)
    result= {**result,**(r.json()['query']['pages'])}
    if not 'continue' in r.json(): break
    query_pars['gcmcontinue']=r.json()['continue']['gcmcontinue']

if 'gcmcontinue' in query_pars: del query_pars['gcmcontinue']

query_pars['gcmtitle']='Category:Companies_listed_on_NASDAQ'
while True:
    i += 1
    print(i, '... will be finished at around 30')
    r = requests.get(r'https://en.wikipedia.org/w/api.php', params=query_pars)
    result= {**result,**(r.json()['query']['pages'])}
    if not 'continue' in r.json(): break
    query_pars['gcmcontinue']=r.json()['continue']['gcmcontinue']


1 ... will be finished at around 30
2 ... will be finished at around 30
3 ... will be finished at around 30
4 ... will be finished at around 30
5 ... will be finished at around 30
6 ... will be finished at around 30
7 ... will be finished at around 30
8 ... will be finished at around 30
9 ... will be finished at around 30
10 ... will be finished at around 30
11 ... will be finished at around 30
12 ... will be finished at around 30
13 ... will be finished at around 30
14 ... will be finished at around 30
15 ... will be finished at around 30
16 ... will be finished at around 30
17 ... will be finished at around 30
18 ... will be finished at around 30
19 ... will be finished at around 30
20 ... will be finished at around 30
21 ... will be finished at around 30
22 ... will be finished at around 30
23 ... will be finished at around 30
24 ... will be finished at around 30
25 ... will be finished at around 30
26 ... will be finished at around 30
27 ... will be finished at around 30
28 ... wil

In [2]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
data1 = pd.DataFrame.from_dict(result, orient='index')
data2 = pd.concat([pd.read_json('massviews-20170101-20170131_nystock.json'),
       pd.read_json('massviews-20170101-20170131_nasdaq.json')])
data = pd.merge(data1, data2, how='inner',left_on='title', right_on='label')
data['visiting_per_view'] = (data['watchers']/data['average'])
plt.figure()
axis = plt.gca()

data = data.loc[data['watchers'].notnull() & data['average'].notnull(),:]

data_nofarout = data.loc[(data.watchers < data.watchers.quantile(q=.95)) & 
                             (data.average < data.average.quantile(q=.95))].copy()

data_nofarout[data_nofarout.protection.apply(lambda x: not bool(x))].plot(
                    kind='scatter', x='watchers', y='average', c='b', label='unprotected',ax=axis)
data_nofarout[data_nofarout.protection.apply(lambda x: bool(x))].plot(
                    kind='scatter', x='watchers', y='average', c='r', label='protected', ax=axis)

model = LinearRegression()
model.fit(data_nofarout.watchers[:, np.newaxis], data_nofarout.average)
axis.plot(data_nofarout.watchers, model.predict(data_nofarout.watchers[:, np.newaxis]),
          color='orange', label='linear regression')
axis.set_xlabel('Number of people having the article on their watchlist')
axis.set_ylabel('Average number of daily visitors')
axis.legend()



<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x7f7aab6a2390>

In [3]:
import numpy as np
data_nofarout['deviations'] = np.abs(data_nofarout.average - model.predict(data_nofarout.watchers[:, np.newaxis]))

plotcount = np.sum(data_nofarout.deviations>data_nofarout.deviations.quantile(.95))
plt.figure()
for i in range(plotcount):
    plt.plot(np.arange(-31,0,1),data_nofarout.loc[data_nofarout.deviations>data_nofarout.deviations.quantile(.95), 'data'].iloc[i])

plt.ylabel('Daily viewcount')
plt.xlabel('Days')
plt.show()


<IPython.core.display.Javascript object>