# Signpost Article Views

We on the editorial board of the *Wikipedia Signpost* have wanted for a while now to run statistics on the page views that our stories generate. Understanding our distribution helps us understand our readership, and understanding our readership helps us to direct our energies in our future writing towards the storis that our readers most want to see and read. Distribution management is a basic tenet of newspaper management (I know of at least one "recovering academic" working in circulation analytics at *The New York Times*), but unfortunately one we've never quite been able to do before.

In the past the only way to get ever-popular pageview statistics was to use [stats.grok.se](http://stats.grok.se/), a venerable website run by an involved community member which did define an API for query but which, alas, didn't have the throughput to serve the results in reasonable time. It relied on huge database dumps which are a pain to download, let alone parse. Finally (belatedly) acting on this fact, the Wikimedia Foundation in December 2015 [rolled out a full-suite Pageview RESTBase API](http://blog.wikimedia.org/2015/12/14/pageview-data-easily-accessible/) getting view statistics on pages on Wikipedia has become worlds easier, especially with the R and Python API clients also released for the purpose.

**Caveat**: The dataset for this analysis is the set of all *Signpost* article views from the period 2015-10-07 to 2015-12-30. The *Wikipedia Signpost* is published on a weekly basis (and has been since its inception in 2005): this period saw 12 publications (one issue was skipped around Christmas-time). We would like to extend the analysis further back, but due to differing standard of reporting in older applications and logs, for the moment early October 2015 is the earliest date for which Pageview API data is available.

## Initialization code

This notebook uses [pandas](http://pandas.pydata.org/) for tabulation and [mwapi](https://github.com/mediawiki-utilities/python-mwapi) plus [mwviews](https://github.com/mediawiki-utilities/python-mwviews) for querying (I had to patch a bug in the latter to get it to work with escaped characters).

In [61]:
from pageviews import PageviewsClient
import arrow
import datetime
import urllib
from pandas import DataFrame
from pandas import Series
import pandas as pd
import mwapi
import requests


def viewcounts(article_name, start=None, end=None):
    """
    Fetches the viewcounts.
    """
    article_name = article_name.replace(' ', '_')
    parsed_article_name = urllib.parse.quote(article_name).replace('/', '%2F')
    p = PageviewsClient().article_views("en.wikipedia",
                                        [parsed_article_name],
                                        access="all-access",
                                        # access="users",
                                        granularity="daily",
                                        start=start,
                                        end=end)
    counts = {key: p[key][article_name] for key in p.keys()}
    # return [counts[key] for key in sorted(counts.keys())]
    return [p[key][article_name] for key in sorted(p.keys())]
    return counts

def article_viewcounts(article_name):
    """
    Fetches a list of the Signpost article viewcount from the date of the publication window.
    The Signpost is usually published late, so a generous 14 day news "cycle" is allotted as the publication window.
    In reality views are low before publication and after publication of the next issue, so this doesn't have much effect.
    """
    pubdate = arrow.get(article_name.split("/")[1])
    enddate = (pubdate + datetime.timedelta(days=14)).strftime('%Y%m%d%H')
    pubdate = pubdate.strftime('%Y%m%d%H')
    return viewcounts(article_name, start=pubdate, end=enddate)

def total_viewcount(article_name):
    """
    Returns the total 60-day viewcount.
    """
    return sum(article_viewcounts(article_name))

def average_daily_viewcount(article_name):
    """
    Returns the average daily viewcount of the article.
    """
    counts = article_viewcounts(article_name)
    return sum(counts)/len(counts)

def get_all_articles(prefix):
    """
    Returns a list of the titles of all of the Signpost articles published after a certain prefix.
    Prefix is 2015-10-07 for now, the earliest published Signpost story for which data is available (yet).
    """
    session = mwapi.Session('https://en.wikipedia.org', user_agent='signpostviews Jupyter notebook')
    raw_result = session.get(action='query',
                             list='allpages',
                             apfrom=prefix,
                             apto='Wikipedia Signpost/A',
                             apprefix='Wikipedia Signpost',
                             apnamespace=4,
                             aplimit=500,
                             formatversion=2)
    # The >= 2 call filters out results e.g. Wikipedia:Wikipedia Signpost/2015-07-18
    # The not 2016 call keeps out titles too recent to have full data for.
    result = [r['title'] for r in raw_result['query']['allpages'] if r['title'].count("/") >= 2]
    return result

def tabulate(articles):
    pass_dict = {article: article_viewcounts(article) for article in articles}
    return pass_dict

In [2]:
targets = [article for article in get_all_articles("Wikipedia Signpost/2015-10-07/Op-ed") if '2016' not in article]
all_views = tabulate(targets)

In [3]:
pd.set_option('display.max_rows', None)
frame = DataFrame([all_views[key] for key in sorted(all_views.keys())],
                  index=sorted(all_views.keys()))
# Fill missing values (not enough views to be logged so the API returns NaN) with 0.
frame = frame.fillna(0)
# Compute some per-page summary statistics.
# frame['avg'] = frame.apply(lambda x: int(sum(x) / 15), axis=1) # Not that useful because of the many empty values.
frame['total'] = frame.apply(lambda x: sum(x), axis=1)

In [4]:
# Drop entries which obviously never made it to publication.
frame = frame[frame['total'] > 200]
# Fix a particular trouble spot, where an article was renamed post-publication.
frame.ix['Wikipedia:Wikipedia Signpost/2015-10-14/WikiConference report'] = frame.ix['Wikipedia:Wikipedia Signpost/2015-10-14/WikiConference report'] + frame.ix['Wikipedia:Wikipedia Signpost/2015-10-14/WikiConference Report']
frame = frame.drop('Wikipedia:Wikipedia Signpost/2015-10-14/WikiConference Report')

## Raw Data

The following table contains the raw data of all *Signpost* article views from the period 2015-10-07 to 2015-12-30. In these 12 issues the *Signpost* published 86 individual stories. The dates correspond with days after the official publication date (0 means day of publication), with a long "after" time in order to account for the variability in *Signpost* publication time: due to editorial constrains, in this period copies were usually published about two days after the actual post date.

In [60]:
frame

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,total
Wikipedia:Wikipedia Signpost/2015-10-07/Op-ed,16,3,2,8,382,663,363,270,180,169,172,117,87,22,29,2483
Wikipedia:Wikipedia Signpost/2015-10-07/Technology report,6,11,8,3,104,174,125,111,113,104,91,55,15,14,14,948
Wikipedia:Wikipedia Signpost/2015-10-07/Traffic report,7,32,5,3,203,233,148,145,146,108,102,77,30,28,14,1281
Wikipedia:Wikipedia Signpost/2015-10-14/Blog,28,154,170,122,84,88,93,86,72,6,11,18,0,0,0,932
Wikipedia:Wikipedia Signpost/2015-10-14/Editorial,52,355,268,151,109,108,111,88,79,19,39,16,0,0,0,1395
Wikipedia:Wikipedia Signpost/2015-10-14/Featured content,40,40,17,23,175,203,141,116,98,96,82,77,11,14,22,1155
Wikipedia:Wikipedia Signpost/2015-10-14/News and notes,22,13,19,27,336,262,167,104,106,112,103,82,10,11,36,1410
Wikipedia:Wikipedia Signpost/2015-10-14/Op-ed,17,4,10,10,175,241,147,101,112,103,89,78,10,14,23,1134
Wikipedia:Wikipedia Signpost/2015-10-14/Technology report,8,1,9,5,114,155,121,86,88,100,81,69,4,12,17,870
Wikipedia:Wikipedia Signpost/2015-10-14/Traffic report,17,13,8,4,165,224,141,100,95,104,95,75,10,11,65,1127


The *Wikipedia Signpost* is considered the [newspaper of record](https://en.wikipedia.org/wiki/Newspaper_of_record) within the internal Wikimedian community (one kind reader (I spoke to in person once excitedly called it "the *The New York Times* of Wikipedia"), but within individual issues different stories have different lifetimes, depending on how interested the readers are and on where they get reposted, and so some stories have more and longer impact periods than others. The hard cut-off is often the publication of the next *Signpost* issue, but for the best stories discussion and reverb can continue for a week or even more afterwards.

Some caveats:

* The data here is likely a very slight undercount (due to the effects described above).

* Stories are occassionally summarized or linked to in more mainstream media or within major discussions within the project long after the fact of publication, which can produce sudden and very strong one-day spikes in viewership. Such spikes are not captured within this data for reason of expense (besides, they're not really in the current news cycle anymore).

* Stories are occassionally used as references in other pages, reread by community members searching through the *Signpost* archives, or clicked through to by "see more" content within current articles. These effects are also not captured.

## Summary statistics

**How many views does the average *Signpost* story get?**

In [6]:
int(round(sum(frame['total'])/len(frame['total']), 0))

1550

**What is the standard deviation?**

[Standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) is a measure of the variability of a dataset: here, the variability of the number of views that a *Signpost* article begets.

In [7]:
int(round(frame['total'].describe()['std']))

730

As you can see the standard deviation of a *Signpost* story is extrordinary! This says the variation in the number of views that a story can get is very high, and hints at the (intuitive, but never before so clearly stated) fact that different types of *Signpost* stories are on different magnitudes of viewership.

It is more meaningful to examine indiviudal sections, not the *Signpost* as a whole.

In [8]:
round(frame['total'].describe()['std'] / (sum(frame['total'])/len(frame['total'])), 6)

0.470997

**What was the most popular story? The least popular one?**

Keep in mind that we are working with a small slice of *Signpost* publication history, just in the range October to December 2015 (the *Signpost* has been published continuously since early 2005). So these are not all-time values, just the ones within our dataset.

In [9]:
print("Most popular:", '"' + frame[frame['total'] == frame['total'].max()].index[0] + '"', "with", int(frame['total'].max()), "views")
print("Least popular:", '"' + frame[frame['total'] == frame['total'].min()].index[0] + '"', "with", int(frame['total'].min()), "views")
print("This is a difference of", int(frame['total'].max() - frame['total'].min()), "views!")

Most popular: "Wikipedia:Wikipedia Signpost/2015-12-30/News and notes" with 5666 views
Least popular: "Wikipedia:Wikipedia Signpost/2015-10-28/Technology report" with 849 views
This is a difference of 4817 views!


The most popular story within the dataset is [a *Signpost* story on the sudden removal of community-elected trustee James Heilman from the Wikimedia Foundation's governing body](http://en.wikipedia.org/wiki/Wikipedia:Wikipedia Signpost/2015-12-30/News and notes), the Board of Trustees, an unprecedented and terrifically opaque move the fallout from which continues to dominate the conversation within the movement in the weeks thereafter. Note that this data is also cut off, so this number is likely a fairly sizable undercount.

The least popular story is a [Tech/News republication](http://en.wikipedia.org/wiki/Wikipedia:Wikipedia Signpost/2015-10-28/Technology report) which just happens to have the unfortunate fact of being the least popular article in the least popular section.

**What are the top five most popular stories?**

Keep in mind that we are working with a small slice of *Signpost* publication history, just in the range October to December 2015.

In [59]:
frame.sort('total', ascending=False).head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,total
Wikipedia:Wikipedia Signpost/2015-12-30/News and notes,1695,1213,673,375,480,389,274,244,129,69,61,64,0,0,0,5666
Wikipedia:Wikipedia Signpost/2015-10-21/Editorial,954,915,575,337,399,317,226,169,109,32,36,0,0,0,0,4069
Wikipedia:Wikipedia Signpost/2015-12-02/Op-ed,326,1388,654,482,350,239,209,137,145,71,40,0,0,0,0,4041
Wikipedia:Wikipedia Signpost/2015-12-09/Op-ed,68,82,560,702,585,286,201,144,120,89,36,32,11,14,0,2930
Wikipedia:Wikipedia Signpost/2015-11-18/Special report,375,852,309,256,236,178,223,148,63,29,17,221,0,0,0,2907


## By section

**How does popularity break down by section?**

Quick notes on what the section are:

* The "Technology report" used to be a specialized report for the tech crowd, however it has not had a regular writer for a long time and so has now been converted to a script-based republication of the similar but bare-bones [Tech/News](https://meta.wikimedia.org/wiki/Tech/News) report. Since many will have originally read it there first anyway, this is likely the least popular section, just as a matter of course.
* The "Gallery" section is a composition of images, either from photo competitions or from meetups or, occassionally, just to go with the season.
* The "Blog" section selectively republished reports first posted the [Wikimedia Blog](http://blog.wikimedia.org/).
* The "Traffic report" speculates on the popularity of the top ten most popular articles on Wikipedia of the week. It is light-hearted material that is well-read for what it is.
* "Recent research" (also compiled seperately as the [Recent Research report](https://meta.wikimedia.org/wiki/Research:Newsletter)) is a monthly section which summarizes the state of academic research on Wikipedia and the Wikimedias. It is thus a dense read, for good reason!
* The "Arbitration report" summarizes the outcomes and progress in cases took on by the [Arbitration Committee](https://en.wikipedia.org/wiki/Wikipedia:Arbitration_Committee), the community court of last resort.
* The "WikiProject report" interviews the active members of a WikiProject such a WikiProject Volcanoes or WikiProject Iceland.
* "In the media" summarizes the week's media mentions of Wikipedia and the Wikimedia movement.
* "News and notes" summarizes the week's internal news specific to the operations of the encyclopedia and the Wikimedia movement.
* "Opinion" is when someone sticks their nose out and stakes a case in the *Signpost*, arguing for or against something. It's no wonder it's well-read!
* The "Special report" is a well-researched *Signpost* inside scoop on something of the project, and is a report which is more just-the-facts than an opinion column.

In [50]:
def views(section_title):
    """
    Fetch a list of view counts for a particular section.
    """
    return [frame.ix[article, 'total'] for article in frame.index if section_title in article]

def avg_views(section_title):
    """
    Fetch the average view count for a particular section.
    """
    v = views(section_title)
    return int(round(sum(v) / len(v)))


ind = ['News and notes',
       'In the media',
       'Technology report',
       'Traffic report',
       'Arbitration report',
       'WikiProject report',
       'Recent research',
       'Blog',
       'Gallery',
       'Opinion',
       'Special report']
viewf = DataFrame([avg_views('News and notes'),
                   avg_views('In the media'),
                   avg_views('Technology report'),
                   avg_views('Traffic report'),
                   avg_views('Arbitration report'),
                   avg_views('WikiProject report'),
                   avg_views('Recent research'),
                   avg_views('Blog'),
                   avg_views('Gallery'),
                   # Opinion
                   int(round(sum(views('Op-ed') + views('Editorial')) / len(views('Op-ed') + views('Editorial')))),
                   # Special
                   int(round(sum(views('Special report') + views('In focus')) / len(views('Special report') + views('In focus'))))],
                   index=ind,
                   columns=['average']).sort('average', ascending=False)
viewf

Unnamed: 0,average
Opinion,2303
Special report,2243
News and notes,2155
In the media,1679
WikiProject report,1600
Arbitration report,1545
Recent research,1422
Traffic report,1251
Blog,1163
Gallery,1073


These results are not exactly surprising, but it's nice to have concrete numbers at last!

## Publication cycles

The *Signpost* has been published on a near-continuous weekly basis since early 2005. An interesting question to ask is one of publication cycles. The *Signpost* is variably published usually a day or two late in order to accomodate the schedules of the publishers and also to best fit whatever the current news cycle may be.

In this context a **publication cycle** is one start-to-finish *Signpost* issue: from the initial drafts produced by the sections' writers to the polished final copy that is published for all to read. The **news cycle** is almost-but-not-quite-weekly accumulation of goings-on that is covered in each *Signpost* issue - that is, the space between one issue and the next. Views are considered to be **in-cycle** if they occur while that issue is the current one.

All of the data above ignored cycles: it included page views a little before and a little after the story was live. We now recalibrate the data somewhat with additional information to get a better picture of what a publication cycle is like.

In [121]:
def get_title(pub):
    """
    Parses a raw HTTP request to fetch the title of the article based on its URL, and returns it.
    """
    ret = requests.get("http://en.wikipedia.org/wiki/" + pub).text
    # return ret
    ret.find('<span class="mw-headline"')
    ret = ret[ret.find('<span class="mw-headline"'):]
    ret = ret[:ret.find('</span></h2>')]
    ret = ret[ret.find('>') + 1:]
    return ret

In [124]:
table = frame.copy()
table['type'] = [title.split("/")[2] for title in frame.index]
table['date'] = [title.split("/")[1] for title in frame.index]

In [139]:
table['title'] = [get_title(title) for title in frame.index]
table = table.set_index('title')
table.index.name=None

In [150]:
cols = table.columns.tolist()
# cols = cols[-2:] + cols[:-2]
cols[1], cols[0] = cols[0], cols[1]

In [152]:
table = table[cols]
table

Unnamed: 0,date,type,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,total
Walled gardens of corruption,2015-10-07,Op-ed,16,3,2,8,382,663,363,270,180,169,172,117,87,22,29,2483
Tech news in brief,2015-10-07,Technology report,6,11,8,3,104,174,125,111,113,104,91,55,15,14,14,948
Reality is for losers,2015-10-07,Traffic report,7,32,5,3,203,233,148,145,146,108,102,77,30,28,14,1281
Third Wikimedia Spain conference takes place in Madrid,2015-10-14,Blog,28,154,170,122,84,88,93,86,72,6,11,18,0,0,0,932
Why the news media needs a Wikipedian in residence,2015-10-14,Editorial,52,355,268,151,109,108,111,88,79,19,39,16,0,0,0,1395
A fistful of dollars,2015-10-14,Featured content,40,40,17,23,175,203,141,116,98,96,82,77,11,14,22,1155
2015–2016 Q1 fundraising update sparks mailing list debate,2015-10-14,News and notes,22,13,19,27,336,262,167,104,106,112,103,82,10,11,36,1410
WikiConference USA 2015: built on good faith,2015-10-14,Op-ed,17,4,10,10,175,241,147,101,112,103,89,78,10,14,23,1134
Tech news in brief,2015-10-14,Technology report,8,1,9,5,114,155,121,86,88,100,81,69,4,12,17,870
"Screens, sport, Reddit, and death",2015-10-14,Traffic report,17,13,8,4,165,224,141,100,95,104,95,75,10,11,65,1127


In [157]:
multiindex = pd.MultiIndex.from_arrays([table['date'], table.index])

In [164]:
table2 = table.copy()
table2.index = multiindex
table2.index.names = ['date', 'title']
del table2['date']
table2

Unnamed: 0_level_0,Unnamed: 1_level_0,type,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,total
date,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-10-07,Walled gardens of corruption,Op-ed,16,3,2,8,382,663,363,270,180,169,172,117,87,22,29,2483
2015-10-07,Tech news in brief,Technology report,6,11,8,3,104,174,125,111,113,104,91,55,15,14,14,948
2015-10-07,Reality is for losers,Traffic report,7,32,5,3,203,233,148,145,146,108,102,77,30,28,14,1281
2015-10-14,Third Wikimedia Spain conference takes place in Madrid,Blog,28,154,170,122,84,88,93,86,72,6,11,18,0,0,0,932
2015-10-14,Why the news media needs a Wikipedian in residence,Editorial,52,355,268,151,109,108,111,88,79,19,39,16,0,0,0,1395
2015-10-14,A fistful of dollars,Featured content,40,40,17,23,175,203,141,116,98,96,82,77,11,14,22,1155
2015-10-14,2015–2016 Q1 fundraising update sparks mailing list debate,News and notes,22,13,19,27,336,262,167,104,106,112,103,82,10,11,36,1410
2015-10-14,WikiConference USA 2015: built on good faith,Op-ed,17,4,10,10,175,241,147,101,112,103,89,78,10,14,23,1134
2015-10-14,Tech news in brief,Technology report,8,1,9,5,114,155,121,86,88,100,81,69,4,12,17,870
2015-10-14,"Screens, sport, Reddit, and death",Traffic report,17,13,8,4,165,224,141,100,95,104,95,75,10,11,65,1127


We expected almost all views of the story to occur after publication, and this does seem to be broadly true; but even within a single issue there is a surprising lack of correspondance in when the publication views begin and when they end. This mismatch becomes even more apparent when we refactor our data based on date.