# Results

In [1]:
import pandas as pd 
import numpy as np
from IPython.display import display, display_html
%load_ext autoreload
from pprint import pprint
import matplotlib.pyplot as plt

In [2]:
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [3]:
with open('./res/revisions_FAC.csv', 'r') as file:
    date_cols_0 = [col for col in file.readline().strip().split(';') if 'date' in col]
with open('./res/FAC_merged.csv', 'r') as file:
    date_cols_1 = [col for col in file.readline().strip().split(';') if 'date' in col]




In [15]:
df_FAC = pd.merge(
    pd.read_csv('./res/revisions_FAC.csv', sep=';', index_col=0, parse_dates=date_cols_0),
    pd.read_csv('./res/FAC_merged.csv', sep=';', index_col=0, parse_dates=date_cols_1),
    on='title')
df_FAC['nomination_period'] = df_FAC['end_date'] - df_FAC['date_nomination']

df_FA = pd.merge(pd.read_csv('./res/revisions_FA.csv', sep=';', index_col=0, parse_dates=date_cols_0),
                  pd.read_csv('./res/FA_merged.csv', sep=';', index_col=0, parse_dates=date_cols_1),
                  on= 'title')
df_FA['nomination_period'] = df_FA['end_date'] - df_FA['date_nomination']

print(f'We were able to sucessfully retriev information on {df_FA.shape[0]} sucessfull and {df_FAC.shape[0]} unsucessfull nominations.')

d = df_FA.nomination_period.mean()


We were able to sucessfully retriev information on 5152 sucessfull and 3743 unsucessfull nominations.


## Erroneous Nomination Periods

If we look at the nomination periods, it becomes clear that some were not correctly recorded. This becomes obvious when taking a look at the longest nomination periods. The reason for this is that we decided to use the earliest and latest comments in determining the nomination period. Which can lead to errors, upon manually inspecting the discussions of the suspicious records, I found the following reasons for incorrect records:
* references to prior or future nominations
* copied discussions for example from peer reviews or article talk pages
* comments outside the nomination period 

Particularly the last case is fairly common. It seems like the comments were added when the article was already archived. For more recent nominations I was able to confirm this from the edit history of the archives. For a small number of older articles I consulted the revision history of the Featured Article Discussion. I opted to removed all questionable observations I could find. The phenomena is very common in the years 2005 or 2006. I excluded these years from my analysis. Additional articles were removed after inspection. The results of this inspection are documented in res/article_inspection.txt.

In [16]:
n_FA = len(df_FA)
n_FAC = len(df_FAC)

df_FA= df_FA.loc[df_FA.date_nomination >= np.datetime64('2007-01-01')]
df_FAC= df_FAC.loc[df_FAC.date_nomination >= np.datetime64('2007-01-01')]

remove_FA = ['Cretaceous–Tertiary extinction event', 'M-theory', 'Beijing opera', 'Columbian mammoth', 'Elvis Presley',
'Baron Munchausen', 'Blue men of the Minch', 'Amphetamine', 'Meshuggah', 'John J. Crittenden']
remove_FAC = ['Vector space', 'Ecology', 'Menominee Tribe v. United States', 'Sesame Street research', 'History of KFC',
              'Ravenloft (module)', 'David Falk', '1997 Michigan Wolverines football team', 'God of War (video game)']

df_FAC =df_FAC[~df_FAC.title.isin(remove_FAC)]
df_FA =df_FA[~df_FA.title.isin(remove_FA)]
n_FA2 = len(df_FA)
n_FAC2 = len(df_FAC)
print(f'We dropped {n_FA-n_FA2} successful and {n_FAC-n_FAC2} nominationas ({n_FA2} / {n_FAC2} remaining)')

We dropped 983 successful and 1292 nominationas (4169 / 2451 remaining)


In [6]:
#df_FA_red.loc[df_FA_red.nomination_period.nlargest(30).index, ['title', 'nomination_period', 'date_nomination', 'date_last_comment', 'end_date']]
#df_FAC_red.loc[df_FAC_red.nomination_period.nlargest(30).index, ['title', 'nomination_period', 'date_nomination', 'date_last_comment', 'end_date']]

## Nomination Periods

In [7]:
sum_nomination = pd.DataFrame([df_FAC.nomination_period.describe(), df_FA.nomination_period.describe()]).T
sum_nomination.columns = ['unsuccessful', 'successful']
sum_nomination

Unnamed: 0,unsuccessful,successful
count,2451,4169
mean,16 days 16:42:06.722154,23 days 07:10:50.858239
std,14 days 10:18:24.233626,16 days 11:25:42.793709
min,0 days 00:07:00,1 days 01:30:00
25%,5 days 16:38:00,10 days 13:43:37
50%,12 days 20:51:59,18 days 22:38:04
75%,23 days 06:18:35.500000,31 days 16:27:00
max,94 days 09:00:39,155 days 10:11:40


In general we find that successful nominations are remain on the discussion page for a longer time. This is to be expected. If we take a look at the shortest unsuccessful nomination, we find nominations with few sources or lacking license information on the pictures. In such a case when it is obvious that an article does not fulfill the Featured Article Criteria, decisions are made swiftly. However, to promote an article there needs to be some time for potential critiques to respond. The shortest successful nominations took at least 2 days. The one exception (M-553, 1 day 1:30) is unusual because it was unsuccessfully nominated 3 day prior.

In [8]:
# By uncommenting these statements you can inspect the smallest an longest observation periods
# n = 5
#df_FA.loc[df_FA.nomination_period.nsmallest(n).index, ['title', 'date_nomination', 'end_date', 'nomination_period']]
#df_FAC.loc[df_FAC.nomination_period.nsmallest(n).index, ['title', 'date_nomination', 'end_date','nomination_period']]
#df_FA.loc[df_FA.nomination_period.nlargest(n).index, ['title', 'date_nomination', 'end_date', 'nomination_period']]
#df_FAC.loc[df_FAC.nomination_period.nlargest(n).index, ['title', 'date_nomination', 'end_date', 'nomination_period']]


## Edits

In [9]:
df_FAC['edits_p2w'] = df_FAC['edits_after']/(df_FAC['nomination_period']/np.timedelta64(2, 'W'))
df_FA['edits_p2w'] = df_FA['edits_after']/(df_FA['nomination_period']/np.timedelta64(2, 'W'))

summary_FAC_edits = pd.Dathttps://www.google.com/search?client=ubuntu&channel=fs&q=imigaes&ie=utf-8&oe=utf-8aFrame([
df_FAC.edits_before.describe(),
df_FAC.edits_after.describe(),
df_FAC.edits_p2w.describe()])

summary_FA_edits = pd.DataFrame([
df_FA.edits_before.describe(),
df_FA.edits_after.describe(),
df_FA.edits_p2w.describe()])

summary_FA_edits['kurtosis'] = [df_FA.edits_before.kurtosis(), df_FA.edits_after.kurtosis(), df_FA.edits_p2w.kurtosis()]
summary_FA_edits['skewness'] = [df_FA.edits_before.skew(), df_FA.edits_after.skew(), df_FA.edits_p2w.skew()]
summary_FAC_edits['kurtosis'] = [df_FAC.edits_before.kurtosis(), df_FAC.edits_after.kurtosis(), df_FAC.edits_p2w.kurtosis()]
summary_FAC_edits['skewness'] = [df_FAC.edits_before.skew(), df_FAC.edits_after.skew(), df_FAC.edits_p2w.skew()]


summary_FA_edits.drop('count', axis=1, inplace=True)
summary_FAC_edits.drop('count', axis=1, inplace=True)
s1_edits = summary_FAC_edits.style.format({col: '{:.2f}' for col in summary_FA_edits.columns})
s2_edits = summary_FA_edits.style.format({col: '{:.2f}' for col in summary_FAC_edits.columns})

print('Unscuccessful Nominations')
display(s1_edits)
print('\n\n')

print('Successful Nominations')
display(s2_edits)

Unscuccessful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
edits_before,35.67,63.11,0.0,3.0,15.0,43.5,1095.0,59.67,5.76
edits_after,52.0,82.4,0.0,6.0,26.0,63.0,1143.0,40.51,4.85
edits_p2w,73.5,167.22,0.0,8.91,31.36,75.06,2880.0,110.91,8.91





Successful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
edits_before,46.16,71.51,0.0,3.0,18.0,58.0,947.0,17.49,3.25
edits_after,75.56,97.76,0.0,20.0,46.0,93.0,1941.0,51.01,4.73
edits_p2w,60.47,79.36,0.0,14.0,34.96,75.7,856.37,15.8,3.26


We think it is important to point out that the number of edits is a flawed metric. It does not account for the magnitude of the change in the article. Whether a single typo was corrected or whether an additional section was written is neglected in a simple edit count. Differnces in the edit count could also just be differences in working style, i.e. how oven an author saves his changes when editing an article. 

On average successful nominations receive a little more edits in the two weeks leading up to the nomination, compared to unsuccessful ones. The difference in the mean number of edits is however strongly driven by outliers. There is only a difference of 3 comparing their median. 

Also after the nomination more work is expended on successful nominations. The problem with this comparision is that we do not take into account the differences in the nomination period. For successful nominations we record the edits for a longer time. For this reason we added an additional measurement, the number of edits per two weeks. We decided on this measure to also facilitated comparisons with the pre-nomination period. Generally are there more edits after the nomination, indicating that the additional attention leads to significant changes the article. The difference between average edits before and average edits after nomination is bigger for unsuccessful nominations. One explanation for this would be that during the nomination of candidates more flaws of these articles should get pointed out, thus leaving the author(s) with more issues to address. From my inspection of the discussion I learned that insufficient or low quality references are a common issue of unsuccessful candidates. During the review low quality references are usually removed, some authors add new ones or unreferenced section may even be deleted, causing frequent edits.

In [17]:
df_FAC['authors_p2w'] = df_FAC['authors_after']/(df_FAC['nomination_period']/np.timedelta64(2, 'W'))
df_FA['authors_p2w'] = df_FA['authors_after']/(df_FA['nomination_period']/np.timedelta64(2, 'W'))


summary_FAC_authors = pd.DataFrame([
df_FAC.authors_before.describe(),
df_FAC.authors_after.describe(),
df_FAC.authors_p2w.describe()])
summary_FA_authors = pd.DataFrame([
df_FA.authors_before.describe(),
df_FA.authors_after.describe(),
df_FA.authors_p2w.describe()])

summary_FA_authors['kurtosis'] = [df_FA.authors_before.kurtosis(), df_FA.authors_after.kurtosis(), df_FA.authors_p2w.kurtosis()]
summary_FA_authors['skewness'] = [df_FA.authors_before.skew(), df_FA.authors_after.skew(), df_FA.authors_p2w.skew()]
summary_FAC_authors['kurtosis'] = [df_FAC.authors_before.kurtosis(), df_FAC.authors_after.kurtosis(), df_FAC.authors_p2w.kurtosis()]
summary_FAC_authors['skewness'] = [df_FAC.authors_before.skew(), df_FAC.authors_after.skew(), df_FAC.authors_p2w.skew()]

summary_FAC_authors_no_outlier = df_FAC.loc[df_FAC.authors_before <= df_FAC.authors_before.quantile(0.995), ['authors_before', 'authors_after', 'authors_p2w']].describe().T
summary_FAC_authors_no_outlier.drop('count', axis=1, inplace=True)

summary_FA_authors_no_outlier = df_FA.loc[df_FA.authors_before <= df_FA.authors_before.quantile(0.995), ['authors_before', 'authors_after', 'authors_p2w']].describe().T
summary_FA_authors_no_outlier.drop('count', axis=1, inplace=True)


summary_FA_authors.drop('count', axis=1, inplace=True)
summary_FAC_authors.drop('count', axis=1, inplace=True)
s1_a = summary_FAC_authors.style.format({col: '{:.2f}' for col in summary_FA_authors.columns})
s2_a = summary_FA_authors.style.format({col: '{:.2f}' for col in summary_FAC_authors.columns})
s3_a = summary_FAC_authors_no_outlier.style.format({col: '{:.2f}' for col in summary_FAC_authors_no_outlier.columns})
s4_a = summary_FA_authors_no_outlier.style.format({col: '{:.2f}' for col in summary_FA_authors_no_outlier.columns})

print('Unscuccessful Nominations')
display(s1_a)
print('\n\n')
print('Successful Nominations')
display(s2_a)


Unscuccessful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
authors_before,5.0,9.31,0.0,1.0,3.0,6.0,245.0,224.57,11.32
authors_after,7.0,7.47,0.0,2.0,5.0,9.0,84.0,19.36,3.26
authors_p2w,15.01,73.71,0.0,2.45,5.92,13.03,2880.0,985.1,27.93





Successful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
authors_before,3.93,5.05,0.0,1.0,3.0,5.0,85.0,41.03,4.68
authors_after,8.69,7.44,0.0,5.0,7.0,11.0,130.0,28.89,3.44
authors_p2w,7.59,7.94,0.0,2.6,5.33,10.14,94.48,13.2,2.74


The first thing that we can verify is that the nomination of an articles causes an influx of new authors into an article. There are generally more unique authors during the nomination than in the time leading up to it. 

At the first glance unsuccessful nominations seem to be worked on by more authors, in comparison to unsuccessful ones. This however is mainly driven by some outliers. Kurtosis and skewness also indicate that the distribution of the unsuccessful nominations is more skewed. If we take a look at the quantiles we find that both have the same median and that in the 75% quantile unsuccessful candidates have only one editor more. This suggests that there is a similar amount of cooperation in preparing an article for its candidacy. 

A slightly different picture is drawn during the nomination period. Successful nominations attract more new editors than unsuccessful ones. From our perspective the most probable explanation for this is that authors are more prone to participate in promising articles. Some authors for example who specialize in topics like reviewing sources or images will only start their review, when an articles has gathered some amount of support. Another possible explanation is simply the longer observation period. As we have note above, successful nomination tend to take longer, so there is a longer time in which users can see the article and decide to contribute to it. We can find support for this interpretation looking at the number of authors normalized to two weeks, in which the difference between successful and unsuccessful nominations is 
considerably smaller.