# Results

In [22]:
import pandas as pd 
import numpy as np
from IPython.display import display, display_html
%load_ext autoreload
from pprint import pprint
import matplotlib.pyplot as plt

from scipy.stats import median_test


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [23]:
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [24]:
with open('./res/revisions_FAC.csv', 'r') as file:
    date_cols_0 = [col for col in file.readline().strip().split(';') if 'date' in col]
with open('./res/FAC_merged.csv', 'r') as file:
    date_cols_1 = [col for col in file.readline().strip().split(';') if 'date' in col]




In [25]:
df_FAC = pd.merge(
    pd.read_csv('./res/revisions_FAC.csv', sep=';', index_col=0, parse_dates=date_cols_0),
    pd.read_csv('./res/FAC_merged.csv', sep=';', index_col=0, parse_dates=date_cols_1),
    on='title')
df_FAC = pd.merge(
    df_FAC, pd.read_csv('./res/revisions_FAC_2w.csv', sep=';', index_col=0),
    on='title')

df_FAC['nomination_period'] = df_FAC['end_date'] - df_FAC['date_nomination']

df_FA = pd.merge(pd.read_csv('./res/revisions_FA.csv', sep=';', index_col=0, parse_dates=date_cols_0),
                  pd.read_csv('./res/FA_merged.csv', sep=';', index_col=0, parse_dates=date_cols_1),
                  on= 'title')
df_FA = pd.merge(
    df_FA, pd.read_csv('./res/revisions_FA_2w.csv', sep=';', index_col=0),
    on='title')

df_FA['nomination_period'] = df_FA['end_date'] - df_FA['date_nomination']

print(f'We were able to sucessfully retriev information on {df_FA.shape[0]} sucessfull and {df_FAC.shape[0]} unsucessfull nominations.')

d = df_FA.nomination_period.mean()
print(df_FAC.columns)

We were able to sucessfully retriev information on 5152 sucessfull and 3743 unsucessfull nominations.
Index(['title', 'edits_before', 'authors_before', 'edits_after',
       'authors_after', 'idx', 'date_nomination', 'date_last_comment',
       'has_duplicate', 'dates', 'end_date', 'start_date', 'edits_2w_later',
       'authors_2w_later', 'Unnamed: 3', 'nomination_period'],
      dtype='object')


## Erroneous Nomination Periods

If we look at the nomination periods, it becomes clear that some were not correctly recorded. This becomes obvious when taking a look at the longest nomination periods. The reason for this is that we decided to use the earliest and latest comments in determining the nomination period. Which can lead to errors, upon manually inspecting the discussions of the suspicious records, I found the following reasons for incorrect records:
* references to prior or future nominations
* copied discussions for example from peer reviews or article talk pages
* comments outside the nomination period 

Particularly the last case is fairly common. It seems like the comments were added when the article was already archived. For more recent nominations I was able to confirm this from the edit history of the archives. For a small number of older articles I consulted the revision history of the Featured Article Discussion. I opted to removed all questionable observations I could find. The phenomena is very common in the years 2005 or 2006. I excluded these years from my analysis. Additional articles were removed after inspection. The results of this inspection are documented in res/article_inspection.txt.

In [26]:
n_FA = len(df_FA)
n_FAC = len(df_FAC)

df_FA= df_FA.loc[df_FA.date_nomination >= np.datetime64('2007-01-01')]
df_FAC= df_FAC.loc[df_FAC.date_nomination >= np.datetime64('2007-01-01')]

remove_FA = ['Cretaceous–Tertiary extinction event', 'M-theory', 'Beijing opera', 'Columbian mammoth', 'Elvis Presley',
'Baron Munchausen', 'Blue men of the Minch', 'Amphetamine', 'Meshuggah', 'John J. Crittenden']
remove_FAC = ['Vector space', 'Ecology', 'Menominee Tribe v. United States', 'Sesame Street research', 'History of KFC',
              'Ravenloft (module)', 'David Falk', '1997 Michigan Wolverines football team', 'God of War (video game)']

df_FAC =df_FAC[~df_FAC.title.isin(remove_FAC)]
df_FA =df_FA[~df_FA.title.isin(remove_FA)]
n_FA2 = len(df_FA)
n_FAC2 = len(df_FAC)
print(f'We dropped {n_FA-n_FA2} successful and {n_FAC-n_FAC2} unsuccessful nominationas ({n_FA2} / {n_FAC2} remaining)')

We dropped 983 successful and 1292 unsuccessful nominationas (4169 / 2451 remaining)


In [27]:
#df_FA_red.loc[df_FA_red.nomination_period.nlargest(30).index, ['title', 'nomination_period', 'date_nomination', 'date_last_comment', 'end_date']]
#df_FAC_red.loc[df_FAC_red.nomination_period.nlargest(30).index, ['title', 'nomination_period', 'date_nomination', 'date_last_comment', 'end_date']]

## Nomination Periods

In [28]:
sum_nomination = pd.DataFrame([df_FAC.nomination_period.describe(), df_FA.nomination_period.describe()]).T
sum_nomination.columns = ['Unsuccessful', 'Successful']
sum_nomination

Unnamed: 0,Unsuccessful,Successful
count,2451,4169
mean,16 days 16:42:06.722154,23 days 07:10:50.858239
std,14 days 10:18:24.233626,16 days 11:25:42.793709
min,0 days 00:07:00,1 days 01:30:00
25%,5 days 16:38:00,10 days 13:43:37
50%,12 days 20:51:59,18 days 22:38:04
75%,23 days 06:18:35.500000,31 days 16:27:00
max,94 days 09:00:39,155 days 10:11:40


In general we find that successful nominations are remain on the discussion page for a longer time. This is to be expected. If we take a look at the shortest unsuccessful nomination, we find nominations with few sources or lacking license information on the pictures. In such a case when it is obvious that an article does not fulfill the Featured Article Criteria, decisions are made swiftly. However, to promote an article there needs to be some time for potential critiques to respond. The shortest successful nominations took at least 2 days. The one exception (M-553, 1 day 1:30) is unusual because it was unsuccessfully nominated 3 day prior.

In [29]:
# By uncommenting these statements you can inspect the smallest an longest observation periods
# n = 5
#df_FA.loc[df_FA.nomination_period.nsmallest(n).index, ['title', 'date_nomination', 'end_date', 'nomination_period']]
#df_FAC.loc[df_FAC.nomination_period.nsmallest(n).index, ['title', 'date_nomination', 'end_date','nomination_period']]
#df_FA.loc[df_FA.nomination_period.nlargest(n).index, ['title', 'date_nomination', 'end_date', 'nomination_period']]
#df_FAC.loc[df_FAC.nomination_period.nlargest(n).index, ['title', 'date_nomination', 'end_date', 'nomination_period']]


In [30]:
# You can remap the summary tables again, for easy row access using
# summary_FA_edits.rename(index=label_dict_edits)

## Edits

In [31]:
# Claculate Edit intensity
df_FAC['edits_p2w'] = df_FAC['edits_after']/(df_FAC['nomination_period']/np.timedelta64(2, 'W'))
df_FA['edits_p2w'] = df_FA['edits_after']/(df_FA['nomination_period']/np.timedelta64(2, 'W'))

summary_FAC_edits = pd.DataFrame([
    df_FAC.edits_before.describe(),
    df_FAC.edits_after.describe(),
    df_FAC.edits_p2w.describe(),
    df_FAC.edits_2w_later.describe()
])

summary_FA_edits = pd.DataFrame([
    df_FA.edits_before.describe(),
    df_FA.edits_after.describe(),
    df_FA.edits_p2w.describe(),
    df_FA.edits_2w_later.describe()
])

summary_FA_edits['kurtosis'] = [df_FA.edits_before.kurtosis(), df_FA.edits_after.kurtosis(), 
                                df_FA.edits_p2w.kurtosis(), df_FA.edits_2w_later.kurtosis()]
summary_FA_edits['skewness'] = [df_FA.edits_before.skew(), df_FA.edits_after.skew(), 
                                 df_FA.edits_p2w.skew(), df_FA.edits_2w_later.skew()] 
summary_FAC_edits['kurtosis'] = [df_FAC.edits_before.kurtosis(), df_FAC.edits_after.kurtosis(), 
                                 df_FAC.edits_p2w.kurtosis(), df_FAC.edits_2w_later.kurtosis()]
summary_FAC_edits['skewness'] = [df_FAC.edits_before.skew(), df_FAC.edits_after.skew(), 
                                 df_FAC.edits_p2w.skew(), df_FAC.edits_2w_later.skew()]


summary_FA_edits.drop('count', axis=1, inplace=True)
summary_FAC_edits.drop('count', axis=1, inplace=True)

idx_labels = ['Before Nom.', 'During Nom.', 'Edit Intensity', 'After Nom.']

label_dict_edits= {name: label for name, label in zip(summary_FAC_edits.index, idx_labels)}
label_dict_edits_inv= {v:k for k,v in label_dict_edits.items()}
summary_FA_edits.rename(index=label_dict_edits, inplace=True)
summary_FAC_edits.rename(index=label_dict_edits, inplace=True)
s1_edits = summary_FAC_edits.style.format({col: '{:.2f}' for col in summary_FA_edits.columns})
s2_edits = summary_FA_edits.style.format({col: '{:.2f}' for col in summary_FAC_edits.columns})

print('Unscuccessful Nominations')
display(s1_edits)
print('\n\n')

print('Successful Nominations')
display(s2_edits)

Unscuccessful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
Before Nom.,35.67,63.11,0.0,3.0,15.0,43.5,1095.0,59.67,5.76
During Nom.,52.0,82.4,0.0,6.0,26.0,63.0,1143.0,40.51,4.85
Edit Intensity,73.5,167.22,0.0,8.91,31.36,75.06,2880.0,110.91,8.91
After Nom.,20.12,43.08,0.0,0.0,5.0,21.0,549.0,34.07,4.96





Successful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
Before Nom.,46.16,71.51,0.0,3.0,18.0,58.0,947.0,17.49,3.25
During Nom.,75.56,97.76,0.0,20.0,46.0,93.0,1941.0,51.01,4.73
Edit Intensity,60.47,79.36,0.0,14.0,34.96,75.7,856.37,15.8,3.26
After Nom.,11.26,24.84,0.0,2.0,5.0,10.0,476.0,98.41,7.94


We think it is important to point out that the number of edits is a flawed metric. It does not account for the magnitude of the change in the article. Whether a single typo was corrected or whether an additional section was written is neglected in a simple edit count. Differences in the edit count could also just be differences in working style, i.e. how oven an author saves his changes when editing an article. 

On average successful nominations receive a little more edits in the two weeks leading up to the nomination, compared to unsuccessful ones. The difference in the mean number of edits is however strongly driven by outliers. There is only a difference of 3 comparing their median. So it seems that there is little difference before the nomination.

It is fairly obvious that the nomination process increases activity on an article. In both cases we see more edits during the nomination than before. During the nomination edits are made on successful nominations, so in absolute terms more work is expended on them. The problem with this comparison is that we do not take into account the differences in the nomination period. For successful nominations we record the edits for a longer time. For this reason we added an additional measurement, the number of edits per two weeks. We decided on this measure to also facilitated comparisons with the pre-nomination period. Comparing this metric the difference become more similar. The difference in mean as well as most quantiles becomes smaller. 

Finally we can see that after the nomination is closed, the interest in the article slows down. We observe the lowest number of edits in all observation periods. The effect is more profund on successful articles. This is reasonable, since Featured Articles are expected to be stable. It should however be noted that 50% of all featured articles recieve at least 5 changes in the 2 weeks after the nomination. Without more detailed inside into these change, we think it is futile to speculate whether these articles are truly stable. Many unsuccessful nominations are still frequently edited. It might be interesting to investigate how the activity develops over a longer time period.

In [32]:
df_FAC['authors_p2w'] = df_FAC['authors_after']/(df_FAC['nomination_period']/np.timedelta64(2, 'W'))
df_FA['authors_p2w'] = df_FA['authors_after']/(df_FA['nomination_period']/np.timedelta64(2, 'W'))


summary_FAC_authors = pd.DataFrame([
    df_FAC.authors_before.describe(),
    df_FAC.authors_after.describe(),
    df_FAC.authors_p2w.describe(),
    df_FAC.authors_2w_later.describe()
])
summary_FA_authors = pd.DataFrame([
    df_FA.authors_before.describe(),
    df_FA.authors_after.describe(),
    df_FA.authors_p2w.describe(),
    df_FA.authors_2w_later.describe()
])

summary_FA_authors['kurtosis'] = [df_FA.authors_before.kurtosis(), df_FA.authors_after.kurtosis(),
                                  df_FA.authors_p2w.kurtosis(), df_FA.authors_2w_later.kurtosis()]
summary_FA_authors['skewness'] = [df_FA.authors_before.skew(), df_FA.authors_after.skew(), 
                                  df_FA.authors_p2w.skew(), df_FA.authors_2w_later.skew()]
summary_FAC_authors['kurtosis'] = [df_FAC.authors_before.kurtosis(), df_FAC.authors_after.kurtosis(),
                                   df_FAC.authors_p2w.kurtosis(), df_FAC.authors_2w_later.kurtosis()]
summary_FAC_authors['skewness'] = [df_FAC.authors_before.skew(), df_FAC.authors_after.skew(), 
                                   df_FAC.authors_p2w.skew(), df_FAC.authors_2w_later.skew()]

summary_FAC_authors_no_outlier = df_FAC.loc[df_FAC.authors_before <= df_FAC.authors_before.quantile(0.995), ['authors_before', 'authors_after', 'authors_p2w']].describe().T
summary_FAC_authors_no_outlier.drop('count', axis=1, inplace=True)

summary_FA_authors_no_outlier = df_FA.loc[df_FA.authors_before <= df_FA.authors_before.quantile(0.995), ['authors_before', 'authors_after', 'authors_p2w']].describe().T
summary_FA_authors_no_outlier.drop('count', axis=1, inplace=True)

# Drop Redundant
summary_FA_authors.drop('count', axis=1, inplace=True)
summary_FAC_authors.drop('count', axis=1, inplace=True)

# Styling
idx_labels = ['Before Nom.', 'During Nom.', 'Author Intensity', 'After Nom.']
label_dict = {name: label for name, label in zip(summary_FAC_authors.index, idx_labels)}
summary_FA_authors.rename(index=label_dict, inplace=True)
summary_FAC_authors.rename(index=label_dict, inplace=True)
s1_a = summary_FAC_authors.style.format({col: '{:.2f}' for col in summary_FA_authors.columns})
s2_a = summary_FA_authors.style.format({col: '{:.2f}' for col in summary_FAC_authors.columns})
#s3_a = summary_FAC_authors_no_outlier.style.format({col: '{:.2f}' for col in summary_FAC_authors_no_outlier.columns})
#s4_a = summary_FA_authors_no_outlier.style.format({col: '{:.2f}' for col in summary_FA_authors_no_outlier.columns})

print('Unscuccessful Nominations')
display(s1_a)
print('\n\n')
print('Successful Nominations')
display(s2_a)


Unscuccessful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
Before Nom.,5.0,9.31,0.0,1.0,3.0,6.0,245.0,224.57,11.32
During Nom.,7.0,7.47,0.0,2.0,5.0,9.0,84.0,19.36,3.26
Author Intensity,15.01,73.71,0.0,2.45,5.92,13.03,2880.0,985.1,27.93
After Nom.,4.16,6.88,0.0,0.0,2.0,5.0,108.0,53.55,5.36





Successful Nominations


Unnamed: 0,mean,std,min,25%,50%,75%,max,kurtosis,skewness
Before Nom.,3.93,5.05,0.0,1.0,3.0,5.0,85.0,41.03,4.68
During Nom.,8.69,7.44,0.0,5.0,7.0,11.0,130.0,28.89,3.44
Author Intensity,7.59,7.94,0.0,2.6,5.33,10.14,94.48,13.2,2.74
After Nom.,4.59,6.74,0.0,2.0,3.0,5.0,117.0,66.08,6.33


We can see that the nomination also draws new contributers to the article. It is interesting to see that there is little difference between successful and unsuccessful nominations in terms of the number of contributers, before and after the nomination. So there is little difference in cooperation in the time leading to their nomination. Successful nominations however are able to attract more additional authors during the Featured Article Process. Again this can be somewhat attributed to their longer nomination period. Finally also see a to a similar number of people work on an article, before and after nomination. It would certainly be interesting to investigate whether these are the same editors. new

#### "Inference..."

In [34]:


#median_test()
summary_FA_edits.rename(index=label_dict_edits_inv, inplace=True)
#summary_FA_edits.columns
#print(summary_FA_edits.index)

t_ed_before = median_test(df_FAC.edits_before, df_FA.edits_before)
t_ed_p2w =median_test(df_FAC.edits_p2w, df_FA.edits_p2w)

t_at_before= median_test(df_FAC.authors_before, df_FA.authors_before)
t_at_p2w =median_test(df_FAC.authors_p2w, df_FA.authors_p2w)
print(t_ed_before[1], t_ed_p2w[1]) # there must be a p-value somewhere
print(t_at_before[1], t_at_p2w[1])

0.0008665492011884198 0.02859900840626778
0.11063968554751881 0.004361599452682988
