In this notebook, we'll create a balanced data set with approx 25K edited articles and the same number of unedited articles. We'll run logistic regression to see how good it predicts future edits if the data is artificially sampled. We'll then test it on a testing set with the real-life balance (i.e. highly imbalanced).

In [15]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib
import sklearn
from sklearn.model_selection import train_test_split

%matplotlib inline  
import matplotlib.pyplot as plt  
#import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import train_test_split


In [16]:
raw_data = pd.read_csv('aggregate-20160501.csv')


In [76]:
dataset = raw_data.dropna(axis=1, how='all')

In [77]:
dataset['num_edits_binary'] = dataset['num_edits'].apply(lambda x: int(x > 0))

In [78]:
feature_names = ["views_30d", 
        "edits_30d", 
        "minor_edits_30d", 
        "avg_size_30d", 
        "talk_views_30d",
        "talk_minor_edits_30d"]
label_name = 'num_edits_binary'

In [79]:
np.random.seed(seed=13579)
set1_idx = np.random.choice(range(len(dataset)), int(len(dataset) * .5), replace=False)
set2_idx = list(set(range(len(dataset))) - set(list(set1_idx)))

In [80]:
set1_X = dataset.loc[set1_idx, feature_names]
set1_Y = dataset.loc[set1_idx, label_name]

In [81]:
set2 = dataset.loc[set2_idx, list(dataset.columns.tolist())]

Ok, we divided our original set on two equal parts. We'll use one of them to prepare a balanced data set.

In [82]:
edited = set2[set2.num_edits > 0.0].copy(deep=True)
    

In [83]:
not_edited = set2[set2.num_edits == 0.0].copy(deep=True)

In [84]:
not_edited_selected = not_edited[0:12528]

In [85]:
balanced_set = pd.concat([edited, not_edited_selected])

In [86]:
from sklearn.utils import shuffle
balanced_set = shuffle(balanced_set)

In [90]:
balanced_set[feature_names].head()

Unnamed: 0,views_30d,edits_30d,minor_edits_30d,avg_size_30d,talk_views_30d,talk_minor_edits_30d
18031698,98.0,0.0,0.0,13394.565217,14.0,0.0
21114328,6044.0,0.0,0.0,358992.608696,6.0,0.0
2425,7310.0,3.0,1.0,328943.913043,23.0,0.0
25815959,4626.0,4.0,1.0,288581.565217,7.0,0.0
752,1.0,0.0,0.0,4684.0,0.0,0.0


In [88]:
X_train, X_test, y_train, y_test = train_test_split(balanced_set[feature_names], balanced_set[label_name],
                                                    train_size=0.75, test_size=0.25)
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [89]:
model.score(X_test, y_test)

0.50255427841634737

Ok, we see that the score on the balanced data - 0.502 - is about the chance. Which means our classifier does not work. This is a good reason to revise our set of features and tune other parameters. 