This notebook covers the code for exploratory analysis of the data

## Problem Statement

Given a User U and edit history (e1,e2,e3....eN) and provided that eN has been reverted we need to predict if the User is going to make an edit e(N+1)

Assumption:
The edit history is related to one particular article. Therefore, e1,e2..eN are for a particular user for a particular article

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
train_data = pd.read_csv('../data/wikichallenge/training.tsv', sep='\t', header=0)
namespaces = pd.read_csv('../data/wikichallenge/namespaces.tsv', sep='\t', header=0)
comments = pd.read_csv('../data/wikichallenge/comments.tsv', sep='\t', header=0)
categories = pd.read_csv('../data/wikichallenge/categories.tsv', sep='\t', header=0)
titles = pd.read_csv('../data/wikichallenge/titles.tsv', sep='\t', header=0)

In [None]:
print(train_data.columns)
print(namespaces.columns)
print(comments.columns)
print(categories.columns)
print(titles.columns)

In [None]:
print(train_data.shape)
print(train_data.dtypes)
train_data.head(n=10)

|
| 
The _reverted_ column is the indicator if the edit was reverted. We can see how many reverts are there in the training data. 

In [None]:
sns.distplot(train_data['reverted'],kde=False)

||
The ratio of reverts to edits is roughly 12.5% 

The next thing to check is the number of unique users with reverted flag as **True** . i.e. How many unique users' edits have been reverted


In [None]:
all_unique_users = train_data['user_id'].unique()
reverted = train_data.loc[train_data.index[train_data['reverted'] == 1]]['user_id'].unique()
counts = [len(all_unique_users),len(reverted)]
plt.bar([0,1],counts)
plt.xticks([0,1],('all_users','reverted_users'))
plt.show()

It shows the number of users whose edits have been reverted atleast once are roughly 25% of the population. Next we move on to see how many reverts have a followup edit from the user.

In [None]:
sorted_df = train_data.sort_values(by=['user_id','article_id','timestamp'])
sorted_df.head()
reverts = sorted_df.index[train_data['reverted'] == 1].tolist()

In [None]:
counts = {'yes':0,'no':0}
#We count the instances where the edit was reverted and the followup edit was there or not

[rows,cols] = sorted_df.shape


for row in reverts:
    u_id = sorted_df.iloc[row]['user_id']
    a_id = sorted_df.iloc[row]['article_id']
    
    next_u_id = sorted_df.iloc[row+1]['user_id']
    next_a_id = sorted_df.iloc[row+1]['article_id']
    
    if u_id == next_u_id and a_id == next_a_id:
        counts['yes'] += 1
    else:
        counts['no'] += 1


plt.bar([0,1],[counts['no'],counts['yes']])
plt.xticks([0,1],('no','yes'))   
plt.show()
    

In [None]:
sorted_df.head()

In [None]:
print(titles.shape)
print(titles.dtypes)
titles.head(n=10)

In [None]:
print(comments.shape)
print(comments.dtypes)
comments.head(n=10)

In [None]:
print(namespaces.shape)
print(namespaces.dtypes)
namespaces.head(n=10)

In [None]:
print(categories.shape)
print(categories.dtypes)
categories.head(n=10)