## Wikipedia Page Protections
This notebook shows page protections on Wikipedia via the Mediawiki API. It has two stages:
Accessing the Page Protection API
Analysis of page protection data (both descriptive statistics and learning a predictive model)

In [2]:
from copy import deepcopy
import json
import os
import time
import gzip  # necessary for decompressing dump file into text format

In [3]:
# Every language on Wikipedia has its own page restrictions table
# you can find all the dbnames (e.g., enwiki) here: https://www.mediawiki.org/w/api.php?action=sitematrix
# for example, you could replace the LANGUAGE parameter of 'enwiki' with 'arwiki' to study Arabic Wikipedia
LANGUAGE = 'enwiki'
# e.g., enwiki -> en.wikipedia (this is necessary for the API section)
SITENAME = LANGUAGE.replace('wiki', '.wikipedia')
# directory on PAWS server that holds Wikimedia dumps
DUMP_DIR = "/public/dumps/public/{0}/latest/".format(LANGUAGE)
DUMP_FN = '{0}-latest-page_restrictions.sql.gz'.format(LANGUAGE)

In [4]:
# The dataset isn't huge -- 1.1 MB -- so should be quick to process in full
!ls -shH "{DUMP_DIR}{DUMP_FN}"

ls: /public/dumps/public/enwiki/latest/enwiki-latest-page_restrictions.sql.gz: No such file or directory


In [5]:
# Inspect the first 1000 characters of the page protections dump to see what it looks like
!zcat "{DUMP_DIR}{DUMP_FN}" | head -46 | cut -c1-1000

zcat: can't stat: /public/dumps/public/enwiki/latest/enwiki-latest-page_restrictions.sql.gz (/public/dumps/public/enwiki/latest/enwiki-latest-page_restrictions.sql.gz.Z): No such file or directory


In [None]:
import random
#import mwapi 

## Accessing the Page Protection APIs
NOTE:I used API to extract data from wikimedia protection page because it gives a complete and less missing data than the dump method of extracting data. Also, its gives a better output in terms of time effective characteristics of a good data analyst.

In [None]:
print(SITENAME)

get_protection function
The get_protection function extract the raw protection from Wikipedia pages by making an HTTP request to mediawiki's API and parsing the response. This function accepts a page title as parameter. The aim of this function is to abstract the process of getting page protections into a reusable unit of code.

In [None]:
import requests


def get_protection(page_title):
  S = requests.Session()
  URL = "https://en.wikipedia.org/w/api.php"

  PARAMS = {
      "action": "query",
      "format": "json",
      "prop": "info",
      "titles": page_title,
      "inprop": "protection"
  }

  R = S.get(url=URL, params=PARAMS)
  DATA = R.json()

  pages = DATA["query"]["pages"]
  return list(pages.values())[0]['protection']

get_cities function
get_cities downloads a JSON doc. This document containing a list of countries and their associated cities is iterated on. The get_protection is reused here to fetch the protection of each cities which are pages in Wikipedia and saved in a dataframe.

In [None]:
import requests
import pandas as pd

CITIES = 'https://raw.githubusercontent.com/russ666/all-countries-and-cities-json/master/countries.json'
df_cities = pd.DataFrame()

def get_cities(df_cities):
  cities = requests.get(CITIES).json()
  for key, values in cities.items():
    values.append(key)
    for value in values :
      try:
        protections = get_protection(value)
        for protection in protections:
          print(protection)
          df_cities = df_cities.append(protection, ignore_index=True)
      except Exception as e:
        print(e)

In [None]:
df_cities.to_csv(r"cities_wiki.csv", index=False)


get_cities(df_cities)

In [None]:
#importing libraries for analysis
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import os, sys
import warnings
warnings.filterwarnings('ignore')
import pandas.util.testing as tm

#importing the data extracted from APIs into the notebook to analysis, preprocess, visualise and do predict modelling

In [None]:
data = pd.read_csv(r'cities_wiki.csv')

In [None]:
data.head()

In [None]:
#to check for missing data
data.isnull().sum()

In [None]:
data.describe()

In [None]:
data.info()

## TYPE

In [None]:
sns.countplot('type', data = data, )

In [None]:
Basic Details
Basic details on total number of level and total number of expiry based on protection type. Findings:
1. For level, sysop level has a higher number on wikimedia protection page, followed by autoconfirmed and 
   the least was extendedconfirmed
2. For expiry, infinity expiry has a higher total number on wikipediaprotection page than other finite/non-infinity expiry


## LEVEL

In [None]:
sns.countplot('level', data = data)

In [None]:
np.unique(data['level'])

In [None]:
sns.countplot('level', data = data, hue='type')

## Relationship with level and type of protection
protection of page from being moved is higher in sysop level than edit protection of pages while, protection of page from being moved is lower in autoconfirmed and extendedconfirmed level than edit protection of pages. Whether a page is a move or edit protection seems not to matter because move protection have higher sysop level but less in autoconfirmed and extendedconfirmed on average protection page, also, move have lower autoconfirmed and extendedconfirmed level but less in sysop level on the wikimedia protection page

In [None]:
#There are two forms of encoding are available using sklearn library; one-hot and label encoding
#For this case, we will be using the pandas method for one-hot encoding (we do not want priority) as shown in the code below.
data = pd.get_dummies(data, columns = ['level'])

In [None]:
data.info()

## EXPIRY

In [None]:
sns.countplot('expiry', data = data)

In [None]:
np.unique(data['expiry'])

In [None]:
#create a function to change 'expiry' from object to integer
#categorising into infinity expiry and non-infinity expiry by infinity as 1 and the non-infinity as 0

def func(data):
    d =[]
    for m in data:
        if m =='infinity':
            d.append(1)
        else:
            d.append(0)
    return d

data['expiry'] = func(data['expiry'])

In [None]:
#relationship with expiry and type of protection
sns.countplot('expiry', data = data, hue='type')

## Relationship with expiry and type of protection
protection of page from being moved is higher in infinity expiry than edit protection of pages while, protection of page from being moved is lower in finite/non-infinty expiry than edit protection of pages Whether a page is a move or edit protection seems to matter because move protection have higher infinity level than edit protection on average protection page

In [None]:
data.info()

convert data into a more usuable form by creating a function to change the 'type' dataset from object to integer, since there are two types of protection; edit protection and move protection
create a function for edit(protection of page from being edited) to be 1 while move(protection of page from being moved) to be 0

In [None]:
def func(data):
    d =[]
    for m in data:
        if m =='edit':
            d.append(1)
        else:
            d.append(0)
    return d

data['type'] = func(data['type'])

In [None]:
data.info()

In [None]:
data.shape

## Predictive Modelling
We've established that there is a clear relationship between level, expiry and type of proctection and that relationship also depends on whether the item is a move or edit protection. Now we want to see with how much accuracy we can predict the number of level and expiry based on the protection page. This can tell us for which type of proctection we would expect wikipedia pages to have.
NOTE: the model presented below is very simplistic and and tell us about classification of the dataset.

In [None]:
#Now that I have been able to process all dirt in my features i.e cleaning my data
#I can go ahead to separate the target from the actual data using the code below.

y = data['type']
x = data.drop('type', axis=1)

In [None]:
#Let’s go on to data model. I will be using the library called sklearn. The algorithms I will use includes LogisticRegression and
#RandomForestClassifier . First and foremost, I need to split the train data into 
#train and test so that I can use a percentage to train my model and the rest to evaluate the performance of the model. 
#In this case, my train will take 80% while testing will take 20% using the train_test_split function available in scikit-learn.

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y, 
                                                    test_size = 0.2,
                                                    random_state = 42)

## Random forest classifier

In [None]:
rand = RandomForestClassifier(random_state= 42)
rand.fit(x_train, y_train) # model learning

# evaluating the train data using accuracy score 
print('Training score is:', rand.score(x_train, y_train))

# make your predictions on the test data
pred= rand.predict(x_test)

# evaluate the test data using accuracy score
print('Testing score is:', accuracy_score(y_test, pred))

In [None]:
classification_report(y_test, pred)

## other analysis
Because of unbalanced dataset, I used confusion matrix for further analysis of to evaluate the performance of my model (to compute the accuracy of the algorithmn) when there is an unbalanced datasets precision, recall and f1-score shows better performance of the model than the accuaracy score.

## Logistic regression

In [None]:
Lr = LogisticRegression() # algorithm instantiation
Lr.fit(x_train, y_train) # model learning

# evaluating the train data using accuracy score 
print('Training score is: ', Lr.score(x_train, y_train))

# make your predictions on the test data
pred = Lr.predict(x_test)

# evaluate the test data using accuracy score
print('Testing score is: ', accuracy_score(y_test, pred))

In [None]:
# estimate the f1_score of your predictions to evaluate better performance of my model
f1_score(y_test, pred)

In [None]:
# classification report of your prediction
classification_report(y_test, pred)

## Future Analyses
To predictive a better model it more data and variables should be added. For example:
Considering using a more advanced classifier than other model types provided that there are larger dataset
Adding other type of protection than move and edit protection E.g., create, semi protection etc.
using other labels such as level of protection, expiry of protection to build a predictive model to know for example,
a.whether expiry of protection page will be infinity or not.
b.whether level of protection will be 'autoconfirmed', 'extendedconfirmed' or 'sysop' level