<a href="https://colab.research.google.com/github/OmriMan/Search-Engine-Wiki/blob/master/PageViewsProcces.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import bz2
from functools import partial
from collections import Counter
import pickle
from itertools import islice
from xml.etree import ElementTree
import codecs
import csv
import time
import os
import re
from pathlib import Path

Data about page views on Wikipedia is available at https://dumps.wikimedia.org and there is documentation about the definition of a page view and the format of lines in the file. In this project, we will use page view data that the course staff provide for ALL of English Wikipedia from the month of August 2021, which is more than 10.7 million viewed articles. The code below shows how we generate that data.

***
***
In the following code we are:
* Download the file pageviews-202108-user.bz2 (This file holds information about Wikipedia pages and views of those pages until August 2021) after "cleaning and" stem)
* Filter noise and leave only page ID and monthly number of views per page ID
Filters values ​​that are not numbers or a sequence of numbers
* Keep everything as a dictionary (counter) that has been summarized for each ID page, the number of views
Then we will have a page map mapping for multiple page views.
* Save this dictionary (counter) to a binary file
( write out the counter as binary file (pickle it) )
*****
It makes sense that this code would take some time to run, about 10-15 minutes for home internet infrastructure in an average student apartment in the D neighborhood in Be'er Sheva
*****

In [2]:
# Paths
# Using user page views (as opposed to spiders and automated traffic) for the 
# month of August 2021
pv_path = 'https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-08/pageviews-202108-user.bz2'
p = Path(pv_path) 
pv_name = p.name
pv_temp = f'{p.stem}-4dedup.txt'
pv_clean = f'{p.stem}.pkl'
# Download the file (2.3GB) 
!wget -N $pv_path
# Filter for English pages, and keep just two fields: article ID (3) and monthly 
# total number of page views (5). Then, remove lines with article id or page 
# view values that are not a sequence of digits.
!bzcat $pv_name | grep "^en\.wikipedia" | cut -d' ' -f3,5 | grep -P "^\d+\s\d+$" > $pv_temp
# Create a Counter (dictionary) that sums up the pages views for the same 
# article, resulting in a mapping from article id to total page views.
wid2pv = Counter()
with open(pv_temp, 'rt') as f:
  for line in f:
    parts = line.split(' ')
    wid2pv.update({int(parts[0]): int(parts[1])})
# write out the counter as binary file (pickle it)
with open(pv_clean, 'wb') as f:
  pickle.dump(wid2pv, f)


--2022-01-03 13:46:49--  https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-08/pageviews-202108-user.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2503235912 (2.3G) [application/octet-stream]
Saving to: ‘pageviews-202108-user.bz2’


2022-01-03 13:55:32 (4.57 MB/s) - ‘pageviews-202108-user.bz2’ saved [2503235912/2503235912]



The following code is an example of how to read the dictionary (counter) we kept in the code above

In [3]:
# read in the counter
with open(pv_clean, 'rb') as f:
  wid2pv = pickle.loads(f.read())