**It should be noted that this notebook may not run smoothly and that it needs some touch-ups to get it to work properly. The main idea is to show the process of creating an inverted index and the steps that are needed to be taken to do so. The code is not optimized and may not run properly on a local machine. It is recommended to run it on a cloud service such as Google Cloud Platform.**

In [1]:
# if the following command generates an error, you probably didn't enable 
# the cluster security option "Allow API access to all Google Cloud services"
# under Manage Security → Project Access when setting up the cluster
!gcloud dataproc clusters list --region us-central1

NAME          PLATFORM  PRIMARY_WORKER_COUNT  SECONDARY_WORKER_COUNT  STATUS   ZONE           SCHEDULED_DELETE
cluster-748d  GCE       4                                             RUNNING  us-central1-a


# Imports & Setup

In [2]:
!pip install -q google-cloud-storage==1.43.0
!pip install -q graphframes

[0m

In [3]:
import pyspark
import sys
from collections import Counter, OrderedDict, defaultdict
import itertools
from itertools import islice, count, groupby
import pandas as pd
import os
import re
from operator import itemgetter
import nltk
from gensim.corpora import PorterStemmer
from nltk.corpus import stopwords
from time import time
from pathlib import Path
import pickle
import pandas as pd
from google.cloud import storage

import hashlib
def _hash(s):
    return hashlib.blake2b(bytes(s, encoding='utf8'), digest_size=5).hexdigest()

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# if nothing prints here you forgot to include the initialization script when starting the cluster
!ls -l /usr/lib/spark/jars/graph*

-rw-r--r-- 1 root root 247882 Mar  5 18:01 /usr/lib/spark/jars/graphframes-0.8.2-spark3.1-s_2.12.jar


In [5]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf, SparkFiles
from pyspark.sql import SQLContext
from graphframes import *

In [6]:
spark

In [7]:
# Put your bucket name below and make sure you can access it without an error
bucket_name = 'bgu-ir-ass3-fab-stem' 
full_path = f"gs://{bucket_name}/"
paths=[]

client = storage.Client()

In [None]:
blobs = client.list_blobs(bucket_name)
for b in blobs:
    if b.name != 'graphframes.sh' and not b.name.startswith('postings_gcp') and not b.name.startswith('pr'):
        paths.append(full_path+b.name)

In [None]:
parquetFile = spark.read.parquet(*paths)
doc_text_pairs = parquetFile.select("text", "id").rdd
doc_title_pairs = parquetFile.select("title", "id").rdd

In [None]:
# Count number of wiki pages
N = parquetFile.count()
N

In [8]:
# if nothing prints here you forgot to upload the file inverted_index_gcp.py to the home dir
%cd -q /home/dataproc
!ls inverted_index_gcp.py

inverted_index_gcp.py


In [9]:
# adding our python module to the cluster
sc.addFile("/home/dataproc/inverted_index_gcp.py")
sys.path.insert(0,SparkFiles.getRootDirectory())

In [10]:
from inverted_index_gcp import InvertedIndex

In [None]:
english_stopwords = frozenset(stopwords.words('english'))
corpus_stopwords = ["category", "references", "also", "external", "links", 
                    "may", "first", "see", "history", "people", "one", "two", 
                    "part", "thumb", "including", "second", "following", 
                    "many", "however", "would", "became"]

all_stopwords = english_stopwords.union(corpus_stopwords)
RE_WORD = re.compile(r"""[\#\@\w](['\-]?\w){2,24}""", re.UNICODE)
stemmer = PorterStemmer()

In [None]:
NUM_BUCKETS = 124
def token2bucket_id(token):
    return int(_hash(token),16) % NUM_BUCKETS  # if we plan on running this again, create seperate funcs for text, title

def word_count(text, id):
    ''' Count the frequency of each word in `text` (tf) that is not included in
    `all_stopwords` and return entries that will go into our posting lists.
    Parameters:
    -----------
    text: str
      Text of one document
    id: int
      Document id
    Returns:
    --------
    List of tuples
      A list of (token, (doc_id, tf)) pairs
      for example: [("Anarchism", (12, 5)), ...]
    '''
    tokens = [token.group() for token in RE_WORD.finditer(text.lower())]
    tokens = [stemmer.stem(token) for token in tokens if token not in all_stopwords]
    word_count = Counter(tokens)
    return [(token, (id, tf)) for token, tf in word_count.items()]

def reduce_word_counts(unsorted_pl):
    ''' Returns a sorted posting list by wiki_id.
    Parameters:
    -----------
    unsorted_pl: list of tuples
      A list of (wiki_id, tf) tuples
    Returns:
    --------
    list of tuples
      A sorted posting list.
    '''
    return sorted(unsorted_pl, key=itemgetter(0))

def partition_postings_and_write(postings, base_dir):
    ''' A function that partitions the posting lists into buckets, writes out
    all posting lists in a bucket to disk, and returns the posting locations for
    each bucket. Partitioning should be done through the use of `token2bucket`
    above. Writing to disk should use the function  `write_a_posting_list`, a
    static method implemented in inverted_index_colab.py under the InvertedIndex
    class.
    Parameters:
    -----------
    postings: RDD
      An RDD where each item is a (w, posting_list) pair.
    Returns:
    --------
    RDD
      An RDD where each item is a posting locations dictionary for a bucket. The
      posting locations maintain a list for each word of file locations and
      offsets its posting list was written to. See `write_a_posting_list` for
      more details.
    '''
    buckets = postings.map(lambda x: (token2bucket_id(x[0]), x))
    buckets = buckets.groupByKey().mapValues(list)
    return buckets.map(lambda x: InvertedIndex.write_a_posting_list(x, base_dir, bucket_name))

def calculate_df(postings):
    ''' Takes a posting list RDD and calculate the df for each token.
    Parameters:
    -----------
    postings: RDD
      An RDD where each element is a (token, posting_list) pair.
    Returns:
    --------
    RDD
      An RDD where each element is a (token, df) pair.
    '''
    return postings.mapValues(len)

def calculate_tf(postings):
    return postings.mapValues(lambda x: x[1]).reduceByKey(lambda a, b: a + b)

def calculate_tf_per_doc(postings):
    return postings.groupByKey().mapValues(dict)

def calculate_document_length(postings):
    return postings.map(lambda x: (x[1][0], x[1][1])).reduceByKey(lambda a, b: a + b)

In [None]:
# word counts map
word_counts_text = doc_text_pairs.flatMap(lambda x: word_count(x[0], x[1]))
postings_text = word_counts_text.groupByKey().mapValues(reduce_word_counts)
# filtering postings and calculate df
postings_text_filtered = postings_text.filter(lambda x: len(x[1])>50)
w2df_text = calculate_df(postings_text_filtered)
w2df_text_dict = w2df_text.collectAsMap()
# partition posting lists and write out
_ = partition_postings_and_write(postings_text_filtered, './postings_gcp/text/').collect()

In [None]:
# word counts map
word_counts_title = doc_title_pairs.flatMap(lambda x: word_count(x[0], x[1]))
postings_title = word_counts_title.groupByKey().mapValues(reduce_word_counts)
# filtering postings and calculate df
postings_title_filtered = postings_title.filter(lambda x: len(x[1])>10)
w2df_title = calculate_df(postings_title_filtered)
w2df_title_dict = w2df_title.collectAsMap()
# partition posting lists and write out
_ = partition_postings_and_write(postings_title_filtered, './postings_gcp/title/').collect()

In [None]:
# collect all posting lists locations into one super-set
super_posting_text_locs = defaultdict(list)
for blob in client.list_blobs(bucket_name, prefix='postings_gcp/text'):
  if not blob.name.endswith("pickle"):
    continue
  with blob.open("rb") as f:
    posting_locs = pickle.load(f)
    for k, v in posting_locs.items():
      super_posting_text_locs[k].extend(v)
    
    
# collect all posting lists locations into one super-set
super_posting_title_locs = defaultdict(list)
for blob in client.list_blobs(bucket_name, prefix='postings_gcp/title'):
  if not blob.name.endswith("pickle"):
    continue
  with blob.open("rb") as f:
    posting_locs = pickle.load(f)
    for k, v in posting_locs.items():
      super_posting_title_locs[k].extend(v)

In [None]:
# =============================================
# takes a while so run only when you need it!!!
# =============================================

# calculate document lengths:
title_doc_length = calculate_document_length(word_counts_title)
text_doc_length = calculate_document_length(word_counts_text)

title_length = title_doc_length.collectAsMap()
document_length = text_doc_length.collectAsMap()

total_title_length = title_doc_length.map(lambda x: x[1]).sum()
total_title_docs = title_doc_length.count()

# Calculate total length and total number of documents for texts:
total_text_length = text_doc_length.map(lambda x: x[1]).sum()
total_text_docs = text_doc_length.count()

# Calculate avdl:
average_title_length = total_title_length / total_title_docs
average_text_length = total_text_length / total_text_docs

In [None]:
def generate_graph(pages):
  ''' Compute the directed graph generated by wiki links.
  Parameters:
  -----------
    pages: RDD
      An RDD where each row consists of one wikipedia articles with 'id' and
      'anchor_text'.
  Returns:
  --------
    edges: RDD
      An RDD where each row represents an edge in the directed graph created by
      the wikipedia links. The first entry should the source page id and the
      second entry is the destination page id. No duplicates should be present.
    vertices: RDD
      An RDD where each row represents a vetrix (node) in the directed graph
      created by the wikipedia links. No duplicates should be present.
  '''
  # YOUR CODE HERE
  edges = pages.flatMapValues(lambda x: x).mapValues(lambda x: x[0]).distinct()
  vertices = edges.flatMap(lambda x: x).distinct().map(lambda x: (x, ))
  return edges, vertices

In [None]:
pages_links = parquetFile.select ("id","anchor_text").rdd
# construct the graph 
edges, vertices = generate_graph(pages_links)
# compute PageRank
edgesDF = edges.toDF(['src', 'dst']).repartition(124, 'src')
verticesDF = vertices.toDF(['id']).repartition(124, 'id')
g = GraphFrame(verticesDF, edgesDF)
pr_results = g.pageRank(resetProbability=0.15, maxIter=6)
pr = pr_results.vertices.select("id", "pagerank")
page_rank = {row['id']: row['pagerank'] for row in pr.collect()}

In [None]:
# ======================================================
# not needed if you download the pickles from the bucket
# ======================================================
 
# Create inverted index instance
inverted_title = InvertedIndex()
inverted_text = InvertedIndex()

# Adding the posting locations dictionary to the inverted index
inverted_title.posting_locs = super_posting_title_locs
inverted_text.posting_locs = super_posting_text_locs
# Add the token - df dictionary to the inverted index
inverted_title.df = w2df_title_dict
inverted_text.df = w2df_text_dict

inverted_title.corpus_size = N
inverted_text.corpus_size = N

inverted_title.doc_len = title_length
inverted_text.doc_len = document_length

inverted_title.avdl = average_title_length
inverted_text.avdl = average_text_length


# write the global stats out
inverted_title.write_index('.', 'index_title')
inverted_text.write_index('.', 'index_text')

In [None]:
# upload to gs
############################################################
# Only uncomment this if you updated the index properly!!!!#
############################################################
# index_title_src = "index_title.pkl"
# index_text_src = "index_text.pkl"
# index_title_dst = f'gs://{bucket_name}/postings_gcp/title/{index_title_src}'
# index_text_dst = f'gs://{bucket_name}/postings_gcp/text/{index_text_src}'
# !gsutil cp $index_title_src $index_title_dst
# !gsutil cp $index_text_src $index_text_dst

In [None]:
!gsutil ls -lh $index_title_dst
!gsutil ls -lh $index_text_dst

In [11]:
# path = '/home/dataproc'
for blob in client.list_blobs(bucket_name):
    if blob.name.endswith('.pkl'):
        file_name = blob.name[blob.name.rfind('/')+1:]
        blob.download_to_filename(file_name)

In [12]:
with open('inverted_title_v1.pkl', 'rb') as file:
    inverted_title = pickle.load(file)
    
with open('inverted_text_v1.pkl', 'rb') as file:
    inverted_text = pickle.load(file)

In [None]:
text_doc_len = inverted_text.doc_len.copy()
title_doc_len = inverted_title.doc_len.copy()

In [13]:
def update_attribute(instance, attribute_name, new_data):
    """
    Updates the specified attribute of an object instance with new data.

    Args:
        instance: The object instance to modify.
        attribute_name (str): The name of the attribute to update.
        new_data: The new value to assign to the attribute.

    Raises:
        AttributeError: If the object does not have the specified attribute.
    """

    if hasattr(instance, attribute_name) and len(getattr(instance, attribute_name)) > 0:
        raise AttributeError(f"Object already has '{attribute_name}'")

    setattr(instance, attribute_name, new_data)

def update_pickle(instance, filename, new_data):
    for attr, data in new_data.items():
        update_attribute(instance, attr, data)
    instance._posting_list = []
    instance.write_index('.', filename)
    index_title_src = f'{filename}.pkl'
    index_title_dst = f'gs://{bucket_name}/postings_gcp/title/{index_title_src}'
    !gsutil cp $index_title_src $index_title_dst

if False:
    text_new_data = {
    }

    title_new_data = {
    }
    
    update_pickle(inverted_text, 'inverted_text_v1', text_new_data)
    update_pickle(inverted_title, 'inverted_title_v1', title_new_data)

Copying file://inverted_text_v1.pkl [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

\ [1 files][342.2 MiB/342.2 MiB]                                                
Operation completed over 1 objects/342.2 MiB.                                    
Copying file://inverted_title_v1.pkl [Content-Type=application/octet-str

At first we saved the pagerank to the indexes, but we decided to save it to a separate file. The reason for this is that the pagerank is not a part of the inverted index and should not be saved with it. We also decided to save the pagerank as a dictionary

In [0]:
inverted_text.set_normalized_page_rank()

In [None]:
with open('normalized_pagerank.pkl', 'wb') as f:
    pickle.dump(inverted_text.pagerank_normalized, f)

In [None]:
inverted_title.set_normalized_page_rank()

In [None]:
with open('normalized_pagerank_title.pkl', 'wb') as f:
    pickle.dump(inverted_title.pagerank_normalized, f)

In [None]:
inverted_title.calculte_idf()

In [None]:
inverted_title.idf

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.hist(inverted_text.get_normalized_page_rank(99.9).values(), density=True, bins=100)
plt.xlabel("Data Values")
plt.ylabel("Probability Density")
plt.yscale('log')
plt.ylim(10e-7, 10e1)
plt.title("Density Histogram")
plt.show()

In [None]:
plt.hist(inverted_text.get_normalized_page_rank(99.995).values(), bins=200)
plt.xlabel("Data Values")
plt.ylabel("Probability Density")
plt.yscale('log')
plt.title("Density Histogram")
plt.show()

In [None]:
plt.hist(inverted_text.get_normalized_page_rank(99.999).values(), density=True, bins=100)
plt.xlabel("Data Values")
plt.ylabel("Probability Density")
plt.yscale('log')
plt.ylim(10e-7, 10e1)
plt.title("Density Histogram")
plt.show()