# ENGG*6600: Special Topics in Information Retrieval - Fall 2022
##Assignment 5: TextRank (Total : 100 points)

**Description**

This is a coding assignment where you will implement the TextRank Algorithm to extract keywords for a document. Basic proficiency in Python is recommended.  

**Instructions**

* To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on *Copy to Drive* button. You can alternatively click the *Share* button located at the top right corner and click on *Copy Link* under *Get Link* to get a link and copy this notebook to your Google Drive.  

*   For questions with descriptive answers, please replace the text in the cell which states "Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables.
*   You should implement all the functions yourself and should not use a library or tool for the computation.
*   For coding questions, you can add code where it says "enter code here" and execute the cell to print the output.
* To create the final pdf submission file, execute *Runtime->RunAll* from the menu to re-execute all the cells and then generate a PDF using *File->Print->Save as PDF*. Make sure that the generated PDF contains all the codes and printed outputs before submission.
To create the final python submission file, click on File->Download .py.


**Submission Details**

* Due data: Nov. 21, 2022 at 11:59 PM (EDT).
* The final PDF and python file must be uploaded on CourseLink.
* After copying this notebook to your Google Drive, please paste a link to it below. Use the same process given above to generate a link. ***You will not recieve any credit if you don't paste the link!*** Make sure we can access the file.
***LINK: *https://colab.research.google.com/drive/1DVRi8BJfCKcYcf18TnTIcJCfuExTLXL4?usp=sharing***

**Academic Honesty**

Please follow the guidelines under the *Collaboration and Help* section in the first lecture.     

# Download input files

Please execute the cell below to download the input files.

In [None]:

import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


import os
import zipfile

download = drive.CreateFile({'id': '1nx6a70YBfOBT357ujg0MDI4TDj8yWheT'})
download.GetContentFile('HW05.zip')

with zipfile.ZipFile('HW05.zip', 'r') as zip_file:
    zip_file.extractall('./')
os.remove('HW05.zip')
# We will use hw05 as our working directory
os.chdir('HW05')

#Setting the input file
doc = "covid_nyt.tok.clean_nostop"

# 1 : Initial Data Setup (20 points)

The input file consists of a single long document which is a news article about Covid. This file has been pre-processed to remove punctuation, non-alphanumeric characters and stopwords as well as tokenized such that the terms are space separated. Please note that this file contains a single long line.

In the TextRank algorithm, we create a graph corresponding to the document, where each node is a term and the edge between the terms implies that they co-occur within a window size $w$.

In the cell below, you have to implement the following:

1) Generate vocabulary from the text. This is the unique set of terms in the text. Each term must be mapped to a unique integer [0,vocab_size-1].

2) Generate all term-pairs which co-occurs within a window. For this implementation, we will set the window size $w=2$. Please note that the window is overlapping. For example: for the text "the sun rises", the term pairs are ["the","sun"] and ["sun","rises"].


In [None]:
'''
In this function, iterate through the input file and store the vocabulary terms
and term pairs which co-occurs within a window of 2.
Return Variables:
vocab - dict which consists of term as key and the value is an integer [0,vocab_size-1]
termPairs - All term pairs which co-occur within a window size 2.
            This would be the entire set, without removing the repeating term pairs.
idVocab - list of words where the index corresponds to the vocab dict value for that word.
          This is used to map ids to words.
'''


def genInit(doc, size):
    words=open(doc).read().split()

    vocab = {}
    i=0
    for word in words:
        if word in vocab:
            continue
        else:
            vocab[word]=i
            i=i+1
    idVocab =list(vocab.keys())
    termPairs = []
    for i in range(0,len(words)-1):
        l=[words[i],words[i+1]]
        termPairs.append(l)

    return vocab, termPairs, idVocab

size = 2
vocab, termPairs, idVocab = genInit(doc, size)

print('Total number of unique terms in the collection :{0}'.format(len(vocab)))
print('Total number of term pairs :{0}'.format(len(termPairs)))


Total number of unique terms in the collection :1640
Total number of term pairs :3464


# 2 : Transition Matrix Creation (40 points)

---



Let the vocab_size be $n$. The TextRank algorithm creates a weighted graph based on term co-occurrences. However, in this assignment, for the sake of simplicity, we will assume that the edges are unweighted and undirected.

In the cell below, implement the following steps:

1) Create a transition matrix $M = n\times n$ where the value of each cell $M_{i,j}=1$, if the $i$-th and the $j$-th words co-occur within a window and $0$ otherwise. This can be implemented by iterating through the term pairs list and getting the integer value mapping for each term from the vocab dict and using these to index into the matrix. Note that since this is an undirected graph, $M_{ij} = M_{ji}$ for all $i$s and $j$s.

2) Normalize the matrix such that, $\forall i: \sum_j M_{ij}=1$ i.e., divide the row elements by the sum of the row elements.


In [None]:
'''
In this function, create the transition matrix for the input document.
Return Variables:
init_matrix - transition matrix containing the 0 or 1 element values depending on co-occurence.
'''
import numpy as np

def createMatrix(vocab, termPairs):
    init_matrix=np.zeros((len(vocab),len(vocab)))
    for t in termPairs:
        init_matrix[idVocab.index(t[0])][idVocab.index(t[1])]=1

    return init_matrix


'''
In this function, normalize the transition matrix such that sum of the elements of a row is 1.
Return Variables:
norm_matrix - normalized transition matrix
'''
def normalizeMatric(init_matrix):
    norm_matrix=(init_matrix/np.sum(init_matrix, axis=1).reshape(-1,1))

    return norm_matrix



init_matrix = createMatrix(vocab, termPairs)
norm_matrix = normalizeMatric(init_matrix)

print('Shape of the transition matrix :{0}'.format(np.shape(norm_matrix)))


Shape of the transition matrix :(1640, 1640)


# 3 : TextRank -- PageRank Algorithm over the Constructed Graph of Terms (40 points)

In the cell below, implement the PageRank Algorithm on the created graph by executing the following.

$$p^{t+1} = (\frac{\alpha}{n} + (1-\alpha) M)^T p^t $$

$t$ is the iteration number starting from $0$.

$p^t$ is a  $n \times 1$ matrix where each row corresponds to a word.

$p^0$ is initialized randomly. In other words, $p^0$ is a random $n \times 1$ matrix with a length of 1. You can generate a random $n \times 1$ matrix and divide all elements by their sum.

$\alpha$ is the random jump probability. Set $\alpha=0.15$.

The superscript $T$ denotes *transpose* of the given matrix.

Execute this for 50 iterations to ensure convergence for this particular example.

After the final iteration, display the top 10 terms with highest PageRank scores in $p^{50}$. These terms are supposed to be the keywords of the document.

**Hint:** as a sanity check to make sure that the implementation is correct, make sure that the sum of all elements in every $p^t$ is equal to 1 (there may be an epsilon difference due to floating point calculations, so you can expect $1 \pm \epsilon$ where $\epsilon < 10^{-10}$).

In [None]:

'''
In this function, implement PageRank Algorithm.
Return Variables:
wordWeights - Return top 10 terms with highest weights.
'''

def pageRank(norm_matrix, vocab, idVocab):
    p=np.random.rand(len(vocab),1)
    p=p/p.sum()
    wordWeights=p
    for i in range(0,50):
        #print (i)
        x=((0.15/len(vocab))+(1-0.15)*norm_matrix).transpose()
        a=np.matmul(x,wordWeights)
        wordWeights=a
    return wordWeights


wordWeights = pageRank(norm_matrix, vocab, idVocab)

'''
Hint: You don't have to display the word weight values. This is only for debugging.
The weight of the top keyword is in the interval [0.0090,0.0095].
The weight of the 10th top keyword is in the interval [0.0040,0.0045].
'''
print('Top 10 keywords :{0}'.format(np.array(idVocab)[np.argsort(wordWeights,axis=0)[::-1][:10]]))



Top 10 keywords :[['said']
 ['people']
 ['vaccine']
 ['coronavirus']
 ['pandemic']
 ['schools']
 ['health']
 ['mr']
 ['new']
 ['vaccines']]


In [None]:
np.sort(wordWeights,axis=0)[::-1][:10]

array([[0.00947376],
       [0.00894989],
       [0.00680972],
       [0.00580891],
       [0.00552994],
       [0.00549831],
       [0.00541205],
       [0.00532733],
       [0.00486775],
       [0.00479199]])