# INFO 4271 - Exercise 6 - Link Analysis

Issued: May 20, 2025

Due: May 26, 2025

Please submit this filled sheet via Ilias by the due date.

---

# 1. Co-Linking Similarity 
The directed graph of resource pointers (e.g., hyperlinks on the Internet, or citations in academic publishing) implicitly encodes topic information but can be much cheaper to process than the content words of the individual documents.

a) Implement a document similarity measure based only on graph topology, assuming that documents are similar if they link to similar documents.

In [16]:
#An example graph topology. Each entry represents a document alongside the outgoing links found in its content.
graph = {'D1' : ['D14', 'D16'],
		 'D2' : ['D5', 'D6', 'D7'],
		 'D3' : ['D4', 'D14', 'D15', 'D18', 'D19'],
		 'D4' : ['D2', 'D9', 'D14'],
		 'D5' : ['D2', 'D8', 'D17'],
		 'D6' : ['D3', 'D8', 'D12', 'D15'],
		 'D7' : ['D3', 'D19'],
		 'D8' : ['D1', 'D2', 'D3', 'D5', 'D9', 'D10', 'D11', 'D13', 'D14', 'D15', 'D17', 'D19'],
		 'D9' : [],
		 'D10' : ['D1', 'D14', 'D19'],
		 'D11' : ['D6'],
		 'D12' : ['D9', 'D11', 'D13', 'D16', 'D18'],
		 'D13' : ['D2', 'D4', 'D18'],
		 'D14' : ['D2', 'D14'],
		 'D15' : ['D7'],
		 'D16' : ['D2', 'D10', 'D16'],
		 'D17' : ['D1', 'D4', 'D6', 'D7', 'D11', 'D12'],
		 'D18' : ['D2', 'D9', 'D14'],
		 'D19' : [],
		 'D20' : ['D12']
		}

#Measure the similarity between two documents x and y in a graph based on their outgoing links.
def sim_out(x, y, graph):
	set_x = set(graph[x])
	set_y = set(graph[y])
	if len(set_x) == 0 or len(set_y) == 0: return int(set_x == set_y)
	return len(set_x.intersection(set_y)) / len(set_x.union(set_y))

#Print a document simialrity matrix
l = '\t'
for doc in graph:
	l += doc+'\t'
print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		v = sim_out(doc, d, graph)
		c = ""
		if v >= 0.99: c = "\33[32m"
		elif v > 0: c = "\33[33m"
		l += c + f"{v:.3f}\33[37m\t"
	print(l)

	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15	D16	D17	D18	D19	D20	
D1	[32m1.000[37m	0.000[37m	[33m0.167[37m	[33m0.250[37m	0.000[37m	0.000[37m	0.000[37m	[33m0.077[37m	0.000[37m	[33m0.250[37m	0.000[37m	[33m0.167[37m	0.000[37m	[33m0.333[37m	0.000[37m	[33m0.250[37m	0.000[37m	[33m0.250[37m	0.000[37m	0.000[37m	
D2	0.000[37m	[32m1.000[37m	0.000[37m	0.000[37m	0.000[37m	0.000[37m	0.000[37m	[33m0.071[37m	0.000[37m	0.000[37m	[33m0.333[37m	0.000[37m	0.000[37m	0.000[37m	[33m0.333[37m	0.000[37m	[33m0.286[37m	0.000[37m	0.000[37m	0.000[37m	
D3	[33m0.167[37m	0.000[37m	[32m1.000[37m	[33m0.143[37m	0.000[37m	[33m0.125[37m	[33m0.167[37m	[33m0.214[37m	0.000[37m	[33m0.333[37m	0.000[37m	[33m0.111[37m	[33m0.333[37m	[33m0.167[37m	0.000[37m	0.000[37m	[33m0.100[37m	[33m0.143[37m	0.000[37m	0.000[37m	
D4	[33m0.250[37m	0.000[37m	[33m0.143[37m	[32m1.000[37m	[33m0.200[37m	0.000[37m	0.000[37m	[33m0.250[37m	0.000[37

b) Now let us modify the above scheme to also use the documents' incoming links in the calculation of the similarity score.

In [18]:
#Measure the similarity between two documents x and y in a graph based on their incoming and outgoing links.
def sim_inout(x, y, graph):
    out = sim_out(x, y, graph)
    set_x = set([k for k, v in graph.items() if x in v])
    set_y = set([k for k, v in graph.items() if y in v])
    i = int(set_x == set_y) if len(set_x) == 0 or len(set_y) == 0 else len(set_x.intersection(set_y)) / len(set_x.union(set_y))
    return (out + i) / 2

#Print a document simialrity matrix
l = '\t'
for doc in graph:
	l += doc+'\t'
print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		v = sim_inout(doc, d, graph)
		c = ""
		if v >= 0.99: c = "\33[32m"
		elif v > 0: c = "\33[33m"
		l += c + f"{v:.3f}\33[37m\t"
	print(l)

	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15	D16	D17	D18	D19	D20	
D1	[32m1.000[37m	[33m0.056[37m	[33m0.183[37m	[33m0.225[37m	[33m0.125[37m	[33m0.100[37m	[33m0.100[37m	[33m0.038[37m	[33m0.083[37m	[33m0.250[37m	[33m0.250[37m	[33m0.183[37m	[33m0.125[37m	[33m0.292[37m	[33m0.100[37m	[33m0.125[37m	[33m0.125[37m	[33m0.125[37m	[33m0.200[37m	0.000[37m	
D2	[33m0.056[37m	[32m1.000[37m	[33m0.056[37m	[33m0.056[37m	[33m0.062[37m	0.000[37m	0.000[37m	[33m0.098[37m	[33m0.188[37m	[33m0.143[37m	[33m0.222[37m	0.000[37m	[33m0.062[37m	[33m0.200[37m	[33m0.222[37m	[33m0.056[37m	[33m0.286[37m	[33m0.056[37m	[33m0.050[37m	0.000[37m	
D3	[33m0.183[37m	[33m0.056[37m	[32m1.000[37m	[33m0.071[37m	[33m0.125[37m	[33m0.062[37m	[33m0.083[37m	[33m0.232[37m	[33m0.083[37m	[33m0.292[37m	[33m0.100[37m	[33m0.156[37m	[33m0.292[37m	[33m0.139[37m	[33m0.250[37m	0.000[37m	[33m0.175[37m	[33m0.071[37m	[33m0.200[37m	0.

c) Discuss the differences between these two simialrity score variants. What are the salient advantages and disadvantages they offer?

The implementation of these methods may vary. In my implementation the second method results in more documents having a similarity score > 0 but the similarity scores are lower on average. More scores > 0 helps with giving as many documents as possible a rank with which they can be compared and selected. While less scores > 0 could reduce the amount of documents and keep the documents with scores > 0 more relevant. Also the first method makes it easier to have scores of 1 meaning perfect matches if only perfect matches are used as a result this method will yield more results.

# 2. PageRank

The PageRank algorithm models page authoritativeness. Is it robust to tempering? Can you think of ways to game the PageRank scheme and give your website an artificially high score? What are ways to defend against such attacks?

Creating pages that link to your page could skew the results. To avoid this pages which don't get linked to could be removed. But this could be avoided by having your pages also link to each other. If they don't have large pages pointing to them or are themselves large their scores will be rather small, then could be prevented by removing pages with low ranks.