# Lab 12

# Dataset
Once again, we are using the Wikispeedia dataset, which consists of 4,604 Wikipedia articles chosen for an online version of the game (`articles.txt`) and all the links between those pages (`links.txt`). We’ve added 2 more files, which are the names of articles for US Presidents (`presidents.txt`) and names of articles for countries (`contries.txt`).

# Jacard Similarity

We learned about the Jaccard measure of similarity in class. This is an efficient and straightforward way to compare two sets. We will use Jaccard similarity today as a measure of how similar two Wikipedia pages are by comparing the links on each page. Two pages are similar if they contain many of the same links. Let’s call the set of links on the first page A and the set of links on the second page B. The Jaccard similarity is calculated by taking the ratio of the intersection and union of the two sets. $$J=\frac{|A \cap B|}{|A \cup B|}$$

The Python program below will compare the Jaccard similarity of two Wikipedia articles. Try a few different pairs of articles and observe how the measurement changes.

In [None]:

'''
Read in list
'''
names=[]
links={}
for line in open("articles.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    names.append(line)
    links[line]=[]

'''
Read in links
'''
for line in open("links.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    splitline=line.split("\t")
    if len(splitline)!=2:
        continue
    a,b=splitline
    if a in links:
        links[a].append(b)

article1="Gerald_Ford"
article2="United_States"

s1=set(links[article1])
s2=set(links[article2])

jaccard=float(len(s1&s2))/len(s1|s2)

print("Jaccard similarity is: %f\n" % (jaccard))

# Clustering

We can use the Jaccard similarity to perform hierarchical clustering. One slight problem, though: clustering requires a *distance measure*. That means it should be small when the items are similar and large when they are different. Jaccard similarity has the *opposite* property. This is simple enough to solve, though. Jaccard similarity is a number between 0 and 1. Taking $$1-J(A,B)$$ results in a distance measure between A and B.

The Python program below uses this method to calculate the distances between all the pairs of presidents in `presidents.txt`. It then plots a dendrogram. If it’s hard to read, you can open the file `dendrogram.pdf` for a better image. 

In [None]:
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

'''
Read in list
'''
names=[]
links={}
for line in open("presidents.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    names.append(line)
    links[line]=[]

'''
Read in links
'''
for line in open("links.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    splitline=line.split("\t")
    if len(splitline)!=2:
        continue
    a,b=splitline
    if a in links:
        links[a].append(b)

distances=[]
for i in range(len(names)):
    for j in range(i+1,len(names)):
        n1=names[i]
        n2=names[j]
        s1=set(links[n1])
        s2=set(links[n2])
        d=1.0-float(len(s1&s2))/len(s1|s2)
        distances.append(d)

link=linkage(distances,method='average')
fig = plt.figure(figsize=(30,10))
dn = dendrogram(link,labels=names,color_threshold=.8)
plt.savefig('dendrogram.pdf')
plt.show()

The agglomeration method used is "average". Try using a few different methods and observe how the resulting clusters change. You'll do that by changing the line that looks like this in the code above:

`link=linkage(distances,method='average')`

Swap out 'average' with something else. You can find the different methods listed on this web page:

`https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html`

# More Clustering

We can also use this method to cluster the countries dataset. You'll have to modify this line in the code above:

`for line in open("presidents.txt"):`

This is a slightly larger dataset, so you’ll have to be patient and scroll around the dendrogram to interpret it. Again, try a few different clustering methods to see how the results change.

Clustering the entire Wikipedia dataset is probably not feasible on this computer. It would take some time. There are 4,604 articles, which means there are 10,596,106 pairs of articles. We need to compute a distance between each pair of articles, which in practice requires about 2GB of RAM in Python. Clearly, this isn’t the best way to go about things. Fortunately, we’re learning about locality sensitive hashing in class.

If we want a fast way of identifying articles that are similar? We learned how to estimate Jaccard similarity with minhash and how to find similar objects using locality sensitive hashing. The Python program below will perform this analysis on the entire Wikipedia dataset in a matter of seconds. The output will be stored in the file `minhash.csv`, which you can download and open in Excel. The pair of articles and their estimated Jaccard similarity will be displayed on each row.

The program also displays some data about the performance of the program. How long it took to do various calculations and how many articles it compared. Try slightly changing the values of `b` (the number of bands) and `r` (the number of rows) to see how this changes the results.

In [None]:
import itertools, operator
from timeit import default_timer as timer

'''
Read in list
'''
names=[]
links={}
for line in open("articles.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    names.append(line)
    links[line]=[]


'''

Read in links

'''
for line in open("links.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    splitline=line.split("\t")
    if len(splitline)!=2:
        continue
    a,b=splitline
    if a in links:
        links[a].append(b)

names=[n for n in names if len(links[n])>0]

'''
Hash function
n is the number (which hash we want)
s is the string we want to hash
'''
def qhash(n,s):
    return hash("%d%s" %(n,s))

b=16        # bands
r=4         # rows
B=100000    # buckets
m=b*r       # signature length

# Compute minhash signatures
start=timer()
minhashes={}
for name in names:
    if len(links[name])==0:
        continue
    minhashes[name]=[]
    for hash_num in range(m):
        minhash=min([qhash(hash_num,link) for link in links[name]])
        minhashes[name].append(minhash)
end=timer()
signature_time=end-start

start=timer()
distances={}
for band in range(b):
    buckets=[[] for i in range(B)]
    for name in names:
        bucket_num=0
        for row in range(r):
            sig=band*r+row
            bucket_num^=minhashes[name][sig]
        bucket_num=qhash(b,str(bucket_num))%B
        buckets[bucket_num].append(name)
    for bucket in buckets:
        if len(bucket)<2:
            continue
        for n1,n2 in itertools.combinations(bucket,2):
            if (n1,n2) not in distances:
                score=float([x==y for x,y in zip(minhashes[n1],minhashes[n2])].count(True))/m
                distances[n1,n2]=score
end=timer()
locality_time=end-start

print("Bands: b=%d" % b)
print("Rows: r=%d" %r)
print("Signature length: m=%d" % m)
print("Time computing signatures: %f seconds" % signature_time)
print("Time performing locality sensitive hashing: %f seconds" % locality_time)
print("Pairs compared: %d" % len(distances))
print("Percent of pairs compared: %f%%." % (100*len(distances)/((len(names)*(len(names)-1))/2.)))

sorted_distances=sorted(distances.items(), key=operator.itemgetter(1),reverse=True)
f=open("minhash.csv",'w')
f.write("Article 1, Article 2, Signature Similarity\n")
for pair,distance in sorted_distances:
    f.write("%s,%s,%f\n" % (pair[0],pair[1],distance))
f.close()

# Lab Report Questions
1. In your experimenting with the Jaccard, what pages did you find that had a high similarity? What pages had low similarity? Does this match your intuition?
2. Does the clustering produced for the U.S. Presidents match your knowledge of U.S. History? Can you explain what some of the clusters mean?
3. What are some of the most similar pairs of presidents? Does this make intuitive sense to you?
4. Does the clustering produced for the countries match your knowledge of geography? Can you explain what some of the clusters mean?
5. What are some of the most similar pairs of countries? Does this make intuitive sense to you?
6. How do the different agglomeration methods change the resulting clusters on our two datasets? Which method do you think results in the best clusters?
7. How long does the minhash program take to run? What proportion of the documents did it compare? If you change the number of bands `b`, how does that change the results? If you change the number of rows `r`, how does that change the results?
8. The file all_articles.pdf has a very large dendrogram of **all** the Wikipedia articles. Do you see any patterns in it? What do some of the large clusters correspond to?
9. Looking at the full dendrogram of all the articles, did our minhash method do a good job of identifying the highly similar articles?