# Lab 10

# Wikispeedia Dataset

There are many games a person can play using Wikipedia. `Wikispeedia` is a game that challenges players to find a path from a starting article to a destination article using only Wikipedia links. For example navigating from `Popcorn` to `Guitar` as quickly as possible would involve some creative thinking.
We have 4,604 Wikipedia articles chosen for an online version of this game (`articles.txt`) and all the links between those pages (`links.txt`). These two files are included with this lab, but if you want to know more, you can download the original data here: `https://snap.stanford.edu/data/wikispeedia.html`.

# Random Walk

The Python program below will read the `articles.txt` and `links.txt`. It will then pick a random Wikipedia article to start and follow random links `num_steps` times. We call this a random walk on the graph. Run this program a few times. Look for pages that show up frequently in the random walks. What would it mean for an article to show up a lot in these random walks?

In [None]:
import random

'''
Read in article names
'''
names=[]
links={}
for line in open("articles.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    names.append(line)
    links[line]=[]
    
'''
Read in links
'''
for line in open("links.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    splitline=line.split("\t")
    if len(splitline)!=2:
        continue
    a,b=splitline
    links[a].append(b)

current=random.choice(names)
print(current)

num_steps=100
for i in range(num_steps):
    current=random.choice(links[current])
    print(current)

# PageRank

Next, we will use the `Wikispeedia` dataset to explore the PageRank algorithm. The Python program below will read the files, compute the PageRank for each of the Wikipedia articles, and then write the result into a file called `ranking.txt`. Each line in the file has the name of the article and its PageRank. The file is sorted, so pages with the highest PageRank are displayed first. Try to relate what you observe to what you observed in the random walks.

In [1]:
import operator
import networkx as nx

'''
Read in article names
'''
names=[]
name2num={}
for line in open("articles.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    name2num[line]=len(names)
    names.append(line)
    
'''
Read in links
'''
G=nx.DiGraph()
for line in open("links.txt"):
    if line[0]=="#":
        continue
    line=line.strip()
    splitline=line.split("\t")
    if len(splitline)!=2:
        continue
    a,b=splitline
    i,j=name2num[a],name2num[b]
    G.add_edge(i,j)
    
pr=nx.pagerank(G)

sorted_pr=sorted(pr.items(), key=operator.itemgetter(1),reverse=True)

f=open("ranking.txt",'w')
for pagenum,rank in sorted_pr:
    f.write(names[pagenum]+"\t"+str(rank)+"\n")
    
f.close()

To view the output of this program, click on the `File` in the top left and click `Open...` then choose `rankings.txt`. It's a big file, so be patient. If it's not there, don't forget to run the program first!

# Lab Questions

1. Looking over the random walks, what articles showed up frequently? Were there common themes in the articles you encountered most often? Why did the random walk visit these pages so frequently?
2. What are some of the Wikipedia articles that have the highest PageRank in our dataset? Why might they have such a high PageRank?
3. What are some of the Wikipedia articles that have the lowest PageRank in our dataset? Why might they have such a low PageRank?
4. What is the total of all the PageRank values? Why is that the total?
5. (CHALLENGE) Below, you'll find five SHA-1 hashes for passwords of length 2. The first character of each password is a lower-case letter and the second is a number (e.g. a1 or w7.) How many different passwords are possible? Can you crack the passwords?

`a9dfb15be45a5f3128784c80c733f2cdee2f756a`

`8b5eaccb28f2182b656fc1a2589dd64891df3a08`

`68ee74f7d6afe0164fe0f1197aa9177c946d8834`

`96d828acda84e1bc81c96b361f248dbe7758bace`

`9865e9d9fefa7e998f1adaf68443075219c864fd`