In [5]:
import numpy as np
import json
import httplib2
import urllib
from urllib.parse import urlencode, quote_plus

---

# 2. Crawling the data

**Q2.1** Which Wikipedia category is crawled in this script?

**Q2.2** What does this script output?

**Q2.3** When running the script `crawl.py`, what should the file `wiki.lst` contain?


### Answers

**R2.1:** In `crawl.py`, the Wikipedia category crawled is **Biology**, as shown in the following code snippet:
```python
categoryToCrawl = "Category:Biology"
pagesToDw = getPages(categoryToCrawl)
```

In [None]:
categoryToCrawl = "Category:Biology"
crawlingDepth = 2

def getPages(category):
	h = httplib2.Http()
	params = dict()
	params["cmlimit"] = "500"
	params["list"] = "categorymembers"
	params["action"] = "query"
	params["format"] = "json"
	params["cmtitle"] = category
	encodedParams = urlencode(params)
	(resp_headers, content) = h.request("http://en.wikipedia.org/w/api.php?" + encodedParams, "GET")
	jsonContent = content.decode('utf-8')
	
	try:
		j = json.loads(jsonContent)["query"]["categorymembers"]
	except json.JSONDecodeError: # Added exception handling
		return []

	return j

pagesToDw = getPages(categoryToCrawl)

**R2.2:** The output of `crawl.py` is the depth of the current crawling as well as the number of pages to download as shown in the following code snippet:
```python
print("Crawling at depth",depth,". Pages to dw:",len(pagesToDw))
```
It also creates a file `wiki.lst` for which we discuss the content in the next question

In [13]:
with open("wiki.lst",'w',encoding='utf-8') as outFile:
	for depth in range(crawlingDepth):
		print("Crawling at depth",depth,". Pages to dw:",len(pagesToDw))
		deeperLevelPages = list()
		for page in pagesToDw:
			pageTitle = page["title"]
			if pageTitle.startswith("Category:"):
				deeperLevelPages += getPages(pageTitle)
			outFile.write(pageTitle+"\n")
		pagesToDw = deeperLevelPages

Crawling at depth 0 . Pages to dw: 63
Crawling at depth 1 . Pages to dw: 1474


**R2.3:** he file `wiki.lst` should contain the list of titles of the pages visited by the crawler, as illustrated in the following code snippet:
```python
outFile.write(pageTitle+"\n")
```

---

# 3. Downloading the data

**Q3.1** How many pages per batch is downloaded ?

**Q3.2** What API of wikipedia is used to download a set of pages ?

**Q3.3** How does the crawling work here ? 

**Q3.4** By going to the API page in your browser, and reading the documentation paragraph, can you tell in what format the pages will be encoded ? 

---

# 4. Parsing the data

**Q4.1** From the code, how are encoded the two matrices (i.e what type of Python object) ? What is the name of this encoding ?

**Q4.2** Take a look at the database of Wikipedia documents in the `dws` folder, for example using the command `vi` or `less`. How are the links encoded in the wiki language ? 

**Q4.3** Complete regular expresson for extracting the links.

**Q4.4** Find and complete a simple regular expression for removing noisy data such as external links (outside of Wikipedia) and info boxes.

**Q4.5** Implement your regular expression in Python such that the first group contains everything in the link (the target as well as its potential displayed text).

**Q4.6** The current implementation builds a doc-tok matrix. You need to transpose it to have a reverse sparse index. As this looks a bit underoptimal, try also to build directly the reverse version when parsing the documents (i.e create directly a tok-doc index) and measure the performance (in per cent of execution time) you gain/loose? How do you explain that ?

---

# 5. Page Rank of the Document

**Q5.1** In the random surfer model, at each iteration, random clicks are "simulated" with a given probability. Complete the code with the correct probability.

**Q5.2** What is the name of the effect we circumvent by adding `sourceVector` to the newly computed page rank vector pageRanksNew ? 

**Q5.3** Implement the formula of the convergence $\delta$.

**Q5.4** Run the PageRank program in interactive mode `python3 -i pageRank.py`, and use the Python interface to answer the following :
- How many iteration did it need to converge ? 
- What is the page rank of "DNA" ? 
- What is the page with the highest rank ? 

---

# 6. Woogle!

**Q6.1** What type of page is selected by the vector model ? By looking at the Wikipedia page, how can you explain it ? What is the ,ame of this classical cheating ? 

**Q6.2** Propose and implement a way of correcting this phenomenon. Check if this correct the effect for the top 15 pages.

**Q6.3** Take a look at vector model rankings for your query. What is the rank of the page "Bacterial Evolution" ? Is it expected ? How would you correct for it (see extra section) ? Play arounf with standard queries and try to understand the behavior. You can try the following queries ("dna", "darwin", "crispr") while varying
- ranking and not ranking to check the difference
- correcting or not for "classical cheating"
- varying the number of results to be ranked (say 2, 10, 20, 200)

**Q6.4** What is your feeling about the right parameters ? Is it better to rank with pageRank or not ? 

**Q6.5** Can you devise a way of automating the search for the right parameters ? If you think that you have somehow limited data, try to get more data and see if it solves your problem.

**Q6.6**

---

# 7. Extras