## Part 1. Traversing a Web Graph (8 points)

You begin by building a web crawler function, webCrawler, that traverses a web graph consisting of a self-contained set of linked web pages.

First, you can use the `urllib` package to retrieve web pages as follows:

```
import urllib.request
webUrl  = urllib.request.urlopen('https://ischool.berkeley.edu/')
data = webUrl.read()
```

Starting with the following URL:

[https://people.ischool.berkeley.edu/~chuang/i206/b5/index.html](https://people.ischool.berkeley.edu/~chuang/i206/b5/index.html)

Your crawler should identify and follow the links on the page, as well as the links found on the other pages reachable from this source page, using the breadth-first search (BFS) technique. 

You can use regex (which you have now mastered) or the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). library (e.g., its `findAll()` and `get()` methods) to find the links on each page. To simplify your task, all the links in this set of web pages use relative links, e.g., 

\<a href=”somepage.html”>This is a link to some random page\</a>

which should be resolved to:

[https://people.ischool.berkeley.edu/~chuang/i206/b5/somepage.html](https://people.ischool.berkeley.edu/~chuang/i206/b5/somepage.html)

Note that pages may link to one another via loops, e.g., A links to B, B links to C, and C links back to A. Your crawler has to avoid loops by keeping track of which pages have already been visited (or not), so that you don't visit the same pages again. Use Python's `deque` data structure to implement a queue for this purpose.

Note: please do not try to run your code on the open Web unless you have properly implemented the following: (i) checking and conforming to a site’s robots.txt file, (ii) rate-limiting your crawler, (iii) properly resolving fully specified and relative links. Otherwise you may get a nasty call from someone.

Upon completion of the crawl, your crawler function should return the following:

A list of the pages found (following the exact order in which they were visited, starting with `index.html`)
Total number of pages crawled (including `index.html`)
Total number of links found
 

In [24]:
#import
import urllib.request
import re
import collections 


# define webCralwer
def webCrawler(link):
    #open URL
    webUrl  = urllib.request.urlopen(link)
    data = str(webUrl.read())
    #print(data)

    #find links with regex
    pattern = re.compile(r'href=[\'"]?(.+)[\'"]>')
    results = pattern.findall(data)
    #print(results)

webCrawler("https://people.ischool.berkeley.edu/~chuang/i206/b5/index.html")


# notes for future drake
# https://www.geeksforgeeks.org/deque-in-python/
# https://www.youtube.com/watch?v=oDqjPvD54Ss

# so we need to us de.count() to check to make sure it isn't already in the queue. 
# if it isn't then we don't count it.

# we also need to count how many pages visited, how many links found, and the list of pages visited
# in a specific order


['information.html']


## Part 2. Indexing Web Pages (6 points)

Extend your web crawler from Part 1 so that as it encounters web pages, it also builds an inverted index (using the dictionary data structure) based on the words found on each web page. Call this function `webCrawlIndexer`.

Each time you retrieve a new web page, you will need to extract the words from the page. You may re-use your code from Assignment 3 (sentiment analysis), or you can also use the `get_text()` method from BeautifulSoup for this purpose.

When your indexer encounters a new word, it should add a new entry to the inverted index, with the word as the key, and the page name (e.g., `somepage.html`) as the value. When it encounters a word already in the index, it should update the entry to append the new page name as the value. However, if a word appears multiple times in a web page, you should not append the same web page name multiple times. For example: 

Correct: inv_index = {‘word1’:[’page1.html,page2.html’]}
Incorrect: inv_index = {‘word1’:[’page1.html,page2.html,page2.html’]}

Upon completion, your webCrawlIndexer function should return:

The number of entries in the inverted index
The inverted index dictionary data structure

In [None]:
#part 2

## Part 3. Search Query Interface (2 points)

Write a search query interface that prompts a user to enter a search query term, and prints a list of web pages corresponding to the query term if it exists in the inverted index from Part 2, or prints "No results found" if it does not exist, or quits the interface if the user enters 'q'.

For simplicity, the query terms are limited to a single word. You do not need to support search queries with multiple keywords.

In [None]:
#part 3

## Extra Credit. Search Results Webpage (1 point)

Optional: Construct and display a search results webpage (in HTML format) that shows a list of web pages (including actual hyperlinks to the pages) that contain the search term.

Python provides an easy way to display a web page with the webbrowser package.  If you run the following, a web browser opens up for you showing the specified page:
```
import webbrowser
webbrowser.open("https://ischool.berkeley.edu/")
```

If you write your search results webpage out to a local file in your computer, you can use the `webbrowser` command to display it, e.g.,: 

`webbrowser.open("file:///Users/name/Documents/search_results.html")`

The web page should be readable but it does not have to be pretty. Be sure to handle the case where there are no matches.

In [None]:
# part 4