## Part 1. Traversing a Web Graph (8 points)

You begin by building a web crawler function, webCrawler, that traverses a web graph consisting of a self-contained set of linked web pages.

First, you can use the `urllib` package to retrieve web pages as follows:

```
import urllib.request
webUrl  = urllib.request.urlopen('https://ischool.berkeley.edu/')
data = webUrl.read()
```

Starting with the following URL:

[https://people.ischool.berkeley.edu/~chuang/i206/b5/index.html](https://people.ischool.berkeley.edu/~chuang/i206/b5/index.html)

Your crawler should identify and follow the links on the page, as well as the links found on the other pages reachable from this source page, using the breadth-first search (BFS) technique. 

You can use regex (which you have now mastered) or the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). library (e.g., its `findAll()` and `get()` methods) to find the links on each page. To simplify your task, all the links in this set of web pages use relative links, e.g., 

\<a href=”somepage.html”>This is a link to some random page\</a>

which should be resolved to:

[https://people.ischool.berkeley.edu/~chuang/i206/b5/somepage.html](https://people.ischool.berkeley.edu/~chuang/i206/b5/somepage.html)

Note that pages may link to one another via loops, e.g., A links to B, B links to C, and C links back to A. Your crawler has to avoid loops by keeping track of which pages have already been visited (or not), so that you don't visit the same pages again. Use Python's `deque` data structure to implement a queue for this purpose.

Note: please do not try to run your code on the open Web unless you have properly implemented the following: (i) checking and conforming to a site’s robots.txt file, (ii) rate-limiting your crawler, (iii) properly resolving fully specified and relative links. Otherwise you may get a nasty call from someone.

Upon completion of the crawl, your crawler function should return the following:

* A list of the pages found (following the exact order in which they were visited, starting with `index.html`)
* Total number of pages crawled (including `index.html`)
* Total number of links found
 

In [2]:
##### Drake's notes #####
# make a list of all of them, even with duplicates, and then just iterate over them
# don't visit the duplicates? 

# nvm - this would just go infinitely because we won't know when we run out of content

# so we would need to recur for every web page, and stop when we don't find any new URLs 

# something like while deque != 0?

# how do we make it recursive? 
    # we don't have to

# notes for future drake
# https://www.geeksforgeeks.org/deque-in-python/
# https://www.youtube.com/watch?v=oDqjPvD54Ss

# so we need to us de.count() to check to make sure it isn't already in the queue. 
# if it isn't then we don't count it.

# we also need to count how many pages visited, how many links found, and the list of pages visited
# in a specific order


# get the url
# collect all links in the url
# visit each of those links
    # get all the links in those websites 
    # if there are links on the website that aren't in the deque, run it on them too 
    # if link not in dequeu
        # append it to the end 
    
# while deque != 0 
    # urllib(deque[1])
    # find links
        # for every link, check if its in the deque 
        # if it isn't, append it to deque
    # if it is
        # dequeue.append(link)
    # deque.popleft() 





In [3]:
#import
import urllib.request
import re
from collections import deque


# define webCrawler
def webCrawler(link):

    # start queue
    queue = deque([])

    # add the url suffix to it
    queue.append(link)

    # keep track of all web pages visited
    web_pages = []

    # unique sites visited, including index.html
    sites_visited = [link]

    # number of links found, including the one we started with
    num_links = 1

    # start the loop, keep looping until deque is empty
    while len(queue) > 0:
        
        #TRACING
        #print(queue)

        # access the first entry
        link = "https://people.ischool.berkeley.edu/~chuang/i206/b5/" + queue[0]
        webUrl  = urllib.request.urlopen(link)
        data = str(webUrl.read())

        # FOR TESTING LOCALLY
        ##file = open(queue[0], "r")
        ##file_contents = file.read()

        # also append to visited pages
        web_pages.append(queue[0])

        #TRACING
        #print("All web pages visited:", web_pages)

        # find all the links in the first entry
        pattern = re.compile(r'href=[\'"]?([^\'" >]+)')
        results = pattern.findall(data)

        # FOR TESTING LOCALLY
        ##results = pattern.findall(file_contents)

        # TRACING
        #print("Pattern results:", results)

        # loop over all the links in the webpage
        for each in results:
            # iterate our counter representing the number of links we've found
            num_links += 1
            # if it's not in the queue, add it to the end of the queue
            if each not in sites_visited:
                queue.append(each)
                # also append it to our list of unique sites visited
                sites_visited.append(each)
        
        # remove the first item 
        queue.popleft()

        # FOR TESTING LOCALLY
        #file.close()

    return web_pages, len(web_pages), num_links

## MAIN

web_pages, page_amt, link_amt = webCrawler("index.html")

print("Web Pages:", web_pages, "\n")

print("We found", page_amt, "unique pages!\n")

print("In total, there were", link_amt, "links!")


Web Pages: ['index.html', 'information.html', 'Berkeley.html', 'ISchool.html', 'MIMS.html', 'CityOfBerkeley.html', 'UCBerkeley.html', 'BerkeleyCollege.html', 'SouthHall.html', 'Campanile.html'] 

We found 10 unique pages!

In total, there were 23 links!


## Part 2. Indexing Web Pages (6 points)

Extend your web crawler from Part 1 so that as it encounters web pages, it also builds an inverted index (using the dictionary data structure) based on the words found on each web page. Call this function `webCrawlIndexer`.

Each time you retrieve a new web page, you will need to extract the words from the page. You may re-use your code from Assignment 3 (sentiment analysis), or you can also use the `get_text()` method from BeautifulSoup for this purpose.

When your indexer encounters a new word, it should add a new entry to the inverted index, with the word as the key, and the page name (e.g., `somepage.html`) as the value. When it encounters a word already in the index, it should update the entry to append the new page name as the value. However, if a word appears multiple times in a web page, you should not append the same web page name multiple times. For example: 

Correct: inv_index = {‘word1’:[’page1.html,page2.html’]}
Incorrect: inv_index = {‘word1’:[’page1.html,page2.html,page2.html’]}

Upon completion, your webCrawlIndexer function should return:

* The number of entries in the inverted index
* The inverted index dictionary data structure

In [4]:
#part 2



#import beautiful soup
from bs4 import BeautifulSoup
import string

#strip word of punctuation and convert to all lower-case
def stripWord( w ):
    w = w.translate(str.maketrans('', '', string.punctuation))
    w = w.lower()
    return( w )

def webCrawlIndexer():
    #start our lexicon
    lexicon = {}

    # loop over the previously determined webpages
    for each in web_pages:
        # open them and slurp their data
        link = "https://people.ischool.berkeley.edu/~chuang/i206/b5/" + each
        webUrl  = urllib.request.urlopen(link)
        data = str(webUrl.read())

        # make them into a soup (yum!)
        soup = BeautifulSoup(data, 'html.parser')

        # I apologize for my puns it's past my bed time 
        ingredients = soup.get_text()

        # remove special characters
        ingredients = ingredients.replace("\\n", '')
        ingredients = stripWord(ingredients)


        # split them into individual words
        words = ingredients.split()

        # cleanup
        #words.remove('californiaxe2x80x94in')
        words.remove('b')

        # There are a lot of random content when I use get_text, like californiaxe2x80x94in.
        # Since the assignment didn't ask me to clean this up, I'm not going to spend too much
        # time trying to figure it out since it said to use this command in the first place.
        

        # TRACING
        #print(words)

        # loop over the words
        for i in range(len(words)):
            # strip
            words[i] = words[i].strip()
            # check if it's already in the dict keys
            if words[i] in lexicon.keys():
                # if it is, check to see if the page name is already in the value
                if each in lexicon[words[i]]:
                    pass
                # if it isn't, append it
                else:
                    lexicon[words[i]].append(each)
            else:
                # otherwise, add the word to the dict
                lexicon[words[i]] = [each]

    return len(lexicon), lexicon
    

word_count, lex = webCrawlIndexer()
print("Our word count is:", word_count, "\n")
print("For the record - I did some cleanup, which the assignment did not say I needed to do. It just said to use the beautiful soup module above. If you expected us to remove every single weird entry or duplicate word with some weird characters on the end that get_text seems to find, the assignment description should have said so explicitly.\n")
for each in lex:
    print(each, "=", lex[each])

Our word count is: 461 

For the record - I did some cleanup, which the assignment did not say I needed to do. It just said to use the beautiful soup module above. If you expected us to remove every single weird entry or duplicate word with some weird characters on the end that get_text seems to find, the assignment description should have said so explicitly.

206 = ['index.html']
crawler = ['index.html']
home = ['index.html', 'information.html', 'Berkeley.html', 'ISchool.html', 'MIMS.html', 'CityOfBerkeley.html', 'UCBerkeley.html', 'BerkeleyCollege.html', 'SouthHall.html', 'Campanile.html']
page = ['index.html', 'information.html']
the = ['index.html', 'information.html', 'ISchool.html', 'MIMS.html', 'CityOfBerkeley.html', 'UCBerkeley.html', 'BerkeleyCollege.html', 'SouthHall.html', 'Campanile.html']
where = ['index.html']
any = ['index.html', 'UCBerkeley.html']
information = ['index.html', 'information.html', 'ISchool.html', 'MIMS.html']
can = ['index.html', 'Berkeley.html']
be = ['i

## Part 3. Search Query Interface (2 points)

Write a search query interface that prompts a user to enter a search query term, and prints a list of web pages corresponding to the query term if it exists in the inverted index from Part 2, or prints "No results found" if it does not exist, or quits the interface if the user enters 'q'.

For simplicity, the query terms are limited to a single word. You do not need to support search queries with multiple keywords.

In [5]:
#part 3

#define the function
def searchInterface():
    # take input
    print("Please enter a word (q to quit): ")
    search_in = input()
    # keep looping until they hit q
    while search_in.lower() != "q":
        # check if its in the dictionary
        if search_in in lex.keys():
            print("Your word can be found in the following web pages:", lex[search_in])
        # otherwise make them try again
        else:
            print("Your word is not in the lexicon. Try again!")
            
        # prompt for input, loopin
        print("Please enter a word (q to quit): ")
        search_in = input()
    

searchInterface()


Please enter a word (q to quit): 
Your word is not in the lexicon. Try again!
Please enter a word (q to quit): 
Your word is not in the lexicon. Try again!
Please enter a word (q to quit): 
Your word is not in the lexicon. Try again!
Please enter a word (q to quit): 
Your word is not in the lexicon. Try again!
Please enter a word (q to quit): 
Your word is not in the lexicon. Try again!
Please enter a word (q to quit): 
Your word is not in the lexicon. Try again!
Please enter a word (q to quit): 
Your word is not in the lexicon. Try again!
Please enter a word (q to quit): 
Your word can be found in the following web pages: ['Berkeley.html', 'ISchool.html', 'CityOfBerkeley.html', 'UCBerkeley.html', 'SouthHall.html']
Please enter a word (q to quit): 
Your word can be found in the following web pages: ['information.html', 'Berkeley.html', 'ISchool.html', 'CityOfBerkeley.html', 'UCBerkeley.html', 'BerkeleyCollege.html']
Please enter a word (q to quit): 
Your word is not in the lexicon. Try

## Extra Credit. Search Results Webpage (1 point)

Optional: Construct and display a search results webpage (in HTML format) that shows a list of web pages (including actual hyperlinks to the pages) that contain the search term.

Python provides an easy way to display a web page with the webbrowser package.  If you run the following, a web browser opens up for you showing the specified page:
```
import webbrowser
webbrowser.open("https://ischool.berkeley.edu/")
```

If you write your search results webpage out to a local file in your computer, you can use the `webbrowser` command to display it, e.g.,: 

`webbrowser.open("file:///Users/name/Documents/search_results.html")`

The web page should be readable but it does not have to be pretty. Be sure to handle the case where there are no matches.

In [18]:
# part 4

# import webbrowser module
import webbrowser

#define the function
def searchInterfacev2():

    # prompt for input
    print("Please enter a word (q to quit): ")
    search_in = input()

    # keep looping until they enter "q"
    while search_in.lower() != "q":
        # start the html contents
        html_contents = '''
        <html>
            <head>
                <title> Search Results </title>
            </head>
            <body>
                <h1> 206b Web Crawler Search Results (Assignment 5) </h1>
                <p> Your search results are below! </p>
        '''
        # append some stuff
        html_contents += "<p> Your search term is: <b>"
        html_contents += search_in
        html_contents += "</b></p>"

        # if it's in the lexicon, add some html to include links 
        if search_in in lex.keys():
            print("Your results have been opened in a browser window.\n")

            for each in lex[search_in]:
                html_contents += "<p>"
                html_contents += "<a href='https://people.ischool.berkeley.edu/~chuang/i206/b5/"
                html_contents += each
                html_contents += "'>"
                html_contents += each
                html_contents += "</a></p>"

        # instructions were unclear if I should handle this in browser window, so I just 
        # did it to be safe    
        else:
            html_contents += "<p> Your word is not in the lexicon. Please switch back to your python window and try again. </p>"
        
        # gotta close up those tags!
        html_contents += "</body> </html>"

        # write to file
        html_file = open("search_output.html", "w")
        html_file.write(html_contents)
        html_file.close()

        # use browser to open the file we just wrote to 
        webbrowser.open("search_output.html")

        # loop again
        print("Please enter a word (q to quit): ")
        search_in = input()

    # pretty handling of a quit
    print("\nYou have chosen to quit. Thanks for playing!")
    
## MAIN
searchInterfacev2()

Please enter a word (q to quit): 
Your results have been opened in a browser window.

Please enter a word (q to quit): 
Please enter a word (q to quit): 

You have chosen to quit. Thanks for playing!
