 # Danilo's first web crawler 

I'm taking an Udacity course on data science and have decided to challenge myself.
I'll do that using a combination of Python 3, Anaconda and Jupyter notebooks.

When finished this crawler will retrieve all web links from a given URL.

## Creating an environment with Anaconda

$ conda create -n first_crawler python=3.6.4

## Activating the environment

$ conda activate first_crawler

## Importing a package to read the content of a web page

$ conda install urllib3


## The function below will retrieve the html content of a given url

In [30]:
import urllib.request

def get_page(url):
    if not url.startswith('@'):
        return ''
    if url.endswith('xml'):
        return ''
    page_reader = urllib.request.urlopen(url)
    return page_reader.read().decode('utf-8')


## Giving a taste of the page's content

In [3]:
html = get_page('http://www.jexperts.com.br/')

html_length = len(html)

html_maximun_length = (100, html_length)[html_length < 100]

print(html[:html_maximun_length])

<!DOCTYPE html>
<html lang="pt-BR" prefix="og: http://ogp.me/ns#">
<head>
	<meta charset="UTF-8">



## The following function will find the next quote after a given position

### (It could be a single or double quote)

In [4]:
def get_next_quote_index(initial_position, html_content):
    link_single_quotes = html_content.find("'", initial_position + 1)
    link_double_quotes = html_content.find('"', initial_position + 1)
    return min(link_single_quotes, link_double_quotes)

## Finally, the code above will go the entire content of the page and find all the web links' URLs

In [31]:
def get_all_links(urls_to_ignore, initial_position, html_content):

    href_index = html.find('href', initial_position)
    
    if (href_index >= 0):
        
        link_beginning_quotes = get_next_quote_index(href_index, html_content)
        link_end_quotes = get_next_quote_index(link_beginning_quotes, html_content)
        found_link = html_content[link_beginning_quotes + 1:link_end_quotes]
        
        if found_link in urls_to_ignore:
            return;
        
        if not found_link:
            return
        
        new_urls_to_ignore = (found_link, *urls_to_ignore)
        children_links = get_all_links(new_urls_to_ignore, 0, get_page(found_link))
        not_none_children_links = children_links if children_links else ()
        
        next_link = get_all_links(new_urls_to_ignore, link_end_quotes, html_content)
 
        if next_link: 
            return (found_link, *next_link, *not_none_children_links)
        
        return (found_link, *not_none_children_links)  
    
    return;
            

all_links = get_all_links((), 0, html)

print('\n'.join([str(link) for link in all_links]))


https://fonts.googleapis.com/css?family=Roboto+Condensed:300,300i,400,400i,700,700i
http://www.jexperts.com.br/
//maps.google.com
//fonts.googleapis.com
//s.w.org
http://www.jexperts.com.br/feed/
http://www.jexperts.com.br/comments/feed/
http://www.jexperts.com.br/wp-content/plugins/contact-form-7/includes/css/styles.css?ver=4.8
http://www.jexperts.com.br/wp-content/plugins/revslider/public/assets/css/settings.css?ver=5.4.1
http://www.jexperts.com.br/wp-content/themes/experts/style.css?ver=4.8
http://www.jexperts.com.br/wp-content/themes/experts/css/bootstrap.min.css?ver=4.8
http://www.jexperts.com.br/wp-content/themes/experts/css/bootstrap-select.min.css?ver=4.8
http://www.jexperts.com.br/wp-content/themes/experts/css/jquery.bootstrap-touchspin.css?ver=4.8
http://www.jexperts.com.br/wp-content/themes/experts/css/font-awesome.css?ver=4.8
http://www.jexperts.com.br/wp-content/themes/experts/css/flaticon.css?ver=4.8
http://www.jexperts.com.br/wp-content/themes/experts/css/icomoon.css?ver