<a href="https://colab.research.google.com/github/Sixsamuraip/Simple_web_crawler/blob/main/Simple_web_crawler_Pieng.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Web Crawler Implementation

A simple web crawler designed here is composed of 4 main modules:
* <b>Scheduler</b>: maintain a queue of URLs to visit
* <b>Downloader</b>: download web pages
* <b>Analyzer</b>: analyze content and links
* <b>Storage</b>: store content and metadata

## 1) Basic Downloader
Every web crawler should be defined a <i>name</i> and identified its <i>owner</i> (i.e., the '`user-agent`' and '`from`' fields of the headers, respectively). Sometimes, you may get an error message, caused by the connection timeout and the page not found, for instance. You can print '`response.status_code`' to track that problem.

In [None]:
import requests
from requests.exceptions import HTTPError

headers = {
    'User-Agent': '6110500135_porpieng',
    'From': 'porpieng.n@ku.th'
}
seed_url = 'https://www.ku.ac.th/th'

def get_page(url):
    global headers
    text = ''
    try:
        response = requests.get(url, headers=headers, timeout=2)
        # If the response was successful, no Exception will be raised
        response.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')  # Python 3.6
    except Exception as err:
        print(f'Other error occurred: {err}')  # Python 3.6
    else:
        print('Success!')
        text = response.text
    return text.lower()

raw_html = get_page(seed_url)
print(raw_html)

Success!
<!doctype html>
<html lang="th">

<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
  <meta http-equiv="x-ua-compatible" content="ie=edge">
  <title>มหาวิทยาลัยเกษตรศาสตร์</title>

  <meta name="keywords" content="ku,kasetsart university,มหาวิทยาลัยเกษตรศาสตร์"/>
<meta name="description" content="มหาวิทยาลัยเกษตรศาสตร์ สร้างสรรค์ศาสตร์แห่งแผ่นดินสู่สากลเพื่อพัฒนาประเทศอย่างยั่งยืน kasetsart university is a public research university in bangkok,..." />
<meta property="og:site_name" content="www.ku.ac.th"/>
<meta property="og:locale" content="th_th"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="ku | มหาวิทยาลัยเกษตรศาสตร์ รอบรั้วชาวนนทรี" />
<meta property="og:url" content="https://www.ku.ac.th/th"/>
<meta property="og:image" content="https://www.ku.ac.th/assets/ku_logo.png" />
<meta property="og:description" content="มหาวิทยาลัยเกษตรศาสตร์ สร้างสรรค์ศาสตร์

## 2) Basic Analyzer
### 2.1 Link Parser
The following code is an example of simple link parser. The program extracts all links by considering the <i>anchor</i> tag only, and stores them into a `urls` list.

In [None]:
def link_parser(raw_html):
    urls = [];
    pattern_start = '<a href="';  pattern_end = '"'
    index = 0;  length = len(raw_html)
    while index < length:
        start = raw_html.find(pattern_start, index)
        if start > 0:
            start = start + len(pattern_start)
            end = raw_html.find(pattern_end, start)
            link = raw_html[start:end]
            if len(link) > 0:
                if link not in urls:
                    urls.append(link)
            index = end
        else:
            break
    return urls

raw_html = '<html><body><a href="http://test1.com">test1</a><br><a href="http://test2.com">test2</a></body></html>'
print(link_parser(raw_html))

['http://test1.com', 'http://test2.com']


### 2.2 URL Normalization
The following code is an example of using the `urljoin()` function to transform a relative URL to the absolute one.

In [None]:
from urllib.parse import urljoin

# Define an absolute (base) URL of a web page
base_url = 'https://mike.cpe.ku.ac.th'

# An example of the extracted absolute link
link_1 = 'http://www.ku.ac.th'
# An example of the extracted relative link
link_2 = 'download/homework.html'

base = 'https://www.ku.ac.th/th/'
link_3 = '/th/community-home'

# Resolve links
abs_link_1 = urljoin(base_url, link_1)
abs_link_2 = urljoin(base_url, link_2)

print(abs_link_1)  # -> http://www.ku.ac.th
print(abs_link_2)  # -> https://mike.cpe.ku.ac.th/download/homework.html

http://www.ku.ac.th
https://mike.cpe.ku.ac.th/download/homework.html


## 3) Basic Scheduler
The following code is an example of using a FIFO queue to handle the extracted URLs to be further downloaded. In particular, the main crawling process simply invokes the previous two defined functions, i.e., `get_page()` and `link_parser()`, to first download a web page and extract its out-going links, respectively. Then, all extracted links will be stored into a queue. We define here two queues: `frontier_q` and `visited_q`. The former is used as the FIFO queue to keep URLs for next downloading, while the latter is used to remember which web pages have been already downloaded.

In [None]:
seed_url = 'https://www.ku.ac.th/th'
frontier_q = [seed_url]
visited_q = []

variable_name = ""
def enqueue(links):
    global frontier_q
    for link in links:
        if link not in frontier_q and link not in visited_q:
            frontier_q.append(link)

# FIFO queue
def dequeue():
    global frontier_q
    current_url = frontier_q[0]
    frontier_q = frontier_q[1:]
    return current_url

#--- main process ---#
current_url = dequeue()
visited_q.append(current_url)
raw_html = get_page(current_url)
extracted_links = link_parser(raw_html)
enqueue(extracted_links)
print(frontier_q)

Success!
['/th/community-home', '/th/newcomer-home', '/th/partner-home']


## 4) Storing Text into a File
As the following, we use the `os.makedirs()` function to first create (sub)directories. Notice that the `exist_ok=True` parameter is set to prevent an exception error if the target directory already exists. Then, we use the `open()`, `write()`, and `close()` functions to open a file, write some text into that file, and afterwards close it. In addition, we import the `codecs` module together with using the '`utf-8`' encoding for non-English content.

In [None]:
import os, codecs

# Create (sub)directories with the 0o755 permission
# @param 'exist_ok' is True for no exception if the target directory already exists
path = 'html/subdir1/subdir3'
os.makedirs(path, 0o755, exist_ok=True)

# Write content into a file
raw_html = '<html><body><a href="http://test1.com">test1</a><br><a href="http://test2.com">test2</a></body></html>'
abs_file = path + '/index.html'
f = codecs.open(abs_file, 'w', 'utf-8')
f.write(raw_html)
f.close()

# <font color="blue">Your Turn ...</font>
Write a web crawler to collect 10,000 web pages (including only '`.htm`' and '`.html`' files) within the '`ku.ac.th`' domain.

In [None]:
import re
from urllib.parse import urljoin,unquote
import requests
import os, codecs
from requests.exceptions import HTTPError


headers = {
    'User-Agent': '6110500135_porpieng',
    'From': 'porpieng.n@ku.th'
}
seed_url = 'https://www.ku.ac.th/th'

page_count = 0

def get_page(url):
    global headers
    global page_count
    global page_err
    global robot_err
    text = ''
    try:
        response = requests.get(url, headers=headers, timeout=10)
        # If the response was successful, no Exception will be raised
        response.raise_for_status()
    except HTTPError as http_err:
        if 'robots.txt' in url:
            robot_err = True
        else:
            page_err = True
        print(f'HTTP error occurred: {http_err}')  # Python 3.6
    except Exception as err:
        if 'robots.txt' in url:
            robot_err = True
        else:
            page_err = True
        print(f'Other error occurred: {err}')  # Python 3.6
    else:
        print('Success!')
        if 'robots.txt' not in url:
            page_count += 1
        text = response.text
    return text.lower()


# def link_parser(raw_html):
#     urls = [];
#     regex_http = "^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$"
#     urls = re.findall(regex_http,raw_html)
#     return urls

def link_parser(raw_html):
    urls = []
    pattern_start = '<a href="';  pattern_end = '"'
    index = 0;  length = len(raw_html)
    while index < length:
        start = raw_html.find(pattern_start, index)
        if start > 0:
            start = start + len(pattern_start)
            end = raw_html.find(pattern_end, start)
            link = raw_html[start:end]
            if len(link) > 0:
                if link not in urls:
                    urls.append(unquote(link))
            index = end
        else:
            break
    return urls

def is_pdf(raw_html):
    if raw_html[(len(raw_html)-4):len(raw_html)] == '.pdf':
        return True
    else:
        return False


seed_url = 'https://www.ku.ac.th/th'
frontier_q = [seed_url]
visited_q = []

variable_name = ""
def enqueue(links):
    global frontier_q
    for link in links:
        if link not in frontier_q and link not in visited_q:
            frontier_q.append(link)

# FIFO queue
def dequeue():
    global frontier_q
    current_url = frontier_q[0]
    frontier_q = frontier_q[1:]
    return current_url


###############################################
abs_link = 'https://www.ku.ac.th/th'
current_url = ''
page_err = False
robot_err = False
abs_visited = []

while (page_count < 10500):
    try:
        page_err = False
        robot_err = False
        current_url = dequeue()
        abs_link = urljoin(seed_url, current_url)
        if is_pdf(abs_link) :
            continue
        if 'download' in abs_link:
            continue
        if 'ku.ac.th' not in abs_link:
            continue
        if abs_link not in abs_visited:
            abs_visited.append(abs_link)
        else:
            continue
        print('absolute path is :',abs_link)

        raw_html = get_page(abs_link)

        links = link_parser(raw_html)
        visited_q.append(current_url)
        #print(links)
        extracted_links = links
        enqueue(extracted_links)
        if page_err:
            continue

        robot_link = abs_link + '/robots.txt'
        #print(robot_link)
        raw_robot = get_page(robot_link)


        file_link = abs_link
        file_link = file_link.replace('https://','')
        file_link = file_link.replace('http://','')
        file_link = file_link.replace('?','{question}')
        file_link = file_link.replace('*','{star}')
        path = 'html/' + file_link
        #print('current path :',path)
        os.makedirs(path, 0o755, exist_ok=True)

        # Write content into a file
        abs_file = path + '/index.html'
        f = codecs.open(abs_file, 'w', 'utf-8')
        f.write(raw_html)
        f.close()
        print('file suscessfil save at',abs_file)

        if ('user-agent' in raw_robot) and (robot_err == False):
            robot_file = path + '/robots.txt'
            r = codecs.open(robot_file, 'w', 'utf-8')
            r.write(raw_robot)
            r.close()

            ro = codecs.open('html/list_robot.txt', 'a', 'utf-8')
            ro.write(abs_link + '\n')
            ro.close()
            print(robot_link,' have robots.txt')

        print('now have',page_count,'page')
        print('')
        #print(frontier_q)
        #print(visited_q)
    except :
        skip = dequeue()
        print('error')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
a = 'html/'
b = 'https://www.ku.ac.th/th/'
b = b.replace('https://','')
c = a+b
print(c)

html/www.ku.ac.th/th/


In [None]:
base = 'https://www.ku.ac.th/th/'
link = ''
ans = urljoin(link,base)
print(ans)

https://www.ku.ac.th/th/


In [None]:
rm -rf html/

In [None]:
!zip -r /content/test.zip /content/html

  adding: content/html/ (stored 0%)
  adding: content/html/clgc.agri.kps.ku.ac.th/ (stored 0%)
  adding: content/html/clgc.agri.kps.ku.ac.th/index.html (deflated 82%)
  adding: content/html/askme.registrar.ku.ac.th/ (stored 0%)
  adding: content/html/askme.registrar.ku.ac.th/contact-us/ (stored 0%)
  adding: content/html/askme.registrar.ku.ac.th/contact-us/index.html (deflated 75%)
  adding: content/html/kuappstore.ku.ac.th/ (stored 0%)
  adding: content/html/kuappstore.ku.ac.th/index.html (deflated 85%)
  adding: content/html/annualconference.ku.ac.th/ (stored 0%)
  adding: content/html/annualconference.ku.ac.th/index.html (deflated 77%)
  adding: content/html/annualconference.ku.ac.th/index.php/ (stored 0%)
  adding: content/html/annualconference.ku.ac.th/index.php/index.html (deflated 77%)
  adding: content/html/mis.grad.ku.ac.th/ (stored 0%)
  adding: content/html/mis.grad.ku.ac.th/faculty/ (stored 0%)
  adding: content/html/mis.grad.ku.ac.th/faculty/register/ (stored 0%)
  adding:

In [None]:
from google.colab import files
files.download("/content/test/html.zip")