# Focused web crawl of a discussion forum

## Open-domain crawling

* Naive open-domain crawling: start from index page, follow each link and go to see if there is something useful
* Then just seed it with few URLs
* You can use ready made crawlers, e.g. [SpiderLing](http://corpus.tools/wiki/SpiderLing), [Heritrix](https://webarchive.jira.com/wiki/display/Heritrix/Heritrix)

## Case study: Futisforum http://futisforum2.org/

* When you target the crawling to a specific site, it makes sense to first eyeball the site
* [robots.txt](http://futisforum2.org/robots.txt)
* **IMPORTANT: crawl headers (state who you are) and crawl delay (do not choke a server)**
* In Python [Requests-library](http://docs.python-requests.org/en/master/) is easy for downloading the pages, [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is handy for processing html


In [None]:
sys.exit()
import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as et
import time
import sys

crawl_headers={'User-Agent':'Research Group from University of Turku, Educational purposes', 'From':'jmnybl@utu.fi'}


## Download and processs urls from sitemap-boards.php
print >> sys.stderr, "Downloading http://futisforum2.org/sitemap-boards.php"
wp=requests.get(u"http://futisforum2.org/sitemap-boards.php", headers=crawl_headers) # this contains the page structure from robots.txt, wp is a response object
time.sleep(20) # wait for 20sec (crawl delay)
tree = et.fromstring(wp.text)

boards=[] # boards are those main partitions of the forum (suomen maajoukkue, veikkausliiga...) # http://futisforum2.org/index.php?board=1.0

for elem in tree.iter():
    if elem.tag.endswith(u"loc"):
        url=elem.text.strip().rsplit(u".",1)[0] # strip the .0 part from url
        boards.append(url)

In [None]:
sys.exit()

visited_topics=set() # here we collect all urls pointing to topics (message chains) we have visited
# http://futisforum2.org/index.php?topic=187475.0

counter=1
board_regex=re.compile(u"board=([0-9]+)\.([0-9]+)")
topic_regex=re.compile(u"topic=([0-9]+)\.([0-9]+)")

## now go through all boards (e.g. http://futisforum2.org/index.php?board=1.0)
for board in boards:
    i=0 #(http://futisforum2.org/index.php?board=1.i)
    max_board=None

    while True:
        
        if max_board!=None and i>max_board: # we are ready here
            break
            
        board_url=u".".join(s for s in (board,unicode(i)))
        print >> sys.stderr, "Downloading",board_url
        wp=requests.get(board_url,headers=crawl_headers) # download the page
        time.sleep(20) # wait for 20sec (crawl delay)

        parsed_html=BeautifulSoup(wp.text, 'html.parser')

        if max_board==None:
            try:
                max_board=get_max_page(board_regex,parsed_html) # this function returns the number of pages this board have so thst we can iterate over those
            except:
                print >> sys.stderr, "Something went wrong:",board # always keep a log file
                break

                
        # now iterate through all topic chain links in this page
        for link in parsed_html.find_all(u"a"): # thanks to beautiful soup its really easy

            if u"nofollow" in link.get(u"rel",[]): # this is not suppposed to be used when crawling (user profiles, shortcuts to newest messages etc.)
                continue
            url=link.get(u"href")
            if not url or u"#" in url: # take only clean urls ('#' ones points to a specific message etc.)
                continue

            if u"topic=" in url: # these are message chains
                try:
                    topic=url.rsplit(u".",1)[0]
                    if topic in visited_topics: # make sure that we don't download same pages twice
                        continue
                    visited_topics.add(topic)
                except:
                    print "Something went wrong:",url # always keep a log file
                    continue
                    
                j=0 #(http://futisforum2.org/index.php?topic=187475.j)
                max_topic=None
                while True:
                    if max_topic!=None and j>max_topic: # we are ready here
                        break
                    topic_url=u".".join(s for s in (topic,unicode(j)))
                    print >> sys.stderr, "Downloading",topic_url # always keep a log file
                    tp=requests.get(topic_url,headers=crawl_headers) # download the page
                    time.sleep(20) # wait for 20sec (crawl delay)

                    
                    if max_topic==None: # again we have to know how many pages there are so that we can download all
                        try:
                            max_topic=get_max_page(topic_regex,tp.text)
                        except:
                            print >> sys.stderr, "Something went wrong:",topic_url
                            break
                            

                    # SAVE THE PAGE
                    print >> sys.stderr, "Saving", topic_url
                    
                    f=codecs.open(u"page_"+str(counter)+".html",u"wt",u"utf-8")
                    counter+=1
                    for line in tp.text:
                        f.write(line)
                    f.close()        
                    
                    j+=25 # increase the topic counter
                    
        i+=30 # increase the board counter

## When working with big data

* Use try--except statements. It does not really matter if you miss a page or two.
* Keep sufficient progress and error logs, so that you know how you are progressing, and that you can restart in case something unexcepted happens.