<a href="https://colab.research.google.com/github/LachlanSharp/Lorem-ipsum/blob/main/A3_Lachlan_Sharp_web_crawler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MA5851 - Assessment 3: WebCrawler and NLP System

---

* **Author**: Lachlan Sharp
* **Due Date**: 2021-12-08

---


<div style="font-size:8pt; background-color:WhiteSmoke; padding-left:32px;">

\  

*Note*: It is recommended that the ***Table of contents*** be used within Google Colab when reviewing this notebook for ease of navigation and overview of this data-project's structure.

</div>


---

# Setup of data-project 

## Import Dependencies

In [None]:
import os.path # Operating system interface
from time import time, sleep # Time access and conversions
from datetime import datetime # Basic date and time types
import glob # Unix style pathname pattern expansio

import numpy as np # array/matrices support
import pandas as pd # data analaysis

import requests # HTTP library
from bs4 import BeautifulSoup # HTML/XML document parser
import html # HyperText Markup Language support

import re # regular expressions


# Get python version
import platform # Platform’s identifying data
print('\nPython version {}'.format(platform.python_version()))


Python version 3.7.12


## Set directory paths

Set pertinent file paths and names.

### Mount Google Drive

This section applies to use of this Python notebook within Google Colab.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Set file paths

In [None]:
def mk_dir(path):
    """
    Creates a directory if it doesn't exist

    Parameters:
    -----------
    str : path
        The full directory path for the folder to be created.
        Note: the path should not terminate with a '/'

    Returns:
    --------
    bool
        If folder exists, True, otherwise, False
    """
    if not os.path.exists(path):
        try:
            os.makedirs(path)
        except:
            pass
    return os.path.exists(path)

# Set directory paths root filepath for data-project
filepath_root = '/content/drive/MyDrive/Colab Notebooks/MA5851/Assessment 3' # data-project root
filepath_data = os.path.join(filepath_root, 'data') # data folder
print("Directory 'data' exists?: {}".format(mk_dir(filepath_data)))

# Set filenames
filename_tophorse_items_df = "df_tophorse_items.csv"
filename_tophorse_item_details_df = "df_tophorse_item_details.csv"

filename_horseforum_threads_df = "df_horseforum_threads.csv" 
filename_horseforum_thread_posts_df = "df_horseforum_thread_posts.csv"
filename_horseforum_thread_pages_df = "df_horseforum_thread_pages.csv"

Directory 'data' exists?: True


## User defined functions


# WebCrawler

In [None]:
def web_crawler(
    url_domain
    ,url_path
    ,fx_get_ResultSet
    ,fx_get_ResultSet_data
    ,fx_get_next_page_url    
    ,fx_get_page_data = None
    ,max_pages = None
    ,echo_every_n = 1
    ,wait_page_sec = 0
    ):
    """
    Generic web crawler

    Parameters:
    -----------
    str : url_domain
        The base URL domain (including protocol) to be crawled (e.g., https://www.tophorse.com.au)
    str : url_path
        The URL path for the first page to be crawled (e.g. /horses-for-sale/)
        Note that the full URL is formed via concatenation with url_domain, hence
        care should be taken to provide the correct prefixing & suffixing '/'
    function : fx_get_ResultSet
        A function to return a BeautifulSoup ResultSet object; 
        this will be  a collection of items (i.e., tags) within the page being crawled.
    function : fx_get_ResultSet_data
        A function that returns scraped data from a ResultSet, as a list;
        this will be data from the collection of items (i.e., tags) returned by `fx_get_ResultSet()`.
    function : fx_get_next_page_url
        A function that returns the "next" page's url path as a string; 
        this will be specific for the page being crawled.
    function : fx_get_page_data
        An optional function that returns scraped data from the page being crawled, as a dictionary;
        this will be data of the page itself, as opposed to data from the collection of items within
        the page returned by `fx_get_ResultSet_data()`.        
    int : max_pages
        The maximum number of pages to iterate (handy for when still tesing code)
    int : echo_every_n
        prints output every n iterations, where n = echo_every_n.
    int : wait_page_sec
        Duration in seconds to wait between requesting pages
    """
    # Initialise
    if max_pages is None:
        max_pages = float('inf') 
    data_pages = [] 
    data_items = []
    count_page = 1

    # Crawl through pages until `url_path` is an empty string.
    while (url_path != "") and (count_page <= max_pages):
        response = requests.get(url_domain + url_path)

        # Echo progress
        if (echo_every_n > 0) and (count_page % echo_every_n == 0): # lazy evaluation
            print('Page {}, HTTP response status: {}'.format(count_page, response.status_code))

        if response.status_code == 200:
            # Scrape page

            soup = BeautifulSoup(response.text, 'html.parser')
 
            # Get data of the page itself (optional)
            if fx_get_page_data is not None:
                page_data = fx_get_page_data(soup)
                page_data['page_url_path'] = url_path # Add key
                # Append results
                data_pages.extend([page_data])

            # Get data from a collection of items within the page
            tags = fx_get_ResultSet(soup)
            item_data = fx_get_ResultSet_data(tags)

            # Append results
            data_items.extend(item_data)
            if (echo_every_n > 0) and (count_page % echo_every_n == 0): # lazy evaluation
                print("\t{} items...".format(len(item_data)))

            # Get next page url
            url_path = fx_get_next_page_url(soup)
            #print(url_path)
        else:
            # force condition to exit loop
            url_path = ""

        count_page += 1
        sleep(wait_page_sec)
    
    return data_items, data_pages

## tophorse sales

### UDF

In [None]:
def get_next_page_url_tophorse(tophorse_soup):
    """
    Gets the 'next page' url path from the www.tophorse.com.au main page.

    Parameters:
    -----------
    bs4.BeautifulSoup : tophorse_soup
        A BeautifulSoup object for the web page.

    Returns:
    --------
    str
        the relative url path for the 'next page' if successful, otherwise an empty string.
    """
    soup = tophorse_soup
    url_path = ""
    try:
        url_path = soup.find(
            'div'
            ,id = 'catalogueMidCol'
            ).find(
                'a'
                ,class_='redLink'
                ,string=re.compile("^Next")
                )['href']
    except:
        pass

    return url_path



def get_listings_tophorse(tophorse_soup):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : tophorse_soup
        A BeautifulSoup object for the web page.

    Returns:
    --------
    bs4.element.ResultSet
        An iterable collection of advert listings, as returned by BeautifulSoup
    """
    soup = tophorse_soup
    return soup.find_all('div', class_='listAdvert hd')




def scrape_listing_details_tophorse(advert_listings):
    """
    Iterates a bs4 ResultSet of horse advert (i.e., www.tophorse.com.au) listings 
    to get predefined values.

    Parameters:
    -----------
    bs4.element.ResultSet : advert_listings
        An iterable collection of advert listings, as returned by BeautifulSoup


    Returns:
    --------
    list of dict {str, Any}
        the predefined details of each advert
            + item_title
            + item_url_path
            + item_created_date
            + item_img_url_path
    """
    listings = advert_listings
    data_items=[]
    if listings is not None:
        for item in listings:
            # initialise
            item_title = ""
            item_url_path = ""
            item_created_date = ""
            item_img_url_path = ""

            # Get items's listing title and url path
            try:
                tag1 = item.find('h2', class_='listAdvertTitle')
                tag_url = tag1.find('a')
                itemt_title = tag_url.get_text()
                item_url_path = tag_url['href']
            except:
                pass

            # Get items's image
            try:
                item_img_url_path = item.find('div', class_='advertImage').find('a').img['src']
            except:
                pass

            # Get items's created date
            try:
                item_created_date = tag1.find('span').get_text()
            except:
                pass

            # Save result
            dict_item = {
                'item_title' : item_title
                ,'item_url_path' : item_url_path
                ,'item_created_date' : item_created_date
                ,'item_img_url_path' : item_img_url_path
                }
            data_items.append(dict_item.copy())

    return data_items

### Crawl TopHorse listings

In [None]:
# Specify which web page should be crawled
url_domain =  "https://www.tophorse.com.au"
url_path = "/horses-for-sale/" # note terminating '/'

# Set functions relevant to scraping the web page
get_ResultSet = get_listings_tophorse
get_ResultSet_data = scrape_listing_details_tophorse
get_next_page_url = get_next_page_url_tophorse

# Crawl pages and scrape
start_time = time()
print('Crawl start UTC: {}'.format(datetime.fromtimestamp(start_time)))
data_tophorse_items, _ = web_crawler(
    url_domain =  url_domain
    ,url_path = url_path
    ,fx_get_ResultSet = get_ResultSet
    ,fx_get_ResultSet_data = get_ResultSet_data
    ,fx_get_next_page_url = get_next_page_url 
    ,echo_every_n = 1   
)
end_time = time()
print("---crawl complete---")
print()
print("Listings count: {}".format(len(data_tophorse_items)))
print("Crawl time (sec): {0}".format(end_time - start_time))
print('Crawl end UTC: {}'.format(datetime.fromtimestamp(end_time)))

Crawl start UTC: 2021-11-30 13:40:45.652615
Page 1, HTTP response status: 200
	15 items...
Page 2, HTTP response status: 200
	15 items...
Page 3, HTTP response status: 200
	15 items...
Page 4, HTTP response status: 200
	14 items...
Page 5, HTTP response status: 200
	13 items...
Page 6, HTTP response status: 200
	14 items...
Page 7, HTTP response status: 200
	13 items...
Page 8, HTTP response status: 200
	14 items...
Page 9, HTTP response status: 200
	13 items...
Page 10, HTTP response status: 200
	14 items...
Page 11, HTTP response status: 200
	15 items...
Page 12, HTTP response status: 200
	12 items...
Page 13, HTTP response status: 200
	3 items...
Page 14, HTTP response status: 200
	0 items...
---crawl complete---

Listings count: 170
Crawl time (sec): 15.835863828659058
Crawl end UTC: 2021-11-30 13:41:01.488478


In [None]:
# Convert to data frame and save result locally
df_tophorse_items = pd.DataFrame(data_tophorse_items)

# Save dataframe locally
df_tophorse_items.to_csv(os.path.join(filepath_data, filename_tophorse_items_df), index=False)

df_tophorse_items

Unnamed: 0,item_title,item_url_path,item_created_date,item_img_url_path
0,,/horses-for-sale/paint-horse/Buckskin-Paint-Co...,30/11/2021,/images/ResizedImages/30-11-21-296166Image1_w2...
1,,/horses-for-sale/endurance-and-trail/Vee-Rock-...,30/11/2021,/images/ResizedImages/30-11-21-488434Image1_w2...
2,,/horses-for-sale/ponies/Tory__27-11-21-563432,27/11/2021,/images/ResizedImages/27-11-21-586832Image1_w2...
3,,/horses-for-sale/australian-stock-horse/Johnny...,27/11/2021,/images/ResizedImages/27-11-21-562895Image1_w2...
4,,/horses-for-sale/broodmare/Wanted---Riding-Pon...,26/11/2021,/images/ResizedImages/24-11-21-231789Image1_w2...
...,...,...,...,...
165,,/horses-for-sale/allrounder-pony/Tilly__16-12-...,17/12/2019,/images/ResizedImages/16-12-19-224877Image7_w2...
166,,/horses-for-sale/allrounder-horse/Grey-TB-mare...,04/12/2019,/images/ResizedImages/4-12-19-248828Image1_w22...
167,,/horses-for-sale/allrounder-pony/stunning-pony...,04/12/2019,/images/ResizedImages/4-12-19-214163Image3_w22...
168,,/horses-for-sale/allrounder-pony/amazing-Bitle...,03/12/2019,/images/ResizedImages/3-12-19-723137Image1_w22...


### Reload data

In [None]:
# Reload tophorse listings
df_tophorse_items = pd.read_csv(os.path.join(filepath_data, filename_tophorse_items_df))
df_tophorse_items

Unnamed: 0,item_title,item_url_path,item_created_date,item_img_url_path
0,,/horses-for-sale/paint-horse/Buckskin-Paint-Co...,30/11/2021,/images/ResizedImages/30-11-21-296166Image1_w2...
1,,/horses-for-sale/endurance-and-trail/Vee-Rock-...,30/11/2021,/images/ResizedImages/30-11-21-488434Image1_w2...
2,,/horses-for-sale/ponies/Tory__27-11-21-563432,27/11/2021,/images/ResizedImages/27-11-21-586832Image1_w2...
3,,/horses-for-sale/australian-stock-horse/Johnny...,27/11/2021,/images/ResizedImages/27-11-21-562895Image1_w2...
4,,/horses-for-sale/broodmare/Wanted---Riding-Pon...,26/11/2021,/images/ResizedImages/24-11-21-231789Image1_w2...
...,...,...,...,...
165,,/horses-for-sale/allrounder-pony/Tilly__16-12-...,17/12/2019,/images/ResizedImages/16-12-19-224877Image7_w2...
166,,/horses-for-sale/allrounder-horse/Grey-TB-mare...,04/12/2019,/images/ResizedImages/4-12-19-248828Image1_w22...
167,,/horses-for-sale/allrounder-pony/stunning-pony...,04/12/2019,/images/ResizedImages/4-12-19-214163Image3_w22...
168,,/horses-for-sale/allrounder-pony/amazing-Bitle...,03/12/2019,/images/ResizedImages/3-12-19-723137Image1_w22...


### Crawl listing details

In [None]:
# Crawl pages and scrape
start_time = time()
print('Crawl start UTC: {}'.format(datetime.fromtimestamp(start_time)))

df_item_details = pd.DataFrame()
count_max = df_tophorse_items.shape[0]
echo_every_n = 10
for count, url_path in enumerate(df_tophorse_items['item_url_path']):
    # Requeste advert details
    url_item = url_domain + url_path
    status_code = 0
    try:
        response = requests.get(url_item)
        status_code = response.status_code
        soup = BeautifulSoup(response.text, 'html.parser')
    except requests.exceptions.ConnectionError:
        # Target machine actively refused connection
        status_code = "connection refused"
    except:
        pass

    # Show progress
    if (echo_every_n > 0) and (count % echo_every_n == 0): # lazy evaluation
        print("item {} of {}:".format(count + 1, count_max))
        print("\t{}".format(url_item))
        print('\tHTTP response status: {}'.format(status_code))

    df_item = pd.DataFrame({ 'url_path':  [url_path] })

    if status_code == 200:
        # Initialsie
        item_description, main_image_url_path, video_url, Seller_Id = "", "", "", ""
        df_table_details = pd.DataFrame()
        image_count = 0
        parents = [ '', '']
        grand_parents = [ '', '', '', '']
        great_grand_parents = [ '', '', '', '']     

        try: # Item description
            item_description = soup.find('div', class_='itemDescription').get_text()
        except:
            pass

        try: # Details table
            table_details = soup.find('table', class_='productPageItemDetails')
            df_table_details = pd.read_html(table_details.prettify())[0]
            df_table_details.rename(columns={0: 'key', 1: 'value'}, inplace=True)
            df_table_details = df_table_details.iloc[:df_table_details[df_table_details['key'].str.contains('(?i)^contact', regex= True)].index[0], ]
            df_table_details['key'] = df_table_details['key'].str.replace(':','')
            df_table_details.set_index('key', inplace=True)
            df_table_details = df_table_details.transpose()
            df_table_details.reset_index(inplace=True)
            df_table_details            
        except:
            pass

        try: # Breeding Tree
            table_breeding_tree = soup.find('table', class_='breedingTree')
            df_table_breeding_tree = pd.read_html(table_breeding_tree.prettify())[0]
            df_table_breeding_tree.rename(columns={0: 'parents', 1: 'grand_parents', 2: 'great_grand_parents'}, inplace=True)
            df_table_breeding_tree.fillna('', inplace=True)
            parents = df_table_breeding_tree[df_table_breeding_tree.index % 4 == 0]['parents'].tolist()
            grand_parents = df_table_breeding_tree[df_table_breeding_tree.index % 2 == 0]['grand_parents'].tolist()
            great_grand_parents = df_table_breeding_tree['great_grand_parents'].tolist()
        except:
            pass

        try: # Number of images
            image_count = len(soup.find('div', class_='contentBlock').find_all('img'))
        except:
            pass

        try: # Main image url_path
            main_image_url_path = soup.find('div', class_='mainImage').find('img')['src']
        except:
            pass

        try: # Video url
            video_url = soup.find('div', id='videoContent').find('iframe')['src']
        except:
            pass

        try: # Seller Id
            Seller_Id = soup.find('div', class_='productPageRightCol').find('a', id=re.compile(r'SellersOtherAdverts$'))['href']
            Seller_Id = re.findall(r'[^/]+$', Seller_Id)[0]
        except:
            pass

        # Set item results
        df_item = pd.DataFrame({ 'url_path':  [url_path] })
        df_item['Description'] = item_description
        df_item = pd.concat([df_item.reset_index(drop=True), df_table_details], axis=1)
        df_item['image_count'] = image_count
        df_item['main_image_url_path'] = main_image_url_path
        df_item['video_url'] = video_url
        df_item['Seller_Id'] = Seller_Id
        df_item['parents'] = ";".join(parents)
        df_item['grand_parents'] = ";".join(grand_parents)
        df_item['great_grand_parents'] = ";".join(great_grand_parents)

    df_item_details = pd.concat([df_item_details, df_item])

df_item_details.reset_index(inplace=True)
df_tophorse_item_details = df_item_details.drop(df_item_details.columns[0], axis=1).copy()

# A horses category on the sight is hard-coded into its URL path e.g., 
#    https://www.tophorse.com.au/horses-for-sale/ --> `performance-horse` <-- /Super-quiet-educated-gelding-__9-1-20-596629
df_tophorse_item_details['category'] = df_tophorse_item_details['url_path'].replace({ r'/horses-for-sale/([^/]+)/.*' : r'\1' }, regex = True, inplace=False)
del df_item_details

end_time = time()
print("---crawl complete---")
print()
print("Listings count: {}".format(len(df_tophorse_item_details.index)))
print("Crawl time (sec): {0}".format(end_time - start_time))
print('Crawl end UTC: {}'.format(datetime.fromtimestamp(end_time)))

Crawl start UTC: 2021-11-30 13:42:50.237417
item 1 of 170:
	https://www.tophorse.com.au/horses-for-sale/paint-horse/Buckskin-Paint-Colt-__30-11-21-178636
	HTTP response status: 200
item 11 of 170:
	https://www.tophorse.com.au/horses-for-sale/australian-stock-horse/Stock-horse-for-sale-__22-11-21-474731
	HTTP response status: 200
item 21 of 170:
	https://www.tophorse.com.au/horses-for-sale/ex-racehorse/FOR-LEASE---beautiful-OTTB-__12-11-21-637184
	HTTP response status: 200
item 31 of 170:
	https://www.tophorse.com.au/horses-for-sale/allrounder-horse/Luke__19-10-21-822854
	HTTP response status: 200
item 41 of 170:
	https://www.tophorse.com.au/horses-for-sale/allrounder-horse/Phoenix__21-9-21-519817
	HTTP response status: 200
item 51 of 170:
	https://www.tophorse.com.au/horses-for-sale/allrounder-horse/HRV-Hero--Parisian-Pride--Paris-__5-8-21-116345
	HTTP response status: 200
item 61 of 170:
	https://www.tophorse.com.au/horses-for-sale/performance-horse/F-rst-Love-x-Versace-Black-Colt__15

In [None]:
# Save dataframe locally
df_tophorse_item_details.to_csv(os.path.join(filepath_data, filename_tophorse_item_details_df), index=False)
df_tophorse_item_details

Unnamed: 0,url_path,Description,index,Height,Age,Colour,Sex,Breed,Terms,Location,P/Trade,Price,Ad Code,image_count,main_image_url_path,video_url,Seller_Id,parents,grand_parents,great_grand_parents,category
0,/horses-for-sale/paint-horse/Buckskin-Paint-Co...,\r\n Phaa Rego pending - DECKAL...,value,16.0 hh,0 yrs,Buckskin,Colt,Paint Horse,For Sale,"Beechwood, New South Wales",Private,$ 0 ONO,30-11-21-178636,4,/images/ResizedImages/30-11-21-296166Image1_w5...,,,QL RIGHTON RINGER;WARRALEE APPLE DANISH,QL RIGHT ON THE MONEY;QUIRRAN LEA MISS GALAXY;...,STRAIT SMOKIN MONEY (IMP USA) 9047;QUIRRAN LEA...,paint-horse
1,/horses-for-sale/endurance-and-trail/Vee-Rock-...,"\r\n 75 starts, $136,562 Prize ...",value,16.0 hh,12 yrs,Bay,Gelding,Standardbred,For Sale,"Oaklands Junction, Victoria",Trade,$ 2500.00,30-11-21-497655,6,/images/ResizedImages/30-11-21-488434Image1_w5...,,HERO__9-6-20-752499,;,;;;,;;;,endurance-and-trail
2,/horses-for-sale/ponies/Tory__27-11-21-563432,\r\n The most sweetest pony I h...,value,11.0 hh,10 yrs,Grey,Mare,Other,For Sale,"Yamba, New South Wales",Private,$ 3500.00 ONO,27-11-21-563432,4,/images/ResizedImages/27-11-21-586832Image1_w5...,https://www.youtube.com/embed/lUedBJKyWAk,,;,;;;,;;;,ponies
3,/horses-for-sale/australian-stock-horse/Johnny...,\r\n A real sweetheart who enjo...,value,14.3 hh,10 yrs,Bay,Gelding,Australian Stock Horse,For Sale,"Yamba, New South Wales",Private,$ 4500.00 ONO,27-11-21-191667,4,/images/ResizedImages/27-11-21-562895Image1_w5...,https://www.youtube.com/embed/wIFXwr6xACE,,;,;;;,;;;,australian-stock-horse
4,/horses-for-sale/broodmare/Wanted---Riding-Pon...,\r\n Looking for a registered r...,value,13.0 hh,7 yrs,Black,Mare,Riding Pony,Wanted,"Currawang, New South Wales",Private,$ 500.00,24-11-21-697984,1,/images/ResizedImages/24-11-21-231789Image1_w5...,,,;,;;;,;;;,broodmare
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,/horses-for-sale/allrounder-pony/Tilly__16-12-...,\r\n 12 year old Australian sto...,value,12.0 hh,12 yrs,Brown,Mare,Australian Pony,For Sale,"Mcgraths Hill, New South Wales",Private,$ 3500.00 Inc GST,16-12-19-224877,6,/images/ResizedImages/16-12-19-224877Image7_w5...,,,;,;;;,;;;,allrounder-pony
166,/horses-for-sale/allrounder-horse/Grey-TB-mare...,"\r\n Grey mare, 7 years old, st...",value,15.2 hh,7 yrs,Grey,Mare,Thoroughbred,For Sale,"Geelong, Victoria",Private,POA,4-12-19-945675,1,/images/ResizedImages/4-12-19-248828Image1_w57...,,,;,;;;,;;;,allrounder-horse
167,/horses-for-sale/allrounder-pony/stunning-pony...,\r\n Green broken Welsh B Pony....,value,12.2 hh,0 yrs,Black,Mare,Welsh Section B,For Sale,"Drysdale, Victoria",Private,$ 5000.00 ONO,4-12-19-214163,1,/images/ResizedImages/4-12-19-214163Image3_w57...,,,;,;;;,;;;,allrounder-pony
168,/horses-for-sale/allrounder-pony/amazing-Bitle...,\r\n This seven year old stocky...,value,14.0 hh,7 yrs,Buckskin,Mare,Other,For Sale,"Bendigo South, Victoria",Private,$ 4500.00 Inc GST,3-12-19-282883,6,/images/ResizedImages/3-12-19-723137Image1_w57...,,,;,;;;,;;;,allrounder-pony


In [None]:
df_tophorse_item_details[df_tophorse_item_details['parents'] != ";"]

Unnamed: 0,url_path,Description,index,Height,Age,Colour,Sex,Breed,Terms,Location,P/Trade,Price,Ad Code,image_count,main_image_url_path,video_url,Seller_Id,parents,grand_parents,great_grand_parents,category
0,/horses-for-sale/paint-horse/Buckskin-Paint-Co...,\r\n Phaa Rego pending - DECKAL...,value,16.0 hh,0 yrs,Buckskin,Colt,Paint Horse,For Sale,"Beechwood, New South Wales",Private,$ 0 ONO,30-11-21-178636,4,/images/ResizedImages/30-11-21-296166Image1_w5...,,,QL RIGHTON RINGER;WARRALEE APPLE DANISH,QL RIGHT ON THE MONEY;QUIRRAN LEA MISS GALAXY;...,STRAIT SMOKIN MONEY (IMP USA) 9047;QUIRRAN LEA...,paint-horse
5,/horses-for-sale/allrounder-horse/Sophie__25-1...,\r\n Racename: Mamzelle Sophie\...,value,16.1 hh,6 yrs,Chestnut,Mare,Thoroughbred,For Sale,"Barnawartha, Victoria",Private,POA,25-11-21-919693,3,/images/ResizedImages/25-11-21-896767Image46_w...,,JW-Equestrian__24-6-21-819416,Choisir;Sophielicious,;;;,;;;;;;;,allrounder-horse
6,/horses-for-sale/allrounder-horse/Manny__25-11...,\r\n Racename: All Starr Courag...,value,16.1 hh,7 yrs,Chestnut,Gelding,Thoroughbred,For Sale,"Barnawartha, Victoria",Private,POA,25-11-21-149549,6,/images/ResizedImages/25-11-21-318156Image41_w...,,JW-Equestrian__24-6-21-819416,Strategic Maneuver;Caseys Courage,;;;,;;;;;;;,allrounder-horse
7,/horses-for-sale/allrounder-horse/Alex__25-11-...,\r\n Racename: Ajyaal\r DOB: 13...,value,16.2 hh,5 yrs,Bay,Gelding,Thoroughbred,For Sale,"Barnawartha, Victoria",Private,POA,25-11-21-227556,6,/images/ResizedImages/25-11-21-256712Image36_w...,,JW-Equestrian__24-6-21-819416,Deep Field;Albakoor,;;;,;;;;;;;,allrounder-horse
8,/horses-for-sale/allrounder-horse/Prince__25-1...,"\r\n Price enjoys going slow, t...",value,16.0 hh,4 yrs,Chestnut,Gelding,Thoroughbred,For Sale,"Barnawartha, Victoria",Private,$ 2500.00,25-11-21-732997,6,/images/ResizedImages/25-11-21-682328Image31_w...,,JW-Equestrian__24-6-21-819416,Snitzle;Star Of Sydney,;;;,;;;;;;;,allrounder-horse
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150,/horses-for-sale/mountain-and-moorland/BAMBORO...,\r\n Welsh Mountain Pony\r Grey...,value,11.1 hh,1 yrs,Grey,Gelding,Other,For Sale,"Cowra, New South Wales",Private,POA,5-3-20-363618,3,/images/ResizedImages/5-3-20-811358Image1_w577...,,,Llangollen Top of the Pops;Bamborough Pou Pi Du,;;;,;;;;;;;,mountain-and-moorland
151,/horses-for-sale/performance-pony/Quality-Part...,\r\n BAMBOROUGH GOSPEL\r DOB: 2...,value,13.1 hh,7 yrs,Chestnut,Filly,Other,For Sale,"Cowra, New South Wales",Private,POA,5-3-20-312285,1,/images/ResizedImages/5-3-20-312285Image1_w577...,,,Rathowen Paragon;Bamborough Graceful,;;;,;;;;;;;,performance-pony
152,/horses-for-sale/broodmare/Bamborough-Royal-Me...,\r\n DOB: 25.11.08\r Lovely bro...,value,13.0 hh,12 yrs,Chestnut,Mare,Other,For Sale,"Cowra, New South Wales",Private,POA,5-3-20-727149,1,/images/ResizedImages/5-3-20-525165Image6_w577...,,,Naruni Park Taylor Made;Bamborough Royal Rhapsody,;;;,;;;;;;;,broodmare
153,/horses-for-sale/broodmare/Lovely-Part-Welsh-B...,\r\n SILKWOOD SUPERNATURAL\r DO...,value,14.1 hh,14 yrs,Chestnut,Mare,Other,For Sale,"Cowra, New South Wales",Private,POA,5-3-20-953732,1,/images/ResizedImages/5-3-20-312145Image7_w577...,,,Silkwood Puss N Boots;Silkwood Sweet Charity,;;;,;;;;;;;,broodmare


## horseforum-health

### UDF scrape forum

In [None]:
def get_next_page_url_horseforum(horseforum_soup):
    """
    Gets the 'next page' url path from the www.horseforum.com forum page.

    Parameters:
    -----------
    bs4.BeautifulSoup : horseforum_soup
        A BeautifulSoup object for the web page.

    Returns:
    --------
    str
        the relative url path for the 'next page' if successful, otherwise an empty string.
    """
    soup = horseforum_soup
    url_path = ""
    try:
        url_path = soup.find('a', { "aria-label" : "Next Page" })['href']
    except:
        pass

    return url_path



def get_threads_horseforum(horseforum_soup):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : horseforum_soup
        A BeautifulSoup object for the web page.

    Returns:
    --------
    bs4.element.ResultSet
        An iterable collection of advert listings, as returned by BeautifulSoup
    """
    soup = horseforum_soup
    return soup.find_all('div', class_="california-thread-item")




def scrape_thread_details_horseforum(forum_listings):
    """
    Iterates a bs4 ResultSet of horse forum (i.e., https://www.horseforum.com/forums/horse-health.17/)
    listings to get predefined values.

    Parameters:
    -----------
    bs4.element.ResultSet : forum_listings
        An iterable collection of forum listings, as returned by BeautifulSoup


    Returns:
    --------
    list of dict {str, Any}
        the predefined details of each forum-post:
            + item_title
            + item_url_path
            + item_created_date
            + item_created_by
            + item_created_by_url_path
            + item_view_count
            + item_reply_count
            + item_last_replied_date
            + item_last_replied_by
            + item_last_replied_by_url_path

    """
    listings = forum_listings
    data_items=[]
    if listings is not None:
        for item in listings:
            # initialise
            item_title = ""
            item_url_path = ""
            item_created_date = ""
            item_created_by = ""
            item_created_by_url_path = ""
            item_view_count = ""
            item_reply_count = ""
            item_last_replied_date = ""
            item_last_replied_by = ""
            item_last_replied_by_url_path = ""

            # Get item's listing title and details URL
            try:
                tag = item.find('div', class_="structItem-title").find('a', { 'qid' : "thread-item-title"})
                item_title = tag.get_text()
                item_url_path = tag['href']
            except:
                pass

            # Get item's created date
            try:
                item_created_date = item.find(
                    'div', class_="structItem-minor"
                    ).find('a', { 'qid' : "thread-item-start-date"}).find('time')['datetime']
            except:
                pass

            # Get item's created by
            try:
                tag = item.find(class_="structItem-username").find('a', class_="username")
                item_created_by = tag.get_text()
                item_created_by_url_path = tag['href']
            except:
                pass

            # Get item's view count
            try:
                item_view_count = item.find('div', class_="view-count").find('span')['title']
            except:
                pass

            # Get item's reply count
            try:
                item_reply_count = item.find('div', class_="reply-count").find('span').get_text()
            except:
                pass            

            # Get item's last replied date
            try:
                item_last_replied_date = item.find('div', class_="last-poster").find('time')['datetime']
            except:
                pass            

            # Get item's last replied by
            try:
                tag = item.find('div', class_="last-poster").find('a', class_="username")
                item_last_replied_by = tag.get_text()
                item_last_replied_by_url_path = tag['href']
            except:
                pass            

            # Save result
            dict_item = {
                'item_title' : item_title
                ,'item_url_path' : item_url_path
                ,'item_created_date' : item_created_date
                ,'item_created_by' : item_created_by
                ,'item_created_by_url_path' : item_created_by_url_path
                ,'item_view_count' : item_view_count
                ,'item_reply_count' : item_reply_count
                ,'item_last_replied_date' : item_last_replied_date
                ,'item_last_replied_by' : item_last_replied_by
                ,'item_last_replied_by_url_path' : item_last_replied_by_url_path
                }

            data_items.append(dict_item.copy())

    return data_items

### Crawl horseforum - health

In [None]:
# Specify which web page should be crawled
url_domain =  "https://www.horseforum.com"
url_path = "/forums/horse-health.17/" # note terminating '/'

# Set functions relevant to scraping the web page
get_ResultSet = get_threads_horseforum
get_ResultSet_data = scrape_thread_details_horseforum
get_next_page_url = get_next_page_url_horseforum

# Crawl pages and scrape
start_time = time()
print('Crawl start UTC: {}'.format(datetime.fromtimestamp(start_time)))
data_horseforum_threads, _ = web_crawler(
    url_domain =  url_domain
    ,url_path = url_path
    ,fx_get_ResultSet = get_ResultSet
    ,fx_get_ResultSet_data = get_ResultSet_data
    ,fx_get_next_page_url = get_next_page_url    
    #,max_pages = 2
    ,echo_every_n = 25
)
end_time = time()
print("---crawl complete---")
print()
print("Thread count: {}".format(len(data_horseforum_threads)))
print("Crawl time (sec): {0}".format(end_time - start_time))
print('Crawl end UTC: {}'.format(datetime.fromtimestamp(end_time)))

Crawl start UTC: 2021-11-30 13:28:37.212199
Page 25, HTTP response status: 200
	35 items...
Page 50, HTTP response status: 200
	35 items...
Page 75, HTTP response status: 200
	35 items...
Page 100, HTTP response status: 200
	35 items...
Page 125, HTTP response status: 200
	35 items...
Page 150, HTTP response status: 200
	35 items...
Page 175, HTTP response status: 200
	35 items...
Page 200, HTTP response status: 200
	35 items...
Page 225, HTTP response status: 200
	35 items...
Page 250, HTTP response status: 200
	35 items...
Page 275, HTTP response status: 200
	35 items...
Page 300, HTTP response status: 200
	35 items...
Page 325, HTTP response status: 200
	35 items...
Page 350, HTTP response status: 200
	35 items...
Page 375, HTTP response status: 200
	35 items...
Page 400, HTTP response status: 200
	35 items...
Page 425, HTTP response status: 200
	35 items...
Page 450, HTTP response status: 200
	35 items...
Page 475, HTTP response status: 200
	35 items...
Page 500, HTTP response stat

In [None]:
# Convert to data frame and save result locally
df_horseforum_threads = pd.DataFrame(data_horseforum_threads)

# Save dataframe locally
df_horseforum_threads.to_csv(os.path.join(filepath_data, filename_horseforum_threads_df), index=False)

df_horseforum_threads

Unnamed: 0,item_title,item_url_path,item_created_date,item_created_by,item_created_by_url_path,item_view_count,item_reply_count,item_last_replied_date,item_last_replied_by,item_last_replied_by_url_path
0,PLEASE READ BEFORE POSTING (both older and new...,/threads/please-read-before-posting-both-older...,2020-12-27T19:19:33-0500,TaMMa89,/members/tamma89.3542/,1076,0,2020-12-27T19:19:33-0500,TaMMa89,/members/tamma89.3542/
1,poisonous plants from HF member locations,/threads/poisonous-plants-from-hf-member-locat...,2018-01-01T22:03:05-0500,Smilie,/members/smilie.18361/,2151,5,2020-12-08T16:29:12-0500,stevenson,/members/stevenson.26572/
2,The Care of an Emaciated Horse,/threads/the-care-of-an-emaciated-horse.100412/,2011-10-14T09:02:04-0400,xxBarry Godden,/members/xxbarry-godden.9451/,48970,54,2020-11-26T10:09:30-0500,horselovinguy,/members/horselovinguy.79162/
3,Making a Vet Kit,/threads/making-a-vet-kit.251/,2007-01-07T02:14:09-0500,Skippy!,/members/skippy.157/,2091441,277,2019-08-26T04:44:59-0400,AmiraAchek,/members/amiraachek.280747/
4,"Information on Myopathies - PSSM1, PSSM2, MFM,...",/threads/information-on-myopathies-pssm1-pssm2...,2017-08-17T20:06:14-0400,Espy,/members/espy.168162/,17546,5,2018-09-17T16:44:33-0400,Espy,/members/espy.168162/
...,...,...,...,...,...,...,...,...,...,...
20949,Recovery from Torn Rear Suspensory Ligament,/threads/recovery-from-torn-rear-suspensory-li...,2006-12-22T05:26:11-0500,trusspt,/members/trusspt.170/,6106,1,2006-12-30T04:51:20-0500,stacyh,/members/stacyh.186/
20950,how to get horses to gaine weight,/threads/how-to-get-horses-to-gaine-weight.99/,2006-12-05T11:54:51-0500,nybarrelracer,/members/nybarrelracer.121/,4988,14,2006-12-21T07:55:03-0500,brandig,/members/brandig.57/
20951,PAXIL?,/threads/paxil.102/,2006-12-06T10:09:20-0500,tuffstuff,/members/tuffstuff.123/,3240,3,2006-12-11T20:29:12-0500,kristy,/members/kristy.132/
20952,Loose stool,/threads/loose-stool.101/,2006-12-05T18:42:31-0500,Cedarsgirl,/members/cedarsgirl.120/,5694,10,2006-12-10T21:52:11-0500,Cedarsgirl,/members/cedarsgirl.120/


### Reload previous crawl: horseforum threads

In [None]:
# Reload  horseforum threads
df_horseforum_threads = pd.read_csv(os.path.join(filepath_data, filename_horseforum_threads_df))
df_horseforum_threads

Unnamed: 0,item_title,item_url_path,item_created_date,item_created_by,item_created_by_url_path,item_view_count,item_reply_count,item_last_replied_date,item_last_replied_by,item_last_replied_by_url_path
0,PLEASE READ BEFORE POSTING (both older and new...,/threads/please-read-before-posting-both-older...,2020-12-27T19:19:33-0500,TaMMa89,/members/tamma89.3542/,1076,0,2020-12-27T19:19:33-0500,TaMMa89,/members/tamma89.3542/
1,poisonous plants from HF member locations,/threads/poisonous-plants-from-hf-member-locat...,2018-01-01T22:03:05-0500,Smilie,/members/smilie.18361/,2151,5,2020-12-08T16:29:12-0500,stevenson,/members/stevenson.26572/
2,The Care of an Emaciated Horse,/threads/the-care-of-an-emaciated-horse.100412/,2011-10-14T09:02:04-0400,xxBarry Godden,/members/xxbarry-godden.9451/,48970,54,2020-11-26T10:09:30-0500,horselovinguy,/members/horselovinguy.79162/
3,Making a Vet Kit,/threads/making-a-vet-kit.251/,2007-01-07T02:14:09-0500,Skippy!,/members/skippy.157/,2091441,277,2019-08-26T04:44:59-0400,AmiraAchek,/members/amiraachek.280747/
4,"Information on Myopathies - PSSM1, PSSM2, MFM,...",/threads/information-on-myopathies-pssm1-pssm2...,2017-08-17T20:06:14-0400,Espy,/members/espy.168162/,17546,5,2018-09-17T16:44:33-0400,Espy,/members/espy.168162/
...,...,...,...,...,...,...,...,...,...,...
20949,Recovery from Torn Rear Suspensory Ligament,/threads/recovery-from-torn-rear-suspensory-li...,2006-12-22T05:26:11-0500,trusspt,/members/trusspt.170/,6106,1,2006-12-30T04:51:20-0500,stacyh,/members/stacyh.186/
20950,how to get horses to gaine weight,/threads/how-to-get-horses-to-gaine-weight.99/,2006-12-05T11:54:51-0500,nybarrelracer,/members/nybarrelracer.121/,4988,14,2006-12-21T07:55:03-0500,brandig,/members/brandig.57/
20951,PAXIL?,/threads/paxil.102/,2006-12-06T10:09:20-0500,tuffstuff,/members/tuffstuff.123/,3240,3,2006-12-11T20:29:12-0500,kristy,/members/kristy.132/
20952,Loose stool,/threads/loose-stool.101/,2006-12-05T18:42:31-0500,Cedarsgirl,/members/cedarsgirl.120/,5694,10,2006-12-10T21:52:11-0500,Cedarsgirl,/members/cedarsgirl.120/


### UDF scrape thread

In [None]:
def get_horseforum_thread_participant_count(horseforum_soup):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : horseforum_soup
        A BeautifulSoup object for the web page, being a 'page' of a 'thread'
        on the forum.

    Returns:
    --------
    int
        the number of users who have posted against the thread, otherwise np.nan.
    """
    soup = horseforum_soup
    participant_count = np.nan
    try:
        participant_count = soup.find(
            'div', class_="stats-container"
            ).find('span', { 'qid' : 'thread-about-discussion-participant-count' }).get_text()
        participant_count = int(participant_count)
    except:
        pass

    return participant_count



def scrape_thread_page_details_horseforum(horseforum_soup):
    """
    Iterates a bs4 ResultSet of a horse forum thread (i.e., https://www.horseforum.com/forums/horse-health.17/)
    listings to get predefined values.

    Parameters:
    -----------
    bs4.BeautifulSoup : horseforum_soup
        A BeautifulSoup object for the web page, being the first 'page'
        of posts for a 'thread' on the forum.


    Returns:
    --------
    dict {str, Any}
        the predefined details of a thread's page:
            + participant_count

    """
    soup = horseforum_soup
    participant_count = get_horseforum_thread_participant_count(soup)

    # Save result
    dict_page = {
        'participant_count' : participant_count
        }
       
    return dict_page






def get_horseforum_thread_page_next(horseforum_soup):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : horseforum_soup
        A BeautifulSoup object for the web page, being a 'page' of posts for 
        a 'thread' on the forum.

    Returns:
    --------
    str
        the relative url path for the 'next page' if successful, otherwise an empty string.
    """
    soup = horseforum_soup
    url_path = ""
    try:
        url_path = soup.find('a', { 'qid' : 'page-nav-next-button'})['href']
    except:
        pass

    return url_path


def get_horseforum_thread_page_last(horseforum_soup):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : horseforum_soup
        A BeautifulSoup object for the web page, being a 'page' of posts for 
        a 'thread' on the forum.

    Returns:
    --------
    str
        the relative url path for the 'last page' if successful, otherwise an empty string.
    """
    soup = horseforum_soup
    url_path = ""
    try:
        url_path = soup.find('a', { 'qid' : "thread-jump-to-latest"})['href']
    except:
        pass

    return url_path



def get_horseforum_thread_page_posts(horseforum_soup):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : horseforum_soup
        A BeautifulSoup object for the web page.

    Returns:
    --------
    bs4.element.ResultSet
        An iterable collection of posts (for the thread on the current page), 
        as returned by BeautifulSoup.
    """
    soup = horseforum_soup
    post_items = None
    try:
        post_items = soup.find('div', { 'qid' : 'thread-box-parent'}).find_all('article', { 'qid' : 'post-item'})

    except:
        pass

    return post_items



def scrape_thread_post_details_horseforum(page_posts, url_domain):
    """
    Iterates a bs4 ResultSet of a horse forum thread (i.e., https://www.horseforum.com/forums/horse-health.17/)
    listings to get predefined values.

    Parameters:
    -----------
    bs4.element.ResultSet : page_posts
        An iterable collection of forum listings, as returned by BeautifulSoup


    Returns:
    --------
    list of dict {str, Any}
        the predefined details of each forum-post:
            + item_title
            + item_url_path
            + item_created_date
            + item_created_by
            + item_created_by_url_path
            + item_view_count
            + item_reply_count
            + item_last_replied_date
            + item_last_replied_by
            + item_last_replied_by_url_path

    """
    listings = page_posts
    data_items=[]
    if listings is not None:
        for item in listings:   
            # Get result
            dict_item = scrape_horseforum_post(item, url_domain)

            data_items.append(dict_item.copy())

    return data_items

### UDF scrape post

In [None]:
def get_horseforum_post_id(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    str
        The identifier for the post, if successful, otherwise ""
    """
    soup = post
    post_id = ""
    try:
        post_id = post['data-content']

    except:
        pass

    return post_id


def get_horseforum_post_number(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    int
        The post number (relative to the thread), if successful, otherwise np.nan
    """
    soup = post
    post_number = np.nan
    try:
        post_number = post.find('a', { 'qid' : "post-number"}).get_text()
        post_number = int(re.findall(r'\d+', post_number)[0])
    except:
        pass

    return post_number


def get_horseforum_post_url_path(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    str
        The url path to the post, if successful, otherwise ""
    """
    soup = post
    post_url_path = ""
    try:
        post_url_path = post.find('a', { 'qid' : "post-number"})['href']
    except:
        pass

    return post_url_path


def get_horseforum_post_reactions_url_path(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    str
        The url path to the post's reactions, if successful, otherwise ""
    """
    soup = post
    post_reactions_url_path = ""
    try:
        post_reactions_url_path = post.find('a', class_="reactionsBar-link")['href']
        if post_reactions_url_path[-1] != '/':
            post_reactions_url_path = post_reactions_url_path + '/'
    except:
        pass

    return post_reactions_url_path



def get_horseforum_post_reactions(post, url_domain):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.
    str : url_domain
        The base URL domain (including protocol) to be crawled (e.g., https://www.horseforum.com)

    Returns:
    --------
    str
        The post's reactions, if successful, otherwise ""
    """
    soup = post
    post_reactions = ""
    url_path = get_horseforum_post_reactions_url_path(soup)
    if url_path != "":
        response = requests.get(url_domain + url_path)
        if response.status_code == 200:
            reactions_soup = BeautifulSoup(response.text, 'html.parser')
            try:
                post_reactions = reactions_soup.find('span', class_=re.compile(r'reaction-text')).get_text()
            except:
                pass

    return post_reactions


def get_horseforum_post_datetime(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    str
        The datetime for the post, if successful, otherwise ""
    """
    soup = post
    post_datetime = ""
    try:
        post_datetime = post.find('div', class_="message-attribution-main").find('time')['datetime']
    except:
        pass

    return post_datetime


def get_horseforum_post_username(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    str
        The username for the post, if successful, otherwise ""
    """
    soup = post
    post_username = ""
    try:
        post_username = post.find('div', class_="message-userDetails").find('a', class_="username").get_text()
    except:
        pass

    return post_username




def get_horseforum_post_userid(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    str
        The userid for the post, if successful, otherwise ""
    """
    soup = post
    post_username = ""
    try:
        post_username = post.find('div', class_="message-userDetails").find('a', class_="username")['data-user-id']
    except:
        pass

    return post_username




def get_horseforum_post_user_url_path(post):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.

    Returns:
    --------
    str
        The url path to the user for the post, if successful, otherwise ""
    """
    soup = post
    post_user_url_path = ""
    try:
        post_user_url_path = post.find('div', class_="message-userDetails").find('a', class_="username")['href']
    except:
        pass

    return post_user_url_path


def get_horseforum_post_text(post):
    """
    Returns the message text of a post.

    Note:
    ----
    Support for emoji within a post's text is done using image tags <img>, not characters.
    However, the `get_text()` method of BeautifulSoup ignores image tags and hence the enoji are lost.
    Hence the post's text is manually wrangled from the tag object of the post.

    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page.

    Returns:
    --------
    str
        The message text for the post, if successful, otherwise ""
    """
    soup = post
    post_text = ""
    try:
        post_text = post.find('article', { 'qid' : 'post-text'}).find('div')
        post_text = str(post_text)
        post_text = re.sub(r'^<[^>]+>|<[^>]+>$', '', post_text) # remove the outer div tags 
        post_text = re.sub(r'<img [^>]+ title="([^"]+)"[^>]*>', r'\1', post_text) # Extract image's title attribute
        post_text = re.sub(r'<br/>', r'\n', post_text) # Replace line breaks
        post_text = re.sub(r'<[^>]+>', r'', post_text) # Remove all remaining tags

    except:
        pass

    return post_text



def scrape_horseforum_post(post, url_domain):
    """
    Parameters:
    -----------
    bs4.BeautifulSoup : post
        A BeautifulSoup object for the web page, being a 'post' within a 
        'thread' on the forum.
    str : url_domain
        The base URL domain (including protocol) to be crawled (e.g., https://www.horseforum.com)        

    Returns:
    --------
    list of dict {str, Any}
        the predefined details of each thread-post:
            + post_id
            + post_number
            + post_url_path
            + post_reactions_url_path
            + post_reactions
            + post_datetime
            + post_username
            + post_userid
            + post_user_url_path
            + post_text

    """
    post_id = get_horseforum_post_id(post)
    post_number = get_horseforum_post_number(post)
    post_url_path = get_horseforum_post_url_path(post)
    post_reactions_url_path = get_horseforum_post_reactions_url_path(post)
    post_reactions = get_horseforum_post_reactions(post, url_domain)
    post_datetime = get_horseforum_post_datetime(post)
    post_username = get_horseforum_post_username(post)
    post_userid = get_horseforum_post_userid(post)
    post_user_url_path = get_horseforum_post_user_url_path(post)
    post_text = get_horseforum_post_text(post)

    # Save result
    dict_post = {
        'post_id' : post_id
        ,'post_number' : post_number
        ,'post_url_path' : post_url_path
        ,'post_reactions_url_path' : post_reactions_url_path
        ,'post_reactions' : post_reactions
        ,'post_datetime' : post_datetime
        ,'post_username' : post_username
        ,'post_userid' : post_userid
        ,'post_user_url_path' : post_user_url_path
        ,'post_text' : post_text
        }
       
    return dict_post

### UDF crawler wrapper

In [None]:
def wrapper_web_crawler(
    url_domain
    ,list_url_path
    ,fx_get_ResultSet
    ,fx_get_ResultSet_data
    ,fx_get_next_page_url
    ,fx_get_page_data = None
    ,max_pages = None
    ,echo_every_n = 1
    ,echo_page_crawl_every_n = 0
    ,wait_thread_sec = 0
    ,wait_page_sec = 0
):

    # Initialise
    count_max = len(list_url_path)
    df_forum_thread_posts = pd.DataFrame()
    df_forum_thread_pages = pd.DataFrame()

    # Crawl threads and scrape
    start_time = time()
    print('Crawl start UTC: {}'.format(datetime.fromtimestamp(start_time)))
    for count, thread_url_path in enumerate(list_url_path):
        # Show progress
        if (echo_every_n > 0) and (count % echo_every_n == 0): # lazy evaluation
            print("item {} of {}:".format(count + 1, count_max))
            print("\t{}".format(thread_url_path))

        # Crawl pages and scrape
        data_thread_posts, data_thread_pages = web_crawler(
            url_domain =  url_domain
            ,url_path = thread_url_path
            ,fx_get_ResultSet = fx_get_ResultSet
            ,fx_get_ResultSet_data = fx_get_ResultSet_data
            ,fx_get_next_page_url = fx_get_next_page_url
            ,fx_get_page_data = fx_get_page_data 
            ,max_pages = max_pages
            ,echo_every_n = echo_page_crawl_every_n
            ,wait_page_sec = wait_page_sec
        )

        # Convert thread's data to dataframe
        df_thread_posts = pd.DataFrame(data_thread_posts)
        df_thread_pages = pd.DataFrame(data_thread_pages)

        # Add thread's url_path as unique key    
        df_thread_posts.insert(0, 'thread_url_path', thread_url_path)
        df_thread_pages.insert(0, 'thread_url_path', thread_url_path)

        # Append to forum's dataframe
        df_forum_thread_posts = pd.concat([df_forum_thread_posts, df_thread_posts])
        df_forum_thread_pages = pd.concat([df_forum_thread_pages, df_thread_pages])

        # Wait until crawling next thread
        sleep(wait_thread_sec)

    # Reset forum dataframe's index
    df_forum_thread_posts.reset_index(inplace=True, drop=True)
    df_forum_thread_pages.reset_index(inplace=True, drop=True)

    end_time = time()
    print("---crawl complete---")
    print()
    print("Posts scraped: {}".format(len(df_forum_thread_posts.index))) 
    print("Crawl time (sec): {0}".format(end_time - start_time))
    print('Crawl end UTC: {}'.format(datetime.fromtimestamp(end_time))) 

    return [df_forum_thread_posts, df_forum_thread_pages]

### Test: Crawl single thread

In [None]:
# Specify which web page should be crawled
url_domain =  "https://www.horseforum.com"
url_path = "/threads/the-care-of-an-emaciated-horse.100412/" # note terminating '/'

# Set functions relevant to scraping the web page
get_ResultSet = get_horseforum_thread_page_posts
get_ResultSet_data = lambda post: scrape_thread_post_details_horseforum(page_posts=post, url_domain=url_domain)
get_next_page_url = get_horseforum_thread_page_next
get_page_data = scrape_thread_page_details_horseforum

# Crawl pages and scrape
start_time = time()
print('Crawl start UTC: {}'.format(datetime.fromtimestamp(start_time)))
data_thread_posts, data_thread_pages = web_crawler(
    url_domain =  url_domain
    ,url_path = url_path
    ,fx_get_ResultSet = get_ResultSet
    ,fx_get_ResultSet_data = get_ResultSet_data
    ,fx_get_next_page_url = get_next_page_url
    ,fx_get_page_data = get_page_data 
    #,max_pages = 2
    ,echo_every_n = 2
)
end_time = time()
print("---crawl complete---")
print()
print("Post count: {}".format(len(data_thread_posts)))
print("Crawl time (sec): {0}".format(end_time - start_time))
print('Crawl end UTC: {}'.format(datetime.fromtimestamp(end_time)))

Crawl start: 2021-11-30 13:07:54.593890
Page 2, HTTP response status: 200
	20 items...
---crawl complete---

Post count: 55
Crawl time (sec): 13.726727962493896
Crawl start: 2021-11-30 13:08:08.320618


### Crawl forum

In [None]:
# Specify which web page should be crawled
url_domain =  "https://www.horseforum.com"
list_url_threads = df_horseforum_threads['item_url_path']
print("Total thread count to crawl: {}".format(len(list_url_threads)))

# Set functions relevant to scraping the web page
get_ResultSet = get_horseforum_thread_page_posts
get_ResultSet_data = lambda post: scrape_thread_post_details_horseforum(page_posts=post, url_domain=url_domain)
get_next_page_url = get_horseforum_thread_page_next
get_page_data = scrape_thread_page_details_horseforum

# Batch control: define "chunks" to crawl
n_chunks = 20
start_chunk = 3 # 1-base
end_chunk = n_chunks # 1-base
echo_every_n = 250
wait_chunk_sec = 0

#n_chunks = 20000
#start_chunk = 25 # 1-base
#end_chunk = 28 # 1-base

print("Avg. chunk size (threads): {}".format(len(list_url_threads) // 20))

# Create a location to save crawler chunks
filepath_chunks_save = datetime.now().strftime("%Y%m%d-%H%M%S")
filepath_chunks_save = os.path.join(filepath_data, filepath_chunks_save)
#filepath_chunks_save = '/content/drive/MyDrive/Colab Notebooks/MA5851/Assessment 3/data/20211201-014114'

flag_make_chunk_folder = mk_dir(filepath_chunks_save)
if flag_make_chunk_folder:
    print("Save location created: {}".format(filepath_chunks_save))
    print()
else:
    print("!!!Failed to create save location: {}".format(filepath_chunks_save))

if flag_make_chunk_folder:
    start_main = time()
    print('~~~~~~~~ Chunks crawl start ~~~~~~~~')
    print('Main start UTC: {}'.format(datetime.fromtimestamp(start_main)))
    print()

    # Set filenames to save locally
    filename_horseforum_posts_chunk_df = "df_horseforum_posts_chunk{value:{fill}{align}{width}}.csv"
    filename_horseforum_pages_chunk_df = "df_horseforum_pages_chunk{value:{fill}{align}{width}}.csv"
    pad_chunk = len(str(n_chunks)) # used for building chunk's filename

    rge = range(df_horseforum_threads.shape[0])
    for count_chunk, chunk in enumerate(np.array_split(np.array(rge), n_chunks)):
        if count_chunk + 1 > end_chunk:
            break
        elif count_chunk + 1 >= start_chunk:
            print("***** Chunk: {} *****".format(count_chunk + 1))

            # Crawl chunk of forum threads
            df_forum_thread_posts_chunk, df_forum_thread_pages_chunk = wrapper_web_crawler(
                url_domain = url_domain
                ,list_url_path = list_url_threads[chunk]
                ,fx_get_ResultSet = get_ResultSet
                ,fx_get_ResultSet_data = get_ResultSet_data
                ,fx_get_next_page_url = get_next_page_url
                ,fx_get_page_data = get_page_data
                #,max_pages = None
                ,echo_every_n = echo_every_n
                ,echo_page_crawl_every_n = 0
                ,wait_thread_sec = 0
                ,wait_page_sec = 0                
            )

            # Save dataframe locally
            df_forum_thread_posts_chunk.to_csv(
                os.path.join(filepath_chunks_save
                            ,filename_horseforum_posts_chunk_df.format(value=count_chunk + 1, fill=0, align=">", width=pad_chunk)
                            )
                ,index=False)
            
            df_forum_thread_pages_chunk.to_csv(
                os.path.join(filepath_chunks_save
                            ,filename_horseforum_pages_chunk_df.format(value=count_chunk + 1, fill=0, align=">", width=pad_chunk)
                            )
                ,index=False)

            print("*************************")
            print()

            sleep(wait_chunk_sec)
        else:
            # skip this chunk
            pass


    end_main = time()
    print("~~~~~~~~ Chunks crawl complete ~~~~~~~~ ")
    print()
    print("Duration time (sec): {0}".format(end_main - start_main))
    print('Main end UTC: {}'.format(datetime.fromtimestamp(end_main)))

Total thread count to crawl: 20954
Avg. chunk size (threads): 1047
Save location created: /content/drive/MyDrive/Colab Notebooks/MA5851/Assessment 3/data/20211201-014114

~~~~~~~~ Chunks crawl start ~~~~~~~~
Main start UTC: 2021-12-01 03:27:02.058748

***** Chunk: 3 *****
Crawl start UTC: 2021-12-01 03:27:02.064146
item 1 of 1048:
	/threads/misdiagnosis-best-news-ever.787217/
item 251 of 1048:
	/threads/orphan-weanling-care-5-6mo.727361/
item 501 of 1048:
	/threads/does-soaking-grain-reduce-fiber-content.747689/
item 751 of 1048:
	/threads/hemp-seed-oil.725809/
item 1001 of 1048:
	/threads/vet-farrier-dentist-oh-my.704305/
---crawl complete---

Posts scraped: 13219
Crawl time (sec): 2176.2238006591797
Crawl end UTC: 2021-12-01 04:03:18.287946
*************************

***** Chunk: 4 *****
Crawl start UTC: 2021-12-01 04:03:18.712484
item 1 of 1048:
	/threads/can-anyone-age-this-horse.700625/
item 251 of 1048:
	/threads/purdue-equine-workshop.641802/
item 501 of 1048:
	/threads/cellulit

In [None]:
# Combine hosreforum post chunks
chunk_files = glob.glob(
    os.path.join(filepath_chunks_save
                 ,filename_horseforum_posts_chunk_df.format(value="*", fill="", align=">", width=1))
    )

ldf = [] # list of dataframes
for filename in chunk_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    ldf.append(df)
df_horseforum_thread_posts = pd.concat(ldf, axis=0, ignore_index=True)

# save combine dateframe locally
df_horseforum_thread_posts.to_csv(os.path.join(filepath_data, filename_horseforum_thread_posts_df), index=False)

# display
df_horseforum_thread_posts

Unnamed: 0,thread_url_path,post_id,post_number,post_url_path,post_reactions_url_path,post_reactions,post_datetime,post_username,post_userid,post_user_url_path,post_text
0,/threads/please-read-before-posting-both-older...,post-1970931865,1.0,/threads/please-read-before-posting-both-older...,,,2020-12-27T19:19:33-0500,TaMMa89,3542.0,/members/tamma89.3542/,New Horseforum.com format was launched in Nove...
1,/threads/poisonous-plants-from-hf-member-locat...,post-1970473609,1.0,/threads/poisonous-plants-from-hf-member-locat...,/posts/1970473609/reactions/,Like (2),2018-01-01T22:03:05-0500,Smilie,18361.0,/members/smilie.18361/,MOD NOTE (Jaydee)\n\nPlease could we keep this...
2,/threads/poisonous-plants-from-hf-member-locat...,post-1970473615,2.0,/threads/poisonous-plants-from-hf-member-locat...,,,2018-01-01T22:22:01-0500,k9kenai,257833.0,/members/k9kenai.257833/,The most common around New Mexico is Braken Fe...
3,/threads/poisonous-plants-from-hf-member-locat...,post-1970473753,3.0,/threads/poisonous-plants-from-hf-member-locat...,,,2018-01-02T09:04:30-0500,QtrBel,33711.0,/members/qtrbel.33711/,http://www.aces.edu/pubs/docs/A/ANR-0975/ANR-0...
4,/threads/poisonous-plants-from-hf-member-locat...,post-1970473821,4.0,/threads/poisonous-plants-from-hf-member-locat...,,,2018-01-02T09:50:56-0500,egrogan,24027.0,/members/egrogan.24027/,The red maple is one we have to worry about a ...
...,...,...,...,...,...,...,...,...,...,...,...
232758,/threads/loose-stool.101/,post-469,10.0,/threads/loose-stool.101/post-469,,,2006-12-10T09:50:22-0500,child in time,118.0,/members/child-in-time.118/,Did you have called a vet?
232759,/threads/loose-stool.101/,post-479,11.0,/threads/loose-stool.101/post-479,,,2006-12-10T21:52:11-0500,Cedarsgirl,120.0,/members/cedarsgirl.120/,"Yup, vets aware of her condition. First he tho..."
232760,/threads/quietex.49/,post-170,1.0,/threads/quietex.49/post-170,,,2006-11-14T23:17:30-0500,KristyMarie87,65.0,/members/kristymarie87.65/,I have a 4 year old Paint Gelding who is very ...
232761,/threads/quietex.49/,post-172,2.0,/threads/quietex.49/post-172,,,2006-11-15T23:06:17-0500,sweetwaterarabians,67.0,/members/sweetwaterarabians.67/,Hi\n\n\n\nSorry your boy is so fussy with the ...


In [None]:
# Combine hosreforum page chunks
chunk_files = glob.glob(
    os.path.join(filepath_chunks_save
                 ,filename_horseforum_pages_chunk_df.format(value="*", fill="", align=">", width=1))
    )

ldf = [] # list of dataframes
for filename in chunk_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    ldf.append(df)
df_horseforum_thread_pages = pd.concat(ldf, axis=0, ignore_index=True)

# save combine dateframe locally
df_horseforum_thread_pages.to_csv(os.path.join(filepath_data, filename_horseforum_thread_pages_df), index=False)

# display
df_horseforum_thread_pages

Unnamed: 0,thread_url_path,participant_count,page_url_path
0,/threads/please-read-before-posting-both-older...,1.0,/threads/please-read-before-posting-both-older...
1,/threads/poisonous-plants-from-hf-member-locat...,5.0,/threads/poisonous-plants-from-hf-member-locat...
2,/threads/the-care-of-an-emaciated-horse.100412/,31.0,/threads/the-care-of-an-emaciated-horse.100412/
3,/threads/the-care-of-an-emaciated-horse.100412/,31.0,/threads/the-care-of-an-emaciated-horse.100412...
4,/threads/the-care-of-an-emaciated-horse.100412/,31.0,/threads/the-care-of-an-emaciated-horse.100412...
...,...,...,...
24622,/threads/recovery-from-torn-rear-suspensory-li...,2.0,/threads/recovery-from-torn-rear-suspensory-li...
24623,/threads/how-to-get-horses-to-gaine-weight.99/,9.0,/threads/how-to-get-horses-to-gaine-weight.99/
24624,/threads/paxil.102/,4.0,/threads/paxil.102/
24625,/threads/loose-stool.101/,4.0,/threads/loose-stool.101/


### Reload crawled data

In [None]:
df_horseforum_thread_posts = pd.read_csv(os.path.join(filepath_data, filename_horseforum_thread_posts_df))
print("Re-imported posts data: {x[0]} rows × {x[1]} cols".format(x = df_horseforum_thread_posts.shape))
df_horseforum_thread_pages = pd.read_csv(os.path.join(filepath_data, filename_horseforum_thread_pages_df))
print("Re-imported pages data {x[0]} rows × {x[1]} cols".format(x = df_horseforum_thread_pages.shape))

Re-imported posts data: 232763 rows × 11 cols
Re-imported pages data 24627 rows × 3 cols


## Python notebook (.ipynb) conversion to HTML

The following code cell was for converting this Python notebook (.ipynb) to HTML (.html).

In [None]:
# Conversion of notebook to html (Google Colab):
filename_Notebook_ipynb = 'A3_Lachlan_Sharp.ipynb'
filename_Notebook_ipynb = '\"' + os.path.join(filepath_root, filename_Notebook_ipynb) + '\"'

!jupyter nbconvert --to html $filename_Notebook_ipynb

[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/MA5851/Assessment 3/A3_Lachlan_Sharp.ipynb to html
[NbConvertApp] Writing 496442 bytes to /content/drive/MyDrive/Colab Notebooks/MA5851/Assessment 3/A3_Lachlan_Sharp.html


In [None]:
# Conversion of notebook to pdf (Google Colab):

#!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
#!pip install pypandoc
#!jupyter nbconvert --to PDF $filename_Notebook_ipynb