<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Author:-Larry-M.-(Larz60+)" data-toc-modified-id="Author:-Larry-M.-(Larz60+)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Author: Larry M. (Larz60+)</a></span></li><li><span><a href="#Introduction" data-toc-modified-id="Introduction-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Part-1----RfcPaths.py---Set-up-paths" data-toc-modified-id="Part-1----RfcPaths.py---Set-up-paths-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Part 1 -- RfcPaths.py - Set up paths</a></span></li><li><span><a href="#Part-2---CheckInternet.py---Check-Internet-Availability" data-toc-modified-id="Part-2---CheckInternet.py---Check-Internet-Availability-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Part 2 - CheckInternet.py - Check Internet Availability</a></span></li><li><span><a href="#Part-3---GetUrl.py---Fetching-the-webpage" data-toc-modified-id="Part-3---GetUrl.py---Fetching-the-webpage-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Part 3 - GetUrl.py - Fetching the webpage</a></span></li><li><span><a href="#Part-4---GetPage.py---A-wrapper-around-GetUrl,-Updates-only-if-data-too-old." data-toc-modified-id="Part-4---GetPage.py---A-wrapper-around-GetUrl,-Updates-only-if-data-too-old.-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Part 4 - GetPage.py - A wrapper around GetUrl, Updates only if data too old.</a></span></li><li><span><a href="#Part-5---GetRemoteDir.py---Fetches-directory-of-page-referenced-by-url." data-toc-modified-id="Part-5---GetRemoteDir.py---Fetches-directory-of-page-referenced-by-url.-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Part 5 - GetRemoteDir.py - Fetches directory of page referenced by url.</a></span></li></ul></div>

# January 2018 Python Users Group - Manchester, New Hampshire, USA

## Author: Larry M. (Larz60+)

## Introduction


This presentation was originally going to be on creating a file server, where RFC's could be searched by Title, Author, RFC_Id, Date, etc. and then the actual rfc would be fetched and presented in nice GUI. Although I may get that far in this presentation, it may have to be divided in two sessions. The reason for this is that, the end result requires some very interesting web scraping and other subjects of interest in preparation for the final result. So here if you like what I present here tonight, and want to give me the floor for another session, I'll continue next month.

![RfcViewer.png](attachment:RfcViewer.png)

Here's a list by module:

* RfcPaths.py - Set up Paths, Files, and URL's needed for project

* CheckInternet.py - Check Internet availability.

* GetUrl.py -   Create Class to fetch page at URL

* GetPage.py -  Create Class to fetch a page if older than a specified time, or if it doesn't exist, and save the html. If not older than specified time, fetch the file. In all cases return the page html

* GetRemoteDir.py - Create class to extract a directory (by file suffix) from a download URL

* CreateRfcIndex.py - Module to do the file scraping, and extract the data that will be needed for the RFC index. Then save that data in json format. Will contain URL's for text, PDF, and Postscript format (as available)

* RfcViewer.py - Create a GUI, either as a web site, WXpython Phoenix, or both. This will have a search capability, and be able to display in all available formats.

* ImageLib.py - Class used to create an icon image library directory (uses icons8 public domain icon library) These images will be used to create a navigation pane for the PDF vierer in RfvViewer There are over 300 icons in this library which can be reused for other applications. Copy of icons8 license is in the images directory - ImageLib.py

## Part 1 -- RfcPaths.py - Set up paths

I like to keep all of my file paths, filenames, and URL's in a single class that I can import into any module requiring access to the data pointed to by the various entries in the class. All paths are relative to the location of the python source directory. I use pathlib, rather than os because it totally object oriented, and offers many advantages as such.

Keeping this data in a separate class has several advantages:

* If a path changes, relative to the source directory, only a single location has to be changed. This is by far the most important advantage.

* Need a path? just import RfcPaths and it's there.

* Opening a pathlib file is as simple as 'RfcIndex.open()'

* Getting an absolute path from a pathlib path is a simple as 'pathname.resolve()'

* Or to get a directory of all files '[x for x in pathname.iterdir() if x.is_file()]'

In [14]:
from pathlib import Path
import os


class RfcPaths:
    def __init__(self):
        # Directory paths assure required paths exist
        self.prev_cwd = None
        self.homepath = Path('.')
        self.datapath = self.homepath / 'data'
        self.datapath.mkdir(exist_ok=True)
        self.htmlpath = self.datapath / 'html'
        self.htmlpath.mkdir(exist_ok=True)
        self.jsonpath = self.datapath / 'json'
        self.jsonpath.mkdir(exist_ok=True)
        self.textpath = self.datapath / '/text'
        self.textpath.mkdir(exist_ok=True)
        # self.miscpath = self.datapath / 'MiscData'
        # self.miscpath.mkdir(exist_ok=True)
        # self.pdfpath  = self.datapath / 'pdf'
        # self.pdfpath.mkdir(exist_ok=True)
        # self.samplepath = self.datapath / 'sampledata'
        # self.samplepath.mkdir(exist_ok=True)
        self.temppath = self.datapath / 'temp'
        self.temppath.mkdir(exist_ok=True)
        self.imagepath = self.homepath / 'images'
        self.imagepath.mkdir(exist_ok=True)

        # File paths
        self.text_index = self.textpath / 'rfc-index.txt'
        # self.ien_index = self.miscpath / 'ien_index.txt'
        # self.bcp_index = self.miscpath / 'bcp_index.txt'
        # self.testdoc = self.samplepath / 'rfc110.pdf'
        # self.pdf_index_file = self.pdfpath / 'rfc-index.txt.pdf'
        self.imagedict = self.jsonpath / 'images.json'
        self.text_test_file = self.temppath / 'text.txt'
        self.pdf_test_file = self.temppath / 'pdf.txt'
        self.ps_test_file = self.temppath / 'ps.txt'

        self.rfc_index_html = self.htmlpath / 'rfc_index.html'
        self.rfc_int_std_html = self.htmlpath / 'int_std_index.html'
        self.rfc_filepage = self.htmlpath / 'file_info.html'
        self.rfc_download_dir_html = self.htmlpath / 'download.html'

        self.rfc_index_json = self.jsonpath / 'rfc_index.json'

        # url's
        self.rfc_index_url = 'https://www.rfc-editor.org/rfc-index.html'
        self.int_standards_url = 'https://www.rfc-editor.org/search/rfc_search_detail.php?sortkey=Number' \
                                 '&sorting=DESC&page=All&pubstatus%5B%5D=Standards%20Track&std_trk=Internet' \
                                 '%20Standard'
        self.base_rfc_url = 'https://www.rfc-editor.org/info/'
        self.sitemap_url ='https://www.rfc-editor.org/sitemap/'
        self.rfc_download_page_url = 'https://www.rfc-editor.org/rfc/'
        self.pdfrfc_download_page_url = 'https://www.rfc-editor.org/rfc/pdfrfc/'
        self.std_download_page_url = 'https://www.rfc-editor.org/rfc/std/'
        self.bcp_download_page_url = 'https://www.rfc-editor.org/rfc/bcp/'
        self.fyi_download_page_url = 'https://www.rfc-editor.org/rfc/fyi/'
        self.ien_download_page_url = 'https://www.rfc-editor.org/rfc/ien/'
        self.ien__pdf_download_page_url = 'https://www.rfc-editor.org/rfc/ien/scanned/'
        self.rfc_ref_index = f'{self.rfc_download_page_url}rfc-ref.txt.new'


def testit():
    rp = RfcPaths()
    print(f'{[x for x in rp.datapath.iterdir()]}')
    print(f'{[x for x in rp.imagepath.iterdir()]}')
    print(f'{rp.imagepath.resolve()}')

if __name__ == '__main__':
    testit()


This script can be imported into other modules. This makes it easy to change the location of an individual file, or entire directory, and have it available immediately to the entire package.

Example:

In [None]:
import RfcPaths as rp
import json


rpaths = rp.RfcPaths()

with rpaths.rfc_index_json.open() as f:
    mydict = json.load(f)

keys = mydict.keys()
print(keys)
print(mydict['RFC0037'])
print(mydict['RFC0037']['authors'])


If you need to know where a file exists, simply call resolve

In [None]:
import RfcPaths as rp
import json


rpaths = rp.RfcPaths()
rpaths.rfc_index_json.resolve()



## Part 2 - CheckInternet.py - Check Internet Availability
Need a module to check if internet access is available

In [None]:
import socket


class CheckInternet:
    def __init__(self):
        self.internet_available = False

    def check_availability(self):
        self.internet_available = False
        if socket.gethostbyname(socket.gethostname()) != '127.0.0.1':
            self.internet_available = True
        return self.internet_available


def testit():
    ci = CheckInternet()
    print('Please turn internet OFF, then press Enter')
    input()
    ci.check_availability()
    print(f'ci.internet_available: {ci.internet_available}')
    if not ci.internet_available:
        print('    Off test successful')
    else:
        print('    Off test failed')
    print('Please turn internet ON, then press Enter')
    input()
    ci.check_availability()
    print(f'ci.internet_available: {ci.internet_available}')
    if ci.internet_available:
        print('    On text successful')
    else:
        print('    On test failed')


if __name__ == '__main__':
    testit()

## Part 3 - GetUrl.py - Fetching the webpage
The GetUrl module only does one thing, fetch a webpage and return it's contents

In [None]:
import RfcPaths
import requests
import CheckInternet
import sys


class GetUrl:
    def __init__(self):
        self.rpath = RfcPaths.RfcPaths()
        self.ci = CheckInternet.CheckInternet()
        self.ok_status = 200
        self.r = None

    def fetch_url(self, url):
        self.r = None
        if self.ci.check_availability():
            self.r = requests.get(url, allow_redirects=False)
        return self.r


def testit():
    gu = GetUrl()
    page = gu.fetch_url('https://www.google.com/')
    count = 0
    maxcount = 20
    try:
        if page.status_code == 200:
            ptext = page.text.split('/n')
            for line in ptext:
                print(f'{line}\n')
                count += 1
                if count > maxcount:
                    break
        else:
            print(f'Error retreving file status code: {page.status_code}')
    except AttributeError:
        print('Please enable internet and try again')

if __name__ == '__main__':
    testit()


GetUrl uses the requests package (http://docs.python-requests.org/en/master/) to fetch a webpage. If you don't have it, you can download simply using pip:

In [None]:
pip install requests

The initalization is simple, rfpath is a pointer to the RfcPaths path class, ok_status is the standard internet fetch ok status code and r will be used to hold the data structure returned by requests.

In [None]:
class GetUrl:
    def __init__(self):
        self.rpath = RfcPaths.RfcPaths()
        self.ci = CheckInternet.CheckInternet()
        self.ok_status = 200
        self.r = None

The fetch_url method uses requests to fetch the URL, with redirects not allowed so that the entire web is not downloaded, and for security reasons.

If an Internet error is encountered, this method will return None, which can be used by the calling routine to cause an AttributeError exception, thus determining that the reason was lack of internet.

If there is no error, ite returns the raw requests response object, which can be queried as required by the calling program.

In [None]:
    def fetch_url(self, url):
        self.r = None
        if self.ci.check_availability():
            self.r = requests.get(url, allow_redirects=False)
        return self.r

Jupyter Notebook starts with a set data size. Running this script from within Notebook will create an error.
To bypass this error, the buffer size must be increased.

If you haven't already done this, cut and paste the following and restart from the command line:

In [None]:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000


## Part 4 - GetPage.py - A wrapper around GetUrl, Updates only if data too old.

Using a local cache, this module will first check to see if the desired webpage is cached in the sub-directory dtat/html.
If present, checks the file date/time modified stamp to see if it is older ta specified time and will attempt to fetch a new copy from the internet if this is the case.

If refresh_hours_every is passed as an argument with a value of 0, the code will unconditionally go to the website and
download a new copy (and write to cache).

This is accomplished but using the lstats method of pathlib, which is very similar to os.stats, with a response that is exactly the same, so taking lstats.st_mtime and dividing by 3600 get's age of last modification.

The download_new_file method makes use of GetUrl's internet checking, which will return None if internet is not available.
That will cause an AttributeError exception, because the format is not a valid requests response object. This is a trick!

In [None]:
import GetUrl
import time
import sys


class GetPage:
    def __init__(self):
        """
        Initalize - Instantiate imported modules, initialize class variables
        """
        self.elapsed_hours = 0
        self.gu = GetUrl.GetUrl()
        self.savefile = None

    def get_page(self, url, savefile=None, refresh_hours_every=48):
        self.url = url
        self.savefile = savefile
        self.refresh_hours_every = refresh_hours_every
        self.page = None
        if self.savefile:
            if self.savefile.exists():
                lstats = savefile.lstat()
                self.elapsed_hours = (time.time() - lstats.st_mtime) / 3600
                if lstats.st_size == 0 or (self.elapsed_hours > self.refresh_hours_every):
                    self.page = self.download_new_file()
                else:
                    with self.savefile.open('r') as f:
                        self.page = f.read()
            else:
                self.page = self.download_new_file()
        else:
            self.page = self.download_new_file()
        return self.page


    def download_new_file(self):
        page = None
        try:
            page = self.gu.fetch_url(self.url)
            if page.status_code == 200:
                with self.savefile.open('wb') as f:
                    f.write(page.content)
            else:
                print(f'Invalid status code: {page.st}')
        except AttributeError:
            print('Please enable internet and try again')
        return page

def testit():
    import RfcPaths
    rpath = RfcPaths.RfcPaths()
    # Test url = rfc index download page, save to data/html/rfc_index.html, refresh always
    gp = GetPage()
    page = gp.get_page(url=rpath.rfc_download_page_url, savefile=rpath.rfc_index_html,
                       refresh_hours_every=0)
    if page:
        if page.status_code == 200:
            print(f'Page contents: {page.text}')
        else:
            print('Page is empty or in')

if __name__ == '__main__':
    testit()


## Part 5 - GetRemoteDir.py - Fetches directory of page referenced by url.

This module is used to get a list of files by type from a webpage that (hopefully) contains files available for download. It does this by getting an html listing of the page containing the files and then using BeautifulSoup, extract the a Tag hrefs, matching those whose type is requested, and returning to the calling program as a list.

One thing to notice in the download_new_file method, after the document is fetched using GetUrl, the content of the requests response is extracted, and that is what beautiful soup parses.

All a tags are extracted from the html and then the href isolated. If the suffix matches that requested it's appended to dirlist, and once complete, returns that list to the calling program.

In [None]:
import GetUrl
from bs4 import BeautifulSoup
import time
import sys
import socket


class GetRemoteDir:
    def __init__(self):
        self.soup = None
        self.gu = GetUrl.GetUrl()
        self.internet_available = socket.gethostbyname(socket.gethostname()) != '127.0.0.1'
        self.page = None
        self.refresh_hours_every = None

    def list_file_descriptor(self, url, savefile=None, refresh_hours_every=48, suffix=''):
        elapsed_hours = 0
        dirlist = []
        self.refresh_hours_every = refresh_hours_every
        self.url = url
        self.savefile = savefile
        self.suffix = suffix

        if self.savefile:
            if self.savefile.exists():
                lstats = self.savefile.lstat()
                elapsed_hours = (time.time() - lstats.st_mtime) / 3600
                if elapsed_hours > self.refresh_hours_every:
                    self.page = self.download_new_file()
                else:
                    with savefile.open('r') as f:
                        self.page = f.read()
            else:
                self.page = self.download_new_file()
        else:
            self.page = self.download_new_file()

        self.soup = BeautifulSoup(self.page, 'html.parser')
        links = self.soup.select('a')
        for link in links:
            try:
                if link['href'].endswith(suffix):
                    dirlist.append(link['href'])
            except:
                print("Unexpected error:", sys.exc_info()[0])
        return dirlist

    def download_new_file(self):
        page = None
        if self.internet_available:
            document = self.gu.fetch_url(self.url)
            page = document.content
            with self.savefile.open('wb') as f:
                f.write(page)
        else:
            print('Please enable internet and re-try')
        return page


    def list_dir(self, url, suffix):
        for item in self.list_file_descriptor(url, suffix):
            print(item)


def testit():
    import RfcPaths
    rpath = RfcPaths.RfcPaths()
    grd              = GetRemoteDir()
    text_files       = grd.list_file_descriptor(url=rpath.rfc_download_page_url,
                                                savefile=rpath.text_test_file, suffix='txt')
    pdf_files        = grd.list_file_descriptor(url=rpath.rfc_download_page_url,
                                                savefile=pdf_test_file, suffix='pdf')
    postscript_files = grd.list_file_descriptor(url=rpath.rfc_download_page_url,
                                                savefile=ps_test_file, suffix='ps')
    print(f'text_files: {text_files}')
    print(f'pdf_files: {pdf_files}')
    print(f'postscript_files: {postscript_files}')


if __name__ == '__main__':
    testit()