<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#RfcPaths.py---Set-up-paths" data-toc-modified-id="RfcPaths.py---Set-up-paths-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>RfcPaths.py - Set up paths</a></span></li><li><span><a href="#FreshStart.py---Clean-html-and-json-directories" data-toc-modified-id="FreshStart.py---Clean-html-and-json-directories-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>FreshStart.py - Clean html and json directories</a></span></li><li><span><a href="#CheckInternet.py---Check-Internet-Availability" data-toc-modified-id="CheckInternet.py---Check-Internet-Availability-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>CheckInternet.py - Check Internet Availability</a></span></li><li><span><a href="#GetUrl.py---Fetching-the-webpage" data-toc-modified-id="GetUrl.py---Fetching-the-webpage-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>GetUrl.py - Fetching the webpage</a></span></li><li><span><a href="#GetPage.py---A-wrapper-around-GetUrl,-Updates-only-if-data-too-old." data-toc-modified-id="GetPage.py---A-wrapper-around-GetUrl,-Updates-only-if-data-too-old.-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>GetPage.py - A wrapper around GetUrl, Updates only if data too old.</a></span></li><li><span><a href="#GetRemoteDir.py---Fetches-directory-of-page-referenced-by-url." data-toc-modified-id="GetRemoteDir.py---Fetches-directory-of-page-referenced-by-url.-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>GetRemoteDir.py - Fetches directory of page referenced by url.</a></span></li><li><span><a href="#CreateRfcIndex.py---Gets-main-RFC-index-and-creates-json-file" data-toc-modified-id="CreateRfcIndex.py---Gets-main-RFC-index-and-creates-json-file-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>CreateRfcIndex.py - Gets main RFC index and creates json file</a></span></li><li><span><a href="#RfcViewer.py---Gui-for-viewing-index,-fetching-and-displaying-Documents" data-toc-modified-id="RfcViewer.py---Gui-for-viewing-index,-fetching-and-displaying-Documents-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>RfcViewer.py - Gui for viewing index, fetching and displaying Documents</a></span></li><li><span><a href="#ImageLib.py---Create-icon8-icon-index-json-file" data-toc-modified-id="ImageLib.py---Create-icon8-icon-index-json-file-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>ImageLib.py - Create icon8 icon index json file</a></span></li><li><span><a href="#PdfAttempt.py---Pdf-code-that-works-...-sort-of" data-toc-modified-id="PdfAttempt.py---Pdf-code-that-works-...-sort-of-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>PdfAttempt.py - Pdf code that works ... sort of</a></span></li></ul></div>

# January 2018 Python Users Group - Manchester, New Hampshire, USA

Author: Larry M. (Larz60+)


## Introduction


This presentation was originally going to be on creating a file server, where RFC's could be searched by Title, Author, RFC_Id, Date, etc. and then the actual rfc would be fetched and presented in nice GUI. There is some very interesting web scraping (CreateRfcIndex which needs to look at some Javascript), and things I had fun with, like GetRemoteDir.py.

<img src='RfcViewer.png' alt='Image goes here' title='RFC Viewer' align='right' />

Here's a list by module:

* __RfcPaths.py__ - Set up Paths, Files, and URL's needed for project

* __CheckInternet.py__ - Check Internet availability.

* __GetUrl.py__ -   Create Class to fetch page at URL

* __GetPage.py__ -  Create Class to fetch a page if older than a specified time, or if it doesn't exist, and save the html. If not older than specified time, fetch the file. In all cases return the page html

* __GetRemoteDir.py__ - Create class to extract a directory (by file suffix) from a download URL

* __CreateRfcIndex.py__ - Module to do the file scraping, and extract the data that will be needed for the RFC index. Then save that data in json format. Will contain URL's for text, PDF, and Postscript format (as available)

* __RfcViewer.py__ - Create a GUI, either as a web site, WXpython Phoenix, or both. This will have a search capability, and be able to display in all available formats.

* __ImageLib.py__ - Class used to create an icon image library directory (uses icons8 public domain icon library) These images will be used to create a navigation pane for the PDF vierer in RfvViewer There are over 300 icons in this library which can be reused for other applications. Copy of icons8 license is in the images directory

** Note **

Jupyter Notebook starts with a set data size. Running this script from within Notebook will create an error.
To bypass this error, the buffer size must be increased.

If you haven't already done this, cut and paste the following and restart from the command line:

In [None]:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

## RfcPaths.py - Set up paths

I like to keep all of my file paths, filenames, and URL's in a single class that I can import into any module requiring access to the data pointed to by the various entries in the class. All paths are relative to the location of the python source directory. I use pathlib, rather than os because it is object oriented, and offers many advantages as such.

Keeping this data in a separate class has several advantages:

* If a path changes, relative to the source directory, only a single location has to be changed. This is by far the most important advantage.

* Need a path? just import RfcPaths and it's there.

* Opening a pathlib file is as simple as 'RfcIndex.open()'

* Getting an absolute path from a pathlib path is a simple as 'pathname.resolve()'

* Or to get a directory of all files '[x for x in pathname.iterdir() if x.is_file()]'

In [None]:
from pathlib import Path
import os


class RfcPaths:
    def __init__(self):
        # Directory paths assure required paths exist
        self.prev_cwd = None
        self.homepath = Path('.')
        self.datapath = self.homepath / 'data'
        self.datapath.mkdir(exist_ok=True)
        self.htmlpath = self.datapath / 'html'
        self.htmlpath.mkdir(exist_ok=True)
        self.jsonpath = self.datapath / 'json'
        self.jsonpath.mkdir(exist_ok=True)
        self.textpath = self.datapath / '/text'
        self.textpath.mkdir(exist_ok=True)
        # self.miscpath = self.datapath / 'MiscData'
        # self.miscpath.mkdir(exist_ok=True)
        # self.pdfpath  = self.datapath / 'pdf'
        # self.pdfpath.mkdir(exist_ok=True)
        # self.samplepath = self.datapath / 'sampledata'
        # self.samplepath.mkdir(exist_ok=True)
        self.temppath = self.datapath / 'temp'
        self.temppath.mkdir(exist_ok=True)
        self.imagepath = self.homepath / 'images'
        self.imagepath.mkdir(exist_ok=True)

        # File paths
        self.text_index = self.textpath / 'rfc-index.txt'
        # self.ien_index = self.miscpath / 'ien_index.txt'
        # self.bcp_index = self.miscpath / 'bcp_index.txt'
        # self.testdoc = self.samplepath / 'rfc110.pdf'
        # self.pdf_index_file = self.pdfpath / 'rfc-index.txt.pdf'
        self.imagedict = self.jsonpath / 'images.json'
        self.text_test_file = self.temppath / 'text.txt'
        self.pdf_test_file = self.temppath / 'pdf.txt'
        self.ps_test_file = self.temppath / 'ps.txt'

        self.rfc_index_html = self.htmlpath / 'rfc_index.html'
        self.rfc_int_std_html = self.htmlpath / 'int_std_index.html'
        self.rfc_filepage = self.htmlpath / 'file_info.html'
        self.rfc_download_dir_html = self.htmlpath / 'download.html'

        self.rfc_index_json = self.jsonpath / 'rfc_index.json'

        # url's
        self.rfc_index_url = 'https://www.rfc-editor.org/rfc-index.html'
        self.int_standards_url = 'https://www.rfc-editor.org/search/rfc_search_detail.php?sortkey=Number' \
                                 '&sorting=DESC&page=All&pubstatus%5B%5D=Standards%20Track&std_trk=Internet' \
                                 '%20Standard'
        self.base_rfc_url = 'https://www.rfc-editor.org/info/'
        self.sitemap_url ='https://www.rfc-editor.org/sitemap/'
        self.rfc_download_page_url = 'https://www.rfc-editor.org/rfc/'
        self.pdfrfc_download_page_url = 'https://www.rfc-editor.org/rfc/pdfrfc/'
        self.std_download_page_url = 'https://www.rfc-editor.org/rfc/std/'
        self.bcp_download_page_url = 'https://www.rfc-editor.org/rfc/bcp/'
        self.fyi_download_page_url = 'https://www.rfc-editor.org/rfc/fyi/'
        self.ien_download_page_url = 'https://www.rfc-editor.org/rfc/ien/'
        self.ien__pdf_download_page_url = 'https://www.rfc-editor.org/rfc/ien/scanned/'
        self.rfc_ref_index = f'{self.rfc_download_page_url}rfc-ref.txt.new'


def testit():
    rp = RfcPaths()
    print(f'{[x for x in rp.datapath.iterdir()]}')
    print(f'{[x for x in rp.imagepath.iterdir()]}')
    print(f'{rp.imagepath.resolve()}')

if __name__ == '__main__':
    testit()

To use, simply import into a module.

For Example:

In [None]:
import RfcPaths as rp
import json


rpaths = rp.RfcPaths()

with rpaths.rfc_index_json.open() as f:
    mydict = json.load(f)

keys = mydict.keys()
print(keys)
print(f"\nSingle dict entry: {mydict['RFC0037']}")
print(f"\nAuthor: {mydict['RFC0037']['authors']}")


Resolve method is like os.abspath, but also includes path type. This can be used as is for path manipulations.

In [8]:
import RfcPaths as rp
import json


rpaths = rp.RfcPaths()
rpaths.rfc_index_json.resolve()


WindowsPath('M:/python/m-p/m/MakerProject/venv/src/data/json/rfc_index.json')

## FreshStart.py - Clean html and json directories

Clean out the html and json directories (sub-directories of data directory). A warning message appears, because if this class is called __CreateRfcIndex__, and __ImageLib__ will have to be rerun. It uses wx.MessageDialog. 

In [9]:
import RfcPaths
import os
import wx


class FreshStart():
    def __init__(self):
        self.app = wx.App()
        self.rpath = RfcPaths.RfcPaths()
        val = self.warning('YOU ARE ABOUT TO DELETE ALL HTML AND JSON DATA')
        if val == wx.ID_OK:
            self.delete_data_files()

    def warning(self, message):
        """
        Display message in standard wx.MessageDialog
        :param message: (string) Value to be displayed
        :return: None
        """
        msg_dlg = wx.MessageDialog(None, message, '', wx.OK | wx.CANCEL| wx.ICON_ERROR)
        val = msg_dlg.ShowModal()
        msg_dlg.Show()
        msg_dlg.Destroy()
        return val

    def delete_data_files(self):
        # html_files = [x for x in self.rpath.htmlpath.iterdir() if x.is_file()]
        html_files = [x for x in self.rpath.htmlpath.iterdir() if x.is_file() and (x.name.endswith('.html')
                      or x.name.endswith('.htm'))]
        json_files = [x for x in self.rpath.jsonpath.iterdir() if x.is_file() and x.name.endswith('.json')]
        all_files = html_files + json_files
        print(f'all_files: {all_files}')
        for file in all_files:
            os.remove(file.resolve())

if __name__ == '__main__':
    FreshStart()

## CheckInternet.py - Check Internet Availability
This module uses the localhost address 127.0.0.1 and gethostbyname to check if internet is available
There are instances where this doesn't work properly, but they are few and this can be safely wusd for most projects, test before relying on it. It's fine for this package.

In [None]:
import socket


class CheckInternet:
    def __init__(self):
        self.internet_available = False

    def check_availability(self):
        self.internet_available = False
        if socket.gethostbyname(socket.gethostname()) != '127.0.0.1':
            self.internet_available = True
        return self.internet_available


def testit():
    ci = CheckInternet()
    print('Please turn internet OFF, then press Enter')
    input()
    ci.check_availability()
    print(f'ci.internet_available: {ci.internet_available}')
    if not ci.internet_available:
        print('    Off test successful')
    else:
        print('    Off test failed')
    print('Please turn internet ON, then press Enter')
    input()
    ci.check_availability()
    print(f'ci.internet_available: {ci.internet_available}')
    if ci.internet_available:
        print('    On test successful')
    else:
        print('    On test failed')


if __name__ == '__main__':
    testit()

## GetUrl.py - Fetching the webpage
The __GetUrl__ module only does one thing, fetch a webpage and return it's contents. It uses the requests module which is more robust than urllib. This is where __CheckInternet__ is called, and in reality the only place it should be called, which is the case. If the internet is not available, rather than returning a requests response structure, it returns None. This requires that the calling module surrounds the call with a try/except block, but for the same reason allows testing for AttributeError to see if a 'turn internet on' message needs to be issued.

In [None]:
import RfcPaths
import requests
import CheckInternet
import sys


class GetUrl:
    def __init__(self):
        self.rpath = RfcPaths.RfcPaths()
        self.ci = CheckInternet.CheckInternet()
        self.ok_status = 200
        self.r = None

    def fetch_url(self, url):
        self.r = None
        if self.ci.check_availability():
            self.r = requests.get(url, allow_redirects=False)
        return self.r


def testit():
    gu = GetUrl()
    page = gu.fetch_url('https://www.google.com/')
    count = 0
    maxcount = 20
    try:
        if page.status_code == 200:
            ptext = page.text.split('/n')
            for line in ptext:
                print(f'{line}\n')
                count += 1
                if count > maxcount:
                    break
        else:
            print(f'Error retreving file status code: {page.status_code}')
    except AttributeError:
        print('Please enable internet and try again')

if __name__ == '__main__':
    testit()


__GetUrl__ uses the requests package (http://docs.python-requests.org/en/master/) to fetch a webpage. If you don't have it, you can download simply using pip:

In [None]:
pip install requests

Initalization is simple, rfpath is a pointer to the __RfcPaths__ path class, ok_status is the standard internet fetch ok status code and r will be used to hold the data structure returned by requests, and self.r will hold the resuests response structure.

In [None]:
class GetUrl:
    def __init__(self):
        self.rpath = RfcPaths.RfcPaths()
        self.ci = CheckInternet.CheckInternet()
        self.ok_status = 200
        self.r = None

The fetch_url method uses requests to fetch the URL, with redirects set to False to prevent the entire web from being downloaded, and also for security reasons.
If network is seen, the raw requests response object is returned, otherwise None.

In [None]:
    def fetch_url(self, url):
        self.r = None
        if self.ci.check_availability():
            self.r = requests.get(url, allow_redirects=False)
        return self.r

## GetPage.py - A wrapper around GetUrl, Updates only if data too old.

Using a local cache, this module will first check to see if the desired webpage is cached in the sub-directory dtat/html.
If present, checks the file date/time modified stamp to see if it is older ta specified time and will attempt to fetch a new copy from the internet if this is the case. If refresh_hours_every is passed as an argument with a value of 0, the code will unconditionally go to the website and download a new copy (and write to cache).

This is accomplished by using the lstats method of pathlib, which is very similar to os.stats, with a response that is in fact exactly the same in format and in names, so taking lstats.st_mtime and dividing by 3600 get's age of last modification in hours.

The __download_new_file__ code is surrounded with try/except, with the exception being AttributeError (as discussed before in GetUrl), this is the trick in use.

In [None]:
import GetUrl
import time
import sys


class GetPage:
    def __init__(self):
        """
        Initalize - Instantiate imported modules, initialize class variables
        """
        self.elapsed_hours = 0
        self.gu = GetUrl.GetUrl()
        self.savefile = None

    def get_page(self, url, savefile=None, refresh_hours_every=48):
        self.url = url
        self.savefile = savefile
        self.refresh_hours_every = refresh_hours_every
        self.page = None
        if self.savefile:
            if self.savefile.exists():
                lstats = savefile.lstat()
                self.elapsed_hours = (time.time() - lstats.st_mtime) / 3600
                if lstats.st_size == 0 or (self.elapsed_hours > self.refresh_hours_every):
                    self.page = self.download_new_file()
                else:
                    with self.savefile.open('r') as f:
                        self.page = f.read()
            else:
                self.page = self.download_new_file()
        else:
            self.page = self.download_new_file()
        return self.page


    def download_new_file(self):
        page = None
        try:
            page = self.gu.fetch_url(self.url)
            if page.status_code == 200:
                with self.savefile.open('wb') as f:
                    f.write(page.content)
            else:
                print(f'Invalid status code: {page.st}')
        except AttributeError:
            print('Please enable internet and try again')
        return page

def testit():
    import RfcPaths
    rpath = RfcPaths.RfcPaths()
    # Test url = rfc index download page, save to data/html/rfc_index.html, refresh always
    gp = GetPage()
    page = gp.get_page(url=rpath.rfc_download_page_url, savefile=rpath.rfc_index_html,
                       refresh_hours_every=0)
    if page:
        if page.status_code == 200:
            print(f'Page contents: {page.text}')
        else:
            print('Page is empty or in')

if __name__ == '__main__':
    testit()


## GetRemoteDir.py - Fetches directory of page referenced by url.

This module is used to get a list of files by type from a webpage. It's purpose in this package so far is to get a list of all RFC documents and their type, which is done using BeautifulSoup to extract the 'a' link Tags and examining the href text, matching those whose type matches the one requested requested. The list is what is returned.

One thing to notice in the download_new_file method, after the document is fetched using __GetUrl__, the content of the requests response structure is is what beautiful soup parses, as this structure still is in requests raw format.

In [None]:
import GetUrl
from bs4 import BeautifulSoup
import time
import sys
import socket


class GetRemoteDir:
    def __init__(self):
        self.soup = None
        self.gu = GetUrl.GetUrl()
        self.internet_available = socket.gethostbyname(socket.gethostname()) != '127.0.0.1'
        self.page = None
        self.refresh_hours_every = None

    def list_file_descriptor(self, url, savefile=None, refresh_hours_every=48, suffix=''):
        elapsed_hours = 0
        dirlist = []
        self.refresh_hours_every = refresh_hours_every
        self.url = url
        self.savefile = savefile
        self.suffix = suffix

        if self.savefile:
            if self.savefile.exists():
                lstats = self.savefile.lstat()
                elapsed_hours = (time.time() - lstats.st_mtime) / 3600
                if elapsed_hours > self.refresh_hours_every:
                    self.page = self.download_new_file()
                else:
                    with savefile.open('r') as f:
                        self.page = f.read()
            else:
                self.page = self.download_new_file()
        else:
            self.page = self.download_new_file()

        self.soup = BeautifulSoup(self.page, 'html.parser')
        links = self.soup.select('a')
        for link in links:
            try:
                if link['href'].endswith(suffix):
                    dirlist.append(link['href'])
            except:
                print("Unexpected error:", sys.exc_info()[0])
        return dirlist

    def download_new_file(self):
        page = None
        if self.internet_available:
            document = self.gu.fetch_url(self.url)
            page = document.content
            with self.savefile.open('wb') as f:
                f.write(page)
        else:
            print('Please enable internet and re-try')
        return page


    def list_dir(self, url, suffix):
        for item in self.list_file_descriptor(url, suffix):
            print(item)


def testit():
    import RfcPaths
    rpath = RfcPaths.RfcPaths()
    grd              = GetRemoteDir()
    text_files       = grd.list_file_descriptor(url=rpath.rfc_download_page_url,
                                                savefile=rpath.text_test_file, suffix='txt')
    pdf_files        = grd.list_file_descriptor(url=rpath.rfc_download_page_url,
                                                savefile=pdf_test_file, suffix='pdf')
    postscript_files = grd.list_file_descriptor(url=rpath.rfc_download_page_url,
                                                savefile=ps_test_file, suffix='ps')
    print(f'text_files: {text_files}')
    print(f'pdf_files: {pdf_files}')
    print(f'postscript_files: {postscript_files}')


if __name__ == '__main__':
    testit()

## CreateRfcIndex.py - Gets main RFC index and creates json file

This module parses the index page found at rfc-editor.org/rfc-index.html (rfc_index_url in RfcPaths). There are index files available for download, but I found them more difficult to parse than the html itself, so thats what I did.

__itialize__ instantiates classes for RfcPaths, GetPage, and GetUrl. It also empty dictionary which will eventually become rfc_index.json, the purpose of this module.



In [None]:
import RfcPaths
import GetPage
import GetUrl
import json
import GetRemoteDir
from bs4 import BeautifulSoup


class CreateRfcIndex:
    def __init__(self):
        self.rpath = RfcPaths.RfcPaths()
        self.gp = GetPage.GetPage()
        self.gu = GetUrl.GetUrl()
        self.pno = 0
        self.page = self.get_rfc_index()
        if self.page is not None:
            self.text_index = {}
            self.create_rfc_text_index()

    def get_rfc_index(self):
        url = self.rpath.rfc_index_url
        save = self.rpath.rfc_index_html
        refresh_hrs = 1
        document = self.gp.get_page(url=url, savefile=save, refresh_hours_every=refresh_hrs)
        return document

    def create_rfc_text_index(self):
        """
        Parse rfc_index from rfc-editor.org, create a dictionary and save as json file
        :return: None
        """
        with self.rpath.rfc_index_html.open('rb') as f:
            page = f.read()
        soup = BeautifulSoup(page, 'lxml')

        tr_list = soup.select('tr')

        for tr in tr_list:
            td_list = tr.select('td')
            if td_list[0].find('script'):
                self.extract_data(td_list)

        self.add_file_links()

        with self.rpath.rfc_index_json.open('w') as temp:
            json.dump(self.text_index, temp)

    def get_file_locations(self):
        grd = GetRemoteDir.GetRemoteDir()
        self.text_files = grd.list_file_descriptor(url=self.rpath.rfc_download_page_url,
                                              savefile=self.rpath.rfc_download_dir_html, suffix='txt')
        self.pdf_files = grd.list_file_descriptor(url=self.rpath.rfc_download_page_url,
                                             savefile=self.rpath.rfc_download_dir_html, suffix='pdf')
        self.postscript_files = grd.list_file_descriptor(url=self.rpath.rfc_download_page_url,
                                                    savefile=self.rpath.rfc_download_dir_html, suffix='ps')

    def add_file_links(self):
        self.get_file_locations()
        for key, value in self.text_index.items():
            abbreviated_key = f'rfc{int(key[3:])}'
            filename = f'{abbreviated_key}.txt'
            if filename in self.text_files:
                self.text_index[key]['text_file'] = f'{self.rpath.rfc_download_page_url}{filename}'
            filename = f'{abbreviated_key}.pdf'
            if filename in self.pdf_files:
                self.text_index[key]['pdf_file'] = f'{self.rpath.rfc_download_page_url}{filename}'
            filename = f'{abbreviated_key}.ps'
            if filename in self.postscript_files:
                self.text_index[key]['ps_file'] = f'{self.rpath.rfc_download_page_url}{filename}'

    def extract_data(self, tds):
        """
        Extract data from soup, building dictionary along the way
        :param tds: list of all td's of proper type
        :return: None
        """
        for n, td in enumerate(tds):
            if td is None:
                continue
            if n == 0:
                item_dict = self.extract_header(td)
            else:
                item_dict['title'] = td.find('b').text
                lines = td.text.strip().split('\n')
                lines = list(map(str.strip, lines))
                lines[0] = lines[0][len(item_dict['title']):]
                item_dict['authors'] = lines[0]
                del lines[0]
                self.get_remaining_fields(lines, item_dict)

    def extract_header(self, td):
        """
        Extract the RFC number from the id script
        :param td: td containing script
        :return: key entry of dictionary entry
        """
        rfc_id = td.text[td.text.index('(') + 2:td.text.index(')') - 1]
        self.text_index[rfc_id] = {}
        return self.text_index[rfc_id]

    @staticmethod
    def get_remaining_fields(lines, item_dict):
        """
        Extracts remaining fields and adds them to the dictionary
        :param lines: list of lines from td
        :param item_dict: dictionary key for rfc
        :return: None
        """
        multi = False
        dtag = None
        dtext = None
        mm = False
        for item in lines:
            if item == '':
                continue
            if multi or item.startswith('('):
                # multi line entry
                if item.startswith('('):
                    if multi:
                        item_dict[dtag] = dtext
                    multi = False
                if multi:
                    if '\>)' in item:
                        multi = False
                        item_dict[dtag] = dtext
                        continue
                    if item.startswith('do'):
                        if mm:
                            nd = item[item.index("('")+2:item.index("')")]
                            dtext = f'{dtext}, {nd}'
                        else:
                            dtext = item[item.index("('")+2:item.index("')")]
                            mm = True
                    continue
                if ')' not in item:
                    dtag = item[item.index('(')+1:]
                    multi = True
                    mm = False
                    continue
                # Two formats, with ':' or with '='
                item = item[1:item.index(')')]
                if '=' in item:
                    item = item.split()
                    item_dict['type'] = item[0].lower()
                    item_dict['size'] = item[2]
                else:
                    item = item.split(': ')
                    item_dict[item[0].lower()] = item[1]
            elif item.startswith('['):
                item_dict['pub_date'] = item[item.index('[')+1:item.index(']')-1].strip()
            else:
                print('oddball found: {item}')


if __name__ == '__main__':
    CreateRfcIndex()

## RfcViewer.py - Gui for viewing index, fetching and displaying Documents


In [None]:
#!/usr/bin/python
# See for pdf: https://wxpython.org/Phoenix/docs/html/wx.lib.pdfviewer.html#module-wx.lib.pdfviewer
import wx
import fitz
from wx.lib.pdfviewer import pdfViewer, pdfButtonPanel
import wx.aui as aui
import wx.lib.agw.aui as aui
import RfcPaths
import json
# import CheckInternet
import requests


class MainPanel(wx.Panel):
    """
    Just a simple derived panel where we override Freeze and Thaw to work
    around an issue on wxGTK.
    """
    def Freeze(self):
        if 'wxMSW' in wx.PlatformInfo:
            return super(MainPanel, self).Freeze()

    def Thaw(self):
        if 'wxMSW' in wx.PlatformInfo:
            return super(MainPanel, self).Thaw()


class RfcViewer(wx.Frame):
    def __init__(self, parent, id=wx.ID_ANY, title="Rfc Viewer", pos=wx.DefaultPosition,
                 size=(800, 600), style=wx.DEFAULT_FRAME_STYLE, name='RfcViewer'):
        """
        Initialize - inherits from wx.Frame. Instantiates all widgets and variables for application

        :param parent: (wx.Window) The window parent. This may be, and often is, None. If it is not None, the frame
                       will be minimized when its parent is minimized and restored when it is restored (although it
                       will still be possible to minimize and restore just this frame itself).

        :param id:     (wx.WindowID) The window identifier. It may take a value of -1 to indicate a default value.

        :param title:  (string)  The caption to be displayed on the frame’s title bar.

        :param pos:    (wx.point) The window position. The value DefaultPosition indicates a default position,
                       chosen by either the windowing system or wxWidgets, depending on platform.

        :param size:   (wx.Size) – The window size. The value DefaultSize indicates a default size, chosen by either
                       the windowing system or wxWidgets, depending on platform.

        :param style:  (long) – The window style. See wx.Frame class description.

        :param name:   (string) The name of the window. This parameter is used to associate a name with the item,
                       allowing the application user to set Motif resource values for individual windows.
        """
        wx.Frame.__init__(self,
                          parent,
                          id,
                          title,
                          pos,
                          size,
                          style)
        print(f'PyMuPDF version: {fitz.__doc__}')
        self.rpath = RfcPaths.RfcPaths()
        self.rfc_document_data = None

        with self.rpath.rfc_index_json.open() as f:
            self.rfc_index = json.load(f)

        self.pnl = pnl = MainPanel(self)
        self._mgr = aui.AuiManager(pnl)

        # notify AUI which frame to use
        self._mgr.SetManagedWindow(self)

        # First pane is Rfc Selection window located top left
        # ----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
        self.rfc_selector = wx.ListCtrl(self,
                                        id=wx.ID_ANY,
                                        pos=wx.DefaultPosition,
                                        size=wx.Size(200, 150),
                                        style=wx.NO_BORDER | wx.TE_MULTILINE,
                                        name='RfcSelector')

        self.rfc_selector.InsertColumn(0, 'RFC Id', width=60)
        self.rfc_selector.InsertColumn(1, 'Title', width=200)

        self.rfc_selector.SetMinSize(wx.Size(500, 300))
        self.rfc_selector.SetMaxSize(wx.Size(1000, 800))

        # Need to distinguish between single and double click so
        self.rfc_id = None

        self.dbl_clk_delay = 250
        self.rfc_selector.Bind(wx.EVT_LIST_ITEM_FOCUSED, self.run_summary)
        self.rfc_selector.Bind(wx.EVT_LEFT_DCLICK, self.display_detail)

        self.rfc_selector_load()

        # Second pane is Rfc Summary window Data displayed in this window on single click in
        # selection window.
        # ----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
        self.summary = wx.TextCtrl(self,
                                   id=wx.ID_ANY,
                                   value="Pane 2 - Summary Text Here",
                                   pos=wx.DefaultPosition,
                                   size=wx.Size(200, 150),
                                   style=wx.NO_BORDER | wx.TE_MULTILINE,
                                   name='RfcSummary')

        # Third pane (left top)  is a two tab notebook, one for text files and one for pdf files. One
        # or both may be populated for any given RFC document.
        # ----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
        self.nb = aui.AuiNotebook(self,
                                  id=wx.ID_ANY,
                                  pos=wx.DefaultPosition,
                                  size=wx.DefaultSize,
                                  style=0,
                                  agwStyle=wx.lib.agw.aui.AUI_NB_DEFAULT_STYLE,
                                  name="RfcDocumentNotebook")

        self.text_page = wx.TextCtrl(self.nb,
                                     id=wx.ID_ANY,
                                     pos=wx.DefaultPosition,
                                     size=wx.DefaultSize,
                                     style=wx.NO_BORDER | wx.TE_MULTILINE,
                                     name='DetailNotebookTabs')

        self.pdf_page = wx.Panel(self.nb,
                                 id=wx.ID_ANY,
                                 pos=wx.DefaultPosition,
                                 size=wx.DefaultSize,
                                 style=wx.TAB_TRAVERSAL,
                                 name='pdfpage')

        hsizer = wx.BoxSizer(wx.HORIZONTAL)
        vsizer = wx.BoxSizer(wx.VERTICAL)

        self.btnpanl = pdfButtonPanel(self.pdf_page,
                                      nid=wx.ID_ANY,
                                      pos=wx.DefaultPosition,
                                      size=wx.DefaultSize,
                                      style=None)

        vsizer.Add(self.btnpanl, 0, wx.GROW | wx.ALIGN_CENTER_VERTICAL | wx.LEFT | wx.RIGHT | wx.TOP, 5)

        self.viewer = pdfViewer(self.pdf_page,
                                nid=wx.ID_ANY,
                                pos=wx.DefaultPosition,
                                size=wx.DefaultSize,
                                style=wx.HSCROLL | wx.VSCROLL | wx.VSCROLL | wx.SUNKEN_BORDER)

        vsizer.Add(self.viewer, 1, wx.GROW | wx.LEFT | wx.RIGHT | wx.BOTTOM, 5)
        hsizer.Add(vsizer, 1, wx.GROW | wx.ALIGN_CENTER_HORIZONTAL | wx.ALL, 5)

        self.SetSizer(hsizer)
        self.SetAutoLayout(True)

        self.btnpanl.viewer = self.viewer
        self.viewer.buttonpanel = self.btnpanl

        self.nb.AddPage(self.text_page, "Text Document")
        self.nb.AddPage(self.pdf_page, "PDF Document")

        self._mgr.AddPane(self.rfc_selector, aui.AuiPaneInfo().Left().Caption("RFC Selection"))
        self._mgr.AddPane(self.summary, aui.AuiPaneInfo().Bottom().Caption("RFC Summary"))
        self._mgr.AddPane(self.nb, aui.AuiPaneInfo().CenterPane().Name('RFC Detail'))

        # tell the manager to "commit" all the changes just made
        self._mgr.Update()

        self.Bind(wx.EVT_CLOSE, self.OnClose)

    def error_msg(self, message):
        """
        Display message in standard wx.MessageDialog
        :param message: (string) Value to be displayed
        :return: None
        """
        msg_dlg = wx.MessageDialog(self, message, '', wx.OK | wx.ICON_ERROR)
        msg_dlg.ShowModal()
        msg_dlg.Show()
        msg_dlg.Destroy()

    def display_detail(self, event):
        """
        Display the actual document. Improvement can be added to print from cache, and
        only fetch from web if not in cache. This would require a update manager which
        would check for new documents. Old documents are not changed at this time, so saving
        documents would not be a problem.

        Will display text documents in the text window. This is the desired mode, but if no text
        document is available, will use a pdf is available (displayed in a separate yab).
        :param event: The event that triggered this method.
        :return: None
        """
        # self.detail.Clear()
        file_format = None
        file_url = None
        print(f'RFC Id: {self.rfc_id}')

        try:
            file_url = self.rfc_index[self.rfc_id]['text_file']
            file_format = 'txt'
        except KeyError:
            try:
                file_url = self.rfc_index[self.rfc_id]['pdf_file']
                file_format = 'pdf'
            except:
                pass

        if file_format is None:
            self.text_page.AppendText("No URL listed for this file.")
        else:
            wx.BeginBusyCursor()
            self.rfc_document_data = self.download_file(file_url, file_format)
            if file_format == 'txt':
                # self.text_page = self.rfc_document_data
                for line in self.rfc_document_data:
                    self.text_page.AppendText(line)
            else:
                print('Pdf viewer not Yet Implemented')
                # self.viewer.LoadFile(fitz.open("pdf", self.rfc_document_data))
                # print(pdf_doc.metadata)
                # self.viewer.LoadFile(pdf_doc)
                # self.viewer.LoadFile(self.rfc_document_data)
            wx.EndBusyCursor()


    def run_summary(self, event):
        """
        Gets the data associated with single or double click in self.rfc_selector ListCtrl.
        calls display_summary to render information.
        :param event: The event that triggered this method.
        :return: None (keeps event with Skip, so double click will still work).
        """
        self.rfc_id = event.GetText()
        self.display_summary()
        event.Skip()

    def display_summary(self):
        """
        Display summary information in bottom pane. Called by run_summary event.
        :return: None
        """
        self.summary.Clear()
        for key, value in self.rfc_index[self.rfc_id].items():
            index = self.rfc_selector.GetItemCount()
            self.summary.AppendText(f'{key}: {value}\n')


    def rfc_selector_load(self):
        """
        Clears, then Loads self.rfc_selector ListCtrl with data from self.rfc_index dictionary.
        :return: None
        """
        self.rfc_selector.DeleteAllItems()
        for key, value in self.rfc_index.items():
            index = self.rfc_selector.GetItemCount()
            self.rfc_selector.InsertItem(index, key)
            self.rfc_selector.SetItem(index, 1, value['title'])

    @staticmethod
    def download_file(url, file_format):
        """
        Download Rfc document
        :return: document
        """
        ok_status = 200
        document = None
        print(f'url: {url}')
        doc = requests.get(url, allow_redirects=False)
        if doc.status_code != ok_status:
            print(f'status: {doc.status_code}')
        if doc:
            if file_format == 'txt':
                document = doc.text
            else:
                document = doc.content
        return document

    def OnClose(self, event):
        """
        Graceful exit, removes aui manager and reissues event
        :param event: The event that triggered this method.
        :return: None
        """
        # deinitialize the frame manager
        self._mgr.UnInit()
        event.Skip()


def main():
    """
    Instanciate and execute wx.App and RfcViewer
    :return:
    """
    app = wx.App()
    frame = RfcViewer(None)
    app.SetTopWindow(frame)
    frame.Show()
    app.MainLoop()

if __name__ == '__main__':
    main()

## ImageLib.py - Create icon8 icon index json file
This class creates an icon image library directory (json) from the icons8 public domain icon library. These images will be used in this application for creation af a navigation pane for PDF documents in RfcViewer. I contains over 300 common menu icons with a fairly open license (Read license document in images directory).

In [None]:
from pathlib import Path
import RfcPaths
import json


class ImageLib:
    def __init__(self):
        """
        Initialize
        """
        self.rpath = RfcPaths.RfcPaths()
        self.image_list = None

    def make_json(self):
        """
        Create image list json file from dictionary, indexed by filename, and include file metadata.
        :return: None
        """
        self.image_list = [x for x in self.rpath.imagepath.iterdir() if x.is_file() and x.name.endswith('.png')]
        stat_fields = ['st_mode', 'st_ino', 'st_dev', 'st_nlink', 'st_uid', 'st_gid', 'st_size', 'st_atime',
                       'st_mtime', 'st_ctime']

        image_dict = {}

        for image_file in self.image_list:
            stats = image_file.lstat()
            image_dict[image_file.name] = {}
            idx = image_dict[image_file.name]

            for n, field in enumerate(stat_fields):
                idx[field] = stats[n]

        with self.rpath.imagedict.open('w') as fo:
            json.dump(image_dict, fo)


def testit():
    """
    Test routine
    :return: None
    """
    il = ImageLib()
    il.make_json()

if __name__ == '__main__':
    testit()

## PdfAttempt.py - Pdf code that works ... sort of
The following code was taken from the wxpython phoenix demo, and modified slightly to format that I thought might work for the viewer. There are, however, some issues with resizing where everything gets lost after playing for a while.
This is what led me to create the image index for the icon8 icons. The idea is to create my own navagation menu, and to break
the pdf into pages that could handle any of the various pdf file modes (There are at least eight that I know of).
Alas, there was no way that I was going to finish this for the presentation tonight, so This is just a teaser, but included in the github repository if you want to play with it.

I plan to finish, and update the repository within the next few weeks.

In [None]:
import wx
import wx.lib.sized_controls as sc
from wx.lib.pdfviewer import pdfViewer, pdfButtonPanel


class PDFViewer(sc.SizedFrame):
    def __init__(self, parent, **kwds):
        super(PDFViewer, self).__init__(parent, **kwds)

        paneCont = self.GetContentsPane()
        self.buttonpanel = pdfButtonPanel(paneCont, wx.NewId(),
                                wx.DefaultPosition, wx.DefaultSize, 0)
        self.buttonpanel.SetSizerProps(expand=True)
        self.viewer = pdfViewer(paneCont, wx.NewId(), wx.DefaultPosition,
                                wx.DefaultSize,
                                wx.HSCROLL|wx.VSCROLL|wx.SUNKEN_BORDER)
        self.viewer.UsePrintDirect = False
        self.viewer.SetSizerProps(expand=True, proportion=1)

        # introduce buttonpanel and viewer to each other
        self.buttonpanel.viewer = self.viewer
        self.viewer.buttonpanel = self.buttonpanel


if __name__ == '__main__':
    import wx.lib.mixins.inspection as WIT
    app = WIT.InspectableApp(redirect=False)


    pdfV = PDFViewer(None, size=(800, 600))
    pdfV.viewer.UsePrintDirect = False
    pdfV.viewer.LoadFile('rfc8.pdf')
    pdfV.Show()

    app.MainLoop()