**Import the scraping function**

Scraping code lives in the meetings_scraper.py file.

In [1]:
from meetings_scraper import get_meetings

In [2]:
help(get_meetings)

Help on function get_meetings in module meetings_scraper:

get_meetings(year=None)
    Calls pdf scrapers for extracting meeting dates
    
    Parameters
    ----------
    year: four digit number of desired year
          Defaults to None (scraper classes use current year)
    
    Returns
    -------
    meetings: Combined and sorted list of meetings generated from online pdfs



**Call get_meetings function with year**

Calls all the defined scrapers and puts into a single Meeting object list. Each meeting is a specific kind of Meeting object, depending on the scraper.

In [3]:
meetings = get_meetings(year=2017)
meetings

2017 => http://www.wichita.gov/Government/Council/CityCouncilDocument/2017%20CITY%20COUNCIL%20MEETING%20SCHEDULE.pdf
Meeting PDF written to file: files/council_meeting_2017.pdf
34 lines of text extracted
44 meetings generated
2017 => http://www.wichita.gov/Government/Departments/Planning/PlanningDocument/2017%20Subdivision%20Calendar.pdf
Meeting PDF written to file: files/subdivision_meeting_2017.pdf
24 meetings parsed


[Tue Jan 03 09:00 AM: Regular Council Meeting,
 Tue Jan 10 09:00 AM: Regular Council Meeting,
 Thu Jan 12 10:00 AM: Subdivision and Utility Advisory Meeting,
 Tue Jan 17 09:00 AM: Regular Council Meeting,
 Tue Jan 24 09:30 AM: Consent/Workshop Council Meeting,
 Thu Jan 26 10:00 AM: Subdivision and Utility Advisory Meeting,
 Tue Feb 07 09:00 AM: Regular Council Meeting,
 Tue Feb 14 09:00 AM: Regular Council Meeting,
 Thu Feb 16 10:00 AM: Subdivision and Utility Advisory Meeting,
 Tue Feb 21 09:00 AM: Regular Council Meeting,
 Tue Feb 28 09:30 AM: Consent/Workshop Council Meeting,
 Thu Mar 02 10:00 AM: Subdivision and Utility Advisory Meeting,
 Tue Mar 07 09:00 AM: Regular Council Meeting,
 Thu Mar 16 10:00 AM: Subdivision and Utility Advisory Meeting,
 Tue Mar 21 09:00 AM: Regular Council Meeting,
 Tue Mar 28 09:30 AM: Consent/Workshop Council Meeting,
 Thu Mar 30 10:00 AM: Subdivision and Utility Advisory Meeting,
 Tue Apr 04 09:00 AM: Regular Council Meeting,
 Tue Apr 11 09:00 AM: Reg

**Check for meeting on date**

Uses string format of "m/d"

In [4]:
"4/18" in meetings

True

**Get meeting useing string format above**

In [5]:
meeting = meetings["4/18"]
meeting

Tue Apr 18 09:00 AM: Regular Council Meeting

**Iterate over properties and print**

★ Note:  Agenda PDF is put online 4 days before meeting. Final Agenda PDF is posted day before.

In [6]:
for key, value in meeting:
    print("{0}: {1}".format(key, value))

agenda: http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20City%20Council%20Agenda%20Packet.pdf
date: 2017-04-18 09:00:00
description: Provide policy direction for Wichita
summary: Regular Council Meeting
email: JCJohnson@wichita.gov;EGlock@wichita.gov;MLovely@wichita.gov;JHensley@wichita.gov;DLCityCouncilMembers@wichita.gov
location: 455 N Main, 1st Floor Board Room Wichita, KS 67202
agenda_final: http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20Final%20City%20Council%20Agenda%20Packet.pdf


**Convert to ical format**

Returns byte string for writing to .ics file.

In [7]:
meeting.to_ics()

b'BEGIN:VCALENDAR\r\nBEGIN:VEVENT\r\nSUMMARY:Regular Council Meeting\r\nDTSTART;VALUE=DATE-TIME:20170418T090000\r\nDTEND;VALUE=DATE-TIME:20170418T110000\r\nDTSTAMP;VALUE=DATE-TIME:20170418T090000Z\r\nDESCRIPTION:Provide policy direction for Wichita\r\nLOCATION:455 N Main\\, 1st Floor Board Room Wichita\\, KS 67202\r\nURL:http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20City%20C\r\n ouncil%20Agenda%20Packet.pdf\r\nEND:VEVENT\r\nEND:VCALENDAR\r\n'

**Pass True to save file**

In [8]:
# When save param is True, filename is returned instead
filename = meeting.to_ics(save=True)

Meeting saved to files/council_meeting_4_18_2017.ics


**Read .ics file back in**

In [9]:
with open(filename, 'r') as f:
    for line in f.readlines():
        print(line.replace("\n", ""))
        
f.close()

BEGIN:VCALENDAR
BEGIN:VEVENT
SUMMARY:Regular Council Meeting
DTSTART;VALUE=DATE-TIME:20170418T090000
DTEND;VALUE=DATE-TIME:20170418T110000
DTSTAMP;VALUE=DATE-TIME:20170418T090000Z
DESCRIPTION:Provide policy direction for Wichita
LOCATION:455 N Main\, 1st Floor Board Room Wichita\, KS 67202
URL:http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20City%20C
 ouncil%20Agenda%20Packet.pdf
END:VEVENT
END:VCALENDAR


**Convert to json format**

In [10]:
meeting.to_json()

'{"agenda": "http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20City%20Council%20Agenda%20Packet.pdf", "summary": "Regular Council Meeting", "date": "2017-04-18T09:00:00", "description": "Provide policy direction for Wichita", "email": "JCJohnson@wichita.gov;EGlock@wichita.gov;MLovely@wichita.gov;JHensley@wichita.gov;DLCityCouncilMembers@wichita.gov", "location": "455 N Main, 1st Floor Board Room Wichita, KS 67202", "agenda_final": "http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20Final%20City%20Council%20Agenda%20Packet.pdf"}'

**Save to file**

In [11]:
meeting.to_json(save=True)

Meeting saved to files/council_meeting_4_18_2017.json


'files/council_meeting_4_18_2017.json'

**Convert to csv format**

In [12]:
meeting.to_csv()

'"http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20City%20Council%20Agenda%20Packet.pdf",2017-04-18 09:00:00,"Provide policy direction for Wichita","Regular Council Meeting","JCJohnson@wichita.gov;EGlock@wichita.gov;MLovely@wichita.gov;JHensley@wichita.gov;DLCityCouncilMembers@wichita.gov","455 N Main, 1st Floor Board Room Wichita, KS 67202","http://www.wichita.gov/Government/Council/Agendas/4-18-2017%20Final%20City%20Council%20Agenda%20Packet.pdf"'

**Save to file**

In [13]:
meeting.to_csv(save=True)

Meeting saved to files/council_meeting_4_18_2017.csv


'files/council_meeting_4_18_2017.csv'

**Filter for Consent/Workshops**

In [14]:
# Filter creates a generator
# list converts to a list
list(filter(lambda meeting: 'Workshop' in meeting.summary, meetings))

[Tue Jan 24 09:30 AM: Consent/Workshop Council Meeting,
 Tue Feb 28 09:30 AM: Consent/Workshop Council Meeting,
 Tue Mar 28 09:30 AM: Consent/Workshop Council Meeting,
 Tue Apr 25 09:30 AM: Consent/Workshop Council Meeting,
 Tue May 23 09:30 AM: Consent/Workshop Council Meeting,
 Tue Jun 27 09:30 AM: Consent/Workshop Council Meeting,
 Tue Jul 25 09:30 AM: Consent/Workshop Council Meeting,
 Tue Aug 22 09:30 AM: Consent/Workshop Council Meeting,
 Tue Sep 26 09:30 AM: Consent/Workshop Council Meeting,
 Tue Oct 24 09:30 AM: Consent/Workshop Council Meeting,
 Tue Nov 28 09:30 AM: Consent/Workshop Council Meeting]

**Filter for Subdivsion meetings**

In [15]:
subdivisions = filter(lambda meeting: meeting.type == 'subdivision', meetings)
subdivisions

<filter at 0x10493a438>

**Get the next (first) subdivsion meeting**

In [16]:
sub_meeting = next(subdivisions)
sub_meeting

Thu Jan 12 10:00 AM: Subdivision and Utility Advisory Meeting

**Iterate over and print properties**

★ Note: Agenda and Plat Drawing PDFs are only put online a few days prior to meeting.

In [17]:
for key, value in sub_meeting:
    print(key, "=>", value)

agenda => http://www.wichita.gov/Government/Departments/Planning/AgendasMinutes/1-12-2017%20Subdivision%20Agenda%20packet.pdf
date => 2017-01-12 10:00:00
plat_drawings => http://www.wichita.gov/Government/Departments/Planning/AgendasMinutes/1-12-2017%20Subdivision%20Agenda%20-%20Plat%20drawings.pdf
description => City utility design planning and review
summary => Subdivision and Utility Advisory Meeting
email => nstrahl@wichita.gov
location => The Ronald Reagan Building, 271 W. 3rd St N, Suite 203, Wichita KS 67201


<hr style="border:6px outset green">

<p style="font-size:16px;font-weight:bold;">Example: Add a new Scraper</p>

<p style="font-size:16px">This would probably be done inside the meetings_scarper library. Below is a screenshot of the sample PDF we'll use:</p>

<img src="files/Test%20Meeting%20screenshot.png" style="width:700px;margin:0;">


<p style="font-size:16px">The PDF xml for text boxes looks like this, and it's not listed in order:</p>

```xml
<LTTextBoxHorizontal bbox="[396.15, 607.426, 537.72, 620.152]" height="12.726" index="0" width="141.57" x0="396.15" x1="537.72" y0="607.426" y1="620.152">June 28 CANCELLED</LTTextBoxHorizontal>```

<p style="font-size:16px">Sometimes, the first letter is in another element, for some reason.</p>

```xml
<LTTextBoxHorizontal bbox="[82.021, 671.696, 185.58, 684.422]" height="12.726" index="6" width="103.559" x0="82.021" x1="185.58" y0="671.696" y1="684.422">ecember 31, 2016</LTTextBoxHorizontal>```

In [18]:
# Only needed if outside of library
import re
from datetime import datetime
from meetings_scraper import Meeting


class ElixirMeeting(Meeting):
    type = "elixir"  #  could be extracted from class name (cls.__name__)
    pdf_url = "http://localhost:8888/files/Open%20Wichita/files/Test%20Meeting%20Schedule.pdf"
    
    def __init__(self, year, month, day):
        self.summary = "Elixir Meeting"
        self.description = "testing adding a custom scraper"
        self.location = "216 N Mosley St, Wichita, KS 67202"
        self.date = datetime(year, month, day, hour=14, minute=15)
        self.agenda = "https://media.readthedocs.org/pdf/elixir-lang/latest/elixir-lang.pdf"
        self.email = "fake_marcus@gmail.com"
        
    def __iter__(self):
        # get props from parent
        props = dict(super().__iter__())
        
        # Add extra properties
        props["other"] = "Something additional"
        
        for key, value in props.items(): # Iterate and yield
            yield key, value
        
    @classmethod
    def parse_meetings(cls, pdf):
        # Find some text in the pdf
        header = pdf.pq('LTTextBoxHorizontal:contains("Meeting Schedule")')
        header_line = header[0]
        
        print("\nFound:", header_line.text)
        for key, value in header_line.items():
            print(key, value)
    
        meetings = []
        day_pattern = "\d{1,2}"  # Match on day 1 - 31
        year_pattern = "\d\d\d\d"
        
        for month_text in cls.months:
            
            # Get any lines that match month text
            # Could be TextLine or TextLineBox
            lines_h = pdf.pq('LTTextLineHorizontal:contains("{0}")'.format(month_text))
            lines_box = pdf.pq('LTTextBoxHorizontal:contains("{0}")'.format(month_text))
            lines = lines_h + lines_box # Merge lists
            
            # Iterate over lines to create Meetings
            # Excluded cancelled and meeting in other years
            for line in lines:
                match_day = re.search(day_pattern, line.text)
                
                if match_day:
                    day = int(match_day.group())
                    month = cls.months.index(month_text) + 1
                   
                    if "CANCELLED" in line.text:
                        print("Meeting cancelled on {0}/{1}".format(month, day))
                        continue
                    else:
                        match_year = re.search(year_pattern, line.text)
                        if match_year:
                            year = int(match_year.group())
                            
                            if year != cls.year:
                                print("Wrong year:", line.text)
                                continue
                                
                    meeting = cls(cls.year, month, day)
                    meetings.append(meeting)
                # end if
            #end for line
        #end for month_text
            
        return meetings

In [19]:
ElixirMeeting.year = 2017 # override current year
elixir_meetings = ElixirMeeting.get_meetings()
elixir_meetings

2017 => http://localhost:8888/files/Open%20Wichita/files/Test%20Meeting%20Schedule.pdf
Meeting PDF written to file: files/elixir_meeting_2017.pdf

Found: Meeting Schedule 
bbox [252.85, 703.946, 362.65, 716.448]
height 12.502
index 4
width 109.8
x0 252.85
x1 362.65
y0 703.946
y1 716.448
Wrong year: April 30, 2018 
Meeting cancelled on 6/28
Wrong year: ecember 31, 2016 


[Sun Jan 01 02:15 PM: Elixir Meeting,
 Thu Feb 23 02:15 PM: Elixir Meeting,
 Mon Mar 13 02:15 PM: Elixir Meeting,
 Fri Aug 11 02:15 PM: Elixir Meeting]

In [20]:
first_meeting = elixir_meetings[0]
first_meeting

Sun Jan 01 02:15 PM: Elixir Meeting

In [21]:
first_meeting.to_json()

'{"agenda": "https://media.readthedocs.org/pdf/elixir-lang/latest/elixir-lang.pdf", "date": "2017-01-01T14:15:00", "description": "testing adding a custom scraper", "summary": "Elixir Meeting", "other": "Something additional", "email": "fake_marcus@gmail.com", "location": "216 N Mosley St, Wichita, KS 67202"}'

**For the hell of it**

In [22]:
first_meeting.to_html()