In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pandas as pd
import json
import sys
import re
from html2text import HTML2Text
from bs4 import BeautifulSoup
import mistletoe
from IPython.display import HTML as HTML_raw, display

In [3]:
def HTML(text):
    text = text.replace('$', r'\$')
    return HTML_raw(text)

In [4]:
DATA_DIR = Path('../data/02_primary/')
paths = list(DATA_DIR.glob('*/*.jsonl'))

In [5]:
def read_jsonl(path):
    with open(path) as f:
        return [json.loads(line) for line in f]

# Converting HTML to Text

Tags: data, python, nlp

date: 2020-08-05T08:00:00+10:00

feature_image: /images/jupyter-blog.png
  
<!--eofm-->

How can we convert HTML into text for processing?

Whitespace in HTML [is complicated](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace).

In [6]:
def html2md(html):
    parser = HTML2Text()
    parser.ignore_images = True
    parser.ignore_anchors = True
    parser.body_width = 0
    md = parser.handle(html)
    return md

In [7]:
def html2plain(html):
    # HTML to Markdown
    md = html2md(html)
    # Normalise custom lists
    md = re.sub(r'(^|\n) ? ? ?\\?[•·–-—-*]( \w)', r'\1  *\2', md)
    # Convert back into HTML
    html_simple = mistletoe.markdown(md)
    # Convert to plain text
    soup = BeautifulSoup(html_simple)
    text = soup.getText()
    # Strip off table formatting
    text = re.sub(r'(^|\n)\|\s*', r'\1', text)
    # Strip off extra emphasis
    text = re.sub(r'\*\*', '', text)
    # Remove trailing whitespace and leading newlines
    text = re.sub(r' *$', '', text)
    text = re.sub(r'\n\n+', r'\n\n', text)
    text = re.sub(r'^\n+', '', text)
    return text

# Example with linebreaks instead of paragraphs

In [8]:
p = paths[6]
data = read_jsonl(p)

In [9]:
html = data[0]['description']

The plain HTML

In [10]:
print(html)

<strong><u>The Client</u></strong><br/><br/>Our client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization.<br/><br/><u><strong>The Job</strong></u><br/>This is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization.<br/><br/>You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports an

How it looks in a browser

In [11]:
HTML(html)

Beautiful soup runs the sentences together because it doesn't process the `<br>` tags.

In [12]:
print(BeautifulSoup(html).getText())

The ClientOur client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization.The JobThis is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization.You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports and assessments.You will be a key and active member of the Security Team, providing s

It's better if we replace them with spaces, but we lose separations between headers and content.

In [13]:
print(BeautifulSoup(html).getText(' '))

The Client Our client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization. The Job This is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization. You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports and assessments. You will be a key and active member of the Security Team, provid

Newlines work for this particular case.

In [14]:
print(BeautifulSoup(html).getText('\n'))

The Client
Our client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization.
The Job
This is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization.
You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports and assessments.
You will be a key and active member of the Security Team, provid

HTML2Text does an excellent job of converting this into markdown (though notice it's sensitive to spaces around markup in the headings which aren't visible in the HTML.

In [15]:
md = html2md(html)
print(md)

**_The Client_**  
  
Our client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization.  
  
 _**The Job**_  
This is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization.  
  
You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports and assessments.  
  
You will be a key and active 

Round trip it back to HTML.

In [16]:
html2 = mistletoe.markdown(md)
print(html2)

<p><strong><em>The Client</em></strong></p>
<p>Our client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization.</p>
<p><em><strong>The Job</strong></em><br />
This is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization.</p>
<p>You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports

In [17]:
HTML(html2)

In [18]:
print(BeautifulSoup(html2).get_text(''))

The Client
Our client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization.
The Job
This is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization.
You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports and assessments.
You will be a key and active member of the Security Team, provid

In [19]:
print(html2plain(html))

The Client
Our client is a secondary education institution in the Eastern Suburbs of Melbourne. They offer rewarding and diverse programs for Australian and overseas students as well as workforce development for Australia's corporate, government, and commercial organisations. They operate seven campuses in Victoria, deliver online learning, and have over 1000 passionate and dedicated team members from diverse backgrounds, making them a high flexible and student-focussed organization.
The Job
This is a job for a talented Information Security Administrator. You'll be providing crucial services in responding to security exposures, monitoring risk, and reporting on internal information security practices across the entire organization.
You'll be chiefly responsible for resolving security issues and proactively assessing risk and vulnerabilities by conducting gap analyses, user education, and providing reports and assessments.
You will be a key and active member of the Security Team, provid

# Example - HTML Tables

In [20]:
p = paths[16]
data = read_jsonl(p)

In [21]:
[idx for idx, d in enumerate(data) if '<tr>' in d['description']]

[68,
 70,
 73,
 74,
 76,
 81,
 91,
 131,
 132,
 133,
 134,
 141,
 156,
 163,
 165,
 179,
 180,
 183,
 185,
 186,
 188,
 198,
 202,
 251,
 253,
 343,
 354,
 357,
 361]

In [22]:
html = data[68]['description']

In [23]:
print(html)

<div class="txt-pre-line">

                                    <style>.conten-detail table *{word-wrap: break-word !important;white-space: pre-wrap !important;word-break !important: break-all !important;} .content-detail table {max-width: 100% !important;width: 100% !important;}  .PABOLDTEXT{font-weight:bold}</style> <table border="0" cellpadding="0" cellspacing="0" class="PABACKGROUNDINVISIBLE" cols="2" id="ACE_HRS_SCH_PSTDSC$0" role="presentation" style="border-style:none" width="1114">
<tbody><tr>
<td height="8" width="7"></td>
<td width="1106"></td>
</tr>
<tr>
<td height="20"></td>
<td align="left" valign="top">
<div id="win0divHRS_SCH_PSTDSC_DESCR$0"><span class="PABOLDTEXT" id="HRS_SCH_PSTDSC_DESCR$0">Location Profile</span>
</div></td>
</tr>
<tr>
<td height="18"></td>
<td align="left" valign="top">
<div id="win0divHRS_SCH_PSTDSC_DESCRLONG$0"><div class="PSLONGEDITBOX" id="HRS_SCH_PSTDSC_DESCRLONG$0" style="word-wrap: break-word;">
<br/><p><b><u><span style=" 11pt;"><span style=

In [24]:
HTML(html)

0,1
,
,Location Profile
,"SCHOOL PROFILE Greenhills Primary School, established in 1962, is situated in a quiet residential location between the north-eastern Melbourne suburbs of Greensborough and Diamond Creek, within the municipality of Nillumbik. The location of the school, on a well maintained grassed and treed site provides a geographic and social centre for the community. The current school enrolment is 517 students. It’s a great place to be! The school has undergone a series of major works providing our community with modern, spacious and well resourced teaching facilities, which include classrooms equipped with the latest IT equipment including interactive whiteboards. We also run a iPad Program in Years 5 and 6. Vision Greenhills Primary School aims to provide a caring, safe, supportive and nurturing environment, which will foster all students’ educational and behavioural development, to enable them to become effective citizens in our school and, the broader society. Values The following values are fundamental to the ethos of Greenhills Primary School and are the underpinning principles for all school-based activity and decision-making. G - Grit R - Respect E - Effort A - Adventure T – Teamwork These values are reflected in the provision of components considered essential to the management of the school, such as; a safe environment, comprehensive educational programs, pastoral care, a whole-school approach to student welfare and discipline, and maximising community involvement. The school is the hub of a family-based community. A close relationship is fostered between the school, the parents and the wider community. This is encouraged by an open-door policy where families are invited to participate actively in all aspects of school life. Greenhills Primary School is committed to providing a safe, stimulating and secure learning environment where the individual needs of all children are met regardless of gender, ability, culture or socio-economic circumstances. Opportunities are provided for all children to experience success through a challenging and varied curriculum with a particular emphasis placed on literacy and numeracy across the school. Sequentially planned learning units are developed by teachers working in teams. Classroom practice encompasses a variety of approaches including co-operative learning, cross-age tutoring and small group learning. The school offers specialist programs in the areas of Visual Arts, Italian, Music and Physical Education. Teachers use data, ongoing assessments and observations to monitor individual students’ learning as well as groups and cohorts of students. Our staff are a valued resource and we pride ourselves on being a collegiate team who not only work together, but respect and assist each other at all times. We believe that learning from each other through formal observations, and giving meaningful feedback is essential to improving our own practice and all teachers are involved in a formal observation and feedback program throughout the year. Our school offers a range of extra-curricular opportunities which include: Inter-school sport, Camps, Transition Programs, Family Life Program, Choirs, Instrumental Band, Privately run instrumental lessons, Swimming Program, Life Education Program, Out of School Hours Care Program and Holiday Care."
,Selection Criteria
,"SC1 Demonstrated knowledge of the relevant curriculum, including the ability to incorporate the teaching of literacy and numeracy skills. Demonstrated experience in responding to student learning needs. SC2 Demonstrated experience in planning for and implementing high impact teaching strategies, guided by how students learn, and evaluating the impact of learning and teaching programs on student learning growth. SC3 Demonstrated experience in monitoring and assessing student learning. Demonstrated experience in using data to inform teaching practice and providing feedback on student learning growth and achievement to students and parents.SC4 Demonstrated interpersonal and communication skills. Demonstrated experience in establishing and maintaining collaborative relationships with students, parents, colleagues and the broader school community to support student learning, agency, wellbeing and engagement.SC5 Demonstrated behaviours and attitudes consistent with Department values. Demonstrated experience in reflecting upon practice and engaging in professional learning to continually improve the quality of teaching."
,Role
,"The classroom teacher classification comprises two salary ranges- range 1 and range 2. The primary focus of the classroom teacher is on the planning, preparation and teaching of programs to achieve specific student outcomes. The classroom teacher engages in critical reflection and inquiry in order to improve knowledge and skills to effectively engage students and improve their learning. As the classroom teacher gains experience his or her contribution to the school program beyond the classroom increases. All classroom teachers may be required to undertake other duties in addition to their rostered teaching duties provided the responsibility is appropriate to the salary range, qualifications, training and experience of the teacher. Classroom teacher Range 2 Range 2 classroom teachers play a significant role in assisting the school to improve student performance and educational outcomes determined by the school strategic plan and state-wide priorities and contributing to the development and implementation of school policies and priorities. A critical component of this work will focus on increasing the knowledge base of staff within their school about student learning and high quality instruction to assist their school to define quality teacher practice. Range 2 classroom teachers will be expected to: - Have the content knowledge and pedagogical practice to meet the diverse needs of all students - Model exemplary classroom practice and mentor/coach other teachers in the school to engage in critical reflection of their practice and to support staff to expand their capacity - Provide expert advice about the content, processes and strategies that will shape individual and school professional learning - Supervise and train one or more student teachers - Assist staff to use student data to inform teaching approaches that enable targets related to improving student learning outcomes to be achieved. Classroom teacher Range 1 The primary focus of the range 1 classroom teacher is on further developing skills and competencies to become an effective classroom practitioner with structured support and guidance from teachers at higher levels and the planning, preparation and teaching of programs to achieve specific student outcomes. These teachers teach a range of students/classes and are accountable for the effective delivery of their programs. Range 1 classroom teachers are skilled teachers who operate under general direction within clear guidelines following established work practices and documented priorities and may have responsibility for the supervision and training of one or more student teachers. At range 1, teachers participate in the development of school policies and programs and assist in the implementation of school priorities. The focus of a range 1 classroom teacher is on classroom management, subject content and teaching practice. New entrants to the teaching profession in their initial teaching years receive structured support, mentoring and guidance from teachers at higher levels. Under guidance, new entrants to the teaching profession will plan and teach student groups in one or more subjects and are expected to participate in induction programs and other professional learning activities that are designed to ensure the integration of curriculum, assessment and pedagogy across the school. Teachers at range 1 are responsible for teaching their own classes and may also assist and participate in policy development, project teams and the organisation of co-curricula activities."
,Responsibilities
,"The role of classroom teacher may include but is not limited to: - Direct teaching of groups of students and individual students; - Contributing to the development, implementation and evaluation of a curriculum area or other curriculum program within the school; - Undertaking other classroom teaching related and organisational duties as determined by the School Principal; - Participating in activities such as parent/teacher meetings; staff meetings; camps and excursions; - Undertaking other non-teaching supervisory duties."
,Who May Apply


In [25]:
md = html2md(html)
print(md)

|   
---|---  
| 

Location Profile  
  
| 

  


**_SCHOOL PROFILE_**   
Greenhills Primary School, established in 1962, is situated in a quiet residential location between the north-eastern Melbourne suburbs of Greensborough and Diamond Creek, within the municipality of Nillumbik. The location of the school, on a well maintained grassed and treed site provides a geographic and social centre for the community. The current school enrolment is 517 students. It’s a great place to be!

  
The school has undergone a series of major works providing our community with modern, spacious and well resourced teaching facilities, which include classrooms equipped with the latest IT equipment including interactive whiteboards. We also run a iPad Program in Years 5 and 6.

  
 **Vision**  
Greenhills Primary School aims to provide a caring, safe, supportive and nurturing environment, which will foster all students’ educational and behavioural development, to enable them to become effective citizens 

In [26]:
html2 = mistletoe.markdown(md)
print(html2)

<table>
<thead>
<tr>
<th align="left"></th>
<th align="left"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left"></td>
</tr>
</tbody>
</table>
<p>Location Profile</p>
<p>|</p>
<p><strong><em>SCHOOL PROFILE</em></strong><br />
Greenhills Primary School, established in 1962, is situated in a quiet residential location between the north-eastern Melbourne suburbs of Greensborough and Diamond Creek, within the municipality of Nillumbik. The location of the school, on a well maintained grassed and treed site provides a geographic and social centre for the community. The current school enrolment is 517 students. It’s a great place to be!</p>
<p>The school has undergone a series of major works providing our community with modern, spacious and well resourced teaching facilities, which include classrooms equipped with the latest IT equipment including interactive whiteboards. We also run a iPad Program in Years 5 and 6.</p>
<p><strong>Vision</strong><br />
Greenhills Primar

In [27]:
HTML(html2)

In [28]:
print(BeautifulSoup(html2).get_text(''))















Location Profile
|
SCHOOL PROFILE
Greenhills Primary School, established in 1962, is situated in a quiet residential location between the north-eastern Melbourne suburbs of Greensborough and Diamond Creek, within the municipality of Nillumbik. The location of the school, on a well maintained grassed and treed site provides a geographic and social centre for the community. The current school enrolment is 517 students. It’s a great place to be!
The school has undergone a series of major works providing our community with modern, spacious and well resourced teaching facilities, which include classrooms equipped with the latest IT equipment including interactive whiteboards. We also run a iPad Program in Years 5 and 6.
Vision
Greenhills Primary School aims to provide a caring, safe, supportive and nurturing environment, which will foster all students’ educational and behavioural development, to enable them to become effective citizens in our school and, the broader society.
V

In [29]:
text = html2plain(html)
print(text)

Location Profile
SCHOOL PROFILE
Greenhills Primary School, established in 1962, is situated in a quiet residential location between the north-eastern Melbourne suburbs of Greensborough and Diamond Creek, within the municipality of Nillumbik. The location of the school, on a well maintained grassed and treed site provides a geographic and social centre for the community. The current school enrolment is 517 students. It’s a great place to be!
The school has undergone a series of major works providing our community with modern, spacious and well resourced teaching facilities, which include classrooms equipped with the latest IT equipment including interactive whiteboards. We also run a iPad Program in Years 5 and 6.
Vision
Greenhills Primary School aims to provide a caring, safe, supportive and nurturing environment, which will foster all students’ educational and behavioural development, to enable them to become effective citizens in our school and, the broader society.
Values
The follow

# Example - Blank bold

In [30]:
p = paths[21]
data = read_jsonl(p)

In [31]:
[idx for idx, d in enumerate(data) if '<tr>' in d['description']]

[186, 315, 543, 546]

In [32]:
html = data[546]['description']

In [33]:
print(html)

<div class="tabs margin-top30" id="about-role">
                                <h4 class="color-purple-bold">About the role</h4>
                                                                <div class="org-excerpt"><p><strong>Position Title:</strong><strong>  </strong><strong>                   </strong>Field Organiser</p>
<p><strong>Position Location:</strong> <strong>             </strong>Darwin, NT</p>
<p><strong>Employment Status:</strong>        Ongoing (subject to probation) / Full Time</p>
<p><strong>Classification and Salary range:</strong>   </p>
<ul>
<li>Organiser Level 1 – 2, $74,984 – $97,713 per annum (includes Organiser Expense Allowance paid as salary) + 15.4% superannuation</li>
<li>Darwin Remote Localities Allowance of $5,941 per annum is also payable on a fortnightly basis</li>
</ul>
<p><strong>Position reports to:</strong><strong>  </strong><strong>        </strong>Regional Secretary</p>
<p><strong>Positions reporting to this position are:</strong>    Nil</p>
<p>

In [34]:
HTML(html)

0
CONDITIONS OF EMPLOYMENT

0
COMMUNITY AND PUBLIC SECTOR UNION (CPSU) – PSU GROUP

0
OVERVIEW OF POSITION

0
"REQUIRED SKILLS, KNOWLEDGE & ABILITY"

0
HOW TO APPLY: You must complete the on-line Employment Application Questionnaire and address the Selection Criteria to be considered for this position. Please visit our website at https://cpsu.wufoo.com/forms/cpsu-employment-application/ to access the Employment Application Questionnaire and submit your application.


In [35]:
md = html2md(html)
print(md)

#### About the role

**Position Title:** ******** Field Organiser

**Position Location:** **** Darwin, NT

**Employment Status:** Ongoing (subject to probation) / Full Time

**Classification and Salary range:**

  * Organiser Level 1 – 2, $74,984 – $97,713 per annum (includes Organiser Expense Allowance paid as salary) + 15.4% superannuation
  * Darwin Remote Localities Allowance of $5,941 per annum is also payable on a fortnightly basis



**Position reports to:** ******** Regional Secretary

**Positions reporting to this position are:** Nil

**CONDITIONS OF EMPLOYMENT**  
  
---  
  
  * Flexible work practices and access to accrued days off.
  * Best practice leave provisions including paid primary carer leave (26 weeks) and supporting partner leave (six weeks); and paid family violence leave (20 days pro rata per annum, non-cumulative).
  * Employer super contributions on paid or unpaid parental leave for a period equal to a maximum of 52 weeks.
  * A comprehensive Employee Assista

In [36]:
html2 = mistletoe.markdown(md)
print(html2)

<h4>About the role</h4>
<p><strong>Position Title:</strong> ******** Field Organiser</p>
<p><strong>Position Location:</strong> **** Darwin, NT</p>
<p><strong>Employment Status:</strong> Ongoing (subject to probation) / Full Time</p>
<p><strong>Classification and Salary range:</strong></p>
<ul>
<li>Organiser Level 1 – 2, $74,984 – $97,713 per annum (includes Organiser Expense Allowance paid as salary) + 15.4% superannuation</li>
<li>Darwin Remote Localities Allowance of $5,941 per annum is also payable on a fortnightly basis</li>
</ul>
<p><strong>Position reports to:</strong> ******** Regional Secretary</p>
<p><strong>Positions reporting to this position are:</strong> Nil</p>
<p><strong>CONDITIONS OF EMPLOYMENT</strong></p>
<hr />
<ul>
<li>Flexible work practices and access to accrued days off.</li>
<li>Best practice leave provisions including paid primary carer leave (26 weeks) and supporting partner leave (six weeks); and paid family violence leave (20 days pro rata per annum, non-cu

In [37]:
HTML(html2)

In [38]:
print(BeautifulSoup(html2).get_text(''))

About the role
Position Title: ******** Field Organiser
Position Location: **** Darwin, NT
Employment Status: Ongoing (subject to probation) / Full Time
Classification and Salary range:

Organiser Level 1 – 2, $74,984 – $97,713 per annum (includes Organiser Expense Allowance paid as salary) + 15.4% superannuation
Darwin Remote Localities Allowance of $5,941 per annum is also payable on a fortnightly basis

Position reports to: ******** Regional Secretary
Positions reporting to this position are: Nil
CONDITIONS OF EMPLOYMENT


Flexible work practices and access to accrued days off.
Best practice leave provisions including paid primary carer leave (26 weeks) and supporting partner leave (six weeks); and paid family violence leave (20 days pro rata per annum, non-cumulative).
Employer super contributions on paid or unpaid parental leave for a period equal to a maximum of 52 weeks.
A comprehensive Employee Assistance Program.
A strong commitment to training and development.
Health and well

In [39]:
text = html2plain(html)
print(text)

About the role
Position Title:  Field Organiser
Position Location:  Darwin, NT
Employment Status: Ongoing (subject to probation) / Full Time
Classification and Salary range:

Organiser Level 1 – 2, $74,984 – $97,713 per annum (includes Organiser Expense Allowance paid as salary) + 15.4% superannuation
Darwin Remote Localities Allowance of $5,941 per annum is also payable on a fortnightly basis

Position reports to:  Regional Secretary
Positions reporting to this position are: Nil
CONDITIONS OF EMPLOYMENT

Flexible work practices and access to accrued days off.
Best practice leave provisions including paid primary carer leave (26 weeks) and supporting partner leave (six weeks); and paid family violence leave (20 days pro rata per annum, non-cumulative).
Employer super contributions on paid or unpaid parental leave for a period equal to a maximum of 52 weeks.
A comprehensive Employee Assistance Program.
A strong commitment to training and development.
Health and wellbeing initiatives.
Sa

# Example: Complicated HTML

In [40]:
p = paths[4]
data = read_jsonl(p)

In [41]:
html = data[0]['description']

The plain HTML

In [42]:
print(html)

<div class="job-detail-des">
<iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/dwoywMmGZEI" width="560"></iframe>
<div style=" "> </div>
<div align="center" style=""><b style="">Organisation Design Specialist</b></div>
<div align="center" style=""><b>TAFE Worker Level 9 – Talent Pool</b></div>
<div style=""><br/>
</div>
<div style="">We are seeking candidates who are interested in joining TAFE NSW’s Organisation Design Specialist Talent Pool.</div>
<div style="">This is a great opportunity for you to be considered for future roles over the next 12 months.</div>
<div style=""><br/>
</div>
<div style=""><br/>
</div>
<div style=""><br/>
</div>
<div style="">• Competitive salary package and access to multiple benefits</div>
<div style="">• Opportunity to join a dynamic team </div>
<div style="">• Build and maintain relationships with key stakeholders</div>
<div style="">

How it looks in a browser (after removing the video iframe):

In [43]:
HTML(re.sub('<iframe[^>]*>[^<]*</iframe>', '', html))

HTML2Text does an excellent job of converting this into markdown.
Notice there are some empty bold `**` that are invisible in the rendered HTML.

Also note that in CLOSING DATE they've had to insert an additional space between the phrase and the colon to bold it as markdown, which is different to how it reads.

In [44]:
md = html2md(html)
print(md)

**Organisation Design Specialist**

**TAFE Worker Level 9 – Talent Pool**

  


We are seeking candidates who are interested in joining TAFE NSW’s Organisation Design Specialist Talent Pool.

This is a great opportunity for you to be considered for future roles over the next 12 months.

  


  


  


• Competitive salary package and access to multiple benefits

• Opportunity to join a dynamic team 

• Build and maintain relationships with key stakeholders

  


**THE OPPORTUNITY**

With TAFE NSW you will have the opportunity to grow your professional career in a dynamic and collaborative environment, where you can innovate, create value and proudly play a meaningful role in the once in a generation transformation of Australia’s largest skills and training provider!

**  
**

**THE ROLE**

The Design Specialist is responsible for supporting organisation design projects to develop fit for purpose organisation designs aligned to strategy and service delivery models. 

  


• Coordinate O

In [45]:
html2 = mistletoe.markdown(md.replace('•', '  *'))
print(html2)

<p><strong>Organisation Design Specialist</strong></p>
<p><strong>TAFE Worker Level 9 – Talent Pool</strong></p>
<p>We are seeking candidates who are interested in joining TAFE NSW’s Organisation Design Specialist Talent Pool.</p>
<p>This is a great opportunity for you to be considered for future roles over the next 12 months.</p>
<ul>
<li>
<p>Competitive salary package and access to multiple benefits</p>
</li>
<li>
<p>Opportunity to join a dynamic team</p>
</li>
<li>
<p>Build and maintain relationships with key stakeholders</p>
</li>
</ul>
<p><strong>THE OPPORTUNITY</strong></p>
<p>With TAFE NSW you will have the opportunity to grow your professional career in a dynamic and collaborative environment, where you can innovate, create value and proudly play a meaningful role in the once in a generation transformation of Australia’s largest skills and training provider!</p>
<p>**<br />
**</p>
<p><strong>THE ROLE</strong></p>
<p>The Design Specialist is responsible for supporting organisati

In [46]:
HTML(html2)

In [47]:
text = BeautifulSoup(html2).getText()
print(text)

Organisation Design Specialist
TAFE Worker Level 9 – Talent Pool
We are seeking candidates who are interested in joining TAFE NSW’s Organisation Design Specialist Talent Pool.
This is a great opportunity for you to be considered for future roles over the next 12 months.


Competitive salary package and access to multiple benefits


Opportunity to join a dynamic team


Build and maintain relationships with key stakeholders


THE OPPORTUNITY
With TAFE NSW you will have the opportunity to grow your professional career in a dynamic and collaborative environment, where you can innovate, create value and proudly play a meaningful role in the once in a generation transformation of Australia’s largest skills and training provider!
**
**
THE ROLE
The Design Specialist is responsible for supporting organisation design projects to develop fit for purpose organisation designs aligned to strategy and service delivery models.


Coordinate Organisation Design workshops to ensure organisational struct

In [48]:
text = html2plain(html)
print(text)

Organisation Design Specialist
TAFE Worker Level 9 – Talent Pool
We are seeking candidates who are interested in joining TAFE NSW’s Organisation Design Specialist Talent Pool.
This is a great opportunity for you to be considered for future roles over the next 12 months.

Competitive salary package and access to multiple benefits

Opportunity to join a dynamic team

Build and maintain relationships with key stakeholders

THE OPPORTUNITY
With TAFE NSW you will have the opportunity to grow your professional career in a dynamic and collaborative environment, where you can innovate, create value and proudly play a meaningful role in the once in a generation transformation of Australia’s largest skills and training provider!

THE ROLE
The Design Specialist is responsible for supporting organisation design projects to develop fit for purpose organisation designs aligned to strategy and service delivery models.

Coordinate Organisation Design workshops to ensure organisational structures and j

# Example - Processing Lists

In [49]:
p = paths[14]
data = read_jsonl(p)

In [50]:
html = data[0]['description']

The plain HTML

In [51]:
html

'<ul>\n<li><strong>Progressive peak body for Dementia</strong></li>\n<li><strong>Full time, fixed term opportunity until June 2020</strong></li>\n<li><strong>Attractive salary packaging options available</strong></li>\n</ul>\n<p>Dementia Australia is a well-known and respected organisation transforming the experience of people impacted by dementia by elevating their voices and inspiring excellence in support and care free from discrimination.</p>\n<p>We are currently seeking a Younger Onset Dementia Support Coordinator to join our Client Services team. This role covers the Metro Melbourne and Gippsland regions and is based with our team in Hawthorn. You will be responsible for the provision of dementia specialist support to assist people living with younger onset dementia, aged under 65 years, to interface with the National Disability Insurance Scheme (NDIS). The Younger Onset Dementia Support Coordinator role plays an active role in assisting clients to implement their NDIS plans, mee

How it looks in a browser

In [52]:
HTML(html)

HTML2Text again does an excellent job of converting this into markdown.
Note that there's a mixing of markup and styling in the Key Selection Criteria heading.

In [53]:
md = html2md(html)
print(md)

  * **Progressive peak body for Dementia**
  * **Full time, fixed term opportunity until June 2020**
  * **Attractive salary packaging options available**



Dementia Australia is a well-known and respected organisation transforming the experience of people impacted by dementia by elevating their voices and inspiring excellence in support and care free from discrimination.

We are currently seeking a Younger Onset Dementia Support Coordinator to join our Client Services team. This role covers the Metro Melbourne and Gippsland regions and is based with our team in Hawthorn. You will be responsible for the provision of dementia specialist support to assist people living with younger onset dementia, aged under 65 years, to interface with the National Disability Insurance Scheme (NDIS). The Younger Onset Dementia Support Coordinator role plays an active role in assisting clients to implement their NDIS plans, meet identified goals and connect with services and supports that best meet their

In [54]:
html2 = mistletoe.markdown(md)
print(html2)

<ul>
<li><strong>Progressive peak body for Dementia</strong></li>
<li><strong>Full time, fixed term opportunity until June 2020</strong></li>
<li><strong>Attractive salary packaging options available</strong></li>
</ul>
<p>Dementia Australia is a well-known and respected organisation transforming the experience of people impacted by dementia by elevating their voices and inspiring excellence in support and care free from discrimination.</p>
<p>We are currently seeking a Younger Onset Dementia Support Coordinator to join our Client Services team. This role covers the Metro Melbourne and Gippsland regions and is based with our team in Hawthorn. You will be responsible for the provision of dementia specialist support to assist people living with younger onset dementia, aged under 65 years, to interface with the National Disability Insurance Scheme (NDIS). The Younger Onset Dementia Support Coordinator role plays an active role in assisting clients to implement their NDIS plans, meet ident

In [55]:
HTML(html2)

In [56]:
text = BeautifulSoup(html2).getText()
print(text)


Progressive peak body for Dementia
Full time, fixed term opportunity until June 2020
Attractive salary packaging options available

Dementia Australia is a well-known and respected organisation transforming the experience of people impacted by dementia by elevating their voices and inspiring excellence in support and care free from discrimination.
We are currently seeking a Younger Onset Dementia Support Coordinator to join our Client Services team. This role covers the Metro Melbourne and Gippsland regions and is based with our team in Hawthorn. You will be responsible for the provision of dementia specialist support to assist people living with younger onset dementia, aged under 65 years, to interface with the National Disability Insurance Scheme (NDIS). The Younger Onset Dementia Support Coordinator role plays an active role in assisting clients to implement their NDIS plans, meet identified goals and connect with services and supports that best meet their needs.
To be successful i

In [57]:
text = html2plain(html)
print(text)

Progressive peak body for Dementia
Full time, fixed term opportunity until June 2020
Attractive salary packaging options available

Dementia Australia is a well-known and respected organisation transforming the experience of people impacted by dementia by elevating their voices and inspiring excellence in support and care free from discrimination.
We are currently seeking a Younger Onset Dementia Support Coordinator to join our Client Services team. This role covers the Metro Melbourne and Gippsland regions and is based with our team in Hawthorn. You will be responsible for the provision of dementia specialist support to assist people living with younger onset dementia, aged under 65 years, to interface with the National Disability Insurance Scheme (NDIS). The Younger Onset Dementia Support Coordinator role plays an active role in assisting clients to implement their NDIS plans, meet identified goals and connect with services and supports that best meet their needs.
To be successful in

# Example - Custom Lists

In [58]:
p = paths[6]
data = read_jsonl(p)

In [59]:
html = data[2]['description']

The plain HTML

In [60]:
md = html2md(html)
print(md)

Do you have a fintech background and are hungry for your next move? National BDM role where you can work from home, apply now!  
  
  
 **Duties and Responsibilities**  
  
  
· Develop sales plans and exceed set KPI's  
  
· Generate leads by researching and networking with key stakeholders  
  
· Prepare presentations and proposals  
  
· Keep abreast of product and industry knowledge  
  
  
 ****Skills and Experience****  
  
  
· 5 years' experience selling B2B products or services  
  
· Experience selling payment products or services- highly regarded  
  
· Experience in the FinTech space- highly regarded  
  
· Experience using Salesforce CRM  
  
· Experience in hunter sales roles  
  
· Excellent written and verbal communication skills  
  
· Demonstrated successful negotiation and influencing skills  
  
· Fantastic presenting skills  
  
  
 ****Thank you in advance of your application, we would kindly ask you submit your resume in WORD format****  
  
  
 ****Please note o

In [61]:
html2 = mistletoe.markdown(md.replace('·', '  *'))
print(html2)

<p>Do you have a fintech background and are hungry for your next move? National BDM role where you can work from home, apply now!</p>
<p><strong>Duties and Responsibilities</strong></p>
<ul>
<li>
<p>Develop sales plans and exceed set KPI's</p>
</li>
<li>
<p>Generate leads by researching and networking with key stakeholders</p>
</li>
<li>
<p>Prepare presentations and proposals</p>
</li>
<li>
<p>Keep abreast of product and industry knowledge</p>
</li>
</ul>
<p><strong><strong>Skills and Experience</strong></strong></p>
<ul>
<li>
<p>5 years' experience selling B2B products or services</p>
</li>
<li>
<p>Experience selling payment products or services- highly regarded</p>
</li>
<li>
<p>Experience in the FinTech space- highly regarded</p>
</li>
<li>
<p>Experience using Salesforce CRM</p>
</li>
<li>
<p>Experience in hunter sales roles</p>
</li>
<li>
<p>Excellent written and verbal communication skills</p>
</li>
<li>
<p>Demonstrated successful negotiation and influencing skills</p>
</li>
<li>


In [62]:
HTML(html2)

In [63]:
print(html2)

<p>Do you have a fintech background and are hungry for your next move? National BDM role where you can work from home, apply now!</p>
<p><strong>Duties and Responsibilities</strong></p>
<ul>
<li>
<p>Develop sales plans and exceed set KPI's</p>
</li>
<li>
<p>Generate leads by researching and networking with key stakeholders</p>
</li>
<li>
<p>Prepare presentations and proposals</p>
</li>
<li>
<p>Keep abreast of product and industry knowledge</p>
</li>
</ul>
<p><strong><strong>Skills and Experience</strong></strong></p>
<ul>
<li>
<p>5 years' experience selling B2B products or services</p>
</li>
<li>
<p>Experience selling payment products or services- highly regarded</p>
</li>
<li>
<p>Experience in the FinTech space- highly regarded</p>
</li>
<li>
<p>Experience using Salesforce CRM</p>
</li>
<li>
<p>Experience in hunter sales roles</p>
</li>
<li>
<p>Excellent written and verbal communication skills</p>
</li>
<li>
<p>Demonstrated successful negotiation and influencing skills</p>
</li>
<li>


In [64]:
text = BeautifulSoup(html2).getText()
print(text)

Do you have a fintech background and are hungry for your next move? National BDM role where you can work from home, apply now!
Duties and Responsibilities


Develop sales plans and exceed set KPI's


Generate leads by researching and networking with key stakeholders


Prepare presentations and proposals


Keep abreast of product and industry knowledge


Skills and Experience


5 years' experience selling B2B products or services


Experience selling payment products or services- highly regarded


Experience in the FinTech space- highly regarded


Experience using Salesforce CRM


Experience in hunter sales roles


Excellent written and verbal communication skills


Demonstrated successful negotiation and influencing skills


Fantastic presenting skills


Thank you in advance of your application, we would kindly ask you submit your resume in WORD format
Please note only shortlisted candidates will be contacted



In [65]:
text = html2plain(html)
print(text)

Do you have a fintech background and are hungry for your next move? National BDM role where you can work from home, apply now!
Duties and Responsibilities

Develop sales plans and exceed set KPI's

Generate leads by researching and networking with key stakeholders

Prepare presentations and proposals

Keep abreast of product and industry knowledge

Skills and Experience

5 years' experience selling B2B products or services

Experience selling payment products or services- highly regarded

Experience in the FinTech space- highly regarded

Experience using Salesforce CRM

Experience in hunter sales roles

Excellent written and verbal communication skills

Demonstrated successful negotiation and influencing skills

Fantastic presenting skills

Thank you in advance of your application, we would kindly ask you submit your resume in WORD format
Please note only shortlisted candidates will be contacted



# Example 5 - Actually plain text

In [66]:
p = paths[10]
data = read_jsonl(p)

In [67]:
html = data[3]['description']

The plain HTML

In [68]:
print(html)

The Opportunity

Do you want to conduct impactful research of strategic importance to Australia?
Incorporate host plant resistance to pests and diseases in the cotton breeding program
Grow your research career with a CSIRO PhD Fellowship

CSIRO Early Research Career (CERC) Postdoctoral Fellowships provide opportunities to scientists and engineers who have completed their doctorate and have less than three years of relevant postdoctoral work experience. These fellowships aim to develop the next generation of future leaders of the innovation system.

In this position you will be the lead researcher for a project titled "Incorporating host plant resistance to pests and diseases in the cotton breeding program. "

The key output from this project is to incorporate host plant resistance traits to verticillium wilt, spider mites and whitefly into advanced lines in the CSIRO cotton breeding program, as well as to improve the methodologies to select for resistance.

As rhe Postdoctoral Fellow i

How it looks in a browser

In [69]:
HTML(html)

HTML2Text again does an excellent job of converting this into markdown.
Note that there's a mixing of markup and styling in the Key Selection Criteria heading.

In [70]:
md = html2md(html)
print(md)

The Opportunity Do you want to conduct impactful research of strategic importance to Australia? Incorporate host plant resistance to pests and diseases in the cotton breeding program Grow your research career with a CSIRO PhD Fellowship CSIRO Early Research Career (CERC) Postdoctoral Fellowships provide opportunities to scientists and engineers who have completed their doctorate and have less than three years of relevant postdoctoral work experience. These fellowships aim to develop the next generation of future leaders of the innovation system. In this position you will be the lead researcher for a project titled "Incorporating host plant resistance to pests and diseases in the cotton breeding program. " The key output from this project is to incorporate host plant resistance traits to verticillium wilt, spider mites and whitefly into advanced lines in the CSIRO cotton breeding program, as well as to improve the methodologies to select for resistance. As rhe Postdoctoral Fellow in Hos

In [71]:
html2 = mistletoe.markdown(md)
print(html2)

<p>The Opportunity Do you want to conduct impactful research of strategic importance to Australia? Incorporate host plant resistance to pests and diseases in the cotton breeding program Grow your research career with a CSIRO PhD Fellowship CSIRO Early Research Career (CERC) Postdoctoral Fellowships provide opportunities to scientists and engineers who have completed their doctorate and have less than three years of relevant postdoctoral work experience. These fellowships aim to develop the next generation of future leaders of the innovation system. In this position you will be the lead researcher for a project titled &quot;Incorporating host plant resistance to pests and diseases in the cotton breeding program. &quot; The key output from this project is to incorporate host plant resistance traits to verticillium wilt, spider mites and whitefly into advanced lines in the CSIRO cotton breeding program, as well as to improve the methodologies to select for resistance. As rhe Postdoctoral 

In [72]:
HTML(html2)

In [73]:
print(html2)

<p>The Opportunity Do you want to conduct impactful research of strategic importance to Australia? Incorporate host plant resistance to pests and diseases in the cotton breeding program Grow your research career with a CSIRO PhD Fellowship CSIRO Early Research Career (CERC) Postdoctoral Fellowships provide opportunities to scientists and engineers who have completed their doctorate and have less than three years of relevant postdoctoral work experience. These fellowships aim to develop the next generation of future leaders of the innovation system. In this position you will be the lead researcher for a project titled &quot;Incorporating host plant resistance to pests and diseases in the cotton breeding program. &quot; The key output from this project is to incorporate host plant resistance traits to verticillium wilt, spider mites and whitefly into advanced lines in the CSIRO cotton breeding program, as well as to improve the methodologies to select for resistance. As rhe Postdoctoral 

In [74]:
text = BeautifulSoup(html2).getText()
print(text)

The Opportunity Do you want to conduct impactful research of strategic importance to Australia? Incorporate host plant resistance to pests and diseases in the cotton breeding program Grow your research career with a CSIRO PhD Fellowship CSIRO Early Research Career (CERC) Postdoctoral Fellowships provide opportunities to scientists and engineers who have completed their doctorate and have less than three years of relevant postdoctoral work experience. These fellowships aim to develop the next generation of future leaders of the innovation system. In this position you will be the lead researcher for a project titled "Incorporating host plant resistance to pests and diseases in the cotton breeding program. " The key output from this project is to incorporate host plant resistance traits to verticillium wilt, spider mites and whitefly into advanced lines in the CSIRO cotton breeding program, as well as to improve the methodologies to select for resistance. As rhe Postdoctoral Fellow in Hos

In [75]:
text = html2plain(html)
print(text)

The Opportunity Do you want to conduct impactful research of strategic importance to Australia? Incorporate host plant resistance to pests and diseases in the cotton breeding program Grow your research career with a CSIRO PhD Fellowship CSIRO Early Research Career (CERC) Postdoctoral Fellowships provide opportunities to scientists and engineers who have completed their doctorate and have less than three years of relevant postdoctoral work experience. These fellowships aim to develop the next generation of future leaders of the innovation system. In this position you will be the lead researcher for a project titled "Incorporating host plant resistance to pests and diseases in the cotton breeding program. " The key output from this project is to incorporate host plant resistance traits to verticillium wilt, spider mites and whitefly into advanced lines in the CSIRO cotton breeding program, as well as to improve the methodologies to select for resistance. As rhe Postdoctoral Fellow in Hos