# Guernsey Deputy Meetings - Hansard Text Mining

*A learning exercise with the aim of using text mining in Python as exploratory data analysis to identify patterns in Deputy meetings and provide a comprehensive source of information of the opinions expressed by deputies*


## Intro
I've chosen to persue this as the steps involved require a mix of disciplines, all of which should help expand my skills in both practical performance of analytics as well as code management

**Storage and Management**  
SharePoint will be used to store files but they will be primarily managed in a GitHub repository

**Presentation**  
Jupyter Notebooks will be used in order to document any decisions made and the process undertaken to analyse or manipulate data

**Data Collection**  
Most initial data will be imported manually at this stage, but web scraping might be investigated as a later exercise

**Data Transformation and Analysis**  
Initial data transformation and analysis will be performed in Python. Some presentation may be done in Power BI

_________
## Data Collection

There are a number of options for collecting data on discussions held during States meetings, the main ones considered for identifying opinions expressed by Deputies are:

1. The audio from recordings available through Microsoft Teams
2. A summary Excel or PDF log (used in Meeting Analysis.pbix to identify time spent speaking by each Deputy)
3. A verbatim official PDF report of proceedings, named a 'Hansard'

The quality of the audio recordings varies wildly and, while interesting to look at in future, is unlikely to produce results due to issues in the audio quality.  
The Excel logs are useful for determining the length of time spent speaking, but lacks sufficient detail to offer further meaningful analysis.  
Therefore, the Hansard reports have been selected for analysis. 

Only final Hansard reports will be used and will be collected manually from each meeting page which can be accessed through the [States meeting information index](https://www.gov.gg/article/163276/States-Meeting-information-index)



### Import PDFs
Import Hansard files and use PDF Miner to begin exploring text.  
Before analysis the aim is to convert this intro a structured format where each record is either a sentence or contiguous speech from one deputy, with variables showing at least the date of the meeting, the speaker and the text.

In [98]:
import io

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTChar

def extract_char_from_text(layout):
    text = ""
    print(layout)
    for lt_obj in layout:
        print(lt_obj)
#         if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj,LTTextLine):
#             # Recursion until down to LTChar
#             extract_char_from_text(lt_obj)
#         elif isinstance(lt_obj, LTChar):
#             # If at LTChar in tree, get result
#             text = lt_obj._text
#             print(text)
#         return text

def extract_text_from_pdf(pdf_path):
    # Open a PDF file.
    fp = open(pdf_path, 'rb')
    # Create a PDF parser object associated with the file object.
    parser = PDFParser(fp)
    # Create a PDF document object that stores the document structure.
    document = PDFDocument(parser)
    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    # Create a PDF resource manager object that stores shared resources.
    rsrcmgr = PDFResourceManager()
    # Set parameters for analysis.
    laparams = LAParams()
    # Create a PDF device object.
    device = PDFPageAggregator(rsrcmgr, laparams = laparams)
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # Create a string output object
    strout = io.StringIO()
    # Process each page contained in the document.
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        layout = device.get_result()

        text = extract_char_from_text(layout)
        print(text)
   

#         for lt_obj in layout:

#             if hasattr(lt_obj, "get_text"):
#                 text = lt_obj.get_text()
#                 strout.write (text)

#     text = strout.getvalue()
#     strout.close()
#     return text

##See https://stackoverflow.com/questions/25248140/how-does-one-obtain-the-location-of-text-in-a-pdf-with-pdfminer for help on recursively exploring the tree in order to find font info


extract_text_from_pdf('Hansard/Test.pdf')

<LTPage(1) 0.000,0.000,595.320,841.920 rotate=0>
<LTTextBoxHorizontal(0) 72.024,756.720,93.719,767.760 'Test \n'>
<LTTextBoxHorizontal(1) 72.024,734.260,74.519,745.300 ' \n'>
<LTTextBoxHorizontal(2) 72.024,711.700,93.359,722.740 'Test \n'>
None
