Exercise about European Parliament Plenary Protocol

# Task 1
In Moodle you will find the file 110723EUParl.docx. Read the text inside into your console.
It contains a plenary protocol of the European Parliament from the 11th of July 2023. In this exercise, you will use regular expressions to extract meta information from the file and separate the text into smaller parts. Write functions that can be generalized to other protocols of a similar structure. That is: when you are for instance trying to remove the table of contents, do not just remove the first 7 pages, but find a way to automatically detect when the main part starts (for instance using Regex).
When you want to find a certain set of tokens using Regex you might want to proceed as follows:
1. Look at the original document and identify structural features that could be used to identify the tokens you are looking for.
2. Translate these loose strucutral features into Regex.
3. Use Regex to see whether you detect your desired tokens and only your desired tokens.
4. Modify your Regex to solve the exercise.


In [1]:
pip install python-docx

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [6]:
import docx
import re

# Function to read and extract text from a DOCX file
def read_docx(file_path):
    doc = docx.Document(file_path)
    text = ''
    for para in doc.paragraphs:
        text += para.text + '\n'
    return text

# Function to find specific tokens using Regex
def extract_meta_info(text):
    # Example: Extracting the date using a regex pattern (e.g., date format "11th of July 2023")
    date_pattern = r'\d{1,2}(?:st|nd|rd|th)?\s+of\s+[A-Za-z]+\s+\d{4}'
    date = re.search(date_pattern, text)
    if date:
        print(f"Date found: {date.group(0)}")

    # Example: Extracting a specific structure, such as titles or headers
def extract_titles_refined(text):
    # Refined pattern to match relevant titles only
    title_pattern = r'\b(?:MINUTES|ATTENDANCE REGISTER|PARLIAMENT|EU PARLIAMENT|SESSION|SUMMARY)\b'
    
    titles = re.findall(title_pattern, text)
    
    # Remove duplicates while keeping the order
    titles = list(dict.fromkeys(titles))
    
    return titles

# Example usage
file_path = '/Users/oayanwale/Downloads/NLP_Exercise_23/110723EUParl.docx'
text = read_docx(file_path)
extract_meta_info(text)
refined_titles = extract_titles_refined(text)
print("Refined Titles:", refined_titles)

Refined Titles: ['MINUTES', 'PARLIAMENT', 'ATTENDANCE REGISTER']


In [15]:
import re
from docx import Document

def read_docx(file_path):
    # Load the .docx file
    doc = Document(file_path)
    full_text = []
    
    for paragraph in doc.paragraphs:
        full_text.append(paragraph.text)
        
    return '\n'.join(full_text)

file_path = r'/Users/oayanwale/Downloads/NLP_Exercise_23/110723EUParl.docx'
document_text = read_docx(file_path)

# Display a portion of the document text to understand its structure
print(document_text[:1000])  # Print first 1000 characters for inspection



 2023 - 2024



MINUTES




of the sitting of

Tuesday

11 July 2023







P9_PV-PROV(2023)07-11	PROVISIONAL VERSION	PE 751.377



Contents
1. Opening of the sitting	4
2. Amendment of the agenda	4
3. Nature restoration ***I (debate)	4
4. European Chips Act ***I (debate)	5
5. Resumption of the sitting	6
6. Commemorating the 28th anniversary of Srebrenica (statement by the President)	6
7. Resumption of the sitting	6
8. Voting time	6
8.1. Request for the waiver of the immunity of Georgios Kyrtsos (vote)	6
8.2. Temporary trade-liberalisation measures supplementing trade concessions applicable to Moldovan products under the EU/Euratom/Moldova Association Agreement ***I (vote)	7
8.3. New Regulation on Construction Products ***I (vote)	7
8.4. Draft amending budget No 2/2023: Entering the surplus of the financial year 2022 (vote)	8
8.5. Mobilisation of the European Globalisation Adjustment Fund: application EGF/2023/001 BE/LNSA - Belgium (vote)	8
8.6. EU/Chile Agreement: modification of con

# Task 2
Identify the date and the weekday of the plenary discussion. Transform the date into a date-object that can be used to create timelines in your programming language (e.g. pack- age datetime in Python and as.Date() in R).


In [7]:
#import re
#from datetime import datetime

In [12]:
file_path = '/Users/oayanwale/Downloads/NLP_Exercise_23/110723EUParl.docx'
document_text = read_docx(file_path)
print(document_text)



 2023 - 2024



MINUTES




of the sitting of

Tuesday

11 July 2023







P9_PV-PROV(2023)07-11	PROVISIONAL VERSION	PE 751.377



Contents
1. Opening of the sitting	4
2. Amendment of the agenda	4
3. Nature restoration ***I (debate)	4
4. European Chips Act ***I (debate)	5
5. Resumption of the sitting	6
6. Commemorating the 28th anniversary of Srebrenica (statement by the President)	6
7. Resumption of the sitting	6
8. Voting time	6
8.1. Request for the waiver of the immunity of Georgios Kyrtsos (vote)	6
8.2. Temporary trade-liberalisation measures supplementing trade concessions applicable to Moldovan products under the EU/Euratom/Moldova Association Agreement ***I (vote)	7
8.3. New Regulation on Construction Products ***I (vote)	7
8.4. Draft amending budget No 2/2023: Entering the surplus of the financial year 2022 (vote)	8
8.5. Mobilisation of the European Globalisation Adjustment Fund: application EGF/2023/001 BE/LNSA - Belgium (vote)	8
8.6. EU/Chile Agreement: modification of con

In [16]:
from datetime import datetime

def extract_date_and_weekday(text):
    # Regex pattern to find dates in format "11 July 2023" or similar variations.
    date_pattern = r'(\d{1,2}\s+[A-Za-z]+\s+\d{4})'
    
    match = re.search(date_pattern, text)
    
    if match:
        date_str = match.group(1)  # Extracted date string
        date_obj = datetime.strptime(date_str, '%d %B %Y')  # Convert to datetime object
        
        return date_obj.date(), date_obj.strftime('%A')  # Return date and weekday name
    
    return None, None

date_object, weekday_name = extract_date_and_weekday(document_text)
print("Date:", date_object)
print("Weekday:", weekday_name)

Date: 2023-07-11
Weekday: Tuesday


# Task 3
From your large text, filter out the Attendance Register (here: page 35) and the cover sheet as well as the table of contents (here: everything until page 8) and remove them so that only the main part of the document is left.

In [29]:
# Task 3: Remove Unnecessary Sections
# Define regex patterns for cover sheet, table of contents, and attendance register
cover_pattern = r"MINUTES.*?United in diversity"
toc_pattern = r"Contents\n(.*?)\n1\. Opening of the sitting"
attendance_pattern = r"ATTENDANCE REGISTER\n.*"

# Remove unwanted sections
text_cleaned = re.sub(cover_pattern, "", text, flags=re.DOTALL)
text_cleaned = re.sub(toc_pattern, "", text_cleaned, flags=re.DOTALL)
text_cleaned = re.sub(attendance_pattern, "", text_cleaned, flags=re.DOTALL)

# Task 4
Split the text into the individual chapters provided in the original document.


In [30]:
# Task 4: Split text into individual chapters based on headings (e.g., "1. Opening of the sitting")
chapter_pattern = r"(\d+\.\s[A-Z].+)"  # Match numbered section headings
chapters = re.split(chapter_pattern, text_cleaned)

# Reorganize chapters (odd indices contain headings, even indices contain content)
structured_chapters = {chapters[i].strip(): chapters[i+1].strip() for i in range(1, len(chapters)-1, 2)}

# Output first few chapters for verification
for i, (title, content) in enumerate(structured_chapters.items()):
    print(f"\n--- {title} ---\n{content[:500]}...\n")  # Print first 500 chars for preview
    if i == 2:  # Show only first 3 sections
        break


--- 2. Amendment of the agenda ---
On a proposal from the President of Parliament and in agreement with the political groups in accordance with Rule 158(2), the President proposed the following amendment to the agenda:
Thursday
Inclusion of ‘Act in Support of Ammunition Production’ in voting time. 
Tabling deadlines:
- Amendments: 11 July 2023 at 13.00
- Requests for separate votes and split votes: 11 July 2023 at 19....


--- 00.
Parliament agreed to the proposal. The agenda was amended accordingly. ---
...


--- 3. Nature restoration ***I (debate) ---
Report on the proposal for a regulation of the European Parliament and of the Council on nature restoration [COM(2022)0304 - C9-0208/2022- 2022/0195(COD)] - Committee on the Environment, Public Health and Food Safety. Rapporteur: César Luena (A9-0220/2023)
César Luena introduced the report.
The following spoke: Anne Sander (rapporteur for the opinion of the AGRI Committee) and Caroline Roose (rapporteur for the opinion of the PECH Comm

In [31]:
# Print all chapters in the console
for title, content in structured_chapters.items():
    print(f"\n--- {title} ---\n{content[:500]}...\n")  # Print first 500 chars for preview


--- 2. Amendment of the agenda ---
On a proposal from the President of Parliament and in agreement with the political groups in accordance with Rule 158(2), the President proposed the following amendment to the agenda:
Thursday
Inclusion of ‘Act in Support of Ammunition Production’ in voting time. 
Tabling deadlines:
- Amendments: 11 July 2023 at 13.00
- Requests for separate votes and split votes: 11 July 2023 at 19....


--- 00.
Parliament agreed to the proposal. The agenda was amended accordingly. ---
...


--- 3. Nature restoration ***I (debate) ---
Report on the proposal for a regulation of the European Parliament and of the Council on nature restoration [COM(2022)0304 - C9-0208/2022- 2022/0195(COD)] - Committee on the Environment, Public Health and Food Safety. Rapporteur: César Luena (A9-0220/2023)
César Luena introduced the report.
The following spoke: Anne Sander (rapporteur for the opinion of the AGRI Committee) and Caroline Roose (rapporteur for the opinion of the PECH Comm