### Task 1

In Moodle you will find the file 110723EUParl.docx. Read the text inside into your console.
It contains a plenary protocol of the European Parliament from the 11th of July 2023. In this
exercise, you will use regular expressions to extract meta information from the file and separate
the text into smaller parts. Write functions that can be generalized to other protocols of a
similar structure. That is: when you are for instance trying to remove the table of contents, do
not just remove the first 7 pages, but find a way to automatically detect when the main part
starts (for instance using Regex).

#### When you want to find a certain set of tokens using Regex you might want to proceed as follows:

1. Look at the original document and identify structural features that could be used to identify the tokens you are looking for.
2. Translate these loose strucutral features into Regex.
3. Use Regex to see whether you detect your desired tokens and only your desired tokens.
4. Modify your Regex to solve the exercise.


### Solution

#### Reading the Document and Pre-processing

This step involves reading the document and preparing it for further processing, such as extracting meta information and splitting the document into chapters.

In [28]:
!pip install python-docx

from docx import Document
import re
from datetime import datetime

def read_docx(file_path):
    doc = Document(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

doc_text = read_docx('110723EUParl.docx')  # Update this to the correct path




### Task 2: 

### Solution

#### Identify the date and the weekday of the plenary discussion. Transform the date into a date-object that can be used to create timelines in your programming language (e.g. package datetime in Python and as.Date() in R).


Identifying the Date and Transforming It into a Date-object

In [29]:
date_pattern = r"\b\d{1,2} [a-zA-Z]+ \d{4}\b"
date_match = re.search(date_pattern, doc_text)
if date_match:
    date_str = date_match.group()
    date_obj = datetime.strptime(date_str, "%d %B %Y")  # Adjust the format based on the document
    print("Date of Discussion:", date_obj)
else:
    print("Date not found.")


Date of Discussion: 2023-07-11 00:00:00


### Task 3: 

#### From your large text, filter out the Attendance Register (here: page 35) and the cover sheet as well as the table of contents (here: everything until page 8) and remove them so that only the main part of the document is left.

### Solution

#### Filtering Out the Attendance Register, Cover Sheet, and Table of Contents

This task is more nuanced and may require specific markers or text patterns unique to the document structure. Here's an example approach: 

In [30]:
# Assuming the main content starts after a specific heading, find that heading and split
start_main_content = "1. Opening of the sitting"
main_content = doc_text.split(start_main_content)[1]  # Adjust based on actual content

# Remove the Attendance Register, assuming it's at the end
end_marker = "ATTENDANCE REGISTER"
main_content_without_register = main_content.split(end_marker)[0]  # Adjust based on actual content


### Task 4: 

#### Split the text into the individual chapters provided in the original document.

### Solution

Splitting the Text into Individual Chapters

In [31]:
# Example approach to split by chapter headings
# Adjust regex to match chapter headings format in the document
chapters = re.split(r'\n\d+\.\s+[A-Z].*', main_content_without_register)
chapters = [chapter.strip() for chapter in chapters if chapter.strip()]  # Remove empty chapters


In [32]:
print(chapters)

['4', "8.1. Request for the waiver of the immunity of Georgios Kyrtsos (vote)\t6\n8.2. Temporary trade-liberalisation measures supplementing trade concessions applicable to Moldovan products under the EU/Euratom/Moldova Association Agreement ***I (vote)\t7\n8.3. New Regulation on Construction Products ***I (vote)\t7\n8.4. Draft amending budget No 2/2023: Entering the surplus of the financial year 2022 (vote)\t8\n8.5. Mobilisation of the European Globalisation Adjustment Fund: application EGF/2023/001 BE/LNSA - Belgium (vote)\t8\n8.6. EU/Chile Agreement: modification of concessions on all the tariff rate quotas included in the EU Schedule CLXXV as a consequence of the United Kingdom's withdrawal from the European Union *** (vote)\t8\n8.7. EU/Japan Agreement: air services *** (vote)\t9\n8.8. Objection pursuant to Rule 112 (4): The food additives nitrites (E 249-250) and nitrates (E 251-252) (vote)\t9\n8.9. Objection pursuant to Rule 111 (3): Detailed production rules for organic sea salt

### Thank you