# Information Extraction


How can we extract structured information from text?

This allows us to answer for example the following questions:

- Who receives questions from MEPs?
- When were the answers submitted? How long is the response time?
- Which party is asking most questions?



# Reading in data

In [None]:
import pandas as pd

In [None]:
path = './data/parliamentary-questions_2023_sample.csv'
data = pd.read_csv(path, index_col=1)

In [None]:
sample_question = data.question_text.values[100]
print(sample_question)

In [None]:
sample_answer = data.answer_text.values[110]
print(sample_answer)

# Extracting information from text

## Methods

- Based on document structure. This requires all documents to be structured exactly the same to avoid noise or wrong extractions
- Named Entity Extraction

## Extract the recipient

In [None]:
sample_question.split('\n')

In [None]:
def get_recipient_from_question(question):
    recipient_text = question.split('\n')[1]
    return recipient_text.replace('to the ', '')

In [None]:
data['recipient'] = data['question_text'].apply(get_recipient_from_question)

In [None]:
data.recipient.value_counts()

## Digression: Regular Expressions

Regular expressions (regex) are concise patterns used for searching and manipulating text. 

Examples:
- Websites: `/^www/`
- Extracting dates in the format dd/mm/yyyy: `\d{2}/\d{2}/\d{4}`
- Matching email addresses 


Tools:

- RegEx Generator: https://www.autoregex.xyz/
- RegEx Online Tester: https://regexr.com/

## Extract the date of the answer submission

In [None]:
import re

date_regex = '(\d{1,2}\.\d{1,2}\.\d{4})'

In [None]:
re.findall(date_regex, sample_answer)

In [None]:
def get_answer_date(answer):
    if not isinstance(answer, str):
        return None
    matches = re.findall(date_regex, answer)
    if len(matches) > 0:
        return matches[0]
    else:
        return None

In [None]:
data['answer_date'] = data['answer_text'].apply(get_answer_date)

In [None]:
data.answer_date.unique()

In [None]:
data[data.answer_date.isnull()]

## How long does it take to respond to a question

In [None]:
data['question_date'] = pd.to_datetime(data['document_date'])
data['answer_date'] = pd.to_datetime(data['answer_date'], dayfirst=True)

In [None]:
def compute_date_difference(start_date, end_date):
    return (end_date - start_date).days

In [None]:
data['response_time'] = data.apply(lambda row: compute_date_difference(row['question_date'], row['answer_date']), axis=1)
data['response_time']

In [None]:
data['response_time'].describe()

In [None]:
# Response time statistics per institution
data.groupby('recipient')['response_time'].describe()

In [None]:
# Response time statistics by month of questioning
data['question_month'] = data['question_date'].dt.strftime('%Y-%m')
data.groupby('question_month')['response_time'].mean()

In [None]:
data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.swarmplot(x='document_type', y='response_time', data=data)
plt.xlabel('Month')
plt.ylabel('Response Time (Days)')
plt.title('Response Time Distribution by Category')
plt.show()

## Named Entity Recognition

In [None]:
!python3 -m spacy download en_core_web_sm

In [None]:
import spacy

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

def extract_named_entities(text):
    doc = nlp(text)
    named_entities = []

    for entity in doc.ents:
        if entity.label_ in ["DATE", "PERSON", "ORG"]:
            named_entities.append((entity.text, entity.label_))

    return named_entities

In [None]:
sample_question

In [None]:
extract_named_entities(sample_answer)