**Loading of libaries and dataset**

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:0

In [2]:
from datasets import load_dataset
import pandas as pd
import re

In [12]:
# Load train dataset
ds = load_dataset("ailsntua/QEvasion")

# Convert to pandas and keep only useful columns
df_train = ds["train"].to_pandas()[["question","interview_question",
                                    "interview_answer", "label","url"]]

In [4]:
df_train.head(5)

Unnamed: 0,question,interview_question,interview_answer,label
0,How would you respond to the accusation that t...,\nQ. Of the Biden administration. And accused ...,"\nThe President. Well, look, first of all, the...",Explicit
1,Do you think President Xi is being sincere abo...,\nQ. Of the Biden administration. And accused ...,"\nThe President. Well, look, first of all, the...",General
2,1. Q1: Do you believe the country's slowdown a...,\nQ. No worries. Do you believe the country's ...,"\nThe President. Look, I think China has a dif...",Partial/half-answer
3,2. Q2: Are you worried about the meeting betwe...,\nQ. No worries. Do you believe the country's ...,"\nThe President. Look, I think China has a dif...",Dodging
4,Is the President's engagement with Asian coun...,"\nQ. I can imagine. It is evening, I'd like to...","\nThe President. Well, I hope I get to see Mr....",Explicit


**Main preprocessing procedure**

In [13]:
# Regex explanation:

# ^ matches the start of the string
# (\d+\.|Part \d+:|Q\d*:|\d+\. Q\d*: ) is a capturing group that matches
# one of the following:
#     \d+\. : one or more digits followed by a period
#
#     Part \d+: : the string "Part " followed by one or more digits,
#     a colon, and an optional space
#
#     Q\d*: : the string "Q" followed by one or more digits, a colon, a
#     and an optional space
#
#     \d+\. Q\d*: : one or more digits followed by a period, a space,
#     "Q", one or more digits, a colon, and an optional space
#
#     - : start sentence with "-"

# Remove indexing from questions
index_pattern = r'^(\d+\. Q\d+:|\d+\.|Part \d+:|Q\d+:|-)'
df_train['question'] = df_train['question'].str.replace(index_pattern,
                                                        '', regex=True)

# Remove quotes and new line espace characters
df_train['question'] = df_train['question'].str.replace(r'["\n]',
                                                        '', regex=True)
df_train['interview_answer'] = df_train['interview_answer'].str.replace(
                                                            r'\n', '',
                                                            regex=True)

# Remove first sentence from answer
sentence_pattern = r'^[^.]+\.?'
df_train['interview_answer'] = df_train['interview_answer'].str.replace(
                                                        sentence_pattern,
                                                        '', regex=True)

# Remove description from questions

# List of exception indexes
exceptions = [142,493,699,809,1052,1053,1446,
              2417,2631,2821,3181,3390]

df_train.loc[~df_train.index.isin(exceptions), 'question'] = df_train.loc[
    ~df_train.index.isin(exceptions), 'question'].apply(
    lambda x: re.sub(r'^[^:]+: ', '', x))

In [6]:
df_train.head(5)

Unnamed: 0,question,interview_question,interview_answer,label
0,How would you respond to the accusation that t...,\nQ. Of the Biden administration. And accused ...,"Well, look, first of all, the—I am sincere ab...",Explicit
1,Do you think President Xi is being sincere abo...,\nQ. Of the Biden administration. And accused ...,"Well, look, first of all, the—I am sincere ab...",General
2,Do you believe the country's slowdown and gro...,\nQ. No worries. Do you believe the country's ...,"Look, I think China has a difficult economic ...",Partial/half-answer
3,Are you worried about the meeting between Pre...,\nQ. No worries. Do you believe the country's ...,"Look, I think China has a difficult economic ...",Dodging
4,Is the President's engagement with Asian coun...,"\nQ. I can imagine. It is evening, I'd like to...","Well, I hope I get to see Mr. Xi sooner than ...",Explicit


---

**Exploring data noise on the end of interview answer**

In [20]:
df_train[["question","interview_answer","url"]]

Unnamed: 0,question,interview_answer,url
0,How would you respond to the accusation that t...,"Well, look, first of all, the—I am sincere ab...",https://www.presidency.ucsb.edu/documents/the-...
1,Do you think President Xi is being sincere abo...,"Well, look, first of all, the—I am sincere ab...",https://www.presidency.ucsb.edu/documents/the-...
2,Do you believe the country's slowdown and gro...,"Look, I think China has a difficult economic ...",https://www.presidency.ucsb.edu/documents/the-...
3,Are you worried about the meeting between Pre...,"Look, I think China has a difficult economic ...",https://www.presidency.ucsb.edu/documents/the-...
4,Is the President's engagement with Asian coun...,"Well, I hope I get to see Mr. Xi sooner than ...",https://www.presidency.ucsb.edu/documents/the-...
...,...,...,...
3443,Why shouldn't Americans give Democrats a chan...,That's a tricky little question there. [Laugh...,https://www.presidency.ucsb.edu/documents/the-...
3444,Inquiry about the belief regarding the abilit...,"Mike, I believe Iraq will be able to defend, ...",https://www.presidency.ucsb.edu/documents/the-...
3445,Are you resentful that some Republican candid...,"You know, no, I'm not resentful, nor am I res...",https://www.presidency.ucsb.edu/documents/the-...
3446,If you really didn't think that Republicans w...,"You know, no, I'm not resentful, nor am I res...",https://www.presidency.ucsb.edu/documents/the-...


In [15]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = "https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-hanoi-vietnam-0"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.get_text()

In [23]:
# Extract text from the <div> with class "field-docs-content"
div_content = soup.find('div', class_='field-docs-content')

exception_list = ["The President.", "Q."]

# Extract unique sentences from <i> tags, excluding specific phrases
italic_sentences = {i.get_text(strip=True) for i in div_content.find_all('i')}
unique_sentences = [sentence for sentence in italic_sentences if sentence not in exception_list]


In [24]:
unique_sentences

['At this point, several reporters began asking questions at once.',
 'Human Rights Issues',
 'Laughter',
 '—',
 'White House Press Secretary Karine Jean-Pierre.',
 'Several reporters spoke at once.',
 'China/Taiwan/U.S. Export Controls',
 'inaudible',
 'Inaudible',
 'Climate Change/Forest Conservation Efforts/Clean Energy Transition Assistance/India-Middle East-Europe Economic Corridor',
 'laughter',
 'President Xi Jinping of China/Global Trade Infrastructure/Africa',
 'Press Secretary Jean-Pierre.',
 'Russia/China-U.S. Relations',
 'State Council Premier Li Keqiang of China',
 'China-U.S. Relations/Indo-Pacific Diplomatic Efforts/Quadrilateral Security Dialogue']

In [31]:
def get_italic_sentences(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract text from the <div> with class "field-docs-content"
    div_content = soup.find('div', class_='field-docs-content')

    exception_list = ["The President.", "Q."]

    # Extract unique sentences from <i> tags, excluding specific phrases
    italic_sentences = {i.get_text(strip=True) for i in div_content.find_all(['i', 'em'])}
    unique_sentences = [sentence for sentence in italic_sentences if sentence not in exception_list]
    return unique_sentences

In [32]:
# Create a dictionary to store unique sentences for each URL
url_sentences = {}

# Iterate through unique URLs in the DataFrame
for url in df_train['url'].unique():
    unique_sentences = get_italic_sentences(url)
    url_sentences[url] = unique_sentences

# Remove unique sentences from interview_answer for each URL
for index, row in df_train.iterrows():
    unique_sentences = url_sentences[row['url']]

    # Remove unique sentences from interview_answer
    for sentence in unique_sentences:
        df_train.at[index, 'interview_answer'] = df_train.at[index, 'interview_answer'].replace(sentence, '')

# Clean up the interview_answer column (optional)
# df_train['interview_answer'] = df_train['interview_answer'].str.replace(r'\s+', ' ', regex=True).str.strip()

df_train

ChunkedEncodingError: ('Connection broken: IncompleteRead(9836 bytes read, 404 more expected)', IncompleteRead(9836 bytes read, 404 more expected))

In [29]:
url_sentences

{'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-hanoi-vietnam-0': ['At this point, several reporters began asking questions at once.',
  'Human Rights Issues',
  'Laughter',
  '—',
  'White House Press Secretary Karine Jean-Pierre.',
  'Several reporters spoke at once.',
  'China/Taiwan/U.S. Export Controls',
  'inaudible',
  'Inaudible',
  'Climate Change/Forest Conservation Efforts/Clean Energy Transition Assistance/India-Middle East-Europe Economic Corridor',
  'laughter',
  'President Xi Jinping of China/Global Trade Infrastructure/Africa',
  'Press Secretary Jean-Pierre.',
  'Russia/China-U.S. Relations',
  'State Council Premier Li Keqiang of China',
  'China-U.S. Relations/Indo-Pacific Diplomatic Efforts/Quadrilateral Security Dialogue'],
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-with-president-yoon-suk-yeol-south-korea-and-prime-minister': ['President Biden.',
  'President Yoon.',
  'China',
  'At this point, President

In [30]:
empty_url_sentences = [url for url, sentences in url_sentences.items() if not sentences]
empty_url_sentences

['https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-vientiane-laos',
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-hangzhou-china',
 'https://www.presidency.ucsb.edu/documents/the-presidents-new-conference-the-pentagon-arlington-virginia',
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-with-prime-minister-lee-hsien-loong-singapore',
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-warsaw-poland',
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-with-prime-minister-justin-pj-trudeau-canada-and-president',
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-with-president-tran-dai-quang-vietnam-hanoi-vietnam',
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-with-prime-minister-david-wd-cameron-the-united-kingdom',
 'https://www.presidency.ucsb.edu/documents/the-presidents-news-conference-1141',
 'https://www.

In [27]:
df_train.to_csv('output.csv', index=False)

---

**Exploring unhandled data noise**

1) Affirmative questions

In [None]:
filtered_no_quest_df = df_train[~df_train['question'].str.contains('\?')]
len(filtered_no_quest_df)

772

In [None]:
len(df_train[df_train['question'].str.contains('\?')])
len(df_train)

3448

In [None]:
filtered_no_quest_df

Unnamed: 0,question,interview_question,interview_answer,label
8,1. Concerns about the lack of communication be...,"\nQ. Well, let me ask you about—you've spent l...",\nThe President. It's not a wedge issue of the...,Explicit
9,2. Inquiry about the reaction of Kyiv regardin...,"\nQ. Well, let me ask you about—you've spent l...",\nThe President. It's not a wedge issue of the...,Explicit
16,ensuring Finland that the U.S. will remain a r...,"\nQ. In Washington, a bipartisan group of Sena...",\nPresident Biden. I absolutely guarantee it. ...,Explicit
18,Concerns about the comments motivating Putin ...,"\nQ. Thank you, Mr. President. You've said tha...","\nPresident Biden. First of all, no one can jo...",Deflection
19,The risk of the war dragging on for years.,"\nQ. Thank you, Mr. President. You've said tha...","\nPresident Biden. First of all, no one can jo...",Dodging
...,...,...,...,...
3403,I wonder what your reaction is to that,\nQ. But the results are being interpreted as ...,"\nThe President. You know, I really haven't—I'...",Explicit
3415,Asking for an explanation of not knowing somet...,\nQ. How could you not know that and not be ou...,"\nThe President. You didn't know it, either.",Dodging
3416,Adjustments to the agenda regarding Social Sec...,"\nQ. Mr. President, you mentioned entitlements...","\nThe President. I told—Ken, I told Hank Pauls...",General
3430,Secretary Rumsfeld Accountability,"\nQ. When you first ran for President, sir, yo...","\nThe President. Peter, you're asking me why I...",Dodging


2) Multiple questions

In [None]:
df_questionmark_filtered = df_train[df_train['question'].str.count('\?') > 1]
df_questionmark_filtered

Unnamed: 0,question,interview_question,interview_answer,label
68,"For the 15,000 migrants that Canada will welc...","\nQ. Good afternoon, Mr. President. Good after...","\nPresident Biden. Well, no, I'm not disappoin...",Partial/half-answer
193,What is President Biden's message to the roug...,"\nQ. Thank you, Mr. President. Based on everyt...","\nPresident Biden. Well, I've had discussions—...",Dodging
218,How long should Americans expect to face highe...,\nQ. ——of inflation. Oil prices have been at a...,"\nThe President. Well, look, as you know, Ken,...",Deflection
261,"Why did you tell Jeff [Jeff Zeleny, CNN] that ...",\nQ. Right. We appreciate it. We very much do....,"\nThe President. Well, first of all, the messa...",Partial/half-answer
269,Have you decided who you will nominate to chai...,"\nQ. Well, I'm going to ask a very Bloomberg q...","\nThe President. No, no, and no. No, I'm not g...",Declining to answer
...,...,...,...,...
3274,"Concerning energy matters, 3 days before the c...","\nQ. Good morning. President Calderon, concern...",\nPresident Calderon. The truth of the matter ...,Explicit
3324,\n3. Minority opinion: Is the sentiment of que...,"\nQ. As you know, a growing number of troops a...",\nThe President. I am—what I hear from command...,Dodging
3345,Were the efforts of the Quartet weak or are th...,\nQ. I have a question to the President and th...,\nPresident Bush. I don't know if I'd call thi...,Partial/half-answer
3375,Did you make any representations to the Presid...,"\nQ. Mr. President, the memo from your Nationa...",\nPresident Bush. I will let the Prime Ministe...,Deflection


In [None]:
len(df_questionmark_filtered)

86