<a href="https://colab.research.google.com/github/Kontilenia/thesis/blob/main/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Loading of libaries and dataset**

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
from datasets import load_dataset
import pandas as pd
import re
from typing import List
import requests
from bs4 import BeautifulSoup

**Main preprocessing procedure**

In [None]:
def pattern_cleaning(
    df: pd.DataFrame,
    exceptions: List[int]
    ) -> pd.DataFrame:
    """
    Function that cleans 4 unwanted patterns from the dataset
    regarding, indexing of questions, special characters, speaker's name
    and description of questions.

    Arguments:
    df – Dataframe to be cleaned
    exceptions - exception list of indexes where the disception of the
    question is needed

    Returns:
    df – Cleaned dataframe
    """

    """
    Regex explanation:

    ^ matches the start of the string
    (\d+\.|Part \d+:|Q\d*:|\d+\. Q\d*: ) is a capturing group that
    matches one of the following:
        \d+\. : one or more digits followed by a period

        Part \d+: : the string "Part " followed by one or more digits,
        a colon, and an optional space

        Q\d*: : the string "Q" followed by one or more digits, a colon,
        and an optional space

        \d+\. Q\d*: : one or more digits followed by a period, a space,
        "Q", one or more digits, a colon, and an optional space

        - : start sentence with "-"
    """

    # 1) Remove indexing from questions
    index_pattern = r'^(\d+\. Q\d+:|\d+\.|Part \d+:|Q\d+:|-)'
    df['question'] = df['question'].str.replace(
        index_pattern,
        '',
        regex=True
        )

    # 2) Remove quotes and new line espace characters
    df['question'] = df['question'].str.replace(
        r'["\n]',
        '',
        regex=True
        )
    df['interview_answer'] = df['interview_answer'].str.replace(
        r'\n',
        '',
        regex=True
        )

    # 3) Remove first sentence from answer (indicates which present is
    # speaking)
    sentence_pattern = r'^[^.]+\.?'
    df['interview_answer'] = df['interview_answer'].str.replace(
        sentence_pattern,
        '',
        regex=True
        )

    # 4) Remove description from questions
    df.loc[~df.index.isin(exceptions), 'question'] = df_train.loc[
        ~df.index.isin(exceptions), 'question'].apply(
        lambda x: re.sub(r'^[^:]+: ', '', x))
    return df


def get_italic_sentences(url: str) -> list:
    """
    Function to get italic sentences from a url, optimized with error
    handling

    Arguments:
    url - Link of the text

    Returns:
    Text with italics except specific phrases
    """
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raise exception for bad responses
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract text from the <div> with class "field-docs-content"
        div_content = soup.find('div', class_='field-docs-content')

        # Return an empty list if the div is not found
        if div_content is None:
            return []

        exception_list = {
            "The President.",
            "Q.",
            "Inaudible",
            "inaudible"
            }

        # Extract unique sentences from <i> or <em> tags, excluding
        # specific phrases
        italic_sentences = {
            i.get_text(strip=True)
            for i in div_content.find_all(['i', 'em'])
            }
        return [
            sentence
            for sentence in italic_sentences
            if sentence not in exception_list
            ]

    except (requests.RequestException, AttributeError) as e:
        print(f"Error retrieving or parsing {url}: {e}")
        return []


def clean_interview_answer(row: pd.Series, url_sentences: set) -> str:
    """
    Remove unnecessary sentences from a interview_answer in a
    vectorized manner

    Arguments:
    row: row of a dataframe
    url_sentences: set of unique sentences to be removed
    from interview answer of a text coming from a particular
    url

    Returns:
    Interview answer string with removed sentences
    """
    unique_sentences = url_sentences.get(row['url'], [])
    interview_answer = row['interview_answer']
    for sentence in unique_sentences:
        interview_answer = interview_answer.replace(sentence, '')
    return interview_answer


def remove_unrelated_text(df: pd.DataFrame) -> pd.DataFrame:
    """
    Function to remove italic sentences from the 'interview_answer' column.

    Arguments:
    df – Dataframe to be cleaned

    Returns:
    df – Cleaned dataframe
    """

    # Create a dictionary to store unique sentences for each URL
    url_sentences = {}

    # Create a dictionary to store unique sentences for each URL
    unique_urls = df['url'].unique()

    # Get sentences for each URL (optionally use parallel processing for
    # speedup)
    for url in unique_urls:
        url_sentences[url] = get_italic_sentences(url)

    df['interview_answer'] = df.apply(
        lambda x: clean_interview_answer(x, url_sentences), axis=1)

    # Optional: Clean up whitespace after sentence removal
    df['interview_answer'] = df['interview_answer'].str.replace(
        r'\s+', ' ',
        regex=True
        ).str.strip()

    return df

def extra_labels(df: pd.DataFrame) -> pd.DataFrame:
  """
  Add inadible and multiple question labels to the dataset

  Arguments:
  df – Dataframe

  Returns:
  df – Labeled dataframe
  """
  df_train["inaudible"] = df_train['interview_answer'].str.contains('inaudible', case=False)
  df_train["multiple_questions"] = df_train['question'].str.count('\?') > 1
  return df_train

In [None]:
# Load train dataset
ds = load_dataset("ailsntua/QEvasion")

# Convert to pandas and keep only useful columns
df_train = ds["train"].to_pandas()[["question","interview_question",
                                    "interview_answer", "label","url"]]

# Remove unwanted patterns
exception_list = [142,493,699,809,1052,1053,1446,
                  2417,2631,2821,3181,3390]
df_train = pattern_cleaning(df_train, exception_list)

# Extract noise from the end of interview answer
df_train = remove_unrelated_text(df_train)

# Add 2 more labels for multiple questions and inadible speech
df_train = extra_labels(df_train)

df_train.to_csv('output.csv', index=False)

train.csv:   0%|          | 0.00/14.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3448 [00:00<?, ? examples/s]

---

**Exploring unhandled affirmative questions**

In [None]:
df_train = pd.read_csv('output.csv')

In [None]:
filtered_no_quest_df = df_train[~df_train['question'].str.contains('\?')]
len(filtered_no_quest_df)

772

In [None]:
len(df_train[df_train['question'].str.contains('\?')])
len(df_train)

3448

In [None]:
filtered_no_quest_df

Unnamed: 0,question,interview_question,interview_answer,label
8,1. Concerns about the lack of communication be...,"\nQ. Well, let me ask you about—you've spent l...",\nThe President. It's not a wedge issue of the...,Explicit
9,2. Inquiry about the reaction of Kyiv regardin...,"\nQ. Well, let me ask you about—you've spent l...",\nThe President. It's not a wedge issue of the...,Explicit
16,ensuring Finland that the U.S. will remain a r...,"\nQ. In Washington, a bipartisan group of Sena...",\nPresident Biden. I absolutely guarantee it. ...,Explicit
18,Concerns about the comments motivating Putin ...,"\nQ. Thank you, Mr. President. You've said tha...","\nPresident Biden. First of all, no one can jo...",Deflection
19,The risk of the war dragging on for years.,"\nQ. Thank you, Mr. President. You've said tha...","\nPresident Biden. First of all, no one can jo...",Dodging
...,...,...,...,...
3403,I wonder what your reaction is to that,\nQ. But the results are being interpreted as ...,"\nThe President. You know, I really haven't—I'...",Explicit
3415,Asking for an explanation of not knowing somet...,\nQ. How could you not know that and not be ou...,"\nThe President. You didn't know it, either.",Dodging
3416,Adjustments to the agenda regarding Social Sec...,"\nQ. Mr. President, you mentioned entitlements...","\nThe President. I told—Ken, I told Hank Pauls...",General
3430,Secretary Rumsfeld Accountability,"\nQ. When you first ran for President, sir, yo...","\nThe President. Peter, you're asking me why I...",Dodging
