### This is a process that standardizes the examples in the training set. The main issue I see is `discourse_start` and `discourse_end` being slightly off.  By standardizing and cleaning the examples, we can hopefully create consistent training data that causes an improvement in model performance.

#### If `text` is the text from the file, then it would be expected that `file_text[discourse_start:discourse_end]=discourse_text` but in 44k rows, this is not the case.

#### Here is an example where `file_text[discourse_start:discourse_end]!=discourse_text`
```
discourse_id = 1622992280991.0
discourse_text = "First, cell phones are a benefit and allows everyone to have access to a telephone at all times of the day.\n"
file_text[discourse_start:discourse_end]="rst, cell phones are a benefit and allows everyone to have access to a telephone at all times of the day.

K"
```
After it goes through my script: 
```
'First, cell phones are a benefit and allows everyone to have access to a telephone at all times of the day.\n'
```

#### Here is an example straight from the csv where there is a bit at the beginning that looks unecessary:
```
discourse_id = 1622489430075.0
discourse_text = '. Drivers should not be able to use cell phones in any capacity while operating a motor vehicle. '
file_text[discourse_start:discourse_end] = '. Drivers should not be able to use cell phones in any capacity while operating a motor vehicle.\n'
```
After it goes through my script: 
```
'Drivers should not be able to use cell phones in any capacity while operating a motor vehicle.\n'
```

# Please look at the following discussions for more information about this topic!
- [Mystery Solved - Discrepancy Between PredictionString and DiscourseText](https://www.kaggle.com/c/feedback-prize-2021/discussion/297591)  
- [Additional Information from Competition Hosts (rubric, dataset, raters, etc.)](https://www.kaggle.com/c/feedback-prize-2021/discussion/297688)
- [Correcting the labels (Minor magic?)](https://www.kaggle.com/c/feedback-prize-2021/discussion/296778)


#### Looking closer, there are many instances of a single character being swapped for another. In 30k out of 44k instances, one has `"\n"` when the other has `" "` as the final character. That difference doesn't actually matter, but there are some other instances that have much worse alignment so this notebook will be my attempt at cleaning it up. 

#### After making these changes, about 16k `discourse_start` values change and 66k `discourse_end` values change. 😮

#### Please leave a comment if you have questions or suggestions! (Some of the cells take a few minutes. I tried to use the easy multi-processing capabilities of `datasets` to speed it up as much as possible)
<p style="font-size: 40px">😊</p>

## This section compares the text extracted using the `discourse_start` and `discourse_end` positions with `discourse_text`

The text might look the same because the only difference is " " and "\n", so below each example I indicate which index is different between the two of them and what characters they are. `[(87, ' ', '\n')]` means that the character at index 87 is different. In `discourse_text` it has a space and in the file text it has a newline character.

In [None]:
%%time

import re
import string

import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from datasets import Dataset

df = pd.read_csv("../input/feedback-prize-2021/train.csv")

# for each row, grab the span of text from the file using discourse_start and discourse_end
def get_text_by_index(example):
    id_ = example["id"]
    start = example["discourse_start"]
    end = example["discourse_end"]
    with open(f"../input/feedback-prize-2021/train/{id_}.txt") as fp:
        file_text = fp.read()
    return {
        "text_by_index": file_text[int(start) : int(end)]
    }

id_ds = Dataset.from_pandas(df[["id", "discourse_start", "discourse_end"]])

text_ds = id_ds.map(get_text_by_index, num_proc=4)
df["text_by_index"] = text_ds["text_by_index"]

not_equal_texts = df[df["discourse_text"] != df["text_by_index"]]
print(f"There are {len(not_equal_texts)} that are not equal")

# Let's look at a few
discourse_texts = not_equal_texts["discourse_text"]
file_spans = not_equal_texts["text_by_index"]
discourse_ids = not_equal_texts["discourse_id"]

for counter, (discourse_text, file_span, discourse_id) in enumerate(
    zip(discourse_texts, file_spans, discourse_ids)
):
    if counter > 5:
        break

    if len(discourse_text) != len(file_span):
        continue

    print("discourse_id =", discourse_id)
    print("\n***discourse_text in train.csv***\n")
    print(discourse_text)
    print("\n"+"-" * 20)
    print("\n***Using discourse_start and discourse_end***\n")
    print(file_span)

    # Print index of character that differs between the two texts
    print(
        [
            (i, char1, char2)
            for i, (char1, char2) in enumerate(zip(discourse_text, file_span))
            if char1 != char2
        ]
    )

    print("\n" + "*" * 20 + "\n")

## At first glance it just looks like newlines and spaces are getting swapped. If we only look for instances with letters being swapped, let's see what comes out

In [None]:
counter = 0
discourse_texts = not_equal_texts["discourse_text"]
file_spans = not_equal_texts["text_by_index"]
discourse_ids = not_equal_texts["discourse_id"]

for discourse_text, file_span, discourse_id in zip(
    discourse_texts, file_spans, discourse_ids
):
    if counter >= 2:
        break

    if len(discourse_text) != len(file_span):
        continue

    # Print index of character that differs between the two texts
    diffs = [
        (i, char1, char2)
        for i, (char1, char2) in enumerate(zip(discourse_text, file_span))
        if char1 != char2
    ]

    if not diffs[0][1].isalpha():
        continue

    print("discourse_id =", discourse_id)
    print("\n***discourse_text in train.csv***\n")
    print(discourse_text)
    print("-" * 20)
    print("\n***Using discourse_start and discourse_end***\n")
    print(file_span)

    # Print index of difference in char
    print(diffs)

    print("\n" + "*" * 20 + "\n")
    counter += 1


## Ok now it looks like a few are misaligned by a few characters which makes it look like there are tons of characters that are different. Here is a counter of all the times the characters didn't align. Keep in mind that this includes the cases when one is a shifted version of another (like in `discourse_id=1622992466917.0`)

In [None]:
from collections import Counter

all_diffs = []
for discourse_text, file_text in not_equal_texts[["discourse_text", "text_by_index"]].values:
    
    if len(discourse_text) != len(file_text):
        continue
        
    all_diffs.extend([(char1, char2) for char1, char2 in zip(discourse_text, file_text) if char1!=char2])

    
counter = Counter(all_diffs)

counter.most_common(20)

## My approach to fix the incorrect `discourse_start` and `discourse_end` values

The first step is to check if we can use the entire `discourse_text` to find the starting index in the file text.


If a match with the entire `discourse_text` string is not found, I'll take the first ~20 or so characters from `discourse_text` and see where it starts in the file text and use that as the starting point.

If the span starts with punctuation and then text, I'll keep increasing the start index until it isn't whitespace or punctuation. This eliminates the examples that begin with a period or comma. 

If the span ends in whitespace, I'll keep it. This whitespace could be beneficial for the model. 

If the span does not end in whitespace and the character after the span is punctuation or whitespace, extend the span to include it. Extending it can hopefully add useful information and it also standardizes the examples to have trailing whitespace but not leading whitespace. Adding the punctuation or whitespace would ***not*** change the `predictionstring` but it ***would*** change how the NER labeling is done.

In [None]:
%%time

PUNCTUATION = set(".,;")

def get_new_positions(examples):
    
    disc_ids = []
    new_starts = []
    new_ends = []
    new_texts = []
    
    for id_ in examples["id"]:
    
        with open(f"../input/feedback-prize-2021/train/{id_}.txt") as fp:
            file_text = fp.read()

        discourse_data = df[df["id"] == id_]

        discourse_ids = discourse_data["discourse_id"]
        discourse_texts = discourse_data["discourse_text"]
        discourse_starts = discourse_data["discourse_start"]
        for disc_id, disc_text, disc_start in zip(discourse_ids, discourse_texts, discourse_starts):
            disc_text = disc_text.strip()

            matches = [x for x in re.finditer(re.escape(disc_text), file_text)]
            offset = 0
            while len(matches) == 0 and offset < len(disc_text):
                chunk = disc_text if offset == 0 else disc_text[:-offset]
                matches = [x for x in re.finditer(re.escape(chunk), file_text)]
                offset += 5
            if offset >= len(disc_text):
                print(f"Could not find substring in {disc_id}")
                continue

            # There are some instances when there are multiple matches, 
            # so we'll take the closest one to the original discourse_start
            distances = [abs(disc_start-match.start()) for match in matches]

            idx = matches[np.argmin(distances)].start()                

            end_idx = idx + len(disc_text)

            # if it starts with whitespace or punctuation, increase idx
            while file_text[idx].split()==[] or file_text[idx] in PUNCTUATION:
                idx += 1
            
            # if the next 
            if (end_idx < len(file_text) and 
                (file_text[end_idx-1]!=[] or file_text[end_idx-1] not in PUNCTUATION) and 
                (file_text[end_idx].split()==[] or file_text[end_idx] in PUNCTUATION)):
                end_idx += 1

            final_text = file_text[idx:end_idx]
            
            disc_ids.append(disc_id)
            new_starts.append(idx)
            new_ends.append(idx + len(final_text))
            new_texts.append(final_text)
            
    return {
        "discourse_id": disc_ids,
        "new_start": new_starts,
        "new_end": new_ends,
        "text_by_new_index": new_texts,
    }

# using Dataset will make it easy to do multi-processing        
dataset = Dataset.from_dict({"id": df["id"].unique()})   

results = dataset.map(get_new_positions, batched=True, num_proc=4, remove_columns=["id"])

In [None]:
df["new_start"] = results["new_start"]
df["new_end"] = results["new_end"]
df["text_by_new_index"] = results["text_by_new_index"]

# Let's check how many of these new spans of text don't match the original `discourse_text` values


In [None]:
new_not_equal_texts = df[df["discourse_text"]!=df["text_by_new_index"]].copy()
print(f"There are {new_not_equal_texts['id'].nunique()} files and {len(new_not_equal_texts)} rows with mismatched spans.")

## There are still many that don't match because I deleted some leading punctuation and added some trailing punctuation

#### NOTE: One row does not match because `discourse_text` did not have the PII masked for some reason (discourse_id = 1623258656795.0). 

In [None]:
new_not_equal_texts["discourse_text"] = new_not_equal_texts["discourse_text"]
new_not_equal_texts["text_by_new_index"] = new_not_equal_texts["text_by_new_index"]

# if we cutoff the last few characters, they will are more likely to be equal
old_text = new_not_equal_texts["discourse_text"].str.strip().str.slice(start=2, stop=3)
new_text = new_not_equal_texts["text_by_new_index"].str.strip().str.slice(start=2, stop=3)


char_unequal_mask = old_text!=new_text

unequal_texts = new_not_equal_texts[char_unequal_mask]

unequal_texts[["discourse_text", "text_by_new_index"]].sample(n=25).values

## Getting predictionstring values

In [None]:
%%time

def find_pred_string(examples):
    
    new_pred_strings = []
    discourse_ids = []
    
    for id_ in examples["id"]:
        with open(f"../input/feedback-prize-2021/train/{id_}.txt") as fp:
            file_text = fp.read()

        discourse_data = df[df["id"] == id_]
        
        left_idxs = discourse_data["new_start"]
        right_idxs = discourse_data["new_end"]
        disc_ids = discourse_data["discourse_id"]
        
        for left_idx, right_idx, disc_id in zip(left_idxs, right_idxs, disc_ids):
            start_word_id = len(file_text[:left_idx].split())
            
            # In the event that the first character of the span is not whitespace
            # and the character before the span is not whitespace, `len(span.split())`
            # will need to be reduced by 1.
            # ex: word__word___sp[an starts in the middle of a word]
            # `len(text[:left_idx].split())==3` but it actually starts in the 3rd word 
            # which is word_id=2
            if left_idx > 0 and file_text[left_idx].split() != [] and file_text[left_idx-1].split() != []:
                start_word_id -= 1
                
            end_word_id = start_word_id + len(file_text[left_idx:right_idx].split())
            
            new_pred_strings.append(" ".join(list(map(str, range(start_word_id, end_word_id)))))
            discourse_ids.append(disc_id)
            
            
    return {
        "new_predictionstring": new_pred_strings,
        "discourse_id": discourse_ids
    }
        

id_ds = Dataset.from_pandas(df[["id"]].drop_duplicates())
new_pred_string_ds = id_ds.map(find_pred_string, batched=True, num_proc=4, remove_columns=id_ds.column_names)

# How many failed to find a substring?

There should be an empty string if no intersection is found.

In [None]:
df["new_predictionstring"] = new_pred_string_ds["new_predictionstring"]
len([x for x in new_pred_string_ds["new_predictionstring"] if x == ""])

## Let's compare some new and old `predictionstring` values

In all the examples I looked at, the `new_predictionstring` values looked better

In [None]:
different_value_mask = df["new_predictionstring"] != df["predictionstring"]

for idx, row in df[different_value_mask].sample(n=5, random_state=18).iterrows():
    file_text = open(f"../input/feedback-prize-2021/train/{row.id}.txt").read()
    print("Old predictionstring=", row.predictionstring)
    print("New predictionstring=", row.new_predictionstring)
    print("words using old predictionstring=", [x for i, x in enumerate(file_text.split()) if i in list(map(int, row.predictionstring.split()))])
    print("words using new predictionstring=", [x for i, x in enumerate(file_text.split()) if i in list(map(int, row.new_predictionstring.split()))])
    print("discourse text=", row.text_by_new_index)
    print(f"start_idx/end_idx= {row.new_start}/{row.new_end}")
    print("discourse_id=",row.discourse_id, "\n")

## How many `discourse_start` and `discourse_end` values got modified?

In [None]:
print(sum(df["discourse_start"].astype(int) != df["new_start"]))
print(sum(df["discourse_end"].astype(int) != df["new_end"]))

## Looks pretty good to me!

#### But please let me know if there is something I missed because I don't want the new training set to be more detrimental than the original 😱

## Saving corrected information

New columns are:
- `text_by_index` (can ignore)
- `new_start` (replaces `discourse_start`)
- `new_end` (replaces `discourse_end`)
- `text_by_new_index` (replaces `discourse_text`)
- `new_predictionstring` (replaces `predictionstring`)

In [None]:
df.to_csv("corrected_train.csv", index=False)

### Hopefully this makes our models better! I know it can be confusing, so please comment and I'll do my best to answer your quesions!

<p style="font-size: 40px">😊</p>