## Introduction
This notebook is used to preprocess the SC-Ques Dataset so that it can be used with our project. We aim to use this additional dataset with our project to help improve the size of our testing set. This dataset was found from the paper [SC-Ques: A Sentence Completion Question Dataset for English as a Second Language Learners](https://arxiv.org/abs/2206.12036). They provided a [link to their code and their data](https://github.com/ai4ed/SC-Ques) for research purposes. This, in turn, gave a link to a [dropbox containing their data](https://www.dropbox.com/s/lzznin2hxt6rmft/SC-Ques.tar.gz?dl=0). The data that we would be looking to use to expand our training set would be found in test.jsons and train.jsons. An example from train.jsons is included below.
```
{"stem": "The plane is scheduled to arrive ___ because of bad weather. ", "choice": {"A": "A.latest", "B": "B.later", "C": "C.late"}, "answer": "C", "choice_dict": {"choice_dict": {"A": "The plane is scheduled to arrive latest because of bad weather.", "B": "The plane is scheduled to arrive later because of bad weather.", "C": "The plane is scheduled to arrive late because of bad weather."}}}
```

From this we would be looking to extract the 'stem' key for the question, the 'answer' key for our ans, and the choice_dict for our response options. Of note is that we will need to modify our code to allow for an arbitrarily sized choice dict. Also of note is that these questions seem less complex than the SAT questions, though some of the questions have more blanks. There are 241195 records in train.jsons and 47953 records in test.jsons.

I am currently unsure what the csv files are, even after skimming through the paper.

## Library Imports
This section is used to import relevant libraries.

In [1]:
from pathlib import Path 
import pandas as pd
import numpy as np
import json
from tqdm.notebook import tqdm
from nltk import word_tokenize
tqdm.pandas()

## Method from Sentence Completion Notebook

In [2]:
def tokenize_question(question):
    sent = word_tokenize(question)
    start_of_sentence = "<sent>"
    end_of_sentence = "<\\sent>"
    sent.insert(0, start_of_sentence)
    sent.append(end_of_sentence)
    return sent

## Load the Original Data
This section loads in the original data from test.jsons and train.jsons. These were two JSONS files in the SC-Ques data. Both of them use the same format which is why we are joining them together. This will work better for our purposes.

In [3]:
def parse_raw_scques_record(line: str):
    record = json.loads(line)
    out = {}
    question = record['stem']
    ans = record['answer']
    # choice = record['choice']
    choice_dict = record['choice_dict']['choice_dict']
    for option in choice_dict.keys():
        out[option] = choice_dict[option]
    out['question'] = question
    out['ans'] = ans
    return out

train_path = Path("Datasets\\SC-Ques\\train.jsons")
test_path = Path("Datasets\\SC-Ques\\test.jsons")
# if train_path.exists() and not train_path.is_dir():
records = []
print("Parsing %s" % str(train_path))
with train_path.open() as f:
    lines = f.readlines()
    for line in tqdm(lines):
        records.append(parse_raw_scques_record(line))
print("Parsing %s" % str(test_path))
with test_path.open() as f:
    lines = f.readlines()
    for line in tqdm(lines):
        records.append(parse_raw_scques_record(line))
        
additional_data = pd.DataFrame.from_records(records)
print(len(additional_data))
additional_data.head()

Parsing Datasets\SC-Ques\train.jsons


  0%|          | 0/241195 [00:00<?, ?it/s]

Parsing Datasets\SC-Ques\test.jsons


  0%|          | 0/47953 [00:00<?, ?it/s]

289148


Unnamed: 0,A,B,C,question,ans,D
0,The plane is scheduled to arrive latest becaus...,The plane is scheduled to arrive later because...,The plane is scheduled to arrive late because ...,The plane is scheduled to arrive ___ because o...,C,
1,Because he was preparing food for tomorrow's p...,While he was preparing food for tomorrow's par...,"If he was preparing food for tomorrow's party,...",___ he was preparing food for tomorrow's part...,B,
2,I don't like the people who may get angry easily.,I don't like the people that may get angry eas...,I don't like the people which may get angry ea...,I don't like the people ___ may get angry easi...,D,I don't like the people both may get angry eas...
3,Stop making so much noise. It is comfortable t...,Stop making so much noise. It is relaxed to th...,Stop making so much noise. It is harmful to th...,Stop making so much noise. It is ___ to the sl...,C,
4,Charles Dickens write a lot of novels.,Charles Dickens wrote a lot of novels.,Charles Dickens writes a lot of novels.,Charles Dickens ___ a lot of novels.,B,


## Count Blanks
This section counts the number of blanks in the question so that value can be added to a column in the dataframe. It also additioanlly changes the mask to use five underscores instead of three. This is to conform to the format used in the SAT question dataset.

In [4]:
def count_num_blanks(question, mask="___"):
    question_tokens = tokenize_question(question)
    # question = question.replace(mask, "_____")
    num_blanks = 0
    for token in question_tokens:
        if token == mask:
            num_blanks += 1
    return num_blanks

additional_data['blanks'] = additional_data['question'].progress_apply(lambda x: count_num_blanks(x, mask="___"))
additional_data['question'] = additional_data['question'].progress_apply(lambda x: x.replace("___", "_____"))
additional_data

  0%|          | 0/289148 [00:00<?, ?it/s]

  0%|          | 0/289148 [00:00<?, ?it/s]

Unnamed: 0,A,B,C,question,ans,D,blanks
0,The plane is scheduled to arrive latest becaus...,The plane is scheduled to arrive later because...,The plane is scheduled to arrive late because ...,The plane is scheduled to arrive _____ because...,C,,1
1,Because he was preparing food for tomorrow's p...,While he was preparing food for tomorrow's par...,"If he was preparing food for tomorrow's party,...",_____ he was preparing food for tomorrow's pa...,B,,1
2,I don't like the people who may get angry easily.,I don't like the people that may get angry eas...,I don't like the people which may get angry ea...,I don't like the people _____ may get angry ea...,D,I don't like the people both may get angry eas...,1
3,Stop making so much noise. It is comfortable t...,Stop making so much noise. It is relaxed to th...,Stop making so much noise. It is harmful to th...,Stop making so much noise. It is _____ to the ...,C,,1
4,Charles Dickens write a lot of novels.,Charles Dickens wrote a lot of novels.,Charles Dickens writes a lot of novels.,Charles Dickens _____ a lot of novels.,B,,1
...,...,...,...,...,...,...,...
289143,—How many students are there in your school? —...,—How many students are there in your school? —...,—How many students are there in your school? —...,—How many students are there in your school? —...,B,—How many students are there in your school? —...,3
289144,"--- Did you find your dictionary yet？---Yes, I...","--- Have you found your dictionary yet？---Yes,...","--- Have you found your dictionary yet？---Yes,...",--- _____ you _____ your dictionary yet？---Yes...,B,"--- Did you find your dictionary yet？---Yes, I...",3
289145,"— Are you feeling any better today, young lady...","— Are you feeling any better today, young lady...","— Do you feel any better today, young lady? —Y...","— _____ you _____ any better today, young lady...",A,"— Do you feel any better today, young lady? —Y...",4
289146,"--- Did you find out your watch?--- No, I didn...","--- Have you found your watch?--- No, not yet .","--- Have you looked for your watch?--- No, I h...","--- _____ you _____ your watch?--- No, _____ .",B,"--- Did you find your watch?--- No, not yet .",3


## Remap Column Names
This section remaps column names to fit with the column names used in the original SAT dataset. It also adds column e and sets all of the answers to lowercase. This should allow for the data to be used directly with our existing code.

In [5]:
additional_data['a)'] = additional_data['A']
additional_data['b)'] = additional_data['B']
additional_data['c)'] = additional_data['C']
additional_data['d)'] = additional_data['D']
additional_data['e)'] = np.nan
additional_data['ans'] = additional_data['ans'].progress_apply(lambda x: x.lower())
additional_data = additional_data.drop(columns=['A', 'B', 'C', 'D'])
additional_data

  0%|          | 0/289148 [00:00<?, ?it/s]

Unnamed: 0,question,ans,blanks,a),b),c),d),e)
0,The plane is scheduled to arrive _____ because...,c,1,The plane is scheduled to arrive latest becaus...,The plane is scheduled to arrive later because...,The plane is scheduled to arrive late because ...,,
1,_____ he was preparing food for tomorrow's pa...,b,1,Because he was preparing food for tomorrow's p...,While he was preparing food for tomorrow's par...,"If he was preparing food for tomorrow's party,...",,
2,I don't like the people _____ may get angry ea...,d,1,I don't like the people who may get angry easily.,I don't like the people that may get angry eas...,I don't like the people which may get angry ea...,I don't like the people both may get angry eas...,
3,Stop making so much noise. It is _____ to the ...,c,1,Stop making so much noise. It is comfortable t...,Stop making so much noise. It is relaxed to th...,Stop making so much noise. It is harmful to th...,,
4,Charles Dickens _____ a lot of novels.,b,1,Charles Dickens write a lot of novels.,Charles Dickens wrote a lot of novels.,Charles Dickens writes a lot of novels.,,
...,...,...,...,...,...,...,...,...
289143,—How many students are there in your school? —...,b,3,—How many students are there in your school? —...,—How many students are there in your school? —...,—How many students are there in your school? —...,—How many students are there in your school? —...,
289144,--- _____ you _____ your dictionary yet？---Yes...,b,3,"--- Did you find your dictionary yet？---Yes, I...","--- Have you found your dictionary yet？---Yes,...","--- Have you found your dictionary yet？---Yes,...","--- Did you find your dictionary yet？---Yes, I...",
289145,"— _____ you _____ any better today, young lady...",a,4,"— Are you feeling any better today, young lady...","— Are you feeling any better today, young lady...","— Do you feel any better today, young lady? —Y...","— Do you feel any better today, young lady? —Y...",
289146,"--- _____ you _____ your watch?--- No, _____ .",b,3,"--- Did you find out your watch?--- No, I didn...","--- Have you found your watch?--- No, not yet .","--- Have you looked for your watch?--- No, I h...","--- Did you find your watch?--- No, not yet .",


In [6]:
additional_data['question'].iloc[289145]

'— _____ you _____ any better today, young lady? —Yes, thank you, Doctor Mason. It _____ as much as it _____ yesterday.'

## Save Data
This section is used to save the results of the data cleaning process. One of the main issues with the original data was that it was too large to fit onto github. This below section will circumvent that problem by saving the data into twenty separate CSV files instead of one conjoined file. This is done through an operation found on [Stack Overflow](# https://stackoverflow.com/a/44502862)

In [7]:
for idx, chunk in enumerate(np.array_split(additional_data, 20)):
    chunk.to_csv(f'Datasets\\SC-Ques\\processed_data_{idx}.csv')