## Data cleaning and format conversion

The raw CSL dataset is tsv file and we want to convert it into Dataset_Dict format.
Also each line in CSL dataset means one paper, the component is splited by '\t'. The first component is the task type, and in this case, is "to title". So we need to delete our the task type and split the line into document and summary parts. 

In [1]:
def read_dataset(path):
    
    documents = []
    summarys = []
    ids = []
    with open(path, mode="r", encoding="utf-8") as f:
        for line_id, line in enumerate(f):
            line = line.strip().split('\t')

            if len(line) == 3:
                id = line_id
                document = line[1]
                summary = line[2]
                documents.append(document)
                summarys.append(summary)
                ids.append(int(id))
                
    data_cleaned = {
        "document": documents,
        "summary": summarys,
        "id": ids 
    }
    
    return data_cleaned

data_test = read_dataset("/home/xxliu/Other projects/Title_generation/CSL/benchmark/ts/test.tsv")
data_val = read_dataset("/home/xxliu/Other projects/Title_generation/CSL/benchmark/ts/dev.tsv")
data_train = read_dataset("/home/xxliu/Other projects/Title_generation/CSL/benchmark/ts/train.tsv")

In [2]:
from datasets import  Dataset, DatasetDict

train_dataset = Dataset.from_dict(data_train)
test_dataset = Dataset.from_dict(data_test)
val_dataset = Dataset.from_dict(data_val)

dataset_dict = DatasetDict({
    "train": train_dataset,
    "test": test_dataset, 
    "validation": val_dataset
})

In [3]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 1000
    })
})

In [4]:
dataset_dict.save_to_disk("./Paper")

Saving the dataset (0/1 shards):   0%|          | 0/8000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [5]:
import datasets  
datasets.load_from_disk

<function datasets.load.load_from_disk(dataset_path: str, fs='deprecated', keep_in_memory: Optional[bool] = None, storage_options: Optional[dict] = None) -> Union[datasets.arrow_dataset.Dataset, datasets.dataset_dict.DatasetDict]>

In [6]:
dataset_dict["train"][0]

{'document': '太平天国占领区街市没有刻字铺，所有刻字匠人都编入镌刻营，“朝勋詹记”一印应为太平天国镌刻营所出。但它不属于太平天国礼部统一制发的印章。太平天国私人便章实物从无发现，“朝勋詹记”一印的发现，弥足珍贵，它为研究太平天国用印情况及制度提供了第一手重要实物资料。',
 'summary': '太平天国私人便章“朝勋詹记”考',
 'id': 0}

In [7]:
dataset_dict["validation"][0]

{'document': '采用修正的Rodrigues参数(MRP),建立了飞行器姿态控制系统的数学模型；利用刚体姿态动力学方程和运动学方程建立了带有转动惯量体坐标系各轴间耦合量的非线性控制系统模型；在镇定控制器设计中,用消去法建立了控制器与Lyapunov函数之间的关系,减少了在迭代算法求解过程中的迭代次数,提出了基于平方和(SOS)方法的一种新的设计方法.仿真结果表明,用这种方法设计镇定控制器简化了设计过程,并且控制器具有较快的响应速度和较好的收敛性.',
 'summary': '一种基于平方和优化的飞行器大角度机动镇定控制器设计方法',
 'id': 0}

In [8]:
dataset_dict["test"][0]

{'document': '双官能团活性艳蓝GN和RN在固色浴中凝聚性小、骤染性小、匀染性好,且吸尽率和固色率高、提升性和重现性好,较好地克服了常用单乙烯砜型活性艳蓝(C.I.B-19)的性能缺陷.该染料最适合70℃染色,与嫩黄Y-160或翠蓝B-21配伍拼染艳绿色或艳蓝色,可以大幅提高染色一等品率.',
 'summary': '双官能团活性艳蓝的应用性能',
 'id': 0}