### Downloading Dataset from HF

In [1]:
from datasets import load_dataset
import pandas as pd
import numpy as np

dataset = load_dataset("multi_x_science_sum")

  from .autonotebook import tqdm as notebook_tqdm
Downloading builder script: 100%|██████████| 3.94k/3.94k [00:00<00:00, 18.8MB/s]
Downloading metadata: 100%|██████████| 2.44k/2.44k [00:00<00:00, 8.77MB/s]
Downloading readme: 100%|██████████| 6.39k/6.39k [00:00<00:00, 20.4MB/s]
Downloading data: 100%|██████████| 46.1M/46.1M [00:35<00:00, 1.30MB/s]
Downloading data: 100%|██████████| 7.60M/7.60M [00:04<00:00, 1.86MB/s]
Downloading data: 100%|██████████| 7.68M/7.68M [00:03<00:00, 2.27MB/s]
Generating train split: 100%|██████████| 30369/30369 [00:01<00:00, 22114.20 examples/s]
Generating test split: 100%|██████████| 5093/5093 [00:00<00:00, 26650.19 examples/s]
Generating validation split: 100%|██████████| 5066/5066 [00:00<00:00, 23066.52 examples/s]


In [49]:
#converting the train, validation and test sets to pandas dataframes
train_df = pd.DataFrame({'summary': dataset['train']['abstract'], 'ref_abstract': dataset['train']['ref_abstract']})
validation_df = pd.DataFrame({'summary': dataset['validation']['abstract'], 'ref_abstract': dataset['validation']['ref_abstract']})
test_df = pd.DataFrame({'summary': dataset['test']['abstract'], 'ref_abstract': dataset['test']['ref_abstract']})

train_df['cite_N'] = train_df['ref_abstract'].apply(lambda x: x['cite_N'])
train_df['abstracts'] = train_df['ref_abstract'].apply(lambda x: x['abstract'])
train_df.drop('ref_abstract', axis=1, inplace=True)

validation_df['cite_N'] = validation_df['ref_abstract'].apply(lambda x: x['cite_N'])
validation_df['abstracts'] = validation_df['ref_abstract'].apply(lambda x: x['abstract'])
validation_df.drop('ref_abstract', axis=1, inplace=True)

test_df['cite_N'] = test_df['ref_abstract'].apply(lambda x: x['cite_N'])
test_df['abstracts'] = test_df['ref_abstract'].apply(lambda x: x['abstract'])
test_df.drop('ref_abstract', axis=1, inplace=True)

In [50]:
#original sizes of the train, validation and test sets
print(len(train_df), len(validation_df), len(test_df))

30369 5066 5093


In [51]:
def remove_empty_strings(lst):
    return [item for item in lst if item.strip()]

train_df['abstracts'] = train_df['abstracts'].apply(remove_empty_strings)
validation_df['abstracts'] = validation_df['abstracts'].apply(remove_empty_strings)
test_df['abstracts'] = test_df['abstracts'].apply(remove_empty_strings)

### Filtering to the suitable format

In [52]:
#filtering out the rows where the length of abstracts more than 1
train_df = train_df[train_df['abstracts'].apply(len) != 1]
train_df.reset_index(drop=True, inplace=True) #reset the index

validation_df = validation_df[validation_df['abstracts'].apply(len) != 1]
validation_df.reset_index(drop=True, inplace=True) #reset the index

test_df = test_df[test_df['abstracts'].apply(len) != 1]
test_df.reset_index(drop=True, inplace=True) #reset the index

#new sizes of the train, validation and test sets
print(len(train_df), len(validation_df), len(test_df))
print(train_df['abstracts'][0])
print(len(train_df['abstracts'][0]))
print(train_df['abstracts'][1])
print(len(train_df['abstracts'][1]))
print(train_df['abstracts'][2])
print(len(train_df['abstracts'][2]))
print(train_df['abstracts'][3])
print(len(train_df['abstracts'][3]))

22555 3773 3785
['This note is a sequel to our earlier paper of the same title [4] and describes invariants of rational homology 3-spheres associated to acyclic orthogonal local systems. Our work is in the spirit of the Axelrod–Singer papers [1], generalizes some of their results, and furnishes a new setting for the purely topological implications of their work.', 'Recently, Mullins calculated the Casson-Walker invariant of the 2-fold cyclic branched cover of an oriented link in S^3 in terms of its Jones polynomial and its signature, under the assumption that the 2-fold branched cover is a rational homology 3-sphere. Using elementary principles, we provide a similar calculation for the general case. In addition, we calculate the LMO invariant of the p-fold branched cover of twisted knots in S^3 in terms of the Kontsevich integral of the knot.']
2
["Despite the apparent success of the Java Virtual Machine, its lackluster performance makes it ill-suited for many speed-critical applicatio

### Storing original files

In [53]:
#store train, validation and test in csv files
train_df.to_csv('train.csv', index=False)
validation_df.to_csv('validation.csv', index=False)
test_df.to_csv('test.csv', index=False)

### Adding Random Summaries and Creating New Files

In [54]:
import random
import ast

random_train = []
random_validation = []
random_test = []

for i in train_df['abstracts']:
    random_train.append(random.choice(i))
    
for i in validation_df['abstracts']:
    random_validation.append(random.choice(i))
    
for i in test_df['abstracts']:
    random_test.append(random.choice(i))


In [55]:
modified_train_rows = []
modified_validation_rows = []
modified_test_rows = []

for _, row in train_df.iterrows():
    row_index = row.name #get the index of the row
    valid_indices = [i for i in range(len(random_train)) if i != row_index]
    random_indices = random.sample(valid_indices, 2)
    # Extract the corresponding abstracts from the random_train list
    random_abstracts = [random_train[i] for i in random_indices]
    abstract_list = row['abstracts']
    # print(abstract_list)
    abstract_list.extend(random_abstracts)
    modified_train_rows.append({'abstracts': abstract_list, 'summary': row['summary'], 'num_abstracts': len(abstract_list)})
    # print(random_indices)
    # print(abstract_list)
    
print(len(modified_train_rows))

for _, row in validation_df.iterrows():
    row_index = row.name #get the index of the row
    valid_indices = [i for i in range(len(random_validation)) if i != row_index]
    random_indices = random.sample(valid_indices, 2)
    # Extract the corresponding abstracts from the random_validation list
    random_abstracts = [random_validation[i] for i in random_indices]
    abstract_list = row['abstracts']
    abstract_list.extend(random_abstracts)
    modified_validation_rows.append({'abstracts': abstract_list, 'summary': row['summary'], 'num_abstracts': len(abstract_list)})
    
print(len(modified_validation_rows))

for _, row in test_df.iterrows():
    row_index = row.name #get the index of the row
    valid_indices = [i for i in range(len(random_test)) if i != row_index]
    random_indices = random.sample(valid_indices, 2)
    # Extract the corresponding abstracts from the random_test list
    random_abstracts = [random_test[i] for i in random_indices]
    abstract_list = row['abstracts']
    abstract_list.extend(random_abstracts)
    modified_test_rows.append({'abstracts': abstract_list, 'summary': row['summary'], 'num_abstracts': len(abstract_list)})
    
print(len(modified_test_rows))
    
modified_train_df = pd.DataFrame(modified_train_rows)
modified_validation_df = pd.DataFrame(modified_validation_rows)
modified_test_df = pd.DataFrame(modified_test_rows)
    

22555
3773
3785


In [56]:
#save the modified DataFrames as CSV files
modified_train_df.to_csv('modified_train.csv', index=False)
modified_validation_df.to_csv('modified_validation.csv', index=False)
modified_test_df.to_csv('modified_test.csv', index=False)

In [60]:
num = 3
print(modified_train_df['abstracts'][num])
print(len(modified_train_df['abstracts'][num]))

['This paper describes the motivations and strategies behind our group’s efforts to integrate the Tcl and Java programming languages. From the Java perspective, we wish to create a powerful scripting solution for Java applications and operating environments. From the Tcl perspective, we want to allow for cross-platform Tcl extensions and leverage the useful features and user community Java has to offer. We are specifically focusing on Java tasks like Java Bean manipulation, where a scripting solution is preferable to using straight Java code. Our goal is to create a synergy between Tcl and Java, similar to that of Visual Basic and Visual C++ on the Microsoft desktop, which makes both languages more powerful together than they are individually.', "A mechanical brake actuator includes a manual lever which is self-locking in the active braking position. In such position, the lever and associated cable means applies tension to a spring whose force is applied to the plunger of a hydraulic m

### Generating Sample Files

In [61]:
#taking a subset of rows for each dataset
sample_train_df = modified_train_df.sample(n=500, random_state=42)
sample_validation_df = modified_validation_df.sample(n=250, random_state=42)
sample_test_df = modified_test_df.sample(n=250, random_state=42)

#save the sample DataFrames as CSV files
sample_train_df.to_csv('sample_train.csv', index=False)
sample_validation_df.to_csv('sample_validation.csv', index=False)
sample_test_df.to_csv('sample_test.csv', index=False)

print(len(sample_train_df), len(sample_validation_df), len(sample_test_df))

500 250 250


In [62]:
print(sample_train_df.head())

                                               abstracts  \
15526  [Erasure codes, such as Reed-Solomon (RS) code...   
14187  [Most modern convolutional neural networks (CN...   
8056   [We describe a technique for building hash ind...   
5523   [The rapid urban expansion has greatly extende...   
14989  [We show that if a connected graph with n node...   

                                                 summary  num_abstracts  
15526  Erasure codes offer an efficient way to decrea...              6  
14187  This paper proposes a new method, that we call...              5  
8056   Similarity-preserving hashing is a widely-used...              4  
5523   In this paper, we address the problem of perso...              4  
14989  We perform a thorough study of various charact...              8  
