This Repository Contains Code for Downloading Datasets we used in our Experiments

# Downloading the CS Dataset from S2ORC
_We Do not Own this Dataset_

The Dataset is from [S2ORC Project](https://github.com/allenai/s2orc) maintained by Allen AI.
<br/>
Please refer to their Repository for Access and Liscencing for use.

---


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
"""
Example of how one would download & process a single batch of S2ORC to filter to specific field of study.
Can be useful for those who can't store the full dataset onto disk easily.
Please adapt this to your own field of study.


Creates directory structure:

|-- metadata/
    |-- raw/
        |-- metadata_0.jsonl.gz      << input; deleted after processed
    |-- medicine/
        |-- metadata_0.jsonl         << output
|-- pdf_parses/
    |-- raw/
        |-- pdf_parses_0.jsonl.gz    << input; deleted after processed
    |-- medicine/
        |-- pdf_parses_0.jsonl       << output

"""


import os
import subprocess
import gzip
import io
import json
from tqdm import tqdm


# process single batch
def process_batch(batch: dict):
    # this downloads both the metadata & full text files for a particular shard
    cmd = ["wget", "-O", batch['input_metadata_path'], batch['input_metadata_url']]
    subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)

    cmd = ["wget", "-O", batch['input_pdf_parses_path'], batch['input_pdf_parses_url']]
    subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)

    # first, let's filter metadata JSONL to only papers with a particular field of study.
    # we also want to remember which paper IDs to keep, so that we can get their full text later.
    paper_ids_to_keep = set()
    with gzip.open(batch['input_metadata_path'], 'rb') as gz, open(batch['output_metadata_path'], 'wb') as f_out:
        f = io.BufferedReader(gz)
        for line in tqdm(f.readlines()):
            metadata_dict = json.loads(line)
            paper_id = metadata_dict['paper_id']
            mag_field_of_study = metadata_dict['mag_field_of_study']
            if mag_field_of_study and 'Computer Science' in mag_field_of_study:     # TODO: <<< change this to your filter
                paper_ids_to_keep.add(paper_id)
                f_out.write(line)

    # now, we get those papers' full text
    with gzip.open(batch['input_pdf_parses_path'], 'rb') as gz, open(batch['output_pdf_parses_path'], 'wb') as f_out:
        f = io.BufferedReader(gz)
        for line in tqdm(f.readlines()):
            metadata_dict = json.loads(line)
            paper_id = metadata_dict['paper_id']
            if paper_id in paper_ids_to_keep:
                f_out.write(line)

    # now delete the raw files to clear up space for other shards
    os.remove(batch['input_metadata_path'])
    os.remove(batch['input_pdf_parses_path'])


if __name__ == '__main__':

    METADATA_INPUT_DIR = 'metadata/raw/'
    METADATA_OUTPUT_DIR = '/content/drive/MyDrive/CsDataset/metadata/computer science/'
    PDF_PARSES_INPUT_DIR = 'pdf_parses/raw/'
    PDF_PARSES_OUTPUT_DIR = '/content/drive/MyDrive/CsDataset/pdf_parses/computer science/'

    os.makedirs(METADATA_INPUT_DIR, exist_ok=True)
    os.makedirs(METADATA_OUTPUT_DIR, exist_ok=True)
    os.makedirs(PDF_PARSES_INPUT_DIR, exist_ok=True)
    os.makedirs(PDF_PARSES_OUTPUT_DIR, exist_ok=True)

    # TODO: make sure to put the links we sent to you here
    # there are 100 shards with IDs 0 to 99. make sure these are paired correctly.
    download_linkss = [
  {
    "metadata": "Add Shard Link here ",
    "pdf_parses": "Add Shard Link here "
  },
  {
    "metadata": "Add Shard Link here ",
    "pdf_parses": "Add Shard Link here "
  },
]

    # turn these into batches of work
    # TODO: feel free to come up with your own naming convention for 'input_{metadata|pdf_parses}_path'
    batches = [{
        'input_metadata_url': download_links['metadata'],
        'input_metadata_path': os.path.join(METADATA_INPUT_DIR,
                                            os.path.basename(download_links['metadata'].split('?')[0])),
        'output_metadata_path': os.path.join(METADATA_OUTPUT_DIR,
                                             os.path.basename(download_links['metadata'].split('?')[0])),
        'input_pdf_parses_url': download_links['pdf_parses'],
        'input_pdf_parses_path': os.path.join(PDF_PARSES_INPUT_DIR,
                                              os.path.basename(download_links['pdf_parses'].split('?')[0])),
        'output_pdf_parses_path': os.path.join(PDF_PARSES_OUTPUT_DIR,
                                               os.path.basename(download_links['pdf_parses'].split('?')[0])),
    } for download_links in download_linkss]

    for batch in batches:
        process_batch(batch=batch)

# Downloading the Amazon Review Dataset (2018)

_We Do not Own this Dataset_

This Dataset is an updated version of the [Amazon review dataset](http://jmcauley.ucsd.edu/data/amazon/index_2014.html) released in 2014


**We use the 5 Core version of the Dataset** 

The Dataset is created by [Jianmo Ni, UCSD](https://nijianmo.github.io/amazon/index.html#subsets) 

Please cite the following paper if you use the data in any way:

Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019


<br/>
Please refer to their Repository for Access and Liscencing for use.

---

List of URLs can be Found [here](https://nijianmo.github.io/amazon/index.html) 

We have created a simplified file for the same [here]()

In [None]:
with open('amazon5corelinks.txt', 'r') as f:
    mainlist = [line for line in f]

listofURLs= mainlist[0]

In [None]:
listofPaths=[]
for i in listofURLs:
  listofPaths.append(i.replace("http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/","/content/drive/MyDrive/AmazonReview/" ))

In [None]:
for x in range(len(listofURLs)-1):
  print(x)
  cmd = ["wget", l[x], "-O", k[x]]
  subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)