# Download datasets

### Note the datasets could be very large! Feel free to check the [dataset website](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home) and download files you really need.

**The datasets are hosted on the google drive. Thus downloading without GUI requires installing the [gdown](https://github.com/wkentaro/gdown) package**:

## Download in bash commands

**We can install the gdown package in shell:**

`$ pip install gdown`

**and download the file by specifying its gdrive ID in the following command**

`$ gdown 'https://drive.google.com/uc?id={fileID}'`

## Download in python

**We can also install the gdown package here and download files in python**

In [None]:
import sys
!{sys.executable} -m pip install gdown

In [2]:
import pandas as pd
import gdown
import os

**Specify your directory here:**

In [3]:
DIR = './'
OUT_DIR = '/home/mengting/'

**Load the google drive ids of the datasets**

In [4]:
file_ids = pd.read_csv(os.path.join(DIR, 'gdrive_id.csv'))
display(file_ids)

Unnamed: 0,id,name
0,1TLmSvzHvTLLLMjMoQdkx6pBWon-4bli7,goodreads_book_works.json.gz
1,19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC,goodreads_book_authors.json.gz
2,1op8D4e5BaxU2JcPUgxM3ZqrodajryFBb,goodreads_book_series.json.gz
3,1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK,goodreads_books.json.gz
4,1ah0_KpUterVi-AHxJ03iKD6O0NfbK0md,goodreads_book_genres_initial.json.gz
5,1Cf90P5TH84ufrs8qyLrM-iWOXOGjBi9r,goodreads_interactions_children.json.gz
6,1CCj-cQw_mJLMdvF_YYfQ7ibKA-dC_GA2,goodreads_interactions_comics_graphic.json.gz
7,1EFHocJIh5nknbUMcz4LnrMEJkwW3Vk6h,goodreads_interactions_fantasy_paranormal.json.gz
8,10j181giCD94pcYynd6fy2U0RyAlL66YH,goodreads_interactions_history_biography.json.gz
9,1xuujDT-vOMMkk2Kog0CTT9ADmnD8pa9D,goodreads_interactions_mystery_thriller_crime....


**Now we can create a fileName-gdriveID map and a function to download files by name**

In [5]:
file_id_map = dict(zip(file_ids['name'].values, file_ids['id'].values))

def download_by_name(fname, output=None, quiet=False):
    if fname in file_id_map:
        url = 'https://drive.google.com/uc?id='+file_id_map[fname]
        gdown.download(url, output=output, quiet=quiet)
    else:
        print('The file', fname, 'can not be found!')

**Switch to the output directory specified before:**

In [6]:
os.chdir(OUT_DIR)

**Here we go!**

In [7]:
download_by_name('goodreads_books_poetry.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1H6xUV48D5sa2uSF_BusW-IBJ7PCQZTS1
To: /home/mengting/goodreads_books_poetry.json.gz
27.9MB [00:00, 73.3MB/s]
