# Download datasets

### Note the datasets could be very large! Feel free to check the [dataset website](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home) and download files you really need.

**The datasets are hosted on the google drive. Thus downloading without GUI requires installing the [gdown](https://github.com/wkentaro/gdown) package**:

## Download in bash commands

**We can install the gdown package in shell:**

`$ pip install gdown`

**and download the file by specifying its gdrive ID in the following command**

`$ gdown 'https://drive.google.com/uc?id={fileID}'`

## Download in python

**We can also install the gdown package here and download files in python**

In [1]:
import sys
# !{sys.executable} -m pip install gdown

In [2]:
import pandas as pd
import gdown
import os

**Specify your directory here:**

In [3]:
DIR = './'
OUT_DIR = '../dataset/'

**Load the google drive ids of the datasets**

In [5]:
file_ids = pd.read_csv(os.path.join(DIR, 'gdrive_id.csv'))
display(file_ids)

Unnamed: 0,id,name
0,1TLmSvzHvTLLLMjMoQdkx6pBWon-4bli7,goodreads_book_works.json.gz
1,19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC,goodreads_book_authors.json.gz
2,1op8D4e5BaxU2JcPUgxM3ZqrodajryFBb,goodreads_book_series.json.gz
3,1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK,goodreads_books.json.gz
4,1ah0_KpUterVi-AHxJ03iKD6O0NfbK0md,goodreads_book_genres_initial.json.gz
5,1R3wJPgyzEX9w6EI8_LmqLbpY4cIC9gw4,goodreads_books_children.json.gz
6,1ICk5x0HXvXDp5Zt54CKPh5qz1HyUIn9m,goodreads_books_comics_graphic.json.gz
7,1x8IudloezYEg6qDTPxuBkqGuQ3xIBKrt,goodreads_books_fantasy_paranormal.json.gz
8,1roQnVtWxVE1tbiXyabrotdZyUY7FA82W,goodreads_books_history_biography.json.gz
9,1ACGrQS0sX4-26D358G2i5pja1Y6CsGtz,goodreads_books_mystery_thriller_crime.json.gz


**Now we can create a fileName-gdriveID map and a function to download files by name**

In [6]:
file_id_map = dict(zip(file_ids['name'].values, file_ids['id'].values))

def download_by_name(fname, output=None, quiet=False):
    if fname in file_id_map:
        url = 'https://drive.google.com/uc?id='+file_id_map[fname]
        gdown.download(url, output=output, quiet=quiet)
    else:
        print('The file', fname, 'can not be found!')

**Switch to the output directory specified before:**

In [7]:
os.chdir(OUT_DIR)

**Here we go!**

In [8]:
download_by_name('goodreads_book_works.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1TLmSvzHvTLLLMjMoQdkx6pBWon-4bli7
To: /home/a/ajayago/cs5260/goodreads/dataset/goodreads_book_works.json.gz
100%|██████████| 74.9M/74.9M [00:00<00:00, 76.9MB/s]


In [9]:
download_by_name('goodreads_reviews_young_adult.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1M5iqCZ8a7rZRtsmY5KQ5rYnP9S0bQJVo
To: /home/a/ajayago/cs5260/goodreads/dataset/goodreads_reviews_young_adult.json.gz
100%|██████████| 899M/899M [00:12<00:00, 70.9MB/s] 


In [10]:
download_by_name('goodreads_reviews_poetry.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1FVD3LxJXRc5GrKm97LehLgVGbRfF9TyO
To: /home/a/ajayago/cs5260/goodreads/dataset/goodreads_reviews_poetry.json.gz
100%|██████████| 49.3M/49.3M [00:01<00:00, 31.4MB/s]


In [11]:
download_by_name('goodreads_interactions_young_adult.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1NNX7SWcKahezLFNyiW88QFPAqOAYP5qg
To: /home/a/ajayago/cs5260/goodreads/dataset/goodreads_interactions_young_adult.json.gz
100%|██████████| 1.84G/1.84G [01:15<00:00, 24.4MB/s]


In [12]:
download_by_name('goodreads_books_young_adult.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1gH7dG4yQzZykTpbHYsrw2nFknjUm0Mol
To: /home/a/ajayago/cs5260/goodreads/dataset/goodreads_books_young_adult.json.gz
100%|██████████| 105M/105M [00:02<00:00, 48.5MB/s] 


In [13]:
download_by_name('goodreads_books.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK
To: /home/a/ajayago/cs5260/goodreads/dataset/goodreads_books.json.gz
100%|██████████| 2.08G/2.08G [01:19<00:00, 26.1MB/s]


In [14]:
download_by_name('goodreads_book_genres_initial.json.gz')

Downloading...
From: https://drive.google.com/uc?id=1ah0_KpUterVi-AHxJ03iKD6O0NfbK0md
To: /home/a/ajayago/cs5260/goodreads/dataset/goodreads_book_genres_initial.json.gz
100%|██████████| 24.3M/24.3M [00:01<00:00, 12.4MB/s]
