# Download datasets

### Note the datasets could be very large! Feel free to check the [dataset website](https://mengtingwan.github.io/data/goodreads) and download the files you really need.

## Download in python

**We can also install the gdown package here and download files in python**

In [1]:
import pandas as pd
import requests
import os

**Specify your directory here:**

In [2]:
DIR = './'

**Load data types and names**

In [3]:
file_names = pd.read_csv(os.path.join(DIR, 'dataset_names.csv'))
display(file_names)

Unnamed: 0,type,name
0,complete,goodreads_book_works.json.gz
1,complete,goodreads_book_authors.json.gz
2,complete,goodreads_book_series.json.gz
3,complete,goodreads_books.json.gz
4,complete,goodreads_book_genres_initial.json.gz
5,byGenre,goodreads_books_children.json.gz
6,byGenre,goodreads_books_comics_graphic.json.gz
7,byGenre,goodreads_books_fantasy_paranormal.json.gz
8,byGenre,goodreads_books_history_biography.json.gz
9,byGenre,goodreads_books_mystery_thriller_crime.json.gz


**Now we can construct the urls to download files by name**

In [4]:
file_name_type_mapping = dict(zip(file_names['name'].values, file_names['type'].values))
file_name_url_mapping = {}

for fname in file_name_type_mapping:
    ftype = file_name_type_mapping[fname]
    if ftype == "complete":
        url = 'https://mcauleylab.ucsd.edu/public_datasets/gdrive/goodreads/'+fname
        file_name_url_mapping[fname] = url
    elif ftype == "byGenre":
        url = 'https://mcauleylab.ucsd.edu/public_datasets/gdrive/goodreads/byGenre/'+fname
        file_name_url_mapping[fname] = url

In [5]:
def download_by_name(fname, local_filename):
    if fname in file_name_url_mapping:
        url = file_name_url_mapping[fname]
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(local_filename, 'wb') as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
        print('Dataset', fname, 'has been downloaded!')
    else:
        print('Dataset', fname, 'can not be found!')

**Here we go!**

In [6]:
OUT_DIR = './genre'
if not os.path.exists(OUT_DIR):
    os.makedirs(OUT_DIR)

output_path = os.path.join(OUT_DIR, 'goodreads_books_poetry.json.gz')
download_by_name('goodreads_books_poetry.json.gz', output_path)

Dataset goodreads_books_poetry.json.gz has been downloaded!
