# Import Dataset from Data Asset eXchange

This notebook downloads a data set file from a public location. If the data set file is a compressed archive it will be decompressed.

This notebook requires the following environment variables:
 -  `dataset_download_url` Public data set URL, e.g. `https://dax-cdn.cdn.appdomain.cloud/dax-fashion-mnist/1.0.2/fashion-mnist.tar.gz`

### Table of Contents:
* [0. Prerequisites](#cell0)
* [1. Download and Extract Dataset Archive](#cell1)
* [2. Add Dataset Files to Watson Studio Project](#cell2)
* [Authors](#authors)


<a id="cell0"></a>
### 0. Prerequisites

Before you run this notebook complete the following steps:
- Insert a project token
- Import required packages

#### Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

```python
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
```

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

* Click on `More -> Insert project token` in the top-right menu section

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

* This should insert a cell at the top of this notebook similar to the example given above.

  > If an error is displayed indicating that no project token is defined, follow [these instructions](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data).

* Run the newly inserted cell before proceeding with the notebook execution below

#### Import required packages

In [None]:
import requests
import tarfile
from pathlib import Path
from urllib.parse import urlparse

<a id="cell1"></a>
### 1. Download and extract the dataset archive

First, we define `dataset_download_url` as the url to download and save `data_path_name` as the potential data path to find your dataset in the folder.

In this cell, we explore the `Oil Reservoir Simulations Dataset` as an example. Feel free to change the url path and data path, and the rest functions should work the same for all [Data Asset eXchange](https://developer.ibm.com/exchanges/data/) datasets.

In [None]:
# Dataset archive location on public cloud storage
dataset_download_url = 'https://dax-cdn.cdn.appdomain.cloud/dax-oil-reservoir-simulations/1.0.0/oil-reservoir-simulations.tar.gz'

data_path_name = 'oil-reservoir-simulations/data/'

Download and extract the dataset archive. 

In [None]:
print('Downloading dataset archive {} ...'.format(dataset_download_url))

data_file_name = Path((urlparse(dataset_download_url).path)).name

r = requests.get(dataset_download_url)

if r.status_code != 200:
    print('Error. Dataset archive download failed.')
else:
    # save the downloaded archive
    print('Saving downloaded archive as {} ...'.format(data_file_name))
    with open(data_file_name, 'wb') as downloaded_file:
        downloaded_file.write(r.content)
    
    if tarfile.is_tarfile(data_file_name):
        # extract the downloaded archive
            print('Extracting downloaded archive ...')
            with tarfile.open(data_file_name, 'r') as tar:
                tar.extractall()
            print('Removing downloaded archive ...')
            Path(data_file_name).unlink()
            print('Done.')
    else:
        print('Error. The downloaded file is not a valid TAR archive.')
    

<a id="cell2"></a>
### 2. Add Dataset Files to Watson Studio Project

Next, we add the extracted data files to the Watson Studio project to make them available to the other notebooks.

In [None]:
data_path = Path(data_path_name)

# Verify that the extracted artifacts are located in the expected location
if not data_path.exists():
    print('Error. The extracted data files are not located in the {} directory.'.format(data_path.name))
else:
    # Save extracted data file(s) as project assets
    data_asset_count = 0
    for file in list(data_path.glob('**/*')):
        # The archive might contain files that are not relevant to 
        if file.suffix != '.tgz':
            print('Saving {} as a project data asset ...'.format(file.name))
            # save data file as a data asset in the project
            with open(data_path / file.name, 'rb') as f:
                project.save_data(file.name, f.read(), set_project_asset=True, overwrite=True)
            data_asset_count = data_asset_count + 1
        # remove the file to free up space
        file.unlink()
    print('Number of added data assets: {} '.format(data_asset_count))
    print('You are ready to run the other notebooks.')

<a id="authors"></a> 
### Authors

This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).
<br><br>

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.
<br><br>
<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>