# Your first Jupyter Notebook on FloydHub

This tutorial introduces FloydHub and how to use Jupyter Notebooks for your experiments.

### Here’s what we’ll learn in this guide:

- How to use Jupyter Notebooks on FloydHub
- How to Create, Explore, and Mount datasets on FloydHub to use in your code
- FloydHub best practices:
 1. How and why to keep datasets separate from code as standalone Datasets
 2. How to sync your remote FloydHub experiments locally to your machine
 3. How to use .floydignore for low-bandwidth situations

## 1. What is a Jupyter Notebook

The [Jupyter Notebook](https://jupyter.org/) is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. It is widely adopted for:

- **Teaching / Workshop**: student/participants can experiment during live sessions or for homework assignments,
- **Experiments Visualization**: Enrich coding experience and help developer during debugging,
- **Reproducibility**: allow everyone who have your code to get your same result or explore new directions.

A notebook is made up of Cells, which can either be Code Cells or Markdown Cells. The most important thing to learn is the **`shift + enter`** shortcut, which runs any command in a Code Cell. 

Try it now. Run the 'Hello FloydHub' code just below by clicking on the play button after having selected the code Cell or use the **`shift + enter`** shortcut.

In [1]:
print ('Hello FloydHub')

Hello FloydHub


## 2. Using FloydHub datasets in your code

Now, let's go through some simple examples to learn how use public datasets in your Jupyter Notebooks on  FloydHub.

### a. Download & explore a dataset

Let’s take a public [url](https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv) of a csv dataset (2011 US Agriculture Exports by State) and create a plot chart of the [United States Choropleth Map](https://plot.ly/python/choropleth-maps/) graph using [Plotly](https://github.com/plotly/plotly.py) inside our Jupyter Noteook.

By default, we don’t have the plotly package installed on this instance (although FloydHub does automatically include lots of great libraries like [Numpy](http://www.numpy.org/), [Pandas](http://pandas.pydata.org/) and [Matplotlib](https://matplotlib.org/). 

But, don’t worry, we have two options for adding plotly to our FloydHub Jupyter Notebook instance:

1.**Install the package through the Juypter terminal.**  Just type `!` at the beginning of a Code cell to run a shell command, where we can install plotly. Note that you can list the [Jupyter Magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) with the `% lsmagic` command. In this way you can extend your notebook with more functionalities.


In [2]:
! pip install plotly

Collecting plotly
  Downloading plotly-2.2.1.tar.gz (1.1MB)
[K    100% |████████████████████████████████| 1.1MB 1.3MB/s eta 0:00:01
Building wheels for collected packages: plotly
  Running setup.py bdist_wheel for plotly ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/cc/87/3f/6a282eb21da5d8223472bed40ee023cdcf2e9691b117969a4c
Successfully built plotly
Installing collected packages: plotly
Successfully installed plotly-2.2.1


2.**Declare the dependencies inside the floyd_requirements.txt file and relaunch the Jupyter instance.** This is the best practice when you already know the dependencies that code requires. Just add a text file called `floyd_requirements.txt` in your project repository and add the line: `plotly` as done by the code snippet just below. FloydHub *will download all the python dependencies* inside this file during the creation of your job’s environment.


Note:

1. This work only for Python dependencies. For Non-python dependencies you have to run a bash command pipeline see docs for more info: [installing-non-python-dependencies](https://docs.floydhub.com/guides/jobs/installing_dependencies/#installing-non-python-dependencies).
2. `floyd_requirements.txt` is only used during the creation of a job instance, so you’ll need to recreate your instance with the floyd restart command or by clicking the Restart button on your dashboa


In [1]:
! touch floyd_requirements.txt && echo plotly > floyd_requirements.txt # if you want to append more lines use >>

In [2]:
%cat floyd_requirements.txt

plotly


Now, you have to `Restart` your Job. To do this [Stop](https://docs.floydhub.com/guides/stop_job/) the current Job and [Restart](https://docs.floydhub.com/guides/restart_job/) it from the Web Dashboard or CLI.

To a more detailed reference see [Install Extra Dependencies](https://docs.floydhub.com/guides/jobs/installing_dependencies/) from our docs.

Now we are ready to explore our dataset, but before run any commands, it's a good practice to keep a unique cell with all the package declaration we need - this improves code maintainability a lot! What’s great is that the dependencies will persist into the subsequent cells, so `plotly` and `pandas` will be available in future code blocks.

In [3]:
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import pandas as pd

print (__version__) # requires version >= 1.9.0 to use Offline mode

# This allow us to plot our graphs offline inside a Jupyter Notebook Environment
init_notebook_mode(connected=True)

2.2.1


Now, it's time to plot United States [Choropleth Map](https://en.wikipedia.org/wiki/Choropleth_map) graph about Agriculture Exports by State during 2011. (Now just press the **`Shift + Enter`** command in the Code cell below to create your Chart.)

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv')

for col in df.columns:
    df[col] = df[col].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

df['text'] = df['state'] + '<br>' +\
    'Beef '+df['beef']+' Dairy '+df['dairy']+'<br>'+\
    'Fruits '+df['total fruits']+' Veggies ' + df['total veggies']+'<br>'+\
    'Wheat '+df['wheat']+' Corn '+df['corn']

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = df['code'],
        z = df['total exports'].astype(float),
        locationmode = 'USA-states',
        text = df['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Millions USD")
        ) ]

layout = dict(
        title = '2011 US Agriculture Exports by State<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
iplot( fig, filename='d3-cloropleth-map' )

Having a proper way to visualize experiments is one of the most important aspect of every Data Scientists workflow and a great way to debug code. If you are more interested in a ML workflow you can take a look at [MNIST Notebook tutorial](https://docs.floydhub.com/getstarted/get_started_jupyter/).

### b. Creating a FloydHub dataset


fter you’ve done the work to clean and transform a dataset inside a Notebook, it’s a very common practice to save this data as a separate [Dataset](https://docs.floydhub.com/guides/create_and_upload_dataset/) on FloydHub. This will let you easily mount your data inside future jobs - and you won’t have to repeat yourself. Using FloydHub to manage your data also helps you speed up your workflow because you won’t need to redownload the data multiple times.
Let’s try it out in our current notebook!


#### Downloading COCO Dataset

[COCO - Common Object in Context](http://cocodataset.org/#home) is a famous dataset widely adopted for Detection, Keypoint and Stuff Segmentation(from this year) Challenges. Let's take the 2014 Train images dataset which is around 13 GB and upload it as FloydHub dataset in a few minutes. 

**Note: FloydHub instances have a max storage of 100GB, so make sure to not overcome this size when downloading, process or create a dataset.**

Our CPU instance should have a 50 MB/s in download, so it should take around 3 minutes on Download, Compressing data Time: about 10', Uploading Time: about 12'(15 MB/s in upload.

In [5]:
! wget http://images.cocodataset.org/zips/train2014.zip

--2017-10-30 15:19:39--  http://images.cocodataset.org/zips/train2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 172.217.3.208, 2607:f8b0:400a:800::2010
Connecting to images.cocodataset.org (images.cocodataset.org)|172.217.3.208|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13510573713 (13G) [application/zip]
Saving to: ‘train2014.zip’


2017-10-30 15:23:50 (51.5 MB/s) - ‘train2014.zip’ saved [13510573713/13510573713]



Check that the file is in the current directory.

In [6]:
!ls

command.sh  floyd_requisite.txt  quick_start.ipynb  train2014.zip


Unzip the dataset and remove the zipped version.

In [8]:
!unzip train2014.zip && rm train2014.zip

Collecting floyd-cli
  Downloading floyd-cli-0.10.17.tar.gz
Collecting click>=6.7 (from floyd-cli)
  Downloading click-6.7-py2.py3-none-any.whl (71kB)
[K    100% |████████████████████████████████| 71kB 4.0MB/s ta 0:00:011
[?25hCollecting clint>=0.5.1 (from floyd-cli)
  Downloading clint-0.5.1.tar.gz
Collecting requests>=2.12.4 (from floyd-cli)
  Downloading requests-2.18.4-py2.py3-none-any.whl (88kB)
[K    100% |████████████████████████████████| 92kB 4.9MB/s ta 0:00:011
[?25hCollecting requests-toolbelt>=0.7.1 (from floyd-cli)
  Downloading requests_toolbelt-0.8.0-py2.py3-none-any.whl (54kB)
[K    100% |████████████████████████████████| 61kB 5.3MB/s eta 0:00:01
[?25hCollecting marshmallow>=2.11.1 (from floyd-cli)
  Downloading marshmallow-2.14.0-py2.py3-none-any.whl (45kB)
[K    100% |████████████████████████████████| 51kB 7.4MB/s eta 0:00:01
Collecting tabulate>=0.7.7 (from floyd-cli)
  Downloading tabulate-0.8.1.tar.gz (45kB)
[K    100% |████████████████████████████████| 51kB

The dataset we have downloaded/created is ready to be imported as FloydHub dataset with a *New* feature: Dataset acquisition from Job's Output! Suddenly after you have Stopped the Jobs, you can create a dataset from the output through Web Dashboard or CLI. - SEE MORE ON ...

Now you’ve got a separate Dataset that you can use in future jobs!

#### Mount datasets or Previuos Job Output

This is pretty straightforward, just follow our great docs: [Mounting Data](https://docs.floydhub.com/guides/data/mounting_data/).

**Note about output**: To retrieve the **Output** of your Job you have to save the artefatcs returned by your experiments in the `/output` folder and make sure the your script/programs are compliant with this policy, otherwise the output of your Jobs will be empty. Jupyter Notebook default working directory is `/output` folder so you do not have to worry about this when you run your Jobs in `--mode jupyter`. For more, see [Save Output](https://docs.floydhub.com/guides/data/storing_output/) in our docs. 

## 3. FloydHub Best Practices

If you have follow this tutorial, you have certainly noticed that we have worked only on a FloydHub remote Job and the code we have locally is not sync/outdate/not update.

To update everything locally we can dowload everything from the Output tab of the Job's Overview of the Web Dashboard or using the CLI with `floyd data clone <output>`, see more on [output download](https://docs.floydhub.com/guides/download_output/) on our docs.

### a. Keeping code separate from data

The Keypoint of your experiments and a Data Science best pratice is to have a clean separation of the code from the data that it uses. This will allow you to structure the experiments/Jobs in a more elegant way and optimize the code you need to upload on FloydHub and speed up the experiment cycle iterations.

### b. Sync your remote experiments locally

If you have follow this tutorial, you have certainly noticed that we have worked only on a FloydHub remote Job and the code we have locally is not synced with the current state of our Jupyter Notebook.

If you’d like to update everything locally, we can download everything from the Output tab of the Job's Overview of the Web Dashboard or by using the CLI with `floyd data clone <output>`.

You can read more on [output download](https://docs.floydhub.com/guides/download_output/) on our docs.

### c.  Using .floydignore

Use `.floydignore` will can speed up your upload and experiments iterations if your project code contains items that can be ignored from experiments code’s point of view (such as docs, images and video). See our FAQ about [long sync](https://docs.floydhub.com/faqs/job/#my-job-is-taking-a-while-to-sync-changes-how-do-i-make-it-go-faster).

**Note**: If your internet connection have a low bandwidth in upload, with this file you can really improve your experience on our service.
