<a href="https://colab.research.google.com/github/HSV-AI/presentations/blob/master/2019/191002_Data_Analysis_Jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing Data with Jupyter Notebook

This notebook is based in part on the Data Science Handbook by Jake VanderPlas that you can find [here](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb)



## IPython Tips

Here are a few tips that will help you get started using a Jupyter Notebook.

### Tip 1 - Help

You can use the ? character at the end of a function or type to access the help for that function or type.

In [0]:
# Let's create a list and see how to get the length
v = [1.0, 2.0, 3.0]
len?

In [6]:
len(v)

3

In [0]:
# We can even get information about the list itself:
v?

In [0]:
# Let's create a function with a helpful description
def empty_function():
  """This is just an empty function. Please don't call it."""
  return 1

In [0]:
# Now the description is available by asking for help
empty_function?

In [0]:
# Two question marks will display the source for the function
empty_function??

### Tip 2 - Tab Completion

Well, I was going to add a section about how to use the \<TAB\> key to autocomplete, but it appears that Colab already has that feature built into the editor.

Just in case, try the command 

?>v.\<TAB\> 

below:

### Tip 3 - Magic Commands

No really, they are called magic because they start with a '%'

One of the most useful commands that can be used to split up a large notebook is the %run magic command. Using it, you
can run external python scripts or even IPython notebooks 
inside the context of the current notebook.

An example of this can be found in the HSV-AI Bug Analysis notebook [here](https://colab.research.google.com/github/HSV-AI/bug-analysis/blob/master/Doc2Vec.ipynb)

Two other very useful magic commands are **%time** and **%timeit**


In [14]:
# Using an example directly from VanderPlas:
print("Here's the output of the %timeit command:")
%timeit L = [n ** 2 for n in range(1000)]

print("\nHere's the output of the %time command:")
%time  L = [n ** 2 for n in range(1000)]

Here's the output of the %timeit command:
1000 loops, best of 3: 246 µs per loop

Here's the output of the %time command:
CPU times: user 275 µs, sys: 0 ns, total: 275 µs
Wall time: 277 µs


### Tip 4 - Suppressing Output

Jupyter will send the result of the last command from a cell to the output. Sometimes you just want it to stay quiet though - especically if you are generating plots and other items that do not need the ouput text inserted.

In order to suppress the output, just end the line with a ';'

In [18]:
# This code has output
p1 = 'This little piggy had roast beef'
p1

'This little piggy had roast beef'

In [0]:
# This code does not have output
p2 = 'This little piggy had none'
p2;

### Tip 5 - %history Magic

This particular piece of magic is very helpful when you may have run cells out of order and are trying to troubleshoot what happened in what order.

In [0]:
%history

### Tip 6- Shell Commands

Most Linux shell commands are available from the Jupyter notebook as well. One good example is shown below and sets us up to start doing some data analysis.

In [1]:
!wget https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv

--2019-10-02 12:46:47--  https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv
Resolving data.nasa.gov (data.nasa.gov)... 128.102.186.77, 2001:4d0:6311:2c05:60b0:5ad8:1210:ea07
Connecting to data.nasa.gov (data.nasa.gov)|128.102.186.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘rows.csv’

rows.csv                [         <=>        ]   4.55M  2.57MB/s    in 1.8s    

2019-10-02 12:46:55 (2.57 MB/s) - ‘rows.csv’ saved [4769811]



## Pandas Dataframes

Pandas datafames are a great way to work with tabular data.

There are several ways to load a dataframe from common file formats like json and csv.


In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('rows.csv')

# Now let's look at the columns available in the dataframe
df.columns

Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')

In [0]:
# We can also get an idea of the data by using the head function
df.head

In [0]:
# We can copy columns of the dataframe into a new dataframe
df_copy = df[['name', 'id']]

df_copy.head

In [0]:
# Changes to the copy do not affect the original
df_copy['id'] = 0

print(df_copy.head)

print(df.head)

In [21]:
df_view = df.loc[:, ['name','id']]

df_view['id'] = 0

print(df_view.head)

print(df.head)

<bound method NDFrame.head of                       name  id
0                   Aachen   0
1                   Aarhus   0
2                     Abee   0
3                 Acapulco   0
4                  Achiras   0
5                 Adhi Kot   0
6      Adzhi-Bogdo (stone)   0
7                     Agen   0
8                   Aguada   0
9            Aguila Blanca   0
10        Aioun el Atrouss   0
11                     Aïr   0
12         Aire-sur-la-Lys   0
13                   Akaba   0
14                Akbarpur   0
15                 Akwanga   0
16                 Akyumak   0
17                 Al Rais   0
18               Al Zarnkh   0
19                   Alais   0
20                Albareto   0
21                 Alberta   0
22         Alby sur Chéran   0
23               Aldsworth   0
24                  Aleppo   0
25             Alessandria   0
26           Alexandrovsky   0
27              Alfianello   0
28                 Allegan   0
29                 Allende   0
...      