# Hello, notebook!!

Welcome to jupyter (iPython) notebooks. 

## Notebooks
### What is it?
A notebook is an interactive coding session. It is a *recipe book* with instructions for a particular sequence. You can read the notebook and execute code snippets interactively to follow the recipe. You can also tweak the ingredients to make your own, new recipe!

### No really though, what is it?
jupyter notebooks runs a web server that interactively. It displays the contents of a `.ipynb` file, which can be edited dynamically. The server also runs an iPython session which can be used to run python commands. Let's figure out what's in a jupyter notebook file: 

In [None]:
line_count = 0
with open('pandas_walkthru.ipynb') as jupyter_file:
    for line in jupyter_file:
        line_count += 1
        print(line)
        
        if line_count > 30:
            break

Cool! So this file `pandas_walkthru.ipynb` is a `JSON` object, that has markdown and code embedded inside. Everything we type here is automatically saved there and displayed in the web-browser. 

### So, why notebooks?

If it isn't obvious yet, let me explain. Notebooks are neat because you can write notes alongside your code. You can tweak both interactively, then also see the results of your code on-the-fly. Then, you can share your notebook with someone else and they can follow the exact same steps as you. And (hopefully) they will get the same results. 

## Python for automation

### What just happened there?
Let's take a step back and see what we just did. While we were learning about jupyter notebooks we also learned something about python: it's *ridiculously* easy to read files in python. It takes only 3 lines of code:

```python
with open('a_file') as f:
    for line in f:
        do_something_to(line)
```

When we printed the file above I was kind and limited the output to a few lines, so that needed a bit of extra code. The cool thing is, most data is text-based so it's almost never more difficult than this to read a file. 

### Why we choose python 
But this is why we use python. It works like magic and you don't need to spend a lot of time writing code. When you use python you get access to the python community. This group of folks loves collaboration and open-source software, so they give you lots of advice and free code that does cool things. So let's do something cool. 

## Getting new stuff

Most of us need to work with a lot of spreadsheets to do our jobs. Sometimes it's really redious to read a bunch of spreadsheets and extract the data we need. Luckily, python has us  covered here too. We'll need to make sure we have a few tools first. We'll be using a library called `pandas`. We'll need some data, too. Since most of our data is confidential or restricted, we'll be best off using a public data set. 

### Install software using `conda`

If you need a new package, or want to make sure a library is installed, we use the command line tool `conda` that comes with our Anaconda installation. Let's get `pandas`. It's usually best to run `conda update --all` first, but we're running python right now, so we can't. 


In [None]:
!conda install -y pandas

Your output could vary, depending on which versions of things have been installed. Hopefully you saw that pandas came preinstalled. Anaconda comes with most of the basic analysis tools that we might want. If we want to see more info about a particular package:

In [None]:
!conda list pandas

In [None]:
!conda info pandas

### Install software using pip

`conda` is great for stable, well-defined libraries. It is well curated, so it has only the most common packages. If you want another open-source library, it will usually be found on the PyPI (Python Package Index). This is a public repository hosted by the Python Software Foundation, and anyone can package and submit their code there. To install, we use the tool `pip` ("Pip Installs Packages"). 

We're going to use a website [data.world](https://data.world) that allows users to share interesting data sets. 

In [None]:
!pip install -I datadotworld

Awesome. Now we have a module `datadotworld` that we can use to download data. First, we will need to set up an account on (data.world) and get an API access token. 

We will open a command line and run 
```code 

dw configure
```
Then paste our token. Now we should be all set

## Working with pandas

### Download from data.world

In [None]:
import datadotworld as dw

df = dw.load_dataset('mattschroyer/netflix-original-series').dataframes['netflix_originals_series']

This will download the `netflix_originals_series` and load a `pandas.DataFrame` object`

## Load locally

Alternatively, 

if the data is local then you can load it using `pandas.read_excel()`
Original data can be found at https://data.world/mattschroyer/netflix-original-series

In [None]:
ls

In [None]:
import pandas as pd

df = pd.read_excel('netflix_originals_series.xlsx')

## Manipulating a `DataFrame`

In [None]:
type(df)

A `DataFrame` is tablular data that has named headers. `pandas` provides a smooth interface to view, plot and analyze this data. 

In [None]:
df.head()

This printed the first 5 lines of the data. We can also print the last 10:

In [None]:
df.tail(10)

In [None]:
df.index

In [None]:
df.columns

Let's try and sort the data:

In [None]:
df.sort_values(by='Premiere Date', ascending=False)

### Querying

`pandas` Supports high-level manipulation with a SQL-like syntax. This makes aggregation pretty easy. 

In [None]:
df_by_year = df.groupby('Premiere_Year').count()
df_by_year

In [None]:
df.groupby(['Major_Genre', 'Subgenre']).size()

### Plotting 

Or maybe we want a histogram?

In [None]:
%matplotlib inline 
df.hist(column='Premiere_Year')

## More information

This tutorial is an extremely basic overview of the kinds of things you can do in python. For more information about using pandas to read and write tabular data, check out [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html). In order to work with your own data you will need to get down into the weeds, but you will be able to write a script that can be replayed over and over again. Who knows how much time you will save? 