# 1. Introduction to Data Science in Python
![HKU logo](./445px_hku.png)![HKU SAAS logo](./statistics3.png)

This notebook helps introduce some of the most basic tools that are commonly used for doing data science and statistics in Python.

# Table of contents
* [Introduction](#introduction)
* [Jupyter Notebook](#Jupyter-Notebook)
* [pandas](#pandas)
* [NumPy](#Introduction-to-NumPy)
* [Interactive data visualization with Plotly](#plotly)
* [Linear regression with Scikit-learn and statsmodels](#Linear-regression-with-scikit-learn-and-statsmodels)
* [Additional references](#additional-references)

## Note: you will need to run the following code cell every time you restart this notebook
If this is your first time using Jupyter, click the block of code below and either press the <i class="fa-step-forward fa"></i><span class="toolbar-btn-label">Run</span></div> button or press `Shift + Enter` on your keyboard.

In [4]:
ls



[31m1. Introduction to Data Science in Python.ipynb[m[m*
[31m2. Intro to Machine Learning in Python with Scikit-learn.ipynb[m[m*
[31m3. Building Neural Networks with Keras.ipynb[m[m*
445px_hku.png
[31mDockerfile[m[m*
[31mREADME.md[m[m*
[1m[36massets[m[m/
[31mrequirements.txt[m[m*
[1m[36msolutions[m[m/
[31mspec-file.txt[m[m*
statistics3.png


In [1]:
import pandas               as pd
import matplotlib.pyplot    as plt
import numpy                as np
import statsmodels.api      as sm
import seaborn              as sns
import plotly.offline       as py
import plotly.graph_objs    as go
from   sklearn.linear_model import LinearRegression

from IPython.display import display

iris = sns.load_dataset('iris')

# Allow plotly to display plots inside the notebook
py.init_notebook_mode(connected=True)

# Jupyter Notebook <a id="Jupyter-Notebook"></a>
[Jupyter Notebook](https://jupyter.org/) is an interactive tool for running code and visualizing data. Each notebook consists of a series of _code cells_ and _Markdown cells_.

* Code cells allow you to run code in a number of languages. Behind the scenes, Jupyter runs a "kernel" that processes the code whenever you execute a cell. Since this is a Python notebook, Jupyter is running the [IPython](https://ipython.org/) kernel. However, kernels also exist for Julia, R, and many other languages.
* Markdown cells display text using the [Markdown language](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html). In addition to displaying text, you can write equations in these cells using $\LaTeX$.

To run code, click a code cell (like the one below) and do one of the following:
* Press `Shift + Enter` on your keyboard
* On the toolbar at the top of this notebook, press the <i class="fa-step-forward fa"></i><span class="toolbar-btn-label">Run</span> button.

In [2]:
print("Hello, world!")

Hello, world!


You can render a markdown cell in the same way. Double click the text below, and try putting in some of the following items:

    # This is a large heading!
    ## This is a smaller heading!
    ### This is an even smaller heading!
    Here is some code: `x = y + z`
    And here is an equation: $x = y + z$



## *Double-click this text!*



### Cell magic
The IPython kernel provides some useful tools for programmers, including

* [Magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html), which allow you to do things like look up documentation and past commands that you've run, and
* [Building graphical user interfaces (GUIs)](https://ipython.org/ipython-doc/stable/interactive/reference.html#gui-event-loop-support) to make it easier to interact with your code.

Here's an example of a useful magic command. `?` will look up the documentation for a library, class, or function to help you figure out how to use it. For instance, if I want to learn about [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), I can run the following:

In [None]:
?pd.DataFrame

If you want to see all the magic functions that IPython makes available to you, `%quickref` can give you a high-level overview.

In [None]:
%quickref

### Widgets
IPython and Jupyter Notebook also makes it easy to build [widgets](https://ipywidgets.readthedocs.io/en/latest/index.html), which give you a richer interface with which to interact with the notebook. Try running the code cell below. This code creates two plots, and displays them in adjacent tabs.

In [3]:
%matplotlib inline

import matplotlib.pyplot as plt
import ipywidgets as widgets
from scipy.stats import norm, linregress

out = [widgets.Output(), widgets.Output()]
tabs = widgets.Tab(children=[out[0], out[1]])
tabs.set_title(0, 'Linear regression')
tabs.set_title(1, 'Normal distribution')

with out[0]:
    # Fit line to some random data
    x = np.random.uniform(size=30)
    y = x + np.random.normal(scale=0.1, size=30)
    slope, intercept, _, _, _ = linregress(x,y)
    u = np.linspace(0, 1)
    
    # Plot
    fig1, axes1 = plt.subplots()
    axes1.scatter(x, y)
    axes1.plot(u, slope * u + intercept, 'k')
    plt.show(fig1)

with out[1]:
    # Plot the probability distribution function (pdf) of the
    # standard normal distribution.
    x = np.linspace(-3.5, 3.5, num=100)
    p = norm.pdf(x)
    
    # Plot
    fig2, axes2 = plt.subplots()
    axes2.plot(x, p)
    plt.show(fig2)

display(tabs)

Tab(children=(Output(), Output()), _titles={'0': 'Linear regression', '1': 'Normal distribution'})

You can create much richer and more complex interfaces that include buttons, sliders, progress bars, and more with Jupyter's ipywidgets library ([docs](https://ipywidgets.readthedocs.io/en/latest/index.html)).

# pandas <a id="pandas"></a>
[pandas](https://pandas.pydata.org/) is a Python library that provides useful data structures and tools for analyzing data.

The fundamental type of the pandas library is the `DataFrame`. In the following code, we load the [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) using the [seaborn library](https://seaborn.pydata.org/). By default, this dataset is stored in a pandas `DataFrame`.

In [None]:
import pandas as pd
import seaborn as sns

iris = sns.load_dataset('iris')

# `iris` is stored as a pandas DataFrame
print('Type of "iris":', type(iris))

# Show the first few entries in this DataFrame
iris.head()

Let's get some information about the iris dataset. Let's try to do the following:

1. Find out how many columns there are in the `DataFrame` object, and what kinds of data are in each column
2. Calculate the average petal length
3. Determine what species of flowers are in the dataset
4. Get an overall summary of the dataset

In [None]:
# 1. Column labels, and types of data in each column
iris.dtypes

In [None]:
# 2. Calculate the average petal length
iris['petal_length'].mean()

In [None]:
# 3. Determine which iris species are in the dataset
iris['species'].unique()

In [None]:
# 4. Summary of the data
iris.describe()

Sometimes we need to extract certain rows or columns of a DataFrame. For instance, in the following code we store each species of flower in its own variable:

In [None]:
"""
IPython.display is a convenience function that works in Jupyter Notebook
(or, more generally, any IPython-based application) that will show
objects in a nicer way than using print(). We'll use it in this notebook
to show some pandas DataFrames.
"""
from IPython.display import display

"""
Create a DataFrame for each species of flower. I've provided two
methods for creating these DataFrames below; pick whichever you
prefer as they are equivalent.
"""
# Method 1: "query" function
setosa     = iris.query('species == "setosa"')
versicolor = iris.query('species == "versicolor"')

# Method 2: index into the DataFrame
virginica = iris[iris['species'] == 'virginica']

"""
Show the first few entries of the DataFrame corresponding to each species
"""
print('Setosa data:')
display(setosa.head())

print('Versicolor data:')
display(versicolor.head())

print('Virginica data:')
display(virginica.head())

To extract a column, we can either use `iris[column_name]` or `iris.iloc[:,column_index]`.

In [None]:
"""
Get the first column.

Note: whenever we extract a single column of a pandas DataFrame,
we get back a pandas Series object. To turn it back into a DataFrame,
we add the line `first_column = pd.DataFrame(first_column)`.
"""
first_column = iris.iloc[:,0]
first_column = pd.DataFrame(first_column)

print('First column:')
display(first_column.head())

"""
Get the first through third columns
"""
first_through_third_columns = iris.iloc[:,0:3]

print('First through third columns:')
display(first_through_third_columns.head())

"""
Get the 'species' column.
"""
species = iris['species']
species = pd.DataFrame(species)

print('Species column:')
display(species.head())

"""
Get all columns *except* the species column
"""
all_but_species = iris.iloc[:, iris.columns != 'species']

print("All columns *except* species:")
display(all_but_species.head())

If you want to create your own pandas `DataFrame`, you have to specify the names of the columns and the items in the rows of the `DataFrame`.

In [None]:
column_labels = ['A', 'B']

column_entries = [
    [1, 2],
    [4, 5],
    [7, 8]
]

pd.DataFrame(column_entries, columns=column_labels)

## Introduction to NumPy
[NumPy](https://www.numpy.org/) is another Python package providing useful data structures and mathematical functions. NumPy's fundamental data type is the array, `numpy.ndarray`, which is like a stripped-down version of a pandas `DataFrame`. However, the `numpy.ndarray` supports much faster operations, which makes it a lot more practical for scientific computing than, say, Python's list objects.

In [None]:
import numpy as np

# 1. Create an array with the numbers [1, 2, 3]
x = np.array([1, 2, 3])

# 2. Create a 2 x 2 matrix with [1, 2] in the first row and [3, 4]
#    in the second row.
x = np.array( [[1,2], [3,4]] )

# 3. Create an array with the numbers 0, 1, ... , 9. Equivalent to
#    calling np.array(range(10))
x = np.arange(10)

# 4. Create a 2 x 2 matrix with zeros in all entries
x = np.zeros( (2,2) )

# 5. Get the total number of items in the matrix, and the shape of
#    the matrix.
num_items    = x.size
matrix_shape = x.shape

Besides just providing data structures, though, NumPy provides many mathematical utilities as well.

In [None]:
### Constants: pi
print('π      = %f' % np.pi)
print()

### Simple functions: sine, cosine, e^x, log, ...
print('sin(0) = %f' % np.sin(0))
print('cos(0) = %f' % np.cos(0))
print('e^1    = %f' % np.exp(1))
print('ln(1)  = %f' % np.log(1))
print()

### Minimums, maximums, sums...
x = np.array([1,2,3])
print('Min of [1,2,3] = %d' % x.min())
print('Max of [1,2,3] = %d' % x.max())
print('Sum of [1,2,3] = %d' % x.sum())
print()

### Random numbers: uniform distribution, normal distribution, ...
print('Random numbers:')
print('Uniform([0,1]): %f' % np.random.uniform(0,1))
print('Normal(0,1):    %f' % np.random.normal(loc=0, scale=1))
print('Poisson(1):     %f' % np.random.poisson(1))

NumPy is primarily used to do large-scale operations on arrays of numbers. Because it has C code running behind the scenes, it can do these computations extremely quickly -- much faster than you could do with regular Python code. Among other things, with NumPy you can

* add a number to every element of an array;
* multiply every element of an array by a number;
* add or multiply two arrays together; or
* calculate a matrix-vector or matrix-matrix product between arrays.

In [None]:
x = np.array([1,2,3])
y = np.array([4,5,6])

print('1 + [1,2,3] =', 1 + x)
print('3 * [1,2,3] =', 3 * x)
print('[1,2,3] * [4,5,6] =', x * y)
print('[1,2,3] + [4,5,6] =', x + y)
print('Dot product of [1,2,3] and [4,5,6] =', x.dot(y))

## Interactive data visualization with Plotly <a id="plotly"></a>
Python has a *massive* number of libraries that can be used for data visualization; [this article](https://www.anaconda.com/blog/developer-blog/python-data-visualization-2018-why-so-many-libraries/) gives a high-level overview of many of them. [matplotlib](https://matplotlib.org/) is the  most commonly used of these libraries. However, in this workshop we'll look at a different library, [Plotly](https://plot.ly/python/). Whereas matplotlib is primarily for static visualizations, Plotly is designed to support interactive graphs.

There are two ways to use Plotly: [online mode](https://plot.ly/python/getting-started/#initialization-for-online-plotting) and [offline mode](https://plot.ly/python/getting-started/#initialization-for-offline-plotting). In online you can save graphs and share them with others using Plotly's web services. To make things simpler, we're going to use Plotly offline. Transitioning from one mode to the other is fairly straightforward; most of the code in this section will remain the same regardless of which mode you are using.

In [None]:
import plotly.offline    as py
import plotly.graph_objs as go

# Allow plotly to display plots inside the notebook
py.init_notebook_mode(connected=True)

Plotly offers a _massive_ variety of plots beyond just scatter plots and heatmaps. You can look at [the introduction to Plotly in Python](https://plot.ly/python/) to get an idea of what's possible in Plotly.

To help you understand the general Plotly API, I'll cover just a few of these plots. The simplest is a basic two-dimensional scatter plot.

In [None]:
"""
Create some random data with numpy
"""
x = np.random.normal(size=2000)
y = np.random.normal(size=x.size)

"""
Create a scatter plot with plotly.graph_objs.Scatter

mode = 'markers' tells plotly to show all of the data points as
individual dots.
"""
trace = go.Scatter(
    x = x,
    y = y,
    mode = 'markers')

"""
Create a Python list that contains all the traces that we want to show.
"""
data = [trace]

"""
Plot the data. py.iplot() tells Plotly to show the graph inside of the
Jupyter Notebook.
"""
py.iplot(data)

Alternatively, you can use `py.plot()` to create a webpage that displays this graph.

In [None]:
# Creates an HTML page. By default the web page is stored in temp-plot.html. If you
# want to save it to a different file, use `py.plot(graphs, filename="my_filename.html")`.
py.plot(data)

We can overlay multiple plots on the same axes by giving `py.iplot()` or `py.plot()` a list containing two or more graph objects.

In [None]:
# Create some more random data with numpy
x = np.random.normal(size=2000)
y = np.random.normal(size=x.size)
u = np.random.uniform(-1, 1, size=300)
v = np.random.uniform(-1, 1, size=u.size)

# First plot
trace1 = go.Scatter(
    x = x,
    y = y,
    mode = 'markers',
    
    # Set the style of the points in this plot
    marker = {
        # Color of the points
        'color': 'black',
        
        # Affects how large the points will be
        'size': 3,
        
        # Make the points a little less dark by making them partially
        # transparent
        'opacity': .3
    })

# Second plot
trace2 = go.Scatter(
    x=u,
    y=v,
    mode='markers')

# Before calling py.plot or py.iplot, we have to put all the graphs
# we want to show in a list. In this example we only want to show
# one plot, so the `graphs` list will only contain one plot
data = [trace1, trace2]

py.iplot(data)

Now let's create two histograms with `plotly.graph_objs.Histogram()`, and see how they look when we try to put them in the same plot.

In [None]:
# Create some random data with numpy
x1 = np.random.binomial(n=300, p=0.55, size=2000)
x2 = np.random.binomial(n=300, p=0.45, size=2000)

hist1 = go.Histogram(x=x1, name='Dataset #1')
hist2 = go.Histogram(x=x2, name='Dataset #2')
data  = [hist1, hist2]

py.iplot(data)

The default behavior of Plotly is to plot the bars from both histogram next to one another, which isn't especially easy to see. Another approach is to overlay the bars on top of one another, which makes each histogram look a little more continuous.

To make this modification, we have to explicitly tell Plotly what changes we want to make to the plot layout. We do this with a `plotly.graph_objs.Layout` object, which we then pass to a `plotly.graph_objs.Figure` object.

In [None]:
"""
Create the histograms again. We set opacity=0.75 to make the bars partially
see-through, which will make it easier to overlay the plots.
"""
hist1 = go.Histogram(x=x1, opacity=0.75, name='Dataset #1')
hist2 = go.Histogram(x=x2, opacity=0.75, name='Dataset #2')
data = [hist1, hist2]

"""
Create a plot.graph_objs.Layout object. We specify barmode=overlay so that
plotly will place bars from both histograms on top of one another.
"""
layout = go.Layout(barmode='overlay', title='Some example histograms')

"""
Instead of using py.iplot(data) as we have in the last few examples, we
first create a Figure object, and specify how the figure should be
formatted by setting layout=layout.
"""
fig = go.Figure(data=data, layout=layout)

py.iplot(fig)

Plotly also offers a lot of options for 3D graphs. For instance, we can use `plotly.graph_objs.Scatter3d` to create a three-dimensional scatter plot.

In [None]:
# Create some random data with numpy
C = [[1, 0, 0],[0, 4, 2],[0, 2, 3]]
x, y, z = np.random.multivariate_normal([0,0,0], C, 400).T

"""
Before we create the scatter plot, we're going to create a dictionary
that will specify attributes of the points in our plot. We're going to
set the following parameters:

- 'size': how large the points should be.
- 'colorscale': specifies a color gradient that should be used to color
  the points. 'colorscale': 'viridis' creates a color gradient going
  from blue to red.
- 'color': an array of numbers (one number for each point in the scatter
  plot) that tells plotly how to color each point. Higher values indicate
  that we should color points on the higher end of the gradient specified
  by 'colorscale', and lower values correspond to colors on the lower end
  of the gradient.
  Here I've chosen 'color': z, telling plotly to color points according to
  the magnitude of their z-coordinate.
"""
marker_dict = {
    'size': 4,
    'colorscale': 'viridis',
    'color': z
}

"""
Create the scatter plot.
"""
trace = go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=marker_dict)
data=[trace]

"""
Set the margins of the plot, i.e. the amount of whitespace surrounding
the plot.

Creating a Layout here isn't strictly necessary, but it does make the
plot a little bigger and easier to see in this notebook. If you really
want to see the difference, change the line py.iplot(fig) to py.plot(fig)
to open the graph in a different browser tab. Then try changing
go.Figure(data=data, layout=layout) to go.Figure(data=data) and look at
how the two plots differ from one another.
"""
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig)

There's a wide range of other things that Plotly can help you visualize outside of typical scatter plots, line plots, histograms, and so on. For instance, Plotly can be used to two- and three-dimensional networks.

In [None]:
"""
Based on the tutorial in https://plot.ly/python/network-graphs/
"""

import networkx as nx
    
"""
Create a random graph
"""
G   = nx.random_geometric_graph(200,0.125)
pos = nx.get_node_attributes(G,'pos')

"""
Get all of the edges between nodes in the graph
"""
edge_x = ()
edge_y = ()
for edge in G.edges():
    x0, y0 = G.node[edge[0]]['pos']
    x1, y1 = G.node[edge[1]]['pos']
    edge_x += tuple([x0, x1, None])
    edge_y += tuple([y0, y1, None])

"""
Plot all of the edges of the graph
"""
edge_trace = go.Scatter(
    # Each entry of edge_x contains the starting and ending x
    # coordinates of an edge
    x=edge_x,
    
    # Each entry of edge_y contains the starting and ending y
    # coordinates of an edge
    y=edge_y,
    
    # The edges should appear as lines
    mode = 'lines',
    
    # Settings for how the lines should appear
    line = {
        'width': 0.5,
        'color': 'black',
    },
    
    # Don't display any information when the user hovers their
    # mouse over an edge
    hoverinfo = 'none')

"""
Get the x- and y-coordinates of the nodes in the graph
"""
node_x, node_y = (), ()
for node in G.nodes():
    x, y = G.node[node]['pos']
    node_x += tuple([x])
    node_y += tuple([y])

"""
Store the number of connections that each node has. Also, add a
label for each node in the format of '# of connections: <number>'
"""
node_connections = ()
node_labels      = ()
for adjacencies in G.adjacency():
    node_connections += tuple([len(adjacencies[1])])
    node_labels += tuple(['# of connections: %d' % len(adjacencies[1])])

"""
Plot all of the nodes in the graph
"""
node_trace = go.Scatter(
    # x-positions of the nodes
    x = node_x,
    
    # y-positions of the nodes
    y = node_y,
    
    # Show the labels of the nodes when the user hovers their mouse over them
    hoverinfo='text',
    
    # The labels that should appear over the nodes
    text=node_labels,
    
    # Each node should appear as a point
    mode='markers',
    marker = {
        'showscale': True,
        
        # colorscale options, via https://plot.ly/python/network-graphs/
        #'Greys' | 'YlGnBu' | 'Greens' | 'YlOrRd' | 'Bluered' | 'RdBu' |
        #'Reds' | 'Blues' | 'Picnic' | 'Rainbow' | 'Portland' | 'Jet' |
        #'Hot' | 'Blackbody' | 'Earth' | 'Electric' | 'Viridis' |
        'colorscale': 'YlGnBu',
        'reversescale': True,
        
        # Color the nodes by the number of connections they have. Nodes with
        # more connections are colored more darkly
        'color': node_connections,
        
        # Size of the nodes
        'size': 10,
        
        # Show a bar next to the graph that shows how colors of the nodes
        # correspond to the number of connections they have
        'colorbar': {
            'thickness': 15,
            'title': 'Node Connections',
            'xanchor': 'left',
            'titleside': 'right'
         },
    },
    line = {
        'width': 2
    })

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Network graph in Plotly',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

py.iplot(fig)

# Linear regression with scikit-learn and statsmodels
In the last section of the notebook, we're going to use linear regression to try and predict the petal length of each iris from its sepal length, sepal width, and petal width.

Before we even start using linear regression, let's look at how each of these variables are related to one another. Below we plot each pair of variables against the others, with the color of the points reflecting which species we're looking at. On the diagonal are box-and-whisker plots that tell us a little about how each of the variables is distributed.

In [None]:
import plotly.figure_factory as ff

data = ff.create_scatterplotmatrix(iris, diag='box', index='species', height=800,
            width=800, opacity=0.75, title='Comparison of pairs of variables in the iris dataset')
py.iplot(data)

The most important conclusion we can draw from these plots is that each pair of variables is roughly linearly related. As a result, we can expect that we should be able to accurately predict petal length from the other three variables using linear regression.

If you just want to do a basic regression and get the coefficients for each variable, you can use `LinearRegression` from the scikit-learn library:

In [None]:
from sklearn.linear_model import LinearRegression

"""
Get all of the irises of the species "setosa" and place them in
a pandas DataFrame called `data`.

Also try with 'setosa' replaced by 'versicolor' and 'virginica'
"""
data = iris[iris['species'] == 'setosa']
    
"""
Split the data into two pieces: the independent variables
(sepal_length, sepal_width, and petal_width), and the dependent
variable (petal_length).
"""
x = data[['sepal_length','sepal_width','petal_width']]
y = data['petal_length']

"""
Create a scikit-learn LinearRegression object which we will
fit to the data.
"""
lm = LinearRegression()
lm.fit(x,y)

for (coef,col) in zip(lm.coef_, x.columns):
    print("%-30s %+.3f" % ("Coefficient for " + col + ":", coef))

However, when we run a linear regression we also want to find out other things about our linear model. For instance, we might want to get a confidence interval for each coefficient in the model. The [StatsModels library](https://www.statsmodels.org/stable/index.html) gives us this functionality with `statsmodel.api.sm`:

In [None]:
import statsmodels.api as sm

# Also try with 'setosa' replaced by 'versicolor' and 'virginica'
data = iris[iris['species'] == 'setosa']

x = data[['sepal_length','sepal_width','petal_width']]
y = data['petal_length']
    
"""
Add y-intercept term to our linear model
"""
x = sm.add_constant(x)
    
"""
Perform linear regression
"""
lm      = sm.OLS(y,x)
results = lm.fit()

"""
Print the results
"""
print(results.summary())

You can also use R-style formulas to specify what variables you want to use to perform linear regression. 

In [None]:
import statsmodels.formula.api as smf

df = iris[iris['species'] == 'setosa']

# Predict petal length from sepal_width, sepal_length, and petal_width
model = smf.ols(formula='petal_length ~ sepal_width + sepal_length + petal_width',
                data=df)
results = model.fit()

print("Results (petal_length ~ sepal_width + sepal_length + petal_width)")
print(results.summary())
print('-' * 80)

# Predict petal length from just sepal_length and petal_width
model = smf.ols(formula='petal_length ~ sepal_length + petal_width',
                data=df)
results = model.fit()

print("Results (petal_length ~ sepal_length + petal_width)")
print(results.summary())

## Additional References <a id="additional-references"></a>

* O'Reilly provides a couple of good books that go in-depth about these tools and more:
  * [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do)
  * [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) -- this book was published in 2012 and may be slightly dated. However, the author provides some Jupyter Notebooks for free in [this repository](https://github.com/wesm/pydata-book) that you may find helpful.
* Check out the full documentation for Jupyter on the [Project Jupyter site](https://jupyter.org/documentation).
* Plotting tools:
  * Plotly
    * [Documentation](https://plot.ly/python/reference/)
    * [Examples](https://plot.ly/python/)
  * Matplotlib
    * [Documentation](https://matplotlib.org/contents.html)
    * [Tutorials](https://matplotlib.org/tutorials/index.html)
  * Seaborn
    * [Documentation](https://seaborn.pydata.org/api.html)
    * [Introduction](https://seaborn.pydata.org/introduction.html)
* [Statsmodels documentation](https://www.statsmodels.org/stable/index.html)