# STAT4609 Example Class 1 (I) {-}
# Introduction to Data Science in Python {-}

This notebook helps introduce some of the most basic tools that are commonly used for doing data science and statistics in Python.

# Table of contents {-}
* [Jupyter Notebook](#Jupyter-Notebook) Jupyter Notebook
* [pandas](#pandas) Pandas
* [NumPy](#Introduction-to-NumPy) NumPy
* [Additional references](#additional-references) Additional references

**Note: you will need to run the following code cell every time you restart this notebook**

If this is your first time using Jupyter, click the block of code below and either press the <i class="fa-step-forward fa"></i><span class="toolbar-btn-label">Run</span></div> button or press `Shift + Enter` on your keyboard.

In [None]:
from platform import python_version

print(python_version())

3.6.9


In [None]:
import pandas               as pd
import matplotlib.pyplot    as plt
import numpy                as np
import seaborn              as sns
from   sklearn.linear_model import LinearRegression

from IPython.display import display

iris = sns.load_dataset('iris')


# Jupyter Notebook <a id="Jupyter-Notebook"></a>
[Jupyter Notebook](https://jupyter.org/) is an interactive tool for running code and visualizing data. Each notebook consists of a series of _code cells_ and _Markdown cells_.

* Code cells allow you to run code in a number of languages. Behind the scenes, Jupyter runs a "kernel" that processes the code whenever you execute a cell. Since this is a Python notebook, Jupyter is running the [IPython](https://ipython.org/) kernel. However, kernels also exist for Julia, R, and many other languages.
* Markdown cells display text using the [Markdown language](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html). In addition to displaying text, you can write equations in these cells using $\LaTeX$.

To run code, click a code cell (like the one below) and do one of the following:
* Press `Shift + Enter` on your keyboard
* On the toolbar at the top of this notebook, press the <i class="fa-step-forward fa"></i><span class="toolbar-btn-label">Run</span> button.

In [None]:
print("Hello, world!")

Hello, world!


You can render a markdown cell in the same way. Double click the text below, and try putting in some of the following items:

    # This is a large heading!
    ## This is a smaller heading!
    ### This is an even smaller heading!
    Here is some code: `x = y + z`
    And here is an equation: $x = y + z$

## Cell magic
The IPython kernel provides some useful tools for programmers, including

* [Magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html), which allow you to do things like look up documentation and past commands that you've run, and
* [Building graphical user interfaces (GUIs)](https://ipython.org/ipython-doc/stable/interactive/reference.html#gui-event-loop-support) to make it easier to interact with your code.

Here's an example of a useful magic command. `?` will look up the documentation for a library, class, or function to help you figure out how to use it. For instance, if I want to learn about [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), I can run the following:

In [None]:
?pd.DataFrame

If you want to see all the magic functions that IPython makes available to you, `%quickref` can give you a high-level overview.

In [None]:
%quickref

## Widgets
IPython and Jupyter Notebook also makes it easy to build [widgets](https://ipywidgets.readthedocs.io/en/latest/index.html), which give you a richer interface with which to interact with the notebook. Try running the code cell below. This code creates two plots, and displays them in adjacent tabs.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import ipywidgets as widgets
from scipy.stats import norm, linregress

out = [widgets.Output(), widgets.Output()]
tabs = widgets.Tab(children=[out[0], out[1]])
tabs.set_title(0, 'Linear regression')
tabs.set_title(1, 'Normal distribution')

with out[0]:
    # Fit line to some random data
    x = np.random.uniform(size=30)
    y = x + np.random.normal(scale=0.1, size=30)
    slope, intercept, _, _, _ = linregress(x,y)
    u = np.linspace(0, 1)
    
    # Plot
    fig1, axes1 = plt.subplots()
    axes1.scatter(x, y)
    axes1.plot(u, slope * u + intercept, 'k')
    plt.show(fig1)

with out[1]:
    # Plot the probability distribution function (pdf) of the
    # standard normal distribution.
    x = np.linspace(-3.5, 3.5, num=100)
    p = norm.pdf(x)
    
    # Plot
    fig2, axes2 = plt.subplots()
    axes2.plot(x, p)
    plt.show(fig2)

display(tabs)

Tab(children=(Output(), Output()), _titles={'0': 'Linear regression', '1': 'Normal distribution'})

You can create much richer and more complex interfaces that include buttons, sliders, progress bars, and more with Jupyter's ipywidgets library ([docs](https://ipywidgets.readthedocs.io/en/latest/index.html)).

# Pandas <a id="pandas"></a>
[pandas](https://pandas.pydata.org/) is a Python library that provides useful data structures and tools for analyzing data.

The fundamental type of the pandas library is the `DataFrame`. In the following code, we load the [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) using the [seaborn library](https://seaborn.pydata.org/). By default, this dataset is stored in a pandas `DataFrame`.

In [None]:
iris = sns.load_dataset('iris')

# `iris` is stored as a pandas DataFrame
print('Type of "iris":', type(iris))

# Show the first few entries in this DataFrame
iris.head()

Type of "iris": <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Let's get some information about the iris dataset. Let's try to do the following:

1. Find out how many columns there are in the `DataFrame` object, and what kinds of data are in each column
2. Calculate the average petal length
3. Determine what species of flowers are in the dataset
4. Get an overall summary of the dataset

In [None]:
# 1. Column labels, and types of data in each column
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [None]:
# 2. Calculate the average petal length
iris['petal_length'].mean()

In [None]:
# 3. Determine which iris species are in the dataset
iris['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [None]:
# 4. Summary of the data
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Sometimes we need to extract certain rows or columns of a DataFrame. For instance, in the following code we store each species of flower in its own variable:

In [None]:
"""
IPython.display is a convenience function that works in Jupyter Notebook
(or, more generally, any IPython-based application) that will show
objects in a nicer way than using print(). We'll use it in this notebook
to show some pandas DataFrames.
"""
from IPython.display import display

"""
Create a DataFrame for each species of flower. I've provided two
methods for creating these DataFrames below; pick whichever you
prefer as they are equivalent.
"""
# Method 1: "query" function
setosa     = iris.query('species == "setosa"')
versicolor = iris.query('species == "versicolor"')

# Method 2: index into the DataFrame
virginica = iris[iris['species'] == 'virginica']

"""
Show the first few entries of the DataFrame corresponding to each species
"""
print('Setosa data:')
display(setosa.head())

print('Versicolor data:')
display(versicolor.head())

print('Virginica data:')
display(virginica.head())

To extract a column, we can either use `iris[column_name]` or `iris.iloc[:,column_index]`.

In [None]:
"""
Get the first column.

Note: whenever we extract a single column of a pandas DataFrame,
we get back a pandas Series object. To turn it back into a DataFrame,
we add the line `first_column = pd.DataFrame(first_column)`.
"""
first_column = iris.iloc[:,0]
first_column = pd.DataFrame(first_column)

print('First column:')
display(first_column.head())

"""
Get the first through third columns
"""
first_through_third_columns = iris.iloc[:,0:3]

print('First through third columns:')
display(first_through_third_columns.head())

"""
Get the 'species' column.
"""
species = iris['species']
species = pd.DataFrame(species)

print('Species column:')
display(species.head())

"""
Get all columns *except* the species column
"""
all_but_species = iris.iloc[:, iris.columns != 'species']

print("All columns *except* species:")
display(all_but_species.head())

If you want to create your own pandas `DataFrame`, you have to specify the names of the columns and the items in the rows of the `DataFrame`.

In [None]:
column_labels = ['A', 'B']

column_entries = [
    [1, 2],
    [4, 5],
    [7, 8]
]

pd.DataFrame(column_entries, columns=column_labels)

# Introduction to NumPy
[NumPy](https://www.numpy.org/) is another Python package providing useful data structures and mathematical functions. NumPy's fundamental data type is the array, `numpy.ndarray`, which is like a stripped-down version of a pandas `DataFrame`. However, the `numpy.ndarray` supports much faster operations, which makes it a lot more practical for scientific computing than, say, Python's list objects.

In [None]:
import numpy as np

# 1. Create an array with the numbers [1, 2, 3]
x = np.array([1, 2, 3])

# 2. Create a 2 x 2 matrix with [1, 2] in the first row and [3, 4]
#    in the second row.
x = np.array( [[1,2], [3,4]] )

# 3. Create an array with the numbers 0, 1, ... , 9. Equivalent to
#    calling np.array(range(10))
x = np.arange(10)

# 4. Create a 2 x 2 matrix with zeros in all entries
x = np.zeros( (2,2) )

# 5. Get the total number of items in the matrix, and the shape of
#    the matrix.
num_items    = x.size
matrix_shape = x.shape

Besides just providing data structures, though, NumPy provides many mathematical utilities as well.

In [None]:
### Constants: pi
print('π      = %f' % np.pi)
print()

### Simple functions: sine, cosine, e^x, log, ...
print('sin(0) = %f' % np.sin(0))
print('cos(0) = %f' % np.cos(0))
print('e^1    = %f' % np.exp(1))
print('ln(1)  = %f' % np.log(1))
print()

### Minimums, maximums, sums...
x = np.array([1,2,3])
print('Min of [1,2,3] = %d' % x.min())
print('Max of [1,2,3] = %d' % x.max())
print('Sum of [1,2,3] = %d' % x.sum())
print()

### Random numbers: uniform distribution, normal distribution, ...
print('Random numbers:')
print('Uniform([0,1]): %f' % np.random.uniform(0,1))
print('Normal(0,1):    %f' % np.random.normal(loc=0, scale=1))
print('Poisson(1):     %f' % np.random.poisson(1))

NumPy is primarily used to do large-scale operations on arrays of numbers. Because it has C code running behind the scenes, it can do these computations extremely quickly -- much faster than you could do with regular Python code. Among other things, with NumPy you can

* add a number to every element of an array;
* multiply every element of an array by a number;
* add or multiply two arrays together; or
* calculate a matrix-vector or matrix-matrix product between arrays.

In [None]:
x = np.array([1,2,3])
y = np.array([4,5,6])

print('1 + [1,2,3] =', 1 + x)
print('3 * [1,2,3] =', 3 * x)
print('[1,2,3] * [4,5,6] =', x * y)
print('[1,2,3] + [4,5,6] =', x + y)
print('Dot product of [1,2,3] and [4,5,6] =', x.dot(y))

# Additional References <a id="additional-references"></a>

* O Reilly provides a couple of good books that go in-depth about these tools and more:
  * [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do)
  * [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) -- this book was published in 2012 and may be slightly dated. However, the author provides some Jupyter Notebooks for free in [this repository](https://github.com/wesm/pydata-book) that you may find helpful.
* Check out the full documentation for Jupyter on the [Project Jupyter site](https://jupyter.org/documentation).
* Plotting tools:
  * **Matplotlib** (Highly Recommended!)
    * [**Documentation**](https://matplotlib.org/contents.html)
    * [**Tutorials**](https://matplotlib.org/tutorials/index.html)
  * Plotly
    * [Documentation](https://plot.ly/python/reference/)
    * [Examples](https://plot.ly/python/)
  * Seaborn [The differences between versions are too large, and thus it is confusing sometimes. ]
    * [Documentation](https://seaborn.pydata.org/api.html)
    * [Introduction](https://seaborn.pydata.org/introduction.html)
* [scikit-learn documentation](https://scikit-learn.org/stable/index.html)