# Machine Learning with Python

#### Collaboratory workshop, 02/20/2018


This is a notebook developed throughout the first day of the Collaboratory Workshop, _Machine Learning with Python_. For more information, go to the workshop home page:

https://github.com/QCB-Collaboratory/W17.MachineLearning/wiki/Day-1


## Python nano-review

In the following we will review very basic concepts from Python and learn how to use Jupyter Notebooks.

In [1]:
x = 5*3 + 2

In [2]:
y = 7*x**2 + 0.6*x
z = 1/x
y*z

119.6

In [3]:
x = range(10)
for element in x:
    print("Element",element)

Element 0
Element 1
Element 2
Element 3
Element 4
Element 5
Element 6
Element 7
Element 8
Element 9


This is some text. We can even use formulas:
$$ \theta = \frac{[L]^n}{K_d+[l]^n}$$

In [4]:
print(y*z)

119.6


## Introducing NumPy and Matlplotlib

Next, we will explore two key libraries used not only to Machine Learning, but also to any quantitative project written in Python. While Numpy is a library for optimized matrix operations, Matplotlib allows you to create virtually any visualization (2-D plots, scatter plots, bar plots, etc).

Let's start with Numpy

In [5]:
import numpy

In [6]:
a = [1,2,3,4,5]
print( type(a) )   # displays the type of the variable a

b = numpy.array( [1,2,3,4,5] )
print( type(b) )   # displays the type of the variable b

<class 'list'>
<class 'numpy.ndarray'>


In [7]:
import numpy as np  # allows user to use np instead of numpy

c = np.array( [5,4,3,2,1] )
print( type(c) )    # displays the type of the variable c

<class 'numpy.ndarray'>


Using mathematical operations on numpy arrays:

In [8]:
print("Sum of lists: ", a + a)

print("Sum of NumPy arrays: ", b + c )

Sum of lists:  [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Sum of NumPy arrays:  [6 6 6 6 6]


In [9]:
print("Original array: ", b)
print("Summing a constant: ", b + 7.1 )
print("Multiplying by a constant: ", b*2.5 )
print("Power function: ", b**3 )
print("Element-wise product: ", b*c )
print("Element-wise division: ", b/c )
print("Element-wise modulo: ", b%c )

Original array:  [1 2 3 4 5]
Summing a constant:  [ 8.1  9.1 10.1 11.1 12.1]
Multiplying by a constant:  [ 2.5  5.   7.5 10.  12.5]
Power function:  [  1   8  27  64 125]
Element-wise product:  [5 8 9 8 5]
Element-wise division:  [0.2 0.5 1.  2.  5. ]
Element-wise modulo:  [1 2 0 0 0]


In [10]:
print("Exponential: ", np.exp(b) )
print("Sine: ", np.sin(b) )
print("Cosine: ", np.cos(b) )

Exponential:  [  2.71828183   7.3890561   20.08553692  54.59815003 148.4131591 ]
Sine:  [ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427]
Cosine:  [ 0.54030231 -0.41614684 -0.9899925  -0.65364362  0.28366219]


Selecting elements and sets of elements from your numpy array:

In [11]:
print("First element in b: ", b[0] )
print("Second element in b: ", b[1] )
print("Last element in b: ", b[-1] )
print("All elements in b: ", b[:] )
print("Elements from 1 to 3: ", b[1:4] ) # last one doesn't count!
print("From 0 to end in steps of 2: ", b[0:-1:2] )

First element in b:  1
Second element in b:  2
Last element in b:  5
All elements in b:  [1 2 3 4 5]
Elements from 1 to 3:  [2 3 4]
From 0 to end in steps of 2:  [1 3]


In [12]:
b.shape

(5,)

In [13]:
b.sum()

15

In [14]:
np.arange(1,5,0.5)

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

In [15]:
np.linspace(0,10,4)

array([ 0.        ,  3.33333333,  6.66666667, 10.        ])

In [16]:
np.ones(5)

array([1., 1., 1., 1., 1.])

In [17]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [18]:
A = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
print("The array A:\n",A)
print("Shape of A:",A.shape)
print("Number of dimensions of A:",A.ndim)

The array A:
 [[  1   2   3   4]
 [ 10  20  30  40]
 [100 200 300 400]]
Shape of A: (3, 4)
Number of dimensions of A: 2


In [21]:
B = np.arange(10)
print("Shape:",B.shape)
print("Ndim:",B.ndim)

Shape: (10,)
Ndim: 1


In [23]:
B = B.reshape((10,1))
print("Shape:",B.shape)
print("Ndim:",B.ndim)

Shape: (10, 1)
Ndim: 2


In [None]:
print("Array A:\n",A)
print("Element at pos (1,2):",A[1,2],'\n')
print("Line #2 of A:\n",A[2,:],'\n')
print("Column #1 of A:\n",A[:,1],'\n')


Let's recall how to perform _logical tests_ in Python.

We can select based on the result of a logical test:

In [None]:
print( A > 25 )

In [None]:
print( A[A > 25] )

As an example of an attribute, let's see the "shape" of one of our numpy arrays:

As an example of a method, let's evaluate the sum of all elements in one of our numpy arrays:

In [None]:
b.sum()

This means that the sums of the elements in b is 15.

### Matplotlib

Next, we will explore the basics of Matplotlib to create simple plots.

In [None]:
import matplotlib.pyplot as plt

In [None]:
y = np.array([0,10,3,4,2])

plt.plot(y)
plt.show()

In [None]:
x = np.arange(15)
y1 = np.sin(0.5*x)
y2 = np.cos(0.5*x)

plt.plot(x, y1, 'o--', markersize=10, linewidth=1.2, 
            color='r', label='Sine')
plt.plot(x, y2, 's--', markersize=10, linewidth=1.2, 
            color='k', label='Cosine')

plt.xlabel('Time (s)')
plt.ylabel('Fluorescence (a.u.)')

plt.yticks([-1.0,-0.5,0.0,0.5,1.0])

plt.legend(frameon=False)
plt.show()

To export a plot, here is the usual structure:

```python
f = pl.figure( figsize=(5,3) )

...
plotting...
...

pl.tight_layout()
pl.savefig('Fig1.png', dpi=300)
```

Next we plot the same functions from last cell, but save it as a PNG file.

In [None]:
f = plt.figure( figsize=(5,3) )

x = np.arange(15)
y1 = np.sin(0.5*x)
y2 = np.cos(0.5*x)

plt.plot(x, y1, 'o--', markersize=10, linewidth=1.2, 
            color='r', label='Sine')
plt.plot(x, y2, 's--', markersize=10, linewidth=1.2, 
            color='k', label='Cosine')

plt.xlabel('Time (s)')
plt.ylabel('Fluorescence (a.u.)')

plt.yticks([-1.0,-0.5,0.0,0.5,1.0])

plt.tight_layout()
plt.savefig('Fig1.png', dpi=300)
plt.close()

print("Figure saved.")

## Getting started with Scikit-Learn

In this last section of our first day, we explore Scikit-Learn's general structure to create our very first Machine Learning model.

In [None]:
import sklearn
import matplotlib.pyplot as plt

import sklearn.datasets
bcancer = sklearn.datasets.load_breast_cancer()

In [None]:
print("Features: ", bcancer.data.shape)
print("Target: ", bcancer.target.shape)

Quick reminder:

* "Features" in Machine Learning is a set of quantitative characteristics (or measures) about each of your samples. Models will use features to make predictions about a target variable. In the Breast Cancer Wisconsin Dataset, there are 30 different features and 569 samples.

* "Target" is our objective: our model should use the features to predict the target. During the training, the target is known and also called _ground truth_. 

Let's take a look at the targets

In [None]:
print( bcancer.target )

As you can see, they are always 0 or 1, indicating malignant or benign (respectively).

Let's inspect one of the samples. The very first sample, which has target 0 (i.e. malign), has the following features:

In [None]:
bcancer.data[0]

Although the numbers have little specific meaning, they were measured based on the nuclei of cells extracted from patients. So our target is to create a model that receives numbers like these, and predicts whether they are from a malign or benign cancer cells.

In [None]:
from sklearn.tree import DecisionTreeClassifier

Next we create a model and store it in a variable, and then we fit it to the data at hand.

In [None]:
bcancer_model = DecisionTreeClassifier( max_depth=4 )

In [None]:
bcancer_model.fit( bcancer.data, bcancer.target )

Let's evaluate the prediction of this model on two samples: one from class 0 (malign) and one from class 1 (benign).

In [None]:
print( bcancer_model.predict( [bcancer.data[0]] ) )
print( bcancer.target[0] )

In [None]:
print( bcancer_model.predict( [bcancer.data[19]] ) )
print( bcancer.target[19] )

We can also evaluate the prediction of your model on an entire dataset:

In [None]:
bcancer_predictions = bcancer_model.predict( bcancer.data )
print( bcancer_predictions )

Let's compare the ground truth with the predictions evaluted above:

In [None]:
bcancer.target == bcancer_predictions

Every ```True``` above means a prediction that matched its corresponding ground truth. ```False``` elements reflect wrong predictions (also known as misclassifications). Let's visualize how many correct classifications and how many misclassifications we had with this first classifier.

In [None]:
compar = (bcancer.target == bcancer_predictions).astype(int)

plt.hist( compar, bins=3 )

plt.ylabel('Count')
plt.xticks( [0.15,0.85], ['Misclassified', 'Correctly classified'] )

plt.show()

This is an example of how to create a model. On Day 2 we will explore more models and learn how to validate the model, making sure that we indeed have prediction power.

<br/>
The end, my friends.

In [None]:
(comparision==0).sum()/len(comparision)