## Python Data science Libraries

* NumPy
    * fast lists ( aka. arrays)
* Pandas
    * provides single-machine dataframes (aka. tables)
* Seaborn & matplotlib
    * visualization
* sklearn
    * Machine Leraning Library
    


* Spark
    * query over distributed file systems 
* plotly
    * interactive visuals
* scipy, sklearn, tensorflow, pytorch, statsmodels
    * scientific & statistical programming
* Aside:
    * NB. sqlite3 is written in C

# Introduction to NumPy

Python alone is slow

In [1]:
from random import random

In [2]:
dataset = []

for _ in range(1_000_000):
    dataset.append(random())

In [5]:
dataset[-5:]

[0.8672591442681461,
 0.2065241313994416,
 0.2346035655805635,
 0.3618896034845297,
 0.33853662053052025]

In [6]:
%%timeit
   
total = 0

for x in dataset:
    total += x

65.2 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [7]:
%%timeit

sum(dataset)

8.24 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [8]:
import numpy as np

In [9]:
ds = np.random.uniform(0, 1, 1_000_000)

In [10]:
ds[:5]

array([0.30738247, 0.9631745 , 0.90721054, 0.22760025, 0.61021491])

In [11]:
%%timeit

ds.sum()

1.31 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [13]:
1.31/8.24

0.15898058252427186

### NumPy is essentially an api

It gives access to well written C & FORTRAN code.

In [14]:
import sys

100 * round(1 -  ds.nbytes / sys.getsizeof(dataset), 1)

10.0

## Using NumPy

In [16]:
x_age = [18, 22, 33, 41]

x = np.array(x_age)

In [17]:
x

array([18, 22, 33, 41])

In [18]:
x.mean()

28.5

In [20]:
list(range(0, 10, 2))

[0, 2, 4, 6, 8]

In [22]:
print(np.arange(0, 10, 2))

[0 2 4 6 8]


In [24]:
print(np.repeat(["Heads", "Tails"], 5))

['Heads' 'Heads' 'Heads' 'Heads' 'Heads' 'Tails' 'Tails' 'Tails' 'Tails'
 'Tails']


In [27]:
np.random.choice([1,2,3,4,5,6],10, p=(0,0,0,0,0,1))

array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6])

Simulate ages of people in a school year

In [28]:
x_age = np.random.normal(16, 1, 20)

In [29]:
x_age

array([16.66022306, 17.01071247, 14.87633762, 16.49979209, 16.41968887,
       15.12990257, 16.46311003, 17.21601865, 16.32791911, 15.76172784,
       14.13062403, 15.36792355, 17.22087214, 15.81477596, 17.48527044,
       17.52957057, 16.03544614, 15.81893965, 15.65788477, 13.26519128])

In [30]:
ls_age = [16.66022306, 17.01071247, 14.87633762, 16.49979209, 16.41968887,
       15.12990257, 16.46311003, 17.21601865, 16.32791911, 15.76172784,
       14.13062403, 15.36792355, 17.22087214, 15.81477596, 17.48527044,
       17.52957057, 16.03544614, 15.81893965, 15.65788477, 13.26519128]

In [31]:
score = []
for element in ls_age:
    score.append(3 * element + 1)

In [33]:
[3 * element + 1 for element in ls_age]

[50.98066918,
 52.032137410000004,
 45.629012859999996,
 50.49937627,
 50.259066610000005,
 46.38970771,
 50.38933009,
 52.64805595,
 49.98375733,
 48.285183520000004,
 43.39187209,
 47.10377065,
 52.662616420000006,
 48.44432788,
 53.45581132,
 53.58871171,
 49.10633842,
 48.45681895,
 47.97365431,
 40.79557384]

In [32]:
score

[50.98066918,
 52.032137410000004,
 45.629012859999996,
 50.49937627,
 50.259066610000005,
 46.38970771,
 50.38933009,
 52.64805595,
 49.98375733,
 48.285183520000004,
 43.39187209,
 47.10377065,
 52.662616420000006,
 48.44432788,
 53.45581132,
 53.58871171,
 49.10633842,
 48.45681895,
 47.97365431,
 40.79557384]

Numpy implements **vectorisation** by default

In [34]:
3 * x_age + 1

array([50.98066917, 52.03213741, 45.62901287, 50.49937628, 50.25906662,
       46.3897077 , 50.38933009, 52.64805596, 49.98375732, 48.28518351,
       43.39187208, 47.10377064, 52.66261643, 48.44432789, 53.45581133,
       53.58871171, 49.10633843, 48.45681895, 47.9736543 , 40.79557385])

## This is a sequence

In [36]:
x_age.shape

(20,)

In [38]:
len(x_age)

20

In [42]:
x_age[[0]]

array([16.66022306])

In [40]:
x_age[0:2]

array([16.66022306, 17.01071247])

In [41]:
x_age[-1]

13.26519128333291

In [54]:
print(x_age[::3])

[16.66022306 16.49979209 16.46311003 15.76172784 17.22087214 17.52957057
 15.65788477]


## What Are Matrices

In [51]:
M = np.array([
    (1000, 12, +1), #eg., Loan, Duration, Settle
    (2000, 9, -1), #eg., Loan, Duration, Settle  
    (3000, 6, -1), #eg., Loan, Duration, Settle  
])

In [53]:
print(M)

[[1000   12    1]
 [2000    9   -1]
 [3000    6   -1]]


`M[row-index, col-index]`

In [55]:
M[0, 0]

1000

In [57]:
M[1:3, -1]

array([-1, -1])

In [62]:
M.shape

(3, 3)

## Vectors

In [59]:
x_profit = np.array([
    [10],
    [11],
    [12]
])

In [61]:
print(x_profit)

[[10]
 [11]
 [12]]


In [63]:
x_profit.shape

(3, 1)

In [65]:
x_profit[0, 0]

10

In [70]:
x_profit.reshape(-1,1)

array([[10],
       [11],
       [12]])

## Comparisons are vectorised

In [72]:
x_age < 16

array([False, False,  True, False, False,  True, False, False, False,
        True,  True,  True, False,  True, False, False, False,  True,
        True,  True])

In [73]:
np.where(x_age < 16)

(array([ 2,  5,  9, 10, 11, 13, 17, 18, 19], dtype=int64),)

In [74]:
x_age[np.where(x_age < 16)]

array([14.87633762, 15.12990257, 15.76172784, 14.13062403, 15.36792355,
       15.81477596, 15.81893965, 15.65788477, 13.26519128])

In [75]:
x_age[x_age < 16]

array([14.87633762, 15.12990257, 15.76172784, 14.13062403, 15.36792355,
       15.81477596, 15.81893965, 15.65788477, 13.26519128])

In [79]:
x_age[(x_age < 16) | ~(x_age > 17)]

array([16.66022306, 14.87633762, 16.49979209, 16.41968887, 15.12990257,
       16.46311003, 16.32791911, 15.76172784, 14.13062403, 15.36792355,
       15.81477596, 16.03544614, 15.81893965, 15.65788477, 13.26519128])

In [None]:
X = np.array([
    [21, 1_000, False],  # temp, power, window_open
    [19, 1_000, False],
    [24, 3_000, False],
    [26, 3_000, True],
])

## Exercise 1: Select Values
* the temperature column
    * HINT: all rows of column 0
* the power column
    * HINT: all rows of column 1
* the last column
    * HINT: all rows of column -1
* the first observation row
    * HINT: row 0 of all columns
* the last observation row
    * HINT: row -1 of all columns
* the temp and power of the first two observations
    * HINT: the first two rows of the first two columns
    * HINT: `0` until `2`
* the temp and power when the window is open
    * HINT: we want the first two columns with a *row* condition (ie., mask, test, ..)
    * HINT: the condition is that the third column `X[:, 2]` is `True`
* the power when it is closed
    * HINT: as above, condition is that third column is `False`
    

In [109]:
age = np.array([5, 10, 20, 30])
action = np.array([1,1,-1, 1])
comedy = np.array([1,-1,1, -1])

In [110]:
np.array([age, action, comedy]).T

array([[ 5,  1,  1],
       [10,  1, -1],
       [20, -1,  1],
       [30,  1, -1]])

In [111]:
films = np.column_stack([age, action, comedy])

In [122]:
films3d = films.reshape(2,3,2)

In [123]:
print(films3d)

[[[ 5  1]
  [ 1 10]
  [ 1 -1]]

 [[20 -1]
  [ 1 30]
  [ 1 -1]]]


In [125]:
films3d[1, 0, 0]

20

The product of the (number of) dimensions must equal the number of elements

$x_{age} \sim N(\mu=35, \sigma=5) \in \mathbb{R}^{25}$

In [80]:
np.random.normal(35, 5, 25)

array([43.33867953, 29.96260058, 34.94868308, 40.53911314, 33.55847808,
       33.67779828, 42.4534349 , 31.0593559 , 35.29010369, 34.67707595,
       44.04357257, 27.07798154, 36.92132216, 44.5856371 , 41.37004161,
       39.69890399, 40.72503791, 38.2582065 , 37.62689486, 25.60562848,
       27.41127743, 38.35047307, 34.88224708, 35.23144587, 35.57701933])

In [86]:
w = np.array([1,2,3]).reshape(-1, 1)

$f(X; W, b) = W_0X_0 + W_1X_1 + b \dots $

$y = mx + c$

In [126]:
w0 = 5
X0 = age

w1 = 10
X1 = action


b = 0

prediction = w0 * X0 + w1 * X1 + b

In [127]:
prediction

array([ 35,  60,  90, 160])

In [128]:
films

array([[ 5,  1,  1],
       [10,  1, -1],
       [20, -1,  1],
       [30,  1, -1]])

In [132]:
w = np.array([5,10,2]).reshape(1, -1)
w

array([[ 5, 10,  2]])

In [133]:
w

array([[ 5, 10,  2]])

In [134]:
w * films

array([ 37,  58,  92, 158])

In [147]:
(w * films).sum(axis=1) + b

array([ 37,  58,  92, 158])

$f(X; w, b) = w^TX $

In [87]:
X = np.array([
    [21, 1_000, False],  # temp, power, window_open
    [19, 1_000, False],
    [24, 3_000, False],
    [26, 3_000, True],
])

In [149]:
w * X

array([[  105, 10000,     0],
       [   95, 10000,     0],
       [  120, 30000,     0],
       [  130, 30000,     2]])