# Python Data Science Handbook

We will run through the book over the course of this notebook and look at various aspects of using python for data science

The [book](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb)

### IPython

- Stands for interactive python
- launched using ipython command
- ? introduced as shorthand for help
- ?? for source code
Demonstraion below

In [2]:
len?

In [3]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



In [4]:
L = [1,2,3]
L?

In [5]:
L.insert?


In [6]:
def square(a):
    '''Return square of a'''
    return a ** 2

In [7]:
square?

In [8]:
square??

In [9]:
len??
# implemented in C or other compiled extension language

#### Tab completion is awesome

#### For dunder methods it works too

In [10]:
*Warning?
# Searching with wilcard

In [11]:
# IPython magic %timeit helps determine exec time of single line python statement

%timeit L = [x **2 for x in range(10000)]
%timeit L = (x **2 for x in range(10000))

100 loops, best of 3: 3.58 ms per loop
1000000 loops, best of 3: 753 ns per loop


In [12]:
# %%timeit to determine entire code block for ipython

#%%timeit
L = []
for n in range(10000):
    L.append(n ** 2)

#### Input Output History

- In is a list 
- Out is dict mapping input number to outputs

In [13]:
# In
# Out

In [14]:
a = 10
b = 20
c = 30
a


10

In [15]:
a*b; # suppress output

In [16]:
%history -n 4-6 # show history

   4:
L = [1,2,3]
L?
   5: L.insert?
   6:
def square(a):
    '''Return square of a'''
    return a ** 2


In [17]:
#use shell commands with !
!pwd

/home/arjunil/Learn to Code/Python-refresher


In [18]:
contents = !ls
print(contents)

['Bob.txt', 'file.json', 'new.txt', 'Python Data Science Handbook.ipynb', 'python_revision.ipynb', 'README.md']


In [19]:
type(contents) # this is not a list

IPython.utils.text.SList

In [20]:
message = "hello from python"
!echo {message}

hello from python


#### Controlling the exceptions displayed
__%xmode__ takes a single argument, the mode, and there are three possibilities: Plain, Context, and Verbose. The default is Context, and gives output like that just shown before. Plain is more compact and gives less information:

In [21]:
def div(a,b):
    return a/b

def num_div(x):
    a = x
    b = x - 1
    return div(a,b)

In [22]:
num_div(1)

ZeroDivisionError: division by zero

In [23]:
# example

%xmode Plain
num_div(1)

Exception reporting mode: Plain


ZeroDivisionError: division by zero

In [24]:
%xmode Context
num_div(1)

Exception reporting mode: Context


ZeroDivisionError: division by zero

In [25]:
%debug
num_div(1)

> [1;32m<ipython-input-21-33932b1d0c15>[0m(2)[0;36mdiv[1;34m()[0m
[1;32m      1 [1;33m[1;32mdef[0m [0mdiv[0m[1;33m([0m[0ma[0m[1;33m,[0m[0mb[0m[1;33m)[0m[1;33m:[0m[1;33m[0m[0m
[0m[1;32m----> 2 [1;33m    [1;32mreturn[0m [0ma[0m[1;33m/[0m[0mb[0m[1;33m[0m[0m
[0m[1;32m      3 [1;33m[1;33m[0m[0m
[0m[1;32m      4 [1;33m[1;32mdef[0m [0mnum_div[0m[1;33m([0m[0mx[0m[1;33m)[0m[1;33m:[0m[1;33m[0m[0m
[0m[1;32m      5 [1;33m    [0ma[0m [1;33m=[0m [0mx[0m[1;33m[0m[0m
[0m
ipdb> 
ipdb> 
ipdb> 
ipdb> 
ipdb> exit


ZeroDivisionError: division by zero

In [26]:
# also pdb and ipdb
# %debug opens ipdb prompt

## NumPy

In [27]:
import numpy as np
np.__version__


'1.10.4'

In [28]:
np?

#### Understanding Data Types in Python
A single integer in Python 3.4 actually contains four pieces:

- ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
- ob_type, which encodes the type of the variable
- ob_size, which specifies the size of the following data members
- ob_digit, which contains the actual integer value that we expect the Python variable to represent.


In [29]:
# integer type definition effectively looks like this 
'''
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
''';

__The differences between a NumPy array(fixed-type) and Python list__
- The array contains a single pointer to one contiguous block of data
- The python list contains a pointer...to a block of pointers, each of which in turn points to a full Python object like the Python integer
- This makes the list flexible, while the array is efficient

In [30]:
import array
L = list(range(10))
A = array.array('i', L) # i indicates type code - int
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Welcome to NumPy arrays
- array gives effcient storage of data
- numpy gives us efficient operations on data too



In [31]:
# integer array
np.array([1,2,3,4,5])

array([1, 2, 3, 4, 5])

In [32]:
# upcasting to floating point - since np arrays have homogeneous data
np.array([3.14,1,4,5,3]) 

array([ 3.14,  1.  ,  4.  ,  5.  ,  3.  ])

In [33]:
np.array([1.,2.,3,4,5], dtype='int')

array([1, 2, 3, 4, 5])

In [34]:
# Multi-dimensional arrays
np.array([range(i,i+3) for i in [2,4,6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

#### Creating arrays from scratch

In [35]:
# Efficiency
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [36]:
# Create 3x5 floating point array filled with ones
np.ones((3,5),dtype=float)

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [37]:
# Create 3x4 array filled with num
num = 3.14
np.full((3,4), num)

array([[ 3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14]])

In [38]:
# Array with linear sequence with start, stop and step
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [39]:
# Create array with five values evenly spaced between 0 and 1

np.linspace(0, 1, 5)

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [40]:
# Create 3x3 array of uniformly distributed random values between 0 and 1
np.random.random((3,3))

array([[ 0.62969288,  0.02029721,  0.29127598],
       [ 0.71298855,  0.05278465,  0.12423418],
       [ 0.19673598,  0.65114901,  0.87306582]])

In [41]:
# 3x3 array of normally distributed random val with mean=0, SD=1 
np.random.normal(0,1,(3,3))

array([[ 0.25497   ,  0.11365453, -0.4090928 ],
       [ 0.42006726,  1.63207511,  0.2147068 ],
       [ 0.24423871, -0.32503543,  0.97322874]])

In [42]:
# 3x3 random int array between 0-10
np.random.randint(0,10,(3,3))

array([[0, 2, 4],
       [8, 9, 1],
       [0, 4, 4]])

In [43]:
# identity matrix
np.eye(4)

array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

In [44]:
# uninitialized array of three integers
# Values will be whatever happens to be at that mem loc
np.empty(3)

array([  0.00000000e+000,   1.93538748e-309,   4.10074486e-322])

For a [List of NumPy Standard Data types](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb#NumPy-Standard-Data-Types)

#### Basics of NumPy arrays

- Attributes
- Indexing
- Slicing
- Reshaping
- Joining/Splitting

##### Attributes

In [45]:
import numpy as np
np.random.seed(0)

x1 = np.random.randint(10, size=6)
x2 = np.random.randint(10, size=(3,4))
x3 = np.random.randint(10, size=(3,4,5))

In [46]:
x1

array([5, 0, 3, 3, 7, 9])

In [47]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [48]:
x3

array([[[8, 1, 5, 9, 8],
        [9, 4, 3, 0, 3],
        [5, 0, 2, 3, 8],
        [1, 3, 3, 3, 7]],

       [[0, 1, 9, 9, 0],
        [4, 7, 3, 2, 7],
        [2, 0, 0, 4, 5],
        [5, 6, 8, 4, 1]],

       [[4, 9, 8, 1, 1],
        [7, 9, 9, 3, 6],
        [7, 2, 0, 3, 5],
        [9, 4, 4, 6, 4]]])

In [49]:
# attributes ndim(no. of dimensions), shape(size of each dim), size(total size)

print("x3 ndim: %s\nx3 shape:%s\nx3 size:%s" %(x3.ndim,x3.shape,x3.size))

x3 ndim: 3
x3 shape:(3, 4, 5)
x3 size:60


In [50]:
print('dtype: ', x3.dtype)

dtype:  int64


In [51]:
print('itemsize: ',x3.itemsize, 'bytes') # each item
print('nbytes:',x3.nbytes,'bytes') # total array size

itemsize:  8 bytes
nbytes: 480 bytes


##### Indexing

In [52]:
x1

array([5, 0, 3, 3, 7, 9])

In [53]:
# similar to basic python - negative indexing works fiine
x1[-1]

9

In [55]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [57]:
## For multidimensional arrays it's all same - coma separated list of indices
x2[-1,-1]

7

In [58]:
x2[2,0]

1

In [59]:
x2[1,3] = 999
x2

array([[  3,   5,   2,   4],
       [  7,   6,   8, 999],
       [  1,   6,   7,   7]])

In [60]:
x2.dtype

dtype('int64')

In [63]:
# since typeis int64, trying to float will fail
x2[1,3] = 3.14
x2[1] # Not a clear failure, rather a truncation to int

array([7, 6, 8, 3])

#### Discovering array slicing (python-esque)

In [64]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [65]:
x[:5]

array([0, 1, 2, 3, 4])

In [66]:
x[::2]

array([0, 2, 4, 6, 8])

In [68]:
# As is obvious start stop step apply here. 
# For reversal, start and stop get swapped with negatve step

x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [69]:
# going nD
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 3],
       [1, 6, 7, 7]])

In [72]:
x2[1:, :1]

array([[7],
       [1]])

In [73]:
x2[::-1, :]

array([[1, 6, 7, 7],
       [7, 6, 8, 3],
       [3, 5, 2, 4]])

In [74]:
x2[:,::-1]

array([[4, 2, 5, 3],
       [3, 8, 6, 7],
       [7, 7, 6, 1]])


#### Please take note

- There arises a difference between NumPy and python when it comes to slicing
- NumPy returns views
- Python returns copies

In [75]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 3],
       [1, 6, 7, 7]])

In [76]:
x2_sub = x2[:2,:2]

In [77]:
x2_sub


array([[3, 5],
       [7, 6]])

In [78]:
x2_sub[0,0] = 999

In [79]:
x2_sub

array([[999,   5],
       [  7,   6]])

In [80]:
x2

array([[999,   5,   2,   4],
       [  7,   6,   8,   3],
       [  1,   6,   7,   7]])

In [81]:
#original array modified

In [83]:
# If a copy is necessary, use copy method
x2_copy = x2[:2,:2].copy()
x2_copy[0,0] = 10000
print(x2)
print(x2_copy)

[[999   5   2   4]
 [  7   6   8   3]
 [  1   6   7   7]]
[[10000     5]
 [    7     6]]


### Reshaping arrays

In [84]:
grid = np.arange(1,10).reshape((3,3))

In [85]:
grid

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [87]:
x = np.array([1,2,3])

# create row vector
x.reshape((1,3))

array([[1, 2, 3]])

In [88]:
x

array([1, 2, 3])

In [90]:
x[np.newaxis, :]

array([[1, 2, 3]])

In [91]:
x.reshape((3,1))

array([[1],
       [2],
       [3]])

In [92]:
x[:,np.newaxis]

array([[1],
       [2],
       [3]])

#### Array concat

In [94]:
x = np.array([1,2,3])
y = np.array([1,2,3])
np.concatenate([x,y])

array([1, 2, 3, 1, 2, 3])

In [95]:
z = [99] * 3
z

[99, 99, 99]

In [97]:
np.concatenate([x,y,z])

array([ 1,  2,  3,  1,  2,  3, 99, 99, 99])

In [103]:
new_arr = np.array(range(6)).reshape(2,3)
np.concatenate([new_arr,new_arr])
#np.concatenate([x,y,z], axis=1)

array([[0, 1, 2],
       [3, 4, 5],
       [0, 1, 2],
       [3, 4, 5]])

In [104]:
np.concatenate([new_arr,new_arr], axis = 1)

array([[0, 1, 2, 0, 1, 2],
       [3, 4, 5, 3, 4, 5]])

In [106]:
np.vstack([new_arr, new_arr])

array([[0, 1, 2],
       [3, 4, 5],
       [0, 1, 2],
       [3, 4, 5]])

In [111]:
np.hstack([new_arr,np.array(list(range(3,9))).reshape(2,3)])

array([[0, 1, 2, 3, 4, 5],
       [3, 4, 5, 6, 7, 8]])

In [116]:
grid = np.arange(9).reshape((3,3))

In [117]:
grid

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [121]:
np.hsplit(grid, [2])

[array([[0, 1],
        [3, 4],
        [6, 7]]), array([[2],
        [5],
        [8]])]

### Why NumPy - Meet the vectorized ops


In [122]:

import numpy as np
np.random.seed(0)

In [125]:
def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output


In [126]:
big_arr = np.random.randint(1,100,size = 1000000)
%timeit compute_reciprocals(big_arr)


1 loop, best of 3: 1.96 s per loop


In [127]:
# on the flip side
%timeit (1.0/big_arr)

100 loops, best of 3: 2.93 ms per loop


In [131]:
# This magic is made possible by ufuncs in numpy
# universal functions that facilitate vectorized ops
np.arange(5) / np.arange(1,6)

array([ 0.        ,  0.5       ,  0.66666667,  0.75      ,  0.8       ])

In [132]:
x = np.arange(9).reshape((3,3))

In [133]:
x

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [134]:
x**2

array([[ 0,  1,  4],
       [ 9, 16, 25],
       [36, 49, 64]])

### Respect the ufuncs 
- they use all standard behaviour 
- seem to be wrappers around specific functions in numpy and resemble python syntax which just makes them stunning and a delight to behold

In [135]:
x = np.arange(5)

In [136]:
x

array([0, 1, 2, 3, 4])

In [137]:
x - 2

array([-2, -1,  0,  1,  2])

In [138]:
x+10

array([10, 11, 12, 13, 14])

In [139]:
x%2

array([0, 1, 0, 1, 0])

In [141]:
x**2

array([ 0,  1,  4,  9, 16])

In [142]:
-x # unary ufunc

array([ 0, -1, -2, -3, -4])

In [143]:
abs(-x)

array([0, 1, 2, 3, 4])

In [144]:
a = 3 +4j
b = 3 -4j
x = np.array([a,b])

In [145]:
x

array([ 3.+4.j,  3.-4.j])

In [147]:
abs(x) # computes magnitude. How cool is that.

array([ 5.,  5.])

### Trigonometric functions