# Python Data Science Handbook

We will run through the book over the course of this notebook and look at various aspects of using python for data science

The [book](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb)

### IPython

- Stands for interactive python
- launched using ipython command
- ? introduced as shorthand for help
- ?? for source code
Demonstraion below

In [7]:
len?

In [8]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



In [11]:
L = [1,2,3]
L?

In [14]:
L.insert?


In [15]:
def square(a):
    '''Return square of a'''
    return a ** 2

In [18]:
square?

In [17]:
square??

In [24]:
len??
# implemented in C or other compiled extension language

#### Tab completion is awesome

#### For dunder methods it works too

In [28]:
*Warning?
# Searching with wilcard

In [34]:
# IPython magic %timeit helps determine exec time of single line python statement

%timeit L = [x **2 for x in range(10000)]
%timeit L = (x **2 for x in range(10000))

10 loops, best of 3: 15.7 ms per loop
100000 loops, best of 3: 3.18 µs per loop


In [38]:
# %%timeit to determine entire code block for ipython

#%%timeit
L = []
for n in range(10000):
    L.append(n ** 2)

#### Input Output History

- In is a list 
- Out is dict mapping input number to outputs

In [42]:
# In
# Out

In [50]:
a = 10
b = 20
c = 30
a


10

In [57]:
a*b; # suppress output

In [60]:
%history -n 4-6 # show history

   4:
import numpy
import scipy
import sklearn
import pandas
import matplotlib
   5: ipython
   6: len?


In [62]:
#use shell commands with !
!pwd

/home/arjpatha/exit_prep/Python-refresher


In [63]:
contents = !ls
print(contents)

['Bob.txt', 'file.json', 'j', 'new.txt', 'Python Data Science Handbook.ipynb', 'python_revision.ipynb', 'README.md']


In [65]:
type(contents) # this is not a list

IPython.utils.text.SList

In [66]:
message = "hello from python"
!echo {message}

hello from python


#### Controlling the exceptions displayed
__%xmode__ takes a single argument, the mode, and there are three possibilities: Plain, Context, and Verbose. The default is Context, and gives output like that just shown before. Plain is more compact and gives less information:

In [69]:
def div(a,b):
    return a/b

def num_div(x):
    a = x
    b = x - 1
    return div(a,b)

In [70]:
num_div(1)

ZeroDivisionError: division by zero

In [71]:
# example

%xmode Plain
num_div(1)

Exception reporting mode: Plain


ZeroDivisionError: division by zero

In [72]:
%xmode Context
num_div(1)

Exception reporting mode: Context


ZeroDivisionError: division by zero

In [73]:
%debug
num_div(1)

> [1;32m<ipython-input-69-33932b1d0c15>[0m(2)[0;36mdiv[1;34m()[0m
[1;32m      1 [1;33m[1;32mdef[0m [0mdiv[0m[1;33m([0m[0ma[0m[1;33m,[0m[0mb[0m[1;33m)[0m[1;33m:[0m[1;33m[0m[0m
[0m[1;32m----> 2 [1;33m    [1;32mreturn[0m [0ma[0m[1;33m/[0m[0mb[0m[1;33m[0m[0m
[0m[1;32m      3 [1;33m[1;33m[0m[0m
[0m[1;32m      4 [1;33m[1;32mdef[0m [0mnum_div[0m[1;33m([0m[0mx[0m[1;33m)[0m[1;33m:[0m[1;33m[0m[0m
[0m[1;32m      5 [1;33m    [0ma[0m [1;33m=[0m [0mx[0m[1;33m[0m[0m
[0m
ipdb> up
> [1;32m<ipython-input-69-33932b1d0c15>[0m(7)[0;36mnum_div[1;34m()[0m
[1;32m      3 [1;33m[1;33m[0m[0m
[0m[1;32m      4 [1;33m[1;32mdef[0m [0mnum_div[0m[1;33m([0m[0mx[0m[1;33m)[0m[1;33m:[0m[1;33m[0m[0m
[0m[1;32m      5 [1;33m    [0ma[0m [1;33m=[0m [0mx[0m[1;33m[0m[0m
[0m[1;32m      6 [1;33m    [0mb[0m [1;33m=[0m [0mx[0m [1;33m-[0m [1;36m1[0m[1;33m[0m[0m
[0m[1;32m----> 7 [1;33m    [1;32mretur

ZeroDivisionError: division by zero

In [74]:
# also pdb and ipdb
# %debug opens ipdb prompt

## NumPy

In [1]:
import numpy as np
np.__version__


'1.10.4'

In [2]:
np?

#### Understanding Data Types in Python
A single integer in Python 3.4 actually contains four pieces:

- ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
- ob_type, which encodes the type of the variable
- ob_size, which specifies the size of the following data members
- ob_digit, which contains the actual integer value that we expect the Python variable to represent.


In [5]:
# integer type definition effectively looks like this 
'''
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
''';

__The differences between a NumPy array(fixed-type) and Python list__
- The array contains a single pointer to one contiguous block of data
- The python list contains a pointer...to a block of pointers, each of which in turn points to a full Python object like the Python integer
- This makes the list flexible, while the array is efficient

In [8]:
import array
L = list(range(10))
A = array.array('i', L) # i indicates type code - int
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Welcome to NumPy arrays
- array gives effcient storage of data
- numpy gives us efficient operations on data too



In [10]:
# integer array
np.array([1,2,3,4,5])

array([1, 2, 3, 4, 5])

In [12]:
# upcasting to floating point - since np arrays have homogeneous data
np.array([3.14,1,4,5,3]) 

array([ 3.14,  1.  ,  4.  ,  5.  ,  3.  ])

In [14]:
np.array([1.,2.,3,4,5], dtype='int')

array([1, 2, 3, 4, 5])

In [15]:
# Multi-dimensional arrays
np.array([range(i,i+3) for i in [2,4,6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

#### Creating arrays from scratch

In [17]:
# Efficiency
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [18]:
# Create 3x5 floating point array filled with ones
np.ones((3,5),dtype=float)

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [19]:
# Create 3x4 array filled with num
num = 3.14
np.full((3,4), num)

array([[ 3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14]])

In [20]:
# Array with linear sequence with start, stop and step
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [21]:
# Create array with five values evenly spaced between 0 and 1

np.linspace(0, 1, 5)

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [22]:
# Create 3x3 array of uniformly distributed random values between 0 and 1
np.random.random((3,3))

array([[ 0.43057394,  0.43842088,  0.83811394],
       [ 0.81037854,  0.71831776,  0.82157272],
       [ 0.61558355,  0.73669647,  0.31133019]])

In [23]:
# 3x3 array of normally distributed random val with mean=0, SD=1 
np.random.normal(0,1,(3,3))

array([[-0.84366027,  0.6881004 ,  0.39114499],
       [ 1.34976061, -0.62848295, -0.26508517],
       [-0.3765121 ,  0.00899633,  1.00012962]])

In [25]:
# 3x3 random int array between 0-10
np.random.randint(0,10,(3,3))

array([[3, 0, 7],
       [8, 4, 1],
       [7, 8, 2]])

In [26]:
# identity matrix
np.eye(4)

array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

In [27]:
# uninitialized array of three integers
# Values will be whatever happens to be at that mem loc
np.empty(3)

array([  2.40901646e-316,   2.41034392e-316,   3.91135922e-316])

For a [List of NumPy Standard Data types](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb#NumPy-Standard-Data-Types)

#### Basics of NumPy arrays

- Attributes
- Indexing
- Slicing
- Reshaping
- Joining/Splitting

##### Attributes

In [29]:
import numpy as np
np.random.seed(0)

x1 = np.random.randint(10, size=6)
x2 = np.random.randint(10, size=(3,4))
x3 = np.random.randint(10, size=(3,4,5))

In [30]:
x1

array([5, 0, 3, 3, 7, 9])

In [31]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [32]:
x3

array([[[8, 1, 5, 9, 8],
        [9, 4, 3, 0, 3],
        [5, 0, 2, 3, 8],
        [1, 3, 3, 3, 7]],

       [[0, 1, 9, 9, 0],
        [4, 7, 3, 2, 7],
        [2, 0, 0, 4, 5],
        [5, 6, 8, 4, 1]],

       [[4, 9, 8, 1, 1],
        [7, 9, 9, 3, 6],
        [7, 2, 0, 3, 5],
        [9, 4, 4, 6, 4]]])

In [34]:
# attributes ndim(no. of dimensions), shape(size of each dim), size(total size)

print("x3 ndim: %s\nx3 shape:%s\nx3 size:%s" %(x3.ndim,x3.shape,x3.size))

x3 ndim: 3
x3 shape:(3, 4, 5)
x3 size:60


In [35]:
print('dtype: ', x3.dtype)

dtype:  int32


In [37]:
print('itemsize: ',x3.itemsize, 'bytes') # each item
print('nbytes:',x3.nbytes,'bytes') # total array size

itemsize:  4 bytes
nbytes: 240 bytes


##### Indexing