# Python for Machine Learning Bootcamp
## 1. Numpy
### Lists vs. Arrays
* Lists are standard Python objects (they come out of the box), whereas arrays are numpy specific objects.
* Lists size can change (i.e. append/delete) whereas arrays cannot, their size/memory allocation is fixed.

In [4]:
# load module
import numpy as np

# enable intellisense
%config IPCompleter.greedy=True

# basic list
list1 = [0,1,2,3,4]

# create array
arr1d = np.array(list1)

# print array
print(list1)
print(arr1d)

[0, 1, 2, 3, 4]
[0 1 2 3 4]


* Lists cannot have things added to them, whereas arrays can be incremented via sum operations.

In [5]:
# attempt to increment list, this will not work
list1 + 2

TypeError: can only concatenate list (not "int") to list

In [6]:
# increment array, this increments all array items by value specified
arr1d + 2

array([2, 3, 4, 5, 6])

* There are many basic actions you can perform with arrays.

In [7]:
# create list of lists
list2 = [[1,1,1], [2,2,2], [3,3,3]]

# convert to array
arr2d = np.array(list2)

# print array
arr2d

array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

In [8]:
# check object type
type(arr2d)

numpy.ndarray

In [9]:
# check data type within array
arr2d.dtype

dtype('int32')

__NOTE:__
* If you press shift + tab within the parentheses of a method (e.g. within the below "array(...)" method, then it will show you the different options for that method.
* Shift + double tab will show you the full, expanded text relating to each method parameter.

In [12]:
# convert type of array 
arr2d = np.array(list2, dtype = 'float')

# print array
print(arr2d)

[[1. 1. 1.]
 [2. 2. 2.]
 [3. 3. 3.]]


In [11]:
# check data type within array
arr2d.dtype

dtype('float64')

In [20]:
# convert type of array
arr2d = arr2d.astype('str')
arr2d = arr2d.astype('int')

* Lists can contain multiple data types, whereas arrays must contain the same data type only

In [16]:
# append string to list of ints
list1.append('6')

# show list
print(list1)

[0, 1, 2, 3, 4, '6']


In [21]:
# convert array into a list
np2list = arr2d.tolist()

# show lsit
print(np2list)

[[1, 1, 1], [2, 2, 2], [3, 3, 3]]


* You can convert arrays into many different object types

In [22]:
# convert array into string
arr2d.tostring()

b'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x03\x00\x00\x00\x03\x00\x00\x00'

In [23]:
# convert array into bytes
arr2d.tobytes()

b'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x03\x00\x00\x00\x03\x00\x00\x00'

### Properties of Arrays

In [26]:
# show list 2
print(list2)

# convert array into float
arr2d = arr2d.astype('float')

# show arr2d
print(arr2d)

[[1, 1, 1], [2, 2, 2], [3, 3, 3]]
[[1. 1. 1.]
 [2. 2. 2.]
 [3. 3. 3.]]


* Shape indicates rows and columns of object (i.e. dimensions)

In [28]:
# shape of array
print('Shape: ', arr2d.shape)

Shape:  (3, 3)


* Size indicates the count of items within the array (i.e. 3 x 3 here)

In [29]:
# size of array
print('Size: ', arr2d.size)

Size:  9


* Ndim gets the dimensions of an object, a simple array has 1 dimension, whilst the standard grid has 2 dimensions.

In [32]:
# get dimensions of arrays
print(arr2d.ndim, arr1d.ndim)

2 1


* You can apply functions to arrays to modify their contents.
* You can extract values from a 1d array using standard indexing.

In [34]:
# square values within array
arr1d = arr1d * arr1d
print(arr1d)

# extract specific values
arr1d[1]

[  0   1  16  81 256]


1

* For 2 dimensional arrays, you must specify the row and column combination to extract specific values.

In [37]:
# print array
print(arr2d)

# extract row 1, column 0
arr2d[1][0] # [R][C]

[[1. 1. 1.]
 [2. 2. 2.]
 [3. 3. 3.]]


2.0

* Boolean indexing and logic can be used on arrays.

In [40]:
# check logic on array
boolarr = arr2d<3

# show results (True for <3, False for >=3)
print(boolarr)

[[ True  True  True]
 [ True  True  True]
 [False False False]]


* Basic functions can be performed on arrays.
* Before the comma refers to rows, after the comma refers to columns.
* Below we use square brackets to index the array, the '::' means we are selecting all rows/columns because we're not specifying start/stop
* And the '-1' means the step is backwards and going through 1 item at a time
* This is standard __[start, stop, step]__ indexing of an array/list

In [42]:
# reverse array order
arr2d = arr2d[::-1, ::-1] # before comma changes row order, after comma changes column order
print(arr2d)

[[3. 3. 3.]
 [2. 2. 2.]
 [1. 1. 1.]]


### Infinite and Nan
* Cannot do any mathematical or operational functions on these types (e.g. mean, sum...)

In [43]:
# not a number (e.g. output of incorrect function)
np.nan

nan

In [44]:
# infinite number (e.g. divide by 0)
np.inf

inf

In [45]:
# insert nan and inf into array
arr2d[0][0] = np.nan # assign value to position
arr2d[0][1] = np.inf
print(arr2d)

[[nan inf  3.]
 [ 2.  2.  2.]
 [ 1.  1.  1.]]


In [48]:
# check which items are nan
print(np.isnan(arr2d))

# check which items are inf
print(np.isinf(arr2d))

[[ True False False]
 [False False False]
 [False False False]]
[[False  True False]
 [False False False]
 [False False False]]


In [50]:
# combine both to find any missing values
missing_flag = np.isnan(arr2d) | np.isinf(arr2d)
missing_flag

array([[ True,  True, False],
       [False, False, False],
       [False, False, False]])

In [53]:
# replace inf and nan with 0
arr2d[missing_flag] = 0
arr2d

array([[0., 0., 3.],
       [2., 2., 2.],
       [1., 1., 1.]])

### Numpy Statistical Operations
* Lots of standard operations such as mean, max, min, var, std etc.

In [54]:
arr2d.mean()

1.3333333333333333

In [55]:
arr2d.max()

3.0

In [56]:
arr2d.min()

0.0

In [57]:
arr2d.std()

0.9428090415820634

In [58]:
arr2d.var()

0.8888888888888888

* Press tab after the '.' and it will show you all of the other operations you can perform on the array.

In [59]:
arr2d.squeeze() # press shift + tab in brackets for explanation

array([[0., 0., 3.],
       [2., 2., 2.],
       [1., 1., 1.]])

In [60]:
arr2d.cumsum() # cumulative sum of each value in array

array([ 0.,  0.,  3.,  5.,  7.,  9., 10., 11., 12.])

In [63]:
# take first 2 rows and columns, goes from value before : up to (but not incl. value after :)
arr = arr2d[:2,:2] # [R][C]
arr

array([[0., 0.],
       [2., 2.]])

* You can reshape a numpy array into any format that fits.
* For a 3x3 grid the below options are the only options, using eg. reshape(2,4) wouldn't work because there are 9 values, not 8.

In [66]:
arr2d.reshape(1,9)
arr2d.reshape(9,1)
arr2d.reshape(3,3)

array([[0., 0., 3.],
       [2., 2., 2.],
       [1., 1., 1.]])

* You can also flatten arrays into 1d arrays (this is equivalent to the '.reshape(1,9)' from above.

In [75]:
a = arr2d.flatten()
a # this is a fresh copy of the original arr2d, changes to a won't affect arr2d

array([0., 0., 3., 2., 2., 2., 1., 1., 1.])

* Ravel is more memory efficient than flatten. Flatten creates a copy of the original array, thus not affecting the parent. Whereas ravel changes the parent rather than creating a copy (this is called a reference).

In [76]:
b = arr2d.ravel()
b # this is a reference to the original arr2d, changes to b affect arr2d

array([0., 0., 3., 2., 2., 2., 1., 1., 1.])

* Supposedly in the below code, changes to a won't affect arr2d but changes to b will (however this doesn't appear to be working!).

In [79]:
a[0] = -1
print(arr2d)

b[0] = -1
arr2d

[[0. 0. 3.]
 [2. 2. 2.]
 [1. 1. 1.]]


array([[0., 0., 3.],
       [2., 2., 2.],
       [1., 1., 1.]])

### Sequences
* Generate sequences of numbers with various parameters.
* Parameters include type (object, int, float etc.), start/stop/step).

In [83]:
# create range from 1 to 7 in steps of 2 (does not include end of range specified)
a = np.arange(1, 8, 2, dtype = 'object') # specify type here, object means string
print(a)

[1 3 5 7]


* Linspace is similar but generates a set amount of numbers between start and stop in a linearly spaced (i.e. eventy spaced) way.

In [88]:
# generate 4 nunbers between 1 and 50, evenly spaced
np.linspace(1, 50, 4)

array([ 1.        , 17.33333333, 33.66666667, 50.        ])

In [89]:
# generate numbers between 1 and 50 in a logarithmic scale, base 10
np.logspace(1, 50, 10)

array([1.00000000e+01, 2.78255940e+06, 7.74263683e+11, 2.15443469e+17,
       5.99484250e+22, 1.66810054e+28, 4.64158883e+33, 1.29154967e+39,
       3.59381366e+44, 1.00000000e+50])

In [92]:
# generate an array of specific values
np.zeros([2,2])

array([[0., 0.],
       [0., 0.]])

In [93]:
# generate an array of specific values
np.ones([2,2])

array([[1., 1.],
       [1., 1.]])

* You can create large arrays using sub-sections.
* Tile appends new versions after each other, retaining original order.
* Repeat multiplies each item next to itself.

In [95]:
# create single row to be repeated
a = [1, 2, 3]

# create array from sub-section
b = np.tile(a, 3) # repeats a 3 times
b

array([1, 2, 3, 1, 2, 3, 1, 2, 3])

In [96]:
# repeat each item in the original item in order
np.repeat(a, 3)

array([1, 1, 1, 2, 2, 2, 3, 3, 3])

### Random Numbers

In [102]:
# create random values in 3x3 grid
# by default this is between 0 and 1
np.random.rand(3, 3)

array([[0.96924004, 0.42660643, 0.69763458],
       [0.81692623, 0.11082988, 0.21879002],
       [0.61085334, 0.47716499, 0.5105914 ]])

In [103]:
# create random values but normally distributed
# by default this is between -1 and 1
np.random.randn(3, 3)

array([[-0.59403322, -0.87331152, -0.56172087],
       [-0.21892688, -1.04301048, -0.15963469],
       [ 0.83537078, -0.88104998, -1.18634497]])

In [106]:
# generates single random integer between 0 and 10 in 3x3 grid
# dtype lets you specify 'l' or 'h' for low or high to skew things
np.random.randint(0, 10, [3, 3], dtype='l')

array([[3, 4, 5],
       [6, 2, 6],
       [5, 6, 1]])

* Seeds ensure that the random numbers generated are the same each time.
* The above code will produce different random numbers each time because they are unseeded.
* Whereas the below code will always produce the same random numbers because a seed has been produced.

In [109]:
# a seed ensures that the random numbers generated are identical each time
np.random.seed(0)
np.random.randn(3, 3)

array([[ 1.76405235,  0.40015721,  0.97873798],
       [ 2.2408932 ,  1.86755799, -0.97727788],
       [ 0.95008842, -0.15135721, -0.10321885]])

In [112]:
# show unique values within array
np.unique(arr2d)

array([0., 1., 2., 3.])

In [115]:
# show unique values followed by the count of each
uniques, counts = np.unique(arr2d, return_counts=True)
print(arr2d, uniques, counts)

[[0. 0. 3.]
 [2. 2. 2.]
 [1. 1. 1.]] [0. 1. 2. 3.] [2 3 3 1]


### Conditional Indexing
* np.where lets you extract things based on specific criteria.

In [118]:
# create array
arr = np.array([8,94,8,56,1,3,4,5,7,23,18])
print(arr)

[ 8 94  8 56  1  3  4  5  7 23 18]


In [122]:
# get values greater than 10
index_get10 = np.where(arr>10)

# show index of results
print(index_get10) # shows index of values, rather than values themselves

# show actual values
print(arr[index_get10])

# alternate method
print(arr[arr>10])

# show boolean indexing of condition
print(arr>10)

(array([ 1,  3,  9, 10], dtype=int64),)
[94 56 23 18]
[94 56 23 18]
[False  True False  True False False False False False  True  True]


In [123]:
# replace values with "if condition then x, else y" logic
np.where(arr>10, 'gt10', 'lt10')

array(['lt10', 'gt10', 'lt10', 'gt10', 'lt10', 'lt10', 'lt10', 'lt10',
       'lt10', 'gt10', 'gt10'], dtype='<U4')

* As well as getting max and min values, you can also get the index of these values if you like.

In [126]:
print(arr)
print(arr.max())
print(arr.argmax())
print(arr.min())
print(arr.argmin())

[ 8 94  8 56  1  3  4  5  7 23 18]
94
1
1
4


### Reading and Writing CSVs
* Two main methods; genfromtxt() and loadtxt().
* The first can read multiple data types (e.g. text, ints etc.) whilst the second can only read text.
* If you get errors initially, look at the source data and see what conditions/parameters you need to add to your code to handle things like delimiters, headers etc.

In [135]:
# read in github text
# use shift + tab to see options (lots here!)
data = np.genfromtxt('https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv',
                     delimiter=',', skip_header=1, filling_values=-1000, dtype='float')

# check rows and cols
data.shape

(392, 9)

In [134]:
# turn off scientific notation (ie. e+01)
np.set_printoptions(suppress=True)

# inspect first 3 rows
data[:3]

array([[   18. ,     8. ,   307. ,   130. ,  3504. ,    12. ,    70. ,
            1. , -1000. ],
       [   15. ,     8. ,   350. ,   165. ,  3693. ,    11.5,    70. ,
            1. , -1000. ],
       [   18. ,     8. ,   318. ,   150. ,  3436. ,    11. ,    70. ,
            1. , -1000. ]])

In [137]:
# read in data
data2 = np.genfromtxt('https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv',
                      delimiter=',', skip_header=1, dtype=None)

# get first 3 rows
data2[:3]

  This is separate from the ipykernel package so we can avoid doing imports until


array([(18., 8, 307., 130, 3504, 12. , 70, 1, b'"chevrolet chevelle malibu"'),
       (15., 8, 350., 165, 3693, 11.5, 70, 1, b'"buick skylark 320"'),
       (18., 8, 318., 150, 3436, 11. , 70, 1, b'"plymouth satellite"')],
      dtype=[('f0', '<f8'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<f8'), ('f6', '<i4'), ('f7', '<i4'), ('f8', 'S38')])

In [138]:
# save data (by default this goes into the same folder as the script)
np.savetxt('data.csv', data, delimiter=',')

In [139]:
# save array as an actual numpy array (not as text etc.)
np.save('data.npy', data)

In [140]:
# saves multiple arrays together
np.savez('data2.npz', data, data2)

In [144]:
# load data into memory from file
d = np.load('data2.npz')

# check number of arrays/files
print(d.files)

# extract specific array
print(d['arr_1'])

['arr_0', 'arr_1']
[(18. , 8, 307. , 130, 3504, 12. , 70, 1, b'"chevrolet chevelle malibu"')
 (15. , 8, 350. , 165, 3693, 11.5, 70, 1, b'"buick skylark 320"')
 (18. , 8, 318. , 150, 3436, 11. , 70, 1, b'"plymouth satellite"')
 (16. , 8, 304. , 150, 3433, 12. , 70, 1, b'"amc rebel sst"')
 (17. , 8, 302. , 140, 3449, 10.5, 70, 1, b'"ford torino"')
 (15. , 8, 429. , 198, 4341, 10. , 70, 1, b'"ford galaxie 500"')
 (14. , 8, 454. , 220, 4354,  9. , 70, 1, b'"chevrolet impala"')
 (14. , 8, 440. , 215, 4312,  8.5, 70, 1, b'"plymouth fury iii"')
 (14. , 8, 455. , 225, 4425, 10. , 70, 1, b'"pontiac catalina"')
 (15. , 8, 390. , 190, 3850,  8.5, 70, 1, b'"amc ambassador dpl"')
 (15. , 8, 383. , 170, 3563, 10. , 70, 1, b'"dodge challenger se"')
 (14. , 8, 340. , 160, 3609,  8. , 70, 1, b'"plymouth \'cuda 340"')
 (15. , 8, 400. , 150, 3761,  9.5, 70, 1, b'"chevrolet monte carlo"')
 (14. , 8, 455. , 225, 3086, 10. , 70, 1, b'"buick estate wagon (sw)"')
 (24. , 4, 113. ,  95, 2372, 15. , 70, 3, b'"t

### Concatenate Arrays
* 3 major methods of vertical concatenation; np.concatenate, np.vstack, np.r_
* 3 major methods of horizontal concatenation; np.concatenate, np.hstack, np.c_

In [149]:
# create dummy arrays
arr1 = np.zeros([4,4])
arr2 = np.ones([4,4])

# concatenate by rows
np.concatenate([arr1, arr2], axis=0) # 0 axis means rows, this is default
np.vstack([arr1, arr2])
np.r_[arr1, arr2]

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [153]:
# 3 methods of horizontal concatenation
np.concatenate([arr1, arr2], axis=1) # 1 axis means columns
np.hstack([arr1, arr2])
np.c_[arr1, arr2]

array([[0., 0., 0., 0., 1., 1., 1., 1.],
       [0., 0., 0., 0., 1., 1., 1., 1.],
       [0., 0., 0., 0., 1., 1., 1., 1.],
       [0., 0., 0., 0., 1., 1., 1., 1.]])

### Sorting Arrays

In [155]:
# create dummy array
arr = np.random.randint(1, 10, size=[10,5]) # size is rows and cols
arr

array([[6, 5, 5, 7, 5],
       [5, 4, 5, 5, 9],
       [5, 4, 8, 6, 6],
       [1, 2, 6, 4, 1],
       [6, 1, 2, 3, 5],
       [3, 1, 4, 3, 1],
       [8, 6, 1, 3, 8],
       [3, 3, 4, 4, 3],
       [4, 5, 2, 3, 2],
       [5, 7, 9, 3, 4]])

In [156]:
# sort array (sorts within rows only)
np.sort(arr)

array([[5, 5, 5, 6, 7],
       [4, 5, 5, 5, 9],
       [4, 5, 6, 6, 8],
       [1, 1, 2, 4, 6],
       [1, 2, 3, 5, 6],
       [1, 1, 3, 3, 4],
       [1, 3, 6, 8, 8],
       [3, 3, 3, 4, 4],
       [2, 2, 3, 4, 5],
       [3, 4, 5, 7, 9]])

* The below code sorts both rows and columns, however it doesn't retain whole rows, so the first row below (11131) doesn't exist anywhere in the above code.

In [157]:
# sort array (sort rows and columns BUT loses row and column contents)
np.sort(arr, axis=0) # axis 0 means sort rows

array([[1, 1, 1, 3, 1],
       [3, 1, 2, 3, 1],
       [3, 2, 2, 3, 2],
       [4, 3, 4, 3, 3],
       [5, 4, 4, 3, 4],
       [5, 4, 5, 4, 5],
       [5, 5, 5, 4, 5],
       [6, 5, 6, 5, 6],
       [6, 6, 8, 6, 8],
       [8, 7, 9, 7, 9]])

* To solve this issue, we can separately get the sort order of whatever we want to sort by (in this case, the first column).
* Then we apply that sort order to the array to sort it without losing row integrity.

In [163]:
# sort first column only (showing index order of each item)
sorted_index = arr[:,0].argsort()
sorted_index

array([3, 5, 7, 8, 1, 2, 9, 0, 4, 6], dtype=int64)

In [162]:
# sort array based on first column order
arr[sorted_index]

array([[1, 2, 6, 4, 1],
       [3, 1, 4, 3, 1],
       [3, 3, 4, 4, 3],
       [4, 5, 2, 3, 2],
       [5, 4, 5, 5, 9],
       [5, 4, 8, 6, 6],
       [5, 7, 9, 3, 4],
       [6, 5, 5, 7, 5],
       [6, 1, 2, 3, 5],
       [8, 6, 1, 3, 8]])

### Working with Dates

In [165]:
# create date
d = np.datetime64('2020-05-28')
d

numpy.datetime64('2020-05-28')

In [166]:
# increment date by 1 day
d + 1

numpy.datetime64('2020-05-29')

In [168]:
# create date with hours
d = np.datetime64('2020-05-28 15:00:00')
d

numpy.datetime64('2020-05-28T15:00:00')

In [170]:
# add 16m40s (16 * 60 + 40)
d + 1000 # adding seconds, smallest unit in datetime value

numpy.datetime64('2020-05-28T15:16:40')

In [175]:
# create single day
oneday = np.timedelta64(1, 'D') # d for day

# add to datetime
d + oneday

# create one minute
oneminute = np.timedelta64(1, 'm')

# add to datetime
d + oneminute

numpy.datetime64('2020-05-28T15:01:00')

__NOTE:__
* A standard range and a numpy arange are broadly the same thing, the main difference is that a numpy range (arange) is far quicker.

In [177]:
# create a daterange
dates = np.arange(np.datetime64('2020-01-01'), np.datetime64('2020-05-28'))
dates[:2]

array(['2020-01-01', '2020-01-02'], dtype='datetime64[D]')

### Advanced Numpy Functions
* The below function is basic and runs quickly on a single number input.
* However, if we were to apply this to a huge array, we want an efficient way to do that - __vectorize__.

In [181]:
# define basic function/method
def foo(x):
    if x%2 == 1:
        return x**2
    else:
        return x/2
    
# call function on single int
foo(11)

121

In [184]:
# call fucntion using vectorize
foo_v = np.vectorize(foo, otypes=[float])

# apply function to array using vectorize
foo_v(arr)

array([[ 3., 25., 25., 49., 25.],
       [25.,  2., 25., 25., 81.],
       [25.,  2.,  4.,  3.,  3.],
       [ 1.,  1.,  3.,  2.,  1.],
       [ 3.,  1.,  1.,  9., 25.],
       [ 9.,  1.,  2.,  9.,  1.],
       [ 4.,  3.,  1.,  9.,  4.],
       [ 9.,  9.,  2.,  2.,  9.],
       [ 2., 25.,  1.,  9.,  1.],
       [25., 49., 81.,  9.,  2.]])