<a href="https://colab.research.google.com/github/cboyda/MachineLearning/blob/main/L1_ML_Tools_with_Python_(Optional).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Disclaimer
The information contained in this notebook and any accompanying files are proprietary and is confidential to the participants of the Machine Learning Technician program and should not be copied, distributed or reproduced in whole or in part, nor passed to any third party without written permission from the Alberta Machine Intelligence Institute, Amii.

#L1: ML Tools with Python (Optional)



Before we start, let us point out that, as the title indicaes, this lab is meant as an optional (but highly recommended) review of the prerequisites and tools we use throught the program and is not intended as a part of the ML Technician program.

Also remember you can search for the documentation of any of the functions mentiend here. All packages mentioned here have very thorough and easily-accessible documentations.

**Note**: You only have read-only access to this and all further notebooks. In order to be able to modify things in this notebook, you have to copy the notebook to your own personal Google Drive, whioch you can do by selecting the 'Save a copy in Drive' option in 'File' menu. Then you will have a copy of the notebook in the 'Colab Notebooks' folder in your personal Google Drive.

## Jupyter notebooks and Google Colab notebooks

Please review the following ideas before:

- Jupyter notebooks and the Google Colab environment:
  - Cells
  - Execution
  - Text, markdown and math
  - Outputs (text, graphics, tables)
  - Menu
  - Magic
    - `%...?`
    - `%whois`
    - `%reset`, `%reset_selective`
  - Installing packages `!pip  install ...`, `!apt-get install ...`
- Python Objects
  - Classes
  - Objects or instances
  - Methods
  - Attributes

## Vectors, matrices and Numpy 

In Python, we can define vectors with lists and matrices with lists of lists:

In [None]:
v = [3, 2, 1]

In [None]:
v

[3, 2, 1]

In [19]:
M = [[1, 4, 5], [-1, 2, 7], [0, 0, 3]]

In [None]:
M

...and we can access elements using indexing operators (we can also access rows of a matrix defined this way by using a single indexing operator):

In [None]:
v[1]

In [None]:
M[1][0]

In [18]:
M[2]

NameError: ignored

However, this is a very time-inefficient way to store vectors and matrices. Vectors and matrices are usually composed of elements with the same data type. And we often perform arithmetic (and other kinds of) operations on vectors and matrices of huge sizes, so, we need the operations to be time-efficient.

Fortunately, Python has a package just for that: [NumPy](https://numpy.org/)!

In [6]:
import numpy as np

Now, let's create a NumPy *array* (which is the data structure we use to keep vectors and matrices) from our matrix `M` defined as a list of lists: 

In [20]:
A = np.array(M)

This is one way of defining a NumPy array. There are other ways as well. For now, let's check this new NumPy array `A`: 

In [None]:
type(A)

The type is `ndarray` ($n$-D array as in $n$-dimensional array) defined in package `numpy` (remember the dot here denotes package membership).

In [None]:
A

The data is displayed like a list of lists wrapped in an array. It is only represented that way and is not representative of how things are composed inside an array.
You can't have an array with mixed data types:

In [None]:
np.array([1, "Hello"])

See how the `'1'` is not an integer anymore but rather converted to a string and how the array has a data type `dtype` property which is for the whole array (which in this case is `<U21`, a kind of unicode text string).

You can't have with a usual list of lists of different sizes neither:

In [1]:
np.array([[1, 2, 3], [4, 5]])

NameError: ignored

You get an array containing two named `list` objects, not the usual representation.

Operating on this new NumPy array data type, we have new operations defined. You can use `shape` attribute to check the size of an array:

In [None]:
A.shape

...and if you had an array containing a vector:

In [None]:
w = np.array(v)
display(w)
w.shape

Just beware that sometimes, we may prefer having matricized vectors:

In [None]:
v2 = [[3], [2], [1]]

w2 = np.array([[3], [2], [1]])
display(w2)
w2.shape

To index NumPy arrays, you can put all indices for different dimensions inside the same indexing operator pairs with the indices for axes separated by commas:

In [None]:
A[1, 2]

You can use the lists of lists indexing technique as well:

In [None]:
A[1, 2] == A[1][2]

You can use the same techniques for choosing a range of indices:

In [None]:
A[0:2, :-1]

You can also use `...` to specify the whole range for the remainder of directions (useful especially with higher-dimensional arrays):

In [None]:
A[1, ...]

...which is basically:

In [None]:
A[1, :]

...and if you had a higher dimensional array:

In [2]:
A_h = np.array([[[[1, 2, 3, 4], 
                  [2, 3, 4, 5], 
                  [3, 4, 5, 6]], 
                 [[-1, -2, -3, -4], 
                  [-2, -3, -4, -5], 
                  [-3, -4, -5, -6]]]])

NameError: ignored

In [None]:
A_h

In [None]:
A_h.shape

In [1]:
A_h.ndim

NameError: ignored

Then:

In [None]:
A_h[0, ...]

In [None]:
A_h[0, :, :, :]

###Matrix multiplication

Let's implement a matrix multplication with matrices represeneted as lists of lists first:

In [3]:
def multiply_list_vec(A, B):
  nrA = len(A)
  ncA = len(A[0])
  nrB = len(B)
  ncB = len(B[0])
  if ncA != nrB:
    return None
  C = []
  for rnum in range(nrA):
    C.append([])
    for cnum in range(ncB):
      C[rnum].append(0)
      for third in range(ncA):
          C[rnum][cnum] += A[rnum][third] * B[third][cnum]
  return C

Now, let's use that:

In [4]:
exampleA = [[1, 1, 1], [1, 1, 1]]
exampleB = [[3], [4], [5]]

multiply_list_vec(exampleA, exampleB)

[[12], [12]]

Matrix multplication using NumPy array is simply done using the `dot` function defined in NumPy package:

In [7]:
M1 = np.array(exampleA)
M2 = np.array(exampleB)

In [8]:
np.dot(M1, M2)

array([[12],
       [12]])

You can also use the `@` operation which is really just an alias for `dot`:

In [9]:
M1 @ M2

array([[12],
       [12]])

####Why NumPy!

Let's time the two matrix multiplication and compare:

In [10]:
import time

In [11]:
testA = np.random.random((100, 200))
testB = np.random.random((200, 400))

These functions generate random matrices using NumPy and we will explain them later. Let's proceed to timing comparison:

In [12]:
start_time = time.time()
multiply_list_vec(testA, testB)
end_time = time.time()
print("Time elapsed using lists :", end_time - start_time)

Time elapsed using lists : 6.420735120773315


In [13]:
start_time = time.time()
testA @ testB
end_time = time.time()
print("Time elapsed using NumPy :", end_time - start_time)

Time elapsed using NumPy : 0.008374929428100586


###More NumPy methods, attributes and functions

Arithmetic operations with symbols that are defined in (regular) Python on matrices are *element-wise*. That means addition and subtraction act similarly to matrix addition and subtraction, however, multiplication is what is called the *Hadamard product* rather than matrix multiplication:

In [48]:
B = np.array([[10, -1, 3], [6, 6, 6], [-9, 5, 8]])

In [41]:
B = np.array([[10, -1, 3]])

In [21]:
display(A)

array([[ 1,  4,  5],
       [-1,  2,  7],
       [ 0,  0,  3]])

In [16]:
display(B)

array([[10, -1,  3],
       [ 6,  6,  6],
       [-9,  5,  8]])

Addition:

In [22]:
A + B

array([[11,  3,  8],
       [ 5,  8, 13],
       [-9,  5, 11]])

You can also use `add` from NumPy:

In [23]:
np.add(A, B)

array([[11,  3,  8],
       [ 5,  8, 13],
       [-9,  5, 11]])

Subtraction:

In [24]:
A - B

array([[-9,  5,  2],
       [-7, -4,  1],
       [ 9, -5, -5]])

You can also use `subtract` from NumPy:

In [25]:
np.subtract(A, B)

array([[-9,  5,  2],
       [-7, -4,  1],
       [ 9, -5, -5]])

Hadamard (element-wise) product, $A \odot B$:

In [26]:
A * B

array([[10, -4, 15],
       [-6, 12, 42],
       [ 0,  0, 24]])

Where matrix multiplication would have given:

In [47]:
A @ B

ValueError: ignored

You can also use `multiply` for Hadamard product:

In [28]:
np.multiply(A, B)

array([[10, -4, 15],
       [-6, 12, 42],
       [ 0,  0, 24]])

You can also other arithmetic operations and all of them are element-wise. For example division, `/` or `divide`:

In [29]:
A / B

array([[ 0.1       , -4.        ,  1.66666667],
       [-0.16666667,  0.33333333,  1.16666667],
       [-0.        ,  0.        ,  0.375     ]])

In [30]:
np.divide(A, B)

array([[ 0.1       , -4.        ,  1.66666667],
       [-0.16666667,  0.33333333,  1.16666667],
       [-0.        ,  0.        ,  0.375     ]])

And other examples:

In [46]:
A // B

array([[ 0, -4,  1],
       [-1, -2,  2],
       [ 0,  0,  1]])

In [32]:
A % B

array([[1, 0, 2],
       [5, 2, 1],
       [0, 0, 3]])

In [33]:
A ** B

ValueError: ignored

Oops! You can't raise integers to negative power, since the result won't be an integer. Let's use a math function from NumPy, `abs` which takes absolute values element-wise to do this:

In [45]:
A ** np.abs(B)

array([[  1,   4, 125],
       [  1,   2, 343],
       [  0,   0,  27]])

We have use many other math functions element-wise as well:

In [34]:
np.sin(A)

array([[ 0.84147098, -0.7568025 , -0.95892427],
       [-0.84147098,  0.90929743,  0.6569866 ],
       [ 0.        ,  0.        ,  0.14112001]])

In [35]:
np.exp(A)

array([[2.71828183e+00, 5.45981500e+01, 1.48413159e+02],
       [3.67879441e-01, 7.38905610e+00, 1.09663316e+03],
       [1.00000000e+00, 1.00000000e+00, 2.00855369e+01]])

In [36]:
np.log(A)

  np.log(A)
  np.log(A)


array([[0.        , 1.38629436, 1.60943791],
       [       nan, 0.69314718, 1.94591015],
       [      -inf,       -inf, 1.09861229]])

You can't take logarithms of negative values neither. Also note the `nan` and `inf`. `nan` in NumPy signifies *n*ot-*a*-*n*umber and is used when the result is not a number or invalid, like what we have here. `inf` means *inf*inity, so `-inf` is $-\infty$:

In [37]:
np.nan

nan

In [38]:
np.inf

inf

Sometimes the results of division by zero can be displayed as `inf` and `-inf`.

Also, if you use comparison operators, they will be element-wise:

In [44]:
A == B

array([[False, False, False],
       [False, False, False],
       [False, False,  True]])

In [43]:
A < B

array([[ True, False, False],
       [ True, False, False],
       [ True, False, False]])

However, if you want to indicate if the comparisons are `True` for any or all elememnts use `any` or `all` from NumPy:

In [49]:
np.all(A == B)

False

In [50]:
np.all(A == A)

True

In [51]:
np.any(A > 5)

True

You can also use NumPy versions of logical operators:

In [52]:
np.logical_and(A < B, A > 1)

array([[False, False, False],
       [False,  True, False],
       [False, False,  True]])

In [53]:
np.logical_or(A > 1, B < 10)

array([[False,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [54]:
np.logical_not(A == B)

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

Or use the "bitwise" operators to accomplish the same things:

In [55]:
(A < B) & (A > 1)

array([[False, False, False],
       [False,  True, False],
       [False, False,  True]])

In [56]:
(A > 1) | (B < 10)

array([[False,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [57]:
~(A == B)

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

Transpose of a matrix:

In [59]:
display(A)

array([[ 1,  4,  5],
       [-1,  2,  7],
       [ 0,  0,  3]])

In [58]:
A.transpose()

array([[ 1, -1,  0],
       [ 4,  2,  0],
       [ 5,  7,  3]])

...or simply:

In [60]:
A.T

array([[ 1, -1,  0],
       [ 4,  2,  0],
       [ 5,  7,  3]])

Creating a matrix of all zeros:

In [61]:
np.zeros((2, 5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

Note the double parentheses. The first ones are for function call, obviously. However, remeber that when you got the `shape` of an array the sizes where given in a tuple. That's how you specify an array size in NumPy and that's why you have these inner parentheses in here: to specify a size, you use a tuple.

Now, a Matrix of all ones:

In [62]:
np.ones((3, 4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Which is different from an *identity matrix*:

In [63]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [64]:
np.eye(8)


array([[1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]])

Which is a square matrix, and thus, you can specify the length of one side rather than giving the full size and you don't see the inner parentheses.

You can laso generate a range of numbers in a vector with `arange` from NumPy:

In [None]:
np.arange(0.0, 10.0, 0.1)

The parameters are like Python's `range` function.

You can alternatively use `linspace`:

In [None]:
np.linspace(0.0, 10.0, 101)

Here, the third parameter is the number of points rather than step size. Also, note that the end parameter is inclusive here.

You can generate points with logarithmic spacing as well:

In [None]:
np.logspace(0.0, 20.0, 21)

For many functions that generate arrays, you can specify the data type of array created using parameter named `dtype`:

In [None]:
np.logspace(0.0, 20.0, 21, dtype='uint64')

The data type `'uint64` means an *u*nsigned *int*eger which is *64* bits long. The last value is obviously wrong but that is because a 64-bit integer cannot contain $1\times 10^{20} = 10^{20}$. If we were to use default `'int'` kind (which is synonymous to `'int64'`, many times), the last two values will likely be wrong:

In [None]:
np.logspace(0.0, 20.0, 21, dtype='int')

You can create random array by using the submodule `random` from NumPy. 

Function `random` creates random numbers uniformly distributed between 0 and 1:

In [None]:
E = np.random.random((3, 4))

In [None]:
E

You can define arrays of random integers:

In [None]:
I = np.random.randint(0, 10, (5, 5))

In [None]:
I

As well as randomly choose a number of elements from an $1$-D array or list:

In [None]:
vr = np.arange(0, 100)
display(vr)

np.random.choice(vr, 3)

You can also generate random variables from a number of distributions. For example, you can generate numbers from a univariate Gaussian (normal) distribution:

In [None]:
np.random.normal(0.0, 1.0, (2, 4))

Here, the first parameter is the first moment of the distribution (its mean) and the second parameter is the second moment (variance) and the third argument is the array size.

Computer generated-random numbers are actually not really random but rather pseudo-random. You can control this with specifying a seed. If the same seed is used, the same sequence of number are generated:

In [None]:
np.random.seed(100)
display(np.random.random((2, 2)))
display(np.random.random((2, 2)))

Compare that to:

In [None]:
np.random.seed(100)
display(np.random.random((2, 2)))
np.random.seed(100)
display(np.random.random((2, 2)))

Now that we are talking about distributions, let's talk about some statistical methods on NumPy arrays.

You can take the empirical mean of numbers in an array with the `mean` method:

In [None]:
print(w)
print(w.mean())

If you used that on a matrix (or a higher-dimensional array), it would give you the mean of every element:

In [None]:
print(A)
print(A.mean())

...unless, you specified which axis do you want to take the mean on (remember, in Python, indices, including indices of axes start from 0): 

In [None]:
print(A.mean(axis=0))

In [None]:
print(A_h)
print(A_h.mean(axis=2))

You can take the variance using `var`:

In [None]:
A.var()

Same rules apply. If you want to take variances along a specific axis, specify that. If you want to get a covariance matrix, use `cov`:

In [None]:
G = np.random.randint(0, 100, (10, 4))
display(G)

np.cov(G)

Be careful about how we used `cov` function of NumPy and not a method of the `ndarray` array `G`.

You can take medians:

In [None]:
np.median(w)

...correletaion coefficients:

In [None]:
np.corrcoef(G)

...standard deviations:

In [None]:
w.std()

...and many other things.

Apart from statistical functions, you can calculate other aggregate functions.

For example summations:

In [None]:
w.sum()

In [None]:
A.sum()

In [None]:
A.sum(axis=1)

Minimums, maximums, cumulative sums, ....

In [None]:
w.min()

In [None]:
w.max()

In [None]:
w.cumsum()

In [None]:
A.min()

In [None]:
A.max()

In [None]:
A.cumsum()

In [None]:
A.max(axis=1)

In [None]:
A.max(axis=0)

In [None]:
A.cumsum(axis=0)

You can also take `argmin`s and `argmax`s (like $\arg\min$ and $\arg\max$ operations in math). These will tell you not the minimum or maximum values but the index at which they happen:

In [None]:
w.argmin()

In [None]:
w.argmax()

In [None]:
A.argmin()

In [None]:
A.argmax()

In [None]:
A.argmin(axis=1)

In [None]:
A.argmax(axis=0)

If you check `w`, the minimum actually is happening at index `2` and the maximum happens indeed at index `0`. If there were multiple minima or maxima, the first index where they happen is returned.

With `A`, you can see instead of two indices, a single index is returned. This corresponds with the index where minimum or maximum happens if you were to flatten the array into a $1$-D array. We will see more about flattening and translating this index into a $n$-D index shortly.

In [None]:
display(w)
display(A)

Finally you can sort arrays as well:

In [None]:
w_s = w
display(w_s)
w_s.sort()
display(w_s)

In [None]:
A_s = A
display(A_s)
A_s.sort()
display(A_s)

In [None]:
A_s = A
display(A_s)
A_s.sort(axis=0)
display(A_s)

Notice, the same **in-place** effect, where the changes are made to the original data structure is true for NumPy sort as well.

Also, with `sort` if you don't specify an axis, the entire array is not sorted. Rather than that, a value of `axis=-1` (the last axis) is taken by default for the `axis` parameter.

In [None]:
A_s = A
display(A_s)
A_s.sort(axis=1)
display(A_s)

The values were already sorted along axis with index of `1`, so nothing changed.

FYI, we also do have `argsort`.

Now back to some basic array attributes:

You can find the number of dimensions of an array with `ndim` attribute:

In [None]:
A.ndim

The number of elements can be found with `size`:

In [None]:
A.size

You can find the data type of an array with `dtype`:

In [None]:
A.dtype

Here you see `dtype('int64')` is not a string but rather an object of type `dtype`. If you just want the name, you can use the `name` attribute of `dtype` objects:

In [None]:
A.dtype.name

You can cast an array into another `dtype` using the method `astype`: 

In [None]:
F = E.astype('float')
display(F)

In [None]:
F.dtype

You can flatten arrays:

In [None]:
F.ravel()

...or reshape them:

In [None]:
F.reshape((2, 6))

If you have an index from a falttened array, you can translate that index into an $n$-D index with `unravel_index`:

In [None]:
ind1 = F.argmin()
print(ind1)

print(np.unravel_index(ind1, (3, 4)))
print(np.unravel_index(ind1, (2, 6)))

You can use a Boolean array of the same size to index another array:

In [None]:
F > 0.2

In [None]:
F[F > 0.2]

You can also $n$ other other arrays or lists of same sizes to index an $n$-D array and you will get an array with the shape of the index arrays. The values will be the values corresponding to the indices in the index array for each dimension:

In [None]:
ind_arr1 = np.array([0, 2, 2])
ind_arr2 = np.array([1, 0, -1])

F[ind_arr1, ind_arr2]

You can find the indices where a condition happens (i.e., where a Boolean array is `True`) using `where`:

In [None]:
np.where(F > 0.2)

As you can see, you will get a tuple of arrays containing indices for each dimension.

You can copy array by using the copy method:

In [None]:
H = G.copy()
print("H is a copy of G.")
print("H:\n", H)
print("G:\n", G)
H[1, 1] = 0
print("\nChange applied to H[1, 1].\n")
print("H:\n", H)
print("G:\n", G)

You can create a pure reference, called a **view** in Python by:

In [None]:
H = G.view()
print("H is a view of G.")
print("H:\n", H)
print("G:\n", G)
H[1, 1] = 0
print("\nChange applied to H[1, 1].\n")
print("H:\n", H)
print("G:\n", G)

Simple assignment creates a view and not a copy:

In [None]:
H = G
print("H = G is executed (simple assignment).")
print("H:\n", H)
print("G:\n", G)
H[1, 1] = 1
print("\nChange applied to H[1, 1].\n")
print("H:\n", H)
print("G:\n", G)

### Even more NumPy

There are many more methods to do array manipulation:

- You can add dimensions to an array with `np.newaxis` or `None` as indices:

In [None]:
A2 = A[:, :, np.newaxis]
A2

In [None]:
A[:, :, None]

In [None]:
A2.shape

- You can compact arrays by using `squeeze`:

In [None]:
A2.squeeze()

In [None]:
A2.squeeze().shape

- You can use `tile` to tile an array to create a bigger array:


In [None]:
np.tile(A, (2, 3))

You can of course combine `tile` and adding new axes with `np.newaxis` or `None` indexing to extend arrays into new dimensions:

In [None]:
np.tile(A[:, :, None], (1, 1, 3))

- You can use `resize` method of an `ndarray` array to resize an array:

In [None]:
E

In [None]:
E.resize((6,2))
E

In [None]:
E.resize((3, 4))
E

- You can use `append` function of NumPy to append values:

In [None]:
np.append(A, [[1, 2, 3]])

In [None]:
np.append(A, [[1], [2], [3]], axis=1)

- You can use `insert` function of NumPy to insert values:

In [None]:
np.insert(A, 1, 5)

In [None]:
np.insert(A, 1, [1, 2, 3])

In [None]:
np.insert(A, 1, 5, axis=1)

In [None]:
np.insert(A, 1, [[7], [8], [9]], axis=1)

In [None]:
np.insert(A, 1, [7, 8, 9], axis=1)

In [None]:
np.insert(A, [1], [[7], [8], [9]], axis=1)

In [None]:
np.insert(A, [1, 2], [[7], [8], [9]], axis=1)

- You can use `delete` function of NumPy to delete values:

In [None]:
np.delete(A, 1)

In [None]:
np.delete(A, 1, axis=0)

In [None]:
np.delete(A, [1, 2], axis=1)

In [None]:
mask = (A.sum(axis=0) < 1.0)
display(mask)
np.delete(A, mask, axis=0)

- You can use `concatenate` function of NumPy to concatenate arrays together:

In [None]:
np.concatenate((A, B))

In [None]:
np.concatenate((A, B), axis=0)

In [None]:
np.concatenate((A, B), axis=1)

In [None]:
np.concatenate((A, B), axis=None)

- You can use `vstack` function of NumPy (or its alias `np.r_[`...`]`, note the indexing operator instead of function call parentheses) to vertically stack arrays:

In [None]:
np.vstack((A, B))

In [None]:
np.r_[A, B]

- You can use `hstack` function of NumPy (or its alias `np.c_[`...`]`, note the indexing operator instead of function call parentheses) to stack arrays in columns to horizontally stack arrays :

In [None]:
np.hstack((A, B))

In [None]:
np.c_[A, B]

- You can use `column_stack` function of NumPy on 1D arrays (which is like doing vertical stacking and then transposing):

In [None]:
np.column_stack(([1, 2, 3], [4, 5, 6]))

In [None]:
np.vstack(([1, 2, 3], [4, 5, 6])).T

- You can use `split` function of NumPy to split an array at an index:

In [None]:
np.split(A, 3, axis=1)

In [None]:
np.split(A, [1], axis=1)

In [None]:
np.split(A, [1, 2], axis=0)

...and many, many more operations!


One last thing before we move on to another package is that you can use the `lionalg` sub-package of NumPy to do more complicated linear algebra operations. The two most useful examples would be using `linalg.inv` to calculate the inverse of a (positive-definite square) matrix:

In [None]:
np.linalg.inv(A)

`A` was singular and could not be inverted.

In [None]:
np.linalg.inv(B)

The other example is to solve a system of linear equations, $\mathbf{Ax}=\mathbf{b}$ (where $\mathbf{A}$ are the coefficients given in the matrix form, $\mathbf{x}$ are the variables and $\mathbf{b}$ is a vector of values), using `linalg.solve`:

In [None]:
np.linalg.solve(B, w)

(For those interested, theoretically, this is returning $\mathbf{x}=\mathbf{A}^\dagger\mathbf{b}$ where $\mathbf{A}^\dagger$ is the Moore–Penrose pseudoinverse of $\mathbf{A}$:
$$\mathbf{A}^\dagger=\left(\mathbf{A}^\top\mathbf{A}\right)^{-1}\mathbf{A}^\top\,,$$
but it is doing that in a more numerically-stable way rather than taking a direct inverse.)

##pandas

[pandas](https://pandas.pydata.org/) is a data analysis package for Python, kind of like a spreadsheet document or a databased where you can have tables, manipulate and display them. Since, data is the central object of machine learning, we will be using this package a lot as well.

In [None]:
import pandas as pd

Pandas has two main data structures.

Series:

In [None]:
ds = pd.Series(w, index=["a", "b", "c"])

`Series`is a $1$-D series of data. Each element in the series can have an index name. 

In [None]:
ds

We can get one element of a series by using its index:

In [None]:
ds["b"]

...or index number:

In [None]:
ds[0]

The second kind is a `DataFrame`, which is like a table:

In [None]:
df = pd.DataFrame(A, columns=["feature_1", "feature_2", "feature_3"])

We could also have had index (row names) in addition to column names, but we usually don't need those unless datapoints have a meaningful identifier.

In [None]:
print(df)

In [None]:
display(df)

Notice how `display` shows a nicer output because it is defined in Jupyter and hence uses Jupyter's graphical inteface.

In [None]:
df

Also notice how inspecting a variable is actually doing `display`.

pandas is built on top of NumPy so many functions of NumPy are available here as well and they have the same name.

Before jumping to the next topic, let' see other ways you can create pandas `Series` and `DataFrame`s:

Creating a DataFrame using dictionaries:

In [None]:
pd.DataFrame({'Feature_1': [-1, 0, 1], 'feature_2': [0, 2, 4], 'feature_3': [3, 5, 6]})

BTW, like in Series, DataFrame indices can have names instead of numbers as well:


In [None]:
pd.DataFrame({'Feature_1': [-1, 0, 1], 'feature_2': [0, 2, 4], 'feature_3': [3, 5, 6]}, 
             index=['A', 'B', 'C'])

Finally, `Series` have only one column and that name can be pased with the name argument:


In [None]:
pd.Series([1, 2, 3], name='w', index=['a', 'b', 'c'])

Now, let's read a DataFrame object from a CSV (*C*omma *S*eparated *V*alues, perhaps the most famous data storage format) file on the web (you can also read CSV files from your won computer or Google storage in case of Colab). We will read the famous [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) from the datasets in the [*Machine Learning Repository*](https://archive.ics.uci.edu/ml/datasets.php) hosted by [UCI](https://uci.edu/): University of California, Irvine ([dataset link](https://archive.ics.uci.edu/ml/datasets/Iris)):

In [None]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names=["sepal_length", "sepal_width", "petal_length", "petal_width"], header=None, index_col=False)

The `names` parameter is the name of columns and by indicating `header=None`, we tell pandas not to use the first row as a list of column names (to be accurate, this was not neccessary because if column names are explicitly provided, the first row is automatically not used as header). Also, by indicating `index_col=False` we tell pandas, we don't want to use the first column as the index column (if we wanted to use a specific column as the index column, we could have used the index of that column the value for `index_col`).

In [None]:
df

You can use `head` and `tail` to look at the beginning and end of data:



In [None]:
df.head()

In [None]:
df.tail()

Five rows are shown by default and you can override that:

In [None]:
df.head(10)

We can use indexing operator with index numbers to select some rows of a DataFrame:

In [None]:
df[1:]

We can use indexing operator with column names to select a column of a DataFrame and get a Series:

In [None]:
df["sepal_length"]

You can also use the `.` operator to get a column if the name of the column does not include spaces:

In [None]:
df.sepal_length

We can use a list (or array) of column names to select a sub-DataFrame from those columns:

In [None]:
df[["sepal_length", "sepal_width"]]

You can use two indexing operators to get a sub-DataFrame:

In [None]:
df[["sepal_length", "sepal_width"]][2:4]

You also (better) can select a sub-DataFrame by row and column index numbers using the `iloc` indexing operator:


In [None]:
df.iloc[[0, 2, 3], [0, 3]]

Notice the difference in ordering. Here, we gave the rows first and then columns.

Or slice using ranges:

In [None]:
df.iloc[0:2, 1:-1]

If you specify a single row or column index, you will get a Series back instead:

In [None]:
df.iloc[3, 1:]

In [None]:
df.iloc[0:4, 2]

If you specify singular row and column indices instead of ranges, you will get a value back:

In [None]:
df.iloc[0, 2]

You can use row and column labels by using `loc` (here row, or index labels are the numbers themselves since we did not specify index names):

In [None]:
df.loc[[0, 1], ["petal_width", "sepal_width"]]

The same situation applies with regards to being able to specify ranges and getting Series or values back:

In [None]:
df.loc[0:2, "sepal_length":"petal_length"]

Note, however, how with `loc` the ranges are inclusive at the end and not exclusive.

In [None]:
df.loc[0, :]

In [None]:
df.loc[:, "petal_length"]

In [None]:
df.loc[6, "petal_width"]

You can also use Boolean Series For indexing. One way of creating a Boolean Series is using comparison on a Series.

In this following example, this is a Series:

In [None]:
df["petal_length"]

We can create a Boolean Series from it:

In [None]:
df["petal_length"] > 0.5

Then, we can use that for indexing

In [None]:
df[df["petal_length"] > 0.5]

You can combine Boolean series by using `|` (or) `&` (not) and `~` (not):

In [None]:
df[(df["petal_length"] > 0.5) & (df["petal_length"] < 2.0)]

In [None]:
df[(df["sepal_length"] < 3.1) | (df["sepal_length"] > 3.8)]

In [None]:
df[~(df["petal_length"] < 2.0)]

You can also use `any` and `all` methods with Boolean Series and they are like their counterparts in NumPy.

You can drop rows or columns using `drop`:

In [None]:
df.drop([0, 2])

To drop column, you have to say `axis=1` (the default is `axis=0` which drops rows):

In [None]:
df.drop(["sepal_length"], axis=1)

But that is not in-place and it does not change the DataFrame/Series:

In [None]:
df

You can sort a DataFrame by index values using `sort_index`:

In [None]:
df.sort_index()

BTW, you can change the index column of a DataFrame: 

In [None]:
df.set_index("petal_length")

...and reset index numbers:

In [None]:
df.set_index("petal_length").reset_index()

You can sort by values in a column:

In [None]:
df.sort_values(by="sepal_length")

These are not in-place:

In [None]:
df

You find the ranks of entries as well (if there are equal enrties, their average rank will be shown):

In [None]:
df.rank()

You can use the `describe` method to get a statistical summary of the DataFrame:

In [None]:
df.describe()

You can also use `mean`, `median`, `mode`, `var` and `std` methods to get mean, median, mode, variance and standard deviation values (mode will return a DataFrame, others a Series):

In [None]:
df.mean()

In [None]:
df.median()

In [None]:
df.mode()

In [None]:
df.var()

In [None]:
df.std()

You can use attributes `shape` to find dimensions, `index` to describe rows, `column` to describe columns and methods `info` to get info on DataFrame and `count` to get the number of entries which are actual numbers (are not not-a-numbers: `nan`s or `NA`s). `NA` or `nan` values are usually used to specify missing values in a dataset.

In [None]:
df.shape

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.count()

You can also use `sum`, `cumsum`, `min`, `max`, `idxmin` and `idxmax` to get sums, cumulative sums, minimums, maximums, index of minimums (like an $\arg\min$) and index of maximums (like an $\arg\max$), repectively:

In [None]:
df.sum()

In [None]:
df.cumsum()

In [None]:
df.min()

In [None]:
df.max()

In [None]:
df.idxmin()

In [None]:
df.idxmax()

Let's load another dataset to show some other features. We are going to use the 
[Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult)), which contains census data from US for incomes. But this time, rather than reading from a web URL, let's read the file from Google drive. Make sure the file `adult.data` is located on your Google Drive. Then, click on the folder icon on the bar on the left of your notebbok and authenticate the notebook to access your Drive. Use the file browser provided and find the dataset file in your Drive you want to load. Click on the three dots to the right of file name and 'copy path'. Paste that path and assign it as the value of `file_path` (this is just a string-valued variable we defined here and the naming is not special):

In [None]:
file_path = '/content/drive/MyDrive/Colab Notebooks/Datasets/adult.data'

Now, we can load our data. In this dataset there is a space after commas, so not to read the data with a preceding space we have to specify the separator to be `', '` rather than the default `','`:

In [None]:
df2 = pd.read_csv(file_path, header=None, index_col=False, sep=', ')
display(df2)

First things first, let's specify column names. We can do that using the `rename` method of the DataFrame:

In [None]:
df2 = df2.rename(columns={
                           0: 'age',
                           1: 'workclass',
                           2: 'fnlwgt',
                           3: 'education',
                           4: 'education-num',
                           5: 'marital-status',
                           6: 'occupation',
                           7: 'relationship',
                           8: 'race',
                           9: 'sex',
                          10: 'capital-gain',
                          11: 'capital-loss',
                          12: 'hours-per-week',
                          13: 'native-country',
                          14: 'income'
                         })

In [None]:
df2

We can also give renames the axes themselves:

In [None]:
df2 = df2.rename_axis('attributes', axis='columns')
df2 = df2.rename_axis('individuals', axis='rows')
df2

We can check the data type of a column:

In [None]:
df2['age'].dtype

...or for all columns:

In [None]:
df2.dtypes

We can check unique values of a column (which is really useful for non-numerical columns):

In [None]:
df2['workclass'].unique()

We can find instances where a column has value in a set of specific values:

In [None]:
df2['workclass'].isin(['Federal-gov', 'State-gov', 'Local-gov'])

If we get a summary using `describe`, columns with categorical values have different stats:

In [None]:
df2['workclass'].describe()

The standard way to display missing values is to use `NaN` values. However, in this dataset, '?' is used to represent missing values. Let's convert those to the standard way by assigning new values:

In [None]:
df2[df2 == '?'] = np.NaN
display(df2)

We can check where in the DataFrame we have missing values with `isnull`:

In [None]:
df2.isnull()

...and where we don't have missing values with `notnull`:

In [None]:
df2.notnull()

Let's find rows where we have missing values:

In [None]:
individuals_with_missing_values = df2.isnull().any(axis=1)

We can replace nulls (`NaN`s) with a specific value with `fillna`:

In [None]:
df2.fillna('Unknown')[individuals_with_missing_values]

We can simpy add a column by assigning value into a non-existsing column name (notice how the `df2` has not changed when we replaced `NaN`s with `'Unknown'` as we did not asisgn the result back to `df2`):

In [None]:
df2['dummy'] = -1
df2

Let's find the number of people with missing **workclass**:

In [None]:
is_workclass_missing = df2['workclass'].isnull()
is_workclass_missing

We can convert data types like NumPy:

In [None]:
is_workclass_missing_int = is_workclass_missing.astype('int')
is_workclass_missing_int

Let's get the number of people which have missing **workclass** values:

In [None]:
num_missing_workclasses = is_workclass_missing_int.sum()

You saw how we assigned a fixed number to a number of elements. We can also assign a list of values of the same length to assign values. However, in order to be able to do that, we should use `loc` or `iloc` indexing and not the simple `[`...`]` indexing':

In [None]:
df2.loc[is_workclass_missing, 'dummy'] = np.arange(num_missing_workclasses)
df2[is_workclass_missing]

We can use `replace` to replace values as well:

In [None]:
df2.loc[:, 'dummy'].replace(-1, -np.inf)

We can count occurences of unique values with `value_counts`:

In [None]:
df2['workclass'].value_counts()

There is a lot more you can do with pandas! You can do arithmetic operations. You can use `stack` and `melt` convert columns into rows, `unstack`to reverse stacking, `pivot` rows and `pivot_table`, `where` to find indices, `select` and `filter` to do more advanced selections, `dropna` to drop missing values, `to_datetimes`, `date_range` `datettime` objects, and `DateTimeIndex` to handle time and date data and `merge`, `join` and `concat` to combine data.You can use `map` and `apply` to apply functions to dataframes, Use `duplicated` and `drop_duplicates` to handle duplicate data, iterate over data, use multi-indexing and group data and `groupby` to handle groups of datapoints with similar characteristics and use `agg` to do multiple operations on them. pandas also provides tools for easy visualization and data storage. We will explain in time the tools we use.

## Covered in later modules

###plotly and Dash

Usually, [matplotib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) are used for visualization of machine learning projects. However, in this course we prefer to use [plotly](https://plotly.com/) as it's fun, fast and beautiful. We will also sometimes use seabron as well and explanation for these will come at usage time. Also, [Bokeh](https://bokeh.org/) is another visualization tool that is sometimes used but one that we will not be using in this course.

Also, if you want to present results, a very good tool is [Dash](https://plotly.com/dash/) by the same people who have developed plotly.

Visualizations are an important part of data exploration which will be covered in module 2. If you interested to see how plolty works, you can see [some examples](https://plotly.com/python/). You can also see a [YouTube tutorial video for plotly and Dash](https://www.youtube.com/watch?v=DIk-y41djCQ&t=6s).

##scikit-learn

[scikit-learn](https://scikit-learn.org/stable/) is the machine learning library we will be using. Explanation will come in due time!

##FYI

###SciPy library

[SciPy](https://www.scipy.org/) is a Python-based ecosystem, some components of which we already introduced, like NumPy, pandas and Matplotlib. There is also [SymPy](https://www.sympy.org/en/index.html) for symbolic computation and [IPython](http://ipython.org/) for interactive Python (which [Jupyter](https://jupyter.org/), and thus Google Colab, is derived from). There is one last package that can be used for machine learning but one that we don't use in this course and that is the [SciPy Library](https://www.scipy.org/scipylib/index.html). SciPy library has a collection of various tools that can help with scientific computing and machine learning. You can find a tutorial [here](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html).

###TensorFlow, Keras and PyTorch

[TensorFlow](https://www.tensorflow.org/), [Keras](https://keras.io/) and [PyTorch](https://pytorch.org/) are tools used for more complicated kinds of machine learning like deep learning. We will not be using these in this course.

That's all Folks!