# Introduction to Numpy

---

**Numpy** or **Numerical Python** is the fundamental package for numeric computing with Python. It provides powerful ways to create,
store, and/or manipulate data, which makes it able to seamlessly and speedily integrate with a wide variety
of databases. Numpy is the foundation of several libraries such as `Pandas`, `SciPy`, `SymPy`


In this lecture, we will talk about creating array with certain data types, manipulating array, selecting
elements from arrays, as well as universal functions of NumPy and how to use its statistical and mathematical capabilities. Moreover, we see how to load dataset into array. Such functions are useful for manipulating data and
understanding the functionalities of other common Python data packages.


### Lecture outline

---

* Scalar, Vector, Matrix, and NdArray


* Shape, Size, Dimension of matrices, and type of entries


* Indexing and Slicing


* Boolean Indexing


* Universal Functions


* Statistical Methods


* Linear Algebra


* Input/Output

### Homework:

[101-numpy-exercises-python](https://www.machinelearningplus.com/python/101-numpy-exercises-python/)


[Paper to read](https://www.nature.com/articles/s41586-020-2649-2) - Optional

In [1]:
import numpy as np

from scipy import sparse

## Scalar, Vector, Matrix, NdArray or Tensor

![alt text](images/scalar-vector-matrix-tensor.png "Title")

In [2]:
# Scalar

scalar = np.array([5])

scalar

array([5])

In [3]:
# Row Vector

row_vector = np.array([1, 2, 3, 4, 5])

row_vector

array([1, 2, 3, 4, 5])

In [4]:
# Column Vactor

column_vector = np.array([[1], [2], [3], [4], [5]])

column_vector

array([[1],
       [2],
       [3],
       [4],
       [5]])

In [5]:
# Matrix

matrix = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

#### Tensor

![alt text](images/tensor.png "Title")

In [6]:
# NdArray or Tensor

tensor = np.array([[[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                  
                  [[10, 11, 12],
                  [13, 14, 15],
                  [16, 17, 18]],
                  
                  [[19, 20, 21],
                  [22, 23, 24],
                  [25, 26, 27]]])

tensor

array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]],

       [[19, 20, 21],
        [22, 23, 24],
        [25, 26, 27]]])

#### Different types of matrices

Numpy have functions which can generate different matrices.

In [7]:
# Matrix of ones

np.ones(shape=(3, 3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [8]:
# Matrix of zeros

np.zeros(shape=(3, 3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [9]:
# Identity matrix

np.eye(N=3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

There exists `Sparse Matrices`, where most of the values are zeros and few of them are non-zero. If we represent sparce matrices in a usual way it will take huge amount oh memory. Hence, it's better to represent them in a compac way. That's where the notion sparce matrix comes in.

Numpy does not support sparse matrices and we have to use `SciPy` to express them.

In [None]:
# # Create compressed sparse row (CSR) matrix

# sparse_matrix = np.array([[0, 0],
#                           [0, 1],
#                           [3, 0]])

# matrix_sparse = sparse.csr_matrix(sparse_matrix)

# print(matrix_sparse)

## Shape, Size, Dimension of matrices, and type of entries

In [11]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [10]:
# Number of rows and columns

matrix.shape

(3, 3)

In [12]:
# Number of elements (rows * columns)

matrix.size

9

In [13]:
# Number of dimensions (axis)

matrix.ndim

2

In [14]:
# Check the type of entries

matrix.dtype

dtype('int64')

### Numpy `arange` and `linspace`

---

They are used to generate sequence of numbers in some range

![alt text](images/arange.png "Title")

In [19]:
np.arange(start=30, stop=40, step=1) # upper boundary is not included

array([30. , 30.2, 30.4, 30.6, 30.8, 31. , 31.2, 31.4, 31.6, 31.8, 32. ,
       32.2, 32.4, 32.6, 32.8, 33. , 33.2, 33.4, 33.6, 33.8, 34. , 34.2,
       34.4, 34.6, 34.8, 35. , 35.2, 35.4, 35.6, 35.8, 36. , 36.2, 36.4,
       36.6, 36.8, 37. , 37.2, 37.4, 37.6, 37.8, 38. , 38.2, 38.4, 38.6,
       38.8, 39. , 39.2, 39.4, 39.6, 39.8])

In [21]:
np.linspace(start=30, stop=40, num=15) # upper boundary is included

array([30.        , 30.71428571, 31.42857143, 32.14285714, 32.85714286,
       33.57142857, 34.28571429, 35.        , 35.71428571, 36.42857143,
       37.14285714, 37.85714286, 38.57142857, 39.28571429, 40.        ])

## Indexing and Slicing

---

Indexing, slicing and iterating are extremely important for data manipulation and analysis because these techinques allow us to select data based on conditions, and copy or update data. Slicing is a way to create a sub-array based on the original array.

NumPy array indexing is a rich topic and there are many ways one can select a subset of data from an array or matrix.

> It is important to realize that a slice of an array is a view into the same data. This is called **passing by reference**. So modifying the sub array will consequently modify the original array

In [24]:
# Indexing for vector

row_vector = np.array([5, 15, 20, 25, 30, 35])

row_vector[0]

row_vector[:3]

row_vector[3:]

row_vector[2:5]

row_vector[-1]

35

Indexing for matrices is slightly different compared to vector because we have axis there and we have to figure our which axis do we need. In multidimensional arrays, the first argument is for selecting rows, and the second argument is for selecting columns.

![alt text](images/indexing.png "Title")

---

![alt text](images/slicing.png "Title")

In [28]:
# Indexing for matrices

print(matrix)

matrix[0] # Select first row vector

matrix[:, 0] # Select first column vector

matrix[:2, :2] # Select the first two rows and first two columns

matrix[:2, :] # Select the first two rows and all columns of a matrix

matrix[:, 1:2] # Select all rows and the second column

matrix[2, 1] # Select only one element

matrix[:] # Select all elements

[[1 2 3]
 [4 5 6]
 [7 8 9]]


array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

## Boolean Indexing

---

Boolean indexing allows us to select arbitrary elements based on conditions. For example, if we want to find elements that are greater than 5 in a matrix we set up a conditon and it returns boolean values

In [29]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [30]:
matrix > 5

array([[False, False, False],
       [False, False,  True],
       [ True,  True,  True]])

In [31]:
matrix[matrix > 5]

array([6, 7, 8, 9])

In [32]:
# Tilde is the negation operator. It converts True to False and False to True

matrix[~(matrix > 5)]

array([1, 2, 3, 4, 5])

We can combine several conditions by using boolean arithmetic operators such as & (and) and | (or)

In [33]:
(matrix > 3) & (matrix < 7)

array([[False, False, False],
       [ True,  True,  True],
       [False, False, False]])

In [34]:
matrix[(matrix > 3) & (matrix < 7)]

array([4, 5, 6])

## Fancy Indexing

---

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order.


> Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.

In [35]:
arr = np.array([[ 0, 1, 2, 3],
                [ 4, 5, 6, 7],
                [ 8, 9, 10, 11],
                [12, 13, 14, 15],
                [16, 17, 18, 19],
                [20, 21, 22, 23],
                [24, 25, 26, 27],
                [28, 29, 30, 31]])
               
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [36]:
arr[[0, 2, 4]] # Positive integers

array([[ 0,  1,  2,  3],
       [ 8,  9, 10, 11],
       [16, 17, 18, 19]])

In [37]:
arr[[-1, -3, -5]] # Negative integers

array([[28, 29, 30, 31],
       [20, 21, 22, 23],
       [12, 13, 14, 15]])

## Arithmetic Operations On Array

---

We can do many things on arrays, such as mathematical manipulation such as addition, subtraction, square, exponents. These operators on array apply elementwise.

In [38]:
a = np.array([10, 20, 30, 40, 50])

b = np.array([1, 2, 3, 4, 5])

In [39]:
print(a - b)

print(a + b)

print(a / b)

print(a * b)

print(a ** b)

[ 9 18 27 36 45]
[11 22 33 44 55]
[10. 10. 10. 10. 10.]
[ 10  40  90 160 250]
[       10       400     27000   2560000 312500000]


## Universal Functions: Fast Element-Wise Array Functions

---

A universal function, or *ufunc* perform fast elementwise operations on arrays. Numpy has two types of universal functions: **unary ufuncs** and **binary ufuncs**. Unary takes one input array and produce one output array, while binary take two input arrays and produce one output array.

In [40]:
arr = np.array([4, 8, 16, 20, 25])

Unary ufuncs

In [43]:
np.sqrt(arr) # Square root of array values

np.exp(arr) # Exponent of array values

np.log(arr) # Natural logarithm of array values

array([1.38629436, 2.07944154, 2.77258872, 2.99573227, 3.21887582])

Binary ufuncs

In [44]:
np.add(a, b) # Add two arrays

np.subtract(a, b) # Subtract two arrays

np.power(a, b) # Exponentiation

np.multiply(a, b) # Multiplication

np.divide(a, b) # Division

np.greater(a, b) # Elementwise comparison, equivalent to > sign

np.less(a, b) # Elementwise comparison, equivalent to < sign

array([False, False, False, False, False])

## Statistical Methods

---

Numpy has support of various statistical functions for data manipulation. These functions operate on an entire array or on sub-array along a particular axis.

In [45]:
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [46]:
np.max(matrix) # Maximum element in matrix

np.min(matrix) # Minimum element in matrix

1

Ofter, we need to know the maximum and minimum value along an axis.

In [48]:
np.max(matrix, axis=0) # Along x axis

np.max(matrix, axis=1) # Along y axis

array([3, 6, 9])

We can calculate sum and cumulative sum of array elements

In [51]:
np.sum(matrix) # Sum of the elements

np.sum(matrix, axis=0) # Sum along x axis

np.sum(matrix, axis=1) # Sum along y axis

np.cumsum(matrix) # Cumulative sum

np.cumsum(matrix, axis=0) # Cumulative sum along x axis

np.cumsum(matrix, axis=1) # Cumulative sum along y axis

array([[ 1,  3,  6],
       [ 4,  9, 15],
       [ 7, 15, 24]])

Moreover, we can calculate mean, median, variance, and standard deviation of an array.

In [54]:
np.mean(matrix) # Arithmetic average of array

np.mean(matrix, axis=0) # Arithmetic average along x axis

np.mean(matrix, axis=1) # Arithmetic average along y axis

# -------------------------------------------------------

np.median(matrix) # Median of array

np.median(matrix, axis=0) # Median along x axis

np.median(matrix, axis=1) # Median along y axis

# -------------------------------------------------------

np.var(matrix) # Variance

np.var(matrix, axis=0) # Variance along x axis

np.var(matrix, axis=1) # Variance along y axis

# -------------------------------------------------------

np.std(matrix) # Standard Deviation or Square Root from Variance

np.std(matrix, axis=0) # Standard Deviation along x axis

np.std(matrix, axis=1) # Standard Deviation along y axis

array([0.81649658, 0.81649658, 0.81649658])

## Linear Algebra

---

Numpy is highly optimized for linear algebra operations. These are operations defined mostly on matrices and may differ from conventional operations. For example multiplication operations is different in case of matrices then for just two numbers.

In [55]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [56]:
# Transposing a matrix means to exchange rows and columns

matrix.T

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

In [57]:
# star (*) operator performs elementwise multiplications between matrices

matrix * matrix

array([[ 1,  4,  9],
       [16, 25, 36],
       [49, 64, 81]])

In [58]:
# To have proper matrix multiplication we have to use .dot() method or @ sign

np.dot(matrix, matrix)

matrix @ matrix

array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])

In [60]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [59]:
# Diagonal elements of a matrix

np.diag(matrix)

array([1, 5, 9])

In [61]:
# Find the trace of a matrix - Trace is the sum of the main diagonal elements

np.trace(matrix)

15

NumPy has dedicated sub-package or sub-library for linear algebra operations. We can see few of them.

In [62]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [63]:
# Find the determinant of a matrix

# np.linalg.det(matrix)

-9.51619735392994e-16

In [None]:
# Find the inverse of a matrix

# np.linalg.inv(matrix)

## File Input and Output with Arrays

---

NumPy is able to save and load data to and from disk either in text or binary format.

`np.save` and `np.load` are the two workhorse functions for efficiently saving and loading array data on disk.

In [64]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [65]:
# Same array on disk. The array is saved as uncompressed raw binary format with the extension .npy

np.save("some_matrix", matrix)

In [66]:
# Load data in array

np.load("some_matrix.npy")

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Moreover, with Numpy we can read and write **CSV** files, too.

We have graduate school admissions data in csv file. It has fields such as GRE score, TOEFL score, university rating, GPA, having research experience or not, and a chance of admission. With this dataset, we can do data manipulation and basic analysis to infer what conditions are associated with higher chance of admission. Let's take a look.

In [67]:
# We can specify data field names when using genfromtxt() to loads CSV data.
# Also, we can have NumPy try and infer the type of a column by setting the dtype parameter to None.


graduate_admission = np.genfromtxt('data/admission.csv',
                                   dtype=None,
                                   delimiter=',',
                                   skip_header=1,
                                   names=('Serial No','GRE Score', 'TOEFL Score',
                                          'University Rating', 'SOP', 'LOR',
                                          'CGPA','Research', 'Chance of Admit'))

graduate_admission

array([(  1, 337, 118, 4, 4.5, 4.5, 9.65, 1, 0.92),
       (  2, 324, 107, 4, 4. , 4.5, 8.87, 1, 0.76),
       (  3, 316, 104, 3, 3. , 3.5, 8.  , 1, 0.72),
       (  4, 322, 110, 3, 3.5, 2.5, 8.67, 1, 0.8 ),
       (  5, 314, 103, 2, 2. , 3. , 8.21, 0, 0.65),
       (  6, 330, 115, 5, 4.5, 3. , 9.34, 1, 0.9 ),
       (  7, 321, 109, 3, 3. , 4. , 8.2 , 1, 0.75),
       (  8, 308, 101, 2, 3. , 4. , 7.9 , 0, 0.68),
       (  9, 302, 102, 1, 2. , 1.5, 8.  , 0, 0.5 ),
       ( 10, 323, 108, 3, 3.5, 3. , 8.6 , 0, 0.45),
       ( 11, 325, 106, 3, 3.5, 4. , 8.4 , 1, 0.52),
       ( 12, 327, 111, 4, 4. , 4.5, 9.  , 1, 0.84),
       ( 13, 328, 112, 4, 4. , 4.5, 9.1 , 1, 0.78),
       ( 14, 307, 109, 3, 4. , 3. , 8.  , 1, 0.62),
       ( 15, 311, 104, 3, 3.5, 2. , 8.2 , 1, 0.61),
       ( 16, 314, 105, 3, 3.5, 2.5, 8.3 , 0, 0.54),
       ( 17, 317, 107, 3, 4. , 3. , 8.7 , 0, 0.66),
       ( 18, 319, 106, 3, 4. , 3. , 8.  , 1, 0.65),
       ( 19, 318, 110, 3, 4. , 3. , 8.8 , 0, 0.63),
       ( 20,

In [68]:
# The resulting array is a one-dimensional array with 400 tuples

graduate_admission.shape

(400,)

In [69]:
# We can retrieve a column from the array using the column's name for example, let's get the CGPA column and
# only the first five values.

graduate_admission['CGPA'][0:5]

array([9.65, 8.87, 8.  , 8.67, 8.21])

In [70]:
# Since the GPA in the dataset range from 1 to 10, and in the US it's more common to use a scale of up to 4,
# a common task might be to convert the GPA by dividing it 10 and then multiplying by 4.

graduate_admission['CGPA'] = graduate_admission['CGPA'] /10 *4

graduate_admission['CGPA'][0:20] # print 20 values to check the result

array([3.86 , 3.548, 3.2  , 3.468, 3.284, 3.736, 3.28 , 3.16 , 3.2  ,
       3.44 , 3.36 , 3.6  , 3.64 , 3.2  , 3.28 , 3.32 , 3.48 , 3.2  ,
       3.52 , 3.4  ])

In [71]:
# Recall boolean masking. We can use this to find out how many students have had research experience by
# creating a boolean mask and passing it to the array indexing operator

len(graduate_admission[graduate_admission['Research'] == 1])

219

In [72]:
# Since we have the data field "chance of admission", which ranges from 0 to 1, we can try to see if students
# with high chance of admission (>0.8) on average have higher GRE score than those with lower chance of
# admission (<0.4)

# So first we use boolean masking to pull out only those students we are interested in based on their chance
# of admission, then we pull out only their GRE scores, then we print the mean values.

print(graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]['GRE_Score'].mean())

print(graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['GRE_Score'].mean())


328.7350427350427
302.2857142857143


In [74]:
# Let's do same for GPA

print(graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]['CGPA'].mean())

print(graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['CGPA'].mean())

3.7106666666666666
3.0222857142857142


# Summary

---

This lecture has to be a strong foundation of towards more advanced Pandas as Pandas is built on top of NumPy and many of the functions and capabilities of NumPy are available to you within Pandas.