# What is Numpy?

https://numpy.org/

Numpy is a python library that implements numpy data structures. These data structures make working with lists and matrices much easier and faster. 

# Import Numpy Package

After our import statement, we can use "as" to give our package an alias. This alias will allow us to call numpy functions without spell out "numpy" every time; we can simply use "np".

In [4]:
import numpy as np

# Numpy Arrays

The most useful feature of numpy is the numpy array. Numpy arrays are python objects that encapsulate lists to simply calculations done with them and much more.

To show off the usefulness of numpy arrays we will first try to do some calculations on the basic python list data structures

In [1]:
list1 = [1,2,3,4,5]
list2 = [6,7,8,9,10]

#This WILL throw an error
list1 * list2

TypeError: can't multiply sequence by non-int of type 'list'

In [2]:
for i in range(0,len(list1)):
    list1[i] = list1[i]*list2[i]
    
list1

[6, 14, 24, 36, 50]

As we can see, operations can not be done on pythons lists. Let's see what happens when we use numpy arrays.

In [5]:
array1 = np.array([1,2,3,4,5])
array2 = np.array([6,7,8,9,10])

array1*array2

array([ 6, 14, 24, 36, 50])

# Numpy Functions

Not only can we do operations with numpy arrays, we can also call mathematical functions on them.

Full list numpy functions: https://numpy.org/doc/stable/reference/routines.math.html

In [11]:
array = np.array([1,2,3,4,5])

#Examples of functions done directly on arrays
print(array.sum())
print(array.mean())
print(array.std())
print(array.max())

15
3.0
1.4142135623730951
5


In [12]:
#Using numpy functions on array
print(np.sin(array))
print(np.exp(array))
print(np.log(array))

[ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427]
[  2.71828183   7.3890561   20.08553692  54.59815003 148.4131591 ]
[0.         0.69314718 1.09861229 1.38629436 1.60943791]


# Multi-Dimensional Arrays

Where numpy shines is with multi-dimensional arrays or matrices. With basic lists, working with multi-dimensional arrays would need bulky nested loops, but numpy handles all of it internally.

In [13]:
#Add 2 matricies with lists
matrix1 = [[1,2,3],[4,5,6],[7,8,9]]
matrix2 = [[1,2,3],[4,5,6],[7,8,9]]

for i in range(0,len(matrix1)):
    for j in range(0,len(matrix2)):
        matrix1[i][j] = matrix1[i][j] + matrix2[i][j]
        
matrix1

[[2, 4, 6], [8, 10, 12], [14, 16, 18]]

In [14]:
#Add 2 matricies with numpy arrays
npMatrix1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
npMatrix2 = np.array([[1,2,3],[4,5,6],[7,8,9]])

npMatrix1 + npMatrix2


array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

In [15]:
#functions
print(npMatrix1.sum())
print(npMatrix1.mean())
print(npMatrix1.std())
print(npMatrix1.max())

45
5.0
2.581988897471611
9


# Array Indexing and Slicing

arrays (and lists) are indexed row X column, or across then down

Remember all indices in python start at 0!

In [16]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])

#first 
print(matrix[0][0])
#middle
print(matrix[1][1])
#last
print(matrix[2][2])
#another way to get last
print(matrix[-1][-1])


1
5
9
9


We can also get entire rows and columns out of numpy arrays

In [17]:
#Rows
print(matrix[0])
print(matrix[1])
print(matrix[2])

[1 2 3]
[4 5 6]
[7 8 9]


In [18]:
#Columns   Row,colums - ":" gets all indices
print(matrix[:,0])
print(matrix[:,1])
print(matrix[:,2])


[1 4 7]
[2 5 8]
[3 6 9]


In [19]:
#Transpose
print(matrix.T)

[[1 4 7]
 [2 5 8]
 [3 6 9]]


# Masking

Masking is the process of filtering our data to a subsection we want

In [20]:
array = np.array([-3,-2,-1,0,1,2,3])

mask = array > 0

print(mask)


[False False False False  True  True  True]


In [21]:
array[mask]

array([1, 2, 3])

In [22]:
#Do this with nan values

array2 = np.array([1,2,np.nan,3,np.nan,5])

mask = np.isnan(array2)

mask


array([False, False,  True, False,  True, False])

In [23]:
array2[~mask]

array([1., 2., 3., 5.])

# Loading Data 

Numpy has many useful functions that can be used to read and write files. We will be looking at numpy load text as an example.

Documentation: https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html

We will be looking at a data set of 80 cereals: https://www.kaggle.com/datasets/crawford/80-cereals

In [24]:
cereal = np.loadtxt("cereal.csv", dtype = str, delimiter =",")
cereal

array([['name', 'mfr', 'type', ..., 'weight', 'cups', 'rating'],
       ['100% Bran', 'N', 'C', ..., '1', '0.33', '68.402973'],
       ['100% Natural Bran', 'Q', 'C', ..., '1', '1', '33.983679'],
       ...,
       ['Wheat Chex', 'R', 'C', ..., '1', '0.67', '49.787445'],
       ['Wheaties', 'G', 'C', ..., '1', '1', '51.592193'],
       ['Wheaties Honey Gold', 'G', 'C', ..., '1', '0.75', '36.187559']],
      dtype='<U38')

# Data Exploration

With you knowledge of numpy, explore the data set we just loaded in.

Here are some questions to guide you: <br>
What manufacturer makes the most types of cereal? <br>
Which cereal has the highest amount of sugar? <br>
How much variation is there in the amount of sugar?<br>

Hints:<br>
The first row of the array contains the columns<br>
The data type of every index was converted to a string and will have to by converted to a type that can be used in math operations - use np.array(..., dtype = float)<br>
Full list numpy functions: https://numpy.org/doc/stable/reference/routines.math.html