3. NumPy

Why use NumPy?

NumPy is a Python library used for working with arrays. NumPy stands for Numerical Python.

In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very important.

In [3]:
# Let's start by importing NumPy. Remember from the previous session, You have to import a package with each file
# We will alias as np for simplicity

import numpy as np

In [4]:
# As a refresher, we create a numpy array by converting a list using the array function, as shown below.

arr = np.array([1, 2, 3, 4, 5])
print(arr)

print(type(arr))

[1 2 3 4 5]
<class 'numpy.ndarray'>


In [9]:
# To index a NumPy array is similar to a normal list

arr = np.array([1, 2, 3, 4])
print(arr[1])
print(arr[1:3])
print(arr[-1])

2
[2 3]
4


In [23]:
# This is where the similarities end between NumPy arrays and standard python lists
# Prehaps I want to add 5 to each number, as written below...

list = [10,20,30,40,50,60,70,80,90,100]
list = list + 5
print(list)

# For the above to work, you would have to write a custom For Loop to get the desired outcome

TypeError: can only concatenate list (not "int") to list

In [25]:
# But in a NumPy array, you can simply enter the following:

list = [10,20,30,40,50,60,70,80,90,100]
numpyList = np.array(list)
numpyList = numpyList +5
print(numpyList)

[ 15  25  35  45  55  65  75  85  95 105]


In [42]:
# This can be more useful when manipulating large amounts of data.

# Here we have a list of distances in kilometers 
distances_km = np.array([12,32,29,55,11,19,23,24,34,15])
print(distances_km)

# Let's convert this array of kilometers to miles
distances_miles = distances_km * 0.621371
print(distances_miles)

[12 32 29 55 11 19 23 24 34 15]
[ 7.456452 19.883872 18.019759 34.175405  6.835081 11.806049 14.291533
 14.912904 21.126614  9.320565]


In [56]:
# Or when working with multiple arrays of data

# Here we have a list of working hours for each person in our team for each day of the week
mon = np.array([8, 8, 6, 8, 5])
tue = np.array([7, 8, 4, 0, 7])
wed = np.array([8, 8, 6, 0, 5])
thu = np.array([6, 8, 8, 7, 9])
fri = np.array([5, 5, 8, 8, 5])

# We can use NumPy to combine these arrays together to get a total number of working hours for each person.
working_hours = mon + tue + wed + thu + fri
print(working_hours)

[34 37 32 23 31]


In [70]:
# Let's make this a bit more dynamic. Using the column_stack method, we can create a 2D array from the two lists
team = np.array(['Kevin', 'Michael','Dylan','Jess','Amy'])
team_working_hours = np.column_stack((team, working_hours)) 

# We will use \n to enter a new line in the terminal
print(team_working_hours, "\n")

# A 2D array is simply a list of lists. We can then fetch each individual list or iterate through the list of lists
print(team_working_hours[0], "\n")
print(team_working_hours[1,0])

# Or just go buck wild!
print(team_working_hours[:,1])
print(team_working_hours[:,0], "\n")
print(team_working_hours[0,:], team_working_hours[-1,:])

# You can see the documentation for more examples

[['Kevin' '34']
 ['Michael' '37']
 ['Dylan' '32']
 ['Jess' '23']
 ['Amy' '31']] 

['Kevin' '34'] 

Michael
['34' '37' '32' '23' '31']
['Kevin' 'Michael' 'Dylan' 'Jess' 'Amy'] 

['Kevin' '34'] ['Amy' '31']


In [79]:
# We can also create custom aggregates using a NumPy array.

# Here we have a list of student ages in our class
ages = np.array([19,21,22,23,18,19,20,19,21,21,19,23,26,25,31,44])
print(ages)

# We can use NumPy to collect some insights from our list
avg_age = np.mean(ages)
print(avg_age)

median_age = np.median(ages)
print(median_age)

stdv_age = np.std(ages)
print(stdv_age)



[19 21 22 23 18 19 20 19 21 21 19 23 26 25 31 44]
23.1875
21.0
6.267163931955187


In [90]:
# If we combine this with data on how well each student scored, we can even find a correlation between
# student scores and how old they are

ages = np.array([19,21,22,23,18,19,20,19,21,21,19,23,26,25,31,44])
scores = np.array([66,77,76,57,70,88,79,61,64,74,83,91,95,73,54,77])

# Here we can find the correlation between student scores and their age
cor = np.corrcoef(ages, scores)
print(cor)

# Let's make this a little easier to read
print("\n", round(cor[0,1],6))

[[ 1.        -0.0237061]
 [-0.0237061  1.       ]]

 -0.023706
