In [None]:
import numpy as np
import random
import time

# Introducing NumPy Arrays!

The numpy library provides high-performance arrays and matrices that we can use to dramatically speed up the runtime of our code.  Numpy uses underlying compiled code (C/C++) to achieve this!

Processing data in a numpy array is MUCH faster than processing it in a Python list. If you find yourself processing lists of numbers, you should ALWAYS ask yourself if you could model it as an array and model the processing steps 'vector/matrix math'.

For those of you that have taking the Python for Data Analysis course, you should know that Pandas is built on top of Numpy, and this is what makes column operations so fast. 

Numpy is a large python package, and we could spend many hours exploring it. If you like what you see in these lectures, feel free to explore the docs and tutorials here: http://www.numpy.org/


**Note**: When thinking about how can your python code run faster - it often comes down to this question - "is there a module/library I can use that implements compiled code (like C)?". Numpy, Scikit-Learn, Pandas, Dask, TensorFlow, are all built on top of numpy for or have their own compiled code they use.



## A Motivating Example:

Below is the function we used from a previous lecture to convert a list of temperatures in Celsius to Fahrenheit. Let's test this on a long list!

In [None]:
def convert_c_to_f(temps_c):
    '''Take a list of temps in celcius, and convert them to fahrenheit and add
    them to a new list'''
    temps_fahrenheit = []
    for tc in temps_c:
        tf = (tc * (9/5)) + 32
        temps_fahrenheit.append(tf)
    return temps_fahrenheit

##### Let's now create a list of fake Celsius temps and time our function with timeit

In [None]:
temps_c_list = range(1000)

In [66]:
%%timeit
convert_c_to_f(temps_c_list)

196 µs ± 772 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


##### Now create a numpy array of the same fake Celsius temps and test the conversion to Fahrenheit

In [None]:
temps_c_array = np.array(temps_c_list)

In [None]:
print(type(temps_c_array))
print(temps_c_array[:10])

In [65]:
%%timeit
temps_c_array = np.array(temps_c_list)
temps_c_array * (9/5) + 32

151 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [None]:
temps_f_list = convert_c_to_f(temps_c_list)
temps_f_array = temps_c_array * (9/5) + 32
print(temps_f_list[:10])
print(temps_f_array[:10])

## Basic NumPy Math and Operations

In [None]:
my_array_1 = np.array([1, 2, 3, 4])
my_array_2 = np.array([5, 6, 7, 8])

# element wise addition, substraction, multiplication and division
print(my_array_1 + my_array_2)
print(my_array_1 - my_array_2)
print(my_array_1 * my_array_2)
print(my_array_1 / my_array_2)

# element wise addition, substraction, multiplication and division

print(my_array_1 + 31)
print(my_array_1 - 7.6)
print(my_array_1 * 4.4)
print(my_array_1 / 21)

# mean, sum, std
print(my_array_1.mean())
print(my_array_1.sum())
print(my_array_1.std())

## Booleans

In [None]:
print(my_array_1 > 2.1)
print(my_array_2 == 6)

# chaining some things together, how many 7s are in this array?
my_array = np.array([1, 2, 7, 5, 4, 7, 8, 7])
print((my_array == 7).sum())

# Are all of the elements in this array 1s?
my_array = np.array([1, 1, 1, 1, 0])
print((my_array == 1).all())

# Are any of the elements in this array 1s?
my_array = np.array([0, 0, 0, 1, 0])
print((my_array == 1).any())

## Indexing

In [None]:
my_array = np.array([3.1, 6.3, 1.1, 3.4])
print(my_array[my_array > 3])

my_array = np.array([3.1, 6.3, 1.1, 3.4])
print((my_array[my_array > 3]).mean())

## Multidimensional Arrays

### 2d Array

In [None]:
array_of_ones = np.zeros((4, 5))
print(array_of_ones)

### 3d Array

In [None]:
array_of_ones = np.ones((3, 4, 5))
print(array_of_ones)

### Arrays From Nested Lists

In [None]:
nested_lists = [[1, 2], [3, 4], [5, 6]]
my_2d_array = np.array(nested_lists)

In [None]:
my_2d_array

## Another Example Problem

You have a years worth of hourly temperatures for 100 weather station, and you need to find the max temperature at each station. Should we process each station data as an individual list or should we create a large 2d array, where each row is a station, and find the max of each row in one numpy command?

### Create The Station Data

In [67]:
all_station_temps = []
num_weather_stations = 100
num_temps = 24*365
for station in range(num_weather_stations):
    station_temps = [random.random()*100 for _ in range(num_temps)]
    all_station_temps.append(station_temps)
    

print(len(all_station_temps))
print(len(all_station_temps[0]))


100
8760


## Calculate The Max Temp From Each Station

In [68]:
%%timeit
max_station_temps = []
for station_temps in all_station_temps:
    max_station_temps.append(max(station_temps))

20.5 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [71]:
%%timeit
station_temps_array = np.array(all_station_temps)
max_station_temps_array = station_temps_array.max(axis=1)

462 µs ± 8.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Indexing With Strings Versus Ints

Another tip I want to share with you, because it has served me well, is converting categorical string data to integer data before processing.

In [72]:
labels = ['ABC', 'DEF', 'GHI', 'JKL']
labels_dict = {'ABC': 0, 'DEF': 1, 'GHI': 2, 'JKL': 3}

# Create a random list of the label data
list_of_labels = [labels[random.randint(0,3)] for _ in range(1000)]
label_array = np.array(list_of_labels)

# Create a random list of the data, but use integers instead of strings.
list_of_label_nums = [labels_dict[label] for label in list_of_labels]
label_num_array = np.array(list_of_label_nums)

### Now test indexing the array!

In [73]:
%%timeit
ix_abc = label_array == "ABC"

15.4 µs ± 69.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [74]:
%%timeit
ix_abc = label_num_array == 0

3.41 µs ± 80.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
