In [None]:
Numerical Computing with Python and Numpy:

In [None]:
Working with numerical data
The "data" in Data Analysis typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The Numpy library provides specialized data structures, functions, and other tools for numerical computing in Python.

In [None]:
Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters) & average relative humidity (in percentage) as a linear equation.

yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity

In [None]:
let express the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights w1, w2, and w3. Here's an example set of values:

w1, w2, w3 = 0.3, 0.2, 0.5

In [1]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43
w1, w2, w3 = 0.3, 0.2, 0.5
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))

The expected yield of apples in Kanto region is 56.8 tons per hectare.


In [3]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]
weights = [0.3, 0.2, 0.5]
crop_yield(kanto, weights)

56.8

In [4]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]
weights = [0.3, 0.2, 0.5]
crop_yield(johto, weights)

76.9

In [5]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]
weights = [0.3, 0.2, 0.5]
crop_yield(unova, weights)

74.9

In [None]:
Going from Python lists to Numpy arrays
The calculation performed by the crop_yield (element-wise multiplication of two vectors and taking a sum of the results) is also called the dot product.
The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

Let's install the Numpy library using the pip package manager.

In [7]:
import numpy as np
kanto = np.array([73, 67, 43])
kanto

array([73, 67, 43])

In [8]:
import numpy as np
weights = np.array([w1, w2, w3])
weights

array([0.3, 0.2, 0.5])

In [9]:
import numpy as np
kanto = np.array([73, 67, 43])
type(kanto)


numpy.ndarray

In [10]:
import numpy as np
weights = np.array([w1, w2, w3])
type(weights)

numpy.ndarray

In [11]:
import numpy as np
weights = np.array([w1, w2, w3])
weights[0]

0.3

In [12]:
import numpy as np
kanto = np.array([73, 67, 43])
kanto[2]

43

In [None]:
Operating on Numpy arrays

We can now compute the dot product of the two vectors using the `np.dot` function.

In [13]:
import numpy as np
kanto = np.array([73, 67, 43])
weights = np.array([w1, w2, w3])
np.dot(kanto, weights)

56.8

In [14]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr1 * arr2

array([ 4, 10, 18])

In [15]:
arr2.sum()

15

In [None]:
## Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [17]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

833332333333500000

In [18]:
np.dot(arr1_np, arr2_np)

-1942957984

In [None]:
Multi-dimensional Numpy arrays

In [20]:
climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])
climate_data


array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

In [21]:
climate_data.shape

(5, 3)

In [22]:
weights

array([0.3, 0.2, 0.5])

In [23]:
weights.shape

(3,)

In [25]:

arr3 = np.array([
    [[11, 12, 13], 
     [13, 14, 15]], 
    [[15, 16, 17], 
     [17, 18, 19.5]]])
arr3.shape

(2, 2, 3)

In [26]:
weights.dtype


dtype('float64')

In [27]:
climate_data.dtype

dtype('int32')

In [28]:
np.matmul(climate_data, weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [29]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])