### Working with numerical data

> Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in  millimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`. Here's an example set of values:

In [3]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [1]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

In [5]:
kanto_yield_apples = kanto_temp*w1 + kanto_rainfall*w2 + kanto_humidity*w3
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))

The expected yield of apples in Kanto region is 56.8 tons per hectare.


Representing the climate data for each region as a vector, i.e. a list of numbers.

In [6]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temperature, rainfall, and humidity data respectively.

In [7]:
weights = [w1, w2, w3]

Let's write a function `crop_yield` to calculate the yield of apples given the climate data and the respective weights.

In [8]:
def crop_yield(region, weights):
      result = 0
      for x,w in zip(region, weights):
            result += x*w
      return result

In [9]:
crop_yield(kanto, weights)

56.8

In [10]:
crop_yield(johto,weights)

76.9

In [11]:
crop_yield(unova, weights)

74.9

### Going from Python lists to Numpy arraysz

In [15]:
import numpy as np

In [17]:
kanto = np.array([73, 67, 43])

In [18]:
kanto

array([73, 67, 43])

In [19]:
weights = np.array([w1,w2,w3])
weights

array([0.3, 0.2, 0.5])

In [21]:
type(kanto), type(weights)

(numpy.ndarray, numpy.ndarray)

### Operating on Numpy Arrays
We can now compute the dot product of the two vectors using the `np.dot` function.

In [22]:
np.dot(kanto, weights)

56.8

We can achieve the the same result with low-level operation supported by Numpy arrays: performing an element-wie multiplication and calculating the resulting numbers' sum.

In [23]:
(kanto * weights).sum()

56.8

The `*` operator performs an element-wise multiplication of two arrays if they have the same size. The `sum` method calculates the sum of numbers in an array.

In [24]:
arr1 = np.array([1,2,3])
arr2 = np.array([4,5,6])

In [25]:
arr1 * arr2

array([ 4, 10, 18])

In [27]:
(arr1 * arr2).sum()

32

### Benefits of using Numpy Arrays

* Ease of use
* Performance

In [79]:
# python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# numpy arrays
# try this and this operation without specifying the dtype. You'll notice the -ve values in
# numpy's result
arr1_np = np.array(arr1, dtype=np.int64)
arr2_np = np.array(arr2, dtype=np.int64)

In [80]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
      result += x1*x2
result

Wall time: 287 ms


833332333333500000

In [81]:
%%time
np.dot(arr1_np, arr2_np)

Wall time: 3.99 ms


833332333333500000

### Multi-dimensional Numpy Arrays

In [82]:
climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

In [89]:
climate_data

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

Above is a 2-d array with 5 rows and 3 columns

In [90]:
climate_data.shape

(5, 3)

In [91]:
weights

array([0.3, 0.2, 0.5])

In [93]:
# 1D array (vector)
weights.shape

(3,)

In [94]:
# 3d array
arr3 = np.array([[[11, 12, 13],
                  [13, 14, 15]],
                 [[15, 16, 17],
                  [17, 18, 19.5]]])

In [95]:
arr3.shape

(2, 2, 3)

In [96]:
weights.dtype

dtype('float64')

In [97]:
arr3.dtype

dtype('float64')

In [98]:
climate_data.dtype

dtype('int32')

We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between `climate_data` (a 5x3 matrix) and `weights` (a vector of length 3). Here's what it looks like visually:

<img src="https://i.imgur.com/LJ2WKSI.png" width="240">

We can use the `np.matmul` function or the `@` operator to perform matrix multiplication.

In [99]:
np.matmul(climate_data, weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [100]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

### Working with CSV data files

In [101]:
from urllib.request import urlretrieve

In [103]:
url = 'https://gist.github.com/BirajCoder/a4ffcb76fd6fb221d76ac2ee2b8584e9/raw/4054f90adfd361b7aa4255e99c2e874664094cea/climate.csv'
urlretrieve(url, "climate.txt")

('climate.txt', <http.client.HTTPMessage at 0x149a76887f0>)

In [106]:
climate_data = np.genfromtxt("climate.txt", delimiter=',', skip_header=1)
climate_data

array([[25., 76., 99.],
       [39., 65., 70.],
       [59., 45., 77.],
       ...,
       [99., 62., 58.],
       [70., 71., 91.],
       [92., 39., 76.]])

In [107]:
climate_data.shape

(10000, 3)

In [108]:
weights

array([0.3, 0.2, 0.5])

In [109]:
yields = climate_data @ weights
yields

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [111]:
yields.shape

(10000,)

Let's add the `yields` to `climate_data` as a fourth column using the `np.concatenate` function.

In [112]:
climate_results = np.concatenate((climate_data, yields.reshape(10000,1)), axis=1)

In [113]:
climate_results

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])

Let's save the final results from our computation above back to a file using the `np.savetxt` function.

In [123]:
np.savetxt('climate_results.txt', climate_results,
           fmt='%.2f', delimiter=',',
           header='temperature,rainfall,humidity, yield_apples',
           comments='')

In [125]:
with open("./climate_results.txt", "r") as f:
      print(f.read())

temperature,rainfall,humidity, yield_apples
25.00,76.00,99.00,72.20
39.00,65.00,70.00,59.70
59.00,45.00,77.00,65.20
84.00,63.00,38.00,56.80
66.00,50.00,52.00,55.80
41.00,94.00,77.00,69.60
91.00,57.00,96.00,86.70
49.00,96.00,99.00,83.40
67.00,20.00,28.00,38.10
85.00,31.00,95.00,79.20
78.00,46.00,34.00,49.60
31.00,40.00,63.00,48.80
52.00,77.00,85.00,73.50
28.00,66.00,77.00,60.10
32.00,50.00,57.00,48.10
31.00,79.00,53.00,51.60
45.00,76.00,48.00,52.70
80.00,52.00,27.00,47.90
24.00,45.00,90.00,61.20
20.00,89.00,84.00,65.80
32.00,20.00,96.00,61.60
56.00,44.00,74.00,62.60
93.00,70.00,80.00,81.90
62.00,87.00,73.00,72.50
85.00,48.00,90.00,80.10
87.00,86.00,93.00,89.80
43.00,57.00,69.00,58.80
28.00,71.00,44.00,44.60
44.00,95.00,58.00,61.20
88.00,48.00,46.00,59.00
80.00,50.00,38.00,53.00
27.00,75.00,89.00,67.60
93.00,61.00,73.00,76.60
31.00,87.00,62.00,57.70
38.00,94.00,44.00,52.20
55.00,93.00,55.00,62.60
56.00,22.00,47.00,44.70
38.00,64.00,79.00,63.70
26.00,22.00,40.00,32.20
27.00,98.00,24.00,39

### Arithmetic Operations, Broadcasting and Comparison

In [128]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [129]:
arr3 = np.array([[11, 12, 13, 14], 
                 [15, 16, 17, 18], 
                 [19, 11, 12, 13]])

In [130]:
# adding a scalar
arr2 + 3

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12,  4,  5,  6]])

In [131]:
# element-wise subtraction
arr3 - arr2

array([[10, 10, 10, 10],
       [10, 10, 10, 10],
       [10, 10, 10, 10]])

In [132]:
# division by scalar
arr2/2

array([[0.5, 1. , 1.5, 2. ],
       [2.5, 3. , 3.5, 4. ],
       [4.5, 0.5, 1. , 1.5]])

In [133]:
# element-wise multiplication
arr2 * arr3

array([[ 11,  24,  39,  56],
       [ 75,  96, 119, 144],
       [171,  11,  24,  39]])

In [134]:
# modulus with scalar
arr2 % 4

array([[1, 2, 3, 0],
       [1, 2, 3, 0],
       [1, 1, 2, 3]], dtype=int32)

### Array Broadcasting

In [135]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [136]:
arr2.shape

(3, 4)

In [137]:
arr4 = np.array([4, 5, 6, 7])
arr4.shape

(4,)

In [139]:
# (3,4) + (4,) works
arr2 + arr4

array([[ 5,  7,  9, 11],
       [ 9, 11, 13, 15],
       [13,  6,  8, 10]])

In [140]:
arr5 = np.array([7,8])
arr5.shape

(2,)

In [141]:
# (3,4) + (2,) will not work
arr2 + arr5

ValueError: operands could not be broadcast together with shapes (3,4) (2,) 

### Array Comparison

In [144]:
arr1 = np.array([[1, 2, 3], 
                 [3, 4, 5]])
arr2 = np.array([[2, 2, 3], 
                 [1, 2, 5]])

In [143]:
arr1 == arr2

array([[False,  True,  True],
       [False, False,  True]])

In [145]:
arr1 != arr2

array([[ True, False, False],
       [ True,  True, False]])

In [146]:
arr1 >= arr2

array([[False,  True,  True],
       [ True,  True,  True]])

In [147]:
arr1 < arr2

array([[ True, False, False],
       [False, False, False]])

In [149]:
# True evaluates to 1 and False evaluates to 0 when used in arithmetic operations
(arr1 == arr2).sum()

3

### Array Indexing and Slicing

In [150]:
arr3 = np.array([[[11, 12, 13, 14], 
                  [13, 14, 15, 19]], 
                 
                 [[15, 16, 17, 21], 
                  [63, 92, 36, 18]], 
                 
                 [[98, 32, 81, 23],      
                  [17, 18, 19.5, 43]]])

In [151]:
arr3.shape

(3, 2, 4)

In [153]:
# single element
arr3[1,1,2]

36.0

In [154]:
# subarray using ranges
arr3[1:, 0:1, :2]

array([[[15., 16.]],

       [[98., 32.]]])

In [155]:
# mixing indices and ranges
arr3[1:, 1, 3]

array([18., 43.])

In [156]:
arr3[1]

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [157]:
arr3[:2, 1]

array([[13., 14., 15., 19.],
       [63., 92., 36., 18.]])

The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it. Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:

<img src="https://scipy-lectures.org/_images/numpy_indexing.png" width="360">

### Other Ways of Creating Numpy Arrays

In [158]:
# all zeros
np.zeros((3,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [159]:
# all ones
np.ones([2,2,3])

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [160]:
# identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [163]:
# random vector
np.random.rand(5)

array([0.29975601, 0.79330394, 0.93226509, 0.16050769, 0.96239309])

In [168]:
np.random.randn(5)

array([ 0.01612597,  0.19681081,  0.28189275,  0.89152782, -0.84872531])

In [169]:
# range with start, end and step
np.arange(10,90,3)

array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
       61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

In [170]:
# equally spaced numbers in a range
np.linspace(3,27,9)

array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])