# Working with numerical data

The "data" in *Data Analysis* typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The [Numpy](https://numpy.org) library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.


> Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in  millimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`. Here's an example set of values:

In [124]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [125]:
kantoTemp = 73
kantoRain = 67
kantoHum = 43

In [126]:
kantoYield = kantoTemp*w1 + kantoRain*w2 + kantoHum*w3
kantoYield

56.8

In [127]:
print(f"The expected yield of apples in Kanto is {kantoYield} tons per hectare.")

The expected yield of apples in Kanto is 56.8 tons per hectare.


To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, i.e., a list of numbers.

In [128]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

In [129]:
weights = [w1, w2, w3]

In [130]:
for item in zip(kanto, weights):
    print(item)

(73, 0.3)
(67, 0.2)
(43, 0.5)


In [131]:
def CropYield(region, weights):
    result = 0
    for x,w in zip(region, weights): #$ zip function returns pair
        result += x*w
    return result 

In [132]:
CropYield(kanto, weights)

56.8

In [133]:
CropYield(unova, weights)

74.9

# Going from Python lists to Numpy arrays


The calculation performed by the `crop_yield` (element-wise multiplication of two vectors and taking a sum of the results) is also called the *dot product*. Learn more about dot product here: [Khan Academy](https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length) . 

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

Let's install the Numpy library using the `pip` package manager.

In [134]:
import numpy as np

In [135]:
kanto = np.array([73,67,43])

In [136]:
type(kanto)

numpy.ndarray

In [137]:
weights = np.array([0.3,0.2,0.5])

In [138]:
type(weights)

numpy.ndarray

In [139]:
CropYield = np.dot(kanto,weights)
CropYield

56.8

We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.

In [140]:
(kanto*weights).sum()

56.8

# Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `CropYield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [141]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [142]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

Wall time: 146 ms


833332333333500000

In [143]:
%%time
print(np.dot(arr1_np,arr2_np))

-1942957984
Wall time: 1.25 ms


## Multi-dimensional Numpy arrays 

We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.

In [144]:
climateData = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

In [145]:
climateData[1,0]

91

In [146]:
climateData

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

In [147]:
# 2D array (matrix)
climateData.shape

(5, 3)

In [148]:
weights.dtype

dtype('float64')

In [149]:
np.matmul(climateData,weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [150]:
climateData @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [151]:
import urllib.request

In [152]:
urllib.request.urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv', 
    'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x1cc39972460>)

In [153]:
climateData = np.genfromtxt('climate.txt', delimiter=',', skip_header=True)

In [154]:
climateData

array([[25., 76., 99.],
       [39., 65., 70.],
       [59., 45., 77.],
       ...,
       [99., 62., 58.],
       [70., 71., 91.],
       [92., 39., 76.]])

In [155]:
climateData.shape

(10000, 3)

In [156]:
CropYield = climateData @ weights

In [157]:
CropYield

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [158]:
CropYield.shape

(10000,)

In [159]:
ClimateRes = np.concatenate((climateData, CropYield.reshape(10000,1)), axis=1)

In [160]:
ClimateRes

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])

In [161]:
np.savetxt('climate_results.txt', 
           ClimateRes, 
           fmt='%.2f', 
           delimiter=',',
           header='temperature,rainfall,humidity,yeild_apples', 
           comments='')

In [162]:
arr1 = np.array([[1,2,3,4],[5,6,7,8],[9,1,2,3]])
arr2 = np.array([[2,4,3,1],[3,5,2,3],[1,5,3,4]])

In [163]:
arr = np.array([[1,2,3],[2,3,4],[3,4,5]])

In [164]:
arr + 3

array([[4, 5, 6],
       [5, 6, 7],
       [6, 7, 8]])

In [165]:
arr1 + arr2

array([[ 3,  6,  6,  5],
       [ 8, 11,  9, 11],
       [10,  6,  5,  7]])

In [166]:
arr1 == arr2

array([[False, False,  True, False],
       [False, False, False, False],
       [False, False, False, False]])

In [167]:
arr1 != arr2

array([[ True,  True, False,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

In [168]:
arr1>= arr2

array([[False, False,  True,  True],
       [ True,  True,  True,  True],
       [ True, False, False, False]])

In [169]:
(arr1 == arr2).sum()

1

## Array indexing and slicing

Numpy extends Python's list indexing notation using `[]` to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.

In [170]:
arr3 = np.array([
    [[11, 12, 13, 14], 
     [13, 14, 15, 19]], 
    
    [[15, 16, 17, 21], 
     [63, 92, 36, 18]], 
    
    [[98, 32, 81, 23],      
     [17, 18, 19.5, 43]]])

In [171]:
arr3

array([[[11. , 12. , 13. , 14. ],
        [13. , 14. , 15. , 19. ]],

       [[15. , 16. , 17. , 21. ],
        [63. , 92. , 36. , 18. ]],

       [[98. , 32. , 81. , 23. ],
        [17. , 18. , 19.5, 43. ]]])

In [172]:
arr3.shape

(3, 2, 4)

In [173]:
arr3[0,0,1]

12.0

In [174]:
arr3[0,0,0]

11.0

In [175]:
# Subarray using ranges
arr3[1:, 0:1, :2]

array([[[15., 16.]],

       [[98., 32.]]])

## Other ways of creating Numpy arrays

Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the [official documentation](https://numpy.org/doc/stable/reference/routines.array-creation.html) or use the `help` function to learn more.

In [176]:
# All zeros
np.zeros((3, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [177]:
# All ones
np.ones([2, 2, 3])

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [178]:
# Identity matrix
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [179]:
# Random vector
np.random.rand(3,3)

array([[0.05788967, 0.72673436, 0.88149628],
       [0.48298257, 0.55840493, 0.56931984],
       [0.41832103, 0.20842273, 0.70369309]])

In [180]:
# Fixed value
np.full([2, 3], 20)

array([[20, 20, 20],
       [20, 20, 20]])

In [181]:
# Range with start, end and step
np.arange(10, 90, 2)

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
       44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76,
       78, 80, 82, 84, 86, 88])

In [182]:
# Equally spaced numbers in a range
np.linspace(3, 27, 3)

array([ 3., 15., 27.])

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is a vector?
2. How do you represent vectors using a Python list? Give an example.
3. What is a dot product of two vectors?
4. Write a function to compute the dot product of two vectors.
5. What is Numpy?
6. How do you install Numpy?
7. How do you import the `numpy` module?
8. What does it mean to import a module with an alias? Give an example.
9. What is the commonly used alias for `numpy`?
10. What is a Numpy array?
11. How do you create a Numpy array? Give an example.
12. What is the type of Numpy arrays?
13. How do you access the elements of a Numpy array?
14. How do you compute the dot product of two vectors using Numpy?
15. What happens if you try to compute the dot product of two vectors which have different sizes?
16. How do you compute the element-wise product of two Numpy arrays?
17. How do you compute the sum of all the elements in a Numpy array?
18. What are the benefits of using Numpy arrays over Python lists for operating on numerical data?
19. Why do Numpy array operations have better performance compared to Python functions and loops?
20. Illustrate the performance difference between Numpy array operations and Python loops using an example.
21. What are multi-dimensional Numpy arrays? 
22. Illustrate the creation of Numpy arrays with 2, 3, and 4 dimensions.
23. How do you inspect the number of dimensions and the length along each dimension in a Numpy array?
24. Can the elements of a Numpy array have different data types?
25. How do you check the data type of the elements of a Numpy array?
26. What is the data type of a Numpy array?
27. What is the difference between a matrix and a 2D Numpy array?
28. How do you perform matrix multiplication using Numpy?
29. What is the `@` operator used for in Numpy?
30. What is the CSV file format?
31. How do you read data from a CSV file using Numpy?
32. How do you concatenate two Numpy arrays?
33. What is the purpose of the `axis` argument of `np.concatenate`?
34. When are two Numpy arrays compatible for concatenation?
35. Give an example of two Numpy arrays that can be concatenated.
36. Give an example of two Numpy arrays that cannot be concatenated.
37. What is the purpose of the `np.reshape` function?
38. What does it mean to “reshape” a Numpy array?
39. How do you write a numpy array into a CSV file?
40. Give some examples of Numpy functions for performing mathematical operations.
41. Give some examples of Numpy functions for performing array manipulation.
42. Give some examples of Numpy functions for performing linear algebra.
43. Give some examples of Numpy functions for performing statistical operations.
44. How do you find the right Numpy function for a specific operation or use case?
45. Where can you see a list of all the Numpy array functions and operations?
46. What are the arithmetic operators supported by Numpy arrays? Illustrate with examples.
47. What is array broadcasting? How is it useful? Illustrate with an example.
48. Give some examples of arrays that are compatible for broadcasting?
49. Give some examples of arrays that are not compatible for broadcasting?
50. What are the comparison operators supported by Numpy arrays? Illustrate with examples.
51. How do you access a specific subarray or slice from a Numpy array?
52. Illustrate array indexing and slicing in multi-dimensional Numpy arrays with some examples.
53. How do you create a Numpy array with a given shape containing all zeros?
54. How do you create a Numpy array with a given shape containing all ones?
55. How do you create an identity matrix of a given shape?
56. How do you create a random vector of a given length?
57. How do you create a Numpy array with a given shape with a fixed value for each element?
58. How do you create a Numpy array with a given shape containing randomly initialized elements?
59. What is the difference between `np.random.rand` and `np.random.randn`? Illustrate with examples.
60. What is the difference between `np.arange` and `np.linspace`? Illustrate with examples.


Ans 1. <br>
A quantity having direction as well as magnitude, especially as determining the position of one point in space relative to another. 

Ans 2. <br>
We can create a python list using the numpy library function called `np.array()`.
```python
vector = np.array([2,3,4])
```

In [183]:
vector = np.array([2,3,4])

In [184]:
vector

array([2, 3, 4])

In [185]:
# Ans. 3 

ar1 = np.array([2,3])
ar2 = np.array([3,5])

In [187]:
def Dot(arr1, arr2):
    result = 0
    for x,w in zip(arr1, arr2): #$ zip function returns pair
        result += x*w
    return result 

Dot(ar1,ar2)

21