# Working with numerical data

The "data" in *Data Analysis* typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The [Numpy](https://numpy.org) library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.


> Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in  millimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`. Here's an example set of values:

In [7]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [8]:
kantoTemp = 73
kantoRain = 67
kantoHum = 43

In [9]:
kantoYield = kantoTemp*w1 + kantoRain*w2 + kantoHum*w3
kantoYield

56.8

In [10]:
print(f"The expected yield of apples in Kanto is {kantoYield} tons per hectare.")

The expected yield of apples in Kanto is 56.8 tons per hectare.


To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, i.e., a list of numbers.

In [11]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

In [12]:
weights = [w1, w2, w3]

In [13]:
for item in zip(kanto, weights):
    print(item)

(73, 0.3)
(67, 0.2)
(43, 0.5)


In [14]:
def CropYield(region, weights):
    result = 0
    for x,w in zip(region, weights): #$ zip function returns pair
        result += x*w
    return result 

In [15]:
CropYield(kanto, weights)

56.8

In [16]:
CropYield(unova, weights)

74.9

# Going from Python lists to Numpy arrays


The calculation performed by the `crop_yield` (element-wise multiplication of two vectors and taking a sum of the results) is also called the *dot product*. Learn more about dot product here: [Khan Academy](https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length) . 

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

Let's install the Numpy library using the `pip` package manager.

In [17]:
import numpy as np

In [18]:
kanto = np.array([73,67,43])

In [19]:
type(kanto)

numpy.ndarray

In [20]:
weights = np.array([0.3,0.2,0.5])

In [21]:
type(weights)

numpy.ndarray

In [22]:
CropYield = np.dot(kanto,weights)
CropYield

56.8

We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.

In [23]:
(kanto*weights).sum()

56.8

# Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `CropYield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [24]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [25]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

Wall time: 161 ms


833332333333500000

In [26]:
%%time
print(np.dot(arr1_np,arr2_np))

-1942957984
Wall time: 992 µs


## Multi-dimensional Numpy arrays 

We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.

In [27]:
climateData = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

In [28]:
climateData[1,0]

91

In [29]:
climateData

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

In [30]:
# 2D array (matrix)
climateData.shape

(5, 3)

In [31]:
weights.dtype

dtype('float64')

In [32]:
np.matmul(climateData,weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [33]:
climateData @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [34]:
import urllib.request

In [35]:
urllib.request.urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv', 
    'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x209ae4b7640>)

In [40]:
climateData = np.genfromtxt('climate.txt', delimiter=',', skip_header=True)

In [41]:
climateData

array([[25., 76., 99.],
       [39., 65., 70.],
       [59., 45., 77.],
       ...,
       [99., 62., 58.],
       [70., 71., 91.],
       [92., 39., 76.]])

In [42]:
climateData.shape

(10000, 3)

In [43]:
CropYield = climateData @ weights

In [44]:
CropYield

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [46]:
CropYield.shape

(10000,)

In [48]:
ClimateRes = np.concatenate((climateData, CropYield.reshape(10000,1)), axis=1)

In [49]:
ClimateRes

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])

In [51]:
np.savetxt('climate_results.txt', 
           ClimateRes, 
           fmt='%.2f', 
           delimiter=',',
           header='temperature,rainfall,humidity,yeild_apples', 
           comments='')