<a href="https://colab.research.google.com/github/Ashikur-ai/Learn-Machine-Learning/blob/main/python_numercal_computing_with_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Working with numerical data in Python
* Going from Python lists to Numpy arrays
* Multi-dimensional Numpy arrays and their benefits
* Array operations, broadcasting, indexing, and slicing
* Working with CSV data files using Numpy


#Working with numerical data
The "data" in Data Analysis typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The Numpy library provides specialized dtastructures, functions, and other tools for numerical computing in Python. Let's work throught an example to see why & how to use Numpy for working with numerical data.
> Suppose we want to use climate dta like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to forrmulate the relationship between the annualyield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall(in millimeters) & average relative humidity (in percentage) as a linear equation.
`yeild_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`
We're expressing the yield of apples as a weighted sum of the temperature, rainfall and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there my be other factor involved. But a simple linear model like this often works well in practice.

Baased on some statical analysis of historical data, we might come up with reasonble values for the weights `w1`, `w2` and `w3`. Here's an example set of values:

In [2]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:
To begin, we can define some variables to record climate data for a region.

In [3]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

We can now substitute these variables into the linear equation to predict the yield of apples.

In [4]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3

In [5]:
kanto_yield_apples

56.8

In [6]:
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))

The expected yield of apples in Kanto region is 56.8 tons per hectare.


To make it slightly easier to perform the above computation for multiple regions, we can represent the climae dta for each region as a vector, i.e., a list of numbers.

In [7]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively. We can also represent the set of wights used in the formula as a vector.

In [8]:
weights = [w1, w2, w3]

We can now write a function `crop_yield` to calculate the yield of apples (or any other crop) given the climate data and the respective weights.

In [9]:
def crop_yield(region, weights):
  result = 0
  for x, w in zip(region, weights):
    result += x * w
  return result
  

In [10]:
crop_yield(kanto, weights)

56.8

In [11]:
crop_yield(johto, weights)

76.9

In [12]:
crop_yield(unova, weights)

74.9

#Going from Python lists to Numpy arrays
The calculation performed by the `crop_yield`(element-wise multiplication of two vectors and taking a sum of the results) is also called the dot product.

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert lists into Numpy arrays.

In [13]:
import numpy as np

We can now use the `np.array` function to create Numpy arrays.

In [14]:
a = [1, 2, 3]
print(type(a))

<class 'list'>


In [15]:
b= np.array(a)
print(type(b))

<class 'numpy.ndarray'>


In [16]:
kanto = np.array([73, 65, 34])

In [17]:
kanto

array([73, 65, 34])

In [18]:
weights = np.array([w1, w2, w3])

In [19]:
weights

array([0.3, 0.2, 0.5])

Numpy arrays have the type `ndarray`

In [20]:
type(kanto)

numpy.ndarray

In [21]:
type(weights)

numpy.ndarray

Just like lists, Numpy arrays support the indexing notation `[]`

In [22]:
weights[0]

0.3

In [23]:
kanto[2]

34

#Operating on Numpy arrays
We can now compute the dot product of the two vectors using the `np.dot` function.

In [24]:
np.dot(kanto, weights)

51.9

We can achieve the same result with low-level opeerations supported by Numpy arrays: preforming an element-wise multiplication and calculating the resulting numbers' sum.

In [25]:
(kanto * weights).sum()

51.9

The `*` operator performs an element-wise multiplication of two arrays if they have the same size. The `sum` method calculates the sum of numbers in an array.

In [26]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

In [27]:
arr1 * arr2

array([ 4, 10, 18])

In [28]:
arr2.sum()

15

#Benefits of using Numpy arrays
Numpy arrays offer the following benefits over Python lists for operating on numerical data:
- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like`(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.

- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them muck faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Pythonloops vs. Numpy arrays on two vectors with a million elements each.

In [29]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [30]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
  result += x1*x2
result

CPU times: user 370 ms, sys: 0 ns, total: 370 ms
Wall time: 467 ms


833332333333500000

In [31]:
%%time
np.dot(arr1_np, arr2_np)

CPU times: user 4.35 ms, sys: 0 ns, total: 4.35 ms
Wall time: 11.1 ms


833332333333500000

As you can see, using `np.dot` is 100 times faster than using a `for` loop. This makes Numpy especially useful while working with really large datasets with tems of thousands or millions of data points.

#Multi-dimensional Numpy arrays
We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.

In [32]:
climate_data = np.array([[73, 67, 43],
                        [91, 88, 64],
                        [87, 134, 58],
                        [102, 43, 37],
                        [69, 96, 70]])

In [33]:
climate_data

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.

Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the `.shape` property of an array.

In [34]:
# 2D array (matrix)
climate_data.shape

(5, 3)

In [35]:
weights

array([0.3, 0.2, 0.5])

In [36]:
#3D array
arr3 = np.array([
    [[11, 12, 13],
     [13, 14, 15]],
     [[15, 16, 17],
      [17, 18, 19.5]]])

In [37]:
arr3.shape

(2, 2, 3)

All the elements in a numpy array have the same data type. You can check the data type of an array using the `.dtype` property.

In [38]:
weights.dtype

dtype('float64')

In [39]:
climate_data.dtype

dtype('int64')

If an array contains even a single floating point number, all the other elements are also converted to floats.

In [40]:
arr3.dtype

dtype('float64')

We can now computer the predicted yields of apples in all the regions, using a single matrix multiplication between `climate_data` (a 5x3 matrix) and `weights` (a vector of length 3).

We can use the `np.matmul` function or th `@` operator to perform matrix multiplication.

In [41]:
np.matmul(climate_data, weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [42]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

#Working with CSV data files
Numpy also provides helper functions reading from & writing to files. Let's download a file `climate.txt`, which contains 10,000 climate measurements (temperature, rainfall & humidity) in the following format:

The format of storing data is known as comma-separated values or CSV.
>**CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular dta(numbers and text) in plain text, in which case each line will have the same number of fields. 

To read this file into a numpy array, we can use the `genfromtxt` function.

In [43]:
import urllib.request

urllib.request.urlretrieve(
    'https://gist.github.com/BirajCoder/a4ffcb76fd6fb221d76ac2ee2b8584e9/raw/4054f90adfd361b7aa4255e99c2e874664094cea/climate.csv', 
    'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x7f30fb073b10>)

In [44]:
climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header=1)

In [45]:
climate_data

array([[25., 76., 99.],
       [39., 65., 70.],
       [59., 45., 77.],
       ...,
       [99., 62., 58.],
       [70., 71., 91.],
       [92., 39., 76.]])

In [46]:
climate_data.shape

(10000, 3)

We can now perform a matrix multiplication using the `@` operator to predict the yield of apples for the entire dataset using a given set of weights.

In [47]:
weights = np.array([0.3, 0.2, 0.5])

In [48]:
yields = climate_data @ weights

In [49]:
yields

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [50]:
yields.shape

(10000,)

Let's add the `yields` to `climate_data` as a fourth column using the np.concatenate function.

In [51]:
climate_results = np.concatenate((climate_data, yields.reshape(10000, 1)), axis=1)

In [52]:
climate_results

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])

In [53]:
np.savetxt('climate_results.txt',
           climate_results,
           fmt='%.2f',
           delimiter=',',
           header='temperature,rainfall, humidity, yeild_apples',
           comments='')

The results are written back in the CSV format to the file `climate_results.txt`.

Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:
* Mathematics: `np.sum`, `np.exp`, `np.round`, arithemtic operators
* Array manipulation:`np.reshape`, `np.stack`, `np.concatenate`, `np.split`
* Linear Algebra: `np.matmul`, `np.dot`, `np.transpose`, `np.eigvals`
* Statistics: `np.mean`, `np.median`, `np.std`, `np.max`

# Arithmetic operations, broadcasting and comparison

Numpy arrays support arithmetic operators like `+`, `-`, `*`,etc. You can perform an arithmetic operation with a single number (also called scalar) or with another array of the same shape. Operators make it easy to write mathematical expressions with multi-dimensional arrays.


In [54]:
arr2 = np.array([[1,2,3,4],
                 [5, 6,7, 8],
                 [9, 1, 2, 3]])

In [55]:
arr3 = np.array([[11, 12, 13, 14],
                 [15, 16, 17, 18],
                 [19, 11, 12, 13]])

In [56]:
#Adding a scalar
arr2 + 3

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12,  4,  5,  6]])

In [57]:
#Element-wise subtraction
arr3 - arr2

array([[10, 10, 10, 10],
       [10, 10, 10, 10],
       [10, 10, 10, 10]])

In [58]:
#Division by scalar
arr2/2

array([[0.5, 1. , 1.5, 2. ],
       [2.5, 3. , 3.5, 4. ],
       [4.5, 0.5, 1. , 1.5]])

In [59]:
#Element-wise multiplication
arr2 * arr3

array([[ 11,  24,  39,  56],
       [ 75,  96, 119, 144],
       [171,  11,  24,  39]])

In [60]:
#Modulus with scalar
arr2 % 4

array([[1, 2, 3, 0],
       [1, 2, 3, 0],
       [1, 1, 2, 3]])

#Array Broadcasting
Numpy arrays also support broadcasting, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.

In [61]:
arr2 = np.array([[1,2, 3, 4],
                 [5, 6, 7, 8],
                 [9, 1, 2, 3]])

In [62]:
arr2.shape

(3, 4)

In [63]:
arr4 = np.array([4, 5, 6, 7])

In [64]:
arr4.shape

(4,)

In [65]:
arr2 + arr4

array([[ 5,  7,  9, 11],
       [ 9, 11, 13, 15],
       [13,  6,  8, 10]])

When the expression `arr2+arr4` is evaluated, `arr4`(which has the shape`(4,)`) is replicated three times to match the shape `(3, 4)` of `arr2`. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.

Broadcasting only works if one of the arrays can be replicated to match the other array's shape.

In [66]:
arr5 = np.array([7, 8])

In [67]:
arr5.shape

(2,)

In [68]:
arr2 + arr5

ValueError: ignored

In the above example, even if `arr5` is replicated three times, it will not match the shape of `arr2`. Hence `arr2+arr5` cannot be evaluated successfully.

#Array Comparison
Numpy arrays also support comparison operations like `==`, `!=`, `>` etc. The result is an array of booleans.

In [69]:
arr1 = np.array([[1, 2, 3], [3, 4, 5]])
arr2 = np.array([[2, 2, 3], [1, 2, 5]])

In [70]:
arr1 == arr2

array([[False,  True,  True],
       [False, False,  True]])

In [71]:
arr1 != arr2

array([[ True, False, False],
       [ True,  True, False]])

In [72]:
arr1 >= arr2

array([[False,  True,  True],
       [ True,  True,  True]])

In [73]:
arr1 < arr2

array([[ True, False, False],
       [False, False, False]])

Array comparison is frequently used to count the number of equal elements in two arrays using the `sum` method. Remember that `True` evaluates to `1` and `False` evaluates to `0` when booleans are used in arithmetic operations.

In [74]:
(arr1 == arr2).sum()

3

#Array indexing and slicing
Numpy extends PYthon's list indexing notation using `[]` to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.

In [75]:
arr3 = np.array([
    [[11, 12, 13, 14],
     [13, 14, 15, 19]],

     [[15, 16, 17, 21],
      [63, 92, 36, 18]],

      [[98, 32, 81, 23],
       [17, 18, 19.5, 43]]])

In [76]:
arr3.shape

(3, 2, 4)

In [77]:
#Single element
arr3[1, 1, 2]

36.0

In [78]:
x = arr3[1:]
y = x[0:1]
y[:2]

array([[[15., 16., 17., 21.],
        [63., 92., 36., 18.]]])

In [79]:
#Subarray using ranges
arr3[1:, 0:1, :2]

array([[[15., 16.]],

       [[98., 32.]]])

In [80]:
#Mixing indices and ranges
arr3[1:, 1, 3]

array([18., 43.])

In [81]:
#Mixing indices and ranges
arr3[1:, 1, :3]

array([[63. , 92. , 36. ],
       [17. , 18. , 19.5]])

In [82]:
#Using fewer indices
arr3[1]

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [83]:
#Using too many indices
arr3[1, 3, 2, 1]

IndexError: ignored

#Other ways of creating Numpy arrays
Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values.

In [84]:
#ALl zeros
np.zeros((3, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [85]:
#All ones
np.ones([2, 2, 3])

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [86]:
#Identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [87]:
#Random vector
np.random.rand(5)

array([0.29779455, 0.85604679, 0.32273969, 0.84487326, 0.70939869])

In [88]:
#Random matrix
np.random.randn(2, 3) #rand vs. randn - what's the difference?

array([[ 1.37602845, -0.55110215, -0.21404079],
       [-0.48364917, -0.8151598 ,  0.37739866]])

In [89]:
#Fixed value
np.full([2, 3], 42)

array([[42, 42, 42],
       [42, 42, 42]])

In [90]:
#Range with start, end and step
np.arange(10, 90, 3)

array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
       61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

In [91]:
#Equally spaced numbers in a range
np.linspace(3, 27, 9)

array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])