# Data Science Fundamentals - Numpy

<center>**emaiconference@2022**</center>

## 2.0. Numpy Basics

NumPy is the fundamental package for scientific computing with Python. Provide high-performance
vector, matrix and higher-dimensional data
structures and offers Matlab-ish capabilities within Python

It contains among other things:

* a powerful N-dimensional array/vector/matrix object
* sophisticated (broadcasting) functions
* function implementation in C/Fortran assuring good performance if vectorized
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Also known as *array oriented computing*. The recommended convention to import numpy is:

**Why Numpy?**

In Python we have lists that serve the purpose of arrays, but they are slow to process.

NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very important.


**Why is NumPy Faster Than Lists?**

NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and manipulate them very efficiently.

This behavior is called locality of reference in computer science.

This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest CPU architectures.

In [1]:
import numpy as np

In [2]:
print(np.__version__)

1.20.3


## 2.1 Creating numpy arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples or
* using functions that are dedicated to generating numpy arrays, such as arange, linspace, empty,zeros etc.

#### array from list

In [3]:
# 0-D array
n = np.array(23)
print(n)

23


In [4]:
# a vector[1-D array]
v = np.array([0.5,0.8,2,1])
print(v)

[0.5 0.8 2.  1. ]


In [5]:
# a matrix[2-D array]
M = np.array([[1, 2], [3, 4]])
print(M)

[[1 2]
 [3 4]]


Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:



In [6]:
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr)

[[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]]


In [7]:
# a multidimension array
N = np.array([[0.2,0.4,2],[0.1,2,5],[3,0.4,0.1]])
print(N)

[[0.2 0.4 2. ]
 [0.1 2.  5. ]
 [3.  0.4 0.1]]


**Check Number of Dimensions?**

NumPy Arrays provides the ``ndim`` attribute that returns an integer that tells us how many dimensions the array have.

In [8]:
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

0
1
2
3


**High Dimension**

In [9]:
arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('number of dimensions :', arr.ndim)

[[[[[1 2 3 4]]]]]
number of dimensions : 5


#### use specific function

In [10]:
#Evenly spaced array (arange)
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [11]:
# create a range
x = np.arange(-1, 2.1, 0.1) # arguments: start, stop, step
print(x)

[-1.00000000e+00 -9.00000000e-01 -8.00000000e-01 -7.00000000e-01
 -6.00000000e-01 -5.00000000e-01 -4.00000000e-01 -3.00000000e-01
 -2.00000000e-01 -1.00000000e-01 -2.22044605e-16  1.00000000e-01
  2.00000000e-01  3.00000000e-01  4.00000000e-01  5.00000000e-01
  6.00000000e-01  7.00000000e-01  8.00000000e-01  9.00000000e-01
  1.00000000e+00  1.10000000e+00  1.20000000e+00  1.30000000e+00
  1.40000000e+00  1.50000000e+00  1.60000000e+00  1.70000000e+00
  1.80000000e+00  1.90000000e+00  2.00000000e+00]


In [12]:
# using linspace, both end points ARE included
np.linspace(0, 2, 25)

array([0.        , 0.08333333, 0.16666667, 0.25      , 0.33333333,
       0.41666667, 0.5       , 0.58333333, 0.66666667, 0.75      ,
       0.83333333, 0.91666667, 1.        , 1.08333333, 1.16666667,
       1.25      , 1.33333333, 1.41666667, 1.5       , 1.58333333,
       1.66666667, 1.75      , 1.83333333, 1.91666667, 2.        ])

In [13]:
#zeros array
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [14]:
np.ones((4,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [15]:
# no any actual number,not initialized
np.empty((4,1))

array([[0.5],
       [0.8],
       [2. ],
       [1. ]])

#### Random numbers and seeds

In [16]:
# uniform random numbers in [0,1]
np.random.rand(5,1)

array([[0.96725163],
       [0.9784884 ],
       [0.00902875],
       [0.65626057],
       [0.25188849]])

In [17]:
# standard normal distributed random numbers
np.random.randn(5,5)

array([[-0.44197343,  2.03946025, -0.54288464, -0.48744742,  0.30357771],
       [ 0.45569849,  1.11405161,  0.18163868, -0.03906072, -1.35077666],
       [ 0.11981237, -1.714845  , -1.18604458,  0.07924776, -0.517105  ],
       [-0.28946641,  0.16270074, -0.15105976, -0.78520158,  0.90906173],
       [-0.27621585, -1.25345501,  0.55402728, -0.96609411,  0.05609882]])

#### Random seed

The seed is for when we want repeatable (reproducible) results

In [18]:
np.random.seed(4)
x=np.random.rand(8,2)
print(x)

[[0.96702984 0.54723225]
 [0.97268436 0.71481599]
 [0.69772882 0.2160895 ]
 [0.97627445 0.00623026]
 [0.25298236 0.43479153]
 [0.77938292 0.19768507]
 [0.86299324 0.98340068]
 [0.16384224 0.59733394]]


### Shape, size, dimension and dtype

In [19]:
y= np.arange(3,78,7)
print(y)

[ 3 10 17 24 31 38 45 52 59 66 73]


In [20]:
y.shape , y.size

((11,), 11)

In [21]:
# row,column
x.shape

(8, 2)

In [22]:
# elements
x.size

16

In [23]:
x.ndim

2

In [24]:
x.dtype

dtype('float64')

In [25]:
y.dtype

dtype('int64')

####  Shape Manipulation
The shape of an array can be changed with various commands:

In [26]:
x = np.random.rand(30)
print(x)

[0.0089861  0.38657128 0.04416006 0.95665297 0.43614665 0.94897731
 0.78630599 0.8662893  0.17316542 0.07494859 0.60074272 0.16797218
 0.73338017 0.40844386 0.52790882 0.93757158 0.52169612 0.10819338
 0.15822341 0.54520265 0.52440408 0.63761024 0.40149544 0.64980511
 0.3969     0.62391611 0.76740497 0.17897391 0.37557577 0.50253306]


In [27]:
x.shape

(30,)

In [28]:
x_new=x.reshape(-1,1)

In [29]:
x_new.shape

(30, 1)

In [30]:
x_new

array([[0.0089861 ],
       [0.38657128],
       [0.04416006],
       [0.95665297],
       [0.43614665],
       [0.94897731],
       [0.78630599],
       [0.8662893 ],
       [0.17316542],
       [0.07494859],
       [0.60074272],
       [0.16797218],
       [0.73338017],
       [0.40844386],
       [0.52790882],
       [0.93757158],
       [0.52169612],
       [0.10819338],
       [0.15822341],
       [0.54520265],
       [0.52440408],
       [0.63761024],
       [0.40149544],
       [0.64980511],
       [0.3969    ],
       [0.62391611],
       [0.76740497],
       [0.17897391],
       [0.37557577],
       [0.50253306]])

In [31]:
x = np.random.rand(10, 2)
print(x)

[[0.68666708 0.25367965]
 [0.55474086 0.62493084]
 [0.89550117 0.36285359]
 [0.63755707 0.1914464 ]
 [0.49779411 0.1824454 ]
 [0.91838304 0.43182207]
 [0.8301881  0.4167763 ]
 [0.90466759 0.40482522]
 [0.3311745  0.57213877]
 [0.84544365 0.86101431]]


In [32]:
x.flatten()

array([0.68666708, 0.25367965, 0.55474086, 0.62493084, 0.89550117,
       0.36285359, 0.63755707, 0.1914464 , 0.49779411, 0.1824454 ,
       0.91838304, 0.43182207, 0.8301881 , 0.4167763 , 0.90466759,
       0.40482522, 0.3311745 , 0.57213877, 0.84544365, 0.86101431])

#### vstack and hstack

In [33]:
x = np.ones((5, 2))
print(x)

[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]]


In [34]:
y = np.zeros((5, 2))
print(y)

[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [35]:
z = np.hstack((x,y))
print(z)

[[1. 1. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 0. 0.]]


In [36]:
z = np.vstack((x,y))
print(z)

[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


### Indexing and slicing

In [37]:
data = np.random.randint(25,37, size=10)
print('',data)

 [31 31 33 25 28 36 30 31 35 26]


In [38]:
#print the first sensor data
print(data[0])

31


In [39]:
#print  data between index 3 and 7
print(data[3:7])

[25 28 36 30]


In [40]:
#print the last three data
print(data[7:])

[31 35 26]


In [41]:
# We can also use negative index
print(data[-1])

26


Multidimensional array behaves like a dataframe or matrix (i.e. columns and rows).Consider the following 2D  array.

In [42]:
data = np.random.randint(25,37, size=(10,3))
print(data)

[[36 26 36]
 [31 25 35]
 [33 32 30]
 [32 31 33]
 [36 29 29]
 [25 32 27]
 [26 35 33]
 [33 26 28]
 [26 28 31]
 [32 35 29]]


In [43]:
# View the first column of the array
data[:,0]

array([36, 31, 33, 32, 36, 25, 26, 33, 26, 32])

In [44]:
# View the third column of the array
data[:,2]

array([36, 35, 30, 33, 29, 27, 33, 28, 31, 29])

In [45]:
# View the first row of the array
data[0,]

array([36, 26, 36])

In [46]:
# View the first two row
data[:2,]

array([[36, 26, 36],
       [31, 25, 35]])

In [47]:
#View the first  data
data[0,0]

36

#### Fancy indexing

In [48]:
## view all data that is less than 30
mask = data<30
data[mask]

array([26, 25, 29, 29, 25, 27, 26, 26, 28, 26, 28, 29])

In [49]:
if (data > 30).any():
    print("at least one element in data is larger than 30")
else:
    print("no element in data is larger than 30")

at least one element in data is larger than 30


## Save and load numpy data to/ from file

In [50]:
np.save("data/sensor_data.npy",data)

In [51]:
sensor_data = np.load("data/sensor_data.npy")
print(sensor_data)

[[36 26 36]
 [31 25 35]
 [33 32 30]
 [32 31 33]
 [36 29 29]
 [25 32 27]
 [26 35 33]
 [33 26 28]
 [26 28 31]
 [32 35 29]]


### calculations

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays. 

In [52]:
#mean
sensor_data.mean()

30.666666666666668

In [53]:
#std
sensor_data.std()

3.457680661303984

In [54]:
#min
sensor_data.min()

25

In [55]:
#max
sensor_data.max()

36

### Numpy calculation is element wise

In [56]:
x = np.arange(1,10)
print(x)

[1 2 3 4 5 6 7 8 9]


In [57]:
print(x+2)

[ 3  4  5  6  7  8  9 10 11]


In [58]:
#print(x**2)
np.square(x)

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

In [59]:
np.log(x)

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
       1.79175947, 1.94591015, 2.07944154, 2.19722458])

## NumPy Array Copy vs View

**The Difference Between Copy and View**

The main difference between a copy and a view of an array is that the copy is a new array, and the view is just a view of the original array.

The copy owns the data and any changes made to the copy will not affect original array, and any changes made to the original array will not affect the copy.

The view does not own the data and any changes made to the view will affect the original array, and any changes made to the original array will affect the view.



**Copy**

In [60]:
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()

arr[0] = 42

print(arr)
print(x)

[42  2  3  4  5]
[1 2 3 4 5]


**View**

In [61]:
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
arr[0] = 42

print(arr)
print(x)

[42  2  3  4  5]
[42  2  3  4  5]


In [62]:
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
x[0] = 31

print(arr)
print(x)

[31  2  3  4  5]
[31  2  3  4  5]


**Iteration**

In [63]:
arr = np.array([1, 2, 3])

for x in arr:
  print(x)

1
2
3


In [64]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
  print(x)

[1 2 3]
[4 5 6]


In [65]:

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

for x in arr:
  print(x)

[[1 2 3]
 [4 5 6]]
[[ 7  8  9]
 [10 11 12]]


In [66]:
arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

for x in np.nditer(arr):
  print(x)

1
2
3
4
5
6
7
8


**Join**

In [67]:
# 1-D array

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.concatenate((arr1, arr2))

print(arr)

[1 2 3 4 5 6]


In [68]:
# 2 -D array

arr1 = np.array([[1, 2], [3, 4]])

arr2 = np.array([[5, 6], [7, 8]])

arr = np.concatenate((arr1, arr2), axis = 1)

print(arr)

[[1 2 5 6]
 [3 4 7 8]]


**Splitting NumPy Arrays**

Splitting is reverse operation of Joining.

Joining merges multiple arrays into one and Splitting breaks one array into multiple.

We use array_split() for splitting arrays, we pass it the array we want to split and the number of splits.

In [69]:
arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 2)

print(newarr)

[array([1, 2, 3]), array([4, 5, 6])]


**Search**

In [70]:
arr = np.array([1, 2, 3, 4, 5, 4, 4])

x = np.where(arr == 4)

print(x) # note positions of matched element are printed

(array([3, 5, 6]),)


**Sorting**

There is a method called ``searchsorted()`` which performs a binary search in the array, and returns the index where the specified value would be inserted to maintain the search order.

In [71]:
arr = np.array([6, 7, 8, 9])

x = np.searchsorted(arr, 9)

print(x)

3


In [72]:

arr = np.array([6, 7, 8, 9])

x = np.searchsorted(arr, 7, side='right')

print(x)

2


## References

- [python4datascience-atc](https://github.com/pythontz/python4datascience-atc)
- [PythonDataScienceHandbook](https://github.com/jakevdp/PythonDataScienceHandbook)
- [DS-python-data-analysis](https://github.com/jorisvandenbossche/DS-python-data-analysis)