# `numpy` Introduction

**`ndarray`** is one of the most important characteristics or features of `numpy`. An `ndarray` object is used to store multi-dimension arrays of a specific type.

Inside `ndarray`:

- A pointer towards data or a block of memory
- Data type (`dtype`) describes the size of an element in the array
- A shape
- A `stride` tuple specifying the step length of bytes

The array is stored as follows:

![](https://i.loli.net/2021/07/05/9A7VZTGEczpQxoq.jpg)

In [1]:
import numpy as np

In [2]:
print(np.array([[1, 1, 4], [5, 1], [4]]))
print(np.array([[1, 2], [3, 4]]))

[list([1, 1, 4]) list([5, 1]) list([4])]
[[1 2]
 [3 4]]


In [3]:
print(np.empty([3, 2], dtype=np.float))
print(np.zeros((2, 3), dtype=np.int))

[[0. 0.]
 [0. 0.]
 [0. 0.]]
[[0 0 0]
 [0 0 0]]


## Practice 1

Create a arithmetic sequence of `numpy` array from 0 to 100 (both inclusive) with 51 elements.

In [4]:
step = (100 - 0) / (51 - 1)
arr = np.arange(0, 100 + step, step, dtype=np.int)
print(*arr, sep=", ")

0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100


In [5]:
print(np.linspace(0, 100, 51))
# print(np.geomspace(1, 64, 7))
print(np.arange(100)[2:7:2])

[  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.
  28.  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.
  56.  58.  60.  62.  64.  66.  68.  70.  72.  74.  76.  78.  80.  82.
  84.  86.  88.  90.  92.  94.  96.  98. 100.]
[2 4 6]


## Insertion and Stacking

More information can be found on [NumPy official docs](https://numpy.org/doc/stable/reference/generated/numpy.vstack.html).

In [6]:
a = np.array([[1], [2], [3]])
b = np.array([[4], [5], [6]])
print(np.vstack((a,b)))

[[1]
 [2]
 [3]
 [4]
 [5]
 [6]]


In [7]:
a = np.array([[10, 20], [30, 40], [50, 60]])
print('First array', end='\n\n')
print('Withou Axis arg, the list will be flattened before manipulation:')
b = np.insert(a, 3, [100, 200])
print(b, end='\n\n')
print('With Axis = 0')
c = np.insert(a, 2, 11, axis=0)
print(c, end='\n\n')
print('With Axis = 1')
c = np.insert(a, 2, 11, axis=1)
print(c, end='\n\n')

First array

Withou Axis arg, the list will be flattened before manipulation:
[ 10  20  30 100 200  40  50  60]

With Axis = 0
[[10 20]
 [30 40]
 [11 11]
 [50 60]]

With Axis = 1
[[10 20 11]
 [30 40 11]
 [50 60 11]]



## Broadcasting

Before we begin, one important definition we need to know is the rank of the array in NumPy. The **rank** is the total number of dimensions a NumPy array has. For example, an array of shape $(3, 4)$ has a rank of $2$ and array of shape $(3, 4, 3)$ has a rank of $3$. Now onto the rules.

1. To deem which two arrays are suitable for operations, NumPy compares the shape of the two arrays dimension-by-dimension starting from the trailing dimensions of the arrays working it's way forward. (from right to left)

2. Two dimensions are said to be compatible if both of them are **equal, or either one of them is 1**.

3. If both the dimensions are unequal and neither of them is 1, then NumPy will throw an error and halt.

![Screenshot_20210705_151321_edit_1568224231025810.jpg](https://i.loli.net/2021/07/05/z9UCY5fgSKiqB4x.jpg)

In [8]:
arr_a = np.random.rand(3, 4, 6, 2)
arr_b = np.random.rand(3, 5, 1, 2)

try:
    arr_a + arr_b   # op throws an error
except:
    print("Can't perform")

arr_a = np.random.rand(3, 4, 6, 2)
arr_b = np.random.rand(3, 4, 1, 2)

arr_a + arr_b   # Does not throw error

Can't perform


array([[[[1.20329259, 1.03314327],
         [1.00058711, 1.60424955],
         [1.49591409, 1.34257974],
         [1.00273864, 0.74849376],
         [0.97193266, 1.45368419],
         [1.13023123, 1.18682208]],

        [[0.42742213, 1.53633532],
         [1.15129049, 1.50078508],
         [1.07113223, 1.72608521],
         [0.69080103, 1.66400121],
         [1.14833462, 1.66390894],
         [0.68979081, 1.41834209]],

        [[0.8001963 , 1.21363283],
         [1.38091716, 1.49638927],
         [1.09850227, 1.66612348],
         [0.40682439, 1.82161771],
         [1.06511813, 1.10835192],
         [1.11153719, 1.33008458]],

        [[1.48041121, 1.15645511],
         [1.00523781, 0.82286997],
         [0.96765355, 1.11430375],
         [0.86256775, 0.67521076],
         [1.35985257, 1.01709376],
         [0.65720956, 0.56280842]]],


       [[[0.79691791, 1.00074348],
         [0.96064178, 0.64180785],
         [0.81180287, 1.07803678],
         [0.69455622, 0.96344821],
         [

![](https://blog.paperspace.com/content/images/2020/07/image-14.png)

Additionally, for matrices with different ranks, the broadcasting rules are generally the same, except for filling the empty dimensions with 1's beforehands:

![](https://blog.paperspace.com/content/images/2020/07/image-16.png)

![](https://blog.paperspace.com/content/images/2020/07/image-18.png)

## Axes

Every NumPy array has a number of axes.

![](https://cdn.shortpixel.ai/spai/w_1375+q_glossy+ret_img+to_auto/https://www.sharpsightlabs.com/wp-content/uploads/2018/11/numpy-arrays-have-axes.png)

When trying to understand axes in NumPy sum, we need to know what the `axis` parameter actually controls.

In `np.sum()`, the `axis` parameter controls which axis will be *aggregated*.

Said differently, the `axis` parameter controls which axis will be *collapsed*.

Remember, functions like `sum()`, `mean()`, `min()`, `median()`, and other statistical functions aggregate your data.

Imagine you have a set of 5 numbers. If sum up those 5 numbers, the result will be a single number. Summation effectively aggregates your data. It collapses a large number of values into a single value.

Similarly, when you use `np.sum()` on a 2-d array with the `axis` parameter, it is going to collapse your 2-d array down to a 1-d array. It will collapse the data and reduce the number of dimensions.

Let's begin with a matrix.

In [9]:
np_array_2d = np.arange(0, 6).reshape([2,3])
print(np_array_2d)

[[0 1 2]
 [3 4 5]]


To sum the matrix up without an axis, the result would be aggregrated in both axes.

In [10]:
print(np.sum(np_array_2d))

15


With an `axis` argument, the aggregation will only be performed along one axis.

In [11]:
print(np.sum(np_array_2d, axis = 0))
#
# 3  << 0 1 2
# 12 << 3 4 5
#       v v v
#       v v v
#       3 5 7
# Top to bottom is axis 0, and thus aggregated
print(np.sum(np_array_2d, axis = 1))

[3 5 7]
[ 3 12]


In addition to `np.sum()`, another important usage of array axes is concatenation.

In [12]:
np_array_1s = np.array([[1,1,1],[1,1,1]])
np_array_9s = np.array([[9,9,9],[9,9,9]])

If we concatenate arrays along `axis = 0`, we can use:

In [13]:
print(np.concatenate([np_array_1s, np_array_9s], axis = 0))
print(np.vstack([np_array_1s, np_array_9s]))

[[1 1 1]
 [1 1 1]
 [9 9 9]
 [9 9 9]]
[[1 1 1]
 [1 1 1]
 [9 9 9]
 [9 9 9]]


And similarly, you can expect identical results from the following:

In [14]:
print(np.concatenate([np_array_1s, np_array_9s], axis = 1))
print(np.hstack([np_array_1s, np_array_9s]))

[[1 1 1 9 9 9]
 [1 1 1 9 9 9]]
[[1 1 1 9 9 9]
 [1 1 1 9 9 9]]


Also, in `np.split()`:

In [15]:
a, b, c = np.split(np_array_2d, [1, 114514])
print(a, b)

[[0 1 2]] [[3 4 5]]


## Matrices

We then attempts to use `np.mat` or `np.asmatrix` to convert the arrays to matrices, after which we will try to find Fibonacci sequence using quick power.

In [16]:
a = np.asmatrix([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(a[..., :2])
print(a[2, ...])
print(a[1:, 1:])
print(a[[0,2,1], [0,2,1]])
print(a.T) # or np.transpose(a)
print(a.reshape(-114514, 9))

[[1 2]
 [4 5]
 [7 8]]
[[7 8 9]]
[[5 6]
 [8 9]]
[[1 9 5]]
[[1 4 7]
 [2 5 8]
 [3 6 9]]
[[1 2 3 4 5 6 7 8 9]]


In [17]:
def fib(n):
    if n==0:
        return 0
    elif n==1 or n==2:
        return 1
    base = np.mat([[1,1],[1,0]])
    n = n - 2
    ans = np.mat([[1,0],[0,1]])
    while n:
        if n&1:
            ans = base @ ans
        base = base @ base
        n = n>>1
 
    return ans[0,0] + ans[0,1]
 
if __name__=='__main__':
    for i in range(21):
        print(fib(i),end='\n')

0
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765


The following code shows a simple use case of `LA.eig`.

In [18]:
from numpy import linalg as LA
a = np.array([1, 2, 3, 1]).reshape(2, 2)
w, v = LA.eig(a)
print("w=", w)
a = np.array([[1, -2j], [2j, 5]])
w, v = LA.eig(a)
print("w=", w)
print("v=", v)

w= [ 3.44948974 -1.44948974]
w= [0.17157288+0.j 5.82842712+0.j]
v= [[ 0.92387953+0.j         -0.        -0.38268343j]
 [-0.        -0.38268343j  0.92387953+0.j        ]]


## Saving

In [19]:
import numpy as np
a = np.array([1, 1, 4, 5, 1, 4])
np.save('outfile.npy', a)
b = np.load('outfile.npy')
print(b)

[1 1 4 5 1 4]


## References

1. [NumPy Axes Explained](https://www.sharpsightlabs.com/blog/numpy-axes-explained/#:~:text=NumPy%20axes%20are%20the%20directions%20along%20the%20rows,Axis%200%20is%20the%20direction%20along%20the%20rows)

2. [Nuts and Bolts of NumPy Optimization Part 1: Understanding Vectorization and Broadcasting](https://blog.paperspace.com/numpy-optimization-vectorization-and-broadcasting/#:~:text=%20Rules%20of%20Broadcasting%20%201%20To%20deem,will%20throw%20an%20error%20and%20halt.%20More%20)

# Pandas

Pandas is a popular library for parsing and saving `.csv` format files.

`DataFrame` (2-d) and `Series` (1-d) are two most important data types provided by Pandas.


In [20]:
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 40, 50])
print(s)

df1 = pd.DataFrame(
    {
        'A': [1, 2, 3],
        'B': pd.Series([3, 4, 5], index=[2,0,1]),
        'C': pd.Series([3, 4, 5], index=[0,1,2])
    }
)
df1

0     1.0
1     3.0
2     5.0
3     NaN
4    40.0
5    50.0
dtype: float64


Unnamed: 0,A,B,C
0,1,4,3
1,2,5,4
2,3,3,5


Pandas support auto-fill with `NaN`'s.

In [21]:
data2 = [
    {
        'a' : 1,
        'b' : 2
    },
    {
        'a' : 5,
        'b' : 10,
        'c' : 20
    }
]

df2 = pd.DataFrame(data2, index=['first', 'second'])
print(df2.to_csv(sep='\t'))

	a	b	c
first	1	2	
second	5	10	20.0



In [22]:
weather = pd.read_csv('LECTURE_3_FILES/seattle-weather.csv')
# It has lots of parameters to play with
weather

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain
...,...,...,...,...,...,...
1456,2015-12-27,8.6,4.4,1.7,2.9,rain
1457,2015-12-28,1.5,5.0,1.7,1.3,rain
1458,2015-12-29,0.0,7.2,0.6,2.6,fog
1459,2015-12-30,0.0,5.6,-1.0,3.4,sun


In [24]:
print(weather.iloc[1])
print("------------------")
print(weather.loc[:10,'wind'])

date             2012-01-02
precipitation          10.9
temp_max               10.6
temp_min                2.8
wind                    4.5
weather                rain
Name: 1, dtype: object
------------------
0     4.7
1     4.5
2     2.3
3     4.7
4     6.1
5     2.2
6     2.3
7     2.0
8     3.4
9     3.4
10    5.1
Name: wind, dtype: float64


In [26]:
import time
np.random.seed(int(time.time()))
dates = pd.date_range('1/1/2020', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2020-01-01,-1.980533,1.466391,-0.175762,1.008728
2020-01-02,0.115376,-2.017288,0.199258,-0.520119
2020-01-03,-0.78483,-2.133081,-0.012582,0.642979
2020-01-04,-2.086748,-0.549511,-0.021671,-0.043258
2020-01-05,0.330752,1.130809,-1.215201,0.727957
2020-01-06,-0.861518,0.142694,1.11802,-0.206136
2020-01-07,-1.829993,0.007281,0.111569,-2.026113
2020-01-08,-1.012935,0.439106,-1.439476,1.093754


In [27]:
df[['B', 'A']] = df[['A', 'B']]
df

Unnamed: 0,A,B,C,D
2020-01-01,1.466391,-1.980533,-0.175762,1.008728
2020-01-02,-2.017288,0.115376,0.199258,-0.520119
2020-01-03,-2.133081,-0.78483,-0.012582,0.642979
2020-01-04,-0.549511,-2.086748,-0.021671,-0.043258
2020-01-05,1.130809,0.330752,-1.215201,0.727957
2020-01-06,0.142694,-0.861518,1.11802,-0.206136
2020-01-07,0.007281,-1.829993,0.111569,-2.026113
2020-01-08,0.439106,-1.012935,-1.439476,1.093754


## Concatenate, Merge, Append

(Omitted, learn as we use)

## Stats

In [34]:
np.mean(weather['precipitation'])

3.02943189596167

In [36]:
weather.groupby(['weather']).mean()

Unnamed: 0_level_0,precipitation,temp_max,temp_min,wind
weather,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
drizzle,0.0,15.926415,7.111321,2.367925
fog,0.0,16.757426,7.979208,2.481188
rain,6.557878,13.454602,7.588768,3.669891
snow,8.553846,5.573077,0.146154,4.411538
sun,0.0,19.861875,9.34375,2.956406


In [31]:
print(sum(list(pd.isna(weather['precipitation']))))

0


## Mini-project

### Problem Description

Suppose we need to select only the weekends of the `weather`, and find the days that satisfy `precipitation < 1.5`.

### Preprocessing

Usually we use `pickle` to store data.

In [37]:
import pickle