# Numpy

**Note:** From here on out I am not going to be posting links to documentation for each and every piece of code unless I think there is something outstanding or something that is not included in the internal pydocs that you can get evaluating the function/object/value using the "?".  I am going to assume that any function that you see you'll check out it's documentation to get fully acquainted with it.

In [1]:
import numpy as np
from scipy import stats

## Creating arrays

Numpy arrays are a lot like a list, or list of lists, however they have a bunch of powerful features and a couple extra requirements.  For one, the data type must be all of the same type.  On the other hand, arrays can be much much faster to process and perform math on if done correctly since internally they are stored in contiguous pieces of memory.

In [2]:
np.array([[1, 2], [3, 4]])

array([[1, 2],
       [3, 4]])

In [13]:
a = np.array(range(16), dtype=int)  #remember range(n) creates an iterator from 0..n-1
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [4]:
np.ones((2,2))

array([[1., 1.],
       [1., 1.]])

In [5]:
np.zeros((2, 2))

array([[0., 0.],
       [0., 0.]])

In [6]:
np.eye(4)  # Identity matrix of 4x4

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [2]:
ints = np.random.randint(0, 10, 30)    # generate 30 random integers between [0, 10)
ints

array([2, 8, 5, 4, 9, 6, 5, 3, 3, 5, 9, 6, 4, 3, 0, 4, 1, 5, 7, 5, 7, 6,
       6, 0, 9, 4, 2, 7, 1, 6])

In [5]:
np.random.normal?

In [49]:
normal = np.random.normal(0, 1, size=30)  # generate 30 random numbers from a normal distribution
normal

array([-0.78861389, -1.65467964,  0.92868347, -0.6898438 , -0.53691372,
        0.57141896,  0.89548532,  1.30527108,  1.57255966,  1.31249156,
       -0.44565146,  0.77366304, -1.29741191,  0.70353754,  0.41418583,
        0.42227107, -1.55318628, -0.14520119,  2.35831915,  0.77193308,
        0.50328883,  0.15983298, -0.65246706, -0.01850371, -0.31776145,
       -1.14796604,  0.6685199 ,  0.30669552, -2.06015986, -0.18829945])

## Loading from CSV

The first option is the name of the file. The `delimiter` option defines how each element is delimited.  And `skiprows` allows you skipp pesky header rows

In [11]:
print(open("data.csv").read(100))

id,ct,secs,cl
1000025,5,2,2
1002945,5,7,2
1015425,3,2,2
1016277,6,3,2
1017023,4,2,2
1017122,8,7,4
10


In [7]:
data=np.loadtxt?

In [10]:
data=np.loadtxt("data.csv", delimiter=",", skiprows=1, dtype=int)
data

array([[1000025,       5,       2,       2],
       [1002945,       5,       7,       2],
       [1015425,       3,       2,       2],
       ...,
       [ 841769,       2,       2,       2],
       [ 888820,       5,       7,       4],
       [ 897471,       4,       3,       4]])

## Accessing arrays

In [8]:
data.shape   #tells us how big the array is

(645, 4)

In [9]:
data[0,0]   # access the first element in the first row

1000025

In [10]:
data[:,0]  # access the first column, all rows

array([ 1000025,  1002945,  1015425,  1016277,  1017023,  1017122,
        1018099,  1018561,  1033078,  1035283,  1036172,  1041801,
        1043999,  1044572,  1047630,  1048672,  1049815,  1050670,
        1050718,  1054590,  1054593,  1056784,  1057013,  1059552,
        1065726,  1066373,  1066979,  1067444,  1070935,  1071760,
        1072179,  1074610,  1075123,  1079304,  1080185,  1081791,
        1084584,  1091262,  1096800,  1099510,  1100524,  1102573,
        1103608,  1103722,  1105257,  1105524,  1106095,  1106829,
        1108370,  1108449,  1110102,  1110503,  1110524,  1111249,
        1112209,  1113038,  1113483,  1113906,  1115282,  1115293,
        1116116,  1116132,  1116192,  1116998,  1117152,  1118039,
        1120559,  1121732,  1121919,  1123061,  1124651,  1125035,
        1126417,  1131294,  1132347,  1133041,  1133136,  1136142,
        1137156,  1143978,  1147044,  1147699,  1147748,  1148278,
        1148873,  1152331,  1155546,  1156272,  1156948,  1157

In [11]:
data[0, :]   # access the first row, all columns

array([1000025,       5,       2,       2])

In [12]:
data[1, 1:-1]  # access the second row, starting from the second element to the next to last element

array([5, 7])

### Changing shape

Arrays can be quickly changed from one shape to another given that they have the same number of elements

In [15]:
a.reshape((8, 2))   # and the matrix to two rows and 8 columns, 2x8=16

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15]])

In [None]:
a


In [18]:
a = a.reshape((-1,4))  # negative 1 means use as man rows/columns as needed to 
                        #make the defined dimensions correct 简单来说，就是根据行/列来自行定义
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [21]:
a.reshape(-1)  # turn 2d array into 1d

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [22]:
a.reshape(4,2,-1)

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15]]])

In [16]:
a.reshape((4,2,2))  #turn an array 3d

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15]]])

## Doing math with numpy

In [25]:
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [28]:
a+1

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

In [24]:
b = np.array(range(4))
bCol = b.reshape((-1,1))
bCol

array([[0],
       [1],
       [2],
       [3]])

In [31]:
bRow = b.reshape((1, -1))
bRow

array([[0, 1, 2, 3]])

In [29]:
a + 5

array([[ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17, 18, 19, 20]])

In [34]:
print(a)
print(bRow)
a + bRow

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[[0 1 2 3]]


array([[ 0,  2,  4,  6],
       [ 4,  6,  8, 10],
       [ 8, 10, 12, 14],
       [12, 14, 16, 18]])

In [35]:
print(a)
print(bCol)
a + bCol

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[[0]
 [1]
 [2]
 [3]]


array([[ 0,  1,  2,  3],
       [ 5,  6,  7,  8],
       [10, 11, 12, 13],
       [15, 16, 17, 18]])

In [36]:
a * bCol  # element-wise multiplication

array([[ 0,  0,  0,  0],
       [ 4,  5,  6,  7],
       [16, 18, 20, 22],
       [36, 39, 42, 45]])

In [37]:
a @ bCol # dot-product multiplication

array([[14],
       [38],
       [62],
       [86]])

In [39]:
a*b


array([[ 0,  1,  4,  9],
       [ 0,  5, 12, 21],
       [ 0,  9, 20, 33],
       [ 0, 13, 28, 45]])

In [41]:
##Array Masks
print(a)
print(a>5)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[[False False False False]
 [False False  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]


In [42]:
mask = a>5
a[mask] #会变成一维数组

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

### Using booleans as selectors (array masks)

Justl like we can do int/float math on whole arrays at once, we can also do boolean math on those same arrays.  The results of those arrays can then be used to mask or select individual elements in the array.

In [44]:
print(data)
print (data[:,2])#第二列，所有行，所有数据

[[1000025       5       2       2]
 [1002945       5       7       2]
 [1015425       3       2       2]
 ...
 [ 841769       2       2       2]
 [ 888820       5       7       4]
 [ 897471       4       3       4]]
[ 2  7  2  3  2  7  2  2  2  1  2  2  2  7  6  2  2  4  2  5  6  2  2  2
  2  1  2  2  1  2  8  2  2  2  6  1  2  6  6  3  8 10  8  2  4  2  2  4
  2  2  3 10  8  4  3  5  6  2  3  2 10  5  2  3  2  8  4  2  2 10  2  6
  3  2  2  2  2  3  1  2  2  8 10  5  5  2  3  2  2  2  2  2  2  2  2 10
  5 10  2  2  6 10  3  2  8  2  2  5  2 10  2  2  4  4  2  2 10  8  9  2
  4  2  5 10  2  2  8  2  3  2  2  2  2  1  2  2  4  2  2  2  6  3  8 10
  1  6  4  2  2  3  2  2  3  6  5  2  2  1  2  2  8  6  2  1  2  2  2  8
  3 10  2  8  2  6  1  2  2  5  4  1  5  6  5  2  6 10  2  2  2  4  2  2
  2  5 10  2  2  2  6  7  1  1  5  4  2  7  3  5  2  2  6  2  2 10  1  3
  3  2  4  3  1 10  3  6  3  5  3  2  8  6  6  3  2  1  2  2  2  2  5  2
  2  2  2  2  5  3 10  5  6 10  3  3  2  3  2  3  2  2

In [46]:
evens = (data[:, 2] % 2 == 0)  # n % 2 == 0 iff n is even
print(evens[:5])      # just look at the first couple elements to see the data type
print(np.sum(evens))  # when summing booleans True=1 False=0, so we count the # of trues here

[ True False  True False  True]
485


In [49]:
even_data = data[evens]
print(even_data.shape)

(485, 4)


In [27]:
print(data.shape)       # original size was this
print(even_data.shape)  # this matches the number of evens

(645, 4)
(485, 4)


In [28]:
even_data[:5]  # what does the data look like?

array([[1000025,       5,       2,       2],
       [1015425,       3,       2,       2],
       [1017023,       4,       2,       2],
       [1018099,       1,       2,       2],
       [1018561,       2,       2,       2]])

### Statistics

In [29]:
np.min(a), np.max(a)

(0, 15)

In [30]:
a.mean()

7.5

In [31]:
a.std()

4.6097722286464435

In [54]:
ints = np.random.randint(0, 10, 30)

In [52]:
stats.mode?

In [55]:
ints

array([8, 8, 0, 2, 2, 6, 4, 3, 0, 1, 0, 6, 1, 6, 5, 1, 2, 0, 2, 6, 3, 1,
       7, 6, 3, 3, 7, 0, 1, 2])

In [57]:
stats.mode?

In [56]:
print(stats.mode(ints).mode) #众数 该函数返回 mode 和 count 两个值
print(stats.mode(ints).count)

[0]
[6]


Most of the functions provided in numpy can either operate on an entire array or a certain axis.  To engage the latter behaviour we use the axis argument:

In [33]:
data.min(axis=0) #取每一行的最小

array([61634,     1,     1,     2])

In [58]:
data.max(axis=1)

array([ 1000025,  1002945,  1015425,  1016277,  1017023,  1017122,
        1018099,  1018561,  1033078,  1035283,  1036172,  1041801,
        1043999,  1044572,  1047630,  1048672,  1049815,  1050670,
        1050718,  1054590,  1054593,  1056784,  1057013,  1059552,
        1065726,  1066373,  1066979,  1067444,  1070935,  1071760,
        1072179,  1074610,  1075123,  1079304,  1080185,  1081791,
        1084584,  1091262,  1096800,  1099510,  1100524,  1102573,
        1103608,  1103722,  1105257,  1105524,  1106095,  1106829,
        1108370,  1108449,  1110102,  1110503,  1110524,  1111249,
        1112209,  1113038,  1113483,  1113906,  1115282,  1115293,
        1116116,  1116132,  1116192,  1116998,  1117152,  1118039,
        1120559,  1121732,  1121919,  1123061,  1124651,  1125035,
        1126417,  1131294,  1132347,  1133041,  1133136,  1136142,
        1137156,  1143978,  1147044,  1147699,  1147748,  1148278,
        1148873,  1152331,  1155546,  1156272,  1156948,  1157

External References:
* [Numpy reference](https://docs.scipy.org/doc/numpy/reference/)
* [Scipy reference](https://docs.scipy.org/doc/scipy/reference/)
* [Numpy cheatsheat](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)