# NumPy

## NumPy

### Numpy

It is a Python package useful for numerical operations. It provides an alternative to the Python list: the NumPy array.

In [6]:
import numpy as np

### NumPy Array

The NumPy array is a data structure very similar to a Python list: it is a ordered set of values based on an index.

In [7]:
# Use np.array() to create a numpy array from baseball 
# Name this array np_baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]
np_baseball = np.array(baseball)

print(np_baseball, type(np_baseball))

[180 215 210 210 188 176 209 200] <class 'numpy.ndarray'>


### Element-wise operations

NumPy allows to do an operation on each value in an Array, easily.

In [14]:
import pandas as pd
baseball_df = pd.read_csv('./data/baseball.csv')

np_height_in = np.array(baseball_df['Height'])
np_height_in[0:5]

array([74, 74, 72, 72, 73])

In [15]:
# Multiply np_height_in with 0.0254 to convert all height measurements from inches to meters
np_height_m = np_height_in * 0.0254
np_height_m[0:5]

array([1.8796, 1.8796, 1.8288, 1.8288, 1.8542])

In [16]:
# Multiply weight_lb by 0.453592 to go from pounds to kilograms
np_weight_lb = baseball_df['Weight']
np_weight_kg = np_weight_lb * 0.453592

# Calculate the BMI of each player
bmi = np_weight_kg/np_height_m**2
bmi[0:5]

0    23.110376
1    27.604061
2    28.480805
3    28.480805
4    24.803335
Name: Weight, dtype: float64

### Array filtering

With NumPy arrays, you are able to subset elements based on a condition. This implies to creating a boolean mask and using bracket notation.

In [17]:
# Create a boolean numpy array: the element of the array should be True if the corresponding baseball player's BMI is below 21
light = bmi < 21
light

0       False
1       False
2       False
3       False
4       False
        ...  
1010    False
1011    False
1012    False
1013    False
1014    False
Name: Weight, Length: 1015, dtype: bool

In [18]:
# Print out a numpy array with the BMIs of all baseball players whose BMI is below 21
bmi[light]

13     20.542557
89     20.542557
159    20.692820
271    20.692820
279    20.343432
357    20.343432
499    20.692820
658    20.158835
742    19.498447
793    20.692820
906    20.920522
Name: Weight, dtype: float64

### Type coercion
A NumPy array can only hold a single data type at a time, in order to optimize memory usage, computational perfomance, and being able to execute element-wise operations. If you try to build an array with more than one data type, some of the elements' types will be changed to end up with a homogeneous list. This is known as type coercion.

### Operators behavior

The typical arithmetic operators, such as `+`, `-`, `*` and `/` have a different meaning for regular Python lists and numpy arrays.

In [20]:
[True, 1, 2] + [3, 4, False]

[True, 1, 2, 3, 4, False]

In [19]:
np.array([True, 1, 2]) + np.array([3, 4, False])

array([4, 5, 2])

### Subsetting NumPy arrays

Subsetting (using the square bracket notation) works exactly the same on lists or arrays.

In [21]:
# Subset np_weight_lb by printing out the element at index 50
np_weight_lb[50]

200

In [22]:
# Print out a sub-array of np_height_in that contains the elements at index 100 up to and including index 110
np_height_in[100:111]

array([73, 74, 72, 73, 69, 72, 73, 75, 75, 73, 72])

## 2D NumPy Arrays

### 2D NumPy array

It is an array where each of its elements is another array. It is a rectangular data structure: each sublist corresponds to a row.

In [24]:
# Use np.array() to create a 2D numpy array from baseball 
# Name it np_baseball
baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

np_baseball = np.array(baseball)
np_baseball

array([[180. ,  78.4],
       [215. , 102.7],
       [210. ,  98.5],
       [188. ,  75.2]])

### Shape

In [26]:
# Print out the shape attribute of np_baseball
np_baseball.shape

(4, 2)

### Subsetting 2D arrays

If your 2D numpy array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. 

The indexes before the comma refer to the rows, while those after the comma refer to the columns.

In [49]:
baseball_2d = np.array([np_height_in, np_weight_lb]).T
baseball_2d.shape

(1015, 2)

In [54]:
# Print out the 50th row of baseball_2d
baseball_2d[49]

array([ 70, 195])

In [51]:
# Print the second column of baseball_2d
baseball_2d[:, 1]

array([180, 215, 210, ..., 205, 190, 195])

In [53]:
# Select the height (first column) of the 124th baseball player in np_baseball and print it out
baseball_2d[123, 0]

75

### 2D arithmetic

In [55]:
# Convert the units of height and weight in baseball_2d to metric system units
conversion = [0.0254, 0.453592]
baseball_2d_metric = baseball_2d * conversion
baseball_2d_metric 

array([[ 1.8796 , 81.64656],
       [ 1.8796 , 97.52228],
       [ 1.8288 , 95.25432],
       ...,
       [ 1.905  , 92.98636],
       [ 1.905  , 86.18248],
       [ 1.8542 , 88.45044]])

## NumPy: Basic statistics

### Mean

In [56]:
# Print out the mean of np_height_in
np.mean(np_height_in)

73.6896551724138

### Median

In [57]:
# Print out the median of np_height_in
np.median(np_height_in)

74.0

### Standard deviation

In [58]:
# Print out the standard deviation of np_height_in
np.std(np_height_in)

2.312791881046546

### Correlation coefficient 

In [59]:
# Print out correlation between first and second column of baseball_2d
np.corrcoef(baseball_2d[:, 0], baseball_2d[:, 1])

array([[1.        , 0.53153932],
       [0.53153932, 1.        ]])