# Programming for Business Analytics (Python)

In [None]:
# This code appears in every demonstration Notebook.
# By default, when you run each cell, only the last output of the codes will show.
# This code makes all outputs of a cell show.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Introduction to numpy:
</p><br>

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold"><br>
Package for scientific computing with Python
</p><br>

Numerical Python, or "Numpy" for short, is a foundational package on which many of the most common data science packages are built.  Numpy provides us with high performance multi-dimensional arrays which we can use as vectors or matrices.  

The key features of numpy are:

- ndarrays: n-dimensional arrays of the same data type which are fast and space-efficient.  There are a number of built-in methods for ndarrays which allow for rapid processing of data without using loops (e.g., compute the mean).
- Vectorization: enables numeric operations on ndarrays.
- Broadcasting: a useful tool which defines implicit behavior between multi-dimensional arrays of different sizes.
- Input/Output: simplifies reading and writing of data from/to file

<b>Additional Recommended Resources:</b><br>
<a href="https://docs.scipy.org/doc/numpy/reference/">Numpy Documentation</a><br>
<i>Python for Data Analysis</i> by Wes McKinney<br>
<i>Python Data science Handbook</i> by Jake VanderPlas


<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

Getting started with ndarray<br><br></p>

**ndarrays** are time and space-efficient multidimensional arrays at the core of numpy.  Like the data structures in Week 2, let's get started by creating ndarrays using the numpy package.

In [None]:
import numpy as np # To simplify package name as 'np'

# Function array() to create arrays; The function array() is from the package.
# To call it, you have to use np.array(). It turns lists into n-dimensional arrays.
# Let's first create Rank 1 array using a 1-dimensional list.
an_array = np.array([3,33,333,3333])
an_array
type(an_array)

In [None]:
# We can use shape to describe the dimensions of an array.
# Shape is an attribute of an array object. One dimensional array has the shape of (a, )
an_array.shape

In [None]:
# Accessing array elements is similar to lists. Use index.
an_array[0]
an_array[1]

In [None]:
# Similar to list, ndarrays are mutable. We can change the value of any element easily.
an_array[0] = 888
an_array

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

A rank 2 **ndarray** is one with two dimensions.  Notice the format below of [ [row] , [row] ].  2 dimensional arrays are great for representing matrices which are often useful in data science.

In [None]:
# A rank 2 array can be created from a 2-dimensional list. Each sublist is a row.
# The length of the sublists need to be the same.
another = np.array([[1,2,3],[11,12,13]])
another.shape
print(another)

In [None]:
# The indice has two components: row and column. row indicates the index of the sublist in the main list;
# column indicates the index of the element in the sublist.
another[0, 1]
another[1, 1]

# arr[i, j] is equivalent to arr[i][j], but more efficient.

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

There are many ways to create numpy arrays:
</p>

Here we create a number of different size arrays with different shapes and different pre-filled values.  numpy has a number of built in methods which help us quickly and easily create multidimensional arrays.

In [None]:
# create a 2x2 array of with predefined values. The shape of the array is passed as the argument.
# zeros
arr1 = np.zeros((2,2))
# ones
arr2 = np.ones((1,2))
# Filled with any specific value
arr3 = np.full((2,2), 9.0)
# create an array of random floats between 0 and 1
arr4 = np.random.random((2,2))
# random.random() returns random floats in the half-open interval [0.0, 1.0).
arr1
arr2
arr3
arr4

In [None]:
# notice that the above ndarray arr2 has one row, but it is actually rank 2, 1x2 array
# It is different from shape (2,).
arr2.shape
# We need to use two indices to access an element
arr2[0, 1]

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Datatypes: The items in an array must be the same data type. As NumPy is actually built on C language, the data types of arrays are different. Please see the slides for details.  
</p>

In [None]:
# Python assigns datatypes when you generate arrays
arr1.dtype

In [None]:
an_array.dtype

In [None]:
# You can also specify the data type when you generate arrays
an_array = np.array([3,33,333,3333], dtype = np.int64)
an_array.dtype

In [None]:
# you can force floats into integers (using floor function)
an_array = np.array([3.3,33.5,333.6,3333], dtype = np.int64)
an_array

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>
Slice indexing
</p>

Similar to the use of slice indexing with lists and strings, we can use slice indexing to pull out sub-regions of ndarrays.

In [None]:
# Rank 2 array of shape (3, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
print(an_array)
# Use array slicing to get a subarray consisting of 2 rows x 2 columns. Each dimension
# is similiar to slicing a list
a_slice = an_array[:2, 1:3]
a_slice

In [None]:
# Array slice is different from list slice. 
# When you modify an array slice, the original array is also modified.
a_slice[0, 0] = 1000
a_slice

In [None]:
an_array

In [None]:
lista = [1,2,3,4]
listb = lista[1:3]
listb[0] = 88

In [None]:
lista
listb

In [None]:
# To maintain the original array, we can make copies using copy()
another_slice = an_array[:2, 1:3].copy()

print("Before:", an_array)
another_slice[0,0] = 1000
print("After:", an_array)

In [None]:
# You can generates an array of lower rank by slicing with one digit
row = an_array[1, :]
print(row, row.shape)
# To maintain the rank, slicing with a range
row2 = an_array[1:2, :]
print(row2, row2.shape)
# Try with the other dimension:
col = an_array[:, 1]
col

Sometimes it's useful to use array indices to access or change elements.

In [None]:
# Create a new array
an_array = np.array([[11,12,13], [21,22,23], [31,32,33], [41,42,43]])

print('Original Array:')
print(an_array)

In [None]:
# Create arrays of indices
col_indice = np.array([0,1,2,0])
row_indice = np.arange(4)

In [None]:
for i in zip(row_indice, col_indice):
    print(i)

In [None]:
# Examine the pairings of row_indices and col_indices.  These are the elements we'll change next.
# zip function creates zip object, an iterator of tuples; You may iterate 
for row, col in zip(row_indice, col_indice):
    print(row, ',', col)
    print(an_array[row, col])

In [None]:
# Select one element from each row
an_array[row_indice, col_indice]

In [None]:
# Change one element from each row using the indices selected
an_array[row_indice, col_indice] += 10000
print('Change array:', an_array)

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>
Boolean Indexing
</p>
<p> We can select items based on conditions. The conditions creates a boolean matrix (mask) that slice the array.
</p>
<br>

In [None]:
# Conditions on an array creates a True/False array.
arrfilter = (an_array >15)
arrfilter

Notice that the filter is a same size ndarray as an_array which is filled with True for each element whose corresponding element in an_array which is greater than 15 and False for those elements whose value is less than 15.

In [None]:
# The filter array can be used to select just those elements which meet that criteria
print(an_array[arrfilter])

In [None]:
# You can use condition expressions directly for slicing.
an_array[an_array>15]
an_array[an_array % 2 == 0]

In [None]:
# It is useful to select elements and make changes
an_array[an_array % 2 == 0] += 1

In [None]:
an_array

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>
Array Universal Functions
</p>
<p>    
They implement vectorized operations on arrays. This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.

</p>

In [None]:
x = np.array([[111,112],[121,122]], dtype=np.int)
y = np.array([[211.1,212.1],[221.1,222.1]], dtype=np.float64)

print(x)
print()
print(y)

In [None]:
# To sum the two arrays element by element
print(x + y)         # The plus sign works
print()
print(np.add(x, y))  # so does the numpy function "add"

In [None]:
# subtract
print(x - y)
print()
print(np.subtract(x, y))

In [None]:
# multiply
print(x * y)
print()
print(np.multiply(x, y))

In [None]:
# divide
print(x / y)
print()
print(np.divide(x, y))

In [None]:
# square root
print(np.sqrt(x))

In [None]:
# exponent (e ** x)
print(np.exp(x))

There are many other universal functions: taking absolute value, logarithms and others. For details, please visit https://numpy.org/doc/stable/reference/ufuncs.html

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">

Let's explore the efficiency of universal functions

</p>

In [None]:
# Using loop to compute the reciprocal of each element of an array
def compute_reciporcals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0/values[i]
    return output

values = np.random.randint(1, 10, size = 5)
compute_reciporcals(values)

In [None]:
big_array = np.random.randint(1, 100, size = 1000000)
%time compute_reciporcals(big_array)

In [None]:
%time (1/big_array)

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Aggregation functions</p>
<br>
<p>    
we can get summary statistics for arrays. Note how axis reference works. <br><br>
</p>

In [None]:
# setup a random 2 x 5 matrix
arr = 10 * np.random.randn(2,5)
# randn returns a sample (or samples) from the “standard normal” distribution.
print(arr)

In [None]:
# compute the mean for all elements
print(arr.mean())

In [None]:
# We can operate on different dimensions. axis 0 refers to row; axis 1 refers to column.
# Compute the means by row, which means to gather all column items in the same row and take the mean.
# The operation is on columns, so axis = 1.
print(arr.mean(axis = 1))

In [None]:
# Compute the means by column, which means to gather all row items in the same column and take the mean.
# The operation is on rows, so axis = 0
print(arr.mean(axis = 0))

In [None]:
# sum all the elements
print(arr.sum())

In [None]:
# What does the following sum do?
print(arr.sum(axis = 1))

In [None]:
# compute the medians; it does not work as an attribute of arrays.
print(np.median(arr, axis = 1))

In [None]:
# Sorting
arr.sort()
print(arr)
# The sorting is along the last axis, axis = 1 this case

In [31]:
#Find unique elements
arrUnique = np.array([1,2,1,4,2,1,4,2])

print(np.unique(arrUnique))

[1 2 4]


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Broadcasting</p>
<br>
<p>    
Broadcasting is how numpy arrays operate when the dimensions are different. <br>
    Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.<br>
Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.<br>
Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.<br>
    For more details, please see: <br>
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html
<br><br>
</p>

In [33]:
# Create a 4X3 array
start = np.zeros((4,3))
print(start)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [34]:
# create a rank 1 ndarray with 3 values
add_rows = np.array([1, 0, 2])
add_rows.shape
print(add_rows)

(3,)

[1 0 2]


In [35]:
# To add (4,3) and (3, ) together
# 1. move the leading dimension to the end (3,) -->(,3)
# 2. add one to the leading dimension --> (1,3)
# 3. pad on the 1 dimension as needed --> (4,3)

array([[1., 0., 2.],
       [1., 0., 2.],
       [1., 0., 2.],
       [1., 0., 2.]])

In [None]:
#Add together: how two arrays of different dimensions (4,3) vs (3,) are added together.
y = start + add_rows
y

In [None]:
# create an ndarray which is 4 x 1 to broadcast across columns
add_cols = np.array([[0,1,2,3]])
add_cols = add_cols.transpose()
add_cols

In [None]:
# Add together: (4,3) vs (4,1).
# Expand (4, 1) to (4, 3) by repeating the one column.
z = start + add_cols
# Similarly, (4, 3) vs (1, 3) can work too.
# Expand (1, 3) to (4, 3) by repeating the one row.

In [None]:
# Broadcast in both dimensions - adding a signle number
add_number = np.array([5])
x = start + add_number
x

In [None]:
# Try another example
# create our 3x4 matrix
arrA = np.ones((3,4))
arrA

In [None]:
# add_cols is a 4x1 array
arrA + add_cols # cannot broadcast

In [None]:
# We need to transpose the 4X1 to 1X4 so it will work
arrB = arrA + add_cols.transpose()
arrB

In [None]:
#Application of broadcasting - centering an array
X = np.random.random((10,3))
Xmean = X.mean(axis = 0)
Xmean
X_centered = X - Xmean
X_centered

In [None]:
X = np.random.random((10,3))
X

In [None]:
Xmean = X.mean(axis = 0)
Xmean