<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'>Data as Nested Lists -> Introduction to Numpy <br><br>
Tiago Ventura </center> <h1> 

---

## Data as nested lists

Before introducting Numpy, we will learn how you can process and analyse your data as nested list. 

We will cover:

- open csv as nested lists
- Working with retangular data as nested lists

This will work as a motivation for the use of `numpy` and later for `pandas`. 

In [1]:
# Batteries included Functions
import csv # convert a .csv to a nested list
import os  # library for managing our operating system. 

In [2]:
## looking at your working directory
os.getcwd()

'/Users/tb186/Dropbox/courses/ppol_5203_fall_2023/lecture_notes/week-05'

In [3]:
## as magic line
%pwd

'/Users/tb186/Dropbox/courses/ppol_5203_fall_2023/lecture_notes/week-05'

In [6]:
## change working directory
os.chdir('/Users/tb186/Dropbox/courses/ppol_5203_fall_2023/lecture_notes/week-04')

In [7]:
### see what exists on my dataset
os.listdir()

['.DS_Store',
 'student_data.csv',
 '_week_4_comprehension_generators.ipynb',
 '_week_4_nested_lists.ipynb',
 '_week_4_numpy.html',
 'student_data_write.csv',
 '_week_4_comprehension_generators.html',
 '_week_4_file_management.ipynb',
 'data_week4',
 '_week_4_nested_lists.html',
 '_week_4_file_management.html',
 'text_file.txt',
 '.ipynb_checkpoints',
 'redrising.txt',
 'data_week4.zip',
 'gapminder.csv']

In [8]:
# Read in the gapminder data 
with open("gapminder.csv",mode="rt") as file:
    data = [row for row in csv.reader(file)]

### What does the data looks like?

In [9]:
# it is a nested list. 
data[:10]

[['country', 'lifeExp', 'gdpPercap'],
 ['Guinea_Bissau', '39.21', '652.157'],
 ['Bolivia', '52.505', '2961.229'],
 ['Austria', '73.103', '20411.916'],
 ['Malawi', '43.352', '575.447'],
 ['Finland', '72.992', '17473.723'],
 ['North_Korea', '63.607', '2591.853'],
 ['Malaysia', '64.28', '5406.038'],
 ['Hungary', '69.393', '10888.176'],
 ['Congo', '52.502', '3312.788']]

## Indexing Nested Lists

Notice something important here, because we open the data using a iterator, the code doesn't know that the first row is the header of the csv

In [10]:
# accessing the header
print(data[0])

['country', 'lifeExp', 'gdpPercap']


### Indexing Rows

In [11]:
# For any row > 0, row == 0 is the column names. 
print(data[100])

['Burundi', '44.817', '471.663']


### Indexing by columns

Accessing columns values become much more trick, because every row is a separate entry in your nested list

In [12]:
# Referencing a column data value
d = data[100] # First select the row
d[1] # Then reference the column

'44.817'

In [13]:
# doing the above all in one step
data[100][1]

'44.817'

In [14]:
# The key is to keep in mind the column names
cnames = data.pop(0)
cnames

['country', 'lifeExp', 'gdpPercap']

In [15]:
# We can now reference this column name list to pull out the columns we're interested in.
ind = cnames.index("lifeExp") # Index allows us to "look up" the location of a data value. 
ind

1

In [16]:
data[99][ind]

'44.817'

## Accessing a entire column

If I want to extract all the values of a particular column, I need to loop through all the *j* element of a sublist. 

In [17]:
# get you columns again
ind = cnames.index("lifeExp") 

# Looping through each row pulling out the relevant data value

#create a container
life_exp = []

# loop
for row in data:
    life_exp.append(row[ind])
print(life_exp)  

['39.21', '52.505', '73.103', '43.352', '72.992', '63.607', '64.28', '69.393', '52.502', '57.609', '73.444', '62.817', '68.922', '73.989', '52.302', '47.619', '42.96', '70.056', '54.336', '74.903', '52.382', '70.299', '71.601', '66.828', '70.177', '50.007', '74.014', '60.721', '52.681', '44.401', '67.708', '59.304', '73.733', '52.341', '58.859', '59.696', '66.644', '66.526', '47.903', '69.744', '65.866', '51.499', '46.78', '68.749', '49.002', '67.431', '73.646', '59.03', '71.511', '46.381', '71.22', '43.581', '49.834', '44.544', '71.045', '53.491', '48.401', '61.346', '41.482', '72.739', '68.433', '57.48', '40.38', '43.413', '58.679', '42.476', '47.771', '46.774', '51.221', '64.953', '45.996', '68.291', '61.554', '56.243', '50.626', '58.443', '52.663', '54.598', '48.436', '37.479', '65.409', '57.896', '53.322', '75.565', '73.923', '74.827', '59.633', '53.166', '62.2', '65.606', '74.663', '55.89', '48.986', '58.637', '57.921', '43.24', '66.581', '76.511', '40.989', '44.817', '67.802', '

In [18]:
# Same idea, but as a list comprehension 
life_exp = [row[ind] for row in data]
print(life_exp)

['39.21', '52.505', '73.103', '43.352', '72.992', '63.607', '64.28', '69.393', '52.502', '57.609', '73.444', '62.817', '68.922', '73.989', '52.302', '47.619', '42.96', '70.056', '54.336', '74.903', '52.382', '70.299', '71.601', '66.828', '70.177', '50.007', '74.014', '60.721', '52.681', '44.401', '67.708', '59.304', '73.733', '52.341', '58.859', '59.696', '66.644', '66.526', '47.903', '69.744', '65.866', '51.499', '46.78', '68.749', '49.002', '67.431', '73.646', '59.03', '71.511', '46.381', '71.22', '43.581', '49.834', '44.544', '71.045', '53.491', '48.401', '61.346', '41.482', '72.739', '68.433', '57.48', '40.38', '43.413', '58.679', '42.476', '47.771', '46.774', '51.221', '64.953', '45.996', '68.291', '61.554', '56.243', '50.626', '58.443', '52.663', '54.598', '48.436', '37.479', '65.409', '57.896', '53.322', '75.565', '73.923', '74.827', '59.633', '53.166', '62.2', '65.606', '74.663', '55.89', '48.986', '58.637', '57.921', '43.24', '66.581', '76.511', '40.989', '44.817', '67.802', '

In [19]:
# Make this code more flexible with list comprehensions
var_name = "gdpPercap"
out = [row[cnames.index(var_name)] for row in data]
print(out)

['652.157', '2961.229', '20411.916', '575.447', '17473.723', '2591.853', '5406.038', '10888.176', '3312.788', '2447.909', '20556.684', '5733.625', '65332.91', '17262.623', '1356.671', '810.384', '2469.167', '9331.712', '1741.365', '22410.746', '1314.38', '7208.065', '14074.582', '7866.872', '8416.554', '780.553', '16245.209', '3477.21', '1200.416', '680.133', '3484.779', '12013.579', '13969.037', '1044.582', '5613.844', '4469.453', '4898.398', '1854.731', '675.368', '6384.055', '7269.216', '1153.82', '1569.275', '6197.645', '3163.352', '6703.289', '14160.936', '4426.026', '13920.011', '2697.833', '17425.382', '1488.309', '817.559', '648.343', '6283.259', '3675.582', '1835.01', '3009.288', '675.669', '10863.164', '3255.367', '1017.713', '542.278', '673.093', '20261.744', '604.814', '1335.595', '1165.454', '11529.865', '4768.942', '1358.199', '7300.17', '2844.856', '3074.031', '1533.122', '12138.562', '635.858', '5031.504', '1912.825', '802.675', '7724.113', '1382.782', '439.333', '27074

## Motivating Numpy

All of the above seems a little too much for working with retangular data in Python. And it is. So of course, there are more recent, modern and easy to work with strategies to work with data frames in Python. 

A first approach to facilitate working with Data Frames in Python comes through using `numpy` to convert nested lists in `arrays`. 

**If you coming from R, think about numpy arrays as matrices.**

We will see more of numpy soon. But, let's see briefly how numpy works and the speed boost of using numpy to access data in Python


In [20]:
# %% -----------------------------------------
# Numpy offers an efficiency boost, especially when indexing
import numpy as np


# Convert to a numpy array
data_np = np.array(data)
data_np

array([['Guinea_Bissau', '39.21', '652.157'],
       ['Bolivia', '52.505', '2961.229'],
       ['Austria', '73.103', '20411.916'],
       ['Malawi', '43.352', '575.447'],
       ['Finland', '72.992', '17473.723'],
       ['North_Korea', '63.607', '2591.853'],
       ['Malaysia', '64.28', '5406.038'],
       ['Hungary', '69.393', '10888.176'],
       ['Congo', '52.502', '3312.788'],
       ['Morocco', '57.609', '2447.909'],
       ['Germany', '73.444', '20556.684'],
       ['Ecuador', '62.817', '5733.625'],
       ['Kuwait', '68.922', '65332.91'],
       ['New_Zealand', '73.989', '17262.623'],
       ['Mauritania', '52.302', '1356.671'],
       ['Uganda', '47.619', '810.384'],
       ['Equatorial Guinea', '42.96', '2469.167'],
       ['Croatia', '70.056', '9331.712'],
       ['Indonesia', '54.336', '1741.365'],
       ['Canada', '74.903', '22410.746'],
       ['Comoros', '52.382', '1314.38'],
       ['Montenegro', '70.299', '7208.065'],
       ['Slovenia', '71.601', '14074.582'],
      

### slicing data with numpy

It allows for slicing both at the row and column indexes

```
        array[rows, columns]
```

In [21]:
# simple slicing of rows and columns of your 2d array
data_np[:,1]

array(['39.21', '52.505', '73.103', '43.352', '72.992', '63.607', '64.28',
       '69.393', '52.502', '57.609', '73.444', '62.817', '68.922',
       '73.989', '52.302', '47.619', '42.96', '70.056', '54.336',
       '74.903', '52.382', '70.299', '71.601', '66.828', '70.177',
       '50.007', '74.014', '60.721', '52.681', '44.401', '67.708',
       '59.304', '73.733', '52.341', '58.859', '59.696', '66.644',
       '66.526', '47.903', '69.744', '65.866', '51.499', '46.78',
       '68.749', '49.002', '67.431', '73.646', '59.03', '71.511',
       '46.381', '71.22', '43.581', '49.834', '44.544', '71.045',
       '53.491', '48.401', '61.346', '41.482', '72.739', '68.433',
       '57.48', '40.38', '43.413', '58.679', '42.476', '47.771', '46.774',
       '51.221', '64.953', '45.996', '68.291', '61.554', '56.243',
       '50.626', '58.443', '52.663', '54.598', '48.436', '37.479',
       '65.409', '57.896', '53.322', '75.565', '73.923', '74.827',
       '59.633', '53.166', '62.2', '65.606', '74.6

## Indexing with Numpy is much faster!!

In [22]:
%%timeit -r 10 -n 100000
out1 = []
for row in data:
    out1.append(row[ind])

2.35 µs ± 98.3 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)


In [23]:
%%timeit -r 10 -n 100000
out2 = [row[ind] for row in data]

1.88 µs ± 120 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)


In [24]:
%%timeit -r 10 -n 100000
out3 = data_np[:,ind]

122 ns ± 56 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)


## Introduction to Numpy

`numpy` is at the core of many important python libraries. This is a extensive notebook, with a great deal of information about numpy. Primarily, we will cover: 

- Introduction to numpy, creating and manipulating arrays. 
- Efficency with numpy arrays
- Vectorization
- Broadcasting

For more, go through your required readings. Both textbooks provide a comprehensive coverage of Numpy. 


### Numpy

NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size and in dimensions. 

NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

## Basics of Numpy

In [134]:
# import numpy library
import numpy as np

#### Creating array from Python lists

`np.array()` form the core building block to create array.

- input: list
- output: numpy array

In [25]:
# from a list of int elements
np.array([1, 4, 8, 5, 3])

array([1, 4, 8, 5, 3])

In [26]:
# you can specify the type
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

In [28]:
# unlikely lists, NumPy arrays can explicitly be multidimensional
nested_list = [range(i, i+3) for i in [2, 4, 6]]
multi_array = np.array(nested_list)
multi_array

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

### Numpy Arrays vs Built-in Lists

We will see throughout this notebook why Numpy's arrays are more efficient than python built-in dtaa structures, like lists. 

A primary difference to keep in mind is how elements are stored:  

- Numpy leans toward less flexibility and more efficiency. 
- Lists gives you more flexibility and less efficiency. 

This is a trade-off between allowing a container to store **heterogenous** data types, which lists allow you to do, compared to **homogenous** data storage provided by numpy. 

In [29]:
# lists support heterogenous data types. It needs to store this information somewhere for every element!
list_ =  ["beep", "false", False, 1, 1.2]
list_

['beep', 'false', False, 1, 1.2]

In [30]:
# numpys only support homogenous data types. Stores elements and this information in a single place!
numpy_boolean = np.array([[True, 0], [True, "TRUE"], [False, True]], dtype=bool)
numpy_boolean

array([[ True, False],
       [ True,  True],
       [False,  True]])

In [31]:
# see with string
numpy_boolean = np.array([[True, 0], [True, "TRUE"], [False, True]], dtype=str)
numpy_boolean

array([['True', '0'],
       ['True', 'TRUE'],
       ['False', 'True']], dtype='<U5')

See this paragraph from your PDS textbook: 

<div class="alert alert-block alert-info"> 

At the implementation level, the array essentially contains a single pointer to one contiguous block of data. The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier. Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.

</div>    

### Creating Arrays from xcratch

Numpy also offer a set of distinct methods to create arrays from scratch, instead of converting from a list. Some options: 

- `numpy.arange()` will create arrays with regularly incrementing values
- `numpy.linspace()` will create arrays with a specified number of elements, and spaced equally between the specified beginning and end values.
- `numpy.zeros()` will create an array filled with 0 values with the specified shape.
- `numpy.ones()` will create an array filled with 1 values


In [32]:
# Arange a incremental sequence
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [33]:
# array with equally spaced intervals
np.linspace(1,5,10) 

array([1.        , 1.44444444, 1.88888889, 2.33333333, 2.77777778,
       3.22222222, 3.66666667, 4.11111111, 4.55555556, 5.        ])

In [34]:
# array filled with 0 values with the specified shape.
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [35]:
# an array filled with 1 values
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

#### Random numbers with numpy.random

Numpy allows for the generation of number from several known mathematical distributions. Some examples below: 

- `numpy.random.random()` will create an array of uniformly distributed random values between 0 and 1
- `numpy.random.normal()` will create an array of normally distributed random values with mean 0 and standard deviation 1
- `numpy.random.randint()` will create an array of random integers from a pre-defined interval

Other options that should be self-explanatory: 

- `numpy.random.poisson()`
- `numpy.random.binomial()`
- `numpy.random.uniform()`

In [38]:
# from a random sequence between 0 and 1
help(np.random.random)

Help on built-in function random:

random(...) method of numpy.random.mtrand.RandomState instance
    random(size=None)
    
    Return random floats in the half-open interval [0.0, 1.0). Alias for
    `random_sample` to ease forward-porting to the new random API.



In [42]:
# see how it works
np.random.random(10)

array([0.79101286, 0.85447581, 0.90415383, 0.89677082, 0.99745146,
       0.45545068, 0.12237682, 0.65271547, 0.08551359, 0.46130489])

In [43]:
# from a normal distribution
np.random.normal(0, 1, (3, 3))

array([[-1.57734582, -2.49535403, -1.85371445],
       [-0.03921928,  1.66423104,  0.57046832],
       [-0.38373116, -0.43965859, -1.12595718]])

In [46]:
#help(np.random.normal)

In [47]:
# random integers from a pre-defined interval
np.random.randint(0, 10, (3, 3))

array([[2, 9, 3],
       [0, 0, 3],
       [7, 5, 6]])

### Retrieving attributtes from Arrays

- `numpy.dim()`: generates the number of dimension
- `numpy.shape()`: generates the size of each dimension
- `numpy.size()`: generates the full size of a array

In [49]:
# generate a 3-d array from a nested list
array_3d = np.array([ # first element of 1 dimension
                    [ 
                    [1,2,3,4],
                    [2,3,4,1],
                    [-1,1,2,1]],
                    [# second element of 1 dimension
                    [1,2,3,4],
                    [2,3,4,1],
                    [-1,1,2,1]]])

# information
print("ndim: ", array_3d.ndim)
print("shape:", array_3d.shape)
print("size: ", array_3d.size)

ndim:  3
shape: (2, 3, 4)
size:  24


In [54]:
array_3d[0][:]

array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  1],
       [-1,  1,  2,  1]])

### Reshaping Arrays

You can reshape array, as soon as you input the appropriate new dimensions!

In [150]:
# new 2d array
array_3d.reshape(4, 6)

array([[ 1,  2,  3,  4,  2,  3],
       [ 4,  1, -1,  1,  2,  1],
       [ 1,  2,  3,  4,  2,  3],
       [ 4,  1, -1,  1,  2,  1]])

In [151]:
# or 6d array
array_3d.reshape(6, 4)

array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  1],
       [-1,  1,  2,  1],
       [ 1,  2,  3,  4],
       [ 2,  3,  4,  1],
       [-1,  1,  2,  1]])

In [153]:
## but you need to provide the proper dimension
array_3d.reshape(6, 6)

ValueError: cannot reshape array of size 24 into shape (6,6)

In [154]:
## transpose an array. Very common property in matrix operations. 
array_3d.reshape(6, 4).transpose()

array([[ 1,  2, -1,  1,  2, -1],
       [ 2,  3,  1,  2,  3,  1],
       [ 3,  4,  2,  3,  4,  2],
       [ 4,  1,  1,  4,  1,  1]])

### Array Indexing and Slicing

Numpy indexing is quite similar to list indexing in Python. And we covered lists and indexing last week.

In a one-dimensional array, you can access the ith value (**counting from zero**) by specifying the desired numerical index. 

```
M[element_index]
```

For n-dimensional arrays, you can access elements with a tuple for row and column index. 

```
M[row, columne]
```

You can use the `:` shortcut for slicing.

In [55]:
# create an 5d array
X = np.random.randint(0, 100, (5, 5))
X

array([[77, 73, 65, 58, 74],
       [ 1, 63, 42, 77, 91],
       [78,  2, 20,  3, 78],
       [98, 97, 80, 82, 51],
       [35, 97, 32, 61, 77]])

In [56]:
# index first row 
X[0] 

array([77, 73, 65, 58, 74])

In [57]:
# index first column
X[:,0]

array([77,  1, 78, 98, 35])

In [58]:
# index a specific cell 
X[0,0] 

77

In [59]:
# slice rows and columns
X[0:3,0:3] 

array([[77, 73, 65],
       [ 1, 63, 42],
       [78,  2, 20]])

In [60]:
# last row
X[-1,:] 

array([35, 97, 32, 61, 77])

### Reassignment

As we just saw, `numpy` makes your life easier for access elements on a retangular type of data -- when compared to nested lists. 

In the same venue, `numpy` uses the benefits of its easy indexing scheme to facilate reassignment of values. 

<div class="alert alert-block alert-danger"> 

**Importance:**
    
Using numpy for reassignment will be at the core of your data wrangling work with pandas!
    
</div>

In [61]:
# Start creating a array
X = np.zeros(50).reshape(10,5)
X

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [62]:
# Reassign data values by referencing positions
X[0,0] = 999
X

array([[999.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.]])

In [63]:
# Reassign whole ranges of values
X[0,:] = 999
X

# by row
X[:,0] = 999
X


array([[999., 999., 999., 999., 999.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.],
       [999.,   0.,   0.,   0.,   0.]])

In [64]:
# Reassignment using boolean values. 
D = np.random.randn(50).reshape(10,5).round(1)
D

array([[-0.9, -0.3,  0.5,  0.6,  1.2],
       [ 0.2,  0.4,  0.9,  1.1,  1.3],
       [-0.9, -2.2,  0.7, -1. ,  0.1],
       [ 2. , -1.5, -0.5, -1.5, -0.5],
       [-0. , -0.4,  0.6,  1.1, -1. ],
       [-1.2,  0.5,  1.2, -2.7,  1.2],
       [ 1.2,  0.4,  0.6,  0.7, -0.9],
       [ 1.6,  0.2, -0.3, -0.4, -1.3],
       [-1. ,  0. ,  0.4,  0.8,  0.1],
       [ 0.1,  1.1, -0.5,  1.6, -0.9]])

In [65]:
# reassignment
D[D > 0] = 1
D[D <= 0] = 0
D

array([[0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [0., 0., 1., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 1., 0.],
       [0., 1., 1., 0., 1.],
       [1., 1., 1., 1., 0.],
       [1., 1., 0., 0., 0.],
       [0., 0., 1., 1., 1.],
       [1., 1., 0., 1., 0.]])

#### np.where

`np.where()` can check whether elements meet a condition and then pull one element if the condition is met and another if not. It uses this syntax:

`np.where(condition, replacement_if_true, replacement_if_false)`

In [67]:
# Using where "ifelse()-like" method
D = np.random.randn(50).reshape(10,5).round(1) # Generate some random numbers again
D # Before 

array([[-0.1, -0.6,  1.1, -0.3,  0.8],
       [-0.3,  0.5,  0.1, -0.2, -2.3],
       [ 2. ,  0.3,  1.8,  2.5, -0.4],
       [-0.5,  0.2, -0.8,  0.5,  0.9],
       [ 0.9, -0.7,  0.7, -0.5,  1.1],
       [-0.3,  0.2, -1.3, -0.7, -1.1],
       [ 0.4, -0.9,  1. ,  0.2,  0.2],
       [ 0.1, -0.8,  0.6,  0.5, -0.2],
       [ 0.7, -0.2,  0.2,  0.8, -0.8],
       [ 1.4,  0.4, -0.7,  1.1, -0.6]])

In [70]:
np.where(D>0,1,0) # After

array([[0, 0, 1, 0, 1],
       [0, 1, 1, 0, 0],
       [1, 1, 1, 1, 0],
       [0, 1, 0, 1, 1],
       [1, 0, 1, 0, 1],
       [0, 1, 0, 0, 0],
       [1, 0, 1, 1, 1],
       [1, 0, 1, 1, 0],
       [1, 0, 1, 1, 0],
       [1, 1, 0, 1, 0]])

#### np.select

`np.select` allows  for element-wise selection reassignment, just like case_when from R


In [71]:
# basic usage: np.select(conditions, choices, default=0)

# create conditions
conditions = [D < 0, D == 0, D > 0]

# element wise reassignment
choices = [-1, 0, 1]

# run np.select
np.select(conditions, choices, default='unknown')

array([['-1', '-1', '1', '-1', '1'],
       ['-1', '1', '1', '-1', '-1'],
       ['1', '1', '1', '1', '-1'],
       ['-1', '1', '-1', '1', '1'],
       ['1', '-1', '1', '-1', '1'],
       ['-1', '1', '-1', '-1', '-1'],
       ['1', '-1', '1', '1', '1'],
       ['1', '-1', '1', '1', '-1'],
       ['1', '-1', '1', '1', '-1'],
       ['1', '1', '-1', '1', '-1']], dtype='<U21')

### Concatenating and Splitting Arrays

We can easily stack and grow numpy arrays. These are the main functions for concatenating arrays: 

- `np.concatenate([array,array],axis=0)`: concatenate by rows
- `np.concatenate([array,array],axis=1)`: concatenate by columns 

The same behavior can be achieved with `np.vstack([array,array])` or `np.hstack([m1,m2])

In [72]:
# create arrays
X = np.random.randint(0, 100, (5, 2))
Y = np.random.randint(0, 100, (5, 2))

In [73]:
# row bind
np.concatenate([X,Y],axis=0)

array([[28, 82],
       [52,  1],
       [29, 67],
       [10, 97],
       [33, 49],
       [55,  5],
       [89, 52],
       [ 4, 91],
       [52, 60],
       [62, 98]])

In [75]:
# column bind
np.concatenate([X,Y],axis=1)

array([[28, 82, 55,  5],
       [52,  1, 89, 52],
       [29, 67,  4, 91],
       [10, 97, 52, 60],
       [33, 49, 62, 98]])

### View vs Copy in Array

An sutil, but interesting point, about numpy arrays refers to the default behavior for slicing. When we slice an array
we **_do not copy fully the array_**, rather we get a "**view**" of the array.

**Why this matters?**: any change in the view will affect the original array
    
**Solution**: Make a copy. 

As noted in the [reading for this week](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html): 

> One important—and extremely useful—thing to know about array slices is that they return views rather than copies of the array data. **This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies**

We need to use the `.copy()` method from numpy to create a new array

In [77]:
# from lists
x = [1, 2, 3]

# full slice is enough on lists
y=x[:] 

# modify
y[0]=100

#print
print(y, x)

[100, 2, 3] [1, 2, 3]


In [79]:
# for arrays
X = np.random.randint(0, 100, (1, 5))

# slice
X_sub = X[:]
print(X, X_sub)

[[58 17 34 91 46]] [[58 17 34 91 46]]


In [80]:
# modify
X_sub[0][0] = 1000

# print
print(X, X_sub)

[[1000   17   34   91   46]] [[1000   17   34   91   46]]


In [81]:
# need to copy
# for arrays
X = np.random.randint(0, 100, (1, 5))

# slice.copy()
X_sub = X[:3].copy()

# modify
X_sub[0][0] = 1000

# print
print(X, X_sub)

[[78 70 57 52 86]] [[1000   70   57   52   86]]


## Vectorization (or ufunc in Numpy)

A critical reason for `numpy` popularity among data scientists is its efficiency.  NumPy provides an easy to implement and flexible interface to optimized computation with arrays of data. The key to making it fast is to use built-in (or easy to implement) vectorized operations. 


**What are vectorized functions?** A vectorize function allows for efficient processing of entire arrays or collections of data elements in a **single operation**. In plain english, it applies a particular operation in one-shot over a sequence of object. Vectorize functions are efficient because it allows us to avoid looping through entire collections of data. 

Let's compare the peformance of vectorized function and a loop, using a example from your [reading for this week](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html)

In [82]:
import numpy as np
rng = np.random.default_rng(seed=1701)

In [83]:
# define a function
def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        # notice the loop
        output[i] = 1.0 / values[i]
    return output

In [95]:
# create a list and apply
values = list(rng.integers(1, 10, size=5))

# error because lists do not support vectorize operations
1.0/values

TypeError: unsupported operand type(s) for /: 'float' and 'list'

In [97]:
# try with the function
compute_reciprocals(values)

array([0.11111111, 0.25      , 1.        , 0.11111111, 0.2       ])

In [99]:
# simple implementation
big_array = list(rng.integers(1, 100, size=1000))

In [100]:
%timeit -n 1000 compute_reciprocals(big_array)

673 µs ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [101]:
# vectorize implementation
%timeit -n 1000 np.divide(1.0 , big_array) 

32.5 µs ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


 ### <span style="color:red"> ALERT: What just happened?</span>

NumPy provides built-in vectorized routines as methods for `np.arrays`. This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.


These built-in vectorize methods are called `ufuncs` (or "universal functions"). Numpy comes baked in with a large number those vectorized operations. [See here for a detailed list.](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html) 

The [google colab notebook](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.03-Computation-on-arrays-ufuncs.ipynb) from your reading also provides a in-depth coverage of universal functions in `numpy`. Check it out!

### Building vectorized functions

We can take advantage of numpy vectorize approach, and very easily vectorise our user-defined functions. 

Consider the following function that yields a different string when input `a` is larger/smaller than input `b`.

In [180]:
def bigsmall(a,b):
    if a > b:
        return "A is larger"
    else:
        return "B is larger"

In [181]:
bigsmall(5,6)

'B is larger'

In [182]:
# Create a vectorized version of the function
vec_bigsmall = np.vectorize(bigsmall)
vec_bigsmall 

<numpy.vectorize at 0x1078e6c90>

The vectorization here brings two main advantages: 

- Advantage 1: it allows us to apply the function to a collection without using loops. 
- Advantage 2: it does is in a vectorize manner

In [183]:
# Advantage 1. Avoid the loops
bigsmall([0,2,5,7,0],4)

TypeError: '>' not supported between instances of 'list' and 'int'

In [184]:
# vectorize
vec_bigsmall([0,2,5,7,0],4)

array(['B is larger', 'B is larger', 'A is larger', 'A is larger',
       'B is larger'], dtype='<U11')

In [185]:
# Advantage II: vectorize, means faster
# write a function to run element-wise
def bigsmall_el_wise(a_collection, b):
    container = []
    for a in a_collection:
        if a > b:
            container.append("A is larger")
        else:
            container.append("B is larger")
    return container

In [None]:
# Generating some random data
a_collection = np.random.rand(1000000)
b = 0.5

In [191]:
%timeit -n 1000 vec_bigsmall(big_array, b)

119 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [192]:
%timeit -n 1000 bigsmall_el_wise(big_array, b)

577 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Broadcasting

**Broadcasting** makes it possible for operations to be performed on arrays of mismatched shapes.

Broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. 

The idea is to wrangle data so that operations can occur element-wise.


**Overall:** subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes.




For example, say we have a numpy array of dimensions (5,1)

$$ 
\begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix}
$$

Now say we wanted to add the values in this array by 5

$$ 
\begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + 5
$$

Broadcasting "pads" the array of 5 (which is shape = 1,1), and extends it so that it has similar dimension to the larger array in which the computation is being performed.

$$ 
\begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + \begin{bmatrix} 5\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\end{bmatrix}
$$

$$ 
\begin{bmatrix} 1 + 5\\2 + 5\\3 + 5\\4 + 5\\5 + 5\end{bmatrix} 
$$

$$ 
\begin{bmatrix} 6\\7\\8\\9\\10\end{bmatrix} 
$$

In [193]:
A = np.array([1,2,3,4,5])
A + 5

array([ 6,  7,  8,  9, 10])

By 'broadcast', we mean that the smaller array is made to match the size of the larger array in order to allow for element-wise manipulations.

### How it works:

- Shapes of the two arrays are compared _element-wise_. 
- Dimensions are considered in reverse order, starting with the trailing dimensions, and working forward 
- We are stretching the smaller array by making copies of its elements. However, and this is key, no actual copies are made, making the method computationally and memory efficient.

A general **Rule of thumb**: All corresponding dimension of the arrays must be compatible or one of the two dimensions is 1.

## Rules of Broadcasting

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays (from [reading](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html)):


### Rule 1
> If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.

### Rule 2

> If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.

### Rule 3 

> If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

#### Example 1

In [194]:
np.arange(3) + 5

array([5, 6, 7])

$$
\texttt{np.arange(3)} = \begin{bmatrix} 0&1&2\end{bmatrix}
$$

<br> 

$$
\texttt{5}  = \begin{bmatrix} 5 \end{bmatrix}
$$

<br> 

$$
\begin{bmatrix} 0&1&2\end{bmatrix} + \begin{bmatrix} 5 & \color{lightgrey}{5} & \color{lightgrey}{5}\end{bmatrix} = \begin{bmatrix} 5 & 6 & 7\end{bmatrix} 
$$

#### Example 2

In [195]:
np.ones((3,3)) + np.arange(3)

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

$$
\texttt{np.ones((3,3)) = }\begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}
$$

<br>

$$
\texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} 
$$

<br>

$$
\begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} + 
\begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\  \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix}  = 
\begin{bmatrix} 1 & 2 & 3\\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{bmatrix} 
$$

#### Example 3

In [196]:
np.arange(3).reshape(3,1) + np.arange(3)

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

$$
\texttt{np.arange(3).reshape(3,1)} = \begin{bmatrix} 0 \\ 1 \\ 2\end{bmatrix} 
$$

<br>

$$
\texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} 
$$

<br>

$$
\begin{bmatrix} 0 & \color{lightgrey}{0} & \color{lightgrey}{0} \\ 1 & \color{lightgrey}{1} & \color{lightgrey}{1} \\  2 & \color{lightgrey}{2} & \color{lightgrey}{2}\end{bmatrix} +
\begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\  \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix}  =
\begin{bmatrix} 0 & 1 & 2\\ 1 &2&3 \\ 2& 3 & 4\end{bmatrix} 
$$


#### Example 4

Example of dimensional disagreement.

In [197]:
np.ones((4,7)) 

array([[1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.]])

In [198]:
np.ones((4,7))  + np.zeros( (7,9) )

ValueError: operands could not be broadcast together with shapes (4,7) (7,9) 

#### Example 5

In [199]:
M = np.ones((3, 2))
M.shape

(3, 2)

In [200]:
a = np.arange(3)
a.shape

(3,)

In [201]:
M + a

ValueError: operands could not be broadcast together with shapes (3,2) (3,) 

#### In practice: Mean Centering a array

In [202]:
X = np.random.random((10, 3))
X

array([[0.21539496, 0.14646488, 0.30907838],
       [0.60371727, 0.21537589, 0.68194277],
       [0.63833831, 0.72524305, 0.28385588],
       [0.19813106, 0.89265894, 0.77541369],
       [0.59011527, 0.16661737, 0.42753666],
       [0.99338948, 0.73362274, 0.06922053],
       [0.72288873, 0.03344708, 0.61698078],
       [0.87205533, 0.43981373, 0.19324654],
       [0.00129213, 0.41930223, 0.68320364],
       [0.77421512, 0.08673353, 0.73662498]])

In [203]:
Xmean = X.mean(0)
Xmean

array([0.56095377, 0.38592794, 0.47771038])

In [204]:
## Shape
print(X.shape)
print(Xmean.shape)

(10, 3)
(3,)


In [205]:
# element-wise operation
Xcentered = X - Xmean

In [206]:
Xcentered

array([[-0.34555881, -0.23946306, -0.16863201],
       [ 0.0427635 , -0.17055206,  0.20423238],
       [ 0.07738454,  0.33931511, -0.19385451],
       [-0.36282271,  0.50673099,  0.29770331],
       [ 0.0291615 , -0.21931057, -0.05017372],
       [ 0.43243571,  0.3476948 , -0.40848985],
       [ 0.16193497, -0.35248087,  0.1392704 ],
       [ 0.31110156,  0.05388578, -0.28446385],
       [-0.55966164,  0.03337429,  0.20549325],
       [ 0.21326135, -0.29919441,  0.2589146 ]])

# The rest of this notebook will not be covered in class. Go through by yourself!

### Missing Values

Numpy provides a data class for missing values (i.e. `nan` == "Not a Number", see [here](https://en.wikipedia.org/wiki/NaN))

In [207]:
Y = np.random.randint(1,10,25).reshape(5,5) + .0
Y

array([[4., 9., 1., 7., 8.],
       [6., 2., 5., 6., 6.],
       [4., 1., 9., 8., 2.],
       [8., 5., 4., 9., 6.],
       [8., 8., 2., 9., 5.]])

In [208]:
Y[Y > 5] = np.nan
Y

array([[ 4., nan,  1., nan, nan],
       [nan,  2.,  5., nan, nan],
       [ 4.,  1., nan, nan,  2.],
       [nan,  5.,  4., nan, nan],
       [nan, nan,  2., nan,  5.]])

In [209]:
type(np.nan)

float

In [210]:
# scan for missing values
np.isnan(Y)

array([[False,  True, False,  True,  True],
       [ True, False, False,  True,  True],
       [False, False,  True,  True, False],
       [ True, False, False,  True,  True],
       [ True,  True, False,  True, False]])

In [211]:
~np.isnan(Y) # are not NAs

array([[ True, False,  True, False, False],
       [False,  True,  True, False, False],
       [ True,  True, False, False,  True],
       [False,  True,  True, False, False],
       [False, False,  True, False,  True]])

When we have missing values, we'll run into issues when computing across the data matrix.

In [212]:
np.mean(Y)

nan

To get around this, we need to use special version of the methods that compensate for the existence of `nan`.

In [213]:
np.nanmean(Y)

3.1818181818181817

In [214]:
np.nanmean(Y,axis=0)

  np.nanmean(Y,axis=0)


array([4.        , 2.66666667, 3.        ,        nan, 3.5       ])

In [215]:
# Mean impute the missing values
Y[np.where(np.isnan(Y))] = np.nanmean(Y)
Y

array([[4.        , 3.18181818, 1.        , 3.18181818, 3.18181818],
       [3.18181818, 2.        , 5.        , 3.18181818, 3.18181818],
       [4.        , 1.        , 3.18181818, 3.18181818, 2.        ],
       [3.18181818, 5.        , 4.        , 3.18181818, 3.18181818],
       [3.18181818, 3.18181818, 2.        , 3.18181818, 5.        ]])

### Structured Data: NumPy’s Structured Arrays

Out of the box, numpy arrays can only handle one data class at a time. Most times we will use heterogenous data types -- spreadsheet with name, age, gender, address, etc..

This short section shows you how to use `NumPy’s structured arrays` to get around of this limitation. 

Let's started creating a some lists. Imagine these are columns on your dataframe

In [None]:
# lists
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

In [None]:
# nest these lists
nested_list = [name, age, weight]
nested_list

In [None]:
# convert to a numpy array
array_nested_list = np.array(nested_list).T
array_nested_list

In [None]:
# see data type - all data treated as strings. 
array_nested_list.dtype


In case you which to preserve the preserve the data types for each variables, you could use structured arrays. These are almost like a less flexible dictionary. 

You need to follow three steps: 

- Create a empty structure with pre-defined size
- Provide names for the 'collumns'
- Provide types for the collumns

In [None]:
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                             'formats':('U10', 'i', 'f')})

In [None]:
# see the skeleton of the structure
data

In [None]:
# add information
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

In [None]:
# then you can access prety much like dictions
data["name"]

Though possible to deal with heterogeneous data frames using numpy, there is a lot of overhead to constructing a data object. 

**As such, we'll use Pandas series and DataFrames to deal with heterogeneous data.**

In [105]:
!jupyter nbconvert _week_5_numpy.ipynb --to html --template classic


This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    