### Python for Data Analysis: Notes for Scientific Computing

Working through "Python for data analysis"

**Please note the contents of this workbook are by no means work, these are my notes on the text book that I was working on.**

## Chapter 4 
### NumPy Basics: Arrays and Vectorized Computation

Here are some of the things you’ll find in NumPy:
- **ndarray**, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
- Mathematical functions for fast operations on entire arrays of data without having to write loops.
- Tools for reading/writing array data to disk and working with memory-mapped files.
- Linear algebra, random number generation, and Fourier transform capabilities.
- A C API for connecting NumPy with libraries written in C, C++, or FORTRAN.

NumPy provides a computational foundation for general numerical data processing,
many readers will want to use pandas as the basis for most kinds of statistics
or analytics, especially on tabular data. pandas also provides some more domainspecific
functionality like time series manipulation, which is not present in NumPy.

Exaple 1 :
Consider a NumPy array of one million integers, and the equivalent Python list:

In [1]:
# import numpy as np 
import numpy as np

# one could have used "from numpy import * " in the code
# to avoid writing np.
# This however is not advisable as numpy conflicts with
# some inbuilt python functions like (min & max)

In [2]:

my_arr = np.arange(10000000)
# now multiply each sequence by 2
%time for _ in range(10):my_arr2 = my_arr * 2

Wall time: 313 ms


In [3]:
# create a list
my_list = list(range(10000000))
# now multiply each sequence by 2
%time for _ in range(10):my_list2 = [x * 2 for x in my_list]

Wall time: 10.2 s


- NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

#### 4.1 The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray,
which is a fast, flexible container for large datasets in Python. Arrays enable you to
perform mathematical operations on whole blocks of data using similar syntax to the
equivalent operations between scalar elements.

In [4]:
# import numpy as np

# generate some random data (row,col)
data = np.random.randn(2,3) # 2 rows, 3 columns
data

array([[ 2.08505581, -0.42038397,  0.37927262],
       [-0.48541921,  0.70473105,  0.12322058]])

In [5]:
# use operators on the matrices (arrays)
data + data

array([[ 4.17011162, -0.84076794,  0.75854525],
       [-0.97083843,  1.40946209,  0.24644116]])

In [6]:
data * 10

array([[20.85055812, -4.20383971,  3.79272625],
       [-4.85419214,  7.04731047,  1.23220579]])

- An ndarray is a generic multidimensional array for homogeneous data; that is, all the elements must be of the same type. This is similar to matrix() function in R.
- Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array

In [7]:
# remember we called our 2 by 3 matrix (array) "data"
# Add . shape to see the shape
data.shape

(2, 3)

In [8]:
# to see the type add .dtype to the name of the array
data.dtype

dtype('float64')

Note: Whenever you see **“array,” “NumPy array,” or “ndarray”** in the text,
with few exceptions they all refer to the **same thing: the ndarray
object**.

#### Creating ndarrays

easiest way to create an array is to use the array function. This accepts any
sequence-like object (including other arrays) and produces a new NumPy array containing
the passed data.

In [9]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1 

array([6. , 7.5, 8. , 0. , 1. ])

In [10]:
# Notice how the vector does not have a value at the column
arr1.shape

(5,)

Nested sequences, like a list of equal-length lists, will be converted into a **multidimensional
array**:

In [11]:
data2 = [[1,2,3,4],[5,6,7,8]]
data2 # A nested list
      # Note that it is not a multidimensional array as it still needs to
      # be converted to a MD array  by parsing it through np.array() function

[[1, 2, 3, 4], [5, 6, 7, 8]]

In [12]:
# Change a list to a multidimensional array
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [13]:
# data2 was a list hence no ndim or shape attributes

# data2.shape
# data2.ndim

In [14]:
# We can be sure arr2 is a array by looking 
# at its ndim and shape attributes
arr2.shape

(2, 4)

In [15]:
# ndim
arr2.ndim

2

The data type is stored in a special dtype metadata
object

In [16]:
arr1.dtype

dtype('float64')

In [17]:
arr2.dtype

dtype('int32')

There are many other ways to create an array other than np.array. E.g
- **np.zeros()**
- **np.ones()**
- **np.empty()** : 
    empty creates an array without initializing its values to any particular value.
    
To create a **higher dimensional array** with these methods, **pass a tuple
for the shape**:

In [18]:
np.zeros(3)

array([0., 0., 0.])

In [19]:
np.ones(3)

array([1., 1., 1.])

In [20]:
np.empty(3)

array([1., 1., 1.])

- It’s not safe to assume that **np.empty** will return an array of all zeros. In some cases, it may return uninitialized “garbage” values.

In [21]:
################################################################3
np.zeros((2,4)) # 2 rows, 4 columns

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [22]:
# create a higher dimensional array with these methods, 
# pass a tuple for the shape

np.ones((2, 3, 4)) # (dimensions, rows, columns)

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

- arange is an array-valued version of the built-in Python range function:

In [23]:
np.arange(5)

array([0, 1, 2, 3, 4])

Table 4-1 Array creating functions

### Data Types for ndarrays

The data type or dtype is a special object containing the information (or metadata,
data about data) the ndarray needs to interpret a chunk of memory as a particular
type of data:

In [24]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr1.dtype

dtype('float64')

In [25]:
arr2 = np.array([1, 2, 3], dtype=np.int32)
arr2.dtype

dtype('int32')

- It’s often only necessary to care about the general kind of data you’re dealing with, whether floating point, complex, integer, boolean, string, or general Python object.

Table 4-2 NumPy data types

To convert or cast an array from one dtype to another use ndarryay's astype method:

 Example 1: Integer dtype cast to floating point

In [26]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

dtype('int32')

In [27]:
# To change the "arr" array to a float
# add .astype() , with np.flot64  
float_arr = arr.astype(np.float64)
float_arr

array([1., 2., 3., 4., 5.])

In [28]:
float_arr.dtype

dtype('float64')

 Example 2: floating-point number to integer dtype
 
If I cast some floating-point
numbers to be of integer dtype, the decimal part will be truncated:

In [29]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

array([ 3.7, -1.2, -2.6,  0.5, 12.9, 10.1])

In [30]:
# Note .astype(np.int32) is added to "arr" in order to change
# the  floating-point number to integer dtype
arr.astype(np.int32) 

array([ 3, -1, -2,  0, 12, 10])

 Example 2: Convert array of stings representing numbers to numeric form:
 
 - It’s important to be **cautious when using the numpy.string_ type**, as **string data in NumPy is fixed size and may truncate input without warning**. 
 - **pandas has more intuitive out-of-the-box behavior on non-numeric data.**

In [37]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.dtype

dtype('S4')

In [46]:
numeric_float = numeric_strings.astype(np.float64)
# can be lazy and write as:
# numeric_float = numeric_strings.astype(float)
# Here NumPy aliases the Python types to its own equivalent data dtypes.
numeric_float

array([ 1.25, -9.6 , 42.  ])

In [47]:
numeric_float.dtype

dtype('float64')

#####  Use another array's dtype attribute:

In [55]:
# Say we create a array with a vector of integers from
# 0 to 9 : array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
int_array = np.arange(10)
int_array.dtype

dtype('int32')

In [62]:
# vector of float values
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
calibers

array([0.22 , 0.27 , 0.357, 0.38 , 0.44 , 0.5  ])

In [63]:
# Give int_array the type of calibers
# Hence answer is a float but int_arrays values converted to float
int_array.astype(calibers.dtype)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

Example : There are **shorthand type code strings** you can use to refer to a dtype

In [65]:
empty_uint32 = np.empty(8,dtype='u4')
empty_uint32 

array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

In [66]:
empty_uint32.dtype

dtype('uint32')

**Calling astype always creates a new array (a copy of the data), even
if the new dtype is the same as the old dtype.**

-----------------------------------------------------------------------

### Arithmetic with NumPy Arrays

NumPy uses vectorization, this means arrays enable you to express batch operations on data without writing for loops. 

In [70]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr 

array([[1., 2., 3.],
       [4., 5., 6.]])

- addition

In [73]:
arr + arr

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

- subtraction

In [74]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

- multiplication

In [80]:
# Notice that each element is 
# multiplying itself with its counterpart elementwise
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

- Division

In [81]:
# Note that the division is occuring elementwise and is not
# multivariate division as we know it
arr / arr 

array([[1., 1., 1.],
       [1., 1., 1.]])

**Arithmetic operations with scalars propagate the scalar argument to each element in
the array**:

In [83]:
1/arr # take the inverse of each element in the matrix

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [86]:
arr ** 0.5 # place each scalar element in the matrix to the power of 0.5
           # I.E take the square root

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

In [85]:
arr * 0.5 # Multiply each scalar element with 0.5

array([[0.5, 1. , 1.5],
       [2. , 2.5, 3. ]])

Comparisons between arrays of the same size yield boolean arrays:

In [89]:
Mugabe_arr1 =np.array([[1,2],[2,4]])
Ramaposa_arr2 =np.array([[5,6],[7,8]])

In [90]:
Mugabe_arr1

array([[1, 2],
       [2, 4]])

In [91]:
Ramaposa_arr2

array([[5, 6],
       [7, 8]])

In [95]:
# Main point here is Comparisons between 
# arrays of the same size yield boolean arrays
Mugabe_arr1 < Ramaposa_arr2

array([[ True,  True],
       [ True,  True]])

"Operations between differently sized arrays" is called **broadcasting**, this will be discussed in the Advanced section later on.

-----------------------------------------------

###  Basic Indexing and Slicing

- there are many ways you may want to select a subset of your data or individual elements. 
- One-dimensional arrays are simple; on the surface they act similarly to Python lists:
-  The last indexed value is not included.

In [99]:
arr = np.arange(10) # Remember Python starts counting at zero
                    # so also indexes from zero
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [101]:
# Slicing notation
arr[4]  # 5th element in the (one dimensional array) vector

4

In [102]:
arr[0:3] # from first element to 3rd, the last element is disregaerded 

array([0, 1, 2])

In [112]:
arr[7:10]

array([7, 8, 9])

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is
propagated (or broadcasted henceforth) to the entire selection.

In [115]:
arr[7:10] = 12
arr

array([ 0,  1,  2,  3,  4,  5,  6, 12, 12, 12])