# Data Processing

#### Data is the most important element of Machine Learning Industry. Given a large and diverse set of training data, a good deep learning model will significantly outperform non-deep learning algorithms.


# Getting Started with NumPy

#### The majority of neural networks use input data that is either numeric or has been converted to a numeric form. When we deal with numeric data, the best Python library to use is NumPy. The NumPy library allows us to perform many operations on numeric data, and convert the data to more usable forms.
#### And NumPy aims to provide an array object that is up to **50x faster** than traditional Python lists.
### Its a commone practice to import NumPy **alias** with **np** Like
> import **numpy** as **np**

# NumPy Arrays

#### NumPy arrays are basically just Python lists with added features. In fact, you can easily convert a Python list to a Numpy array using the np.array function, which takes in a Python list as its required argument.
#### The function also has quite a few keyword arguments, but the main one to know is **dtype**.
#### The function **np.array** perfoms **upcasting**. **If the array contains elements of different data types, all the elements are cast into the largest type (a process known as upcasting)**
#### The dtype keyword argument takes in a NumPy type and manually casts the array to the specified type.

In [1]:
import numpy as np

arr = np.array([[0, 1, 2], [3, 4, 5]],
               dtype=np.float32)
print(repr(arr))

array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)


## Copying

In [2]:
a = np.array([0, 1])
b = np.array([9, 8])
c = a
print('Array a: {}'.format(repr(a)))
c[0] = 5
print('Array a: {}'.format(repr(a)))

d = b.copy()
d[0] = 6
print('Array b: {}'.format(repr(b)))

Array a: array([0, 1])
Array a: array([5, 1])
Array b: array([9, 8])


## Casting

In [3]:
arr = np.array([0, 1, 2])
print(arr.dtype)
arr = arr.astype(np.float32)
print(arr.dtype)

int64
float32


## NaN & Infinity

* Note that np.nan cannot take on an integer type.
* Note that np.inf cannot take on an integer type.
### Try it by uncommenting and commenting the code chunks

In [4]:
# NaN
# arr = np.array([np.nan, 1, 2])
# print(repr(arr))

# arr = np.array([np.nan, 'abc'])
# print(repr(arr))

# Will result in a ValueError
# np.array([np.nan, 1, 2], dtype=np.int32)


# # # # Infinity
# print(np.inf > 1000000)

# arr = np.array([np.inf, 5])
# print(repr(arr))

# arr = np.array([-np.inf, 1])
# print(repr(arr))

# # Will result in an OverflowError
# np.array([np.inf, 3], dtype=np.int32)

# Dimensions in Arrays

#### A dimension in arrays is one level of array depth (nested arrays).
### 0-D Arrays
#### 0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
### 1-D Arrays
#### An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
**These are the most common and basic arrays.**
### 2-D Arrays
#### An array that has 1-D arrays as its elements is called a 2-D array.
**These are often used to represent matrix or 2nd order tensors.**
>NumPy has a whole sub module dedicated towards matrix operations called numpy.mat
### 3-D arrays
#### An array that has 2-D arrays (matrices) as its elements is called 3-D array.
**These are often used to represent a 3rd order tensor.** 
### Check Number of Dimensions?
#### NumPy Arrays provides the **ndim** attribute that returns an integer that tells us how many dimensions the array have.

In [5]:
# importing numpy and alias as np
import numpy as np

# 0-D 
d= np.array(42)
print("0-D numpy array {}\n".format(d))

# 1-D
D = np.array([1, 2, 3])
print("1-D numpy array {}\n".format(D))

# 2-D
DD = np.array([[1, 2, 4], [3, 4, 5]])
print("2-D numpy array {}\n".format(DD))

# 3-D
DDD = np.array([[[1, 2, 3], [3, 4, 5]], [[6, 7, 8], [9, 10, 11]]])
print("3-D numpy array {}\n".format(DDD))

# Check Number of Dimensions?
print("The dimensions of DDD is {}\n".format(DDD.ndim))

0-D numpy array 42

1-D numpy array [1 2 3]

2-D numpy array [[1 2 4]
 [3 4 5]]

3-D numpy array [[[ 1  2  3]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]]

The dimensions of DDD is 3



# Numpy Ranged data

#### To specify the number of elements in the returned array, rather than the step size, we can use the **np.linspace** function.
#### In other words **np.linspace** gives an array between something to something. The difference is set according to the total numbers of array we want. Such as, We want 100 numbers between 1000 to 2500. The difference will be autometicly set.
#### The argument **endpoint=False** makes the stop value **exclusive**.  

In [6]:
arr = np.linspace(5, 11, num=4)
print(repr(arr))

arr = np.linspace(5, 11, num=4, endpoint=False)
print(repr(arr))

arr = np.linspace(5, 11, num=4, dtype=np.int32)
print(repr(arr))

array([ 5.,  7.,  9., 11.])
array([5. , 6.5, 8. , 9.5])
array([ 5,  7,  9, 11], dtype=int32)


#### **np.arange** worked the same as **range()** function in python except it returns **evenly spaced** values within a given interval. The function acts very similar to the range function in Python. 
#### Like np.array, np.arange performs upcasting. It also has the dtype keyword argument to manually cast the array.

# Reshaping data

#### You probably worked with **shape** attribute. The shape of an array is the number of elements in each dimension.
#### **np.reshape** gives you the power to reshape the array as you like. (2, 4) or others. 
#### The product of the reshape format should have equal to the number of elements exits in the array itself.
#### Here reshape format (2, 4) or (2, 2, 2). Indivisually their product must be equal to **arr.shape** which is 8.
#### 2x4=8 and 2x2x2=8 are valid **But** 2x2x2x2=16 is not valid

In [7]:
arr = np.arange(8)
print(arr.shape)

reshaped_arr = np.reshape(arr, (2, 4))
reshaped_arr_one = np.reshape(arr, (2, 2, 2))
# reshaped_arr_one = np.reshape(arr, (2, 2, 2, 2))

print(repr(reshaped_arr_one))
print(repr(reshaped_arr))
print('New shape: {}'.format(reshaped_arr.shape))

reshaped_arr = np.reshape(arr, (-1, 2, 2))
print(repr(reshaped_arr))
print('New shape: {}'.format(reshaped_arr.shape))

(8,)
array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
New shape: (2, 4)
array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])
New shape: (2, 2, 2)


#### Since we need to flatten data quite often, it is a useful function. NumPy provides an inherent function for flattening an array. Flattening an array reshapes it into a 1D array.
#### In other words **np.flatten()** reshaped your array at the **begining point**. Like when you created it or imported it.

In [8]:
arr = np.arange(8)
arr = np.reshape(arr, (2, 4))
flattened = arr.flatten()
print(repr(arr))
print('arr shape: {}'.format(arr.shape))
print(repr(flattened))
print('flattened shape: {}'.format(flattened.shape))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
arr shape: (2, 4)
array([0, 1, 2, 3, 4, 5, 6, 7])
flattened shape: (8,)


# Transposing

#### Similar to how it is common to reshape data, it is also common to **transpose** data. Perhaps we have data that's supposed to be in a particular format, but some new data we get is rearranged. We can just transpose the data, using the **np.transpose** function, to convert it to the proper format.
#### **np.transpose** takes an argument called **axes**. Like np.reshape(**the array** , axes(1, 0, 2))
#### **NOTE:** If the shape of a 3D array is (2, 3, 2) than the transpose argument axes(1, 0, 2) represents, 1 equal to 2 arrays, 0 equal to 3 rows, 1 equal to 2 coloumns.
#### In other words axes argument represents the position of the last shape.


In [9]:
arr = np.arange(8)
print(repr(arr))
arr = np.reshape(arr, (4, 2))
transposed = np.transpose(arr)
print(repr(arr))
print('arr shape: {}'.format(arr.shape))
print(repr(transposed))
print('transposed shape: {}'.format(transposed.shape))

array([0, 1, 2, 3, 4, 5, 6, 7])
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])
arr shape: (4, 2)
array([[0, 2, 4, 6],
       [1, 3, 5, 7]])
transposed shape: (2, 4)


# Math

## Arithmetic

#### Using NumPy arithmetic, we can easily modify large amounts of numeric data with only a few operations. For example, we could convert a dataset of Fahrenheit temperatures to their equivalent Celsius form.

In [10]:
def f2c(temps):
  return (5/9)*(temps-32)

fahrenheits = np.array([32, -4, 14, -40])
celsius = f2c(fahrenheits)
print('Celsius: {}'.format(repr(celsius)))

Celsius: array([  0., -20., -10., -40.])


#### It is important to note that performing arithmetic on NumPy arrays does not change the original array, and instead produces a new array that is the result of the arithmetic operation.

## Non-linear functions

> In order to be a Linier function the changes of x and y, has to be constant.In other words the graph gives a stright line
> On the other hand the Non-Linear function gives a carved line instead of a stright line in graph. In Non-Linear function the change of x and y is not constant.

> Difference among **log**, **ln**(natural log), **lb**

| Kinds | Base |
|----------|----------|
|log|10|
|ln|e|
|lb|2|

#### The function **np.exp** performs a base **e** exponential on an array, while the function **np.exp2** performs a base 2 exponential. Likewise, **np.log**, **np.log2**, and **np.log10** all perform logarithms on an input array, using base **e**, base **2**, and base **10**, respectively.

In [11]:
arr = np.array([[1, 2], [3, 4]])
# Raised to power of e
print(repr(np.exp(arr)))
# Raised to power of 2
print(repr(np.exp2(arr)))

arr2 = np.array([[1, 10], [np.e, np.pi]])
# Natural logarithm
print(repr(np.log(arr2)))
# Base 10 logarithm
print(repr(np.log10(arr2)))

array([[ 2.71828183,  7.3890561 ],
       [20.08553692, 54.59815003]])
array([[ 2.,  4.],
       [ 8., 16.]])
array([[0.        , 2.30258509],
       [1.        , 1.14472989]])
array([[0.        , 1.        ],
       [0.43429448, 0.49714987]])


#### NumPy has various other mathematical functions, which are listed [here](https://numpy.org/doc/stable/reference/routines.math.html)

## Matrix multiplication

####  NumPy arrays are basically vectors and matrices. The main function to use is np.matmul, which takes two vector/matrix arrays as input and produces a dot product or matrix multiplication.
#### The code below shows various examples of matrix multiplication. When both inputs are 1-D, the output is the dot product.
#### The basic Matrix multiplication rules are also applied here too.So, the second dimension of the first matrix must equal the first dimension of the second matrix, otherwise np.matmul will result in a **ValueError**.
#### **As reminder if two matrix are A=[2row, 3columns], B=[2row, 3columns]. These matrixs(A, B) can not be multiplied. These have to be A=[2x3], B=[3x2].So, the first matrix columns number and second matrix row number has to be equal.

In [12]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([-3, 0, 10])
print(np.matmul(arr1, arr2))

arr3 = np.array([[1, 2], [3, 4], [5, 6]])
arr4 = np.array([[-1, 0, 1], [3, 2, -4]])
print(repr(np.matmul(arr3, arr4)))
print(repr(np.matmul(arr4, arr3)))
# This will result in a ValueError: If we uncomment line 10 and run again.
#print(repr(np.matmul(arr3, arr3)))

27
array([[  5,   4,  -7],
       [  9,   8, -13],
       [ 13,  12, -19]])
array([[  4,   4],
       [-11, -10]])


# Random

#### Random number does NOT mean a different number every time. Random means something that can not be predicted logically.


## Generate Random Number

#### Similar to the Python **random** module, NumPy has its own submodule for pseudo-random number generation called **np.random**. It provides all the necessary randomized operations and extends it to multi-dimensional arrays. To generate pseudo-random integers, we use the **np.random.randint** function.
#### **np.random.rand** for float

In [13]:
print(np.random.randint(5))
print(np.random.randint(5))
print(np.random.rand())
print(np.random.rand())
print(np.random.randint(5, high=6))

random_arr = np.random.randint(-3, high=14,
                               size=(2, 2))
print(repr(random_arr))


0
4
0.33886351916658164
0.30955570365438634
5
array([[13,  8],
       [11, -2]])


#### As a default, np.**random.randint** returns a single integer. 
#### The **np.random.randint** function takes in a single required argument, which actually depends on the **high** keyword argument. If **high=None** (which is the default value), then the required argument represents the upper (exclusive) end of the range, with the lower end being 0. Specifically, if the required argument is n, then the random integer is chosen uniformly from the range [0, n).
#### If **high** is not **None**, then the required argument will represent the lower (inclusive) end of the range, while high represents the upper (exclusive) end.
#### The **size** keyword argument specifies the size of the output array, where each integer in the array is randomly drawn from the specified range.

## Utility functions

#### Some fundamental utility functions from the **np.random** module are **np.random.seed** and **np.random.shuffle**. We use the **np.random.seed** function to set the **random seed**, *which allows us to control the outputs of the pseudo-random functions*. The function takes in a single integer as an argument, representing the random seed.

In [14]:
np.random.seed(1)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100,
                               size=(2, 2))
print(repr(random_arr))

# New seed
np.random.seed(2)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100,
                               size=(2, 2))
print(repr(random_arr))

# Original seed
np.random.seed(1)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100,
                               size=(2, 2))
print(repr(random_arr))

5
array([[15, 75],
       [12, 78]])
8
array([[18, 75],
       [25, 46]])
5
array([[15, 75],
       [12, 78]])


#### The **np.random.shuffle** function allows us to randomly shuffle an array. Note that the shuffling happens in place (i.e. no return value), and shuffling multi-dimensional arrays only shuffles the first dimension.

In [15]:
vec = np.array([1, 2, 3, 4, 5])
np.random.shuffle(vec)
print(repr(vec))
np.random.shuffle(vec)
print(repr(vec))

matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
np.random.shuffle(matrix)
print(repr(matrix))

array([3, 4, 2, 5, 1])
array([5, 3, 4, 2, 1])
array([[4, 5, 6],
       [7, 8, 9],
       [1, 2, 3]])


## Distributions

#### Using **np.random** we can also draw samples from probability distributions. For example, we can use **np.random.uniform** to draw pseudo-random real numbers from a uniform distribution.\
#### **Note** Uniform Distribution means ***Every Outcome is equally Likely to each other.***	

In [16]:
print(np.random.uniform())
print(np.random.uniform(low=-1.5, high=2.2))
print(repr(np.random.uniform(size=3)))
print(repr(np.random.uniform(low=-3.4, high=5.9,
                             size=(2, 2))))

0.3132735169322751
0.4408281904196243
array([0.44345289, 0.22957721, 0.53441391])
array([[5.09984683, 0.85200471],
       [0.60549667, 5.33388844]])


#### Another popular distribution we can sample from is the normal (Gaussian) distribution. The function we use is **np.random.normal**.
#### Gaussian distribution (also known as normal distribution) is a bell-shaped curve, and it is assumed that during any measurement values will follow a normal distribution with an equal number of measurements above and below the mean value. In order to understand normal distribution, it is important to know the definitions of “mean,” “median,” and “mode.” The “mean” is the calculated average of all values, the “median” is the value at the center point (mid-point) of the distribution, while the “mode” is the value that was observed most frequently during the measurement. If a distribution is normal, then the values of the mean, median, and mode are the same. However, the value of the mean, median, and mode may be different if the distribution is skewed (not Gaussian distribution). Other characteristics of Gaussian distributions are as follows:

* Mean±1 SD contain 68.2% of all values.

* Mean±2 SD contain 95.5% of all values.

* Mean±3 SD contain 99.7% of all values.


![Gaussian Distribution](./imageNotes/gaussainDistribution)


#### A Gaussian distribution is shown in Figure. Usually, reference range is determined by measuring the value of an analyte in a large number of normal subjects (at least 100 normal healthy people, but preferably 200–300 healthy individuals). Then the mean and standard deviations are determined.

In [17]:
print(np.random.normal())
print(np.random.normal(loc=1.5, scale=3.5))
print(repr(np.random.normal(loc=-2.4, scale=4.0,
                            size=(2, 2))))

0.7252740646272712
4.772112039383628
array([[ 2.07318791, -2.17754724],
       [-0.89337346, -0.89545991]])


#### Like **np.random.uniform**, **np.random.normal** has no required arguments. The **loc** and **scale** keyword arguments represent the **mean** and **standard deviation**, respectively, of the normal distribution we sample from.
#### For more information click [here](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

#### NumPy provides built-in distributions to sample from, we can also sample from a custom distribution with the **np.random.choice** function.

In [18]:
colors = ['red', 'blue', 'green']
print(np.random.choice(colors))
print(repr(np.random.choice(colors, size=2)))
print(repr(np.random.choice(colors, size=(2, 2),
                            p=[0.8, 0.19, 0.01])))

green
array(['blue', 'red'], dtype='<U5')
array([['red', 'red'],
       ['blue', 'red']], dtype='<U5')


In [19]:
colors = ['red', 'blue', 'green']
print(np.random.choice(colors, size=(3, 4, 5), p=[.25, .50, .25]))

[[['blue' 'blue' 'blue' 'green' 'red']
  ['green' 'green' 'blue' 'blue' 'green']
  ['red' 'blue' 'green' 'blue' 'blue']
  ['red' 'red' 'blue' 'red' 'blue']]

 [['blue' 'red' 'blue' 'red' 'blue']
  ['blue' 'red' 'blue' 'blue' 'blue']
  ['red' 'blue' 'blue' 'blue' 'green']
  ['blue' 'green' 'red' 'red' 'green']]

 [['blue' 'red' 'green' 'blue' 'green']
  ['blue' 'green' 'blue' 'green' 'blue']
  ['blue' 'green' 'blue' 'green' 'blue']
  ['blue' 'red' 'green' 'blue' 'blue']]]


#### The required argument for **np.random.choice** is the custom distribution we sample from. The **p** keyword argument denotes the **probabilities** given to each element in the input distribution. **Note** that the list of probabilities for **p** must **sum** to **1**.
#### In the example, we set **p** such that **'red'** has a probability of **0.8** of being chosen, **'blue'** has a probability of **0.19**, and **'green'** has a probability of **0.01**. When p is not set, the probabilities are equal for each element in the distribution (and sum to 1).

# Indexing

## Array accessing

#### Accessing NumPy arrays is identical to accessing Python lists. For multi-dimensional arrays, it is equivalent to accessing Python lists of lists.You can access an Python array/NumPy array element by referring to its index number.
#### The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.

In [20]:
arr = np.array([1, 2, 3, 4, 5])
print(arr[0])
print(arr[4])

arr = np.array([[6, 3], [0, 2]])
# Subarray
print(repr(arr[0]))

1
5
array([6, 3])


## Slicing

#### Similar to Python, we use the **colon operator** **[:]** for slicing. We can also use negative indexing to slice in the backwards direction.

In [21]:
arr = np.array([1, 2, 3, 4, 5])
print(repr(arr[:]))
print(repr(arr[1:]))
print(repr(arr[2:4]))
print(repr(arr[:-1]))
print(repr(arr[-2:]))

array([1, 2, 3, 4, 5])
array([2, 3, 4, 5])
array([3, 4])
array([1, 2, 3, 4])
array([4, 5])


#### For multi-dimensional arrays, we can use a comma to separate slices across each dimension.

In [22]:
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])
print(repr(arr[:]))
print(repr(arr[1:]))
print(repr(arr[:, -1]))
print(repr(arr[:, 1:]))
print(repr(arr[0:1, 1:]))
print(repr(arr[0, 1:]))

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
array([[4, 5, 6],
       [7, 8, 9]])
array([3, 6, 9])
array([[2, 3],
       [5, 6],
       [8, 9]])
array([[2, 3]])
array([2, 3])


## Argmin and argmax

#### In addition to accessing and slicing arrays, it is useful to figure out the actual indexes of the minimum and maximum elements. To do this, we use the **np.argmin** and **np.argmax** functions.
#### **Note:** it returns the **index** of the minimum and maximum elements. For multi-dimentional array **np.argmin, np.argmax** count the whole array as 1D array.

In [23]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [-3, 9, 1]])
print(np.argmin(arr[0]))
print(np.argmax(arr[2]))
print(np.argmin(arr))

2
1
5


#### The **np.argmin** and **np.argmax** functions take the same arguments. The required argument is the input array and the **axis** keyword argument specifies which dimension to apply the operation on.

In [24]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [-3, 9, 1]])
print(repr(np.argmin(arr, axis=0)))
print(repr(np.argmin(arr, axis=1)))
print(repr(np.argmax(arr, axis=-1)))

array([2, 0, 1])
array([2, 2, 0])
array([1, 1, 1])


## Argsort

#### Sort the array as 1D array and returns the **index**

# Filtering

## Filtering data

#### The key to filtering data is through basic relation operations, e.g. ==, >, etc. In NumPy, we can apply basic relation operations element-wise on arrays.
#### The **~** operation represents a boolean negation, i.e. it flips each truth value in the array.

In [25]:
arr = np.array([[0, 2, 3],
                [1, 3, -6],
                [-3, -2, 1]])
print(repr(arr == 3))
print(repr(arr > 0))
print(repr(arr != 1))
# Negated from the previous step
print(repr(~(arr != 1)))

array([[False, False,  True],
       [False,  True, False],
       [False, False, False]])
array([[False,  True,  True],
       [ True,  True, False],
       [False, False,  True]])
array([[ True,  True,  True],
       [False,  True,  True],
       [ True,  True, False]])
array([[False, False, False],
       [ True, False, False],
       [False, False,  True]])


#### Something to note is that **np.nan** can't be used with any relation operation. Instead, we use **np.isnan** to filter for the location of np.nan.

In [26]:
arr = np.array([[0, 2, np.nan],
                [1, np.nan, -6],
                [np.nan, -2, 1]])
print(repr(np.isnan(arr)))

array([[False, False,  True],
       [False,  True, False],
       [ True, False, False]])
