<h1 align="center">Python for DATA SCIENCE</h1><Br/>
<img src="https://goo.gl/ZKX5FF" style="width:15%; float:centre"><Br/>
<h2 align="center">Dr Mazen Gabriel Alhrishy</h2>
<h5 align="center"><i>MAZEN.ALHRISHY@GMAIL.COM</i></h5><Br/>

<table width=25%>
    <tr>
        <td>
            <a href="https://goo.gl/BTtR3C"><img src="https://goo.gl/rMsKok"></a>
        </td>
        <td>
            <a href="https://goo.gl/XaRDbH"><img src="https://goo.gl/KyMZcj"></a>
        </td>
        <td>
            <a href="https://goo.gl/9uCqS6"><img src="https://goo.gl/a8gcDK"></a>
        </td>
        <td>
            <a href="https://goo.gl/bnt2EL"><img src="https://goo.gl/1rT18x"></a>
        </td>
        <td>
            <a href="https://goo.gl/VmfU3S"><img src="https://goo.gl/WFFkxn"></a>
        </td>
    </tr>
</table>

# 6- NumPy for Numerical Computing

> ## [I- Introduction](#I)
> ## [II- Array creation](#II)
> ## [III- Basic arithmetic](#III)
> ## [IV- Indexing, slicing and iterating](#IV)
> ## [V- Shape Manipulation](#V)
> ## [VI- Universal functions (ufunc)](#VI)

### [- Exercise](#E)
### [- Solutions](#S)

***

## I- Introduction <a id='I'></a>

> ## [1. History](#I-1)
> ## [2. Installation](#I-2)
> ## [3. NumPy arrays](#I-3)
> ## [4. Motivation](#I-4)

### 1- History <a id='I-1'></a>

* In early 2005, **Travis Oliphant*** wanted to unify the community around a single array package and ported Numarray's features to Numeric, releasing the result as the **Num**erical **Py**thon extension or **NumPy 1.0** in 2006

<img src="https://goo.gl/d6Ddwm" style="width:30%; border-radius:30%; float:left; padding:10px 30px 10px 30px;"/>

* Travis Oliphant is an American data scientist and businessman who is also the founder of Anaconda


* NumPy is the fundamental package for scientific computing with Python. It contains among other things:<br>
     * A powerful N-dimensional array object<br>
     * Sophisticated (broadcasting) functions<br>
     * Useful linear algebra, Fourier transform, and random number capabilities


* NumPy is mostly written in C and wrapped in Python to provide good performance for large arrays


* __[NumPy website](http://www.numpy.org)__

### 2- Installation <a id='I-2'></a>

* The Anaconda Python distribution includes all the key packages needed to use NumPy. However, if you've created a basic  virtual environment, you can get NumPy using conda:

In [None]:
! conda install numpy --y

* To verify the package was installed

In [None]:
! conda list

* To import into a Python script and check installed version number

In [None]:
import numpy as np
print(np.__version__)

### 3- NumPy arrays <a id='I-3'></a>

> "Similar to a Python list, a Numpy array is a data structure which can store a fixed-size collection of elements. However, Numpy arrays can only store a collection of the **same data type**. Moreover, it provides much **advanced handling of N-dimensional arrays*** for mathematical operations"

* *Each dimension in an array is represented by an axis: 
    * Axis-0: represents the first dimension (row-wise for 1D array, and column-wise for nd array)
    * Axis-1: represents the second dimension (row-wise)
    * Axis-2: represents the third dimension (depth-wise)

<img src="https://goo.gl/wfFHUF" width="70%"/>

### 4- Motivation <a id='I-4'></a>

* Lists are very powerful objects in Python. A list can hold a sequence of elements with different types if needed. Elements can be also added, removed or changed. However, one crucial feature is missing!

* Let’s consider the task of calculating the BMI from these 2 lists, where each weight/height pair is for the same person

In [None]:
weight = [65.4, 59.2, 63.6, 88.4, 68.7]
height = [1.73, 1.68, 1.71, 1.89, 1.79]

In [None]:
bmi = weight / height ** 2

* It seems that you can't perform mathematical operations, elementwise, over the entire list at once! The only option left is to loop over the list to calculate the BMI one at a time:

In [None]:
bmi = list()

for w, h in zip(weight, height):  # the built-in zip() function allows you to loop over 2 lists at once (or more)
    bmi.append(w / h ** 2)
    
print(bmi)

* What about rounding every element in the bmi list to 2 decimal digits?

In [None]:
round(bmi, 2)

* We need to loop again! This is clearly inefficient, especially for very big lists. Solution: NumPy!

In [None]:
import numpy as np

# convert height list into a numpy array 
np_height = np.array(height)
print(np_height, type(np_height))

# convert weight list into a numpy array 
np_weight = np.array(weight)
print(np_weight, type(np_weight))

* Now we can calculate all the BMI values for all weight/height pairs at once!

In [None]:
bmi = np_weight / np_height ** 2
print(bmi, type(bmi))

* We can also round all elements in the array at once using the built-in NumPy **round()** method

In [None]:
bmi = np.round(bmi)
print(bmi)

***
## II- Array creation <a id="II"></a>

* There are several ways to create NumPy arrays. Here are the most common ones:

> ### [1- Using array()](#II-1)
> ### [2- Using zeros(), and ones()](#II-2)
> ### [3- Using arange(), and linspace()](#II-3)
> ### [4- Using the random module](#II-4)

### 1- Using array() <a id='II-1'></a>

* When the elements of an array are known, the **array()** function is used

* **array()** expects a regular Python sequence, such as a list or a tuple of the known elements. The type of the resulting array is deduced from the type of the elements in the sequences

In [None]:
# create a 1D array from Python list
a = np.array([2, 3, 4])
print('a =', a, type(a))

* To get the type and the shape of the created array, the **dtype** and **shape** attributes* can be used

*An object has methods that do tasks, and also attributes that save values. Similar to methods, they are also accessible through the dot operator

In [None]:
print(a.dtype, a.shape)

In [None]:
# create a 1D array from Python tuple
b = np.array((1.2, 3.5, 5.1))
print('b =', b, b.dtype, b.shape)

* **array()** transforms sequences of sequences into 2D arrays, sequences of sequences of sequences into 3D arrays, and so on (i.e. N-dimensional arrays)

In [None]:
# create a 2D array from list of 2 tuples
c = np.array([(1.5, 2, 3), (4, 5, 6)])
print('c =\n', c, c.dtype, c.shape)

* The type of the array can also be explicitly specified at creation time

In [None]:
d = np.array([ [1,2], [3,4] ], dtype=np.float)
print('d =\n', d, d.dtype, d.shape)

### 2- Using zeros(), and ones() <a id='II-2'></a>

* When the elements of an array are originally unknown, but its shape is known, either **zeros()**, or **ones()** is usually used to create an array with initial placeholder content. By default, the **dtype** of the created array is 'float64'

In [None]:
# create a 3*4 array full of zeros
a = np.zeros((3, 4))
print('a =\n', a, a.dtype, a.shape)

In [None]:
# create a 2*3*4 array full of ones
b = np.ones((2, 3, 4), dtype=np.int16)
print('b =\n', b, b.dtype, b.shape)

### 3- Using arange(), and linspace() <a id='II-3'></a>

* When we need to create an array with an evenly spaced sequence of numbers, **arange()** and **linspace()** can be used

* The **arange()** function is similar to the built-in function **range()** but returns an array instead of a list

In [None]:
# create an array starting at 10, stopping at 30 exclusive, with a step of 5
np.arange(10, 30, 5)

* The **linspace()** function expects the number of elements that we want instead of the step

In [None]:
# create an array starting at 0, stopping at 2 inclusive, with 9 elements
np.linspace(0, 2, 9)

### 4- Using the random module <a id='II-4'></a>

* When we need to create an array with random sampling from a specific distribution, the **random** module can be used

* **rand()** creates an array of the given shape and populate it with random samples from a __[continuous uniform distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)__ over [0, 1) with 'float64' dtype

In [None]:
# create a 3*2 array full of random content 
# sampled from a uniform distribution over [0,1]
np.random.rand(3, 2)

* **randint()** creates an array of the given shape and populate it with random samples from a __[discrete uniform distribution](https://en.wikipedia.org/wiki/Discrete_uniform_distribution)__ of the specified dtype (default is np.int)

In [None]:
# create a 2*4 array full of random content between [1, 5) 
# sampled from a discrete uniform distribution
np.random.randint(1, 5, size=(2, 4))  # 5 is exclusive

* **randn()** creates an array of the given shape and populate it with random samples from a __[normal (Gaussian) distribution](https://en.wikipedia.org/wiki/Normal_distribution)__ (mean=0, variance=1), with 'float64' dtype

In [None]:
# create a 2*4 array full of random content 
# sampled from a normal distribution
np.random.randn(2, 4)

***
## III- Basic arithmetic <a id="III"></a>

> ### [1- With scalars (numbers)](#III-1)
> ### [2- Between arrays](#III-2)
> ### [3- Array reductions](#III-3)


* All arithmetic operators are applied **elementwise**
* The original arrays are not modified (a new array is created and filled with the results)

### 1- With scalars (numbers) <a id='III-1'></a>

In [None]:
a = np.arange(1, 6)
print('a =', a)

b = 2
print('b =', b)

print('a + b =', a + b)
print('a - b =', a - b)
print('a * b =', a * b)
print('a / b =', a / b)
print('a // b =', a // b)
print('a % b =', a % b)
print('a ** b =', a ** b)

### 2- Between arrays <a id='III-2'></a>

* When operating with arrays of different types, the type of the resulting array corresponds to the more general or precise one (a behavior known as upcasting)

In [None]:
a = np.array([20, 30, 40, 50])
print('a =', a)

b = np.array([2.5, 3.5, 4.5, 5.5])
print('b =', b)

print('a + b =', a + b)
print('a - b =', a - b)
print('a * b =', a * b)
print('a / b =', a / b)
print('a // b =', a // b)
print('a % b =', a % b)
print('a ** b =', a ** b)

### 3- Array reductions <a id='III-3'></a>

*  Extrema reductions

In [None]:
a = np.array([[1,2], [3,4]])

print('a =\n', a)

print('min = ', np.min(a))
print('max = ', np.max(a))

*  Statistical reductions

In [None]:
print('a =\n', a)

print('mean = ', np.mean(a))
print('median = ', np.median(a))
print('sum = ', np.sum(a))
print('std = ', np.std(a))

By default, all the above reductions operators apply to the array as though it were a list of numbers, regardless of its shape. However, by specifying the axis parameter you can apply an operation along the specified axis of an array

*  Logical reductions

In [None]:
print('a =\n', a)

print(np.all([True, True, False]))
print(np.any([True, True, False]))
print((a != 0))
print(np.any(a != 0))
print((a == 0))
print(np.all(a == 0))

***
## IV- Indexing, slicing and iterating <a id="IV"></a>

* Elements in 1d arrays can be accessed just like lists and other Python sequences

In [None]:
a = np.arange(10)
print(a)

In [None]:
# indexing
print(a[2])
print(a[-1])

In [None]:
# slicing
print(a[2:5])
print(a[2:8:2])

In [None]:
# iterating
for i in a:
    print(i)

* Elements in N-dimensional arrays are accesses by using one index per axis, separated by commas

In [None]:
b = np.arange(12).reshape(3, 4)
print(b)

In [None]:
# indexing
print(b[1, 3])  # 2nd row, 4th col
print(b[2, :])  # 3rd row, all cols (b[2] for short)
print(b[:, -1])  # all rows, last col

In [None]:
# slicing
print(b[0:2, 1])
print(b[ : ,1])
print(b[1:3, : ])

Iterating is done row-wise

In [None]:
for row in b:
    print(row)

To access each element in the array, you can use nested loops, the first loops row-wise, and the second column-wise

In [None]:
for row in b:
    for col in row:
        print(col)

However, for that, the **flat** attribute can be used which is an iterator* over all the elements of the array

*An iterator is an object that returns the next value of a sequence every time it's called!

In [None]:
for element in b.flat:
    print(element)

***
## V- Shape Manipulation <a id="V"></a>

> ### [1- Changing array's shape](#V-1)
> ### [2- Stacking arrays](#V-2)
> ### [3- Splitting an array](#V-3)

### 1- Changing array's shape <a id='V-1'></a>
* We can change the shape of an array in several ways. The most common ones are **reshape()** and **ravel()**

In [None]:
# create a 3*4 array full of random content between [1, 10) 
a = np.random.randint(1, 10, size=(3, 4))
print('a =\n', a)

In [None]:
# reshape 'a' into 2*6 array
b = a.reshape(2, 6)
print('b =\n', b)

In [None]:
# reshape 'a' into a 1D array
c = a.ravel()
print('c =\n', c)

* Both methods return a modified array, but do not change the original array. To change the original array we can use **resize()**

In [None]:
print('original a =\n', a)

# resize a into 2*2 array
a.resize(2, 6)
print('resized a =\n', a)

* The **transpose()** function is also used to swap rows with columns axes

In [None]:
np.transpose(a)  # or a.T for short

### 2- Stacking arrays <a id='V-2'></a>

* Arrays can be stacked together along different axes using various functions. Here are some:

In [None]:
a = np.arange(1, 5).reshape(2, 2)
print('a =\n', a)

b = np.arange(5, 9).reshape(2, 2)
print('b =\n', b)

* **hstack()** is used to stack along the horizontal axis (column-wise)

In [None]:
# stack horizontally
np.hstack((a, b))

* **vstack()** is used to stack along the vertical axis (row-wise) 

In [None]:
# stack vertically
np.vstack((a, b))

* Alternatively, **concatenate()** can be used to join arrays along an existing axis

In [None]:
# concatenate along axis=1 (row-wise)
np.concatenate((a, b), axis=1)

In [None]:
# concatenate along axis=0 (columns-wise)
np.concatenate((a, b), axis=0)

### 3- Splitting an array <a id='V-3'></a>

* Arrays can also be split along different axes using various functions. Some of them are:

In [None]:
a = np.linspace(1, 10, 16).reshape(4, 4)
print('a =\n', a)

* **hsplit()** is used to split along the horizontal axis (column-wise)

In [None]:
# split horizontally into 2 arrays
a1, a2 = np.hsplit(a, 2)
print('a1 =\n', a1)
print('a2 =\n', a1)

* **vsplit()** is used to split along the vertical axis (row-wise)

In [None]:
# split vertically into 2 arrays
a1, a2 = np.vsplit(a, 2)
print('a1 =\n', a1)
print('a2 =\n', a1)

* Alternatively, **split()** can be used to split arrays along an existing axis

In [None]:
# split along axis=1 (column-wise)
a1, a2 = np.split(a, 2, axis=1)
print('a1 =\n', a1)
print('a2 =\n', a1)

In [None]:
# split along axis=0 (row-wise)
a1, a2 = np.split(a, 2, axis=0)
print('a1 =\n', a1)
print('a2 =\n', a1)

***
## VI- Universal functions (ufunc) <a id="VI"></a>

* A universal function (or ufunc for short) is a function that operates on nd-arrays in an **element-by-element** fashion

* There are currently more than 60 universal functions defined in numpy __[here](https://docs.scipy.org/doc/numpy-1.14.0/reference/ufuncs.html#available-ufuncs)__ 

These include:
> ### [1. Math operations](#VI-1)
> ### [2. Trigonometric function](#VI-2)
> ### [3. Floating functions](#VI-3)
> ### [4. Comparison functions](#VI-4)

### 1. Math operation <a id='VI-1'></a>

* All basic arithmetic we saw actually calls a defined math ufunc. For example when adding 2 arrays, the **add()** ufunc is called under the hood!

In [None]:
a = np.array([20, 30, 40, 50])
print('a =', a)

b = np.array([2.5, 3.5, 4.5, 5.5])
print('b =', b)

print('a + b =', np.add(a, b))
print('a - b =', np.subtract(a, b))
print('a * b =', np.multiply(a, b))
print('a / b =', np.divide(a, b))
print('a // b =', np.floor_divide(a, b))
print('a % b =', np.mod(a, b))
print('a ** b =', np.power(a, b))

* Some other useful ones:

In [None]:
np.sqrt([1, 4, 9])  # returns the positive square-root of the input
np.square([1, 4, 9])  # returns the square of the input 
np.fabs([-1.2, 1.2])  # returns the absolute value of the input

### 2. Trigonometric function <a id='VI-2'></a>

* All trigonometric functions use radians as an input. The ratio of degrees to radians is $180^{\circ}/\pi$
* To convert from degrees > radians use **deg2rad()**
* To convert from radians > degrees use **rad2deg()**

* An example of calculating the sine

In [None]:
a = np.array([0, 30, 45, 60, 90])  # in degrees
np.sin(np.deg2rad(a))  # sine (degrees are converted to radians first)

* An example of calculating the arcsine

In [None]:
b = np.array([0, 0.5, 0.707, 0.866, 1])
np.rad2deg(np.arcsin(b))  # inverse sine (the answer is converted from radians to degrees)

### 3. Floating functions <a id='VI-3'></a>

* Some useful ones:

In [None]:
a = np.array([-1.7, -1.5, -0.2, 0.2, 1.5, 1.7, 2.0])
print('a =', a)

print(np.floor(a))  # returns the floor of the input
print(np.ceil(a))  # returns the ceiling of the input

In [None]:
print(np.isnan(np.nan))  # returns truth value truth value whether NaN exists or not
print(np.isinf(np.inf))  # returns truth value truth value whether infinity exists or not

### 4. Comparison functions <a id='VI-4'></a>

* Python logical operators **and** and **or** can't be used with Numpy array to combine logical array expressions elementwise!

* The following example will throw an error

In [None]:
a = np.arange(5)
print('a =', a)

print(a > 1 and a < 4)

* For this the **logical_and()** ufunc should be used

In [None]:
print(np.logical_and(a > 1, a < 4))  # elements with both conditions true will have True value

*  Python comparison operators can be used directly (e.g. **>**, **==**). These also have equivalent ufunc in NumPy (these are called under the hood anyway!)

In [None]:
a = np.arange(5)
print('a =', a)

b = np.array([1 , 1, 2, 2, 3])
print('b =', b)

print(a >= b)
print(np.greater_equal(a, b))

***

### Exercise <a id='E'></a>
> Modified from dataCamp.com

You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on a thousand players, which is stored as a text file: height.txt with the heights expressed in inches.

1.	Extract the heights from the text file into a Python list ‘height’
2.	Create a numpy array ‘np_height’ from the extracted list ‘height’
3.	Convert ‘np_height’ so that the units are now in meters, and save the result in ‘np_height_m’ (1 inch = 0.0254 meter)

The MLB also offers to let you analyse their weight data. Again, they pass along weight data for the same thousand players as a text file: weight.txt, with the weights expressed in pounds.

4.	Extract the weights from the text file into a Python list ‘weight’
5.	Create a numpy array ‘np_weight’ from the extracted list ‘weight’
6.	Convert ‘np_weight’ so that the units are now in kg, and save the result in ‘np_weight_kg’’ (1 pound = 0.453592 kg)
7.	Create a numpy array 'bmi' using ‘np_height_m’ and ‘np_weight_kg’
8.	Print out BMIs of all baseball players whose BMI is below 21. For this you will need to :
    1.	Create a boolean numpy array ‘light’. The element of the array should be True if bmi < 21
    2.	Use ‘light’ inside square brackets to do a selection on the bmi array

You have another look at the MLB data and realize that it makes more sense to restructure all this information into a 2D numpy array. This array should have 1000 rows, corresponding to the 1000 baseball players you have information on, and 2 columns for height and weight.
9.	Create a 2d numpy array ‘np_baseball’ from 'np_height_m' and 'np_weight_kg'
10.	Print the shape of ‘np_baseball’
11. Run a quick summary statistic on each column to calculate the min, max, mean, median, and standard deviation
12. Check if big players tend to be heavier. Use np.corrcoef() to store the correlation between the first and second column of np_baseball in 'corr'

Extra: write a function that handles opening the given text files and reading the values into a list. Use the function to extract ‘height’ and ‘weight’

***

### Solutions <a id='S'></a>

In [None]:
import numpy as np
import os

# read data from height.txt as one string
filename = os.path.join('Examples', 'height.txt')
with open(filename) as f:
    data = f.read()

# split heights into a list (still as strings)
height = data.split(', ')

# convert into numpy array and cast strings into float
np_height = np.array(height, dtype=np.float64)
print(np_height.shape)

# convert inches into meters
np_height_m = np_height * 0.0254

# read data from weight.txt as one string
filename = os.path.join('Examples', 'weight.txt')
with open(filename) as f:
    data = f.read()

# split weights into a list (still as strings)
weight = data.split(', ')

# convert into numpy array and cast strings into float
np_weight = np.array(weight, dtype=np.float64)
print(np_weight.shape)

# convert pounds into kg
np_weight_kg = np_weight * 0.453592

# calculate BMI
bmi = np_weight_kg / (np_height_m ** 2)

# print BMI < 21
light = bmi < 21
print(bmi[light])

# join 2 arrays into one
np_baseball = np.stack((np_height_m, np_weight_kg), axis=1)

# check shape
print(np_baseball.shape)

# run quick stats
print('min = ', np.min(np_baseball, axis=0))
print('max = ', np.max(np_baseball, axis=0))
print('mean = ', np.mean(np_baseball, axis=0))
print('median = ', np.median(np_baseball, axis=0 ))
print('std = ', np.std(np_baseball, axis=0))

# check correlation between the first and second column (heights in m, weights in kg)
corr = np.corrcoef(np_baseball[:, 0], np_baseball[:, 1])

# corrcoef() returns a matrix of 4 coefficients one for each combination of inputs (including with itself) 
print(corr)

In [None]:
# Extra

def read_file_into_np(full_file_path):
    # reads data from txt as one string
    with open(full_file_path) as f:
        data = f.read() 

    # split data into a list (still as strings)
    data_list = data.split(', ')
    
    # convert into numpy array and cast strings into float
    np_data_list = np.array(data_list, dtype=np.float64)

    return data_list


# read height.txt into a numopy array
filename1 = os.path.join('Examples', 'height.txt')
np_height = read_file_into_np(filename1)

# read weight.txt into a numopy array
filename2 = os.path.join('Examples', 'weight.txt')
np_weight = read_file_into_np(filename2)