## MG-GY 8401: Programming for Business Intelligence and Analytics
### Lab 3

We want to get some experience with the `numpy` package for operations on arrays. Along the way, we will learn about reading data from files and writing data to files. 

---


### Importing NumPy

It is customary to import NumPy as `np`:

In [1]:
import numpy as np

NumPy `array`'s have two main properties you need to keep in mind:

1. Shape
2. Data type (or `dtype`) of the data they contain

In particular, a NumPy `array` can only contain data of a single type.

### Properties

What is `x1`'s `dtype`?

In [2]:
x1 = np.array([0,1,2,3,4])
print(x1)

[0 1 2 3 4]


What is `x2`'s `shape`?

In [31]:
x2 = np.array([[1.0, 2.0], [3.0, 4.0]])
print(x2)

[[1. 2.]
 [3. 4.]]


Arrays have a shape which corresponds to the number of rows, columns, fibers, etc along its different dimensions

In [114]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print(arr.ndim)
print(arr.shape)

2
(2, 3)


Arrays have a type which corresponds to the type of data they contain

In [115]:
print(arr.dtype)

float64


Note that we can change the type of an array

In [6]:
arr.astype(int)

array([[1, 2, 3],
       [4, 5, 6]])

Array's don't have to contain numbers. What is `x3`'s `shape` and `dtype`?

In [80]:
x3 = np.array([["A", "matrix"], ["of", "words."]])
print(x3.dtype) 
print(x3.shape)

<U6
(2, 2)


What does <U6 mean?

- < [Little Endian](https://en.wikipedia.org/wiki/Endianness)
- U [Unicode](https://en.wikipedia.org/wiki/Unicode)
- 6 length of longest string

We can change the data type.

In [116]:
x2.astype("int32")

array([[1, 2],
       [3, 4]])

### Creating Arrays 

We can create arrays containing all 0's or all 1's. This can be helpful to initialize an array with the right shape.

In [117]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [118]:
np.ones([3, 2])

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

Just like the function `range`, we have the function `np.arange` for making containers of numbers that differ by some increment.

The `np.arange(start, stop, step)` function is like the python `range` function.

In [119]:
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

You can make a range of other types as well:

The `linspace(start,end,num)` function generates `num` numbers evenly spaced between the `start` and `end`.

In [120]:
np.linspace(0, 5, 10)

array([0.        , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
       2.77777778, 3.33333333, 3.88888889, 4.44444444, 5.        ])

### Accessing 

Since arrays can have many dimensions, we need to specify a tuple containing all of the indices.

In [121]:
arr = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]])
arr

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

For example the tuple `(1,0)` would mean the first row and zeroth column.

In [122]:
arr[(1,0)]

4.0

We are allowed to omit the parentheses in the tuple.

In [123]:
arr[1,0]

4.0

If we compare arrays and lists, then a 2-dimensional array is like a list of lists. So `numpy` support repeated use of brackets to access entries.

In [124]:
arr[1][0]

4.0

Just like lists, we can use negative indices and slicing.

In [125]:
arr[-1,-2]

8.0

In [126]:
arr[:2, 1:]

array([[2., 3.],
       [5., 6.]])


### Jagged Arrays

Is the following valid?

```python
arr = np.array([[1, 2, 3], [4, 5], [6]])
```

In [127]:
arr = np.array([[1, 2, 3], [4, 5], [6]])
arr

  arr = np.array([[1, 2, 3], [4, 5], [6]])


array([list([1, 2, 3]), list([4, 5]), list([6])], dtype=object)

Note 
```python
arr[0,1] 
```
leads to an error. Instead we need to use 

```python
arr[0][1] 
 > 2
```

In [128]:
arr[0][1]

2

### Reshaping

Flattening a matrix (higher dimensional array) produces a one dimensional array.

Often you will need to reshape matrices.  Suppose you have the following array:

In [131]:
np.arange(1, 13)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

What will the following produce:

```python
np.arange(1, 13).reshape(4,3)
```

**Option A:**

```python
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
```

**Option B:**

```python
array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])
```

In [132]:
arr = np.arange(1, 13).reshape(4,3)
arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

We can change back to a one dimensional array.

In [133]:
arr = arr.flatten()
arr

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

### Files

We can store data in files like text files. Note that the file extension is `.txt`. We have a file called `dna.txt` that contains sequences of the nucleotides `A`,`T`,`C`,`G`. The command

##### open
```python
open(filename, mode)
``` 
returns a file object. The "mode" can be 
- ‘r’ for reading
- ‘w’ for writing
- ‘a’ for appending
- ‘r+’ for reading and writing

If the file cannot be opened, throws IOError<br>. Methods on the file object are used for reading and writing.

In [134]:
file_handle = open("dna.txt", "r")

#### seek

The file object maintains a position in the file. The method `tell` indicates the position in the file.

In [135]:
file_handle.tell()

0

The method `seek` allows us to move the position.

In [136]:
file_handle.seek(0)

0

#### read
```python
file_handle.read(size)
``` 
We use the method `read` to load the content of a file. Here we measure size in two ways 

- number of characters in text mode
- number of bytes in binary mode 

If we do not pass a value, then we get the entire file.

In [137]:
file_handle.read(10)

'GTCAGGACAA'

We can read a single line
```python
file_handle.readline() 
``` 

In [138]:
file_handle.readline()

'GAAAGACAANTCCAATTNACATTATG\n'

Note that `\n` is a special character indicating a new line.
We can read all lines
```python
file_handle.readlines() 
```
We obtain a list containing all the lines in the file.

In [None]:
lines = file_handle.readlines()
print(lines)

#### close

We can terminate the connection to the file with `close`.

In [140]:
file_handle.close()

If we indicate `a` for the mode then we can write to files. Note that we will append to the file.

In [141]:
file_handle = open("dna.txt", "a")

#### write
```python
file_handle.write()
```
write a string in the file 
```python
file_handle.writelines()
```
write a list of string to file

In [142]:
file_handle.write('GCTA\n')

file_handle.writelines(['ATCG\n','GCTA\n'])

In [143]:
file_handle.close() 

#### with 

Instead of `open` and `close` we can use a `with` statement. 

In [144]:
with open('dna.txt', 'r') as file_handle:  
    data = file_handle.read()

Using the `with` statement, we can ensure that we correctly close the file.
### Calculating

We want to use `numpy` to calculate the number of nucleotides. We can convert the string of an array.

In [154]:
dna_list = []
for character in data:
    if character in ["A","T","C","G"]:
        dna_list.append(character)

dna_array = np.array(dna_list) 

Now we can use the `numpy` function `bincount` to enumerate entries.

In [156]:
np.unique(dna_array, return_counts=True)

(array(['A', 'C', 'G', 'T'], dtype='<U1'),
 array([10223,  7368,  7373, 10259], dtype=int64))