# Lesson 6 - 2024/11/14

## Numpy
[NumPy](https://numpy.org/) (short for *Numerical Python*) is a numerical library for Python which provides an efficient interface to store and operate on data.

### A Python Integer Is More Than Just an Integer
A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value.

![Integer.jpeg](../images/Integer.jpeg)

### A Python List Is More Than Just a List

Because of Python's dynamic typing, we can create heterogeneous lists:

In [2]:
my_list = [True, "2", 3.0, 4]

[type(item) for item in my_list]

[bool, str, float, int]

But this flexibility comes at a <strong>cost</strong>.

![List.jpeg](../images/List.jpeg)

In the special case that all variables are of the <strong>same type</strong>, much of this information is redundant: it can be much more efficient to store data in a <strong>fixed-type</strong> array.

### Fixed-Type Arrays in Python
The built-in ``array`` module can be used to create arrays of a uniform type:

In [3]:
import array

L = list(range(10))
A = array.array('i', L) # i indicates integer values

A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Much more useful, however, is the ``numpy.ndarray`` object of the NumPy package which adds to this efficient *operations* on that data.

In [5]:
import numpy as np

x_np = np.random.randint(10, size=9)  # One-dimensional array

type(x_np), x_np

(numpy.ndarray, array([4, 5, 7, 0, 1, 4, 4, 5, 9]))

In [4]:
print('x1[3]:', x_np[3])     # Array Indexing
print('x1[2:5]:', x_np[2:5]) # Array Slicing

x1[3]: 2
x1[2:5]: [4 2 5]


In [5]:
# Iteration
for element in x_np:
    print(element)

9
4
4
2
5
8
0
7
2


``numpy.ndarray`` stands for <strong>N-dimensional array</strong> which means that this object is built to be multi-dimensional, with attributes and methods specifically designed for this feature.

In [6]:
grid = x_np.reshape((3, 3))  # Two-dimensional array

print(x_np)
print(grid)

[9 4 4 2 5 8 0 7 2]
[[9 4 4]
 [2 5 8]
 [0 7 2]]


In [7]:
print("grid.ndim: ", grid.ndim)
print("grid.shape:", grid.shape)
print("grid.size: ", grid.size)
print("grid.dtype:", grid.dtype)

grid.ndim:  2
grid.shape: (3, 3)
grid.size:  9
grid.dtype: int64


### Boolean indexing
Numpy arrays can be sliced with vectors of booleans (``list``s or other ``ndarray``s) with the same dimensions.

In [8]:
print('x_np:', x_np)

x_np: [9 4 4 2 5 8 0 7 2]


In [9]:
boolean_np = x_np > 3

print('boolean_np:', boolean_np) # It states if the element in the elements in the same position are > 3.

boolean_np: [ True  True  True False  True  True False  True False]


In [10]:
[x for x in x_np if x > 3]

[9, 4, 4, 5, 8, 7]

In [11]:
x_np[boolean_np]

array([9, 4, 4, 5, 8, 7])

In [12]:
big_array = np.random.rand(10000000)

%timeit [x for x in big_array if x > 3]
%timeit big_array[big_array > 3]

1.15 s ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.97 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Vectorized Operations
Operation between arrays are carried out with a different logic than that of standard lists.

In [13]:
x_list = list(x_np)

print('x_list + x_list:', x_list + x_list)
print('x_np + x_np:', x_np + x_np)

x_list + x_list: [9, 4, 4, 2, 5, 8, 0, 7, 2, 9, 4, 4, 2, 5, 8, 0, 7, 2]
x_np + x_np: [18  8  8  4 10 16  0 14  4]


| Operator    | Equivalent func     | Description                           |
|---------------|---------------------|---------------------------------------|
|``+``          |``np.add``           |Addition (e.g., ``1 + 1 = 2``)         |
|``-``          |``np.subtract``      |Subtraction (e.g., ``3 - 2 = 1``)      |
|``-``          |``np.negative``      |Unary negation (e.g., ``-2``)          |
|``*``          |``np.multiply``      |Multiplication (e.g., ``2 * 3 = 6``)   |
|``/``          |``np.divide``        |Division (e.g., ``3 / 2 = 1.5``)       |
|``//``         |``np.floor_divide``  |Floor division (e.g., ``3 // 2 = 1``)  |
|``**``         |``np.power``         |Exponentiation (e.g., ``2 ** 3 = 8``)  |
|``%``          |``np.mod``           |Modulus/remainder (e.g., ``9 % 4 = 1``)|

In [14]:
%timeit sum(big_array)
%timeit np.sum(big_array) # or big_array.sum()

525 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.81 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


<strong>Important</strong>: whenever possible, make sure that you are using the NumPy version of these operations when operating on NumPy arrays.

## Pandas

[Pandas](https://pandas.pydata.org/) is a library built on top of NumPy, which provides an efficient implementation of a ``DataFrame``.

``DataFrame``s can be seens as multidimensional arrays with attached row and column labels, that can presennt heterogeneous types and/or missing data.

### The Pandas Series Object
A Pandas ``Series`` is a one-dimensional array of indexed data.

In [2]:
import pandas as pd

data = pd.Series(['RNA', 'gene', 'protein'])
data

0        RNA
1       gene
2    protein
dtype: object

In [16]:
data.values

array(['RNA', 'gene', 'protein'], dtype=object)

In [17]:
data.index

RangeIndex(start=0, stop=3, step=1)

The index need not be an integer, but can consist of values of any type:

In [18]:
data = pd.Series(
    ['RNA', 'gene', 'protein'],
    index=['ENST', 'ENSG', 'ENSP']
)
data

ENST        RNA
ENSG       gene
ENSP    protein
dtype: object

We can construct a ``Series`` from a dictionary and the way we access the values are similar to dictionaries:

In [20]:
map_dict = {'ENST': 'RNA', 'ENSG': 'gene', 'ENSP': 'protein'}
data = pd.Series(map_dict)
data

ENST        RNA
ENSG       gene
ENSP    protein
dtype: object

In [19]:
data['ENSG']

'gene'

In [21]:
data['ENSG':]

ENSG       gene
ENSP    protein
dtype: object

### The Pandas DataFrame Object

It can be constructed from 2 or more dictionary with the same keys (or from 2 `Series` with the same indexes).

In [22]:
map_dict   = {'ENST': 'RNA', 'ENSG': 'gene', 'ENSP': 'protein'}
count_dict = {'ENST': 3300,  'ENSG': 18435,  'ENSP': 12034}
 
df = pd.DataFrame({'mapping type': map_dict, 'counts': count_dict})
df

Unnamed: 0,mapping type,counts
ENST,RNA,3300
ENSG,gene,18435
ENSP,protein,12034


In [23]:
df.index

Index(['ENST', 'ENSG', 'ENSP'], dtype='object')

In [24]:
df.columns

Index(['mapping type', 'counts'], dtype='object')

We can access a colum like a dictionary or in a Pandas way:

In [25]:
df['counts']  # like a dictionary

ENST     3300
ENSG    18435
ENSP    12034
Name: counts, dtype: int64

In [26]:
df.counts     # The Pandas way

ENST     3300
ENSG    18435
ENSP    12034
Name: counts, dtype: int64

In [None]:
df['mapping type']
#df.mapping type  # I can't do it

ENST        RNA
ENSG       gene
ENSP    protein
Name: mapping type, dtype: object