# NumPy

## Understanding Data Types in Python


```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

While in Python the equivalent operation could be written this way:
```python
# Python code
result = 0
for i in range(100):
    result += i
```


```C
/* C code */
int x = 4;
x = "four";  // FAILS
```

Python spremenljivka dobi zraven veliko informacij, zato je dosti večja kot npr. v C.

### A Python Integer Is More Than Just an Integer

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

A single integer in Python 3.4 actually contains four pieces:
- ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
- ob_type, which encodes the type of the variable
- ob_size, which specifies the size of the following data members
- ob_digit, which contains the actual integer value that we expect the Python variable to represent.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Integer Memory Layout">

### A Python List Is More Than Just a List


In [2]:
L = list(range(10))
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [3]:
type(L[0])

int

In [5]:
L2 = [str(c) for c in L]
L2

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Tudi, če vemo, da bodo v seznamu samo intigerji, python za vsakega napiše vse informacije. Numpy pa reši to situacijo. Samo za en element določi lastnosti, kar zmanjša velikost podatkov. Zato je hitrejši. Je pa omejitev, da lahko v en seznam lahko shranimo samo en tip objekta.


<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Array Memory Layout">

### Fixed-Type Arrays in Python


In [7]:
import array
L=list(range(10))
A=array.array('i',L)
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

To zgoraj je malo v smeri numpya (brez numpya)...

## How Vectorization Makes Code Faster



<p><img alt="Translating Python code to bytecode" src="https://s3.amazonaws.com/dq-content/289/bytecode.svg"></p>


<table>
<thead>
<tr>
<th>Language Type</th>
<th>Example</th>
<th>Time taken to write program</th>
<th>Control over program performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>High-Level</td>
<td>Python</td>
<td>Low</td>
<td>Low</td>
</tr>
<tr>
<td>Low-Level</td>
<td>C</td>
<td>High</td>
<td>High</td>
</tr>
</tbody>
</table>



<p><img alt="For loop to sum rows" src="https://s3.amazonaws.com/dq-content/289/for_loop.svg"></p>

Numpy združi najboljše od python sintakse in C optimizacije hitrosti. To dosežemo z vektorizacijo.

To spodaj je python osnovna sintaksa:

In [8]:
my_numbers = [[6,5],[1,3],[5,6]]
sums = []
for row in my_numbers:
    row_sum = row[0]+row[1]
    sums.append(row_sum)
print(sums)

[11, 4, 11]



<p><img alt="Unvectorized operation" src="https://s3.amazonaws.com/dq-content/289/unvectorized.svg"></p>

Pri vektorizaciji naredimo obe operaciji istočasno - SIMD. Ne seštevamo vrstico po vrstico, ampak združimo več ciklov v enega.

<p><img alt="Vectorized operation" src="https://s3.amazonaws.com/dq-content/289/vectorized.svg"></p>



## Numpy = numerical Python

Tudi Pandas temelji na Numpy-u. Algoritmi so 10x-100x hitrejši kot v Pythonu. Ima zelo dobro knjižnico, help, ... 

ndarray - omogoča hitre aritmetične operacije

In [10]:
#uvozimo knjiznico
import numpy as np

### NumPy ndarrays

Večdimenzionalni seznami.



<p><img alt="Dimensional Arrays" src="https://s3.amazonaws.com/dq-content/289/dimensional_arrays.svg"></p>



#### Create an array



In [16]:
#da naredimo array, uporabimo funkcijo array
list1 = [4,5,6,78,12]
#seznam spremenimo v numpy array
arr1 = np.array(list1)
print(arr1)
#če damo vmes nek element kot float, naredi vse elemente tega tipa
type(arr1)

[ 4  5  6 78 12]


numpy.ndarray

In [18]:
data2 = [[1.1,2,3],[3,2,11]]
arr2 = np.array(data2)
print(arr2)

[[ 1.1  2.   3. ]
 [ 3.   2.  11. ]]


### Funkcije za ustvarjanje pogostih arrayev

In [19]:
#ones
np.ones((3,5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [21]:
#arange (podobna funkciji range v Pythonu), (od, do, korak)
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [34]:
#zeros
Aa=np.zeros((3,5))
Bb=np.zeros((5))
print(Aa, '\n')
print(Bb)

Aa=[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]] 

[0. 0. 0. 0. 0.]


In [30]:
#linspace (interval od, interval do, na koliko delov razdelimo)
np.linspace(0,1,5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [35]:
#randint (števila od, števila do,(velikost matrike))
np.random.randint(0,10,(4,4))

array([[7, 2, 9, 2],
       [5, 1, 0, 7],
       [3, 7, 9, 8],
       [4, 5, 0, 3]])

In [38]:
#random (najključna števila med 0 in 1, (velikost matrike))
np.random.random((3,3))

array([[0.3480086 , 0.49263108, 0.21096055],
       [0.54280082, 0.73469633, 0.89570865],
       [0.91811405, 0.04376255, 0.55394838]])

In [40]:
#eye (kvadratna matrika z enkami po diagonali)
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [42]:
#full (matrika velikosti (m,n) s konstantno vrednostjo)
np.full((5,6),8)

array([[8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8],
       [8, 8, 8, 8, 8, 8]])

In [48]:
#empty (naredi prazen array???) preveri zakaj ni prazen???
np.empty((2,1))

array([[-5.73021895e-300],
       [ 1.06098609e-153]])

#### Understanding NumPy ndarrays

In [50]:
data3 = np.random.randint(0,10,(4,7))
data3

array([[4, 8, 5, 0, 5, 3, 3],
       [9, 9, 2, 5, 4, 9, 1],
       [8, 7, 6, 1, 6, 0, 7],
       [9, 4, 3, 8, 6, 4, 7]])

In [51]:
#ndim - število dimenzij ndarraya
data3.ndim

2

In [53]:
#shape - velikost matrike (št vrstic, št stolpcev), dimenzija (ta primer 2D)
data3.shape

(4, 7)

In [56]:
#size - število elementov (če je več D, vrne pač število vseh elementov)
data3.size

28

In [58]:
#itemsize - velikost enega elementa v arrayu (v bajtih)
data3.itemsize

8

In [60]:
#nbytes - velikost celega arraya (v bajtih)
data3.nbytes

224

#### Selecting and Slicing Rows and Items from ndarrays

<p><img alt="Selecting rows from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_rows.svg"></p>



This is how we select a single item from a 2D ndarray:

<p><img alt="Selecting a single item from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_item.svg"></p>


ndarray[row, column] - tukaj je edina razlika s pythonovim listom [][]

Možnosti za izbiranje elementov
- int 5
- slice 0:5, 5
- :
- [1,5,8]
- boolean array

### Vrstice

In [63]:
test_arr = np.random.randint(10,size=(5,5))
test_arr

array([[2, 3, 1, 4, 0],
       [4, 7, 7, 7, 3],
       [3, 7, 2, 1, 1],
       [1, 5, 1, 3, 3],
       [2, 5, 1, 8, 3]])

In [68]:
#prva vrstica
first_row = test_arr[0]
first_row

array([2, 3, 1, 4, 0])

In [71]:
#2. in 3. vrstica
row2_3 = test_arr[[1,2]] #ali row2_3 = [1:3]
row2_3

array([[4, 7, 7, 7, 3],
       [3, 7, 2, 1, 1]])

In [73]:
#vrstica 2 do konca
row2_do_konca = test_arr[1:]
row2_do_konca

array([[4, 7, 7, 7, 3],
       [3, 7, 2, 1, 1],
       [1, 5, 1, 3, 3],
       [2, 5, 1, 8, 3]])

#### Selecting Columns and Custom Slicing ndarrays

Let's continue by learning how to select one or more columns of data:

<p><img alt="Selecting columns from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_columns.svg"></p>



If we wanted to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:

<p><img alt="Selecting partial 1D slices from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_1darray.svg"></p>

Lastly, if we wanted to select a 2D slice, we can use slices for both dimensions:

<p><img alt="Selecting a 2D slice from a 2D ndarray" src="https://s3.amazonaws.com/dq-content/289/selection_2darray.svg"></p>



### Stolpci

`[vrstice, stolpci]`

In [74]:
test_arr2 = np.random.randint(10,size=(5,5))
test_arr2

array([[6, 3, 3, 3, 2],
       [4, 8, 7, 2, 6],
       [7, 6, 3, 1, 3],
       [4, 2, 4, 0, 8],
       [6, 3, 4, 5, 8]])

In [77]:
#stolpec 2
stolp2 = test_arr2[:,1]
print(stolp2)

[3 8 6 2 3]


In [79]:
#stolpec 1 in 2
stolp1_2 = test_arr2[:,0:2] #ali =test_arr2[:,:2]
stolp1_2

array([[6, 3],
       [4, 8],
       [7, 6],
       [4, 2],
       [6, 3]])

In [81]:
#stolpec 2,4,5
stolp2_4_5 = test_arr2[:,[1,3,4]]
stolp2_4_5

array([[3, 3, 2],
       [8, 2, 6],
       [6, 1, 3],
       [2, 0, 8],
       [3, 5, 8]])

In [85]:
#stolpec elementi 2 do 4, vrstica 3
stolp_vrstica = test_arr2[2, 1:4]
stolp_vrstica

array([6, 3, 1])

In [87]:
#elementi v vrsticah 1-4, stolpce 1-3
test_arr2[1:4,:3]

array([[4, 8, 7],
       [7, 6, 3],
       [4, 2, 4]])

#### Modify values in ndarray



In [88]:
test_arr2

array([[6, 3, 3, 3, 2],
       [4, 8, 7, 2, 6],
       [7, 6, 3, 1, 3],
       [4, 2, 4, 0, 8],
       [6, 3, 4, 5, 8]])

In [93]:
test_arr2[0,0] = 124
test_arr2
#POZOR:če bi v seznam zdaj dodali float, bi decimalko odrezal, ker je nastavljen na intiger!

array([[124,   3,   3,   3,   2],
       [  4,   8,   7,   2,   6],
       [  7,   6,   3,   1,   3],
       [  4,   2,   4,   0,   8],
       [  6,   3,   4,   5,   8]])

#### Datatypes

[Več o datatypes](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)

[List of scalars](https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#arrays-scalars-built-in)

8 bajtni intiger zajema številke od 0 do 255 (če recimo v nekem arrayu uporabljamo samo ta števila, je smiselno nastaviti tako, na 8 bajtov)

In [95]:
#pogledamo tip seznama
x=np.array([1,2])
print(x.dtype)

int64


In [98]:
#nastavljanje tipa podatkov
np.zeros(10, dtype=np.int16) #ali np.zeros(10, dtype='int16')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)

<div class="text_cell_render border-box-sizing rendered_html">
<table>
<thead><tr>
<th>Data type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>bool_</code></td>
<td>Boolean (True or False) stored as a byte</td>
</tr>
<tr>
<td><code>int_</code></td>
<td>Default integer type (same as C <code>long</code>; normally either <code>int64</code> or <code>int32</code>)</td>
</tr>
<tr>
<td><code>intc</code></td>
<td>Identical to C <code>int</code> (normally <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>intp</code></td>
<td>Integer used for indexing (same as C <code>ssize_t</code>; normally either <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>int8</code></td>
<td>Byte (-128 to 127)</td>
</tr>
<tr>
<td><code>int16</code></td>
<td>Integer (-32768 to 32767)</td>
</tr>
<tr>
<td><code>int32</code></td>
<td>Integer (-2147483648 to 2147483647)</td>
</tr>
<tr>
<td><code>int64</code></td>
<td>Integer (-9223372036854775808 to 9223372036854775807)</td>
</tr>
<tr>
<td><code>uint8</code></td>
<td>Unsigned integer (0 to 255)</td>
</tr>
<tr>
<td><code>uint16</code></td>
<td>Unsigned integer (0 to 65535)</td>
</tr>
<tr>
<td><code>uint32</code></td>
<td>Unsigned integer (0 to 4294967295)</td>
</tr>
<tr>
<td><code>uint64</code></td>
<td>Unsigned integer (0 to 18446744073709551615)</td>
</tr>
<tr>
<td><code>float_</code></td>
<td>Shorthand for <code>float64</code>.</td>
</tr>
<tr>
<td><code>float16</code></td>
<td>Half precision float: sign bit, 5 bits exponent, 10 bits mantissa</td>
</tr>
<tr>
<td><code>float32</code></td>
<td>Single precision float: sign bit, 8 bits exponent, 23 bits mantissa</td>
</tr>
<tr>
<td><code>float64</code></td>
<td>Double precision float: sign bit, 11 bits exponent, 52 bits mantissa</td>
</tr>
<tr>
<td><code>complex_</code></td>
<td>Shorthand for <code>complex128</code>.</td>
</tr>
<tr>
<td><code>complex64</code></td>
<td>Complex number, represented by two 32-bit floats</td>
</tr>
<tr>
<td><code>complex128</code></td>
<td>Complex number, represented by two 64-bit floats</td>
</tr>
</tbody>
</table>

</div>

### Computation on NumPy Arrays: Universal Functions


#### The Slowness of Loops



#### Introducing UFuncs (Universal functions)

[Docs](https://docs.scipy.org/doc/numpy/reference/ufuncs.html())



### Uvoz realnih podatkov


- Row 1 is RatecodeID
- Row 2 is PULocationID
- Row 3 is DOLocationID
- Row 4 is passenger_count
- Row 5 is trip_distance
- Row 6 is fare_amount
- Row 7 is extra
- Row 8 is mta_tax
- Row 9 is tip_amount
- Row 10 is tolls_amount
- Row 11 is improvement_surcharge
- Row 12 is total_amount
- Row 13 is payment_type
- Row 14 is trip_type

### Vector Math




Here's what happened behind the scenes:

<p><img alt="Vectorized Addition" src="https://s3.amazonaws.com/dq-content/289/vectorized_addition.svg"></p>


- `vector_a + vector_b` - Addition
- `vector_a - vector_b7` - Subtraction
- `vector_a * vector_b` - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
- `vector_a / vector_b` - Division
- `vector_a % vector_b` - Modulus (find the remainder when vector_a is divided by vector_b)
- `vector_a ** vector_b` - Exponent (raise vector_a to the power of vector_b)
- `vector_a // vector_b` - Floor Division (divide vector_a by vector_b, rounding down to the nearest integer)


<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The following table lists the arithmetic operators implemented in NumPy:</p>
<table>
<thead><tr>
<th>Operator</th>
<th>Equivalent ufunc</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>+</code></td>
<td><code>np.add</code></td>
<td>Addition (e.g., <code>1 + 1 = 2</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.subtract</code></td>
<td>Subtraction (e.g., <code>3 - 2 = 1</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.negative</code></td>
<td>Unary negation (e.g., <code>-2</code>)</td>
</tr>
<tr>
<td><code>*</code></td>
<td><code>np.multiply</code></td>
<td>Multiplication (e.g., <code>2 * 3 = 6</code>)</td>
</tr>
<tr>
<td><code>/</code></td>
<td><code>np.divide</code></td>
<td>Division (e.g., <code>3 / 2 = 1.5</code>)</td>
</tr>
<tr>
<td><code>//</code></td>
<td><code>np.floor_divide</code></td>
<td>Floor division (e.g., <code>3 // 2 = 1</code>)</td>
</tr>
<tr>
<td><code>**</code></td>
<td><code>np.power</code></td>
<td>Exponentiation (e.g., <code>2 ** 3 = 8</code>)</td>
</tr>
<tr>
<td><code>%</code></td>
<td><code>np.mod</code></td>
<td>Modulus/remainder (e.g., <code>9 % 4 = 1</code>)</td>
</tr>
</tbody>
</table>
<p>Additionally there are Boolean/bitwise operators; we will explore these in <a href="02.06-boolean-arrays-and-masks.html">Comparisons, Masks, and Boolean Logic</a>.</p>

</div>
</div>

[Mathematical expressions](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.math.html#arithmetic-operations)

### Calculating Statistics For 1D ndarrays



### Calculating Statistics For 2D ndarrays

For now, we're going to look at how we can calculate statistics for two-dimensional ndarrays. If we use the arrays without additional parameters, they will return a single value, just like they do with a 1D array:

<p><img alt="Array method without axis parameter" src="https://s3.amazonaws.com/dq-content/289/array_method_axis_none.svg"></p>

But what if we wanted to find the maximum value of each row? For that, we need to use the axis parameter, and specify a value of 1, which indicates we want to calculate values for each row.

<p><img alt="Array method without axis 1" src="https://s3.amazonaws.com/dq-content/289/array_method_axis_1.svg"></p>

If we want to find the maximum value of each column, we use an axis value of 0:

<p><img alt="Array method without axis 1" src="https://s3.amazonaws.com/dq-content/289/array_method_axis_0.svg"></p>

To help you remember which is which, you can think of the first axis as rows, and the second axis as columns, just in the same way as when we're indexing a 2D NumPy array we use ndarray[row,column]. Then you think about which axis you want to apply the method along. The tricky part is to remember that when you apply the method along one axis, you get results in the other axis. Here is an illustration of that:

<p><img alt="The axis parameter" src="https://s3.amazonaws.com/dq-content/289/axis_param.svg"></p>



### Adding Rows and Columns to ndarrays


### Sorting ndarrays


###  Reading CSV files with NumPy

###  Boolean Arrays





A similar pattern occurs– the 'less than five' operation is applied to each value in the array. The diagram below shows this step by step:

<p><img alt="Vectorized boolean operation" src="https://s3.amazonaws.com/dq-content/290/vectorized_bool.svg"></p>

### Boolean Indexing with 1D ndarrays




<p><img alt="Boolean indexing 1D ndarrays 1" src="https://s3.amazonaws.com/dq-content/290/1d_bool_1.svg"></p>



<p><img alt="Boolean indexing 1D ndarrays 2" src="https://s3.amazonaws.com/dq-content/290/1d_bool_2.svg"></p>




### Boolean Indexing with 2D ndarrays


<p><img alt="Boolean indexing 1D ndarrays 2" src="https://s3.amazonaws.com/dq-content/290/bool_dims.svg"></p>


### Assigning Values in ndarrays

### Subarrays as no-copy views



### Copying Data
