# Using pandas and numpy effectively
Both pandas and numpy have been written in Compiled languages, mostly C: 
### NumPy

* The core array operations (like vectorised math, broadcasting, slicing, etc.) are implemented in C (and some Fortran, especially for linear algebra via BLAS/LAPACK).

* The Python layer is mostly a wrapper that calls into these fast C routines.

That’s why NumPy is so much faster than plain Python loops.

### pandas

* Built on top of NumPy, so it inherits a lot of that C speed indirectly.

* Performance-critical parts (like groupby, joins, parsing CSVs) are written in C or Cython (a Python-like language that compiles to C).

* The higher-level DataFrame/Series API is written in Python.

# Using Numpy (efficiently) :
1. ## Numpy arrays are static:
* They are fixed in size (unlike Python lists).

* They don’t automatically resize when you add elements.

* To append, you must call `resize()` explicitly.

* Resizing on every append causes `many extra copies and memory allocations`.

* This makes repeated appends much slower than using Python lists.

In [1]:
# show casing the drawback of resizing a numpy array
from timeit import timeit
import numpy

N = 100000  # Number of elements in list/array

def list_append():
    ls = []
    for i in range(N):
        ls.append(i)

def array_resize():
    ar = numpy.zeros(1)
    for i in range(1, N):
        ar.resize(i+1)
        ar[i] = i
        
repeats = 1000
print(f"list_append: {timeit(list_append, number=repeats):.2f}ms")
print(f"array_resize: {timeit(array_resize, number=repeats):.2f}ms")

list_append: 1.22ms
array_resize: 12.60ms


### Conclusion 1: Avoid trying to resize Numpy arrays

## 2. NumPy arrays typically require all data to be the `same type` (and a NumPy type)

If you don't respect this, you lose the advantages from using numpy

In [2]:
import numpy as np
a = np.array([0.5, 5])
print(type(a[0]))
print(type(a[1]))

b = np.array([0.5, 5,{"foo":5}])
print(type(b[0]),type(b[1]),type(b[2]))

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'float'> <class 'int'> <class 'dict'>


## We will demonstrate now the overhead from mixing Python lists and NumPy functions.

In [3]:
import timeit
import numpy as np

#python list
ls = list(range(10000))
time = timeit.timeit(lambda: np.random.choice(ls), number=1000)
print("List: Time for 1000 runs:", time, "seconds")

# NumPy array, numpy.random.choice()
ar = numpy.arange(10000)
time = timeit.timeit(lambda: np.random.choice(ar), number=1000)
print("Numpy: Time for 1000 runs:", time, "seconds")

List: Time for 1000 runs: 0.5479201669950271 seconds
Numpy: Time for 1000 runs: 0.0032690829975763336 seconds


### Conclusion: Passing a Python list to numpy.random.choice() is 65.6x slower than passing a NumPy array. 
This is the additional overhead of converting the list to an array.

# Numpy and array broadcasting
* NumPy arrays support `broadcasting` many mathematical operations or functions. 

* This is a shorthand notation, where the operation/function is applied `element-wise` without having to loop over the array explicitly

* No need to loop over array, thus optimising the code!

* This means also that multiple `operations could be applied simultaneously`, rather than sequentially --> significant performance boost

In [12]:
import numpy as np
ar = np.arange(6)
print(ar)
print(ar + 10)
print(ar * 2)
print(ar**2)
print(np.exp(ar))
print(ar*2+1)

[0 1 2 3 4 5]
[10 11 12 13 14 15]
[ 0  2  4  6  8 10]
[ 0  1  4  9 16 25]
[  1.           2.71828183   7.3890561   20.08553692  54.59815003
 148.4131591 ]
[ 1  3  5  7  9 11]


In [11]:
# lists do not allow broadcasting
lt = [0, 1, 2, 3, 4, 5]
print(lt*2)
print(lt+2)

[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]


TypeError: can only concatenate list (not "int") to list

In [14]:
# speed of applying broadcasting is independent from the size of the numpy arrays (comparing to using lists)
import timeit
import numpy as np

for n in [1, 10, 100]:
    ar = np.arange(n)
    time = timeit.timeit(lambda: ar + 10, number=1_000_000)
    print(f"Array size {n}: {time:.6f} seconds total")


for n in [1, 10, 100]:
    ls = list(range(n))
    time = timeit.timeit(lambda: [x + 10 for x in ls], number=1_000_000)
    print(f"List size {n}: {time:.6f} seconds total")

Array size 1: 0.418775 seconds total
Array size 10: 0.381499 seconds total
Array size 100: 0.396316 seconds total
List size 1: 0.090674 seconds total
List size 10: 0.193455 seconds total
List size 100: 1.267181 seconds total


In [16]:
# A final summary code

In [15]:
from timeit import timeit

N = 1000000  # Number of elements in list

gen_list = f"ls = list(range({N}))"
gen_array = f"import numpy; ar = numpy.arange({N}, dtype=numpy.int64)"

py_sum_ls = "sum([i*i for i in ls])"
py_sum_ar = "sum(ar*ar)"
np_sum_ar = "numpy.sum(ar*ar)"
np_dot_ar = "numpy.dot(ar, ar)"

repeats = 1000
print(f"python_sum_list: {timeit(py_sum_ls, setup=gen_list, number=repeats):.2f}ms")
print(f"python_sum_array: {timeit(py_sum_ar, setup=gen_array, number=repeats):.2f}ms")

print(f"numpy_sum_array: {timeit(np_sum_ar, setup=gen_array, number=repeats):.2f}ms")
print(f"numpy_dot_array: {timeit(np_dot_ar, setup=gen_array, number=repeats):.2f}ms")

python_sum_list: 21.83ms
python_sum_array: 30.91ms
numpy_sum_array: 0.57ms
numpy_dot_array: 0.27ms


# Pandas :
* Pandas is the most common Python package when working with tabular data (`csv`, `tsv` files).
* Pandas enhances performance if used well, otherwise it can harm performance



## Pandas methods by default operate on `columns`
* This means that elements of a same column are saved in the same memory space.
* Think of the column as a numpy array, highly suitable for vectorisation.

* Iterating over DataFrame rows using a Python for loop is slow and not recommended!

* Prefer Pandas built-in methods that support axis=1 for row-wise operations.

* If no built-in method exists, use `apply()` to apply a custom function to rows (similar to `map()`).

In [17]:
from timeit import timeit
import pandas
import numpy

N = 100000  # Number of rows in DataFrame

def genDataFrame():
    numpy.random.seed(12)  # Ensure each dataframe is identical
    return pandas.DataFrame(
    {
        "f_vertical": numpy.random.random(size=N),
        "f_horizontal": numpy.random.random(size=N),
        # todo some spurious columns
    })

def pythagoras(row):
    return (row["f_vertical"]**2 + row["f_horizontal"]**2)**0.5
    
def for_range():
    rtn = []
    df = genDataFrame()
    for row_idx in range(df.shape[0]):
        row = df.iloc[row_idx]
        rtn.append(pythagoras(row))
    return pandas.Series(rtn)

def for_iterrows():
    rtn = []
    df = genDataFrame()
    for row_idx, row in df.iterrows():
        rtn.append(pythagoras(row))
    return pandas.Series(rtn)
    
def pandas_apply():
    df = genDataFrame()
    return df.apply(pythagoras, axis=1)# axis=1 means apply to rows

repeats = 100
gentime = timeit(genDataFrame, number=repeats)

#Subtract gentime from each and multiply by 10 (for scaling to milliseconds) to get the time spent only on row iteration.
print(f"for_range: {timeit(for_range, number=repeats)*10-gentime:.2f}ms")
print(f"for_iterrows: {timeit(for_iterrows, number=repeats)*10-gentime:.2f}ms")
print(f"pandas_apply: {timeit(pandas_apply, number=repeats)*10-gentime:.2f}ms")

for_range: 747.98ms
for_iterrows: 852.20ms
pandas_apply: 235.65ms


## `apply()` is 4x faster than the two for approaches, as it avoids the Python for loop.
#### But, we can do better by profiting from `vectorisation` and `broadcasting` and applying the mathematical operations on columns instead!


In [18]:
# calculating pythagoras directly on the columns using numpy operations
def vectorize():
    df = genDataFrame()
    return pandas.Series(numpy.sqrt(numpy.square(df["f_vertical"]) + numpy.square(df["f_horizontal"])))
    
print(f"vectorize: {timeit(vectorize, number=repeats)-gentime:.2f}ms")

vectorize: 0.02ms


### Ok, we were lucky here and we could profit from vectorisarion but it's not always the case!
* An alternative approach is `converting your DataFrame` to a `Python dictionary` using `to_dict(orient='index')`.
* This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary.
* This can then be processed via list-comprehension 


In [20]:
def to_dict():
    df = genDataFrame()
    df_as_dict = df.to_dict(orient='index')
    return pandas.Series([(r['f_vertical']**2 + r['f_horizontal']**2)**0.5 for r in df_as_dict.values()])

print(f"to_dict: {timeit(to_dict, number=repeats)*10-gentime:.2f}ms")

to_dict: 80.74ms


In [25]:
df = pandas.DataFrame({
        "f_vertical": [1,2,23,3,5],
        "f_horizontal": [10,20,3,30,50],
        # todo some spurious columns
    })
df_as_dict = df.to_dict(orient='index')
print(df, df_as_dict)

   f_vertical  f_horizontal
0           1            10
1           2            20
2          23             3
3           3            30
4           5            50 {0: {'f_vertical': 1, 'f_horizontal': 10}, 1: {'f_vertical': 2, 'f_horizontal': 20}, 2: {'f_vertical': 23, 'f_horizontal': 3}, 3: {'f_vertical': 3, 'f_horizontal': 30}, 4: {'f_vertical': 5, 'f_horizontal': 50}}


## Using dictionary (80ms) is slower than using vectorisation but is twice as fast as using `apply`.
Note that indexing into `Pandas Series` (rows) is significantly slower than a `Python dictionary` as we will demonstrate below:

In [26]:
from timeit import timeit
import pandas as pandas

N = 100000  # Number of rows in DataFrame

def genInput():
    s = pandas.Series({'a' : 1, 'b' : 2})
    d = {'a' : 1, 'b' : 2}
    return s, d

def series():
    s, _ = genInput()
    for i in range(N):
        y = s['a'] * s['b']

def dictionary():
    _, d = genInput()
    for i in range(N):
        y = d['a'] * d['b']

repeats = 1000
print(f"series: {timeit(series, number=repeats):.2f}ms")
print(f"dictionary: {timeit(dictionary, number=repeats):.2f}ms")

series: 128.98ms
dictionary: 2.12ms
