### Writing Efficient Python Code

Python is a very popular language amongst data scientists due to the extensive list of libraries available to us. Pandas and Numpy inparticular are extremely useful at supporting data scientists to solve machine learning problems. 

However, Python also comes with it's own set of disadvantages. As Python is an interpreted language (code is executed line-by-line) it is significantly slower than it's counterlanguages; C and Java for example. With the rise of big data, more data is becoming available for data scientists to work with. Meaning it is important to find ways of keeping the code as efficient as possible and not carrying out any unneccessay computation time. One way of doing this is by avoiding for-loops. By making use of the inbuilt optimised routines, forcing all operations to be carried out in parallel can assist us in writing highly efficient code. Not only does this signficantly reduce the computation time, it in-turn reduces the amount of code. This making the code faster and much easier to check for bugs. Throughout this blog we will see different examples of python code which will execute the same task, but with considerably different run times.


In [1]:
# importing the libraries
import pandas as pd
import numpy as np
import random
import warnings

from timeit import Timer

warnings.filterwarnings("ignore")

Throughout this blog, we will investigate four ways of calculating the total cost, when given a dataframe consisting of the following two columns:
- the number of units 
- the price per unit

We will randomly generate 10,000 integer values for each column using the `random` library for the purpose of this example. It is important to remember that in practice, data scientists tend to work with much larger datasets, often comprising millions of rows, which would make the following code even slower.

In [76]:
# creating a dataframe with two columns and 10,000 rows, where the values have been randonly generated
df = pd.DataFrame({'number_of_units':np.random.randint(0,100,size=10000),
                    'price_per_unit':np.random.randint(0,1000,size=10000)})

# printing out the top 5 rows of our dataframe
df.head()

Unnamed: 0,number_of_units,price_per_unit
0,61,832
1,56,808
2,62,6
3,93,495
4,3,828


We will write four different functions which all carry out the same task of calculating the total cost. These functions are as follows:
1. calculating the total cost using a __for-loop__
2. calculating the total cost using a __list comprehension__
3. calculating the total cost making use of __vectorization__
4. calculating the by using the __inbuilt `numpy` dot product__



In [80]:
def for_loop():
    df['cost_of_items'] = pd.Series()
    for i in range(len(df)):
        df['cost_of_items'].iloc[i] = df['price_per_unit'].iloc[i] * df['number_of_units'].iloc[i]
        
    total_cost = sum(df['cost_of_items'])
    return total_cost


def list_comprehension():
    cost_of_items = [price*num for price, num in zip(df['price_per_unit'], df['number_of_units'])]
    total_cost = sum(cost_of_items)
    return total_cost


def vectorized():
    total_cost = sum(df['price_per_unit'] * df['number_of_units'])
    return total_cost


def dot_product():
    total_cost = np.dot(df['price_per_unit'], df['number_of_units'])
    return total_cost

We can calculate the computation time for each function by using the `timeit` library. We have calculated how long it takes to execute each function once.

In [81]:
computation_time_for_loop = Timer(for_loop).timeit(1)
computation_time_list_comprehension = Timer(list_comprehension).timeit(1)
computation_time_vectorized = Timer(vectorized).timeit(1)
computation_time_dot_product = Timer(dot_product).timeit(1)
 
print("Computation time is %0.9f using for-loop"%computation_time_for_loop)
print("Computation time is %0.9f using comprehension for-loop"%computation_time_list_comprehension)
print("Computation time is %0.9f using vectorization"%computation_time_vectorized)
print("Computation time is %0.9f using numpy"%computation_time_dot_product)



Computation time is 167.691518600 using for-loop
Computation time is 0.001933900 using comprehension for-loop
Computation time is 0.000811400 using vectorization
Computation time is 0.000053000 using numpy


As expected, the for-loop took by far the most time with approximately 168 seconds! A for-loop can, and often is, replaced with a list comprehension and it can be seen through this example how valuable that can be; with the computation time going from 168 seconds to roughly 0.002 seconds. 

For-loops are a good place to start, especially as a beginner, as they can be more intuitive and readable. However, by refactoring these into list comprehensions, your code suddendly becomes much more efficient, shorter to write, and much easier to check for bugs.

This example also shows that it is important to use vectorizaiton when possible. We can see that in this example, it is executed in half the time of a list comprehension, which may seem minor here, but when working with considerably more data, and having multiple operations to perform at once, this can make a lot of difference.

 But making use of the Numpy package with the vectorized operations, can become 15 times quicker than that! 

### Conculsion

In conclusion, it is worth investing time in refactoring code to make it as efficient as possible. Making use of the inbuilt operations in the `Numpy` library can improve the speed of the code immeasurably. 