<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/011_vectorization.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Vectorization
___

In this notebook, we wil have a look at **vectorization**. Vectorization is one of the most important concepts to understand when working with large amounts of data. Applying vectorization to loops will speed up your code in tremendous ways and it will gain you a lot of time. Using plain Python loops without vectorizing them is one of the most common mistake that beginner data scientists make when starting to work with pandas and NumPy.

To make a long story short, vectorization refers to applying the same function multiple times to a `numpy` array, or a column in a `pandas` dataframe, **and it will substantially decrease the runtime your code needs!**

Everytime you write a loop that loops over your full data (or large parts of it), you should think: **"Is there a way to vectorize this loop?"**

Let's have a look at vectorization in a data cleaning context. Say that your client has a large datasets of multiple thousands of observations of stock returns for multiple companies in long format.

In [None]:
import pandas as pd

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Read in the stock data
stocks = pd.read_csv(f"{DATA_PATH}/data/stock_returns.csv")
stocks.head(10) # Display the first few lines

In [None]:
# Display the unique stocks we have in our dataframe
stocks["Stock"].unique()

Say, you have this dataset of stock returns; however, you have another dataset where there is only the ticker for each company (the acronym in the parentheses, e.g., NOK, AAPL, etc.). How can we add a column `ticker` which contains only the ticker of each observation such that we would be able to merge both dataframes together?

First, let's define a function that extracts the ticker for a single string. The company name always follows the pattern **company name (ticker)**. Remember what you learned about string manipulation in the previous notebook...

The `.split()` method on a string lets us split a string into the characters before and after a delimiter, e.g.,

In [None]:
# Split the first company name at the whitespace ' '
stocks.loc[0, "Stock"].split(" ")

Hence, what we can do to recover the ticker, is to simply apply this split, grab the last item of the list, i.e., the ticker in parentheses, and, strip it from any parentheses. Let's write a function for this!

In [None]:
# Define a function to recover the ticker from the full company name
def recover_ticker(company_name):
    # Split the name at the white spaces
    name_split = company_name.split(" ")
    # Grab the last element of this split
    ticker = name_split[-1]
    # Strip it from parenteheses and return
    return ticker.strip("()")

# ü§ì Oh and if you want to get fancy, this is the way you'd do it
# you can verify yourself that it also works, much nicer don't you think?
recover_ticker2 = lambda x: x.split(" ")[-1].strip("()")

In [None]:
# ... does it work?
[recover_ticker(company) for company in stocks["Stock"].unique()]

Neat! We can now use it on our dataframe. Intuitively, we can just loop over the dataframe and apply this function to each company right? Well we could, but we'll show this is not efficient.

#### ‚õî The *wrong* way
Let's by doing it the *wrong* way, i.e., with a loop. I won't go into too many details of how to loop over the observations properly **because you should never do it! If you are looping over the observations in your dataframe, there is a very high chance you are doing something wrong and you should use vectorization instead!**

Now, to clarify. There's generally nothing wrong with loops in computer science. But **loops in Python are not very efficient**. In fact, *vectorization* is doing nothing else than a loop, the reason it's so much faster is that the underlying package (`pandas` or `numpy`) makes use of parallelization as well as faster programming languages (e.g., C or C++) to do the computations in the background.

So to summarize, the main concept is the following: Loops written in pure Python are slow. Using vectorization we are also writing loops but this can seem confusing, because we are writing them in a *different manner*. Now, instead of being interpreted in pure Python, when using vectorization, the package makes use of tricks to massively increase the speed of our loops.

In [None]:
# Start by creating a column with the ticker, but put an empty string in there for now
stocks["Ticker"] = ""
stocks # Display the data, our new column is there

In [None]:
%%time 
# This %%time outputs the time needed for the cell to run

# Iterate over every observation
for i in stocks.index: 
    stocks.at[i, "Ticker"] = recover_ticker(stocks.loc[i, "Stock"])

Okay, roughly **170 milliseconds** (on my machine) for ~15'000 observations. Come on, it's not that bad is it? Well let's see what vectorization has to say about it.

#### ‚úÖ The *right* way
So, how does vectorization works anyway? Loops are simple enough right, but is vectorization also easy? Kind of.

Remember how we can use methods on `pandas` dataframes, e.g., `.mean()` for the mean or `.std()` for the standard deviation? Well, we can use `.apply` and pass a function as an argument, this applies the function to the entire data column, so not only is it much faster computation-wise, it's actually also cleaner than first initializing empty columns and writing some cumbersome loops to fill them. Have a look!

In [None]:
%%time
# A one-liner instead of what we did above...
stocks["Ticker2"] = stocks["Stock"].apply(recover_ticker)

___
Roughly **15 ms** on the same machine, that's **a 95% speed increase compared to the pure Python loop! ü§Ø**
___

And, to be honest, this example is kind on loops, in fact, the loop runs in 170 milliseconds, we could deal with that. However, beginner data scientists often let their loops run for multiple hours, whereas a simple vectorization could have done the job in less than a few minutes. **I'm not exaggerating, I have seen this more than I can think of, and I've also done it myself when I didn't know better.**

In [None]:
stocks # Display the final data to make sure everything matches

In [None]:
# If you really want to make sure
all(stocks["Ticker"] == stocks["Ticker2"])

#### ‚û°Ô∏è ‚úèÔ∏è Task 1

Time to try your own hand at vectorization. Say that you want to create a column `returns_binned`, where you group the return percentages into the following bins:

+ $(-\infty; -5] \implies$ `"extreme negative returns"`
+ $(-5;  -2] \implies$ `"large negative returns"`
+ $(-2; 0] \implies$ `"negative returns"`
+ $(0; +2] \implies$ `"positive returns"`
+ $(+2; +5] \implies$ `"large positive returns"`
+ $(+5; +\infty) \implies$ `"extreme positive returns"`

Start by creating a function, `bin_returns`, which takes a single input and returns a **string** depending on the value of the input. Here is an input/output table, compare your outputs to it to make sure your function is right:

|`input`|`output`|
|---:|--:|
|`1.5`|`"positive returns"`|
|`3.6`|`"large positive returns"`|
|`-6.52`|`"extreme negative returns"`|
|`-0.7`|`"negative returns"`|
|`8.34`|`extreme positive returns`|

Once you are convinced that your function works, go ahead and add a column to the dataframe `stocks`. Do it with vectorization, but also with a loop, and compare the time difference using the `%%time` magic at the beginning of the cell.

In [None]:
# Enter your code here ‚û°Ô∏è ‚úèÔ∏è
def bin_returns(x):
    
    return 