In [118]:
import pandas as pd

# Doing Things in Bulk

We are, hopefully, pretty comfortable with the idea of looping through some data structure if we need to "do something" to every item in the structure. In addition to loops, there are other ways to accomplish this. These other ways are often more efficient, and they can also be more expressive - more readable and easier to understand. Again, efficiency is not a massive concern in most cases, but doing actions to datasets that may be millions of items long can be one of those cases. Expressiveness, on the other hand, is always a concern. The more readable and understandable our code is, the easier it is to maintain and the less likely it is to contain bugs. I personally have a lower opinion of the readability of these condensed shortcuts than most people, but that's really just an opinion. I also think that this is another good example of a scenario where we want to focus on what do to, then figure out how to implement it. If we know we want to take the square root of every item in a list, we can look up how to do that; if we write a loop and find it is inefficient or hard to debug, we can look for another solution then. One benefit of the type of programming that we do in data science is that we almost always test and observe our code in chunks as we write it, so if there's some chunk that is taking an inordinate amount of time we can then investigate as the need occurs - it is very unlikely that we will write things that take ages to run without noticing. 

## Contents

This notebook has several parts and the functionality in each overlaps, but is largely independent. Hopefully this helps with keeping things straight:
<ul>
<li> List comprehension - create a list filled with values generated from some expression. </li>
<li> 

## List Comprehensions

List comprehensions are a way to create lists in Python, generally a bit faster than using a for loop. List comprehension is basically a combined loop, expression, and list constructor, that will perform some calculation/action and return the results in a list. List comprehension can sometimes be more readable than looping, but that is a matter of opinion. The syntax for a list comprehension is:

[output for item in iterable if condition]

<ul>
<li> output is the expression you want to evaluate for every item in the iterable. </li>
<li> item is the item in the iterable that you want to evaluate the expression for. </li>
<li> iterable is the iterable you want to apply the expression to. </li>
<li> condition is an optional condition that you can use to filter the output. </li>
</ul>

This results in the expression being executed, or applied, to every item that comes up within the for loop. Most commonly, this means that some expression is applied to every item in a list or other data structure. 

![List Comprehension](../../images/list_comprehension.png "List Comprehension")
![List Comprehension](../images/list_comprehension.png "List Comprehension")

We can use list comprehension to create a list of the square of every number in the previous list. 

In [119]:
# Create List 
list1 = [1,2,3,4,5,6,7,8,9,10]
list2 = [x**2 for x in list1]
list2

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

A common use of list comprehension is to set a limit for the range, then take the results into the list. For example, if we want all numbers squared up to some limit. 

In [120]:
a = 20

# Create List
list3 = [x**2 for x in range(a)]
list3

[0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 100,
 121,
 144,
 169,
 196,
 225,
 256,
 289,
 324,
 361]

We can also add a condition to the list comprehension, so that only items that meet the condition are added to the list. For example, let's square the even numbers. 

In [121]:
list4 = [x**2 for x in range(a) if x%2==0]
list4

[0, 4, 16, 36, 64, 100, 144, 196, 256, 324]

## Exercise

Use list comprehension to do the following:

<ol>
<li>Create a list of the square of every number from 1 to 100, but only if the number is divisible by 3.</li>
<li>Create a list of the square of every number from 1 to 100, but only if the number is divisible by 3 and 5.</li>
<li>Create a list of the capitalized letters in the string "I'm sure I have at least 5 words left." Only place letters in the list, ignore numbers or punctuation. </li>
</li>

In [122]:
limit = 100

list_resp_1 = [x**2 for x in range(limit) if x%3==0]
list_resp_1

[0,
 9,
 36,
 81,
 144,
 225,
 324,
 441,
 576,
 729,
 900,
 1089,
 1296,
 1521,
 1764,
 2025,
 2304,
 2601,
 2916,
 3249,
 3600,
 3969,
 4356,
 4761,
 5184,
 5625,
 6084,
 6561,
 7056,
 7569,
 8100,
 8649,
 9216,
 9801]

In [123]:
list_resp_2 = [x**2 for x in range(limit) if (x%3==0 and x%5==0)]
list_resp_2

[0, 225, 900, 2025, 3600, 5625, 8100]

In [124]:
capString = "I'm sure I have at least 5 words left."

list_resp_3 = [x.upper() for x in capString if x.isalpha()]
list_resp_3

['I',
 'M',
 'S',
 'U',
 'R',
 'E',
 'I',
 'H',
 'A',
 'V',
 'E',
 'A',
 'T',
 'L',
 'E',
 'A',
 'S',
 'T',
 'W',
 'O',
 'R',
 'D',
 'S',
 'L',
 'E',
 'F',
 'T']

### Functions as Objects

In Python functions are first class objects. This means that they can be assigned to variables, put into lists, and passed around and used as arguments just like any other object (string, int, float, list, and so on). This allows us to do some things that can be more efficient, more readable, but also potentially more confusing.

We'll use this property here to allow these bulk actions to work efficiently. For the following stuff we are generally passing in the function we want executed along with the data to execute it "on", rather than calling the function directly. This is a bit more abstract, and can be confusing at first, but it is a very powerful concept.

![Map Inputs](../../images/map_inputs.png "Map Inputs")
![Map Inputs](../images/map_inputs.png "Map Inputs")

The mental leap comes from the fact that we don't really care about or refer to the individual items in the data structure at all, we just specify that we want "this" done to "all of them". This setup allows the computer to do a better job of optimizing the process and fully utilizing the hardware of the computer. We'll look into this a bit more when we examine vectorization, another similar concept. 

### Orders - High and Low

Higher order functions are functions that take other functions as arguments. The `map` function that we'll look at in a moment is an example of a higher order function. It takes a function and a sequence and returns a new sequence with the function applied to each element in the sequence. In the example below, the "applier" function is a simple example - it takes in two numbers and a function, and "applies" that function to the two numbers. We could swap the function we are plugging in to something different, like multiplication here, and get different results - this isn't all that useful with basic math, but it can be helpful to make more modular code as things get complex. 

Our examples here rely on a simple concept that underlies the way Python is built, that everything is an object. 

In [125]:
def add(x, y):
    return x + y
def mult(x, y):
    return x * y
def applier(num1, num2, func):
    return func(num1, num2)

In [126]:
applier(2, 3, add)

5

Swap functions...

In [127]:
applier(2, 3, mult)

6

#### Function Composition 

Now we can pass functions as arguments to other functions. This allows us to abstract away, or hide, the details of a particular operation. This can make our code more readable and more reusable... in theory.

In this example, take a look at the details of the arguments. The outermost function is add, which takes in two arguments. The first argument is the function "mult", called with the two columns of the dataframe as arguments. The second argument is another call to "add", with the two columns of the dataframe as arguments again. So the end result of what we are doing is:

<ul>
<li> Argument 1 - Take the first column of the dataframe and multiply it by the second column of the dataframe. </li>
<li> Argument 2 - Take the first column of the dataframe and add it to the second column of the dataframe. </li>
<li> Take the results of Argument 1 and Argument 2 and add them together. </li>
</ul>

This could be stacked infinitely deep with function calls as arguments to function calls as arguments to function calls. This is a powerful technique, but it can also be confusing. So use it with caution. If you find yourself getting confused, it's probably better to break it up into smaller steps.

In [128]:

number_cols = ["Num 1", "Num 2"]
# Create DataFrame
number_table = pd.DataFrame({"Num1":[1, 2, 5, 7, 8, 9, 23, 5], "Num2":[3, 4, 6, 8, 1, 3, 5, 45]})
number_table.head()

Unnamed: 0,Num1,Num2
0,1,3
1,2,4
2,5,6
3,7,8
4,8,1


In [129]:
add( mult(number_table["Num1"], number_table["Num2"]), add(number_table["Num1"], number_table["Num2"]) )

0      7
1     14
2     41
3     71
4     17
5     39
6    143
7    275
dtype: int64

## Exercise

Write a function that extends the applier function above. This function should take in two arguments, a list of numbers and a function, and it should apply the function to every item in the list. For example, inputs of [1,2,3] and "add" should return 6, and inputs of [1,2,3] and "mult" should return 6. Inputs of [20,2,2] and "divide" should return 5 - 20 divided by 2, divided by 2.

<b>Note:</b> this one is a good candidate for recursion, but that's not required. 

In [130]:
def bigApplier(inputNums, func):
    if len(inputNums) == 1:
        return inputNums[0]
    else:
        return func(inputNums[0], bigApplier(inputNums[1:], func))

In [131]:
bigApplier([1, 2, 3, 4, 5], add)

15

In [150]:
bigApplier([1, 2, 3, 4, 5], mult)

120

### Lambda Functions

Another common tool, one especially useful for simple actions that we want to apply to a large number of items, is the lambda function. Lambda functions are single line functions that are anonymous, meaning they are not bound to a name. Lambda functions can accept arguments, but they can only contain one expression. Unlike functions created with the def keyword, lambda functions return an expression, not a value. They are useful for creating quick functions that aren't needed later in your code, for data science, this may often come up for things like preparing data that is largely unstructured - we may have several specific steps that we need to do once on some column of data, like splitting full names into first, last, and middle(s). 

![Lambda Functions](../../images/lambda.png "Lambda Functions")
![Lambda Functions](../images/lambda.png "Lambda Functions")

#### Defining Lambda Functions

Lambda functions can be created with a simple statement starting with the keyword lambda. The general syntax of a lambda function is quite simple: 

lambda argument(s) : expression

<ul>
<li> lambda is a keyword in Python for defining the anonymous function.</li>
<li> argument(s) is a placeholder, that is a variable that will be used to hold the value you want to pass into the function expression. A lambda function can have multiple variables depending on what you want to achieve. </li>
<li> expression is the code you want to execute in the lambda function.</li>
</ul>

We can think of what the lambda function is doing as:
<ul>
<li> Treat each item of the dataframe as x. </li>
<li> Do the expression to x. </li>
<li> Return the result. </li>
<li> Repeat for each item in the data</li>
</ul>

The lambda functions that we create work just like any other function, they are just more constrained and smaller. We can use them in the same way that we use any other function, including passing them as arguments to other functions. Lambda functions are most useful in conjunction with some higher order functions, which allows them to be applied directly to an entire set of data...

### Doing Pandas in Bulk

![Pandas](../../images/pandas.png "Pandas")
![Pandas](../images/pandas.png "Pandas")

In data science, we generally have large sets of data, such as our dataframes. It is pretty common to want to "do something" to every item in a column - such as convert all the values to a different type, remove some stray characters, or some other cleanup step. We can do this with a for loop, but there are some other ways that are more efficient and more readable. In effect, rather than focusing on moving through the items in the dataset and doing something to each, these methods are oriented around defining what is to be done, and where that is to be applied. If we picture that, we can visualize it as a function with two inputs - the function that does what we want, and the dataset that we want it applied to.

#### Apply

The apply function is used to apply a function to a dataframe (or a Series), or to run it on every row or every column. The apply function takes a function as an argument, and then applies that function to every row or column of the dataframe. In these cases, we will call apply with the lambda function as the argument, and that lambda function will be applied to every row or column of the dataframe we called it on. 


Let's make a lambda function with apply to check if something is an even number:

In [132]:
#df = pd.read_csv('../data/threads_reviews.csv')
df = pd.read_csv('../../data/threads_reviews.csv')
df.head()

Unnamed: 0,source,review_description,rating,review_date
0,Google Play,Very good app for Android phone and me,5,27-08-2023 10:31
1,Google Play,Sl👍👍👍👍,5,27-08-2023 10:28
2,Google Play,Best app,5,27-08-2023 9:47
3,Google Play,Gatiya app,1,27-08-2023 9:13
4,Google Play,Lit bruv,5,27-08-2023 9:00


In [133]:
# Using it
df["rating"].apply(lambda x: x%2==0)

0        False
1        False
2        False
3        False
4        False
         ...  
40430    False
40431    False
40432    False
40433     True
40434    False
Name: rating, Length: 40435, dtype: bool

We can also assign a lambda function to a variable, just like we can with a regular function. This is not something we will do often, but it is possible.

In [134]:
# Lambda functions
# even numbers
even = lambda x: x % 2 == 0

# Using it
df["rating"].apply(even)



0        False
1        False
2        False
3        False
4        False
         ...  
40430    False
40431    False
40432    False
40433     True
40434    False
Name: rating, Length: 40435, dtype: bool

We can do pretty much anything we want in a lambda function, as long as it is a single expression. 

In [135]:
df.apply(lambda x: str(x["rating"]) + "-" + x["source"], axis=1)

0        5-Google Play
1        5-Google Play
2        5-Google Play
3        1-Google Play
4        5-Google Play
             ...      
40430      1-App Store
40431      1-App Store
40432      1-App Store
40433      2-App Store
40434      1-App Store
Length: 40435, dtype: object

We can also have more than one argument in a lambda function.

In [136]:
Max = lambda a, b : a if(a > b) else b
 
print(Max(1, 2))

2


## Map and Filter - Generalized Bulk Actions

The apply function above is part of the Pandas library, so it works well for dataframes or series, and not at all for other objects. For other data structures we can use the map and filter functions. These are built-in Python functions, so they work on any iterable object. Each of these does something very similar to apply, but they are a bit more flexible. Map will apply a function to every item in an iterable, and return a new iterable with the results. Filter will apply a function to every item in an iterable, and return a new iterable with only the items for which the function returned True.

![Map](../../images/map.png "Map")
![Map](../images/map.png "Map")

### Map

The map function allows you to apply a function over a collection of elements. It is a higher-order function, meaning it takes another function as an argument. The map function is useful when you need to apply the same function to every item in an iterable (such as a list or dictionary) and collect the results in a new iterable. Map differs from apply in that it is not limited to dataframes, or other Pandas objects. It can be used on any iterable, including lists, tuples, and dictionaries. The syntax for map is:

map(function, iterable)

<ul>
<li> function is the function you want to apply to every item in the iterable. </li>
<li> iterable is the iterable you want to apply the function to. </li>
</ul>

<b>Note:</b> we are converting the results to lists here to look at them. The map function returns a map object, which is an iterable. We can convert it to a list to see the results, more on this below. 

In [137]:
list(map(lambda x: x%2==0, list1))

[False, True, False, True, False, True, False, True, False, True]

We can also store the results in a variable. 

In [138]:
tmp = map(lambda x: x%2==0, list1)
tmp

<map at 0x11fe39a50>

The return value is a "map" object, more on this later. For now, we can convert it to a list or loop through it to see the results.

In [139]:
for x in tmp:
    print(x)

False
True
False
True
False
True
False
True
False
True


<b>Note:</b> Pandas also has a map function, which is similar to apply, it is not the same as the map function we are discussing here, but it does the equivalent for Pandas objects. See below. 

In [140]:
# Pandas Map
print("I'm Pandas - ", list(df["rating"].map(lambda x: x%2==0).head()), "\n")

# Map map - note the different syntax
print("I'm Map - ", list(map(lambda x: x%2==0, df["rating"]))[:5])

I'm Pandas -  [False, False, False, False, False] 

I'm Map -  [False, False, False, False, False]


Map will work with a list here as well, apply comes from Pandas and is only used on Pandas objects.

In [141]:
# I'm gonna fail!
print(list1)
try:
    list1.apply(lambda x: x%2==0)
except:
    print("Failed - I knew it!")

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Failed - I knew it!


### Filter

The filter function allows you to apply a function over a collection of elements, but only returns the elements for which the function returns True. It requires that we provide a function that returns a boolean value, and then it will return only the elements for which that function returns True. The syntax for filter is just like map:

filter(function, iterable)

<ul>
<li> function is the function you want to apply to every item in the iterable. </li>
<li> iterable is the iterable you want to apply the function to. </li>
</ul>


In [142]:
# Filter ratings column for even/odd numbers
list(filter(lambda x: x%2==0, df["rating"]))[:5]

[2, 4, 2, 2, 2]

### Mapping and Applying Real Functions

These shortcuts are commonly used with lambda functions, but we can map, filter, or apply pretty much any function to a data structure if we need to. This is identical in syntax to a lambda function, but I'd argue it is more readable since the body of the function is elsewhere. When using normal functions we don't have the "one expression" limitation that we have with lambda functions, so we can do pretty much anything. 

In [143]:
def isEven(x):
    return x%2==0

In [144]:
list(map(isEven, list1))

[False, True, False, True, False, True, False, True, False, True]

In [145]:
df["rating"].apply(isEven)

0        False
1        False
2        False
3        False
4        False
         ...  
40430    False
40431    False
40432    False
40433     True
40434    False
Name: rating, Length: 40435, dtype: bool

### What's Returned? - Intro to Iterators

Map doesn't return a list or another data structure of the results, it returns a map object - a type of iterable. Iterables as a concept are something that we'll explore in more detail later. For now, we can think of this as something that we can use like any other data structure, but we first must convert it to a list or other data structure if we want to use it like one. When we want to see the results above we wrapped the map in a list() function, which converted it to a list.

The object itself is actually a type of iterator, or a variation on a normal data structure where the items inside are not stored in memory, but are instead generated as needed. This is something that we will see more when using large datasets for neural network models, where the amount of data is more than we can fit in the memory of the system, so we leave it on disk and only pull in bits at a time. The key concept here is that the iterator will return the next item when asked, no matter if that item is in memory or not. In a "normal" data structure all the data is in memory. This is also an example of "lazy" evaluation, where we don't do anything until we need to. Lazy evaluation is a common technique that we see in Python when working with large datasets, we can avoid doing large and slow operations in favor of distributing that work over time. These iterators also open the possibility of effectively infinite data structures, where we can generate the next item in the data structure as needed. Take an example of training a large language model such as chatGPT, where it will need to take in a massive amount of text as its training data. We can't fit "the internet" into memory, but we can create an iterator (specifically a <i>generator</i>) that will produce the next item that we expect, without forcing us to care about the data as a whole. 

![Iterators](../../images/iterator.png "Iterators")
![Iterators](../images/iterator.png "Iterators")

We can generate one of these iterators for any iterable, meaning that we can use this technique with almost any data structure.

In [146]:
type(map(isEven, list1))

map

In [147]:
a = map(isEven, list1)

All iterators have a next() method, which returns the next item in the iterator. When we use a for loop to iterate over an iterator, the for loop automatically calls next() for us. This is also what allows it to be interoperable with lists and other data structures that we're used to. Each run of this block will progress through the dataset. 

In [148]:
a.__next__()

False

## Map, Filter, Apply, Oh My!

These different functions can be a little confusing to keep straight - which one applies where, and what is the syntax. The important point is that we understand that if we want to do something to a large set of items, we can use one of these actions to do it all in one step. We can use a for loop, but that is generally less efficient and less readable, so having these additional tools in your toolbox is a good thing.

If you're not going to remember anything else, remember the use of map(). Map is usable on any iterable (unlike apply), so if you understand how map is used you can accomplish most tasks using it. The idea and syntax is also in line with 

## Exercise

Write a function that will take in a function, a data structure, and a boolean value as arguments, then return a new data structure with the function run on every item in the original data structure. The function used to do the "running" should be dynamic:
<ul>
<li> If the iterable is a Pandas object, use apply. </li>
<li> If the iterable is a different iterable, use map. </li>
<li> If the flag to_filter is True, use filter. </li>
</ul>

<b>Note:</b> for (at least) one of the cases here, it might not work out directly with a map/apply/filter function. I used list comprehension for that case, but there are probably several ways. 

In [163]:
# I think this is right, I didn't test a lot of edge cases though. Might be a mistake soemwhere. 
def map_filter_or_apply(func, iterable, to_filter=False):
    if to_filter == True:
        if type(iterable) == pd.Series:
            return [x for x in iterable if func(x)]
        else:
            return list(filter(func, iterable))
    elif to_filter == False:
        if type(iterable) == pd.Series:
            return iterable.apply(func)
        else:
            return list(map(func, iterable))
    else:
        raise Exception("to_filter must be True or False")

Some tests I used to check my work:

In [164]:
# Tests

map_filter_or_apply(isEven, pd.Series(list1), to_filter=True)

[2, 4, 6, 8, 10]

In [165]:
map_filter_or_apply(isEven, list1, to_filter=True)

[2, 4, 6, 8, 10]

In [166]:
type(map_filter_or_apply)

function