# Applying functions to dataframe
There is an [interesting post](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6) about ways to apply functions to dataframe and their performance.

In [1]:
import numpy as np
import pandas as pd
# Restricting number of displaying rows, just for convenience
pd.set_option('max_rows', 8)

## Load data

In [3]:
movie = pd.read_csv('data/movie.csv')
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


## Ways to apply functions
There are several ways to apply your functions to data and they are differ in speed and readability. We will examine some of them from bad to better, obviously you should use the greatest ones which are listed last.
1. iteration: simple for loop with python - not good idea - python is an interpreted language and as in R looping is slow in comparison with languages like C.
1. `iterrows()` and `iteritems()`: dataframe method returning generator with rows/columns - faster than previous but nevertheless is slow
1. `apply()` and `applymap`: applying passed function row/column-wise or element-wise - better
1. built-in pandas methods: min(), max() and other - usually it is one of the best options because these functions are optimized and quite easy readable
1. operating upon arrays instead of dataframes - unpacking values from dataframe and applying numpy functions directly can increase performance
1. writing functions in C - for improving speed of your function you can rewrite it in high-speed C language and activate as an extension

### Iteration over rows

In [31]:
for i, row in enumerate(movie.iterrows()):
    print(i, row, sep='\n\n')
    if i > 2:
        break

0

(0, color                             Color
director_name             James Cameron
num_critic_for_reviews              723
duration                            178
                              ...      
actor_2_facebook_likes              936
imdb_score                          7.9
aspect_ratio                       1.78
movie_facebook_likes              33000
Name: 0, Length: 28, dtype: object)
1

(1, color                              Color
director_name             Gore Verbinski
num_critic_for_reviews               302
duration                             169
                               ...      
actor_2_facebook_likes              5000
imdb_score                           7.1
aspect_ratio                        2.35
movie_facebook_likes                   0
Name: 1, Length: 28, dtype: object)
2

(2, color                          Color
director_name             Sam Mendes
num_critic_for_reviews           602
duration                         148
                             .

### apply

In [44]:
# Define function for data Z-transformation and apply it to some numeric columns
def norm(colon):
    return (colon - np.mean(colon)) / np.std(colon)

movie[['budget', 'movie_facebook_likes', 'num_critic_for_reviews']].apply(norm, axis=0)

Unnamed: 0,budget,movie_facebook_likes,num_critic_for_reviews
0,1.999898,1.335744,4.865887
1,2.628444,-0.382643,1.364178
2,2.079713,4.043504,3.859457
3,2.129598,8.157217,5.614471
...,...,...,...
4912,,1.283671,-0.790079
4913,-0.364617,-0.381810,-1.039607
4914,,-0.348275,-1.031290
4915,-0.364620,-0.358898,-0.790079


### Applymap

In [41]:
# Applying quite strange function to each cell in dataframe
movie.applymap(lambda x: str(x) + str(len(str(x))))

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color5,James Cameron13,723.05,178.05,0.03,855.05,Joel David Moore16,1000.06,760505847.011,Action|Adventure|Fantasy|Sci-Fi31,...,3054.06,English7,USA3,PG-135,237000000.011,2009.06,936.05,7.93,1.784,330005
1,Color5,Gore Verbinski14,302.05,169.05,563.05,1000.06,Orlando Bloom13,40000.07,309404152.011,Action|Adventure|Fantasy24,...,1238.06,English7,USA3,PG-135,300000000.011,2007.06,5000.06,7.13,2.354,01
2,Color5,Sam Mendes10,602.05,148.05,0.03,161.05,Rory Kinnear12,11000.07,200074175.011,Action|Adventure|Thriller25,...,994.05,English7,UK2,PG-135,245000000.011,2015.06,393.05,6.83,2.354,850005
3,Color5,Christopher Nolan17,813.05,164.05,22000.07,23000.07,Christian Bale14,27000.07,448130642.011,Action|Thriller15,...,2701.06,English7,USA3,PG-135,250000000.011,2012.06,23000.07,8.53,2.354,1640006
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4912,Color5,nan3,43.04,43.04,nan3,319.05,Valorie Curry13,841.05,nan3,Crime|Drama|Mystery|Thriller28,...,359.05,English7,USA3,TV-145,nan3,nan3,593.05,7.53,16.04,320005
4913,Color5,Benjamin Roberds16,13.04,76.04,0.03,0.03,Maxwell Moody13,0.03,nan3,Drama|Horror|Thriller21,...,3.03,English7,USA3,nan3,1400.06,2013.06,0.03,6.33,nan3,162
4914,Color5,Daniel Hsia11,14.04,100.05,0.03,489.05,Daniel Henney13,946.05,10443.07,Comedy|Drama|Romance20,...,9.03,English7,USA3,PG-135,nan3,2012.06,719.05,6.33,2.354,6603
4915,Color5,Jon Gunn8,43.04,90.04,16.04,16.04,Brian Herzlinger16,86.04,85222.07,Documentary11,...,84.04,English7,USA3,PG2,1100.06,2004.06,23.04,6.63,1.854,4563


### Direct functions usage

In [43]:
# Similar functional with apply example
budget_likes_reviews = movie[['budget', 'movie_facebook_likes', 'num_critic_for_reviews']]

(budget_likes_reviews - budget_likes_reviews.mean()) / budget_likes_reviews.std()

Unnamed: 0,budget,movie_facebook_likes,num_critic_for_reviews
0,1.999672,1.335608,4.865387
1,2.628147,-0.382604,1.364038
2,2.079479,4.043093,3.859061
3,2.129358,8.156387,5.613894
...,...,...,...
4912,,1.283541,-0.789998
4913,-0.364576,-0.381771,-1.039501
4914,,-0.348240,-1.031184
4915,-0.364579,-0.358861,-0.789998


### Comparison of applying built-in methods and their usage per se
If you have possibility to use built-in methods directly, don't `apply()` them

In [23]:
%timeit movie[['duration', 'budget']].apply(np.cumsum)

2.36 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [24]:
%timeit movie[['duration', 'budget']].cumsum()

1.04 ms ± 77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
