### Pandas Apply Lambda

Always remember the Zen of Python!!!

In [1]:
import pandas as pd

In [5]:
df_movie=pd.read_csv("../data/input/IMDB-Movie-Data.csv")
df_movie.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [2]:
# import this

# **Challenge 1. Using a single argument**

We want to create **bins** of movies according to the number of votes they've received. For that matter, we will create a new column named **'bin'** which will tag every movie as follow:
- From 0 to 999 ==> 'cat_1'
- From 1000 to 9999 ==> 'cat_2'
- From 10000 to 99999 ==> 'cat_3'
- From 100000 to 999999 ==> 'cat_4'
- More than 1000000 ==> 'cat_5' 

In [17]:
def categorize(votes):
    if 0<=votes<=999:
        votes="cat_1"
    elif 1000<=votes<=9999:
        votes="cat_2"
    elif 10000<=votes<=99999:
        votes="cat_3"
    elif 100000<=votes<=999999:
        votes="cat_4"
    elif votes >1000000:
        votes="cat_5"
    else:
        votes="non_cat"
    return votes

In [18]:
categorize(-1)

'non_cat'

In [21]:
df_movie["Votes"].apply(categorize).unique()

array(['cat_4', 'cat_3', 'cat_2', 'cat_1', 'cat_5'], dtype=object)

In [47]:
df_movie["Bin"]=df_movie["Votes"].apply(categorize)
df_movie.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Bin
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,cat_4
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,cat_4
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,cat_4
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,cat_3
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,cat_4


# **Challenge 2. Using two arguments**

We want to know how much is the revenue per minute for every movie.

In [37]:
# Vamos a comprobar si las peliculas se repiten en el df
df_movie["Title"].unique() # esto nos da un array
df_check=pd.DataFrame(df_movie["Title"].unique())
print(len(df_check))
print(len(df_movie))

999
1000


In [41]:
# Parece que tenemos un duplicado, vamos a buscarlo para decidir qué hacer:
serie_check=df_movie["Title"].duplicated()
serie_check[serie_check == True].index[0] # esto nos dice que el duplicado está en la fila 632. Y lo dejamos aquí porque vamos a usar el método más sencillo, sin hacer agrupaciones

632

In [44]:
df_movie.apply(lambda row: row["Revenue (Millions)"]/row["Runtime (Minutes)"], axis=1)

0      2.753140
1      1.019839
2      1.180513
3      2.502963
4      2.642439
         ...   
995         NaN
996    0.186596
997    0.591939
998         NaN
999    0.225747
Length: 1000, dtype: float64

# **Challenge 3. A bit more complicated**

We want to create a __new rating__ where we add 1 point if the genre is thriller but subtract 1 point if the genre is comedy.

In [53]:
df_movie.apply(lambda row: row["Rating"]+1 if "Thriller" in row["Genre"] else(row["Rating"]-1 if "Comedy" in row["Genre"] else row["Rating"]), axis=1)

0      8.1
1      7.0
2      8.3
3      6.2
4      6.2
      ... 
995    6.2
996    5.5
997    6.2
998    4.6
999    4.3
Length: 1000, dtype: float64