## Chaining DataFrame methods together

- Whether you believe chaining is a good practice or not, it is quite common to encounter it during data analysis with pandas
- One of the keys to method chaining is to know the exact object being returned during each step of the chain
- In pandas, this will nearly always be a DataFrame, Series, or scalar value

In [2]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 40

- To get a count of the missing values, the `isnull` method must first be called to change each DataFrame value to a boolean

In [3]:
movie = pd.read_csv('data/movie.csv')
movie.isnull().head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,True,False,True,True,False,True,False,False,True,False,False,False,False,False,True,False,True,False,True,True,True,True,True,True,False,False,True,False


- We will chain the `sum` method that interprets `True/False` booleans as 1/0

In [4]:
movie.isnull().sum().head()

color                       19
director_name              102
num_critic_for_reviews      49
duration                    15
director_facebook_likes    102
dtype: int64

- We can go one step further and take the sum of this Series and return the count of the total number of missing values in the entire DataFrame as a scalar value

In [5]:
movie.isnull().sum().sum()

2654

- A slight deviation is to determine whether there are any missing values in the DataFrame

In [7]:
movie.isnull().any().any()

True

## How it works...

- The `isnull` method returns a DataFrame the same size as the calling DataFrame but with all values transformed to booleans
- As booleans evaluate numerically as 0/1, it is possible to sum them by column, as done in step 2
- In step 4, the `any` DataFrame method returns a Series of booleans indicating if there exists at least one `True` for each column

In [8]:
movie.isnull().get_dtype_counts()

bool    28
dtype: int64

## There's more...

- Most of the columns in the movie dataset with object data type contain missing values
- By default, the aggregation methods, `min`, `max`, and `sum`, do not return anything, as seen in the following code snippet

In [9]:
movie[['color', 'movie_title', 'color']].max()

Series([], dtype: float64)

- To force pandas to return something for each column, we must fill in the missing values

In [10]:
movie.select_dtypes(['object']).fillna('').min()

color                                                               
director_name                                                       
actor_2_name                                                        
genres                                                        Action
actor_1_name                                                        
movie_title                                                  #Horror
actor_3_name                                                        
plot_keywords                                                       
movie_imdb_link    http://www.imdb.com/title/tt0006864/?ref_=fn_t...
language                                                            
country                                                             
content_rating                                                      
dtype: object

- For purposes of readability, method chains are often written as one method call per line with the backslash character at the end to escape new lines
- This makes it easier to read and insert comments on what is returned at each steop of the chain

In [11]:
# rewrite the above chain on multiple lines
movie.select_dtypes(['object']) \
     .fillna('') \
     .min()

color                                                               
director_name                                                       
actor_2_name                                                        
genres                                                        Action
actor_1_name                                                        
movie_title                                                  #Horror
actor_3_name                                                        
plot_keywords                                                       
movie_imdb_link    http://www.imdb.com/title/tt0006864/?ref_=fn_t...
language                                                            
country                                                             
content_rating                                                      
dtype: object