## Selecting columns with methods

- Although column selection is usually done directly with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner
- `select_dtypes` and `filter` are two useful method to do this

In [2]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 40

- Read in the movie dataset
- Use the title of the movie to label each row
- Use the `get_dtype_counts` method to output the number of columns with each specific data type:

In [3]:
movie = pd.read_csv('data/movie.csv', index_col='movie_title')
movie.get_dtype_counts()

float64    13
int64       3
object     11
dtype: int64

- Use the `select_dtypes` method to select only the integer columns:

In [8]:
movie.select_dtypes(include=['int64']).head()

Unnamed: 0_level_0,num_voted_users,cast_total_facebook_likes,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,886204,4834,33000
Pirates of the Caribbean: At World's End,471220,48350,0
Spectre,275868,11700,85000
The Dark Knight Rises,1144337,106759,164000
Star Wars: Episode VII - The Force Awakens,8,143,0


- If you would like to select all the numeric columns, you may simply pass the string *number* to the `include` parameter

In [9]:
movie.select_dtypes(include=['number']).head()

Unnamed: 0_level_0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Avatar,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


- An alternative method to select columns is with the `filter` method
- This method is flexible and searches column names (or index labels) based on which parameter is used
- Here, we use the `like` parameter to search for all column names that contain the exact string, *facebook*:

In [10]:
movie.filter(like='facebook').head()

Unnamed: 0_level_0,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,cast_total_facebook_likes,actor_2_facebook_likes,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Avatar,0.0,855.0,1000.0,4834,936.0,33000
Pirates of the Caribbean: At World's End,563.0,1000.0,40000.0,48350,5000.0,0
Spectre,0.0,161.0,11000.0,11700,393.0,85000
The Dark Knight Rises,22000.0,23000.0,27000.0,106759,23000.0,164000
Star Wars: Episode VII - The Force Awakens,131.0,,131.0,143,12.0,0


- The `filter` method allows columns to be searched through regular expressions with the `regex` parameter
- Here, we search for all columns that have a digit somewhere in their name:

In [11]:
movie.filter(regex='\d').head()

Unnamed: 0_level_0,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,actor_1_name,actor_3_name,actor_2_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Avatar,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
Pirates of the Caribbean: At World's End,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
Spectre,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0
The Dark Knight Rises,23000.0,Christian Bale,27000.0,Tom Hardy,Joseph Gordon-Levitt,23000.0
Star Wars: Episode VII - The Force Awakens,,Rob Walker,131.0,Doug Walker,,12.0


## How it works...

- In step 1, `get_dtype_counts` method lists the frequencies of all the different data types
- Alternatively, you may use the `dtypes` attribute to get the exact data type for each column.
- The `select_dtypes` method takes a list of data types in its `include` parameter and returns a DataFrame with columns of just those given data types

- The `filter` method selects columns by only inspecting the column names and not the actual data values
- It has three mutually exclusive parameters:
    - `items`, `like` and `regex`
    - only one of which can be used at a time

## There's more...

- The `filter` method comes with another parameter, `items`, which takes a list of exact column names
- This is nearly an exact duplication of the indexing operator, except that a `KeyError` will not be raised if one of the strings does not match a column name

In [12]:
movie.filter(items=['actor_1_name', 'asdf']).head()

Unnamed: 0_level_0,actor_1_name
movie_title,Unnamed: 1_level_1
Avatar,CCH Pounder
Pirates of the Caribbean: At World's End,Johnny Depp
Spectre,Christoph Waltz
The Dark Knight Rises,Tom Hardy
Star Wars: Episode VII - The Force Awakens,Doug Walker
