# How to select column types in Pandas DataFrame

---

In this notebook, I want to show you some tricks how conveniently you can check and select column types in Pandas DataFrame

In [1]:
import pandas as pd

Let create some artificial data as dictionary and then convert if to Pandas DataFrame object

In [2]:
data = {"Id":[1, 2, 3, 4, 5, 6], "Name":["John", "Alex", "Barbara", "Jane", "James", "Emma"],
        "Age":[25, 33, 52, 41, 30, 40], "Date_of_Birth":["1995/10/25", "1987/8/31", "1968/5/6", "1979/12/12", "1990/4/20", "1980/1/1"],
        "Salary":[15500.65, 95420.6, 254287.5, 55000.0, 78942.47, 122500.2]}

In [3]:
df = pd.DataFrame(data)

To check what are the columns types, we can use Pandas ```.dtypes``` attribute

In [4]:
df.dtypes

Id                 int64
Name              object
Age                int64
Date_of_Birth     object
Salary           float64
dtype: object

We see three different column types. They are:

* ```int64``` - Integers numbers
* ```float64``` - Float numbers
* ```object``` - Can be string or any column with mixed types

Also, Note that ```Date_of_Birth``` is date not ```object``` type as shown above. We need to convert it into ```datetime``` object. 

In [5]:
# Convert "Date_of_Birth" into datetime object

df['Date_of_Birth'] = pd.to_datetime(df['Date_of_Birth'])

In [6]:
df.dtypes

Id                        int64
Name                     object
Age                       int64
Date_of_Birth    datetime64[ns]
Salary                  float64
dtype: object

From the above output, we see that we have one new type: ```datetime64[ns]```

```.info()``` method gives much richer output about our dataframe. We can either use this method to check types of columns.

In [7]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
Id               6 non-null int64
Name             6 non-null object
Age              6 non-null int64
Date_of_Birth    6 non-null datetime64[ns]
Salary           6 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 368.0+ bytes


Now, imagine the situation when you have thousands of columns. It's really hard and tedious to look for all the column types with ```.dtypes``` attribute or ```.info()``` method. To ease the search process we can use Pandas ```.select_dtypes()``` method with combination of ```.columns``` attribute and ```.tolist()``` method

In [8]:
print(df.select_dtypes(include='object').columns.to_list())

print(df.select_dtypes(include="int64").columns.to_list())

print(df.select_dtypes(include="float64").columns.to_list())

print(df.select_dtypes(include="datetime64[ns]").columns.to_list())

['Name']
['Id', 'Age']
['Salary']
['Date_of_Birth']


We can even write a small function to perform column type selection

In [9]:
def select_col_dtypes(df: pd.DataFrame, col_type: list = None) -> list:
    """
    Args:
        df: Pandas DataFrame
        col_type: List of column types. Defaults to None.
    
    Returns:
        list of selected column types
    """
    if col_type == None:
        return df.columns.to_list()
    else:
        return df.select_dtypes(include=col_type).columns.to_list()

In [10]:
print(select_col_dtypes(df, ["float64", "int64"]))

print(select_col_dtypes(df, "datetime64[ns]"))

['Id', 'Age', 'Salary']
['Date_of_Birth']


Let go even further and create Pandas Series of columns and then groupby this Series by column types and extract groups

In [11]:
col_type_dict = df.columns.to_series().groupby(df.dtypes).groups

In [12]:
# Pretty Print
from pprint import pprint as pp


pp(col_type_dict)

{dtype('int64'): Index(['Id', 'Age'], dtype='object'),
 dtype('<M8[ns]'): Index(['Date_of_Birth'], dtype='object'),
 dtype('float64'): Index(['Salary'], dtype='object'),
 dtype('O'): Index(['Name'], dtype='object')}
