# Data Types

When transitioning from R to pandas, understanding data types is crucial because pandas handles types differently than R. While R has a relatively simple type system with automatic coercion, pandas requires more explicit type management but offers finer control over memory usage and performance.


In [1]:
import pandas as pd
import numpy as np

## Overview



| R_Type | Pandas_Type | NumPy_Type | Example |
|--------|-------------|------------|---------|
| numeric/double | float64 | np.float64 | 3.14 |
| integer | int64 | np.int64 | 42 |
| character | object/string | np.object_ | "text" |
| factor | category | pd.Categorical | pd.Categorical() |
| logical | bool | np.bool_ | True/False |
| Date | datetime64[ns] | np.datetime64 | 2024-01-01 |
| POSIXct/POSIXlt | datetime64[ns] | np.datetime64 | 2024-01-01 10:30:00 |
| NA | NaN/NaT/None | np.nan/pd.NaT | Missing values |

## Key Differences Summary

Here's a quick reference table summarizing the main differences between R and pandas data types:

| Aspect | R | Pandas |
|--------|---|--------|
| Numeric precision | `numeric` (double) only | `float16`, `float32`, `float64` |
| Integer types | `integer` (32-bit) | `int8`, `int16`, `int32`, `int64` |
| String handling | `character` | `object` or `string` dtype |
| Categorical data | `factor` | `category` dtype |
| Missing values | Universal `NA` | `np.nan`, `pd.NA`, `pd.NaT`, `None` |
| Type coercion | Automatic | More explicit control |
| Memory optimization | Limited options | Fine-grained control |

## Numeric Types

In [2]:
# R numeric (double) equivalent
float_series = pd.Series([1.5, 2.7, 3.14])
# Default float type
float_series.dtype

dtype('float64')

In [3]:
# R integer equivalent  
int_series = pd.Series([1, 2, 3])
int_series.dtype

dtype('int64')

In [4]:
# Pandas allows explicit type control for memory efficiency
int8_series = pd.Series([1, 2, 3], dtype='int8')  # Uses less memory
float32_series = pd.Series([1.5, 2.7, 3.14], dtype='float32')

One key difference from R is that pandas maintains integer types when possible, while R often coerces to numeric:

In [5]:
# In R: c(1, 2, 3) / 2 would give numeric
# In pandas, we need explicit conversion
r_style = pd.Series([1, 2, 3])
result = r_style / 2
print(f"Division result type: {result.dtype}")  # float64 - automatically converted

# To maintain integer division when appropriate
int_division = pd.Series([2, 4, 6]) // 2
print(f"Integer division type: {int_division.dtype}")  # int64 - stays integer

Division result type: float64
Integer division type: int64


## String Types

R uses `character` type for text, while pandas traditionally used `object` dtype. However, pandas now offers a dedicated `string` dtype that's more efficient and explicit.

In [9]:
# Traditional pandas way (like R's character)
text_object = pd.Series(['apple', 'banana', 'cherry'])
text_object.dtype 

dtype('O')

In [10]:
# Modern pandas way (more explicit)
text_string = pd.Series(['apple', 'banana', 'cherry'], dtype='string')
text_string.dtype

string[python]

In [13]:
# Practical difference: string dtype handles NA better
mixed_text = pd.Series(['apple', None, 'cherry'], dtype='string')
print(mixed_text) 

0     apple
1      <NA>
2    cherry
dtype: string


## Categorical Data (Factors in R)

R's `factor` type maps directly to pandas' `category` dtype. Both are used for efficient storage of repeated values and maintaining order in categorical variables.

In [15]:
# Creating categories (like R factors)
categories = pd.Series(["low", "medium", "high", "low", "high"], dtype="category")

# Or convert existing series
education = pd.Series(["HS", "BS", "MS", "BS", "HS", "PhD", "MS"])
education_cat = education.astype("category")

# Ordered categories (like ordered factors in R)
education_ordered = pd.Categorical(
    education, 
    categories=["HS", "BS", "MS", "PhD"], 
    ordered=True
)

education_ordered

['HS', 'BS', 'MS', 'BS', 'HS', 'PhD', 'MS']
Categories (4, object): ['HS' < 'BS' < 'MS' < 'PhD']

## Boolean Types

Boolean types work similarly in both languages, but pandas is more strict about boolean operations:

In [16]:
# R logical equivalent
bool_series = pd.Series([True, False, True, False])
bool_series.dtype

dtype('bool')

In [18]:
# Important difference: pandas doesn't allow arithmetic on booleans
# R: TRUE + TRUE gives 2
# Pandas: need explicit conversion
print(bool_series.astype(int).sum())  # Convert to int first

2


In [19]:
# Boolean indexing works the same
data = pd.Series([1, 2, 3, 4])
mask = pd.Series([True, False, True, False])
print(data[mask])  # Works like R

0    1
2    3
dtype: int64


## DateTime Types

DateTime handling in pandas is more unified than R's separate Date and POSIXct/POSIXlt types:

In [20]:
# Date equivalent (R's Date class)
dates = pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01'])
dates.dtype

dtype('<M8[ns]')

In [None]:
# DateTime with time (R's POSIXct)
timestamps = pd.to_datetime(['2024-01-01 10:30:00', '2024-01-01 14:45:30'])
timestamps

DatetimeIndex(['2024-01-01 10:30:00', '2024-01-01 14:45:30'], dtype='datetime64[ns]', freq=None)

In [23]:
# Extracting components (like lubridate)
df_dates = pd.DataFrame({'date': dates})
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates

Unnamed: 0,date,year,month,day
0,2024-01-01,2024,1,1
1,2024-02-01,2024,2,1
2,2024-03-01,2024,3,1


## Missing Values

Perhaps the biggest adjustment from R is how pandas handles missing values. R uses `NA` universally, while pandas has different missing value representations for different types:

In [25]:
# Different missing value types in pandas
df_missing = pd.DataFrame({
    'float_col': [1.5, np.nan, 3.5],  # np.nan for floats
    'int_col': pd.array([1, pd.NA, 3], dtype='Int64'),  # pd.NA for nullable integers
    'string_col': pd.array(['a', pd.NA, 'c'], dtype='string'),  # pd.NA for strings
    'datetime_col': pd.to_datetime(['2024-01-01', pd.NaT, '2024-01-03'])  # pd.NaT for datetimes
})

df_missing

Unnamed: 0,float_col,int_col,string_col,datetime_col
0,1.5,1.0,a,2024-01-01
1,,,,NaT
2,3.5,3.0,c,2024-01-03


In [26]:
df_missing.dtypes

float_col              float64
int_col                  Int64
string_col      string[python]
datetime_col    datetime64[ns]
dtype: object

## Type Conversion

In [27]:
# Create a sample DataFrame
df = pd.DataFrame({
    'numbers_as_text': ['1', '2', '3'],
    'mixed_numbers': [1, 2.5, 3],
    'categories': ['A', 'B', 'A'],
    'dates_as_text': ['2024-01-01', '2024-02-01', '2024-03-01']
})

# Check initial types
df.dtypes

numbers_as_text     object
mixed_numbers      float64
categories          object
dates_as_text       object
dtype: object

In [None]:
# Convert types (like R's as.numeric, as.character, etc.)
df['numbers_as_text'] = pd.to_numeric(df['numbers_as_text'])  # Like as.numeric()
df['mixed_numbers'] = df['mixed_numbers'].astype('int64')  # Like as.integer()
df['categories'] = df['categories'].astype('category')  # Like as.factor()
df['dates_as_text'] = pd.to_datetime(df['dates_as_text'])  # Like as.Date()

df.dtypes

numbers_as_text             int64
mixed_numbers               int64
categories               category
dates_as_text      datetime64[ns]
dtype: object

## Type Checking and `.info()`

When working with data, you'll often need to inspect types:

In [31]:
df_example = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [95.5, 87.2, 91.8],
    'passed': [True, True, False],
    'grade': pd.Categorical(['A', 'B', 'A']),
    'test_date': pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17'])
})

df_example.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   id         3 non-null      int64         
 1   name       3 non-null      object        
 2   score      3 non-null      float64       
 3   passed     3 non-null      bool          
 4   grade      3 non-null      category      
 5   test_date  3 non-null      datetime64[ns]
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 358.0+ bytes


## Select columns by type

Syntax: `df.select_dtypes`

In [32]:
numeric_cols = df_example.select_dtypes(include=['number']).columns
numeric_cols

Index(['id', 'score'], dtype='object')

In [34]:
df_example.select_dtypes(include=['category'])

Unnamed: 0,grade
0,A
1,B
2,A


## Memory Efficiency Considerations

In [35]:
df_inefficient = pd.DataFrame({
    'small_ints': [1, 2, 3, 4, 5],  # Default int64
    'categories': ['A', 'B', 'A', 'B', 'A'],  # Object type
    'small_floats': [0.1, 0.2, 0.3, 0.4, 0.5]  # Default float64
})

df_efficient = pd.DataFrame({
    'small_ints': pd.array([1, 2, 3, 4, 5], dtype='int8'),
    'categories': pd.Categorical(['A', 'B', 'A', 'B', 'A']),
    'small_floats': pd.array([0.1, 0.2, 0.3, 0.4, 0.5], dtype='float32')
})

print("Memory usage comparison:")
print(f"Inefficient: {df_inefficient.memory_usage(deep=True).sum()} bytes")
print(f"Efficient: {df_efficient.memory_usage(deep=True).sum()} bytes")

Memory usage comparison:
Inefficient: 462 bytes
Efficient: 370 bytes
