# DataFrame

A pandas DataFrame is the primary data structure for data analysis in Python, equivalent to R's data.frame or tibble. While they serve the same purpose, pandas DataFrames have some unique characteristics that make them both more flexible and sometimes more complex than their R counterparts. Understanding these differences will help you leverage the full power of pandas.


In [1]:
import pandas as pd
import numpy as np

## Creating DataFrames

Just like R's `data.frame()` or `tibble()`, pandas offers multiple ways to create DataFrames:

### From a dictionary

In [2]:
df_dict = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'score': [85.5, 92.0, 78.5]
})
df_dict

Unnamed: 0,name,age,score
0,Alice,25,85.5
1,Bob,30,92.0
2,Charlie,35,78.5


### From a list of lists

In [4]:
# R: data.frame(matrix(data, ncol=3), col.names=c(...))
df_list = pd.DataFrame(
    [['Alice', 25, 85.5],
     ['Bob', 30, 92.0],
     ['Charlie', 35, 78.5]],
    columns=['name', 'age', 'score']
)
df_list

Unnamed: 0,name,age,score
0,Alice,25,85.5
1,Bob,30,92.0
2,Charlie,35,78.5


### From a NumPy array

In [5]:
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df_array = pd.DataFrame(arr, columns=['A', 'B', 'C'])
df_array

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


## Understanding DataFrame Structure

Unlike R's data.frame, pandas DataFrames have both row and column indices:

In [6]:
# Create a DataFrame with custom index
df = pd.DataFrame({
    'math': [90, 85, 92, 78],
    'english': [88, 92, 85, 90],
    'science': [95, 88, 90, 85]
}, index=['Alice', 'Bob', 'Charlie', 'David'])

df

Unnamed: 0,math,english,science
Alice,90,88,95
Bob,85,92,88
Charlie,92,85,90
David,78,90,85


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Alice to David
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   math     4 non-null      int64
 1   english  4 non-null      int64
 2   science  4 non-null      int64
dtypes: int64(3)
memory usage: 128.0+ bytes


In [8]:
# Basic attributes
print(f"Shape: {df.shape}")           # R: dim(df)
print(f"Columns: {df.columns.tolist()}")  # R: names(df) or colnames(df)
print(f"Index: {df.index.tolist()}")      # R: rownames(df)
print(f"Data types:\n{df.dtypes}")        # R: sapply(df, class)

Shape: (4, 3)
Columns: ['math', 'english', 'science']
Index: ['Alice', 'Bob', 'Charlie', 'David']
Data types:
math       int64
english    int64
science    int64
dtype: object


## Basic DataFrame Inspection

Pandas provides several methods to inspect your data, similar to R:

In [11]:
# Create a larger sample DataFrame
np.random.seed(42)
df_large = pd.DataFrame({
    'id': range(1, 101),
    'age': np.random.randint(18, 65, 100),
    'salary': np.random.normal(50000, 15000, 100).round(2),
    'department': np.random.choice(['Sales', 'IT', 'HR', 'Finance'], 100),
    'performance': np.random.choice(['A', 'B', 'C'], 100)
})

In [12]:
df_large.head()

Unnamed: 0,id,age,salary,department,performance
0,1,56,59544.58,Finance,C
1,2,46,36399.19,IT,C
2,3,32,57140.64,Finance,A
3,4,60,69554.92,Sales,A
4,5,25,53173.81,Finance,B


In [13]:
df_large.tail(3)

Unnamed: 0,id,age,salary,department,performance
97,98,41,28347.75,Sales,C
98,99,18,34584.34,Finance,C
99,100,61,61118.61,Sales,A


In [14]:
# Summary statistics
print(df_large.describe())    # R: summary(df)

               id        age        salary
count  100.000000  100.00000    100.000000
mean    50.500000   40.88000  51027.643500
std     29.011492   13.99082  14965.841917
min      1.000000   18.00000  20901.330000
25%     25.750000   30.50000  38517.790000
50%     50.500000   41.00000  49870.225000
75%     75.250000   53.25000  61300.230000
max    100.000000   64.00000  94154.950000


In [15]:
# For categorical columns
print(df_large.describe(include='all'))

                id        age        salary department performance
count   100.000000  100.00000    100.000000        100         100
unique         NaN        NaN           NaN          4           3
top            NaN        NaN           NaN    Finance           B
freq           NaN        NaN           NaN         35          36
mean     50.500000   40.88000  51027.643500        NaN         NaN
std      29.011492   13.99082  14965.841917        NaN         NaN
min       1.000000   18.00000  20901.330000        NaN         NaN
25%      25.750000   30.50000  38517.790000        NaN         NaN
50%      50.500000   41.00000  49870.225000        NaN         NaN
75%      75.250000   53.25000  61300.230000        NaN         NaN
max     100.000000   64.00000  94154.950000        NaN         NaN


## Selecting Columns

Column selection in pandas offers more flexibility than R:

In [16]:
# Sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'Chicago', 'Houston'],
    'salary': [70000, 80000, 75000, 72000]
})

### Single column (returns Series)

In [17]:
print(df['name'])              # R: df$name or df[['name']]
print(df.name)                 # R: df$name (dot notation)
print(df.loc[:, 'name'])       # More explicit

0      Alice
1        Bob
2    Charlie
3      David
Name: name, dtype: object
0      Alice
1        Bob
2    Charlie
3      David
Name: name, dtype: object
0      Alice
1        Bob
2    Charlie
3      David
Name: name, dtype: object


### Multiple columns (returns DataFrame)

In [18]:
print(df[['name', 'salary']])  # R: df[c('name', 'salary')]

      name  salary
0    Alice   70000
1      Bob   80000
2  Charlie   75000
3    David   72000


In [19]:
# Using loc for columns
print(df.loc[:, ['name', 'age']])  # R: df[, c('name', 'age')]

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   28


In [20]:
# Column slicing (if columns are ordered)
print(df.loc[:, 'name':'city'])    # All columns from 'name' to 'city'

      name  age     city
0    Alice   25      NYC
1      Bob   30       LA
2  Charlie   35  Chicago
3    David   28  Houston


## Selecting Rows

Row selection has multiple approaches, each with specific use cases:

In [21]:
# By integer position (iloc)
print(df.iloc[0])              # R: df[1, ] (R is 1-indexed!)

name      Alice
age          25
city        NYC
salary    70000
Name: 0, dtype: object


In [22]:
print(df.iloc[0:2])           # R: df[1:2, ]

    name  age city  salary
0  Alice   25  NYC   70000
1    Bob   30   LA   80000


In [23]:
# By index label (loc)
df_indexed = df.set_index('name')
print("DataFrame with name as index:")
print(df_indexed)

DataFrame with name as index:
         age     city  salary
name                         
Alice     25      NYC   70000
Bob       30       LA   80000
Charlie   35  Chicago   75000
David     28  Houston   72000


In [24]:
print("Row by label:")
print(df_indexed.loc['Alice'])  # R: df['Alice', ] if rownames set

Row by label:
age          25
city        NYC
salary    70000
Name: Alice, dtype: object


In [25]:
# Boolean indexing
print("Boolean indexing (age > 28):")
print(df[df['age'] > 28])      # R: df[df$age > 28, ]

Boolean indexing (age > 28):
      name  age     city  salary
1      Bob   30       LA   80000
2  Charlie   35  Chicago   75000


## Selecting Subsets (Rows and Columns)

Combining row and column selection:

In [26]:
# Using loc (label-based)
print(df.loc[df['age'] > 28, ['name', 'salary']])  
# R: df[df$age > 28, c('name', 'salary')]

      name  salary
1      Bob   80000
2  Charlie   75000


In [27]:
# Using iloc (position-based)
print(df.iloc[0:2, 1:3])       # R: df[1:2, 2:3]

   age city
0   25  NYC
1   30   LA


In [28]:
# Mixed selection
mask = df['city'].isin(['NYC', 'LA'])
print(df.loc[mask, ['name', 'city', 'salary']])
# R: df[df$city %in% c('NYC', 'LA'), c('name', 'city', 'salary')]

    name city  salary
0  Alice  NYC   70000
1    Bob   LA   80000


## Adding and Modifying Columns

Adding new columns is straightforward, similar to R:

In [29]:
# Copy DataFrame
df_mod = df.copy()

# Add new column (like R's df$new_col <- ...)
df_mod['bonus'] = df_mod['salary'] * 0.1  # R: df$bonus <- df$salary * 0.1
df_mod

Unnamed: 0,name,age,city,salary,bonus
0,Alice,25,NYC,70000,7000.0
1,Bob,30,LA,80000,8000.0
2,Charlie,35,Chicago,75000,7500.0
3,David,28,Houston,72000,7200.0


### Conditional column

In [31]:
df_mod['level'] = np.where(df_mod['salary'] > 75000, 'Senior', 'Junior')
# R: df$level <- ifelse(df$salary > 75000, 'Senior', 'Junior')
df_mod

Unnamed: 0,name,age,city,salary,bonus,level
0,Alice,25,NYC,70000,7000.0,Junior
1,Bob,30,LA,80000,8000.0,Senior
2,Charlie,35,Chicago,75000,7500.0,Junior
3,David,28,Houston,72000,7200.0,Junior


### `.assign`

In [35]:
df_mod = df_mod.assign(
    total_comp=lambda x: x['salary'] + x['bonus'],
    age_group=lambda x: pd.cut(x['age'], bins=[0, 30, 40, 100], 
                               labels=['Young', 'Middle', 'Senior'])
)
# R: df %>% mutate(total_comp = salary + bonus, ...)
df_mod

Unnamed: 0,name,age,city,salary,bonus,level,total_comp,age_group
0,Alice,25,NYC,70000,7000.0,Junior,77000.0,Young
1,Bob,30,LA,80000,8000.0,Senior,88000.0,Young
2,Charlie,35,Chicago,75000,7500.0,Junior,82500.0,Middle
3,David,28,Houston,72000,7200.0,Junior,79200.0,Young


## `.drop` Rows and Columns

Removing data from DataFrames:

In [36]:
# Drop columns
df_dropped = df_mod.drop(columns=['bonus', 'level'])  
# R: df[, !names(df) %in% c('bonus', 'level')]
df_dropped

Unnamed: 0,name,age,city,salary,total_comp,age_group
0,Alice,25,NYC,70000,77000.0,Young
1,Bob,30,LA,80000,88000.0,Young
2,Charlie,35,Chicago,75000,82500.0,Middle
3,David,28,Houston,72000,79200.0,Young


In [37]:
# Drop rows by index
df_dropped = df.drop(index=[0, 2])  # Drop first and third rows
# R: df[-c(1, 3), ]
df_dropped

Unnamed: 0,name,age,city,salary
1,Bob,30,LA,80000
3,David,28,Houston,72000


In [38]:
# Drop rows by condition
df_filtered = df[df['age'] <= 30]  # Keep only age <= 30
# R: df[df$age <= 30, ]
df_filtered

Unnamed: 0,name,age,city,salary
0,Alice,25,NYC,70000
1,Bob,30,LA,80000
3,David,28,Houston,72000


## Sorting DataFrames

Sorting is more flexible than R's `order()`:

In [39]:
# Create sample data
df_sort = pd.DataFrame({
    'name': ['Eve', 'Alice', 'Charlie', 'Bob', 'David'],
    'age': [28, 25, 35, 30, 28],
    'salary': [72000, 70000, 75000, 80000, 68000]
})

In [40]:
# Sort by single column
print(df_sort.sort_values('age'))  # R: df[order(df$age), ]

      name  age  salary
1    Alice   25   70000
0      Eve   28   72000
4    David   28   68000
3      Bob   30   80000
2  Charlie   35   75000


In [41]:
# Sort by multiple columns
print("Sort by age, then salary (descending):")
print(df_sort.sort_values(['age', 'salary'], 
                         ascending=[True, False]))
# R: df[order(df$age, -df$salary), ]

Sort by age, then salary (descending):
      name  age  salary
1    Alice   25   70000
0      Eve   28   72000
4    David   28   68000
3      Bob   30   80000
2  Charlie   35   75000


In [42]:
# Sort by index
print(df_sort.sort_index())  # R: df[order(rownames(df)), ]

      name  age  salary
0      Eve   28   72000
1    Alice   25   70000
2  Charlie   35   75000
3      Bob   30   80000
4    David   28   68000


## Handling Missing Data

DataFrames provide comprehensive missing data handling:

In [43]:
# Create DataFrame with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})
df_missing

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,,11
3,4.0,8.0,12


In [44]:
# Check for missing values
print(df_missing.isna().sum())  # R: colSums(is.na(df))

A    1
B    2
C    0
dtype: int64


In [46]:
# Drop rows with any missing
print(df_missing.dropna())  # R: na.omit(df)

     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12


In [47]:
# Drop columns with any missing
print(df_missing.dropna(axis=1))  # R: df[, colSums(is.na(df)) == 0]

    C
0   9
1  10
2  11
3  12


In [48]:
# Fill missing values
print(df_missing.fillna(0))  # R: replace_na(df, 0)

     A    B   C
0  1.0  5.0   9
1  2.0  0.0  10
2  0.0  0.0  11
3  4.0  8.0  12


## Basic Operations

Applying operations across the DataFrame:

In [49]:
# Numeric DataFrame
df_num = pd.DataFrame({
    'x': [10, 20, 30, 40],
    'y': [15, 25, 35, 45],
    'z': [12, 22, 32, 42]
})

In [50]:
# Element-wise operations
print(df_num + 5)  # R: df + 5

    x   y   z
0  15  20  17
1  25  30  27
2  35  40  37
3  45  50  47


In [51]:
# Column-wise operations
print(df_num.mean())  # R: colMeans(df)

x    25.0
y    30.0
z    27.0
dtype: float64


In [58]:
print(df_num.sum(axis=0))  # Col Sum
print(df_num.sum(axis=1))  # Row Sum

x    100
y    120
z    108
dtype: int64
0     37
1     67
2     97
3    127
dtype: int64


In [59]:
# Apply custom function
print(df_num.apply(lambda x: x.max() - x.min()))  
# R: apply(df, 2, function(x) max(x) - min(x))

x    30
y    30
z    30
dtype: int64


## DataFrame Information Methods

Quick ways to understand your DataFrame:

In [60]:
# Sample DataFrame
df_info = pd.DataFrame({
    'int_col': [1, 2, 3, 4, 5],
    'float_col': [1.1, 2.2, 3.3, 4.4, 5.5],
    'str_col': ['a', 'b', 'c', 'd', 'e'],
    'bool_col': [True, False, True, False, True],
    'cat_col': pd.Categorical(['A', 'B', 'A', 'B', 'C'])
})

In [61]:
# Memory usage
print(df_info.memory_usage(deep=True))  # R: object.size(df)

Index        132
int_col       40
float_col     40
str_col      250
bool_col       5
cat_col      263
dtype: int64


In [62]:
# Get numeric columns only
print(df_info.select_dtypes(include='number'))  
# R: df[sapply(df, is.numeric)]

   int_col  float_col
0        1        1.1
1        2        2.2
2        3        3.3
3        4        4.4
4        5        5.5


### Value counts for all columns

In [64]:
for col in df_info.select_dtypes(exclude='number').columns:
    print(f"\n{col}:")
    print(df_info[col].value_counts())
# R: lapply(df[sapply(df, is.factor)], table)


str_col:
str_col
a    1
b    1
c    1
d    1
e    1
Name: count, dtype: int64

bool_col:
bool_col
True     3
False    2
Name: count, dtype: int64

cat_col:
cat_col
A    2
B    2
C    1
Name: count, dtype: int64


## The Power of Index

Understanding the index is crucial for effective pandas use:

In [65]:
# DataFrame with meaningful index
sales_df = pd.DataFrame({
    'product': ['A', 'B', 'C', 'A', 'B', 'C'],
    'quarter': ['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
    'sales': [100, 150, 200, 120, 160, 220]
})

# Set multi-index
sales_indexed = sales_df.set_index(['quarter', 'product'])
sales_indexed

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
quarter,product,Unnamed: 2_level_1
Q1,A,100
Q1,B,150
Q1,C,200
Q2,A,120
Q2,B,160
Q2,C,220


In [66]:
# Access with multi-index
print(sales_indexed.loc['Q1'])  # All Q1 sales

         sales
product       
A          100
B          150
C          200


In [67]:
# Reset index (back to default)
print(sales_indexed.reset_index())  # R: rownames(df) <- NULL

  quarter product  sales
0      Q1       A    100
1      Q1       B    150
2      Q1       C    200
3      Q2       A    120
4      Q2       B    160
5      Q2       C    220


## Key Differences from R data.frames

Here's a summary of key differences:

| Feature | R data.frame | pandas DataFrame |
|---------|--------------|------------------|
| Row names | Optional, often numeric | Always has index, can be meaningful |
| Column access | `df$col` or `df[['col']]` | `df['col']` or `df.col` |
| Subsetting | `df[rows, cols]` | `df.loc[rows, cols]` or `df.iloc[rows, cols]` |
| Missing values | `NA` | `NaN`, `None`, `pd.NA` |
| Column types | Can mix types | Can mix types |
| Memory | Copy-on-modify | Some operations return views |
| Method chaining | Limited (needs %>%) | Built-in with dot notation |

## Practical Tips

1. **Always use `.copy()`** when you want a true copy:

In [68]:
df_new = df.copy()  # R: df_new <- df

2. **Prefer `loc` and `iloc`** for explicit selection:

In [69]:
# Clear and unambiguous
df.loc[df['age'] > 30, 'salary']  # Rows where age > 30, salary column

2    75000
Name: salary, dtype: int64

3. **Method chaining** for cleaner code:

In [70]:
result = (df
         .query('age > 25')
         .sort_values('salary', ascending=False)
         .head(10))

4. **Use `assign()` for new columns** in chains:

In [71]:
df_new = (df
         .assign(age_group=lambda x: pd.cut(x['age'], bins=3))
         .groupby('age_group')
         .mean())

  .groupby('age_group')


TypeError: agg function failed [how->mean,dtype->object]

DataFrames are the workhorses of pandas. Master these basics, and you'll be able to handle most data manipulation tasks. The key is understanding that DataFrames are collections of Series with a shared index, making operations between columns automatic and efficient.