# Indexing Fundamentals: Index vs Columns, Setting Custom Indexes

### What Is an Index in Pandas?

In Pandas, an **index** is a fundamental part of our DataFrame or Series. It acts like a label or "name tag" for each row, allowing us to identify, select, and align data efficiently. Think of the index as the *row labels*—just like the row numbers in a spreadsheet, but much more flexible and powerful.

Unlike traditional row numbers, an index can be anything meaningful: strings, dates, or even hierarchical labels (MultiIndex). This flexibility helps us organize and access data easily without relying solely on positional integer locations.

### Index vs Columns: What's the Difference?

- **Index**: Labels rows. They uniquely identify each row in the dataset. Indexes are not part of the data itself but act as metadata. By default, Pandas assigns an integer index starting from 0, but we can customize it.
- **Columns**: These are the actual data fields or features in our dataset. Each column contains values of a specific type (numeric, categorical, string, etc.) and represents an attribute or variable.

**For example, in the Titanic dataset:**

| Index | PassengerId | Name | Sex | Age |
| --- | --- | --- | --- | --- |
| 0 | 1 | Braund, Mr. Owen | male | 22 |
| 1 | 2 | Cumings, Mrs. John | female | 38 |
- The leftmost number (0, 1, ...) is the **index**
- Columns like `PassengerId`, `Name`, `Sex`, `Age` hold the actual data

### Why Does the Index Matter?

Indexes let us:

- **Uniquely identify rows:** Essential when merging, joining, or aligning data from different sources.
- **Quickly select and slice data:** We can use labels instead of integer positions to retrieve rows.
- **Enable fast lookups:** Pandas internally optimizes index-based access for speed.
- **Align data automatically:** When combining DataFrames, indexes help match rows correctly.

In short, indexes act as the backbone of our DataFrames, helping us work smarter with tabular data.

### Viewing the Index and Columns

We can easily check the index and columns of our DataFrame:

In [1]:
import pandas as pd

df = pd.read_csv("data/train.csv")

print("Index:", df.index)
print("Columns:", df.columns)

Index: RangeIndex(start=0, stop=891, step=1)
Columns: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


By default, the index is a simple integer range from 0 to number of rows minus 1.

### Setting Custom Indexes: Why and How?

Sometimes, the default integer index isn't meaningful for our data analysis. We often want to use one or more columns as the index to reflect unique identifiers or hierarchical relationships.

For example, using `PassengerId` as the index makes sense because it uniquely identifies each passenger.

We can set a custom index using `.set_index()`:


In [2]:
df_custom = df.set_index('PassengerId')

print(df_custom.head())
print(df_custom.index)

             Survived  Pclass  \
PassengerId                     
1                   0       3   
2                   1       1   
3                   1       3   
4                   1       1   
5                   0       3   

                                                          Name     Sex   Age  \
PassengerId                                                                    
1                                      Braund, Mr. Owen Harris    male  22.0   
2            Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   
3                                       Heikkinen, Miss. Laina  female  26.0   
4                 Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   
5                                     Allen, Mr. William Henry    male  35.0   

             SibSp  Parch            Ticket     Fare Cabin Embarked  
PassengerId                                                          
1                1      0         A/5 21171   7.2500   NaN        S

This changes the index from 0,1,2,... to actual passenger IDs like 1, 2, 3, …

### Setting Multi-Level Indexes (MultiIndex)

We can also create indexes with multiple levels by passing a list of columns:

In [3]:
df_multi = df.set_index(['Pclass', 'Sex'])

print(df_multi.head())
print(df_multi.index)

               PassengerId  Survived  \
Pclass Sex                             
3      male              1         0   
1      female            2         1   
3      female            3         1   
1      female            4         1   
3      male              5         0   

                                                            Name   Age  SibSp  \
Pclass Sex                                                                      
3      male                              Braund, Mr. Owen Harris  22.0      1   
1      female  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0      1   
3      female                             Heikkinen, Miss. Laina  26.0      0   
1      female       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1   
3      male                             Allen, Mr. William Henry  35.0      0   

               Parch            Ticket     Fare Cabin Embarked  
Pclass Sex                                                      
3      male        0 

This creates a hierarchical index with levels `Pclass` and `Sex`, allowing us to group and slice data more flexibly.

### Resetting the Index

If we want to revert back to the default integer index and turn the index back into columns, we use `.reset_index()`:

In [4]:
df_reset = df_multi.reset_index()

print(df_reset.head())
print(df_reset.index)

   Pclass     Sex  PassengerId  Survived  \
0       3    male            1         0   
1       1  female            2         1   
2       3  female            3         1   
3       1  female            4         1   
4       3    male            5         0   

                                                Name   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0      1      0   
2                             Heikkinen, Miss. Laina  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0   
4                           Allen, Mr. William Henry  35.0      0      0   

             Ticket     Fare Cabin Embarked  
0         A/5 21171   7.2500   NaN        S  
1          PC 17599  71.2833   C85        C  
2  STON/O2. 3101282   7.9250   NaN        S  
3            113803  53.1000  C123        S  
4            373450   8.0500   NaN        S  
Ra

### Common Index Types in Pandas

- **RangeIndex** (default integer index)
- **Int64Index** (integer labels)
- **Float64Index** (floating-point labels)
- **DatetimeIndex** (dates and times)
- **CategoricalIndex** (categorical labels)
- **MultiIndex** (hierarchical/multi-level indexes)

Each type has different performance and use cases, and Pandas optimizes operations depending on the index type.

### Exercises

Q1. Load the Titanic dataset and print its index and columns.

In [5]:
import pandas as pd

df = pd.read_csv("data/train.csv")

print("Index:", df.index)
print("Columns:", df.columns)

Index: RangeIndex(start=0, stop=891, step=1)
Columns: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


Q2. Set `PassengerId` as the index and display the first 5 rows.

In [6]:
df_PID = df.set_index('PassengerId')
print(df_PID.head())

             Survived  Pclass  \
PassengerId                     
1                   0       3   
2                   1       1   
3                   1       3   
4                   1       1   
5                   0       3   

                                                          Name     Sex   Age  \
PassengerId                                                                    
1                                      Braund, Mr. Owen Harris    male  22.0   
2            Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   
3                                       Heikkinen, Miss. Laina  female  26.0   
4                 Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   
5                                     Allen, Mr. William Henry    male  35.0   

             SibSp  Parch            Ticket     Fare Cabin Embarked  
PassengerId                                                          
1                1      0         A/5 21171   7.2500   NaN        S

In [7]:
df_PID = df.set_index('PassengerId')
print(df_PID.head())

             Survived  Pclass  \
PassengerId                     
1                   0       3   
2                   1       1   
3                   1       3   
4                   1       1   
5                   0       3   

                                                          Name     Sex   Age  \
PassengerId                                                                    
1                                      Braund, Mr. Owen Harris    male  22.0   
2            Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   
3                                       Heikkinen, Miss. Laina  female  26.0   
4                 Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   
5                                     Allen, Mr. William Henry    male  35.0   

             SibSp  Parch            Ticket     Fare Cabin Embarked  
PassengerId                                                          
1                1      0         A/5 21171   7.2500   NaN        S

Q3. Create a MultiIndex with columns `Pclass` and `Embarked`.

In [8]:
df_multi = df.set_index(['Pclass', 'Embarked'])
print(df_multi.head())
print(df_multi.index)

                 PassengerId  Survived  \
Pclass Embarked                          
3      S                   1         0   
1      C                   2         1   
3      S                   3         1   
1      S                   4         1   
3      S                   5         0   

                                                              Name     Sex  \
Pclass Embarked                                                              
3      S                                   Braund, Mr. Owen Harris    male   
1      C         Cumings, Mrs. John Bradley (Florence Briggs Th...  female   
3      S                                    Heikkinen, Miss. Laina  female   
1      S              Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   
3      S                                  Allen, Mr. William Henry    male   

                  Age  SibSp  Parch            Ticket     Fare Cabin  
Pclass Embarked                                                       
3      S        

Q4. Reset the MultiIndex to return to default indexing.

In [9]:
df_multi_reset = df_multi.reset_index()
print(df_multi_reset.head())
print(df_multi_reset.index)

   Pclass Embarked  PassengerId  Survived  \
0       3        S            1         0   
1       1        C            2         1   
2       3        S            3         1   
3       1        S            4         1   
4       3        S            5         0   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin  
0      0         A/5 21171   7.2500   NaN  
1      0          PC 17599  71.2833   C85  
2      0  STON/O2. 3101282   7.9250   NaN  
3      0            113803  53.1000  C123  
4      0            373450   8.0500   NaN  
Ra

Q5. Check the index type of the DataFrame after each operation.

In [10]:
print("Original df index type:", type(df.index))
print("df_PID index type:", type(df_PID.index))
print("df_multi index type:", type(df_multi.index))
print("df_reset index type:", type(df_reset.index))

Original df index type: <class 'pandas.core.indexes.range.RangeIndex'>
df_PID index type: <class 'pandas.core.indexes.base.Index'>
df_multi index type: <class 'pandas.core.indexes.multi.MultiIndex'>
df_reset index type: <class 'pandas.core.indexes.range.RangeIndex'>


### Summary

In Pandas, the concept of an index is central to how we organize and access our data efficiently. Unlike columns, which hold the actual data values, the index serves as a unique identifier or label for each row in our DataFrame. By default, Pandas assigns a simple integer index starting from zero, but this default can often be replaced with more meaningful labels from one or more columns in our dataset. Setting a custom index allows us to work with data in a more intuitive and powerful way. For example, using a column like `PassengerId` as the index in the Titanic dataset helps us uniquely identify each passenger and quickly retrieve or align data based on that identifier. Beyond single-level indexes, Pandas also supports MultiIndexes — hierarchical indexes with multiple levels that let us organize complex datasets with multiple grouping variables, such as `Pclass` and `Sex`. This makes slicing and grouping operations more flexible and expressive. Equally important is knowing how to reset indexes when we want to revert back to the default integer indexing or convert index labels back to regular columns. Different types of indexes (e.g., integer, datetime, categorical) come with their own performance implications and use cases, so understanding when and how to use them helps us optimize our workflows. Overall, mastering indexing fundamentals is essential for effectively managing, selecting, and manipulating data in Pandas, laying a strong foundation for more advanced data analysis tasks.