# Sorting and Ordering Data in Pandas

### What Is Sorting in Pandas?

When we explore or prepare data for analysis or modeling, the first thing we often need is **to bring order to our dataset**. Whether we’re identifying the top performers, sorting passengers by age or fare, or simply preparing data for reports or dashboards — **sorting** gives structure to our work.

In Pandas, we have two primary tools for sorting:

- `sort_values()` — for sorting by actual **column values**
- `sort_index()` — for sorting by **row or column index**

These methods let us organize our DataFrame in meaningful ways. With `sort_values()`, we can easily sort by one or multiple columns. With `sort_index()`, we can manage the order of indexes, which is especially helpful when we’re working with custom indexes or time series.

Sorting isn’t just about looking neat — it’s often the first step in identifying **outliers**, **top-N records**, or **patterns** that matter. When we’re building ML pipelines, visualizing data, or creating reports, mastering sorting allows us to bring clarity and direction to our insights.

Let’s go through the full power of sorting using Pandas and the Titanic dataset.

### `sort_values()` — Sorting by Column Values

When we want to organize our dataset by the values **inside the columns**, we use `sort_values()`. This is our go-to method when ranking passengers by age, fare, or survival status.

**Syntax:**

```python
df.sort_values(by='ColumnName', ascending=True)
```

### Parameters:

- `by`: Column name(s) to sort
- `ascending`: `True` for ascending, `False` for descending
- `na_position`: Places `NaN` at `'first'` or `'last'`
- `inplace`: Whether to overwrite the original DataFrame
- `key`: Optional function to apply before sorting

**Examples**

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")

# Sort by Fare
print(df.sort_values(by='Fare')[['Name', 'Fare']].head())

# Sort by Age
print(df.sort_values(by='Age')[['Name', 'Age']].head())

# Sort by Pclass and Fare (multi-level)
print(df.sort_values(by=['Pclass', 'Fare'], ascending=[True, False])[['Pclass', 'Fare', 'Name']].head())

                                 Name  Fare
815                  Fry, Mr. Richard   0.0
806            Andrews, Mr. Thomas Jr   0.0
413    Cunningham, Mr. Alfred Fleming   0.0
481  Frost, Mr. Anthony Wood "Archie"   0.0
302   Johnson, Mr. William Cahoone Jr   0.0
                                Name   Age
803  Thomas, Master. Assad Alexander  0.42
755        Hamalainen, Master. Viljo  0.67
644           Baclini, Miss. Eugenie  0.75
469    Baclini, Miss. Helene Barbara  0.75
78     Caldwell, Master. Alden Gates  0.83
     Pclass      Fare                                Name
258       1  512.3292                    Ward, Miss. Anna
679       1  512.3292  Cardeza, Mr. Thomas Drake Martinez
737       1  512.3292              Lesurer, Mr. Gustave J
27        1  263.0000      Fortune, Mr. Charles Alexander
88        1  263.0000          Fortune, Miss. Mabel Helen


Using `sort_values()` helps us organize messy data, prepare meaningful summaries, and support decision-making in our analysis process.

### `sort_index()` — Sorting by Row Index

Sometimes, the structure of our data matters more than the values. When we need to sort based on the **index** (like row numbers or a custom ID), we use `sort_index()`.

Syntax:

```python
df.sort_index(ascending=True)
```

**Why and When We Use It:**

- To **restore order** after shuffling or merging data
- To **sort time-series data** (e.g., by dates)
- To **organize by custom index** like `PassengerId`

**Example**

In [2]:
# Sort in reverse index order
print(df.sort_index(ascending=False).head())

# Set a custom index and sort
df_indexed = df.set_index("PassengerId")
print(df_indexed.sort_index().head())

     PassengerId  Survived  Pclass                                      Name  \
890          891         0       3                       Dooley, Mr. Patrick   
889          890         1       1                     Behr, Mr. Karl Howell   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
887          888         1       1              Graham, Miss. Margaret Edith   
886          887         0       2                     Montvila, Rev. Juozas   

        Sex   Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
890    male  32.0      0      0      370376   7.75   NaN        Q  
889    male  26.0      0      0      111369  30.00  C148        C  
888  female   NaN      1      2  W./C. 6607  23.45   NaN        S  
887  female  19.0      0      0      112053  30.00   B42        S  
886    male  27.0      0      0      211536  13.00   NaN        S  
             Survived  Pclass  \
PassengerId                     
1                   0       3   
2           

Sorting by index keeps our structure clean — especially in time-based or grouped data where order matters. It ensures consistency when we’re aligning data across steps.

### Handling Null Values During Sorting (`na_position`)

Real datasets, like Titanic, often have missing values. When we sort a column with nulls, it’s important to **decide where those NaNs should appear**. By default, Pandas places them **last**, but we can change that using the `na_position` parameter.

**Example**

In [3]:
# Default (NaNs at end)
print(df.sort_values(by='Age', na_position='last')[['Name', 'Age']].tail())

# NaNs at start
print(df.sort_values(by='Age', na_position='first')[['Name', 'Age']].head())

# Combined with multiple columns
print(df.sort_values(by=['Age', 'Fare'], na_position='first')[['Name', 'Age', 'Fare']].head())

                                         Name  Age
859                          Razi, Mr. Raihed  NaN
863         Sage, Miss. Dorothy Edith "Dolly"  NaN
868               van Melkebeke, Mr. Philemon  NaN
878                        Laleff, Mr. Kristo  NaN
888  Johnston, Miss. Catherine Helen "Carrie"  NaN
                             Name  Age
5                Moran, Mr. James  NaN
17   Williams, Mr. Charles Eugene  NaN
19        Masselmani, Mrs. Fatima  NaN
26        Emir, Mr. Farred Chehab  NaN
28  O'Dwyer, Miss. Ellen "Nellie"  NaN
                                 Name  Age  Fare
277       Parkes, Mr. Francis "Frank"  NaN   0.0
413    Cunningham, Mr. Alfred Fleming  NaN   0.0
466             Campbell, Mr. William  NaN   0.0
481  Frost, Mr. Anthony Wood "Archie"  NaN   0.0
633     Parr, Mr. William Henry Marsh  NaN   0.0


As data scientists, we must decide whether missing values should be prioritized, ignored, or filled before sorting. This decision impacts our modeling, filtering, and how insights are communicated.

### Sorting by Multiple Criteria with Complex Conditions

There are many times when we want to **sort by more than one condition** — for example, sort passengers by class, and within each class, by fare paid.

**Example**

In [4]:
# First by class (ascending), then fare (descending)
print(df.sort_values(by=['Pclass', 'Fare'], ascending=[True, False])[['Pclass', 'Fare', 'Name']].head())

# Sort Age, replacing NaNs with median first
print(df.sort_values(by='Age', key=lambda x: x.fillna(x.median()))[['Name', 'Age']].head())

     Pclass      Fare                                Name
258       1  512.3292                    Ward, Miss. Anna
679       1  512.3292  Cardeza, Mr. Thomas Drake Martinez
737       1  512.3292              Lesurer, Mr. Gustave J
27        1  263.0000      Fortune, Mr. Charles Alexander
88        1  263.0000          Fortune, Miss. Mabel Helen
                                Name   Age
803  Thomas, Master. Assad Alexander  0.42
755        Hamalainen, Master. Viljo  0.67
469    Baclini, Miss. Helene Barbara  0.75
644           Baclini, Miss. Eugenie  0.75
78     Caldwell, Master. Alden Gates  0.83


This is useful when we’re doing **grouped analysis**, segmenting our audience, or ranking based on business rules. It allows us to maintain logical order even when the data is messy or partial.

### Sorting String and Categorical Data

String and categorical sorting can be tricky — especially when we care about **specific orders** (like 1st class before 2nd, or sorting by last name).

**Example**

In [5]:
# Alphabetical name sort
print(df.sort_values(by='Name')[['Name']].head())

# Sort by last name
print(df.sort_values(by='Name', key=lambda x: x.str.split(',').str[0])[['Name']].head())

# Sort Pclass as an ordered category
df['Pclass_cat'] = pd.Categorical(df['Pclass'], categories=[1, 2, 3], ordered=True)
print(df.sort_values(by='Pclass_cat')[['Name', 'Pclass_cat']].head())

                                      Name
845                    Abbing, Mr. Anthony
746            Abbott, Mr. Rossmore Edward
279       Abbott, Mrs. Stanton (Rosa Hunt)
308                    Abelson, Mr. Samuel
874  Abelson, Mrs. Samuel (Hannah Wizosky)
                                      Name
845                    Abbing, Mr. Anthony
279       Abbott, Mrs. Stanton (Rosa Hunt)
746            Abbott, Mr. Rossmore Edward
874  Abelson, Mrs. Samuel (Hannah Wizosky)
308                    Abelson, Mr. Samuel
                                                  Name Pclass_cat
445                          Dodge, Master. Washington          1
310                     Hays, Miss. Margaret Bechstein          1
309                     Francatelli, Miss. Laura Mabel          1
307  Penasco y Castellana, Mrs. Victor de Satode (M...          1
306                            Fleming, Miss. Margaret          1


We often use this when working with **labels**, **product tiers**, or **text-based fields**. Custom sorting lets us reflect real-world logic in our analysis.

### Performance Considerations for Large Datasets

Sorting can be computationally expensive — especially with tens or hundreds of thousands of rows. To **optimize performance**, we should:

- Use `inplace=True` to avoid making a copy
- Use `nlargest()` or `nsmallest()` for top-N records
- Use `ignore_index=True` when index values aren’t needed

**Example**

In [6]:
import time
import numpy as np

large_df = pd.DataFrame({
    'A': np.random.randint(1, 1_000_000, 100_000),
    'B': np.random.randn(100_000)
})

start = time.time()
large_df.sort_values(by='A', inplace=True)
print(f"In-place sort time: {time.time() - start:.2f}s")

In-place sort time: 0.01s


As our projects grow (e.g., in ML training data), these optimizations can save **time, memory, and cost** — especially in production environments.

### Sorting with Custom Functions / Keys

Sometimes, we want to sort based on **derived values** — like absolute difference from median fare or extracted title from a name.

**Example**

In [7]:
# Sort by how close fare is to median
median_fare = df['Fare'].median()
print(df.sort_values(by='Fare', key=lambda x: abs(x - median_fare))[['Name', 'Fare']].head())

# Sort by extracted title
df['Title'] = df['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip())
print(df.sort_values(by='Title')[['Name', 'Title']].head())

                                        Name     Fare
830  Yasbeck, Mrs. Antoni (Selini Alexander)  14.4542
73               Chronopoulos, Mr. Apostolos  14.4542
111                     Zabour, Miss. Hileni  14.4542
702                    Barbara, Miss. Saiide  14.4542
620                      Yasbeck, Mr. Antoni  14.4542
                                    Name Title
745         Crosby, Capt. Edward Gifford  Capt
647  Simonius-Blumer, Col. Oberst Alfons   Col
694                      Weir, Col. John   Col
30              Uruchurtu, Don. Manuel E   Don
796          Leader, Dr. Alice (Farnham)    Dr


This is especially powerful during **feature engineering**, where we need to create new metrics or use logic-based ranking.

### Sorting Columns with axis=1

Pandas also lets us **sort columns** using `axis=1`. This helps when we want to display or export datasets in a cleaner format.

In [8]:
# Sort columns alphabetically
print(df.sort_index(axis=1).head())

# Group numeric columns before object columns
numeric_cols = df.select_dtypes(include='number').columns.tolist()
object_cols = df.select_dtypes(include='object').columns.tolist()
print(df[numeric_cols + object_cols].head())

    Age Cabin Embarked     Fare  \
0  22.0   NaN        S   7.2500   
1  38.0   C85        C  71.2833   
2  26.0   NaN        S   7.9250   
3  35.0  C123        S  53.1000   
4  35.0   NaN        S   8.0500   

                                                Name  Parch  PassengerId  \
0                            Braund, Mr. Owen Harris      0            1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...      0            2   
2                             Heikkinen, Miss. Laina      0            3   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)      0            4   
4                           Allen, Mr. William Henry      0            5   

   Pclass Pclass_cat     Sex  SibSp  Survived            Ticket Title  
0       3          3    male      1         0         A/5 21171    Mr  
1       1          1  female      1         1          PC 17599   Mrs  
2       3          3  female      0         1  STON/O2. 3101282  Miss  
3       1          1  female      1         

Great for preparing reports, formatting outputs, or exporting CSVs for external tools.

### Memory-Efficient Sorting with `inplace=True`

In large datasets, using `inplace=True` prevents duplication and reduces memory use.

In [9]:
df_copy = df.copy()
df_copy.sort_values(by='Fare', inplace=True)
df_copy.reset_index(drop=True, inplace=True)
print(df_copy.head())

   PassengerId  Survived  Pclass                              Name   Sex  \
0          816         0       1                  Fry, Mr. Richard  male   
1          807         0       1            Andrews, Mr. Thomas Jr  male   
2          414         0       2    Cunningham, Mr. Alfred Fleming  male   
3          482         0       2  Frost, Mr. Anthony Wood "Archie"  male   
4          303         0       3   Johnson, Mr. William Cahoone Jr  male   

    Age  SibSp  Parch  Ticket  Fare Cabin Embarked Pclass_cat Title  
0   NaN      0      0  112058   0.0  B102        S          1    Mr  
1  39.0      0      0  112050   0.0   A36        S          1    Mr  
2   NaN      0      0  239853   0.0   NaN        S          2    Mr  
3   NaN      0      0  239854   0.0   NaN        S          2    Mr  
4  19.0      0      0    LINE   0.0   NaN        S          3    Mr  


This helps us save time when processing large volumes of data in production or limited environments like Colab or Kaggle.

### Exercise

Q1. Sort all passengers by `Age` in descending order. Place `NaN` values (missing ages) at the beginning. Display only `Name`, `Age`, and `Pclass`.

In [10]:
df_age_sorted = df.sort_values(by='Age', ascending=False, na_position='first')
print(df_age_sorted[['Name', 'Age', 'Pclass']].head(10))

                                              Name  Age  Pclass
5                                 Moran, Mr. James  NaN       3
17                    Williams, Mr. Charles Eugene  NaN       2
19                         Masselmani, Mrs. Fatima  NaN       3
26                         Emir, Mr. Farred Chehab  NaN       3
28                   O'Dwyer, Miss. Ellen "Nellie"  NaN       3
29                             Todoroff, Mr. Lalio  NaN       3
31  Spencer, Mrs. William Augustus (Marie Eugenie)  NaN       1
32                        Glynn, Miss. Mary Agatha  NaN       3
36                                Mamee, Mr. Hanna  NaN       3
42                             Kraeff, Mr. Theodor  NaN       3


Q2. Sort the dataset first by `Sex` (alphabetically), and then by `Fare` in descending order. Display the top 5 results.

In [11]:
df_sorted_sex_fare = df.sort_values(by=['Sex', 'Fare'], ascending=[True, False])
print(df_sorted_sex_fare[['Name', 'Sex', 'Fare']].head(5))

                                      Name     Sex      Fare
258                       Ward, Miss. Anna  female  512.3292
88              Fortune, Miss. Mabel Helen  female  263.0000
341         Fortune, Miss. Alice Elizabeth  female  263.0000
311             Ryerson, Miss. Emily Borie  female  262.3750
742  Ryerson, Miss. Susan Parker "Suzette"  female  262.3750


Q3. Extract the title (Mr., Mrs., Miss, etc.) from the `Name` column and sort all passengers by title. Show the title distribution.

In [12]:
df['Title'] = df['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip())
df_title_sorted = df.sort_values(by='Title')
print(df_title_sorted[['Name', 'Title']].head(10))

# Optional: Show title frequency
print("\nTitle counts:")
print(df['Title'].value_counts())

                                    Name Title
745         Crosby, Capt. Edward Gifford  Capt
647  Simonius-Blumer, Col. Oberst Alfons   Col
694                      Weir, Col. John   Col
30              Uruchurtu, Don. Manuel E   Don
796          Leader, Dr. Alice (Farnham)    Dr
632            Stahelin-Maeglin, Dr. Max    Dr
398                     Pain, Dr. Alfred    Dr
317                 Moraweck, Dr. Ernest    Dr
660        Frauenthal, Dr. Henry William    Dr
245          Minahan, Dr. William Edward    Dr

Title counts:
Title
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Col               2
Mlle              2
Major             2
Ms                1
Mme               1
Don               1
Lady              1
Sir               1
Capt              1
the Countess      1
Jonkheer          1
Name: count, dtype: int64


Q4. Use `nlargest()` and `nsmallest()` to get the top 5 and bottom 5 fares paid. Compare it with `sort_values()`.

In [13]:
top_fares_nl = df.nlargest(5, 'Fare')[['Name', 'Fare']]
bottom_fares_ns = df.nsmallest(5, 'Fare')[['Name', 'Fare']]

top_fares_sort = df.sort_values(by='Fare', ascending=False)[['Name', 'Fare']].head(5)
bottom_fares_sort = df.sort_values(by='Fare')[['Name', 'Fare']].head(5)

print("Top 5 - Using nlargest:")
print(top_fares_nl)

print("\nTop 5 - Using sort_values:")
print(top_fares_sort)

print("\nBottom 5 - Using nsmallest:")
print(bottom_fares_ns)

print("\nBottom 5 - Using sort_values:")
print(bottom_fares_sort)

Top 5 - Using nlargest:
                                   Name      Fare
258                    Ward, Miss. Anna  512.3292
679  Cardeza, Mr. Thomas Drake Martinez  512.3292
737              Lesurer, Mr. Gustave J  512.3292
27       Fortune, Mr. Charles Alexander  263.0000
88           Fortune, Miss. Mabel Helen  263.0000

Top 5 - Using sort_values:
                                   Name      Fare
679  Cardeza, Mr. Thomas Drake Martinez  512.3292
258                    Ward, Miss. Anna  512.3292
737              Lesurer, Mr. Gustave J  512.3292
88           Fortune, Miss. Mabel Helen  263.0000
438                   Fortune, Mr. Mark  263.0000

Bottom 5 - Using nsmallest:
                                Name  Fare
179              Leonard, Mr. Lionel   0.0
263            Harrison, Mr. William   0.0
271     Tornquist, Mr. William Henry   0.0
277      Parkes, Mr. Francis "Frank"   0.0
302  Johnson, Mr. William Cahoone Jr   0.0

Bottom 5 - Using sort_values:
                              

Q5. Sort all rows by the absolute difference of their `Fare` from the median fare. Display the 5 passengers whose fare was closest to the median.

In [14]:
median_fare = df['Fare'].median()
df_sorted_median = df.sort_values(by='Fare', key=lambda x: abs(x - median_fare))
print("Passengers closest to median fare:")
print(df_sorted_median[['Name', 'Fare']].head(5))

Passengers closest to median fare:
                                        Name     Fare
830  Yasbeck, Mrs. Antoni (Selini Alexander)  14.4542
73               Chronopoulos, Mr. Apostolos  14.4542
111                     Zabour, Miss. Hileni  14.4542
702                    Barbara, Miss. Saiide  14.4542
620                      Yasbeck, Mr. Antoni  14.4542


Q6. What is the difference between `sort_values()` and `sort_index()`?

- `sort_values()` sorts based on column data values.
- `sort_index()` sorts based on the row or column **index**.
    
    Use `sort_values()` when we care about the actual data. Use `sort_index()` when we care about the order of rows.

## Summary

Sorting is a foundational operation in data analysis that helps bring structure and clarity to datasets. In Pandas, we primarily use two methods to perform sorting: `sort_values()` and `sort_index()`.

- `sort_values()` allows sorting based on one or multiple column values (e.g., sorting passengers by age or fare).
- `sort_index()` arranges the rows or columns based on the index, which is especially helpful when working with time series or restoring original order after a shuffle.

We can sort in ascending or descending order, handle missing values using `na_position`, and apply custom sorting logic using the `key` parameter. Sorting can also be performed on multiple criteria (like class and fare), or even on derived values such as the absolute difference from the median. It supports sorting of strings, categorical data, and even DataFrame columns (`axis=1`).

For large datasets, performance tips like using `inplace=True`, or methods like `nlargest()` and `nsmallest()` help save memory and time. Sorting is widely used in feature engineering, exploratory data analysis (EDA), data cleaning, and machine learning pipelines.

Mastering sorting enables you to identify top performers, clean inconsistencies, and prepare data for meaningful insights — making it an essential skill for any data-driven project.