<h1 align="center">Python Data Science Guides</h1>
<h2 align="center">Pandas - Sorting Data</h2>

&nbsp;

### Contents

Section 1 - Sorting DataFrames and Series

Section 2 - Finding the *n* Largest and Smallest Values in Data

Conclusion

In [2]:
import pandas as pd
df = pd.read_csv('airline_passenger_satisfaction.csv')
df.head()

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,1,Male,48,First-time,Business,Business,821,2,5.0,3,...,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
1,2,Female,35,Returning,Business,Business,821,26,39.0,2,...,5,4,5,5,3,5,2,5,5,Satisfied
2,3,Male,41,Returning,Business,Business,853,0,0.0,4,...,3,5,3,5,5,3,4,3,3,Satisfied
3,4,Male,50,Returning,Business,Business,1905,0,0.0,2,...,5,5,5,4,4,5,2,5,5,Satisfied
4,5,Female,49,Returning,Business,Business,3470,0,1.0,3,...,3,4,4,5,4,3,3,3,3,Satisfied


<h2 align="center">Section 1 - Sorting Data </h2>

### 1.1 - Sorting DataFrames and Series using `sort_values()`

The `sort_values` method provides a simple way to sort data based on the values in a single column or across multiple columns. The parameters are detailed below, and more information can be found in the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html).

|                 |                                                                                                                                                                                                             |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *by* (required) | A column name or list of column names to sort by. If a list of column names are given, the data will be sorted according the first column passed, then duplicates by the second and so on.                  |
| *axis*          | The axis to be sorted. Valid arguments are 0 or 'index' to sort be rows, and 1 or 'columns' to sort by columns. If `None` default to 0.                                                                     |
| *ascending*     | A Boolean or list of Booleans (if using multiple columns to sort by) to specify whether sorted values should be in ascending order (`True`) or descending order (`False`). If `None` the default is `True`. |
| *inplace*       | A Boolean to specify whether the changes made should overwrite the current object. If `None` the default is `False` and so changes are not made to the original object.                                     |
| *kind*          | A string of the sorting algorithm to use. Valid options are 'quicksort', 'mergesort', 'heapsort' and 'stable'. If `None` the default is 'quicksort'.                                                        |
| *na_position*   | A string stating where `NaN` values should be positioned after the sort. Valid options are 'first' and 'last'. If `None` the default is 'last'.                                                             |
| *ignore_index*  | A Boolean which if `True` will overwrite the current index and re-number the rows with integers starting from 0. If `False` the original indices will be maintained. If `None` the default is `False`.      |
| *key*           | A function to apply to the values before sorting. Similar to the built-in `sorted()` method, except this function should expect a Series object.                                                            |

<style>
table,td,tr,th {border:none!important}
</style>

In [3]:
# Sort DataFrame by the Age column in ascending order
df.sort_values(by='Age').head(3)

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
91221,91222,Male,7,Returning,Personal,Economy,271,0,0.0,0,...,5,5,5,5,5,5,2,5,4,Neutral or Dissatisfied
107196,107197,Female,7,Returning,Personal,Economy,402,0,0.0,4,...,4,1,5,1,1,5,2,1,4,Neutral or Dissatisfied
61292,61293,Female,7,Returning,Personal,Economy,967,1,0.0,5,...,4,5,3,5,5,5,2,5,5,Neutral or Dissatisfied


In [4]:
# Sort DataFrame by the Age column in descending order
df.sort_values(by='Age', ascending=False).head(3)

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
89351,89352,Male,85,First-time,Business,Business,1187,0,7.0,3,...,4,5,3,5,3,1,3,3,3,Neutral or Dissatisfied
22898,22899,Female,85,Returning,Business,Economy Plus,147,3,0.0,4,...,1,4,3,4,4,1,4,2,4,Neutral or Dissatisfied
81344,81345,Female,85,First-time,Business,Business,899,0,0.0,2,...,3,3,4,3,2,3,2,2,4,Neutral or Dissatisfied


In [5]:
# Sort DataFrame on multiple columns: first Flight Distance, then Departure Delay, then Age
sort_columns = ['Flight Distance', 'Departure Delay', 'Age']
df.sort_values(by=sort_columns).head(3)

# Note how the data is sorted by Flight Distance first, but the first 3 individuals have identical flight distances
# of 31. So next the data is sorted by the Departure Delay, but these individuals have an identical delay time of 0.0.
# Finally the data is sorted by Age, and now it becomes clear that the passengers are separated with ages of 22, 23 and 26.

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
30077,30078,Female,22,Returning,Personal,Economy,31,0,0.0,5,...,3,5,2,5,5,4,2,5,4,Neutral or Dissatisfied
29823,29824,Female,23,First-time,Business,Economy,31,0,0.0,0,...,1,1,3,1,1,2,4,1,4,Satisfied
29991,29992,Female,26,First-time,Business,Economy,31,0,0.0,0,...,5,3,3,3,3,2,4,3,5,Neutral or Dissatisfied


In [6]:
# Sort DataFrame on multiple columns with some ascending and some descending
df.sort_values(by=sort_columns, ascending=[True, True, False]).head(3)

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
30132,30133,Male,70,Returning,Personal,Economy,31,0,0.0,5,...,3,4,5,4,4,4,2,4,5,Neutral or Dissatisfied
29862,29863,Female,53,Returning,Business,Economy,31,0,0.0,3,...,5,4,5,3,4,5,5,5,5,Satisfied
30183,30184,Male,43,Returning,Business,Economy Plus,31,0,0.0,5,...,4,4,3,4,4,2,5,4,5,Satisfied


In [7]:
# Sort DataFrame and overwrite the original index
df.sort_values(by=sort_columns, ascending=[True, True, False], ignore_index=True).head(3)

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,30133,Male,70,Returning,Personal,Economy,31,0,0.0,5,...,3,4,5,4,4,4,2,4,5,Neutral or Dissatisfied
1,29863,Female,53,Returning,Business,Economy,31,0,0.0,3,...,5,4,5,3,4,5,5,5,5,Satisfied
2,30184,Male,43,Returning,Business,Economy Plus,31,0,0.0,5,...,4,4,3,4,4,2,5,4,5,Satisfied


### 1.2 - Reset a Sorted DataFrame/Series using sort_index()`

The `sort_index` method can be used to return a DataFrame/Series back to its unsorted state by sorting the rows according to their row label. This method will not however work on a sorted object which set the `ignore_index` parameter to `True` since this will have overwritten the original row labels. If called with no optional arguments, this method will simply restore the object to its unsorted state according to the row labels. Optional arguments can be passed however to modify this behaviour, and these are described below. More can be found in the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html).

|                  |                                                                                                                                                                                                             |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *axis*           | The axis to be sorted. Valid arguments are 0 or 'index' to sort be rows, and 1 or 'columns' to sort by columns. If `None` default to 0.                                                                     |
| *level*          | An integer/level name string or list of integers/level of level name strings to sort on if data is multi-indexed. If `None` the default behaviour is to sort on the values in the specified index level(s). |
| *ascending*      | A Boolean or list of Booleans (if using multiple columns to sort by) to specify whether sorted values should be in ascending order (`True`) or descending order (`False`). If `None` the default is `True`. |
| *inplace*        | A Boolean to specify whether the changes made should overwrite the current object. If `None` the default is `False` and so changes are not made to the original object.                                     |
| *kind*           | A string of the sorting algorithm to use. Valid options are 'quicksort', 'mergesort', 'heapsort' and 'stable'. If `None` the default is 'quicksort'.                                                        |
| *na_position*    | A string stating where `NaN` values should be positioned after the sort. Valid options are 'first' and 'last'. If `None` the default is 'last'.                                                             |
| *sort_remaining* | A Boolean which if `True` and sort by level is not `None` and the index is multilevel, will sort by other the levels in order after sorting by specified level.                                             |
| *ignore_index*   | A Boolean which if `True` will overwrite the current index and re-number the rows with integers starting from 0. If `False` the original indices will be maintained. If `None` the default is `False`.      |
| *key*            | A function to apply to the values before sorting. Similar to the built-in `sorted()` method, except this function should expect a Series object.                                                            |

<style>
table,td,tr,th {border:none!important}
</style>

In [8]:
# Sort the DataFrame by the index (row labels)
df.sort_index().head(3)

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,1,Male,48,First-time,Business,Business,821,2,5.0,3,...,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
1,2,Female,35,Returning,Business,Business,821,26,39.0,2,...,5,4,5,5,3,5,2,5,5,Satisfied
2,3,Male,41,Returning,Business,Business,853,0,0.0,4,...,3,5,3,5,5,3,4,3,3,Satisfied


### 1.3 - Sorting Index Objects with `sort_values()`

The `sort_values` method can also be used on Index objects to rearrange them in alphanumerical order. This can be useul if a DataFrame has already been sorted and the `ignore_index` argument was not set to `True`. In cases like this, the index attribute of the object can be accessed directly and overwriten with the sorted index object as shown below. 

In [21]:
# Sort a DataFrame in-place without setting the ignore_index parameter.
df.sort_values('Age', inplace=True)
df.head(3)

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,91222,Male,7,Returning,Personal,Economy,271,0,0.0,0,...,5,5,5,5,5,5,2,5,4,Neutral or Dissatisfied
452,32859,Female,7,Returning,Personal,Economy,102,1,0.0,4,...,1,1,3,1,1,3,2,1,4,Neutral or Dissatisfied
453,106552,Male,7,Returning,Personal,Economy,727,0,0.0,5,...,4,4,1,4,4,4,1,4,1,Neutral or Dissatisfied


In [20]:
# Overwrite the index attribute with a sorted index
df.index = df.index.sort_values()
df.head(3)

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,91222,Male,7,Returning,Personal,Economy,271,0,0.0,0,...,5,5,5,5,5,5,2,5,4,Neutral or Dissatisfied
1,107197,Female,7,Returning,Personal,Economy,402,0,0.0,4,...,4,1,5,1,1,5,2,1,4,Neutral or Dissatisfied
2,61293,Female,7,Returning,Personal,Economy,967,1,0.0,5,...,4,5,3,5,5,5,2,5,5,Neutral or Dissatisfied


<h2 align="center">Section 2 - Finding the n Largest and Smallest Values in Data</h2>

### 2.1 - Finding the *n* Largest Values in a DataFrame or Series using `nlargest()`

The `nlargest` method can be applied to DataFrames or Series objects to find the largest values in some data. When applied to a Series objects, the *n* largest values are returned in a Series. When applied to a DataFrame, a DataFrame is return containing the entire rows where the largest values for a column/columns are found. This method has 1 required argument and 1 optional argument for a Series object, and an additional argument for which columns to search is required. The parameters are summarised below and more information can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html).

|                                     |                                                                                                                                                                                                                                                                                                                           |
|-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *n* (required)                      | The integer number of largest values to find.                                                                                                                                                                                                                                                                             |
| *columns* (required for DataFrames) | For DataFrames only, a column name string or list of column names to sort the largest values by.                                                                                                                                                                                                                          |
| *keep*                              | A string to specify how to handle duplicate values. Valid options are 'first', 'last' and 'keep'. Passing 'first' will prioritise the first occurrence, 'last' will prioritise the last occurrence, and 'keep' will keep all occurrences even if the number of results will exceed *n*. If `None` the default is 'first'.  |


<style>
table,td,tr,th {border:none!important}
</style>

In [10]:
# Find the 5 largest Flight Distance values from a DataFrame
df.nlargest(5, 'Flight Distance')

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
31811,31812,Male,42,Returning,Business,Business,4983,0,0.0,3,...,5,4,4,4,5,2,3,4,3,Satisfied
31812,31813,Female,37,Returning,Business,Business,4983,0,2.0,3,...,1,4,5,4,4,3,2,3,1,Satisfied
31814,31815,Female,49,Returning,Business,Business,4983,0,14.0,1,...,5,4,2,4,4,4,4,2,3,Neutral or Dissatisfied
31815,31816,Female,45,Returning,Personal,Economy,4983,2,0.0,1,...,5,1,2,1,3,4,3,2,3,Neutral or Dissatisfied
31816,31817,Male,38,Returning,Business,Economy Plus,4983,2,0.0,4,...,4,4,3,4,4,5,4,3,5,Satisfied


In [11]:
# Find the 5 largest values from the Flight Distance column (Series)
df['Flight Distance'].nlargest(5)

31811    4983
31812    4983
31814    4983
31815    4983
31816    4983
Name: Flight Distance, dtype: int64

In [12]:
# Find the 5 largest values sorting first by Age and then by ID
df.nlargest(5, ['Age', 'ID'])

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
127819,127820,Female,85,First-time,Business,Business,862,57,51.0,1,...,2,5,1,5,1,1,1,4,1,Neutral or Dissatisfied
125070,125071,Female,85,Returning,Business,Business,3156,0,0.0,5,...,1,3,4,3,4,3,3,3,4,Neutral or Dissatisfied
120410,120411,Female,85,First-time,Business,Economy,366,0,0.0,2,...,1,3,2,3,3,1,3,3,2,Neutral or Dissatisfied
117305,117306,Female,85,Returning,Business,Business,325,0,0.0,1,...,5,4,4,4,4,5,1,5,5,Satisfied
109730,109731,Female,85,Returning,Business,Business,187,17,11.0,3,...,5,4,4,4,4,1,2,3,4,Neutral or Dissatisfied


### 2.2 - Finding the *n* Smallest Values in a DataFrame or Series using `nsmallest()`

The `nsmallest` method can be applied to DataFrames or Series objects to find the smallest values in some data. When applied to a Series objects, the *n* smallest values are returned in a Series. When applied to a DataFrame, a DataFrame is return containing the entire rows where the smallest values for a column/columns are found. This method has 1 required argument and 1 optional argument for a Series object, and an additional argument for which columns to search is required. The parameters are summarised below and more information can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html).

|                                     |                                                                                                                                                                                                                                                                                                                           |
|-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *n* (required)                      | The integer number of smallest values to find.                                                                                                                                                                                                                                                                             |
| *columns* (required for DataFrames) | For DataFrames only, a column name string or list of column names to sort the smallest values by.                                                                                                                                                                                                                          |
| *keep*                              | A string to specify how to handle duplicate values. Valid options are 'first', 'last' and 'keep'. Passing 'first' will prioritise the first occurrence, 'last' will prioritise the last occurrence, and 'keep' will keep all occurrences even if the number of results will exceed *n*. If `None` the default is 'first'.  |


<style>
table,td,tr,th {border:none!important}
</style>

In [13]:
# Find the 5 smallest Flight Distance values from a DataFrame
df.nsmallest(5, 'Flight Distance')

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
29815,29816,Female,38,Returning,Business,Economy,31,0,0.0,2,...,4,4,4,4,4,2,4,4,3,Satisfied
29823,29824,Female,23,First-time,Business,Economy,31,0,0.0,0,...,1,1,3,1,1,2,4,1,4,Satisfied
29862,29863,Female,53,Returning,Business,Economy,31,0,0.0,3,...,5,4,5,3,4,5,5,5,5,Satisfied
29991,29992,Female,26,First-time,Business,Economy,31,0,0.0,0,...,5,3,3,3,3,2,4,3,5,Neutral or Dissatisfied
30077,30078,Female,22,Returning,Personal,Economy,31,0,0.0,5,...,3,5,2,5,5,4,2,5,4,Neutral or Dissatisfied


In [14]:
# Find the 5 smallest values from the Flight Distance column (Series)
df['Flight Distance'].nsmallest(5)

29815    31
29823    31
29862    31
29991    31
30077    31
Name: Flight Distance, dtype: int64

In [15]:
# Find the 5 smallest values sorting first by Age and then by ID
df.nsmallest(5, ['Age', 'ID'])

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
94,95,Female,7,Returning,Personal,Economy Plus,173,18,35.0,5,...,5,2,2,2,2,5,2,2,5,Neutral or Dissatisfied
123,124,Female,7,Returning,Personal,Economy,562,0,2.0,5,...,5,5,2,2,2,5,1,2,5,Neutral or Dissatisfied
236,237,Male,7,Returning,Personal,Business,719,0,2.0,5,...,3,2,4,2,2,5,3,2,5,Neutral or Dissatisfied
351,352,Female,7,Returning,Personal,Economy Plus,853,0,0.0,5,...,3,5,2,5,5,4,2,5,5,Neutral or Dissatisfied
1127,1128,Female,7,Returning,Personal,Economy,158,0,0.0,4,...,5,1,2,1,1,5,4,1,5,Neutral or Dissatisfied


<h2 align="center">Conclusion</h2>

Sorting data in Pandas is a simple process and can be performed using the `sort_values` method on DataFrames, Series, and Index objects. If duplicate values are present in a column, multiple columns can be passed in a list to sort the data by multiple columns. Further options for handling the indices, NaN values and whether the order should be ascending or descending are also easily controllable. Methods for finding the *n* largest or smallest values in a column are also available, and simple to use.

&nbsp;
