# Sorting and Reshaping
We’ll explore sorting and reshaping data using Pandas.

By the end of this activity, you’ll learn to:
- Sort data by column values and indices.
- Perform hierarchical sorting using multiple columns.
- Reshape data with `pivot()` and summarise it using `pivot_table()`.


## Sorting Data

Sorting is an essential part of data analysis. 

It helps us make sense of data by organising it in a meaningful way. 

In Pandas, we can sort data using two main methods:
- `sort_values()`: Sort rows based on column values.
- `sort_index()`: Sort rows or columns based on their index labels.


---
## Load the Data

In [1]:
# Import Pandas
import pandas as pd

In [2]:
# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'Age': [24, 19, 22, 23, 24, 19, 22, 23],
    'Score': [85, 92, 88, 76, 90, 85, 88, 76],
    'Department': ['Sales', 'HR', 'IT', 'Finance', 'Sales', 'HR', 'IT', 'Finance']
}
df = pd.DataFrame(data)
print(df)

      Name  Age  Score Department
0    Alice   24     85      Sales
1      Bob   19     92         HR
2  Charlie   22     88         IT
3    Diana   23     76    Finance
4      Eve   24     90      Sales
5    Frank   19     85         HR
6    Grace   22     88         IT
7   Hannah   23     76    Finance


---
## Sort by Age (Ascending)
Sort the data by `Age` in ascending order—smallest to largest.

In [3]:
sorted_by_age = df.sort_values(by='Age')
print(sorted_by_age)

      Name  Age  Score Department
1      Bob   19     92         HR
5    Frank   19     85         HR
6    Grace   22     88         IT
2  Charlie   22     88         IT
7   Hannah   23     76    Finance
3    Diana   23     76    Finance
0    Alice   24     85      Sales
4      Eve   24     90      Sales


The method `sort_values(by='Age')` organizes the rows in ascending order of the Age column.

You can observe that the youngest learner appears at the top, while the oldest is listed at the bottom.

All other columns (`Name`, `Score`, `Department`) remain aligned with the newly sorted `Age` values.

---
## Sort by Score (Descending)
Sort by `Score` but in descending order—highest to lowest.

In [4]:
sorted_by_score = df.sort_values(by='Score', ascending=False)
print(sorted_by_score)

      Name  Age  Score Department
1      Bob   19     92         HR
4      Eve   24     90      Sales
6    Grace   22     88         IT
2  Charlie   22     88         IT
5    Frank   19     85         HR
0    Alice   24     85      Sales
3    Diana   23     76    Finance
7   Hannah   23     76    Finance


We included the parameter `ascending=False` to sort the values in descending order.  

As a result, Bob, with the highest score of 92, is now at the top, while Diana and Hannah, both with the lowest scores of 76, are at the bottom.

---
## Sort by Multiple Columns (Age Ascending, Score Descending)

What if we need to sort by multiple columns? For example, we can start by sorting by `Age` and then, for rows with the same age, sort by `Score` in descending order.  

To achieve this, we can provide a list of column names to the `by` parameter and a corresponding list of booleans to the `ascending` parameter to specify the sorting order for each column.

In [5]:
sorted_by_age_score = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print(sorted_by_age_score)

      Name  Age  Score Department
1      Bob   19     92         HR
5    Frank   19     85         HR
2  Charlie   22     88         IT
6    Grace   22     88         IT
3    Diana   23     76    Finance
7   Hannah   23     76    Finance
4      Eve   24     90      Sales
0    Alice   24     85      Sales


#### Hierarchical Sorting
This type of sorting is called hierarchical sorting.

Hierarchical sorting is the process of sorting data by multiple criteria in a specific order of priority.

- The `first criterion` is applied to group or organise the data.
- The `second (or subsequent)` criterion is used to sort rows within each group formed by the first criterion.

For example:
If we sort by `Age` (ascending) and then by `Score` (descending), all rows with the same Age will be grouped together first. 

Within each `Age` group, rows will then be ordered based on their `Score`.

---
## Sort by Index (Descending)
Let’s see the original dataframe.

Notice the numbers on the left side.

In [6]:
print("Original Dataframe:\n", df)

Original Dataframe:
       Name  Age  Score Department
0    Alice   24     85      Sales
1      Bob   19     92         HR
2  Charlie   22     88         IT
3    Diana   23     76    Finance
4      Eve   24     90      Sales
5    Frank   19     85         HR
6    Grace   22     88         IT
7   Hannah   23     76    Finance


Now, let's sort by index instead of column values. 

In [7]:
print("\nSorted by Index")
sorted_by_index = df.sort_index(ascending=False)
print(sorted_by_index)


Sorted by Index
      Name  Age  Score Department
7   Hannah   23     76    Finance
6    Grace   22     88         IT
5    Frank   19     85         HR
4      Eve   24     90      Sales
3    Diana   23     76    Finance
2  Charlie   22     88         IT
1      Bob   19     92         HR
0    Alice   24     85      Sales


`sort_index()` rearranges rows based on their index labels, sorting in ascending order by default.  

To sort in descending order, set `ascending=False`.  

The index refers to the labels on the left side of the table, which are separate from the columns.  

A DataFrame usually has a single row index, which can be numeric or string-based, though custom or multi-level indices can also be used.  

Here, the index labels are now sorted in reverse order.

---
## Sort Columns Alphabetically by Name
Sort the columns themselves instead of the rows.

In [8]:
print("Original Dataframe:\n", df)

print("\nSorted Columns Alphabetically by Name")
sorted_columns = df.sort_index(axis=1)
print(sorted_columns)

Original Dataframe:
       Name  Age  Score Department
0    Alice   24     85      Sales
1      Bob   19     92         HR
2  Charlie   22     88         IT
3    Diana   23     76    Finance
4      Eve   24     90      Sales
5    Frank   19     85         HR
6    Grace   22     88         IT
7   Hannah   23     76    Finance

Sorted Columns Alphabetically by Name
   Age Department     Name  Score
0   24      Sales    Alice     85
1   19         HR      Bob     92
2   22         IT  Charlie     88
3   23    Finance    Diana     76
4   24      Sales      Eve     90
5   19         HR    Frank     85
6   22         IT    Grace     88
7   23    Finance   Hannah     76


In Pandas, the `axis` parameter controls the operation's direction:  
- `axis=0` (default): Operates on rows (vertical).  
- `axis=1`: Operates on columns (horizontal).  

Setting `axis=1` sorts the DataFrame's columns alphabetically by their names.  

Now, the columns are arranged alphabetically as `Age`, `Name`, `Department`, and `Score`.

---
## Reshaping
In Pandas, reshaping data means transforming a DataFrame into a different structure to enhance its usability for data visualisation and analysis.

Pandas provides multiple methods like `pivot()`, `pivot_table()`, `stack()`, `unstack()` and `melt()` to reshape data. 

Here we’re going to see `pivot()` and `pivot_table()`.

---
## Use Pivot() to Reshape the DataFrame
Let's see the original DataFrame:

In [9]:
print("Original Dataframe:\n", df)

Original Dataframe:
       Name  Age  Score Department
0    Alice   24     85      Sales
1      Bob   19     92         HR
2  Charlie   22     88         IT
3    Diana   23     76    Finance
4      Eve   24     90      Sales
5    Frank   19     85         HR
6    Grace   22     88         IT
7   Hannah   23     76    Finance


The **pivot()** function is used to reorganise data by specifying three key elements:

- `index`: Defines the rows of the new table. In this case, it will be Name.
- `columns`: Defines the columns. Here, it will be Department.
- `values`: Specifies the data to fill the cells. We'll use Score.

Let's create a table showing scores of individuals for different departments.

In [10]:
# Pivot the DataFrame
pivot_df = df.pivot(index='Name', columns='Department', values='Score')

# Replace NaN with an empty string
pivot_df = pivot_df.fillna('')

print("\nPivoted DataFrame with empty strings:\n", pivot_df)



Pivoted DataFrame with empty strings:
 Department Finance    HR    IT Sales
Name                                
Alice                           85.0
Bob                 92.0            
Charlie                   88.0      
Diana         76.0                  
Eve                             90.0
Frank               85.0            
Grace                     88.0      
Hannah        76.0                  


The `Name` column from the original DataFrame is now used as the row labels in the new table, as specified by `index='Name'`.  

Each distinct value in the `Department` column has been transformed into a column header, as defined by `columns='Department'`.  

The cells are populated with the corresponding `Score` values, and any missing values (`NaN`) have been replaced with empty strings for better readability.

---
## Summarise Data Using pivot_table()
How can we calculate metrics, like averages, for specific groups?  
We’ll use Pandas’ `pivot_table()` to summarise data by grouping and calculating aggregate metrics.  
To calculate the average score for each department, the function requires:  
- `index`: Column to group by (e.g., Department).  
- `values`: Column to calculate metrics for (e.g., Score).  
- `aggfunc`: Aggregation function (e.g., `mean`, `sum`, or `count`).  

In [11]:
print("Original Dataframe:\n", df)

pivot_table_df = df.pivot_table(index='Department', values='Score', aggfunc='mean')
print("\nPivot Table (Average Score by Department):\n", pivot_table_df)

Original Dataframe:
       Name  Age  Score Department
0    Alice   24     85      Sales
1      Bob   19     92         HR
2  Charlie   22     88         IT
3    Diana   23     76    Finance
4      Eve   24     90      Sales
5    Frank   19     85         HR
6    Grace   22     88         IT
7   Hannah   23     76    Finance

Pivot Table (Average Score by Department):
             Score
Department       
Finance      76.0
HR           88.5
IT           88.0
Sales        87.5


The `Department` column is now the row labels (index), and the `Score` column is summarised with averages under the `Score` column.  

The `mean` function computes the average `Score` for each department, e.g., HR's average is 88.5, based on Bob (92) and Frank (85).