# Chapter 2 Subsetting and Sorting

In [None]:
import pandas as pd

# Read titanic dataset
tnc = pd.read_csv("./datasets/titanic.csv")

# Print dataframe
tnc.head()

In [None]:
# Info about columns, datatypes, non-null columns, and total size
tnc.info()

In [None]:
# Shape of the dataframe
tnc.shape

## Selecting column

We can select one column from a Dataframe using the following two syntax:
    
1) **Dataframe.column_name**

2) **Dataframe["column_name"]** or **Dataframe["column_name"]**

In [None]:
# Extracting age column from the data
tnc.age

In [None]:
# Extracting name column from the dataframe
tnc["name"]

### Why second syntax is better than first?

First syntax: **Dataframe.column_name**
    
Second syntax: **Dataframe["column_name"]**
    
Because there can be many instances where the dot operator syntax may not work.

One such example is that we cannot access the column "home.dest" using the dot operator.

In [None]:
# Using first syntax
# tnc.home.dest

# Python looks for 'dest' attribute in 'home' attribute of 'tnc'. 
# It first looks for 'home' which is not an attribute and hence results in error

In [None]:
# Extracting "home.dest" column using the second syntax
tnc["home.dest"]

## Selecting multiple columns

We can select multiple columns using the square bracket syntax:

**Dataframe[["col1_name", "col2_name", ...., "colN_name"]]**

In [None]:
# Describe the statistical information about the columns name, age and home.dest of first 25 members in the titanic dataframe

# Extract first 25 members
tnc_25 = tnc.head(25)

# Select the columns name, age and home.dest using square bracket syntax
tnc_cols = tnc_25[["name", "age", "home.dest"]]

# Describe the columns
tnc_cols.describe()

# This can also be achieved in a single line using Method Chaining
# tnc.head(25)[["name", "age", "home.dest"]].describe()

## Index (or) Label

Index or label are unique labels given to records in a Dataframe. It is given by Pandas by default to uniquely indentify the records.

![Index](http://res.cloudinary.com/dtwgxcqkr/image/upload/v1710049645/Data%20Wrangling/indexes.png)

To know the current index of the Dataframe we can use the **index property upon the Dataframe**.

Ex: df.index

In [None]:
# Print titanic dataframe
tnc

In [None]:
# Print index of titanic
tnc.index

## Set index

We can create our own index using the method:
    
**Dataframe.set_index(col_name)**

Ex: df.set_index("name")

**Note that when a column is made as an index, it is no longer a column of the dataframe and rather acts as an index to the dataframe**

**Also note that this is not an in place operation and hence returns a copy of the resultant Dataframe.**

In [None]:
# Setting an index
tnc_new = tnc.set_index("name")

In [None]:
# Print new titanic dataframe
tnc_new

In [None]:
# Print index of new titanic
tnc_new.index

## To set an index in place

To set an index in place we need to **set the argument 'inplace' of set_index() method to True.**

Ex: df.set_index(col, inplace=True)

In [None]:
# Read a new dataset 'world happiness report 2021'
countries = pd.read_csv("./datasets/world-happiness-report-2021.csv")

# Print the dataframe
countries.head()

In [None]:
# Select 'Healthy life expectancy' column of dataframe
countries["Healthy life expectancy"]

# It prints 'Healthy life expectancy', but I cannot know corresponding to which country it is. 
# Setting an index plays a key role here.

In [None]:
# Set Country name column as index in place
countries.set_index("Country name", inplace=True)

# Print the dataframe
countries

In [None]:
# Select 'Healthy life expectancy' column of dataframe
countries["Healthy life expectancy"]

In [None]:
# Plot the 'Healthy life expectancy' of first seven countries
countries.head(7)["Healthy life expectancy"].plot()

## Sorting

Sort the dataframe records in the ascending or descending order of the specified column. To sort the records in ascending order we use the method:

**Dataframe.sort_values(col_name)**

Ex: df.sort_values("name")

To sort the records in descending order we need to set the argument 'ascending' of sort_values() method to False.

Ex: df.sort_values("name", ascending=False)

**Note that this is not an in place operation and hence returns a copy of the resultant Dataframe.**

To sort the records in ascending/descending order in place we need to set the argument 'inplace' of set_index() method to True.

Ex: df.sort_values("name", ascending=False, inplace=True)

In [None]:
# Sort the countries dataframe in ascending order of Social support
countries.sort_values("Social support")

In [None]:
# Sort the countries dataframe in descending order of Freedom to make life choices
countries.sort_values("Freedom to make life choices", ascending=False)

In [None]:
# Sort the countries dataframe in descending order of Generosity in place
countries.sort_values("Generosity", ascending=False, inplace=True)

# Print the dataframe
countries

## Sorting by index

Since index is not a column of a dataframe we cannot directly use sort_values() method on the index. Instead we use the method:
    
**Dataframe.sort_index()**

Ex: df.sort_index()
    
**Note: As usual by default the values of the arguments 'inplace' and 'ascending' are False**

In [None]:
# Sort the countries by its index
countries.sort_index()

## Sorting by multiple columns

We can sort the dataframe by multiple columns using the same syntax. Instead of passing one column, we pass a list of columns.

**Syntax: Dataframe.sort_values([col1_name, col2_name, ..., colN_name])**

Ex: df.sort_values(["name", "age"])

In [None]:
# Read houses dataset
houses = pd.read_csv("./datasets/kc_house_data.csv")

# Print the dataframe
houses.head()

In [None]:
# Set in place id as index of dataframe
houses.set_index("id", inplace=True)

# Print the dataframe
houses

In [None]:
# Sort the dataframe in descending order of price
houses.sort_values("price", ascending=False)

In [None]:
# Sort the dataframe in descending order of bedrooms
houses.sort_values("bedrooms", ascending=False)

In [None]:
# Sort the dataframe in descending order of bedrooms and if there is a tie sort them in the descending order of their price
houses.sort_values(["bedrooms", "price"], ascending=False)

## Selecting rows

We can select rows from a Dataframe based on their label/index using the following methods:

1) **Dataframe.loc["label"]**: Access a group of records based on their label. 

Ex: df.loc["group1"], returns all the records with label "group1".

2) **Dataframe.iloc[position]**: Access a group of records based on their position 

Ex: df.iloc[0], returns first row (index 0).

**Note: loc looks for values whereas iloc looks for position**

We can also make use of Python slicing to extract records within a range

3) **Dataframe.loc[start:end] (or) Dataframe.iloc[start:end]**

Ex: df.loc["group1":"group3"], returns all the records of dataframe starting with label "group1" upto label "group3".

Ex: df.loc[0:9], returns all the records of dataframe whose position are within the range 0 to 9.

In [None]:
# Extract details about country 'India'
countries.loc["India"]

In [None]:
# Extract details about country 'India' horizontally
countries.loc[["India"]]

In [None]:
# Extract details about any 5 Asian countries
countries.loc[["India", "Pakistan", "Bangladesh", "Sri Lanka", "Nepal"]]

In [None]:
# Extract details within a range
tnc.loc[0:5]

In [None]:
# Extract details of countries from 'America' to 'India' when sorted alphabetically
countries.sort_index().loc["America":"India"]

In [None]:
# Extract details about the 20th country from the last when sorted alphabetically
countries.sort_index(ascending=False).iloc[20]

In [None]:
# Extract details about the 20th country from the last when sorted alphabetically, horizontally
countries.sort_index(ascending=False).iloc[[20]]

In [None]:
# Extract details about the countries that ranked between 30-40 when sorted by 'Logged GDP per capita'
countries.sort_values("Logged GDP per capita").iloc[29:39]