<img src="https://raw.githubusercontent.com/AkulAshray/Pandas-tutorial/main/Corndel_Logos_RGB.png" style="float: left; margin: 20px; height: 55px"><h1 style=" font-size:1.5em; font-family:Verdana"> Pandas Fundamentals I </h1>

<hr style="border: 0.5px solid #504845;">

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Pandas is a powerful data manipulation library in Python. It provides versatile data structures and functions that make it easy to work with structured data. Pandas offers much of the same functionality as SQL or Excel, allowing you to work with various data types‚Äîranging from CSV and text files to Microsoft Excel files, SQL databases, and more. One of its core advantages is the ability to load this data into a DataFrame, which represents tabular data with rows and columns, similar to how data is presented in Excel‚Äîbut within Python.
</p>

![image.png](attachment:image.png)

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Additionally we can
</p>

<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li>Perform operations on the data to uncover insights</li>
    <li>Use functions from other libraries (like NumPy) to further process data</li>
</ul>


<h1 style=" font-size:1.4em; font-family:Verdana"> Fundamental Data Structures in Pandas</h1>

<hr style="border: 0.5px solid #504845;">


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
We will first import Pandas library into our Python environment.
</p>


In [1]:
# 'pd' is the conventional alias for Pandas
import pandas as pd

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
When we speak of data structures, we are essentially referring to how data is organised and stored in our system to allow for efficient access and manipulation. Pandas provides three key data structures for organising and working with data:
</p>

<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li><strong>Series:</strong> A one-dimensional labeled array, which you can think of as a single column of data.</li>
    <li><strong>DataFrame:</strong> A two-dimnensional structure that organises data into rows and columns, much like a table</li>
    <li><strong>Index:</strong> A collection of labels that identify the rows or columns within a DataFrame or Series.</li>
</ul>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The image below shows the first few rows of the expenses dataset we encountered in previous tutorial.
</p>

![image.png](attachment:image.png)


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Notice how the DataFrame is a two-dimensional object, containing both rows and columns. The Series shown here is a single column of this DataFrame, specifically the `Travel Type` column. Both structures share an index - in this case, the row labels raning from 0 to 4.
</p>

<h2 style=" font-size:1.2em; font-family:Verdana"> Pandas Data Structure: Series </h2>

---

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
One way Python represents data is through <a href="https://www.tomasbeuzen.com/python-programming-for-data-science/chapters/chapter5-numpy.html?highlight=arrays#numpy-arrays"> arrays</a>, which are collections of elements, typically of the same data type, arranged in a structured format.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
In the context of Pandas, a Series can be thought of as a labeled one-dimensional array-like structure where each element is indexed. It contains both:
</p>

<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li>A sequence of values of the same type.</li>
    <li>A sequence of data labels called the index.</li>
</ul>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
You can read more about Series, <a href="https://www.tomasbeuzen.com/python-programming-for-data-science/chapters/chapter7-pandas.html#pandas-series">here</a>.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
In the cell below, we create a Series named <code>s</code>.
</p>

In [2]:
s = pd.Series(['Welcome', 'To', 'The', 'Data Analyst', 'Level 4', 'Apprenticeship'])
s

0           Welcome
1                To
2               The
3      Data Analyst
4           Level 4
5    Apprenticeship
dtype: object

In [3]:
# Accessing data values within the Series
s.values

array(['Welcome', 'To', 'The', 'Data Analyst', 'Level 4',
       'Apprenticeship'], dtype=object)

In [4]:
# Accessing the index of the Series
s.index

RangeIndex(start=0, stop=6, step=1)

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
By default, the index of a Series is a sequential list of integers begining from 0. Optionally, a manually specified list of desired indices can be passed to the index argument.
</p>

In [5]:
s = pd.Series(['Alex', 'John', 'Bruce'], index=['a','b','c'])
s

a     Alex
b     John
c    Bruce
dtype: object

In [6]:
s.index

Index(['a', 'b', 'c'], dtype='object')

<h2 style=" font-size:1.2em; font-family:Verdana"> Selection in Series </h2>


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
There are three primary methods we can use when we want to select a single value or a set of values from a Series.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
These are:
</p>

<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li><strong>Single label selection:</strong> Selects a single value using a specific label.</li>
    <li><strong>Multiple-label selection:</strong> Selects multiple values using a list of labels.</li>
    <li><strong>Conditional Selection:</strong> Selects values that meet a specific condition.</li>
</ul>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
To demonstrate this, we will define the Series <code>ser</code>, which has values <code>[10, 20, 30, 40]</code> and set the index labels as <code>['a', 'b', 'c', 'd']</code>.
</p>


In [7]:
# Creating a Series
ser = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
ser

a    10
b    20
c    30
d    40
dtype: int64

<h2 style=" font-size:1.2em; font-family:Verdana"> Selection in Series: Using a single label </h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
To select a value by its index label, such as the value at index 'a', we do the following:
</p>

In [8]:
# We return the value stored at the index label "a"
ser['a']

10

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Similarly, we can retrive the value for index label 'd':
</p>

In [9]:
# We return the value stored at the index label "d"
ser["d"]

40

<h2 style=" font-size:1.2em; font-family:Verdana"> Selection in Series: Using a list of labels </h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
To retrive multiple values using a list of labels, like 'a' and 'c', we can do this:
</p>

In [10]:
# We return a Series of the values stored at the index labels "a" and "c"
ser[["a", "c"]] 

a    10
c    30
dtype: int64

<h2 style=" font-size:1.2em; font-family:Verdana"> Selection in Series: Using a filtering condition</h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
One of the most useful ways to select data is by applying a condition, we first apply a boolean (True/False) condition to the Series, which returns a new Series of boolean values.
</p>



In [11]:
# Applying a condition to the Series
ser > 20

a    False
b    False
c     True
d     True
dtype: bool

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
This boolean Series shows whether each element meets the condition. We can then use this to filter the orginal Series and retrieve only the values that satisfy the condition.
</p>


```python
series_data[# The condition we want to apply inside the brackets]
```

In [12]:
ser[ser > 20]

c    30
d    40
dtype: int64

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
This approach lets us easily extract specific subsets of data based on conditions.
</p>



<h2 style=" font-size:1.2em; font-family:Verdana"> Pandas Data Structure: DataFrames </h2>

---

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
A DataFrame can be thought of as a collection of Series that all share the same index. The syntax for creating a DataFrame is:
</p>


```python
pandas.DataFrame(data, index, columns)
```

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
There are several ways to create a DataFrame, below we cover some of the most common methods
</p>

<h2 style=" font-size:1.2em; font-family:Verdana"> Creating a DataFrame: From a CSV or Excel file </h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
In many cases, data is stored in CSV or Excel files. We can import a CSV file into a DataFrame by specifying the file path as an argument.
</p>

```python
pd.read_csv('path_to_file/filename.csv')
```

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Let's revisit the expenses dataset we used in the previous lesson.
</p>

In [13]:
expenses = pd.read_csv('2019-20.csv')
expenses

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.90,33.90,0.00,Leeds - Coventry (Advance Single),Conference - as an attendee
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.20,66.20,0.00,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.00,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.00,26.00,0.00,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.80,50.80,0.00,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
...,...,...,...,...,...,...,...,...,...,...
106,101486374,13/06/2019,Rebecca Charlwood,Train,,23.85,23.85,0.00,Leeds - Newcastle (Advance Single),Member portfolio/Council business (Elected Mem...
107,101486375,13/06/2019,Rebecca Charlwood,Train,,38.50,38.50,0.00,Newcastle - Leeds (Advance Single),Member portfolio/Council business (Elected Mem...
108,102144703,03/09/2019,Rebecca Charlwood,Train,,6.50,6.50,0.00,Leeds - Wakefield Westgate (Anytime Day Return),Member portfolio/Council business (Elected Mem...
109,101470428,04/06/2019,Stewart Golton,Train,,69.00,69.00,0.00,London Kings Cross - Leeds (Advance Single),Member portfolio/Council business (Elected Mem...


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The code snippet above assigns the DataFrame to the expenses variable. When we examine this DataFrame, we can see that it contains 11 rows and 10 columns, each representing different details about councilors travel expenses. 
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
These columns include attributes such as Itinerary ID, Travel Date, Traveller Name, Travel Type, Trip Length (Days), Total ¬£, Net ¬£, Tax ¬£, Detail, and Reason for Travel. Each row corresponds to a specific record, providing detailed information about a particular trip. 
</p>

<h2 style=" font-size:1.2em; font-family:Verdana"> Creating a DataFrame: Using a List and Column Name(s) </h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Next, we'll explore how to create a DataFrame from our own data.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    In the first example, we‚Äôll create a DataFrame with just one column named <strong>"Numbers"</strong>:
</p>

In [14]:
# Creating a DataFrame with a single column
df_single_column = pd.DataFrame([1,2,3], columns=['Numbers'])
print(df_single_column)

   Numbers
0        1
1        2
2        3


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
In the second example, we'll create a DataFrame with two columns: "Number" and "Description". Here, we use a 2D list to represent the data, where each nested list forms a row:
</p>

In [15]:
# Creating a DataFrame with a two column
df_list = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list

Unnamed: 0,Number,Description
0,1,one
1,2,two


<h2 style=" font-size:1.2em; font-family:Verdana"> Creating a DataFrame: Using a Dictionary </h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Another common way to create a DataFrame is by using a <a href="https://www.w3schools.com/PYTHON/python_dictionaries.asp"> dictionary</a>. In this approach, the dictionary's Keys represent the column names, and the values correspond to the data for each column.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Below is an example of how to create a DataFrame using dictionary
</p>


In [16]:
data_columns = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df_from_dict = pd.DataFrame(data_columns)
df_from_dict

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


<h2 style=" font-size:1.2em; font-family:Verdana"> Pandas Data Structure: Indices </h2>

---

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
When we speak of the index of a DataFrame, they don't necessarily need to be an integer, nor does it have to be unique. For example, you can set the index of the <strong>expenses</strong> DataFrame to the <strong><i>Itinerary ID</i></strong> column:
</p>

In [17]:
# Creating a DataFrame from a CSV file and specifying the index column
expenses = pd.read_csv("2019-20.csv", index_col = "Itinerary ID")
expenses.head()

Unnamed: 0_level_0,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
Itinerary ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    This displays a DataFrame where each row is identified by its <i>Itinerary ID</i>. Each column provides details about the councillors trips.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
We can also change the index to another column. For example, to set the <i>Travel Date</i> as the index we first reset the index to default numeric values
</p>
    
 
```python
expenses.reset_index(inplace = True)
```
<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
This will reset the index to default values from 0 to 111. Then we set the index using set_index() function.
</p>


```python
expenses.set_index('Travel Date')
```

In [18]:
expenses.reset_index(inplace = True)
expenses.set_index('Travel Date')

Unnamed: 0_level_0,Itinerary ID,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
Travel Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
09/03/2020,103533993,Abigail Marshall Katung,Train,,33.90,33.90,0.00,Leeds - Coventry (Advance Single),Conference - as an attendee
11/03/2020,103533626,Abigail Marshall Katung,Train,,66.20,66.20,0.00,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
24/10/2019,102425812,Andrew Scopes,Hotel,1.0,83.00,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
24/10/2019,102425835,Andrew Scopes,Train,,26.00,26.00,0.00,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
25/10/2019,102425836,Andrew Scopes,Train,,50.80,50.80,0.00,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
...,...,...,...,...,...,...,...,...,...
13/06/2019,101486374,Rebecca Charlwood,Train,,23.85,23.85,0.00,Leeds - Newcastle (Advance Single),Member portfolio/Council business (Elected Mem...
13/06/2019,101486375,Rebecca Charlwood,Train,,38.50,38.50,0.00,Newcastle - Leeds (Advance Single),Member portfolio/Council business (Elected Mem...
03/09/2019,102144703,Rebecca Charlwood,Train,,6.50,6.50,0.00,Leeds - Wakefield Westgate (Anytime Day Return),Member portfolio/Council business (Elected Mem...
04/06/2019,101470428,Stewart Golton,Train,,69.00,69.00,0.00,London Kings Cross - Leeds (Advance Single),Member portfolio/Council business (Elected Mem...


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Now, the rows are labeled by their Travel Date, making it easier to access data based on the date of the trip. </p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
It‚Äôs important to note that an index doesn‚Äôt have to be unique or numeric. You can use meaningful labels like names or dates, and it‚Äôs also possible for the same index value to appear more than once in the DataFrame. This flexibility allows for more customized data handling.</p>

![image-2.png](attachment:image-2.png)


<h1 style=" font-size:1.4em; font-family:Verdana"> Slicing in DataFrames </h1>

<hr style="border: 0.5px solid #504845;">

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Now that we‚Äôve explored DataFrames, let's take a closer look at how to work with them efficiently. The DataFrame class has a wide range of functions, but one of the most useful tasks is extracting subsets of data‚Äîthis process is called slicing.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
There are several ways to create subsets or slice a DataFrame to get specific data. Some of these include
</p>

<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li>Retrieving the first or last <code>n</code> rows in the DataFrame.</li>
    <li>Selecting data based on a certain label.</li>
    <li>Accessing data at a particular position.</li>
</ul>

<h2 style=" font-size:1.2em; font-family:Verdana"> Extracting data with <code>.head</code> and <code>.tail</code> </h2>

---

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
One of the simplest ways to slice a DataFrame is by using the .head() and .tail() method to grab the first or last few rows.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    To get the first <i>n</i> rows of a DataFrame, use the <code>df.head(n)</code>.
</p>

```python
# Extracting the first n rows of a DataFrame
df_first_n_rows = df.head(n)
print(df_first_n_rows)



In [19]:
expenses.head(10)

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
5,101476745,01/07/2019,Barry Anderson,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
7,101477138,02/07/2019,Debra Coupar,Hotel,2.0,191.35,159.46,31.89,"Travelodge Bournemouth Seafront, 02/07/2019, 2...",Conference - as an attendee
8,101516923,02/07/2019,Debra Coupar,Train,,14.3,14.3,0.0,Southampton Airport Parkway - Bournemouth (Any...,Conference - as an attendee
9,101516922,02/07/2019,Debra Coupar,Flight,2.0,360.57,360.57,0.0,02-07-2019 - Leeds Bradford International Airp...,Conference - as an attendee


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Similarly, calling <code>df.tail(n)</code> allows us to extract the last <code>n</code> rows in the <code>DataFrame</code>.
</p>

In [20]:
# Extract the last 5 rows of the DataFrame

expenses.tail(5)

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
106,101486374,13/06/2019,Rebecca Charlwood,Train,,23.85,23.85,0.0,Leeds - Newcastle (Advance Single),Member portfolio/Council business (Elected Mem...
107,101486375,13/06/2019,Rebecca Charlwood,Train,,38.5,38.5,0.0,Newcastle - Leeds (Advance Single),Member portfolio/Council business (Elected Mem...
108,102144703,03/09/2019,Rebecca Charlwood,Train,,6.5,6.5,0.0,Leeds - Wakefield Westgate (Anytime Day Return),Member portfolio/Council business (Elected Mem...
109,101470428,04/06/2019,Stewart Golton,Train,,69.0,69.0,0.0,London Kings Cross - Leeds (Advance Single),Member portfolio/Council business (Elected Mem...
110,101476748,01/07/2019,Stewart Golton,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee


<h2 style=" font-size:1.2em; font-family:Verdana"> Label-based Indexing with <code>loc</code> </h2>

---

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
For the more complex task of extracting data with specific column or index labels, we can use <code>.loc</code>. The <code>.loc</code> accessor allows us to specify the labels of rows and columns we wish to extract. The labels (commonly referred to as the indices) are the bold text on the far left of a DataFrame, while the column labels are the column names found at the top of a DataFrame.
</p>

![image.png](attachment:image.png)

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
To grab data with <code>.loc</code>, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the <code>.loc</code> function; the column labels are the second.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Arguments to <code>.loc</code> can be:
</p>

<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li>A single value.</li>
    <li>A slice.</li>
    <li>A list.</li>
</ul>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
For example, to select a single value, we can select the row labeled 0 and the column labeled <strong>Traveller Name</strong> from the elections DataFrame.
</p>

In [21]:
expenses.loc[0, 'Traveller Name']

'Abigail Marshall Katung'

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Keep in mind that passing in just one argument as a single value will produce a Series. Below, we‚Äôve extracted a subset of the <strong>"Total ¬£"</strong> column as a Series.
</p>


In [22]:
expenses.loc[[1, 35, 43], 'Total ¬£']

1      66.2
35    104.0
43    132.5
Name: Total ¬£, dtype: float64

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
To select multiple rows and columns, we can use Python slice notation <code><strong>:</strong></code>. Here, we select the rows from labels 25 to 30 and the columns from labels <strong>"Trip Length (Days)"</strong> to <strong>"Tax ¬£"</strong>.
</p>

In [23]:
expenses.loc[25:30 , 'Trip Length (Days)' : 'Tax ¬£' ]

Unnamed: 0,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£
25,,13.1,13.1,0.0
26,,65.4,65.4,0.0
27,,13.1,13.1,0.0
28,,29.0,29.0,0.0
29,1.0,116.0,96.67,19.33
30,,13.1,13.1,0.0


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Suppose that instead, we want to extract all column values for the first four rows in the expenses DataFrame. The shorthand <code><strong>:</strong></code> is useful for this.
    </p>

In [24]:
expenses.loc[0:3, :]

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Pandas also offers another way of extracting data using the <code>.iloc</code> accessor. You can learn more about it <a href="https://www.geeksforgeeks.org/python-extracting-rows-using-pandas-iloc/" target="_blank">here</a>.
</p>


<h2 style=" font-size:1.2em; font-family:Verdana"> Context-dependent Extraction: Indexing with <code>[]</code> </h2>

---

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The <code>[]</code> selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument, which may be one of the following:
</p>

<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li>A slice of row numbers.</li>
    <li>A list of column labels.</li>
    <li>A single-column label.</li>
</ul>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
That is, <code>[]</code> is context-dependent. Let‚Äôs see some examples.
</p>

<h2 style="font-size:1.3em; font-family:Verdana">A slice of row numbers</h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Say we wanted the first four rows of our expenses DataFrame.
</p>


In [25]:
expenses[0:4]

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations


<h2 style="font-size:1.3em; font-family:Verdana">A list of column labels</h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Suppose we now want the first four columns.
</p>

In [26]:
expenses[['Travel Date', 'Traveller Name', 'Travel Type', 'Total ¬£']]

Unnamed: 0,Travel Date,Traveller Name,Travel Type,Total ¬£
0,09/03/2020,Abigail Marshall Katung,Train,33.90
1,11/03/2020,Abigail Marshall Katung,Train,66.20
2,24/10/2019,Andrew Scopes,Hotel,83.00
3,24/10/2019,Andrew Scopes,Train,26.00
4,25/10/2019,Andrew Scopes,Train,50.80
...,...,...,...,...
106,13/06/2019,Rebecca Charlwood,Train,23.85
107,13/06/2019,Rebecca Charlwood,Train,38.50
108,03/09/2019,Rebecca Charlwood,Train,6.50
109,04/06/2019,Stewart Golton,Train,69.00


<h2 style="font-size:1.3em; font-family:Verdana">A single column label</h2>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    Lastly, <code>[]</code> allows us to extract only a single column.
</p>

In [27]:
expenses['Traveller Name']

0      Abigail Marshall Katung
1      Abigail Marshall Katung
2                Andrew Scopes
3                Andrew Scopes
4                Andrew Scopes
                ...           
106          Rebecca Charlwood
107          Rebecca Charlwood
108          Rebecca Charlwood
109             Stewart Golton
110             Stewart Golton
Name: Traveller Name, Length: 111, dtype: object

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    The output is a Series! We‚Äôll become very comfortable with <code>[]</code>, especially for selecting columns. In practice, <code>[]</code> is much more common than .loc, especially since it is far more concise.
</p>

<h1 style=" font-size:1.4em; font-family:Verdana"> Conditional Selection </h1>

<hr style="border: 0.5px solid #504845;">

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Conditional selection allows us to select a subset of rows in a DataFrame that satisfy some specified condition.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
To understand how to use conditional selection, we must look at another possible input of the <code>.loc</code> and <code>[]</code> methods ‚Äì a boolean array, which is simply an array or Series where each element is either <code>True</code> or <code>False</code>. This boolean array must have a length equal to the number of rows in the DataFrame. It will return all rows that correspond to a value of <code>True</code> in the array. We used a very similar technique when performing conditional extraction from a Series earlier.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
To see this in action, let‚Äôs select all even-indexed rows in the first 10 rows of our expenses DataFrame.
</p>



In [28]:
# Ask yourself: why is :9 is the correct slice to select the first 10 rows?
expenses_first_10_rows = expenses.loc[:9, :]

# Notice how we have exactly 10 elements in our boolean array argument
expenses_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
8,101516923,02/07/2019,Debra Coupar,Train,,14.3,14.3,0.0,Southampton Airport Parkway - Bournemouth (Any...,Conference - as an attendee


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    We can perform a similar operation using <code>.loc</code>
</p>

In [29]:
expenses_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
8,101516923,02/07/2019,Debra Coupar,Train,,14.3,14.3,0.0,Southampton Airport Parkway - Bournemouth (Any...,Conference - as an attendee


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
These techniques worked well in this example, but you can imagine how tedious it might be to list out <code>True</code> and <code>False</code> for every row in a larger DataFrame. To make things easier, we can instead provide a logical condition as an input to <code>.loc</code> or <code>[]</code> that returns a boolean array with the necessary length.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
For example, to return all expenses data associated with an expense greater than ¬£200:
</p>


In [30]:
logical_operator = (expenses['Total ¬£'] > 200)

expenses[logical_operator]

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
5,101476745,01/07/2019,Barry Anderson,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
9,101516922,02/07/2019,Debra Coupar,Flight,2.0,360.57,360.57,0.0,02-07-2019 - Leeds Bradford International Airp...,Conference - as an attendee
40,102717352,15/11/2019,James Lewis,Train,,279.0,279.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
82,102785179,20/11/2019,Judith Blake,Train,,229.7,229.7,0.0,Leeds - Bournemouth (Anytime Return),Member portfolio/Council business (Elected Mem...
86,103175749,22/01/2020,Judith Blake,Hotel,1.0,235.2,196.0,39.2,"DoubleTree by Hilton London - Westminster, 22/...",Member portfolio/Council business (Elected Mem...
110,101476748,01/07/2019,Stewart Golton,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Passing a Series as an argument to the <code>[]</code> selection operator has the same effect as using a boolean array. In fact, the <code>[]</code> selection operator can take a boolean Series, array, or list as arguments. These three are used interchangeably throughout the course.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
We can also use <code>.loc</code> to achieve similar results.
</p>


In [31]:
expenses.loc[logical_operator]

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
5,101476745,01/07/2019,Barry Anderson,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
9,101516922,02/07/2019,Debra Coupar,Flight,2.0,360.57,360.57,0.0,02-07-2019 - Leeds Bradford International Airp...,Conference - as an attendee
40,102717352,15/11/2019,James Lewis,Train,,279.0,279.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
82,102785179,20/11/2019,Judith Blake,Train,,229.7,229.7,0.0,Leeds - Bournemouth (Anytime Return),Member portfolio/Council business (Elected Mem...
86,103175749,22/01/2020,Judith Blake,Hotel,1.0,235.2,196.0,39.2,"DoubleTree by Hilton London - Westminster, 22/...",Member portfolio/Council business (Elected Mem...
110,101476748,01/07/2019,Stewart Golton,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Boolean conditions can be combined using various bitwise operators, allowing us to filter results by multiple conditions. In the table below, p and q are boolean arrays or Series.
</p>

<table style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em; border-collapse: collapse;">
  <thead>
    <tr>
        <th style="border: 1px solid black; padding: 20px;"><centre>Symbol</centre></th>
      <th style="border: 1px solid black; padding: 20px;"><centre>Usage</centre></th>
      <th style="border: 1px solid black; padding: 20px;"><centre>Meaning</centre></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; padding: 8px;"><centre>~</centre></td>
      <td style="border: 1px solid black; padding: 8px;">~p</td>
      <td style="border: 1px solid black; padding: 8px;">Returns negation of p</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">|</td>
      <td style="border: 1px solid black; padding: 8px;">p | q</td>
      <td style="border: 1px solid black; padding: 8px;">p OR q</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">&</td>
      <td style="border: 1px solid black; padding: 8px;">p & q</td>
      <td style="border: 1px solid black; padding: 8px;">p AND q</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">^</td>
      <td style="border: 1px solid black; padding: 8px;">p ^ q</td>
      <td style="border: 1px solid black; padding: 8px;">p XOR q (exclusive or)</td>
    </tr>
  </tbody>
</table>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
When combining multiple conditions with logical operators, we surround each individual condition with a set of parentheses <code>()</code>. This imposes an order of operations on pandas evaluating your logic and can avoid code errors.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
For example, if we want to return data on all Travel Type of "Train" with an expense greater than ¬£50, we can write.
</p>


In [32]:
logical_operator = (expenses['Travel Type'] == 'Train') & (expenses['Total ¬£'] > 50)

expenses[logical_operator].head()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
11,102226608,26/09/2019,Debra Coupar,Train,,59.5,59.5,0.0,London Kings Cross - Leeds (Advance Single),Meeting with other external bodies
12,102226607,26/09/2019,Debra Coupar,Train,,139.5,139.5,0.0,Leeds - London Kings Cross (Anytime Single),Meeting with other external bodies


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
You may have noticed that in the code we use <code>travel type == 'Train'</code>. The operator <code>==</code> stands for "is it equal to". Similarly, there are other logical operators and their meanings. Here is a table with some common logical operators:
</p>

<table style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em; border-collapse: collapse; text-align: left;">
  <thead>
    <tr>
      <th style="border: 1px solid black; padding: 20px;">Operator</th>
      <th style="border: 1px solid black; padding: 20px;">Usage</th>
      <th style="border: 1px solid black; padding: 20px;">Meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; padding: 8px;">==</td>
      <td style="border: 1px solid black; padding: 8px;">p == q</td>
      <td style="border: 1px solid black; padding: 8px;">Is equal to</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">!=</td>
      <td style="border: 1px solid black; padding: 8px;">p != q</td>
      <td style="border: 1px solid black; padding: 8px;">Is not equal to</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">&gt;</td>
      <td style="border: 1px solid black; padding: 8px;">p &gt; q</td>
      <td style="border: 1px solid black; padding: 8px;">Greater than</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">&lt;</td>
      <td style="border: 1px solid black; padding: 8px;">p &lt; q</td>
      <td style="border: 1px solid black; padding: 8px;">Less than</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">&gt;=</td>
      <td style="border: 1px solid black; padding: 8px;">p &gt;= q</td>
      <td style="border: 1px solid black; padding: 8px;">Greater than or equal to</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">&lt;=</td>
      <td style="border: 1px solid black; padding: 8px;">p &lt;= q</td>
      <td style="border: 1px solid black; padding: 8px;">Less than or equal to</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">&amp;</td>
      <td style="border: 1px solid black; padding: 8px;">p &amp; q</td>
      <td style="border: 1px solid black; padding: 8px;">Logical AND</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">|</td>
      <td style="border: 1px solid black; padding: 8px;">p | q</td>
      <td style="border: 1px solid black; padding: 8px;">Logical OR</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">~</td>
      <td style="border: 1px solid black; padding: 8px;">~p</td>
      <td style="border: 1px solid black; padding: 8px;">Logical NOT</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; padding: 8px;">^</td>
      <td style="border: 1px solid black; padding: 8px;">p ^ q</td>
      <td style="border: 1px solid black; padding: 8px;">Logical XOR (exclusive OR)</td>
    </tr>
  </tbody>
</table>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Note that using <code>and</code> in place of <code>&</code>, or <code>or</code> in place of <code>|</code> will result in an error.
</p>



In [33]:
# This line of code will raise a ValueError
expenses[(expenses["Travel Type"] == "Train") and (expenses["Total ¬£"] > 50)]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    If we want to return data on all Travel Type of Train <code>or</code> all those born before 2000s
</p>




In [34]:
logical_operator = (expenses['Travel Type'] == 'Train') | (expenses['Total ¬£'] > 50)

expenses[logical_operator].head()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Boolean array selection is a useful tool, but it can lead to overly verbose code for complex conditions. In the example below, our boolean condition is long enough to extend over several lines of code.
</p>
</br>

<div style="
           display:fill;
           border-radius:5px;
           background-color:grey;
           font-size:1.2em;
           font-family:Helvetica;
           letter-spacing:0.5px;
            line-height:1.7em">
<p> üìù The parentheses surrounding the code make it possible to break the code on to multiple lines for readability </p>
</div>

In [35]:
( expenses[(expenses['Traveller Name'] == 'Barry Anderson') |
           (expenses['Traveller Name'] == 'Debra Coupar') |
           (expenses['Traveller Name'] == 'Judith Blake') |
           (expenses['Traveller Name'] == 'Andrew Scopes')]
).head()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
5,101476745,01/07/2019,Barry Anderson,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Fortunately, pandas provides many alternative methods for constructing boolean filters.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The <code>.isin</code> function is one such example. This method evaluates if the values in a Series are contained in a different sequence (list, array, or Series) of values. In the cell below, we achieve equivalent results to the DataFrame above with far more concise code.
</p>

In [36]:
names = ['Barry Anderson', 'Debra Coupar', 'Judith Blake', 'Andrew Scopes']
# the code below will create a logical condiiton to create a boolean (true/false) array
expenses['Traveller Name'].isin(names)

0      False
1      False
2       True
3       True
4       True
       ...  
106    False
107    False
108    False
109    False
110    False
Name: Traveller Name, Length: 111, dtype: bool

In [37]:
logical_operator = expenses['Traveller Name'].isin(names)

expenses[logical_operator].head()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations
5,101476745,01/07/2019,Barry Anderson,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...


<h1 style="font-size:1.4em; font-family:Verdana">Adding, Removing, and Modifying Columns</h1> 

<hr style="border: 0.5px solid #504845;"> 

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em"> In many data science tasks, we may need to change the columns contained in our DataFrame in some way. Below, we will look at an example where we want to create a new column called <strong>"Detail_lengths"</strong>, which will tell us the number of characters contained in the "Detail" column. </p> 

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em"> We can find out the number of characters contained in the "Detail" column by using the following code: </p>

```python
expenses["Detail"].str.len()
```
<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em"> The code above does the following: </p> 

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em"> <ul> <li><strong>expenses["Detail"]</strong>: This accesses the "Detail" column from the <strong>expenses</strong> DataFrame.</li> <li><strong>.str</strong>: This attribute allows us to apply string functions to each element in the "Detail" column. Since each row in the "Detail" column is a string, the <code>.str</code> accessor is necessary to perform string operations.</li> <li><strong>.len()</strong>: This method returns the length (i.e., the number of characters) for each string in the column. So, for each entry in the "Detail" column, <code>.len()</code> will calculate how many characters are in that string.</li> </ul> </p> <p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em"> To add this new information to our DataFrame, we can create a new column called "Detail_lengths" as shown in the code below: </p>

In [38]:
expenses["Detail_lengths"] = expenses["Detail"].str.len()
expenses.head(5)

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel,Detail_lengths
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee,33
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee,34
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations,51
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations,34
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations,50


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
If we need to later modify an existing column, we can do so by referencing this column again with the syntax <code>df["column"]</code>, then re-assigning it to a new Series or array of the appropriate length.
</p>


In [39]:
# Modify the ‚ÄúDetail_lengths‚Äù column to be one less than its original value
expenses["Detail_lengths"] = expenses["Detail_lengths"] - 1
expenses.head()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel,Detail_lengths
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee,32
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee,33
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations,50
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations,33
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations,49


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
We can rename a column using the <code>.rename()</code> method. This method takes in a dictionary that maps old column names to their new ones.
</p>


In [40]:
# Rename ‚ÄúDetail_lengths‚Äù to ‚ÄúLength‚Äù
expenses = expenses.rename(columns={"Detail_lengths":"Detail_Length"})
expenses.head()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel,Detail_Length
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee,32
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee,33
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations,50
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations,33
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations,49


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
If we want to remove a column or row of a DataFrame, we can call the <code>.drop()</code> method. Use the <code>axis</code> parameter to specify whether a column or row should be dropped. Unless otherwise specified, pandas will assume that we are dropping a row by default.
</p>

In [41]:
# Drop our new "Detail_Length" column from the DataFrame
expenses = expenses.drop("Detail_Length", axis="columns")
expenses.head(5)

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Notice that we re-assigned <code>expenses</code> to the result of <code>expenses.drop(...)</code>. This is a subtle but important point: pandas table operations do not occur in-place by default. Calling <code>df.drop(...)</code> will output a copy of <code>df</code> with the row/column of interest removed without modifying the original <code>df</code> table.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
In other words, if we simply call:
</p>

In [42]:
# This creates a copy of `expenses` and removes the column "Name"...
expenses.drop("Traveller Name", axis="columns")

# ...but the original `expenses` is unchanged! 
# Notice that the "Traveller Name" column is still present
expenses.head(5)

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
0,103533993,09/03/2020,Abigail Marshall Katung,Train,,33.9,33.9,0.0,Leeds - Coventry (Advance Single),Conference - as an attendee
1,103533626,11/03/2020,Abigail Marshall Katung,Train,,66.2,66.2,0.0,Coventry - Leeds (Off-Peak Single),Conference - as an attendee
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
3,102425835,24/10/2019,Andrew Scopes,Train,,26.0,26.0,0.0,Leeds - Cambridge (Advance Single),Meeting with other public sector organisations
4,102425836,25/10/2019,Andrew Scopes,Train,,50.8,50.8,0.0,Cambridge - Leeds (Super Off-Peak Single (Onli...,Meeting with other public sector organisations


<h1 style=" font-size:1.4em; font-family:Verdana"> Utility Function </h1>

<hr style="border: 0.5px solid #504845;">

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Pandas contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in our data science projects.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Discussing all functionality offered by Pandas could take an entire semester! We will walk you through the most commonly-used functions and encourage you to explore and experiment on your own.
</p>


<ul style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
    <li><code>NumPy</code> and Built-in Function Support</li>
    <li><code>.shape</code></li>
    <li><code>.size</code></li>
    <li><code>.describe()</code></li>
    <li><code>.sample()</code></li>
    <li><code>.value_counts()</code></li>
    <li><code>.unique()</code></li>
    <li><code>.sort_values()</code></li>
</ul>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The Pandas documentation will be a valuable resource during your apprenticeship and beyond.
</p>

<h3 style="font-size:1.4em; font-family:Verdana">NumPy</h3>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Pandas is designed to work well with NumPy, the framework for array computations. Just about any NumPy function can be applied to Pandas DataFrames and Series.
</p>


In [43]:
# Filtering out all expenses related to traveller name Andrew Scopes
andrew_expense = expenses[expenses["Traveller Name"] == "Andrew Scopes"]["Total ¬£"]
andrew_expense.head()

2    83.0
3    26.0
4    50.8
Name: Total ¬£, dtype: float64

In [44]:
import numpy as np

# Average of expenses recorded for Andrew
np.mean(andrew_expense)

53.26666666666667

In [45]:
# What is the most expense by Andrew?
np.max(andrew_expense)

83.0

In [46]:
# Total sum of all expenses
np.sum(andrew_expense)

159.8

<h3 style="font-size:1.4em; font-family:Verdana"><code>.shape</code> and <code>.size</code></h3>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
<code>.shape</code> and <code>.size</code> are attributes of Series and DataFrames that measure the ‚Äúamount‚Äù of data stored in the structure. Calling <code>.shape</code> returns a tuple containing the number of rows and columns present in the DataFrame or Series. <code>.size</code> is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Many functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.
</p>

In [47]:
expenses.shape

(111, 10)

In [48]:
expenses.size

1110

<h3 style="font-size:1.4em; font-family:Verdana"><code>.describe()</code></h3>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
If many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then <code>.describe()</code> can be used to compute all of them at once. You can find more details in the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html" target="_blank">documentation</a>.
</p>

In [49]:
expenses.describe()

Unnamed: 0,Itinerary ID,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£
count,111.0,16.0,111.0,111.0,111.0
mean,102153600.0,1.375,81.587207,78.631171,2.956036
std,699910.1,0.806226,74.080696,70.6881,8.903756
min,101179100.0,0.0,6.5,6.5,0.0
25%,101517100.0,1.0,24.1,24.1,0.0
50%,102009100.0,1.0,56.75,54.17,0.0
75%,102579900.0,2.0,132.5,128.075,0.0
max,103610600.0,3.0,360.57,360.57,42.5


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
A different set of statistics will be reported if <code>.describe()</code> is called on a Series.
</p>

In [50]:
expenses['Total ¬£'].describe()

count    111.000000
mean      81.587207
std       74.080696
min        6.500000
25%       24.100000
50%       56.750000
75%      132.500000
max      360.570000
Name: Total ¬£, dtype: float64

<h3 style="font-size:1.4em; font-family:Verdana"><code>.sample()</code></h3>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
As we will see later in the Unit 5, random processes are at the heart of many data science techniques (for example, train-test splits, and cross-validation). The <code>.sample()</code> method lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series). You can find more details in the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html" target="_blank">documentation</a>.
</p>

In [51]:
expenses.sample()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
10,101516924,04/07/2019,Debra Coupar,Train,,14.3,14.3,0.0,Bournemouth - Southampton Airport Parkway (Any...,Conference - as an attendee


In [52]:
# Using .sample() to get 5 random sample of rows from the DataFrame
expenses.sample(5)

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
68,101753473,09/07/2019,Judith Blake,Train,,174.9,174.9,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
2,102425812,24/10/2019,Andrew Scopes,Hotel,1.0,83.0,69.17,13.83,"ibis Cambridge Central Station, 24/10/2019, 1 ...",Meeting with other public sector organisations
46,101704533,03/07/2019,Jonathan David Pryor,Hotel,1.0,159.0,132.5,26.5,"Premier Inn London County Hall, 03/07/2019, 1 ...",Member portfolio/Council business (Elected Mem...
108,102144703,03/09/2019,Rebecca Charlwood,Train,,6.5,6.5,0.0,Leeds - Wakefield Westgate (Anytime Day Return),Member portfolio/Council business (Elected Mem...
103,101729200,03/07/2019,Neil Walshaw,Train,,56.5,56.5,0.0,London Kings Cross - Leeds (Advance Single),Member portfolio/Council business (Elected Mem...


In [53]:
# Using .sample() to get a random sample of values from a Series
expenses['Traveller Name'].sample(10)

79     Judith Blake
103    Neil Walshaw
42      Jane Dowson
91     Judith Blake
21     Fiona Venner
40      James Lewis
31     Fiona Venner
7      Debra Coupar
90     Judith Blake
60     Judith Blake
Name: Traveller Name, dtype: object

<h3 style="font-size:1.4em; font-family:Verdana"><code>.value_counts()</code></h3>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The <code>Series.value_counts()</code> method counts the number of occurrences of each unique value in a Series. In other words, it counts the number of times each unique value appears. This is often useful for determining the most or least common entries in a Series.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
In the example below, we determine how many times each traveler has traveled by counting the number of times each value is repeated. Note that the return value is also a Series.
</p>


In [54]:
expenses['Traveller Name'].value_counts()

Judith Blake               44
Fiona Venner               16
Debra Coupar               10
Jonathan David Pryor        6
Neil Walshaw                5
James Lewis                 5
Lisa Mulherin               4
Rebecca Charlwood           3
Andrew Scopes               3
Abigail Marshall Katung     2
Mohammed Rafique            2
Stewart Golton              2
Graham Latty                2
Eileen Taylor               2
Denise Ragan                1
Patricia Latty              1
Peter Carlill               1
Barry Anderson              1
Jane Dowson                 1
Name: Traveller Name, dtype: int64

<h3 style="font-size:1.4em; font-family:Verdana"><code>.unique()</code></h3>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The <code>.unique()</code> method returns an array of all unique values in a Series, removing any duplicates. Unlike <code>.value_counts()</code>, which returns the frequency of each value, <code>.unique()</code> simply shows which distinct values are present, without counting how often they occur. In the example below, we return an array of all unique traveller names from the <code>expenses</code> dataset.
</p>


In [55]:
expenses['Traveller Name'].unique()

array(['Abigail Marshall Katung', 'Andrew Scopes', 'Barry Anderson',
       'Debra Coupar', 'Denise Ragan', 'Eileen Taylor', 'Fiona Venner',
       'Graham Latty', 'James Lewis', 'Jane Dowson',
       'Jonathan David Pryor', 'Judith Blake', 'Lisa Mulherin',
       'Mohammed Rafique', 'Neil Walshaw', 'Patricia Latty',
       'Peter Carlill', 'Rebecca Charlwood', 'Stewart Golton'],
      dtype=object)

<h3 style="font-size:1.4em; font-family:Verdana"><code>.sort_values()</code></h3>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
The <code>sort_values()</code> method is used to sort the values in a Series or DataFrame. When applied to a Series, it sorts the data in ascending or descending order, returning a new Series with the same values, but in a sorted sequence. This is useful when you want to rank or organize data based on value. For example, if we wanted to sort the names in the <code>expenses</code> dataset alphabetically or by the frequency they appear, we would use <code>sort_values()</code>.
</p>


In [56]:
expenses.sort_values(by='Total ¬£', ascending=False).head()

Unnamed: 0,Itinerary ID,Travel Date,Traveller Name,Travel Type,Trip Length (Days),Total ¬£,Net ¬£,Tax ¬£,Detail,Reason for Travel
9,101516922,02/07/2019,Debra Coupar,Flight,2.0,360.57,360.57,0.0,02-07-2019 - Leeds Bradford International Airp...,Conference - as an attendee
40,102717352,15/11/2019,James Lewis,Train,,279.0,279.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
6,101571656,10/06/2019,Debra Coupar,Train,,265.0,265.0,0.0,Leeds - London Kings Cross (Anytime Return),Member portfolio/Council business (Elected Mem...
5,101476745,01/07/2019,Barry Anderson,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee
110,101476748,01/07/2019,Stewart Golton,Hotel,3.0,255.0,212.5,42.5,"The Hop Inn, 01/07/2019, 3 nights",Conference - as an attendee


<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
In the example above, we are using the <code>sort_values()</code> method to sort the rows in the <code>expenses</code> DataFrame based on the values in the "Total ¬£" column. By setting <code>ascending=False</code>, we ensure that the rows are sorted in descending order, meaning the largest values will appear first. Finally, <code>head()</code> is used to return only the top few rows of the sorted DataFrame, showing us the highest expenses.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
This helps us answer the question of who had the most expensive travel by quickly identifying the individuals or entries with the highest total expenses.
</p>

<h1 style=" font-size:1.4em; font-family:Verdana"> Final Thoughts </h1>

<hr style="border: 0.5px solid #504845;">

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Manipulating DataFrames is not a skill that can be mastered overnight. Pandas offers a vast array of functions and methods that allow for incredible flexibility in data handling, meaning there are often multiple ways to achieve the same result. This flexibility can sometimes make it overwhelming, but it also provides a great opportunity to strengthen your understanding by exploring different approaches to solving a problem. We highly recommend experimenting with various methods and trying different solutions to the same task. Doing so will not only reinforce your learning but also give you a deeper insight into how pandas operates, helping you to reach mastery sooner.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Remember, the key to becoming proficient in data manipulation lies in practice and patience. As you continue working with DataFrames, you will develop an intuitive sense of how to approach complex problems more efficiently. By embracing this iterative process, you'll build confidence and competence in handling a wide variety of data challenges.
</p>

<p style="font-size:1.2em; font-family:Helvetica; line-height: 1.7em">
Next, we will start digging deeper into the mechanics behind grouping data. This will allow us to break down large datasets into more manageable parts and analyze them based on specific criteria, providing new insights and a better understanding of the data structure.
</p>

