::: {.callout-note collapse="false"}
## Learning Outcomes

- Build familiarity with `pandas` and `pandas` syntax.
- Learn key data structures: `DataFrame`, `Series`, and `Index`.
- Understand methods for extracting data: `.loc`, `.iloc`, and `[]`.
:::

In this sequence of lectures, we will dive right into things by having you explore and manipulate real-world data. We'll first introduce `pandas`, a popular Python library for interacting with **tabular data**.

## Tabular Data

Data scientists work with data stored in a variety of formats. The primary focus of this class is understanding *tabular data* — data that is stored in a table.

Tabular data is one of the most common systems that data scientists use to organize data. This is in large part due to the simplicity and flexibility of tables. Tables allow us to represent each **observation**, or instance of collecting data from an individual, as its own *row*. We can record each observation's distinct characteristics, or **features**, in separate *columns*.

To see this in action, we'll explore the `elections` dataset, which stores information about political candidates who ran for president of the United States in previous years.


In [None]:
#| code-fold: true
import pandas as pd
pd.read_csv("data/elections.csv")

In the `elections` dataset, each row represents one instance of a candidate running for president in a particular year. For example, the first row represents Andrew Jackson running for president in the year 1824. Each column represents one characteristic piece of information about each presidential candidate. For example, the column named "Result" stores whether or not the candidate won the election. 

Your work in Data 8 helped you grow very familiar with using and interpreting data stored in a tabular format. Back then, you used the `Table` class of the `datascience` library, a special programming library created specifically for Data 8 students.

In Data 100, we will be working with the programming library `pandas`, which is generally accepted in the data science community as the industry- and academia-standard tool for manipulating tabular data (as well as the inspiration for Petey, our panda bear mascot).

Using `pandas`, we can

- Arrange data in a tabular format.
- Extract useful information filtered by specific conditions.
- Operate on data to gain new insights.
- Apply `NumPy` functions to our data (our friends from Data 8).
- Perform vectorized computations to speed up our analysis (Lab 1).

## `Series`, `DataFrame`s, and Indices

To begin our work in `pandas`, we must first import the library into our Python environment. This will allow us to use `pandas` data structures and methods in our code.


In [None]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

There are three fundamental data structures in `pandas`:

1. **Series**: 1D labeled array data; best thought of as columnar data.
2. **DataFrame**: 2D tabular data with rows and columns.
3. **Index**: A sequence of row/column labels.

`DataFrame`s, `Series`, and Indices can be represented visually in the following diagram, which considers the first few rows of the `elections` dataset.

![](images/df_elections.png)

Notice how the **DataFrame** is a two-dimensional object — it contains both rows and columns. The **Series** above is a singular column of this `DataFrame`, namely the `Result` column. Both contain an **Index**, or a shared list of row labels (the integers from 0 to 4, inclusive).

### Series

A Series represents a column of a `DataFrame`; more generally, it can be any 1-dimensional array-like object. It contains:

- A sequence of **values** of the same type.
- A sequence of data labels called the **index**.

In the cell below, we create a `Series` named `s`.


In [None]:
s = pd.Series(["welcome", "to", "data 100"])
s

In [None]:
s.values # Data values contained within the Series

In [None]:
s.index # The Index of the Series

By default, the Index of a Series is a sequential list of integers beginning from 0. Optionally, a manually specified list of desired indices can be passed to the `index` argument.


In [None]:
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s

In [None]:
s.index

Indices can also be changed after initialization.


In [None]:
s.index = ["first", "second", "third"]
s

In [None]:
s.index

#### Selection in `Series`

Much like when working with `NumPy` arrays, we can select a single value or a set of values from a `Series`. To do so, there are three primary methods:

1. A single label.
2. A list of labels.
3. A filtering condition.

To demonstrate this, let's define the Series `ser`.


In [None]:
ser = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
ser

##### A Single Label


In [None]:
ser["a"] # We return the value stored at the Index label "a"

##### A List of Labels


In [None]:
ser[["a", "c"]] # We return a *Series* of the values stored at the Index labels "a" and "c"

##### A Filtering Condition

Perhaps the most interesting (and useful) method of selecting data from a Series is by using a filtering condition. 

First, we apply a boolean operation to the `Series`. This creates **a new Series of boolean values**.


In [None]:
ser > 0 # Filter condition: select all elements greater than 0

We then use this boolean condition to index into our original `Series`. `pandas` will select only the entries in the original `Series` that satisfy the condition.


In [None]:
ser[ser > 0] 

### DataFrames

Typically, we will work with `Series` using the perspective that they are columns in a `DataFrame`. We can think of a **DataFrame** as a collection of **Series** that all share the same **Index**. 

In Data 8, you encountered the `Table` class of the `datascience` library, which represented tabular data. In Data 100, we'll be using the `DataFrame` class of the `pandas` library.

#### Creating a `DataFrame`

There are many ways to create a `DataFrame`. Here, we will cover the most popular approaches:

1. From a CSV file.
2. Using a list and column name(s).
3. From a dictionary.
4. From a `Series`.

More generally, the syntax for creating a `DataFrame` is: `pandas.DataFrame(data, index, columns)`.

##### From a CSV file
In Data 100, our data are typically stored in a CSV (comma-separated values) file format. We can import a CSV file into a `DataFrame` by passing the data path as an argument to the following ````pandas```` function. 
<br> &emsp;```` pd.read_csv("filename.csv") ```` 

With our new understanding of `pandas` in hand, let's return to the `elections` dataset from before. Now, we can recognize that it is represented as a `pandas` DataFrame.


In [None]:
elections = pd.read_csv("data/elections.csv")
elections

This code stores our `DataFrame` object in the ````elections```` variable. Upon inspection, our ````elections```` DataFrame has 182 rows and 6 columns (`Year`, `Candidate`, `Party`, `Popular Vote`, `Result`, `%`). Each row represents a single record — in our example, a presidential candidate from some particular year. Each column represents a single attribute or feature of the record.

##### Using a List and Column Name(s)

We'll now explore creating a `DataFrame` with data of our own.

Consider the following examples. The first code cell creates a `DataFrame` with a single column `Numbers`. The second creates a `DataFrame` with the columns `Numbers` and `Description`. Notice how a 2D list of values is required to initialize the second `DataFrame` — each nested list represents a single row of data.


In [None]:
df_list = pd.DataFrame([1, 2, 3], columns=["Numbers"])
df_list

In [None]:
df_list = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list

##### From a Dictionary

A third (and more common) way to create a `DataFrame` is with a dictionary. The dictionary keys represent the column names, and the dictionary values represent the column values.

Below are two ways of implementing this approach. The first is based on specifying the columns of the `DataFrame`, whereas the second is based on specifying the rows of the `DataFrame`.


In [None]:
df_dict = pd.DataFrame({"Fruit": ["Strawberry", "Orange"], "Price": [5.49, 3.99]})
df_dict

In [None]:
df_dict = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, {"Fruit": "Orange", "Price":3.99}])
df_dict

##### From a `Series`

Earlier, we explained how a `Series` was synonymous to a column in a `DataFrame`. It follows, then, that a `DataFrame` is equivalent to a collection of `Series`, which all share the same `Index`. 

In fact, we can initialize a `DataFrame` by merging two or more `Series`.


In [None]:
# Notice how our indices, or row labels, are the same

s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])

pd.DataFrame({"A-column": s_a, "B-column": s_b})

In [None]:
pd.DataFrame(s_a)

In [None]:
s_a.to_frame()

### Indices

On a more technical note, an `Index` doesn't have to be an integer, nor does it have to be unique. For example, we can set the index of the `elections` Dataframe to be the name of presidential candidates. 


In [None]:
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col = "Candidate")
elections

We can also select a new column and set it as the index of the DataFrame. For example, we can set the index of the `elections` Dataframe to represent the candidate's party.


In [None]:
elections.reset_index(inplace = True) # Resetting the index so we can set the Index again
# This sets the index to the "Party" column
elections.set_index("Party")

And, if we'd like, we can revert the index back to the default list of integers.


In [None]:
# This resets the index to be the default list of integer
elections.reset_index(inplace=True) 
elections.index

It is also important to note that the row labels that constitute an index don't have to be unique. While index values can be unique and numeric, acting as a row number, they can also be named and non-unique. 

Here we see unique and numeric index values.
![](images/uniqueindex.png)

However, here the index values here are non-unique. 
![](images/non-uniqueindex.png)

## `DataFrame` Attributes: Index, Columns, and Shape

On the other hand, column names in a `DataFrame` are almost always unique. Looking back to the `elections` dataset, it wouldn't make sense to have two columns named "Candidate".

Sometimes, you'll want to extract these different values, in particular, the list of row and column labels.

For index/row labels, use `DataFrame.index`:


In [None]:
elections.set_index("Party", inplace = True)
elections.index

For column labels, use `DataFrame.columns`:


In [None]:
elections.columns

And for the shape of the DataFrame, we can use `DataFrame.shape`:


In [None]:
elections.shape

## Slicing in `DataFrame`s

Now that we've learned more about `DataFrame`s, let's dive deeper into their capabilities. 

The API (Application Programming Interface) for the `DataFrame` class is enormous. In this section, we'll discuss several methods of the `DataFrame` API that allow us to extract subsets of data.

The simplest way to manipulate a `DataFrame` is to extract a subset of rows and columns, known as **slicing**. 

Common ways we may want to extract data are grabbing:

- The first or last `n` rows in the `DataFrame`.
- Data with a certain label.
- Data at a certain position.

We will do so with four primary methods of the DataFrame class:

1. `.head` and `.tail`
2. `.loc`
3. `.iloc`
4. `[]`

### Extracting data with `.head` and `.tail`

The simplest scenario in which we want to extract data is when we simply want to select the first or last few rows of the `DataFrame`.

To extract the first `n` rows of a DataFrame `df`, we use the syntax `df.head(n)`.


In [None]:
elections = pd.read_csv("data/elections.csv")

# Extract the first 5 rows of the DataFrame
elections.head(5)

Similarly, calling `df.tail(n)` allows us to extract the last `n` rows of the DataFrame.  


In [None]:
# Extract the last 5 rows of the DataFrame
elections.tail(5)

### Label-based Extraction: Indexing with `.loc`

For the more complex task of extracting data with specific column or index labels, we can use `.loc`. The `.loc` accessor allows us to specify the ***labels*** of rows and columns we wish to extract. The **labels** (commonly referred to as the **indices**) are the bold text on the far *left* of a DataFrame, while the **column labels** are the column names found at the *top* of a DataFrame.

![](images/locgraphic.png)

To grab data with `.loc`, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the `.loc` function; the column labels are the second.

Arguments to `.loc` can be:

- A single value.
- A slice.
- A list.

For example, to select a single value, we can select the row labeled `0` and the column labeled `Candidate` from the `elections` `DataFrame`.


In [None]:
elections.loc[0, 'Candidate']

Keep in mind that passing in just one argument as a single value will produce a `Series`. Below, we've extracted a subset of the `"Popular vote"` column as a `Series`.


In [None]:
elections.loc[[87, 25, 179], "Popular vote"]

To select *multiple* rows and columns, we can use Python slice notation. Here, we select the rows from labels `0` to `3` and the columns from labels `"Year"` to `"Popular vote"`. 


In [None]:
elections.loc[0:3, 'Year':'Popular vote']

Suppose that instead, we want to extract *all* column values for the first four rows in the `elections` DataFrame. The shorthand `:` is useful for this.


In [None]:
elections.loc[0:3, :]

We can use the same shorthand to extract all rows. 


In [None]:
elections.loc[:, ["Year", "Candidate", "Result"]]

There are a couple of things we should note. Firstly, unlike conventional Python, `pandas` allows us to slice string values (in our example, the column labels). Secondly, slicing with `.loc` is *inclusive*. Notice how our resulting `DataFrame` includes every row and column between and including the slice labels we specified.

Equivalently, we can use a list to obtain multiple rows and columns in our `elections` DataFrame. 


In [None]:
elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']]

Lastly, we can interchange list and slicing notation.


In [None]:
elections.loc[[0, 1, 2, 3], :]

### Integer-based Extraction: Indexing with `.iloc`

Slicing with `.iloc` works similarly to `.loc`. However, `.iloc` uses the *index positions* of rows and columns rather than the labels (think to yourself: **l**oc uses **l**ables; **i**loc uses **i**ndices). The arguments to the `.iloc` function also behave similarly — single values, lists, indices, and any combination of these are permitted. 

Let's begin reproducing our results from above. We'll begin by selecting the first presidential candidate in our `elections` DataFrame:


In [None]:
# elections.loc[0, "Candidate"] - Previous approach
elections.iloc[0, 1]

Notice how the first argument to both `.loc` and `.iloc` are the same. This is because the row with a label of 0 is conveniently in the $0^{th}$ (equivalently, the first position) of the `elections` DataFrame. Generally, this is true of any DataFrame where the row labels are incremented in ascending order from 0.

And, as before, if we were to pass in only one single value argument, our result would be a `Series`.


In [None]:
elections.iloc[[1,2,3],1]

However, when we select the first four rows and columns using `.iloc`, we notice something.


In [None]:
# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach
elections.iloc[0:4, 0:4]

Slicing is no longer inclusive in `.iloc` — it's *exclusive*. In other words, the right end of a slice is not included when using `.iloc`. This is one of the subtleties of `pandas` syntax; you will get used to it with practice.

List behavior works just as expected.


In [None]:
#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach
elections.iloc[[0, 1, 2, 3], [0, 1, 2, 3]]

And just like with `.loc`, we can use a colon with `.iloc` to extract all rows or columns.


In [None]:
elections.iloc[:, 0:3]

This discussion begs the question: When should we use `.loc` vs. `.iloc`? In most cases, `.loc` is generally safer to use. You can imagine `.iloc` may return incorrect values when applied to a dataset where the ordering of data can change. However, `.iloc` can still be useful — for example, if you are looking at a `DataFrame` of sorted movie earnings and want to get the median earnings for a given year, you can use `.iloc` to index into the middle.

Overall, it is important to remember that:

- `.loc` performances **l**abel-based extraction.
- `.iloc` performs **i**nteger-based extraction.

### Context-dependent Extraction: Indexing with `[]`

The `[]` selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument, which may be one of the following:

1. A slice of row numbers.
2. A list of column labels.
3. A single-column label.

That is, `[]` is *context-dependent*. Let's see some examples.

#### A slice of row numbers

Say we wanted the first four rows of our `elections` DataFrame.


In [None]:
elections[0:4]

#### A list of column labels

Suppose we now want the first four columns.


In [None]:
elections[["Year", "Candidate", "Party", "Popular vote"]]

#### A single-column label

Lastly, `[]` allows us to extract only the `Candidate` column.


In [None]:
elections["Candidate"]

The output is a `Series`! In this course, we'll become very comfortable with `[]`, especially for selecting columns. In practice, `[]` is much more common than `.loc`, especially since it is far more concise.

## Parting Note

The `pandas` library is enormous and contains many useful functions. Here is a link to [documentation](https://pandas.pydata.org/docs/). We certainly don't expect you to memorize each and every method of the library.

The introductory Data 100 `pandas` lectures will provide a high-level view of the key data structures and methods that will form the foundation of your `pandas` knowledge. A goal of this course is to help you build your familiarity with the real-world programming practice of ...Googling! Answers to your questions can be found in documentation, Stack Overflow, etc. Being able to search for, read, and implement documentation is an important life skill for any data scientist. 

With that, we will move on to Pandas II.
