::: {.callout-note collapse="true"}
## Learning Outcomes

- Build familiarity with basic `pandas` syntax
- Learn the methods of selecting and filtering data from a DataFrame.
- Understand the differences between DataFrames and Series
:::

Data scientists work with data stored in a variety of formats. The primary focus of this class is in understanding tabular data - one of the most widely used formats in data science. This note introduces DataFrames, which are among the most popular representations of tabular data. We’ll also introduce `pandas`, the standard Python package for manipulating data in DataFrames.

Add test sentence

## DataFrames

In Data 8, you encountered the `Table` class of the `datascience` library. In Data 100, we'll be using the `DataFrame` class of the `pandas` library.

Here is an example of a DataFrame containing election data.

In [None]:
import pandas as pd

elections = pd.read_csv("data/elections.csv")
elections

Let's dissect the code above. 

1. We first import the ````pandas```` library into our Python environment, using the alias `pd`. <br> &emsp;```` import pandas as pd ````

2. There are a number of ways to read data into a DataFrame. In Data 100, our data are typically stored in a CSV (comma-seperated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following ````pandas```` function. 
<br> &emsp;```` pd.read_csv("elections.csv") ```` 

This code stores our DataFrame object into the ````elections```` variable. Upon inspection, our ````elections```` DataFrame has 182 rows and 6 columns. Each row represents a single record - in our example, a presedential candidate from some particular year. Each column represents a single attribute, or feature of the record.

The API (application programming interface) for the DataFrame class is enormous. In the next section, we'll discuss several methods of the DataFrame API that allow us to extract subsets of data.

## Slicing in DataFrames

The most fundamental way to manipulate a DataFrame is to extract a subset of rows and columns. This is called **slicing**. We will do so with three primary methods of the DataFrame class:

1. `.loc`
2. `.iloc`
3. `[]`

### Indexing with .loc

The `.loc` operator selects rows and columns in a DataFrame by their row and column label(s), respectively. The **row label** (commonly referred to as the **index**) is the bold text on the far *left* of a DataFrame, while the **column label** is the text found at the *top* of a DataFrame. By default, row labels in `pandas` are the sequential list of integers beginning from 0. The column labels in our `elections` DataFrame are the column names themselves: `Year`, `Candidate`, `Party`, `Popular Vote`, `Result`, and `%`.

`.loc` lets us grab data by specifying the appropriate row and column label(s) where the data exists. The row labels are the first argument to the `.loc` function; the column labels are the second. For example, to select the the row labeled `0` and the column labeled `Candidate` from our `elections` DataFrame we can write:

In [None]:
elections.loc[0, 'Candidate']

To select *multiple* rows and columns, we can use Python slice notation. We can select the first four rows and first four columns.

In [None]:
elections.loc[0:3, 'Year':'Popular vote']

Suppose that instead, we wanted *every* column value for the first four rows in the `elections` DataFrame. The shorthand `:` is useful for this.

In [None]:
elections.loc[0:3, :]

There are a couple of things we should note. Unlike conventional Python, Pandas allows us to slice string values (in our example, the column labels). Secondly, slicing with `.loc` is *inclusive*. Notice how our resulting DataFrame includes every row and column between and including the slice labels we specified.

Equivalently, we can use a list to obtain multiple rows and columns in our `elections` DataFrame. 

In [None]:
elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']]

Lastly, we can interchange list and slicing notation.

In [None]:
elections.loc[[0, 1, 2, 3], :]

### Indexing with .iloc

Slicing with `.iloc` works similarily to `.loc`, although `.iloc` uses the integer positions of rows and columns rather the labels. The arguments to the `.iloc` function also behave similarly - single values, lists, indices, and any combination of these are permitted. 

Let's begin reproducing our results from above. We'll begin by selecting for the first presedential candidate in our `elections` DataFrame:

In [None]:
# elections.loc[0, "Candidate"] - Previous approach
elections.iloc[0, 1]

Notice how the first argument to both `.loc` and `.iloc` are the same. This is because the row with a label of 0 is conveniently in the 0^th^ (or first) position of the `elections` DataFrame. Generally, this is true of any DataFrame where the row labels are incremented in ascending order from 0.

However, when we select for the first four rows and columns using `.iloc`, we notice something.

In [None]:
# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach
elections.iloc[0:4, 0:4]

Slicing is no longer inclusive in `.iloc` - it's *exclusive*. This is one of Pandas syntatical subtleties; you'll get used to with practice.

List behavior works just as expected.

In [None]:
#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach
elections.iloc[[0, 1, 2, 3], [0, 1, 2, 3]]

This discussion begs the question: when should we use `.loc` vs `.iloc`? In most cases, `.loc` is generally safer to use. You can imagine `.iloc` may return incorrect values when applied to a dataset where the ordering of data can change. 

### Indexing with []

The `[]` selection operator is the most baffling of all, yet it is the commonly used. It only takes a single argument, which may be one of the following:

1. A slice of row numbers
2. A list of column labels
3. A single column label

That is, `[]` is *context dependent*. Let's see some examples.

#### A slice of row numbers

Say we wanted the first four rows of our `elections` DataFrame.

In [None]:
elections[0:4]

#### A list of column labels

Suppose we now want the first four columns.

In [None]:
elections[["Year", "Candidate", "Party", "Popular vote"]]

#### A single column label

Lastly, if we only want the `Candidate` column.

In [None]:
elections["Candidate"]

The output looks quite different - it's no longer a DataFrame! This is a *Series*. We'll talk about what a Series is in the next section.

## DataFrames, Series, and Indices

We saw that selecting a single column from a DataFrame using the `[]` operator outputted a new data format, called a Series. Let's verify this claim.

In [None]:
type(elections)

In [None]:
type(elections['Candidate'])

A **Series** is a one dimensional object that represents a single column of data. It has two components - an index and the data. A DataFrame is equivalent to a collection of multiple Series, which all share the same index. Notice how the index of a Series is equivalent to the index (or row labels) of a DataFrame.

![](images/index_comparison_1.png)

However, a DataFrame index doesn't have to be an integer, nor does it have to be unique. For example, we can set our index to be the name of our presedential candidates. Selecting a new Series from this modified DataFrame yields the following:

In [None]:
elections.set_index("Candidate", inplace=True) # This sets the index to the "Candidate" column

![](images/index_comparison_2.png)

To retrieve the indices of a DataFrame, simply use the `.index` attribute of the DataFrame class.

In [None]:
elections.index

In [None]:
elections.reset_index(inplace=True) # This resets the index

Earlier, we mentioned that a Series was just a column of data. What if we wanted a single column as a DataFrame? To obtain this, we can pass in a list containing a single column to the `[]` selection operator.

In [None]:
elections[["Party"]] # ["Party"] is the argument - a list with a single element

## Conditional Selection

Conditional selection allows us to select a subset of rows in a DataFrame if they follow some specified condition.

To understand how to use conditional selection, we must look at another input of the `.loc` and `[]` methods - a boolean array. This boolean array must have a length equal to the number of rows in the DataFrame. It will return all rows in the position of a corresponding `True` value in the array.

Here, we will select all *even-indexed* rows in the first 10 rows of our DataFrame.

In [None]:
# Why is :9 is the correct slice to select the first 10 rows?
elections_first_10_rows = elections.loc[:9, :]

# Notice how we have exactly 10 elements in our boolean array argument
elections_first_10_rows[[True, False, True, False, True, \
                         False, True, False, True, False]]

Unfortunately, using this method to select multiple rows in a large DataFrame is infeasible. Instead, we can provide a logical condition as an input to `.loc` or `[]` that returns a boolean array with said length.

For example, to return all candidates affilliated with the Independent party:

In [None]:
logical_operator = elections['Party'] == "Independent"
elections[logical_operator]

Here, `logical_operator` evaluates to a Series of boolean values with length 182.

In [None]:
logical_operator

Rows 121, 130, 143, 161, 167, and 174 evaluate to `True` and are thus returned in the DataFrame.

In [None]:
#| code-fold: true
print(logical_operator.loc[[121, 130, 143, 161, 167, 174]])

Passing a Series as an argument to `elections[]` has the same affect as using a boolean array. In fact, the `[]` selection operator can take a boolean Series, array, and list as arguments. These three are used interchangeably thoughout the course.

Similarly, we can use `.loc` to achieve similar results.

In [None]:
elections.loc[elections['Party'] == "Independent"]

Boolean conditions can be combined using various operators that allow us to filter results by multiple conditions. Some examples include the `&` (and) operator and `|` (or) operator.

**Note**: When combining multiple conditions with logical operators, be sure to surround each condition with a set of paranthesis `()`. If you forget, your code will throw an error.

For example, if we want to return data on all presidential candidates affiliated with the Independent Party before the 21^st^ century, we can do so:

In [None]:
elections[(elections['Party'] == "Independent") \
          & (elections['Year'] < 2000)]

## Handy Utility Functions

There are a large number of operations supported by `pandas` that allow us to efficiently manipulate data. In this section, we'll cover a few.

1. `.head` and `.tail`
2. `.shape` and `.size`
2. `.describe`
3. `.sample`
4. `.value_counts`
5. `.unique`
6. `.sort_values`

#### .head / .tail

`.head(n)` and `.tail(n)` display the first `n` and last `n` rows of a DataFrame, respectively.

In [None]:
elections.head(3)

In [None]:
elections.tail(3)

#### .shape / .size

`.shape` returns a tuple with the number of rows and columns in a DataFrame. <br>
`.size` returns the total number of data entries. This is the product of the number of rows and columns.

In [None]:
elections.shape

In [None]:
num_rows, num_cols = elections.shape
assert(elections.size == num_rows * num_cols)
elections.size

#### .describe

`.describe()` returns a DataFrame of useful summary statistics for each numerical column.

In [None]:
elections.describe()

#### .sample

`.sample(n)` returns a random sample of `n` rows from the given DataFrame.

In [None]:
elections.sample(3)

#### .value_counts

`.value_counts()` is called on a column and returns a Series containing the total count of each unique value.

In [None]:
#| fig-cap: This code tells us how many times each candidate ran for president of the United States.
elections['Candidate'].value_counts()

#### .unique

`.unique()` is called on a Series and returns an array with its unique values.

In [None]:
# For brevity, we have limited the results to 5 candidates 
elections['Candidate'].unique()[:5]

#### .sort_values

`.sort_values()` returns a sorted version of the Series it was called on. Numerical values are in sorted magnitude, while text is sorted in alphabetical order. You may specify optional arguments to sort in ascending or descending order.

In [None]:
elections['Candidate'].sort_values()

### Parting Note

The `pandas` library is enormous and contains many useful functions. Here is a link to [documentation](https://pandas.pydata.org/docs/).

This lecture and the next will cover important methods you should be fluent in. However, we want you to get familiar with the real world programming practice of ...Googling! Answers to your questions can be found in documentation, Stack Overflow, etc. 

With that, let's move on to Pandas II.
