::: {.callout-note collapse="true"}
## Learning Outcomes

- Build familiarity with basic `pandas` syntax
- Learn the methods of selecting and filtering data from a DataFrame.
- Understand the differences between DataFrames and Series
:::

Data scientists work with data stored in a variety of formats. The primary focus of this class is in understanding tabular data - one of the most widely used formats in data science. This note introduces DataFrames, which are among the most popular representations of tabular data. We’ll also introduce `pandas`, the standard Python package for manipulating data.


## DataFrames, Series, and Indices
There are three fundamental data structures in pandas:
* Series: 1D labeled array data. I usually think of it as columnar data.
* Data Frame: 2D tabular data with both row and column labels
* Index: A sequence of row/column labels.

Here is an example of a DataFrame containing election data.





In [3]:
import pandas as pd

elections = pd.read_csv("data/elections.csv")
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


Let's dissect the code above. 

1. We first import the ````pandas```` library into our Python environment, using the alias `pd`. <br> &emsp;```` import pandas as pd ````

2. There are a number of ways to read data into a DataFrame. In Data 100, our data are typically stored in a CSV (comma-seperated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following ````pandas```` function. 
<br> &emsp;```` pd.read_csv("elections.csv") ```` 

This code stores our DataFrame object into the ````elections```` variable. Upon inspection, our ````elections```` DataFrame has 182 rows and 6 columns. Each row represents a single record - in our example, a presedential candidate from some particular year. Each column represents a single attribute, or feature of the record.

The API (application programming interface) for the DataFrame class is enormous. Let's dig deeper into what exactly a DataFrame is.

````elections```` is an example of a DataFrame. Using the built-in ````type```` function, we can check that it is indeed a DataFrame object. 

In [4]:
type(elections)

pandas.core.frame.DataFrame

### Series 

A `Series` is a 1-dimensional array-like object containing a sequence of values of the *same type* and an associated array of data labels, called its `index`. Think of this as the fundamental building block of Pandas. 

### Creating a Series

Passing a list into the constructor will intialize a `Series` containing the same elements as the list passed in with a default numeric `Index`.



In [5]:
# This creates a Series containing the number 1,2, and 3.
s = pd.Series([1,2,3])

### Setting an Index
The `Index` serves as a way of identifying the items in a Series. Sometimes, this is prefered not to be just integers. For example, if you are working with data that corresponds to a person, then you can set the `Index` to be the names of each person. This can help in making your code more interpretable.

Setting an `Index` can be done through the constructor's `index = ...` argument **or** by setting the `.index` attribute to be a list. 

In [24]:
# Let's imagine we have a Series that contains the heights of 3 people: 
# Alice, Bob, and Candace. 
# Setting the heights array to have an Index can inform us of the person each value belongs to. 

heights = pd.Series([150, 160, 155], index = ["Alice","Bob","Candice"])
heights



Alice      150
Bob        160
Candice    155
dtype: int64

### Selection in Series

We can select a single value or a set of values from a Series using a variety of methods. All of these methods require the name of the `Series` followed by square brackets `[]` containing an argument:

1. A single label. 
    - This will return the value(s) associated with the label in the `Index`. Each label in the `Index` will have at least one value associated with it at all times.  

2. A list of labels
    - This will return all values associated with the each label in the list. This is similar to the first method.
    
3. A filtering condition
    - How this method works will be clarified in a future section. However, it essntially compares every value in the `Series` using the operator and the other argument. Every value for which the expression evaluates to `True` will be returned. 



In [29]:
s = pd.Series([5,10,15, 20], index = ["A","B","C","D"])
#Selecting a single value using a single label
#Note: this returns back an integer because each value is an integer
s["A"]

#Selecting a list of values using a list of labels.
#Note: this returns back a Series
s[["B","D"]]

#Selecting all values greater than 11
s[s>11]




5
B    10
D    20
dtype: int64
C    15
D    20
dtype: int64


### DataFrames

The `DataFrame` object is specialized for storing **2D tabular** data. The 2 dimensions are called the rows and columns. **Rows** contain sepearate units of observation(people, days, etc.), and are typically oriented on the **vertically**. **Columns** represent features(a person's height, a day's temperature) of the data, and are oriented **horizontally**. Each column has a name and each row has an index value. These identifying values **should** be unique. While its possible to have duplicate row indices and column names, it is highly discouraged as it can lead to confusion.



### Creating a DataFrame

There are three common approaches:

1. Using a `collection` of `list` objects for values and column name
    - The `collection` should contain a `list` that will be used as a row in the resulting DataFrame. Each `list` should have the same number of items.

2. From a `Dictionary` of {column name : `list`}
3. From a `Dictionary` of {column name : `Series`}

    - Contrary to the first method, the second and third method is helpful if we want to define the data by *column* instead of by row. Each key in our `Dictionary` will become a column name in our DataFrame, with the value corresponding to the key becoming the Series that the DataFrame uses to represent that column. The value can be defined as either a `list` or a `Series`, but will ultimately still end up as a column.

In [30]:
#Using a collection of list objects, one for each row
ex1 = pd.DataFrame([[1,'one'],[2,'two'],[3,'three']],columns = ["Numbers","Words"])

#Using a dictionary of list objects
ex2 = pd.DataFrame({"Numbers" : [1,2,3], "Words":['one','two','three']})

#Using a dictionary of Series objects
number_series = pd.Series([1,2,3])
word_series = pd.Series(['one','two','three'])
ex3 = pd.DataFrame({"Numbers" : number_series, "Words":word_series })


### The Relationship Between DataFrames, Series, and Indices

A DataFrame is equivalent to a collection of multiple Series, which all share the same `Index`, a special `Series` used to represent row indices. Notice how the `Index` of a Series is equivalent to the `Index` (or row labels) of the DataFrame it comes from (this will come up again later). However, a DataFrame index doesn't have to be an integer, nor does it have to be unique. For example, we can set our index to be the name of our presedential candidates:

![](images/index_comparison_1.png)

To retrieve the row indices of a DataFrame, simply use the `.index` attribute, and to get the column names of a DataFrame, simply use the `.columns` attribute.

In [7]:
elections.columns

Index(['Year', 'Candidate', 'Party', 'Popular vote', 'Result', '%'], dtype='object')

In [8]:
elections.index

RangeIndex(start=0, stop=182, step=1)

Knowing the column names of a DataFrame is important because this is the primary way of accessing a column. This will be further elaborated during the next section, but a key fact to know is that column names should be **unique**!

In [33]:
# What happens when a column is not unique? You can't have one without the other. 
# Trust that this can get annoying

bad_df = pd.DataFrame([[1,'one'],[2,'two'],[3,'three']],columns = ["A","A"])
bad_df["A"]

Unnamed: 0,A,A.1
0,1,one
1,2,two
2,3,three


We can change the `index` of our DataFrame to be any `Series` object we choose. Here is a demonstration of how to turn the `Candidate` column into the `Index` of our `elections` DataFrame.

In [9]:
elections['Candidate']

0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

In [10]:
elections.set_index("Candidate", inplace=True) # This sets the index to the "Candidate" column

![](images/index_comparison_2.png)

In [11]:
elections.reset_index(inplace=True) # This resets the index

Earlier, we mentioned that a Series was just a column of data. What if we wanted a single column as a DataFrame? To obtain this, we can pass in a list containing a single column to the `[]` selection operator.

In [12]:
elections[["Party"]] # ["Party"] is the argument - a list with a single element

Unnamed: 0,Party
0,Democratic-Republican
1,Democratic-Republican
2,Democratic
3,National Republican
4,Democratic
...,...
177,Green
178,Democratic
179,Republican
180,Libertarian


## Slicing in DataFrames

The most fundamental way to manipulate a DataFrame is to extract a subset of rows and columns. This is called **slicing**. We will do so with three primary methods of the DataFrame class:

1. `.loc`
2. `.iloc`
3. `[]`

### Indexing with .loc

The `.loc` operator selects rows and columns in a DataFrame by their row and column label(s), respectively. The **row label** (commonly referred to as the **index**) is the bold text on the far *left* of a DataFrame, while the **column label** is the text found at the *top* of a DataFrame. By default, row labels in `pandas` are the sequential list of integers beginning from 0. The column labels in our `elections` DataFrame are the column names themselves: `Year`, `Candidate`, `Party`, `Popular Vote`, `Result`, and `%`.

`.loc` lets us grab data by specifying the appropriate row and column label(s) where the data exists. The row labels are the first argument to the `.loc` function; the column labels are the second. For example, to select the the row labeled `0` and the column labeled `Candidate` from our `elections` DataFrame we can write:

In [13]:
elections.loc[0, 'Candidate']

'Andrew Jackson'

To select *multiple* rows and columns, we can use Python slice notation. We can select the first four rows and first four columns.

In [14]:
elections.loc[0:3, 'Year':'Popular vote']

Unnamed: 0,Year,Party,Popular vote
0,1824,Democratic-Republican,151271
1,1824,Democratic-Republican,113142
2,1828,Democratic,642806
3,1828,National Republican,500897


Suppose that instead, we wanted *every* column value for the first four rows in the `elections` DataFrame. The shorthand `:` is useful for this.

In [15]:
elections.loc[0:3, :]

Unnamed: 0,Candidate,Year,Party,Popular vote,Result,%
0,Andrew Jackson,1824,Democratic-Republican,151271,loss,57.210122
1,John Quincy Adams,1824,Democratic-Republican,113142,win,42.789878
2,Andrew Jackson,1828,Democratic,642806,win,56.203927
3,John Quincy Adams,1828,National Republican,500897,loss,43.796073


There are a couple of things we should note. Unlike conventional Python, Pandas allows us to slice string values (in our example, the column labels). Secondly, slicing with `.loc` is *inclusive*. Notice how our resulting DataFrame includes every row and column between and including the slice labels we specified.

Equivalently, we can use a list to obtain multiple rows and columns in our `elections` DataFrame. 

In [16]:
elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']]

Unnamed: 0,Year,Candidate,Party,Popular vote
0,1824,Andrew Jackson,Democratic-Republican,151271
1,1824,John Quincy Adams,Democratic-Republican,113142
2,1828,Andrew Jackson,Democratic,642806
3,1828,John Quincy Adams,National Republican,500897


Lastly, we can interchange list and slicing notation.

In [17]:
elections.loc[[0, 1, 2, 3], :]

Unnamed: 0,Candidate,Year,Party,Popular vote,Result,%
0,Andrew Jackson,1824,Democratic-Republican,151271,loss,57.210122
1,John Quincy Adams,1824,Democratic-Republican,113142,win,42.789878
2,Andrew Jackson,1828,Democratic,642806,win,56.203927
3,John Quincy Adams,1828,National Republican,500897,loss,43.796073


### Indexing with .iloc

Slicing with `.iloc` works similarily to `.loc`, although `.iloc` uses the integer positions of rows and columns rather the labels. The arguments to the `.iloc` function also behave similarly - single values, lists, indices, and any combination of these are permitted. 

Let's begin reproducing our results from above. We'll begin by selecting for the first presedential candidate in our `elections` DataFrame:

In [18]:
# elections.loc[0, "Candidate"] - Previous approach
elections.iloc[0, 1]

1824

Notice how the first argument to both `.loc` and `.iloc` are the same. This is because the row with a label of 0 is conveniently in the 0^th^ (or first) position of the `elections` DataFrame. Generally, this is true of any DataFrame where the row labels are incremented in ascending order from 0.

However, when we select for the first four rows and columns using `.iloc`, we notice something.

In [19]:
# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach
elections.iloc[0:4, 0:4]

Unnamed: 0,Candidate,Year,Party,Popular vote
0,Andrew Jackson,1824,Democratic-Republican,151271
1,John Quincy Adams,1824,Democratic-Republican,113142
2,Andrew Jackson,1828,Democratic,642806
3,John Quincy Adams,1828,National Republican,500897


Slicing is no longer inclusive in `.iloc` - it's *exclusive*. This is one of Pandas syntatical subtleties; you'll get used to with practice.

List behavior works just as expected.

In [20]:
#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach
elections.iloc[[0, 1, 2, 3], [0, 1, 2, 3]]

Unnamed: 0,Candidate,Year,Party,Popular vote
0,Andrew Jackson,1824,Democratic-Republican,151271
1,John Quincy Adams,1824,Democratic-Republican,113142
2,Andrew Jackson,1828,Democratic,642806
3,John Quincy Adams,1828,National Republican,500897


This discussion begs the question: when should we use `.loc` vs `.iloc`? In most cases, `.loc` is generally safer to use. You can imagine `.iloc` may return incorrect values when applied to a dataset where the ordering of data can change. 

### Indexing with []

The `[]` selection operator is the most baffling of all, yet it is the commonly used. It only takes a single argument, which may be one of the following:

1. A slice of row numbers
2. A list of column labels
3. A single column label

That is, `[]` is *context dependent*. Let's see some examples.

#### A slice of row numbers

Say we wanted the first four rows of our `elections` DataFrame.

In [21]:
elections[0:4]

Unnamed: 0,Candidate,Year,Party,Popular vote,Result,%
0,Andrew Jackson,1824,Democratic-Republican,151271,loss,57.210122
1,John Quincy Adams,1824,Democratic-Republican,113142,win,42.789878
2,Andrew Jackson,1828,Democratic,642806,win,56.203927
3,John Quincy Adams,1828,National Republican,500897,loss,43.796073


#### A list of column labels

Suppose we now want the first four columns.

In [22]:
elections[["Year", "Candidate", "Party", "Popular vote"]]

Unnamed: 0,Year,Candidate,Party,Popular vote
0,1824,Andrew Jackson,Democratic-Republican,151271
1,1824,John Quincy Adams,Democratic-Republican,113142
2,1828,Andrew Jackson,Democratic,642806
3,1828,John Quincy Adams,National Republican,500897
4,1832,Andrew Jackson,Democratic,702735
...,...,...,...,...
177,2016,Jill Stein,Green,1457226
178,2020,Joseph Biden,Democratic,81268924
179,2020,Donald Trump,Republican,74216154
180,2020,Jo Jorgensen,Libertarian,1865724


#### A single column label

Lastly, if we only want the `Candidate` column.

In [23]:
elections["Candidate"]

0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object

### Parting Note

The `pandas` library is enormous and contains many useful functions. Here is a link to [documentation](https://pandas.pydata.org/docs/).

This lecture and the next will cover important methods you should be fluent in. However, we want you to get familiar with the real world programming practice of ...Googling! Answers to your questions can be found in documentation, Stack Overflow, etc. 

With that, we will move on to learning how to manipulate and access the data stored in the `pandas` data structures.
