# Lecture 2 â€“ Data 100, Spring 2026

[Acknowledgments Page](https://ds100.org/sp26/acks/)

A high-level overview of the [`pandas`](https://pandas.pydata.org) library to accompany Lecture 2.

In [None]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

## Series, DataFrames, and Indices 

Series, DataFrames, and Indices are fundamental `pandas` data structures for storing tabular data and processing the data using vectorized operations.

### Series

A `Series` is a 1-D labeled array of data. We can think of it as columnar data. 

Let's create a `Series` object and look at its two components: 1) values and 2) index.

In [None]:
s = pd.Series(["welcome", "to", "data 100"])
s

In [None]:
s.values

In [None]:
# this is a more efficient way of representing [0, 1, 2]
s.index

In the example above, `pandas` automatically generated an `Index` of integer labels. We can also create a `Series` object by providing a custom `Index`.

In [None]:
s = pd.Series([-1, 10, 2], index=["a", "b", "c"])
s

In [None]:
s.values

In [None]:
s.index

After it has been created, we can reassign the Index of a `Series` to a new Index.

In [None]:
s.index = ["first", "second", "third"]
s

#### Selection in Series
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering condition using a **Boolean mask**

In [None]:
s = pd.Series([4, -2, 0, 6], index=["a", "b", "c", "d"])
s

**Selection using one or more label(s)**

In [None]:
# Selection using a single label
# Notice how the return value is a single array element
s["a"]

In [None]:
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]

**Selection using a filter condition**

In [None]:
# Boolean mask: true for all elements greater than 0.
s > 0

In [None]:
# Use the Boolean mask to select data from the original Series: this picks all entries greater than 0
s[s > 0]

<br><br><br><br><br><br><br>
**Instructor Note: Return to slides!**
<br><br><br><br><br><br><br>

#### DataFrame

A `DataFrame` is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a `DataFrame` can be created from scratch or loaded from a file. We'll cover the following:
1. From a CSV file
2. From a list of rows
3. From a dictionary of columns
4. From a `Series`

(but there are many more!)

##### Creating a `DataFrame` from a CSV file
For loading data into a `DataFrame`, `pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a CSV file into a `DataFrame` object. 

In [None]:
elections = pd.read_csv("data/elections.csv")
elections

By passing a column to the `index_col` attribute, the `Index` can be defined at the initialization.

In [None]:
elections = pd.read_csv("data/elections.csv", index_col="Candidate")
elections

In [None]:
elections = pd.read_csv("data/elections.csv", index_col="Year")
elections

##### Creating a `DataFrame` from a list of rows

In [None]:
# Creating a DataFrame from a list of rows
# Each row is a list, and we specify column names
df_list_1 = pd.DataFrame(
    [["Kiwi", 5.49],
     ["Orange", 3.99]],
    columns=["Fruit", "Price"]
)
df_list_1

In [None]:
# Creating a DataFrame from a list of rows
# Each row is a dictionary
df_list_2 = pd.DataFrame(
    [{"Fruit": "Kiwi", "Price": 5.49}, 
     {"Fruit": "Orange", "Price": 3.99}]
)
df_list_2

##### Creating a `DataFrame` from a dictionary of columns

In [None]:
# Creating a DataFrame from a dictionary of columns
# where each columns is a *list*:
df_dict_1 = pd.DataFrame(
    {"Fruit": ["Kiwi", "Orange"], 
     "Price": [5.49, 3.99]}
)
df_dict_1

In [None]:
# first, make some Series
ser_a = pd.Series(["a1", "a2", "a3"], index=["r1", "r2", "r3"])
ser_b = pd.Series(["b1", "b2", "b3"], index=["r1", "r2", "r3"])
ser_a

In [None]:

# Creating a DataFrame from a dictionary of columns
# where each column is a *Series* (must have matching Indexes)
df_dict_ser = pd.DataFrame(
    {"ColumnA": ser_a, "ColumnB": ser_b}
)
df_dict_ser

##### Creating a `DataFrame` from a `Series`

In [None]:
# Passing a Series to the DataFrame constructor to make a one-column DataFrame
df_ser = pd.DataFrame(ser_a)
df_ser

In [None]:
# Using to_frame() to convert a Series to DataFrame
ser_to_df = ser_a.to_frame()
ser_to_df

#### `DataFrame` attributes: `index`, `columns`, and `shape`

In [None]:
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col="Candidate")
elections

In [None]:
elections.index

In [None]:
elections.columns

In [None]:
elections.shape

<br><br><br><br><br><br><br>
**Instructor Note: Return to slides!**
<br><br><br><br><br><br><br>

### Extracting data from `DataFrame`s

In [None]:
elections = pd.read_csv('data/elections.csv')
elections

We can use `.head` to return only a few rows of a dataframe.

In [None]:
# By default, calling .head with no argument will show the first 5 rows
elections.head() 

In [None]:
elections.head(3)

We can also use `.tail` to get the last so many rows.

In [None]:
elections.tail(5)

*What might be some issues with using `.head` or `.tail` to check a DataFrame?*

#### Label-Based Extraction Using `.loc`

`.loc` uses two arguments, one for row labels and one for column labels. Each argument to `.loc` can be:
1. A single value.
2. A list.
3. A slice (syntax is **inclusive** of the right-hand side of the slice, unlike normal Python indexing).
3. A boolean mask


`loc` selects items by row and column *label*.

In [None]:
# Selection by a row label and a column label
elections.loc[0, "Candidate"]

In [None]:
# Selection by a list
elections.loc[[87, 25, 179], ["Year", "Party", "%"]]

In [None]:
# Selection by a list and a slice of columns
elections.loc[[87, 25, 179], "Popular vote":"%"]

In [None]:
# Extracting all rows using a colon
elections.loc[:, ["Year", "Candidate", "Result"]]

In [None]:
# Extracting all columns using a colon
elections.loc[[87, 25, 179], :]

In [None]:
# Selection by a list and a single-column label
elections.loc[[87, 25, 179], "Popular vote"]

In [None]:
# Note that if we pass "Popular vote" in a list, the output will be a DataFrame
elections.loc[[87, 25, 179], ["Popular vote"]]

In [None]:
# Selection with a boolean mask
elections.loc[elections["Year"] == 2008, ["Year", "Candidate"]]

In [None]:
# If we only use one argument for .loc, pandas uses it
#  for the rows and returns all columns
elections.loc[[180, 181]]

<br><br><br><br><br><br><br>
**Instructor Note: Return to slides!**
<br><br><br><br><br><br><br>

#### Integer-Based Extraction Using `iloc`

`iloc` selects items by row and column *integer* position.

Arguments to `.iloc` can be:
1. A single value.
2. A list.
3. A slice (syntax is **exclusive** of the right hand side of the slice, just like with lists/arrays/etc.).

In [None]:
# Extracting the value at the first row (row 0) and the second column.
# Remember that Python indexing begins at position 0!
elections.iloc[0,1]

In [None]:
# Extracting the second, third, and fourth rows of the second column
# (returns a series since we used a single column)
elections.iloc[[1, 2, 3], 1]

In [None]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
elections.iloc[[1, 2, 3], [0, 1, 2]]

In [None]:
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]

In [None]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]

In [None]:
# If we only use one argument for .iloc, pandas uses it
# for the rows, and returns all columns
elections.iloc[138:144]

<br><br><br><br><br><br><br>
**Instructor Note: Return to slides!**
<br><br><br><br><br><br><br>

#### Context-dependent Extraction using `[]`

We can do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code for the most common `DataFrame` manipulations.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it achieves some of the same functionality. The most important difference is that `[]` is *context-dependent*.

Remember that `.loc` and `.iloc` always take two arguments: the row labels and the column labels (and if you only pass one argument, it's always used as the row labels).

`[]` only takes one argument, which could be used for rows **or** labels, depending on what the argument is:
1. If it's a **single label**, it's treated as a **column** label (like with `loc`)
2. If it's a **list**, it's treated as **column** labels (like with `loc`)
3. If it's a **slice**, it's treated as **integer row** labels (like with `iloc`)
4. If it's a **boolean mask**, it's treated as **row** labels (like with `loc`)



In [None]:
# If we provide a single column name, we get back that column as a Series
elections["Candidate"]

In [None]:
# If we provide a list of column names, we get the listed columns.
elections[["Year", "Candidate", "Result"]]

In [None]:
# If we provide a slice of row numbers, we get the numbered rows.
elections[3:7]

In [None]:
# If we provide a boolean mask, we get the corresponding rows
elections[elections["Popular vote"] > 60000000]

<br><br><br><br><br><br><br>
**Instructor Note: Return to slides!**
<br><br><br><br><br><br><br>

#### Boolean Operators

To filter on multiple conditions, we combine boolean masks using **bitwise comparisons**.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

**Always** make sure you wrap each condition inside `()` when combining boolean masks with bitwise comparisons!

In [None]:
# Grab rows from 2008 OR candidates winning over 60% of the vote (or both)
elections[(elections["Year"] == 2008) | (elections["%"] >= 60)]

In [None]:
# Don't do this! Python's precedence rules (like PEMDAS for other operators) will do the wrong thing
# elections[elections["Year"] == 2008 | elections["%"] >= 60]

In [None]:
# Grab post-2000 winners: rows where year is after 2000 AND result is win
elections[(elections["Year"] > 2000) & (elections["Result"] == "win")]

In [None]:
# To make code more readable, use multiple lines
elections[(
    (elections["Year"] < 2000) &
    (elections["Year"] > 1941) &
    (elections["Result"] == "win") &
    (elections["%"] >= 55)
)]

#### Preparing the data for our graph

![](images/graph_and_df.png)

We need rows that are either Democratic OR Republican, AND after 1900. We also need only the Year, Party, and % columns.

In [None]:
# Notice the parens: it's important to use ((dem | rep) & year) so that we
# get the boolean operations right
elections.loc[
    (
        ((elections['Party'] == 'Democratic') | (elections['Party'] == 'Republican')) &
        (elections['Year'] > 1900)
    ),
    ['Year', 'Party', '%']
]

##### Investigating 1992

In [None]:
# What happened in 1992?
elections[elections['Year'] == 1992]

In [None]:
# How do we put that in context? How does 19% compare to other third party candidates?
third_party = elections[
    (elections['Party'] != "Democratic") & 
    (elections['Party'] != "Republican") & 
    (elections['Year'] > 1900)
]
third_party.sort_values('%').iloc[-15:, :]

#### Working with the `Index`

We can manipulate the index of a `DataFrame` by using `set_index` and `reset_index`:

In [None]:
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col="Candidate")
elections.head(3)

The `Index` column can be set to the default list of integers by calling `reset_index()` on a `DataFrame`.

In [None]:
# reset_index moves the old index to a column and changes the index to the default
elections.reset_index()

In [None]:
# Notice that reset_index, just like most DataFrame methods, returns a new dataframe (instead of modifying it)
elections

In [None]:
# This sets the Index to the "Party" column (and deletes the old index!)
elections.set_index("Party")

## Slido Exercises

**Question 1**

What's the output of the following code?

In [None]:
example = pd.Series(
    [4, 5, 6], 
    index=["one", "two", "three"]
)
example[example > 4].values

**Questions 2, 3, and 4**

Which of the following statements correctly return the value "blue fish" from the "weird" DataFrame?

In [None]:
weird = pd.DataFrame({"a":["one fish", "two fish"], 
                      "b":["red fish", "blue fish"]})
weird

In [None]:
# weird.loc[1, 1]

In [None]:
weird

In [None]:
# weird.loc[1, 'b']

In [None]:
weird

In [None]:
# weird.loc[[1,1]]

In [None]:
weird

In [None]:
# weird.loc[[1,'b']]

In [None]:
weird

In [None]:
# weird.iloc[1, 1]

In [None]:
weird

In [None]:
# weird.iloc[1, :]

In [None]:
weird

In [None]:
# weird.iloc[2,2]

In [None]:
weird

In [None]:
# weird.iloc[0,'b']

In [None]:
weird

In [None]:
# weird[1]['b']

In [None]:
weird

In [None]:
# weird['b'][1]

In [None]:
weird

In [None]:
# weird[1,'b']

In [None]:
weird

In [None]:
# weird[[1,'b']]