# Lecture 2 – Data 100, Summer 2025

Data 100, Summer 2025

[Acknowledgments Page](https://ds100.org/su25/acks/)

A high-level overview of the [`pandas`](https://pandas.pydata.org) library to accompany Lecture 2.

In [None]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

## Series, DataFrames, and Indices 

Series, DataFrames, and Indices are fundamental `pandas` data structures for storing tabular data and processing the data using vectorized operations.

### Series

A `Series` is a 1-D labeled array of data. We can think of it as columnar data. 

#### Creating a new `Series` object
Below, we create a `Series` object and will look into its two components: 1) values and 2) index.

In [None]:
s = pd.Series(["welcome", "to", "data 100"])

s

In [None]:
s.values

In [None]:
s.index

In the example above, `pandas` automatically generated an `Index` of integer labels. We can also create a `Series` object by providing a custom `Index`.

In [None]:
s = pd.Series([-1, 10, 2], index=["a", "b", "c"])
s

In [None]:
s.values

In [None]:
s.index

After it has been created, we can reassign the Index of a `Series` to a new Index.

In [None]:
s.index = ["first", "second", "third"]
s

#### Selection in Series
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering condition

In [None]:
s = pd.Series([4, -2, 0, 6], index=["a", "b", "c", "d"])
s

**Selection using one or more label(s)**

In [None]:
# Selection using a single label
# Notice how the return value is a single array element
s["a"]

In [None]:
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]

**Selection using a filter condition**

In [None]:
# Filter condition: select all elements greater than 0
s>0

In [None]:
# Use the Boolean filter to select data from the original Series
s[s>0]

<br><br><br><br><br><br><br>
**Instructor Note: Return to slides!**
<br><br><br><br><br><br><br>

### DataFrame

A `DataFrame` is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a `DataFrame` can be created from scratch or loaded from a file. 

#### Creating a new `DataFrame` object
We can also create a `DataFrame` in a variety of ways. Here, we cover the following:
1. From a CSV file
2. Using a list and column names
3. From a dictionary
4. From a `Series`


##### Creating a `DataFrame` from a CSV file
For loading data into a `DataFrame`, `pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a CSV file into a `DataFrame` object. 

In [None]:
elections = pd.read_csv("data/elections.csv")
elections

By passing a column to the `index_col` attribute, the `Index` can be defined at the initialization.

In [None]:
elections = pd.read_csv("data/elections.csv", index_col="Candidate")
elections

In [None]:
elections = pd.read_csv("data/elections.csv", index_col="Year")
elections

##### Creating a `DataFrame` using a list and column names

In [None]:
# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], 
                         columns=["Number"])
display(df_list_1)

In [None]:
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], 
                         columns=["Number", "Description"])
df_list_2

##### Creating a `DataFrame` from a dictionary

In [None]:
# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], 
                          "Price":[5.49, 3.99]})
df_dict_1

In [None]:
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                          {"Fruit":"Orange", "Price":3.99}])
df_dict_2

##### Creating a `DataFrame` from a `Series`

In [None]:
# In the examples below, we create a DataFrame from a Series

s_a = pd.Series(["a1", "a2", "a3"], index=["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index=["r1", "r2", "r3"])

In [None]:
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser

In [None]:
# Passing a Series to the DataFrame constructor to make a one-column DataFrame
df_ser = pd.DataFrame(s_a)
df_ser

In [None]:
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df

In [None]:
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col="Candidate")
elections.head(5) # Using `.head` shows only the first 5 rows to save space

In [None]:
elections.reset_index(inplace=True) # Need to reset the Index to keep 'Candidate' as one of the DataFrane Columns
elections.set_index("Party", inplace=True) # This sets the Index to the "Candidate" column
elections

#### `DataFrame` attributes: `index`, `columns`, and `shape`

In [None]:
elections.index

In [None]:
elections.columns

The `Index` column can be set to the default list of integers by calling `reset_index()` on a `DataFrame`.

In [None]:
elections.reset_index(inplace=True) # Revert the Index back to its default numeric labeling
elections

In [None]:
elections.shape

<br><br><br><br><br><br><br>
**Instructor Note: Return to slides!**
<br><br><br><br><br><br><br>

### Slicing in `DataFrame`s

We can use `.head` to return only a few rows of a dataframe.

In [None]:
# Loading DataFrame again to keep the original ordering of columns
elections = pd.read_csv("data/elections.csv")

elections.head() # By default, calling .head with no argument will show the first 5 rows

In [None]:
elections.head(3)

We can also use `.tail` to get the last so many rows.

In [None]:
elections.tail(5)

#### Label-Based Extraction Using`loc`

Arguments to `.loc` can be:
1. A list.
2. A slice (syntax is **inclusive** of the right-hand side of the slice).
3. A single value.


`loc` selects items by row and column *label*.

In [None]:
# Selection by a list
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

In [None]:
# Selection by a list and a slice of columns
elections.loc[[87, 25, 179], "Popular vote":"%"]

In [None]:
# Extracting all rows using a colon
elections.loc[:, ["Year", "Candidate", "Result"]]

In [None]:
# Extracting all columns using a colon
elections.loc[[87, 25, 179], :]

In [None]:
# Selection by a list and a single-column label
elections.loc[[87, 25, 179], "Popular vote"]

In [None]:
# Note that if we pass "Popular vote" in a list, the output will be a DataFrame
elections.loc[[87, 25, 179], ["Popular vote"]]

In [None]:
# Selection by a row label and a column label
elections.loc[0, "Candidate"]

#### Integer-Based Extraction Using `iloc`

`iloc` selects items by row and column *integer* position.

Arguments to `.iloc` can be:
1. A list.
2. A slice (syntax is exclusive of the right hand side of the slice).
3. A single value.


In [None]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]

In [None]:
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]

In [None]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]

In [None]:
elections.iloc[[1, 2, 3], 1]

In [None]:
# Extracting the value at row 0 and the second column
elections.iloc[0,1]

#### Context-dependent Extraction using `[]`

We could technically do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it achieves essentially the same functionality. The difference is that `[]` is *context-dependent*.

`[]` only takes one argument, which may be:
1. A slice of row integers.
2. A list of column labels.
3. A single column label.


If we provide a slice of row numbers, we get the numbered rows.

In [None]:
elections[3:7]

If we provide a list of column names, we get the listed columns.

In [None]:
elections[["Year", "Candidate", "Result"]]

And if we provide a single column name we get back just that column, stored as a `Series`.

In [None]:
elections["Candidate"]

### Slido Exercises

**Question 1**

What's the output of the following code?

In [None]:
example = pd.Series([4, 5, 6], index=["one", "two", "three"])
example[example > 4].values

**Question 2**

  We are expecting to get the following output:

<img src="images/medium.jpeg" width="200px" />


In [None]:
df1 = pd.DataFrame([["A", "B"], [84, 79]], columns=["Group", "Score"])
df1

In [None]:
df2 = pd.DataFrame([["A", 84], ["B", 79]], columns=["Group", "Score"])
df2

In [None]:
df3 = pd.DataFrame({"A": 84, "B": 79}, columns=["Group", "Score"])
df3

In [None]:
df4 = pd.DataFrame({"Group": ["A", "B"], "Score": [84, 79]})
df4

In [None]:
df5 = pd.DataFrame([{"Group": "A", "Score": 84}, {"Group": "B", "Score": 79}])
df5

In [None]:
df = pd.DataFrame({"c1":[1, 2, 3, 4], "c2":[2, 4, 6, 8]})
df.columns

**Questions 3, 4, and 5**

Which of the following statements correctly return the value "blue fish" from the "weird" DataFrame?

In [None]:
weird = pd.DataFrame({"a":["one fish", "two fish"], 
                      "b":["red fish", "blue fish"]})
weird

In [None]:
weird.loc[1, 1]

In [None]:
weird

In [None]:
weird.loc[1, 'b']

In [None]:
weird

In [None]:
weird.loc[[1,1]]

In [None]:
weird

In [None]:
weird.loc[[1,'b']]

In [None]:
weird

In [None]:
weird.iloc[1, 1]

In [None]:
weird

In [None]:
weird.iloc[1, :]

In [None]:
weird

In [None]:
weird.iloc[2,2]

In [None]:
weird

In [None]:
weird.iloc[0,'b']

In [None]:
weird

In [None]:
weird[1]['b']

In [None]:
weird

In [None]:
weird['b'][1]

In [None]:
weird

In [None]:
weird[1,'b']

In [None]:
weird

In [None]:
weird[[1,'b']]