<a href="https://colab.research.google.com/github/LarrySnyder/ASJ/blob/main/intro_notebooks/Baby_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baby Pandas

This file is read-only. To work with it, you first need to **save a copy to your Google Drive:**

1. Go to the *File* menu. (The *File* menu inside the notebook, right below the filename—not the *File* menu in your browser, at the top of your screen.)
2. Choose *Save a copy in Drive*. (Log in to your Google account, if necessary.) Feel free to move it to a different folder in your Drive, if you want.
3. Colab should open up a new browser tab with your copy of the notebook. Double-click the filename at the top of the window and rename it `Baby Pandas [your name(s)]`. 
4. Close the original read-only notebook in your browser.


---
> 👓 **Note:** This notebook is part of the *Algorithms and Social Justice* course at Lehigh University, Profs. Larry Snyder and Suzanne Edwards.
---


## Baby Pandas?

This notebook gives a short overview of the basics of the Python package `pandas`—so, "baby pandas." 

Let's just set the mood:

![](https://media.tenor.com/EWPjzf41UAgAAAAd/gfg.gif)

Along the way, we'll also point out some tidbits about Python itself that will be useful our notebooks. We'll call those **Python Bites.** 

We're really on a roll here.

OK. First, let's import the `pandas` package. It's customary to abbreviate it as `pd`.

In [None]:
import pandas as pd

Now we can reference anything within the `pandas` package using the prefix `pd`.

## `pandas` DataFrames

The main object in `pandas` is a **DataFrame.** A DataFrame is just a collection of data, in table form. You can think of it like a spreadsheet. DataFrames are useful for storing and manipulating datasets.

There are lots of ways to build and fill a DataFrame in code. For example:

In [None]:
my_df = pd.DataFrame(
    [ ['Maginnes Hall', 1971, 'Academic'],                      # this part is the data itself
    ['STEPS', 2010, 'Academic'], 
    ['Taylor House', 1907, 'Residential'],
    ['Alumni Memorial Building', 1925, 'Administrative'] ], 
    columns=['Building', 'Year_Built', 'Type']                  # this part specifies the column names
)

To view a DataFrame, just put its name alone on a line:

In [None]:
my_df

Each **row** of a DataFrame is an entry in the database. (It's sometimes called a **record.**) Each **column** of a DataFrame is one piece of information for those rows. (It's sometimes called a **field.**)

---
> 🐍 **Python Bite: Lists** 
> 
> A sequence of items, separated by commas and surrounded by square brackets, is a Python **list.** Here are some lists:
>
> ```
[1, 5, 7, 12, 1735]
['a', 'c', 'r']
['Maginnes Hall', 1971, 'Academic']
```
>
> Lists are very flexible; they can even contain other lists! In fact, what we passed to `pd.DataFrame()` to create `df` is a **list of lists:**
>
> ```
[ ['Maginnes Hall', 1971, 'Academic'], 
['STEPS', 2010, 'Academic'], 
['Taylor House', 1907, 'Residential'],
['Alumni Memorial Building', 1925, 'Administrative'] ]
```
>
> We have a sequence of 4 items separated by commas and surrounded by brackets, and each of the 4 items is itself a sequence of 3 items separated by commas and surrounded by brackets. 
---

Although we can build DataFrames in code, most of the time we'll be loading them from files. (Obviously, the files we load have to already have some database-like structure that `pandas` can interpret as a DataFrame. We couldn't just load a gif or a PowerPoint file, for example.) 

Let's load a file from the internet into a DataFrame. This is a database of episodes of *The Office.*

In [None]:
office_df = pd.read_csv("https://raw.githubusercontent.com/aidano-7/the-office-dataset/main/DataFiles/Episodes.csv")

In [None]:
office_df

---
> 🐍 **Python Bite: Printing Variables** 
> 
> When we put the name of a variable (like `office_df`) by itself on a line and then run the cell, we're asking Python to print the value of that variable. For example, try typing the following into the code cell below and running it:
>
> ```
a = 17
a
```
>
> So when you type 
> ```
office_df
```
>
> you're just asking Python to print the value of the variable `office_df`. Now, `office_df` is a much more complicated variable than `a` because `office_df` is a whole DataFrame while `a` is just an integer, but the concept is the same.
>
> *Note*: This is only true for Jupyter notebooks. In other environments for programming in Python, if you just put a variable on a line by itself, it won't have any effect. To print it, you'd use the `print()` function:
> ```
print(a)
print(office_df)
```
> 
> That works in a Jupyter notebook too, although the formatting of DataFrames isn't as nice that way.
---

In [None]:
[...]

## Accessing Parts of a DataFrame 

Often we want to view or manipulate parts of DataFrame, like:

* row 3
* the column called "EpisodeTitle"
* the entry in the "SeasonID" column in row 37
* the "EpisodeID" and "SeasonID" columns for rows 10–20

Examples of all of these are given below. 

To be perfectly honest, I constantly forget the syntax for doing these things, and I have to look it up in another notebook or online. (Here's a handy [cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf); see the section called "Subsets - rows and columns".)

#### Accessing rows:

In [None]:
# Get row 3.
office_df.iloc[3]

In [None]:
# Get rows 3-7.
office_df.iloc[3:8]

---
> 🐍 **Python Bite: Numbering Things** 
> 
> In Python, things tend to be numbered starting at 0, not 1. So "row 3" in a DataFrame is really the 4th row, because the first row is "row 0". (Look back at the *Office* DataFrame and you'll see what I mean.)
> 
> Also, when we specify a range of integers like `3:8`, the second number refers to the integer *after* the last one we want. In other words, if we want 3 through 7, we ask for `3:8`. It's weird and kind of confusing, but you'll get used to it.
---

#### Accessing columns:

Remember, "column 2" is really the third column because we start at 0. And the column that's printed in bold, the one with no header? That one doesn't count *at all*, so "column 2" is really the fourth column in the printout! 🤦


In [None]:
# Get column 2. 
office_df.iloc[:, 2]
# (":, 2" means "all rows, column 2")

In [None]:
# Get the "Rating" column.
office_df["Rating"]

In [None]:
# Get the "EpisodeID" and "Rating" columns.
office_df[["EpisodeID", "Rating"]]
# Note: Since we're asking for multiple columns, we need to give a *list* of columns,
# hence the double-brackets!

#### Accessing rows and columns:

In [None]:
# Get row 3, column 2.
office_df.iloc[3, 2]

In [None]:
# Get rows 3-7, column 2.
office_df.iloc[3:8, 2]

In [None]:
# Get rows 3-7, columns 2-3.
office_df.iloc[3:8, 2:4]

In [None]:
# Get rows 3-7, "Rating" column.
office_df.loc[3:8, "Rating"]
# Note: loc, not iloc -- because we're specifying the column as a name, not an integer.

### Your Turn

In [None]:
# Get row 30.
[...]

In [None]:
# Get column 0.
[...]

In [None]:
# Get the "SeasonID" column.
[...]

In [None]:
# Get rows 35–40.
[...]

In [None]:
# Get rows 15-20, "EpisodeTitle" column.
[...]

In [None]:
# Get rows 0-7, "EpisodeID" and "EpisodeTitle" columns.
[...]

---
> 🐍 **Python Bite: `=` and `==`** 
> 
> There are two kinds of equal signs in Python:
>
> * `=` means "set the thing on the left equal to the thing on the right"
> * `==` means "is the thing on the left equal to the thing on the right?"
>
> `=` is a command, an instruction, while `==` is a question. So:
>
> ```
b = 7
```
> means "take the variable `b` (create if it doesn't already exist) and set it equal to 7. Whereas
>
> ```
b == 7
```
>
> means "does the variable `b` equal 7?" Note that `b` has to already exist in this case; if it doesn't already exist, Python will give an error.
---

## Taking Subsets of a DataFrame

One thing we do quite often is to take *subsets* of a DataFrame—that is, taking pieces of the DataFrame according to some condition.

For example, we might want to create a new DataFrame, called `season_3_df`, which contains only the rows of `office_df` corresponding to episodes in season 3.

The command to do this is as follows:

In [None]:
season_3_df = office_df[office_df["SeasonID"] == 3]

But this command is a little inscrutable, so let's build up to it, piece by piece.

First, remember that `office_df["SeasonID"]` gives us the whole "SeasonID" column:

In [None]:
office_df["SeasonID"]

If we say `office_df["SeasonID"] == 3`, we are creating a new column which, for each row, equals `True` if the "SeasonID" for that row is 3 and equals `False` otherwise. (Note the the `==` here: We're testing the condition, "does 'SeasonID' equal 3?")

In [None]:
office_df["SeasonID"] == 3

(All the entries look `False`, but that's because only the first few and last few rows are displayed, and the `True` entries are in the middle.)

Now, if you say `office_df[*a list of Trues and Falses*]`, where `*a list of Trues and Falses*` has the same length as the number of rows in `office_df`, you will get the rows of `office_df` that correspond to the `True` entries only.

For example: `office_df` has 186 rows. The code `[True, False] * 93` is a shortcut to get a list with 186 elements that has `True, False` repeated 93 times: `[True, False, True, False, ...]` etc. 

Take a guess: What will `office_df[[True, False] * 93]` look like?

Now try it:

In [None]:
office_df[[True, False] * 93]

Now, remember that `office_df["SeasonID"] == 3` is *also* just a list of `True`s and `False`s. (Technically, it's not a Python list, it's a column of a `pandas` DataFrame, but that's OK.) In particular, it's a list that says, for each row, whether the "SeasonID" is 3.

So, we can say `office_df[office_df["SeasonID"] == 3]`, and it will give us the rows of `office_df` for which the "SeasonID" is 3.

Try it:

In [None]:
office_df[office_df["SeasonID"] == 3]

We're almost done: We've *created* the DataFrame that we want (i.e., a DataFrame consisting of rows of `office_df` for which "SeasonID" is 3), but we haven't stored it anywhere. Remember that we want to store it in a variable called `season_3_df`. So we just assign it. (Note the single `=` for assignment.)

In [None]:
season_3_df = office_df[office_df["SeasonID"] == 3]

In [None]:
season_3_df

We can also test things other than equality. For example, we can get all rows in which the rating is greater than or equal to 9:

In [None]:
office_df[office_df["Rating"] >= 9]

We can test multiple conditions at once, using `&` to represent "and" and `|` to represent "or". Each condition must be put inside parentheses.

For example, we can get all rows in which the rating is less than 7 and the season is greater than or equal to 4:

In [None]:
office_df[(office_df["Rating"] < 7) & (office_df["SeasonID"] >= 4)]

Instead of comparing a column to a single number, we can even compare it to another column! For example, this code gives us all rows for which the "SeasonID" and the "EpisodeID" are the same. (There's only one such episode.)

(Pay special attention to this type of command; it will come up a bunch of times in the COMPAS notebook.)

In [None]:
office_df[office_df["SeasonID"] == office_df["EpisodeID"]]

### Your Turn

In [None]:
# Get all rows for which "EpisodeID" is less than 50.
[...]

In [None]:
# Get all rows for which "Rating" is less than 7, and store it in a new 
# variable called bad_episodes.
[...]

In [None]:
# Get all rows for which the "SeasonID" equals the "Rating". 
# (This seems like a silly thing to want to do, but oh well.)
[...]

In [None]:
# Get all rows for which "EpisodeID" is less than 100 or
# "Rating" is greater than or equal to 9.
[...]

## Adding Columns

We often want to add new columns to DataFrames. (You can add rows, too, but this is less common.) We can do this just by setting the new column equal to a list of values.

For example, go back to the small `my_df` DataFrame, which contains data about Lehigh buildings. Let's add a column called "Address":

In [None]:
my_df["Address"] = ["9 West Packer Ave.", "1 West Packer Ave.", "68 University Dr.", "27 Memorial Dr. West"]

In [None]:
my_df

Or, suppose I want to add a new column to `office_df` called "MyRating" which has my own rating for each episode.

(I don't feel like typing 186 ratings, so I'll just generate them randomly for the purposes of this example. We'll need `numpy` to do that.) 

In [None]:
import numpy as np

In [None]:
office_df["MyRating"] = np.random.randint(11, size=186) # random integers between 0 and 10

In [None]:
office_df

### Your Turn

In [None]:
# Add a column to the my_df DataFrame called "Visit" which contains,
# for each building, True if you visit that building at least once per week
# and False otherwise.
[...]