# Catch Up Section 1 ~ Pandas III & Data Wrangling and EDA

## Author: Sammie Smith, Summer 2025

In [None]:
import pandas as pd
import numpy as np

In [None]:
# This cell makes a DataFrame for us to play with
course_preferences = pd.DataFrame({ "Staff":["Sammie", "Jake", "Milena", "Wesley", "Xiaorui", "Ella", "Hannah"], "Favorite Course":["D100", "CS70", "CS61B", "D100", "CS61B", "D100", "CS61B"], "Position":["TA", "Head TA", "TA", "TA", "Ta", "Tutor", "Tutor"], "Data Sci Major": [1,1,1,0,1,0,1], "Graduated" : [0,0,0,0,1,0,1]})
display(course_preferences)

## Q1) Provide a breakdown of favorite courses per position.

Guiding Question: What needs to go on the index? What should the column(s) represent? What should the values within the columns represent?

Hint: What is a Pandas method that will change the index and granularity of the DataFrame?

In [None]:
# We need to use a groupby
course_preferences.groupby("Position")

In [None]:
# Oh no! Something went wrong... when we called groupby, we get a DataFrameGroupBy object-- not a DataFrame!
# That's because we need to call an aggregate function after the groupby.
course_preferences.groupby("Position").agg(list)

In [None]:
# Something's still not right!! We have two rows that represent TAs. 
# .groupby is case sensitive, so we need to make our string data follow the same format. 
# let's just make everything upper case for convience.
course_preferences["Position"] = course_preferences["Position"].str.upper()
display(course_preferences)
course_preferences.groupby("Position").agg(list)

In [None]:
# Ok, cool. But what if I want just the 'Favorite Course' column? Well I could drop the 'Staff' column.. but let's try something fancier!
course_preferences.groupby("Position")[["Favorite Course"]].agg(list) 
# the bracktets tell me to only apply the aggregate function on columns inside the brackets. 
# Single brakets returns a series with max=1 column, double returns a dataframe with max = as many columns as exist in the original dataframe

In [None]:
# But you know what's really powerful?? I can use the dictionary inside .agg to apply different aggregation functions to different columns!
course_preferences.groupby("Position").agg({"Favorite Course": list, "Staff": "first"})
# why didn't I write list as "list" when I wrote first as "first"?
# That's because list is a built-in Python function and first is a built-in Pandas function
# Look at the reference sheet for more built-ins!

In [None]:
# want the list of course names to be unique? Use 'set' instead of 'list'.
course_preferences.groupby("Position").agg({"Favorite Course": set, "Staff": "first"})

## Q2) Which staff positions have at least two staff members who are Data Science majors?

Use .groupby().filter() to return only the rows belonging to those positions.

In [None]:
# First, we group by "Position", then we can filter groups where the number of Data Science majors ("Data Sci Major" == 1) is at least 2.
course_preferences.groupby("Position").filter(lambda group: (group["Data Sci Major"] == 1).sum() >= 2)

In [None]:
# Tutor does not have at least two staff who are Data Science Majors... Ella is a Computer Science Major!
# But Hannah is a Data Science Major & a tutor... so why isn't her row included?
# This is because if the group does not pass the condition in the filter, then NO rows in the group are returned.
# Hence, there are only TA rows in the result.

### PAUSE Concept Check: Does groupby change granularity? Does filter change granularity? 
(Concept Check questions are meant to help your intuition & educated guessing on exams)

## Q3) How many staff in each position preferred each course? Only include those who have not graduated.

Guiding Question: What needs to go on the index? What should the columns represent? What should the values within the columns represent?

In [None]:
# We need grouped counts of TWO categorical variables AND we need to filter for graduation status before we aggregate.
# Let's use a pivot table!
# Step 1) Filter to staff who haven't graduated.
active_staff = course_preferences[course_preferences["Graduated"] == 0]
# Step 2) Count how many staff per position per favorite course.
active_staff.pivot_table(
    index="Position",
    columns="Favorite Course",
    values="Staff",
    aggfunc="count",
)

In [None]:
# These NaNs are ugly! Fortuanately it's super easy to fill them in with another number... which number makes the most sense?
active_staff.pivot_table(
    index="Position",
    columns="Favorite Course",
    values="Staff",
    aggfunc="count",
    fill_value=0
)

## Q4) Solve Q3 without a pivot table.

In [None]:
# We can group on multiple features!
active_staff.groupby(["Position", "Favorite Course"]).count() # how many index columns will this produce? How many regular columns?

## Q5) We want to include the mascot of each staff member's favorite course in the course_preferences DataFrame. Use the mascots DataFrame to add this information.

In [None]:
# Run this cell to build the mascots DataFrame
mascots = pd.DataFrame({"Course":["CS70", "CS61B", "D100"], "Mascot": ["penguin", "bee", "panda"]})
print("course_preferences")
display(course_preferences)
print("mascots")
display(mascots)

### PAUSE Concept Check: What Pandas methods exist to combine data from multiple DataFrames? Circle them on your reference sheet.

In [None]:
# Option 1: pd.merge(df1, df2)
pd.merge(left = course_preferences, right = mascots, left_on="Favorite Course", right_on = "Course", how='inner')
# you can optinally drop the 'Course' or 'Favorite Course' column to get rid of the duplicate column.

In [None]:
# Option 2: df1.merge(df2)
course_preferences.merge(right=mascots, left_on = "Favorite Course", right_on= "Course", how='inner')

## Whew, that was a lot! Reference this during your homework, labs, and exam practice-- this is ungraded and optional!