# Welcome to Lab_Plots! 🕵🏻‍♂️ 📊 🕵🏻‍♀️

Fun fact: Every employee at the University of Illinois is a "public employee" and all public employee salaries are [publicly available online](https://www.bot.uillinois.edu/resources/gray_book) -- we have curated this data in a cleaned dataset for you to explore!  This includes data about every professor, administrator, and football coach!

The goal of this lab is to work with **real UIUC salary data** to explore its properties, answer important questions, and to think about the implications of collecting and analyzing this data.  Throughout the lab, it is important to think about being a critical consumer of data who can not only use statistics and programming to analyze data, but can also think about the **"why"** part of data science both in the classroom and in the world. Let’s get started!

In this lab, you will:
- Work with real UIUC salary data to explore some of the statistics that we talked about in lecture: mean, median, standard deviation, etc.
- Practice creating plots to **visualize quantitative data**: boxplots and histograms.
- See how data science can be used in the real world to think about important issues through written individual reflections and discussions with your group.

A few tips to remember:

- **You are not alone on your journey in learning programming!**  You have your lab TA, the CAs, your lab group, and the professors (Prof. Wade and Prof. Karle), who are all here to help you out!
- If you find yourself stuck for more than a few minutes, ask a neighbor or course staff for help!  When you are giving help to your neighbor, explain the **idea and approach** to the problem without sharing the answer itself so they can have the same **<i>ah-hah</i>** moment!
- We are here to help you!  Don't feel embarrassed or shy to ask us for help!

Let's get started!


<hr style="color: #DD3403;">

In [0]:
# Meet your CAs and TA if you haven't already!
# First name is enough, we'll know who they are! :)
ta_name = ""
ca1_name = ""
ca2_name = ""
ca3_name = ""

# Work with your group again this week! 
#
# QOTD to Ask Your Group: "What was your favorite childhood game?"
partner1_name = ""
partner1_netid = ""
partner1_fav_game = ""

partner2_name = ""
partner2_netid = ""
partner2_fav_game = ""

partner3_name = ""
partner3_netid = ""
partner3_fav_game = ""

<hr style="color: #DD3403;">

## Setup: Import the Graybook Dataset

The "Gray Book" is historical term for the book of "Academic and Administrative Appointments".  As a public university, all positions (including job title, tenure status, and salary) at UIUC are publicly approved by the Board of Trustees.  After approval, they are published publicly at [https://www.bot.uillinois.edu/resources/gray_book](https://www.bot.uillinois.edu/resources/gray_book).

We have parsed the HTML tables and done a little data cleaning for you. The "Graybook Dataset" provided here includes all faculty (except for the Division of Collegiate Athletics, for salary outlier reasons) at the University of Illinois, based on the **2023-2024 Graybook report**.  A CSV version of this dataset is available at the following URL:

```
https://waf.cs.illinois.edu/discovery/graybook.csv
```

Import `pandas` and load this dataset into a DataFrame, `df`:

In [0]:
...

### 🔬 Test Case Checkpoint 🔬

In [0]:
## == TEST CASES for Loading in Graybook ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs with a message (with the emoji showing), you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert( 'df' in vars()), "Load the dataset into the variable named `df`."
assert ( len(df) == 6288 ), "This is not the Graybook dataset you're looking for."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")   

<hr style="color: #DD3403;">

## Part 1: Exploratory Data Analysis (EDA)

As discussed in lecture, the first step of any data analysis is to **get familiar** with your dataset.  Think about what this data can tell you and what variables are included.  Data scientists always start with this step.

Let’s do some general **exploratory data analysis** to feel out our dataset.  Before you do any calculations, ponder this question:

**Q: What do you estimate the average salary of all UIUC Faculty to be?**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

### Puzzle 1.1: Descriptive Statistics

Our Graybook Dataset contains both the `Present Salary` and `Proposed Salary` of employees at U of I. For now, we're only interested in the present salary.

Using `df`, find the following information:

1. The number of faculty at UIUC, stored in the variable `num_employee` (Hint: each row is an employee!)
2. The **mean** present salary, storing in the variable `mean_sal`
3. The **median** present salary, storing in the variable `median_sal`
4. The **standard deviation** of present salary, storing in the variable `std_sal`

Remember, present salary is found in the `"Present Salary"` column! 

In [0]:
# Find the number of employees at UIUC:
num_employee = ...
num_employee

In [0]:
# Find the average (mean) salary at UIUC:
mean_sal = ...
mean_sal

In [0]:
# Find the median salary at UIUC:
median_sal = ...
median_sal

In [0]:
# Find the standard deviation of the salary at UIUC:
std_sal = ...
std_sal

### 🔬 Test Case Checkpoint 🔬

In [0]:
## == TEST CASES for Puzzle 1.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs with a message (with the emoji showing), you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math

x = mean_sal + median_sal + std_sal
assert( num_employee == 6288 ), "Your calculations of the number of employees is incorrect."
assert( math.isclose(x, 303895.7881426731) ), "Your calculations of the number of employees is incorrect."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Visual Displays of Data: A Key Part of EDA!

Now, we are a bit more familiar with the dataset through summary statistics. Looking at overall descriptive statistics helps us summarize all of the observations in a column, rather than having to scroll through all of the observations!  However, descriptive statistics alone often don’t tell the whole story. This is where having tools for **visualizing statistics** comes in handy.

### Puzzle 1.2: Boxplots

Next, let’s look at a simple, yet powerful visualization: the **boxplot**! 

Generate a **boxplot** of the `Present Salary` column by using:

> ```py
> df["Present Salary"].plot.box()
> ```

We can make the boxplot to be drawn horizontally by using `vert=False` option in the `box` function:

> ```py
> df["Present Salary"].plot.box(vert=False)
> ```


Even better, inside of the `box` function options, include `figsize=(8, 5)` to control the size of the figure:

> ```py
> df['Present Salary'].plot.box(vert=False, figsize=(14, 5))
> ```


In [0]:
...

### Analysis

**Question**: This visualization is different than any boxplot you saw in lecture!  **With your group**, discuss (1): why the box is almost unreadable and (2): explain your prediction of what the extreme outliers might be in this dataset.

*(✏️ Edit this cell to replace this text with your group's answer. Write at least 2 complete sentences. ✏️)*

###  Puzzle 1.3: The $500,000 Club

Create a new DataFrame, `df_over_500k`, that includes all employees who make more than $500,000.

In [0]:
df_over_500k = ...
df_over_500k

Create a second DataFrame, `df_under_500k`, that includes all employees who make less than $500,000.

In [0]:
df_under_500k = ...
df_under_500k

### Part 1.4: Boxplots (Round 2)

Create a boxplot of all employees who make less than $500,000:

In [0]:
...

### Part 1.5: Histograms

Create a histogram of all employees who make less than $500,000:

In [0]:
...

**Analysis: Using the data visualizations above, estimate what the lowest salary is that is a HIGH outlier.  Write your answer in the cell below as a complete sentence.**

*(✏️ Edit this cell to replace this text with your answer. Write your answer as a complete sentence. ✏️)*

<hr style="color: #DD3403;">

## Part 2: Gender and Salaries

Data can reveal **systemic problems or discrimination**. For example, in many companies, men and women are promoted at **different rates**.  Let’s look at a subset of the salary dataset to investigate whether or not there is a **difference in salaries** between faculty who identify as men and women in two departments: **STAT** and **CS** (Karle and Wade’s home departments). 

We've compiled data from these departments, added a `Gender` column, and placed it in a dataset called `STAT_CS_gender.csv`. 

This data is also located in the **same directory as this lab**. To load it in, just specify the **local file path** (`"STAT_CS_gender.csv"`)!

### Puzzle 2.1: Loading A Second Dataset
Using the cell below, import `STAT_CS_gender.csv`, store it in a variable called `STAT_CS_df`, and display it to see what it looks like!

In [0]:
STAT_CS_df = ...
STAT_CS_df

Now, let's create **two subsets** of our `STAT_CS_df`. 

Using conditionals in the cells below, create:
- `STAT_CS_M`, a `DataFrame` of the staff and faculty who identify as Male (**"M"**) under the `Gender` column 
- `STAT_CS_F`, a `DataFrame` of the staff and faculty who identify as Female (**"F"**) under the `Gender` column 

In [0]:
STAT_CS_M = ...
STAT_CS_M

In [0]:
STAT_CS_F = ...
STAT_CS_F

### 🔬 Test Case Checkpoint 🔬

In [0]:
## == TEST CASES for Puzzle 2.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert("STAT_CS_df" in vars()), "Ensure you've named your original DataFrame `STAT_CS_df`."
assert(len(STAT_CS_df) == 145), "This is not the STAT_CS_df you are looking for."
assert("STAT_CS_M" in vars()), "Ensure your male subset of STAT_CS_df is named `STAT_CS_M`."
assert(len(STAT_CS_M) == 106), "Double check your conditional to generate STAT_CS_M - the number of rows is incorrect."
assert("STAT_CS_F" in vars()), "Ensure your female subset of STAT_CS_df is named `STAT_CS_F`."
assert(len(STAT_CS_F) == 39), "Double check your conditional to generate STAT_CS_F - the number of rows is incorrect."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Puzzle 2.2: Merging Two Columns into One DataFrame

To create a new DataFrame containing only the data you found above, the following provided line of code creates `df_salary_by_gender`:

In [0]:
df_salary_by_gender = pd.DataFrame({
    "female": STAT_CS_F["Present Salary"],
    "male": STAT_CS_M["Present Salary"],
})

Take a look at the DataFrame `df_salary_by_gender`.  Every row will either have data in the `female` column or the `male` column, meaning every row represents a male or female employee's present salary:

In [0]:
df_salary_by_gender

### Puzzle 2.3: Visualization

Let's create a visualization! Using the next cell, create a **boxplot** of `df_salary_by_gender` that includes both genders:

In [0]:
...

### Puzzle 2.4: EDA

Now that we've visualized the data, let's explore some basic statistics once more to gain further insight. 

In the following cells, calculate:
- The **mean** `Present Salary` for **Male** STAT/CS Faculty, storing in the variable `mean_m`
- The **median** `Present Salary` for **Male** STAT/CS Faculty, storing in the variable `median_m`
- The **standard deviation** of `Present Salary` for **Male** STAT/CS Faculty, storing in the variable `std_m`

In [0]:
mean_m = ...
mean_m

In [0]:
median_m = ...
median_m

In [0]:
std_m = ...
std_m

Now, in the cells below, calculate:

- The **mean** `Present Salary` for **Female** STAT/CS Faculty, storing in the variable `mean_f`
- The **median** `Present Salary` for **Female** STAT/CS Faculty, storing in the variable `median_f` 
- The **standard deviation** of `Present Salary` for **Female** STAT/CS Faculty, storing in the variable `std_f`

In [0]:
mean_f = ...
mean_f

In [0]:
median_f = ...
median_f

In [0]:
std_f = ...
std_f

Run the following cell to make a summary table of your previously calculated data:

In [0]:
pd.DataFrame([
  {"Gender": "F", "Mean ($)": round(mean_f), "Median ($)": round(median_f), "Standard Deviation ($)": round(std_f)},
  {"Gender": "M", "Mean ($)": round(mean_m), "Median ($)": round(median_m), "Standard Deviation ($)": round(std_m)},
])

### 🔬 Test Case Checkpoint 🔬

In [0]:
## == TEST CASES for Puzzle 2.4 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs with a message (with the emoji showing), you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math
assert( math.isclose( mean_f, 157847.477949 ) )
assert( math.isclose( median_f, 138000.0 ) )
assert( math.isclose( std_f, 50583.799487 ) )
assert( math.isclose( mean_m, 157679.508208 ) )
assert( math.isclose( median_m, 152500.0 ) )
assert( math.isclose( std_m, 49368.327380 ) )

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Analysis: EDA Takeaways

**Q: Now that you've calculated descriptive statistics of the `Present Salary` of Male and Female STAT/CS Faculty, how do the numbers support or counter the boxplot observations you made earlier? Can we draw any conclusions about gender-based salary discrimination from our data? Use the analysis you did to support your answer.**

*(✏️ Edit this cell to replace this text with your answer. Write at least 3 complete sentences. ✏️)*

<hr style="color: #DD3403;">

## Part 3: Exploring Your Own Interests


At this point of the lab, we have investigated a lot of questions.

However, these have been questions that **we told you to answer**. As a data scientist, it is important to be able to use the data science skills that you learn in the classroom to answer questions that **you have**.

Think about **two questions** that you have about the **Graybook** or **STAT_CS_Gender** datasets that have not been answered. These can be simple questions. Record them below.  Then, answer at least one of these questions using Python and either dataset.  

*(✏️ Edit this cell to replace this text with your two questions. ✏️)*


Now, use the cell below to **find the answer** to (at least) **one** of your questions! Remember, it can be something simple. 

In [0]:
...

**Summarize your findings to your TA, Karle, and Wade in at least two complete sentences. In this summary, explain what question you had and the answer you found in the data.**

*(✏️ Write the question you answered here and briefly describe the results. Write at least 2 complete sentences. ✏️)*

<hr style="color: #DD3403;">

## **Submission** 


You're almost done! All you need to do is to commit your lab to GitHub:

1. Make certain to save your work. To do this, go to **File => Save All**

2. After you have saved, exit this notebook and follow the Canvas instructions to commit this lab to your Git repository!

3. Your TA will grade your submission and provide you feedback after the lab is due. :)