In [4]:
# :)

As the previous engineer left in a hurry, the model that was provided to you is pre-trained and you do not have any information on how it was trained. You are tasked with evaluating the model's performance and fairness.

When it comes to training the model itself, the ML engineer who worked on the project before you had the following assumptions:

The model is used to predict whether a student will be a good candidate for a software engineering job. It is a binary classifier, where 1 means the student is a good candidate, and 0 means the student is not a good candidate.
The model is trained on a dataset of students who have graduated from CMU, and have been working in the industry for at least 1 year.
In order to prevent bias, they assumed that removing Gender (M, F) and Student ID from the dataset would be sufficient when it comes to training the model. They claimed that it is now Group Unaware, thus the model would be fair.
The model specification is as follows:

In [5]:
# X variable (input parameters)
# - Age (18 - 25)
# - Major (Computer Science, Information Systems, Business, Math,
#          Electrical and Computer Engineering, Statistics and Machine Learning)
# - GPA (0 - 4.0)
# - Extra Curricular Activities (Student Theatre, Buggy, Teaching Assistant, Student Government,
#     Society of Women Engineers, Women in CS, Volleyball, Sorority, Men's Basketball,
#     American Football, Men's Golf, Fraternity)
# - Number of Programming Languages (1, 2, 3, 4, 5)
# - Number of Past Internships (0, 1, 2, 3, 4)

# Y variable (output)
# - Good Candidate (0, 1)

The previous engineer has provided some examples on the usage of the model in the draft pull request.
You are provided with a test dataset, which contains a similar set of features and output (whether the student is a good candidate or not). This test dataset is a different set of students from the training dataset, and the evaluation of whether the student is a good candidate is done by a fair panel of recruiters, so it can be considered to be unbiased. Additionally, the panel of recruiters have provided you with additional context on the extracurricular activities in comments (marked with #).

Your test dataset is provided to you in the following format:

In [6]:
# X variable
# - Student ID
# - Gender (M, F)
# - Age (18 - 25)
# - Major (Computer Science, Information Systems, Business, Math,
#          Electrical and Computer Engineering, Statistics and Machine Learning)
# - GPA (0 - 4.0)
# - Extra Curricular Activities (Student Theatre, Buggy, Teaching Assistant, Student Government,
#     Society of Women Engineers, Women in CS, Volleyball, Sorority, Men's Basketball,
#     American Football, Men's Golf, Fraternity)
#   # Likely Co-Ed (Student Theatre, Buggy, Teaching Assistant, Student Government)
#   # Likely Majority Female (Society of Women Engineers, Women in CS, Volleyball, Sorority)
#   # Likely Majority Male (Men's Basketball, American Football, Men's Golf, Fraternity)
# - Number of Programming Languages (1, 2, 3, 4, 5)
# - Number of Past Internships (0, 1, 2, 3, 4)

# Y variable
# - Good Candidate (0, 1)

Before doing a thorough evaluation of the fairness of the model, you will start by doing preliminary analysis on the test dataset, and run the model on the test dataset to get the accuracy of the model. To do so, you will need to set up a Jupyter notebook to do this, you can either:

Use Google Colab to run the notebook in the cloud. (Recommended if you do not have experience with Jupyter notebooks)
Alternatively, set up a JupyterLab on your local machine. Additionally, you can use VSCode to run the notebook as well. (Recommended if you are experienced)
It is recommended that you use Python 3.9 or above when setting up the notebook.

After you have set up the notebook, you should:

Load the model and test dataset
Plot the distribution of the test dataset across all features (except Student ID) using any visualization library of your choice (e.g. pandas, matplotlib, seaborn, plotly, etc.). You should choose the appropriate visualization for each feature.
Predict the output of the test dataset using the model
Report the accuracy of the model, and the confusion matrix
Refer to the Resources & Documentation section if you need help with any of the above steps.

By the checkpoint deadline, your team will commit the Jupyter notebook to your repository, and submit a link to the Jupyter notebook with the basic analysis and usage of the model done.

In [7]:
# ⠀⢀⡤⢚⡭⣿⡟⠉⠉⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠁⠉⠁⠀⠀⠀⠉⠉⠁⠀⠉⣿⠀⠀⠀⠀⠀
# ⠀⠀⠀⢁⢶⡟⠀⠀⠀⠘⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠸⠀⠀⠀⠀⡆⠁⢀⠀⡇⠠⣼⡄⠀⠀⠀⠀
# ⠀⠀⢠⠏⣾⠁⡄⠀⠀⡃⠀⢡⠀⠀⠀⠃⠀⠀⠀⠀⠀⠀⠀⡁⠀⠀⢠⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣇⠇⠀⢀⡄⢸⢀⡆⢸⡰⡆⢸⠿⣧⠀⠀⠀⠀
# ⠀⠀⡜⣰⡏⠀⠧⠂⠰⠇⢰⣸⡄⠀⣼⡀⠀⠄⢠⠀⡀⠀⢠⡇⠀⢀⢾⠀⠀⢀⠇⠀⠀⠀⠀⠀⠀⡀⣸⠀⢀⢸⡟⠀⠀⡼⠀⡇⣼⡁⠆⡇⣧⢸⠀⠈⠀⠀⠀⠀
# ⠀⢠⣷⠇⠀⢸⢸⠀⢰⠄⣸⡟⡇⢀⠽⡇⢀⡇⠘⠰⠁⠀⡞⡇⠀⡜⢸⡀⠀⣸⡄⠀⢀⠀⠀⢀⠴⢃⡇⢀⣇⠞⣓⠲⣴⠁⢸⢰⡇⢧⠀⣸⢿⣼⠀⠀⠀⠀⠀⠀
# ⠀⣼⡏⢀⡆⠘⢸⠀⢸⠀⣿⣤⡇⣸⠀⡇⢸⡆⠂⢸⠀⢰⠇⡇⢰⠗⢻⡃⢸⣿⠀⠀⢸⠀⢀⡏⠀⣋⠅⣼⠋⣼⣿⡇⣇⠀⣿⠏⣧⢸⢠⣿⠀⠻⠀⠀⠀⠀⠀⠀
# ⢠⣿⢀⡞⠀⡀⢸⢀⠸⣾⣿⣹⣇⡇⣄⡇⣼⡇⠀⣼⠀⡾⠉⢱⡞⠀⢀⡇⡞⢹⠀⠀⣾⠀⣼⡇⢠⣿⢠⠇⠘⣿⡇⢣⣸⡄⡞⠀⣿⠈⣿⡏⠀⠀⠀⠀⠀⠀⠀⠀
# ⢸⣿⢻⠇⣠⡇⣹⢸⡆⣿⠿⣿⣿⣄⢸⡇⡏⢿⡮⢹⢰⡇⣠⣼⡷⣶⣿⣿⣁⢘⠠⠇⣿⢰⣱⡁⣸⣿⡼⠀⣾⠿⠏⢸⢋⣿⣷⣤⠙⣇⣿⣇⠀⠀⠀⠀⠀⠀⠀⠀
# ⠾⠁⠘⡰⢋⣧⣇⣸⣷⡞⠓⠺⢧⣘⣆⢻⠇⢸⡇⣸⣞⣹⡿⠛⠻⣏⣀⡭⠙⢛⡶⠦⣟⢫⣿⡿⠉⣿⠇⠀⡿⠁⠀⢧⣞⣿⡇⢻⣧⡘⣿⣿⡆⠀⠀⠀⠀⠀⠀⠀
# ⠀⢠⡿⠁⢸⠇⣼⠋⡿⣧⡀⠀⠀⠙⣿⡿⠀⢹⣶⠟⠋⠛⠻⠶⣤⣀⠀⠠⠔⠋⣣⣾⡏⡞⢸⠃⠀⡿⠀⠀⠁⠀⡰⠋⠘⠾⡇⠀⠙⠳⣿⣿⣿⡀⠀⠀⠀⠀⠀⠀
# ⠀⡾⠁⢀⡏⢰⠃⠀⡇⢿⠿⣿⣿⣿⣿⠷⠶⠾⢿⣦⣀⣀⣀⣀⣈⣿⣇⢀⣴⠟⠉⢸⡿⠁⠊⠀⠀⠃⣀⣄⣠⠞⠁⠀⠀⠀⠀⠀⠀⣰⣿⣿⣿⣷⣦⣀⠀⠀⠀⠀
# ⠈⠁⠀⠼⠁⠁⠀⠀⠃⢸⠀⢹⡉⠛⠋⠀⠀⠀⠀⠻⢿⣿⣿⣿⣿⣿⣿⣛⡟⠀⠀⠛⠃⠀⠀⠀⠀⢠⡿⠏⣠⡆⠀⠀⠀⠀⠀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀⠀
# ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⢷⣤⡀⠀⠀⠀⠀⠀⠀⠈⠙⠛⠛⠋⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⢾⠁⣰⣿⡇⠀⠀⠀⢀⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣄
# ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠹⣿⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢟⣹⠀⠀⠀⣠⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
# ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢾⣯⣄⣀⡀⠀⠀⠀⠀⣀⡤⠖⠋⠀⠀⠀⠀⠀⢀⣠⡤⠀⠀⠀⢺⣿⡀⢀⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
# ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⢀⣩⡗⠒⠒⠉⠀⠀⠀⠀⠀⠀⠀⣠⣴⣿⠏⠀⠀⠀⠀⣾⠏⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
# ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠧⠴⢿⡋⠁⠀⠀⠀⠀⠀⠀⠀⠀⣠⣞⢻⠟⠁⠀⠀⠀⠀⠀⣡⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
# ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢳⣄⡀⠀⠀⠀⠀⣀⣴⣮⣹⡿⠃⠀⠀⠀⠀⣀⣴⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿