In [34]:
# Necessary imports for the notebook
import requests
import pandas as pd
import altair as alt
import random as r
import datetime as dt

#Premise of Notebook
This exploratory notebook will explore increasingly granular levels of visualization, utilizing Altair for its high customizability and functionality, as well as interactivity that's retained when embedded to a website. This exercise focuses specifically on mentor/mentee data in regards to high level categorical metrics, but the concepts applied can be utilized in many other areas of data visualization, including attendance time series graphs, financial/resource management (tracking usage, supply/demand growths, etc.), mentor availability for mentees by day and time, staff metric overviews like what we're about to dive into...the list goes on.

###Data Retrieval
For starters, we will explore comparing counts of mentees and mentors based on certain categories, namely experience level and subject. This could be a useful tool for staff administration (superadmins).

In [35]:
# These are requests to the live database, showcasing how data could be retrieved for usage.
mentees_df = pd.DataFrame(requests.post("http://underdog-devs-ds-a-dev.us-east-1.elasticbeanstalk.com/Mentees/read").json()["result"])
mentors_df = pd.DataFrame(requests.post("http://underdog-devs-ds-a-dev.us-east-1.elasticbeanstalk.com/Mentors/read").json()["result"])

In [36]:
#verifying data retrieval
print("Mentees\n")
mentees_df.info()
print("------------------------------------------------------------------\nMentors\n")
mentors_df.info()

Mentees

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   profile_id              100 non-null    object
 1   mentee_intake_id        0 non-null      object
 2   name                    100 non-null    object
 3   email                   100 non-null    object
 4   location                100 non-null    object
 5   in_US                   100 non-null    bool  
 6   formerly_incarcerated   100 non-null    bool  
 7   underrepresented_group  100 non-null    bool  
 8   low_income              100 non-null    bool  
 9   convictions             100 non-null    object
 10  list_convictions        100 non-null    object
 11  tech_stack              100 non-null    object
 12  experience_level        100 non-null    object
 13  job_help                100 non-null    bool  
 14  industry_knowledge      100 non-null    bool  
 15

In [37]:
mentees_df['tech_stack'].unique()

array(['Career Development', 'General Programming', 'iOS: Swift',
       'Android: Java', 'Data Science: Python',
       'Web: HTML, CSS, JavaScript'], dtype=object)

In [38]:
# Using local files
mentees_df = pd.read_csv("https://raw.githubusercontent.com/BakerJr1904/Altair-visualization-for-underdogsDevs/main/mentees.csv")
mentors_df = pd.read_csv("https://raw.githubusercontent.com/BakerJr1904/Altair-visualization-for-underdogsDevs/main/mentors.csv")

In [39]:
#verifying data retrieval
print("Mentees\n")
mentees_df.info()
print("------------------------------------------------------------------\nMentors\n")
mentors_df.info()

Mentees

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Unnamed: 0             250 non-null    int64 
 1   mentee_intake_id       250 non-null    int64 
 2   first_name             250 non-null    object
 3   last_name              250 non-null    object
 4   email                  250 non-null    object
 5   profile_id             250 non-null    int64 
 6   gender                 250 non-null    object
 7   formerly_incarcerated  250 non-null    bool  
 8   convictions_list       250 non-null    object
 9   veteran_status         250 non-null    bool  
 10  intention              250 non-null    object
 11  tech_stack             250 non-null    object
 12  career_development     250 non-null    object
 13  known_languages        250 non-null    object
 14  date_submitted         250 non-null    object
dtypes: bool(2), in

In [40]:
# Using local files
mentees_df = pd.read_csv("https://raw.githubusercontent.com/BakerJr1904/Altair-visualization-for-underdogsDevs/main/mentees.csv")
mentors_df = pd.read_csv("https://raw.githubusercontent.com/BakerJr1904/Altair-visualization-for-underdogsDevs/main/mentors.csv")
# Adding role column to distinguish mentee vs mentor
mentees_df["role"] = ["Mentee"]*len(mentees_df)
mentors_df["role"] = ["Mentor"]*len(mentors_df)
# Generating random skill levels
levels = ['Beginner', 'Intermediate', 'Advanced', 'Master']
mentees_df['experience_level'] = [r.choice(levels) for _ in range(len(mentees_df))]
mentors_df['experience_level'] = [r.choice(levels) for _ in range(len(mentors_df))]
# Filtering for relevant columns
mentees_df = mentees_df[["role", "profile_id", "first_name", "last_name", "tech_stack", "experience_level"]]
mentors_df = mentors_df[["role", "profile_id", "first_name", "last_name", "tech_stack", "experience_level"]]
# Concatenating
df = pd.merge(mentees_df, mentors_df, how="outer")

In [41]:
# There shouldn't be nulls, but there are "None" values in tech_stack that shouldn't exist, so we'll remove them
df = df.loc[df['tech_stack'] != "None"]

In [42]:
# Checking redundant variances
print(df["role"].unique())
print(df["tech_stack"].unique())
print(df["experience_level"].unique())

['Mentee' 'Mentor']
['Data Science' 'Web3' 'Backend' 'Web Development' 'Android' 'IOS']
['Master' 'Advanced' 'Beginner' 'Intermediate']


In [43]:
# Generating full_name column
df["full_name"] = df["first_name"] + [" "]*len(df) + df["last_name"]

In [44]:
# High level data overview
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 177 entries, 0 to 277
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   role              177 non-null    object
 1   profile_id        177 non-null    int64 
 2   first_name        177 non-null    object
 3   last_name         177 non-null    object
 4   tech_stack        177 non-null    object
 5   experience_level  177 non-null    object
 6   full_name         177 non-null    object
dtypes: int64(1), object(6)
memory usage: 11.1+ KB


In [45]:
# actual dataframe
df.head(5)

Unnamed: 0,role,profile_id,first_name,last_name,tech_stack,experience_level,full_name
0,Mentee,2781024714661041477,Winston,Reed,Data Science,Master,Winston Reed
2,Mentee,275887507625650233,Kathryn,Kelly,Web3,Advanced,Kathryn Kelly
4,Mentee,4387769754891819622,Iker,Robinson,Backend,Advanced,Iker Robinson
5,Mentee,5533477046168175917,Cory,Morgan,Data Science,Advanced,Cory Morgan
6,Mentee,7054492486990369460,Leon,Ross,Web Development,Master,Leon Ross


###Building Graphs Part 1: Mentors and Mentees
Next, we'll build some graphs. Let's start with a lot of information in a single graph, something we really don't want to use, but that showcases information density.

In [46]:
graph = alt.Chart(df).mark_bar().encode(
    x="role",
    y=alt.X(
        "count()",
        title="Head Count"
        ),
    color="tech_stack",
    column="experience_level"
).properties(width=200).configure_axisX(
    title="null",
    labelFontSize=15
).configure_header(
    labelFontSize=15
).configure_axisY(
    labelFontSize=12
)

graph

This is obviously a lot to look at. It's hard to compare mentors and mentees across disciplines, even if their skill levels are clustered together. But what if we could choose what subject(s) to compare? With the power of Altair, we can! One way is to use Altair's included checkboxes, but a prettier way is to make our own selection panels, so let's go for it!

In [47]:
# Create selection panel with selection functionality
selection = alt.selection_multi(fields=['tech_stack'])
color_select = alt.condition(selection, alt.Color('tech_stack:N'), alt.value('lightgray'))
selector = alt.Chart().mark_rect().encode(y='tech_stack', color=color_select).add_selection(selection)

In [48]:
# Create main graph
main_graph = alt.Chart().mark_bar().encode(
    x="role",
    y=alt.X(
        "count()",
        title="Head Count"
        ),
    color="tech_stack",
    column="experience_level"
).transform_filter(selection).properties(height=400, width=150)

In [49]:
# Concatenate with data
full_graph = alt.hconcat(selector, main_graph, data=df)
full_graph.configure_axisX(labelFontSize=15, title="null").configure_header(labelFontSize=15, titleFontSize=20).configure_legend(disable=True)

Ta-da! Now we can filter them at will, both with single subjects and multiple if we click while holding shift! Using the same process, we could add a secondary filter for experience level that would condense the graph we're viewing into two bars, if we wanted to get really granular!

In [50]:
# Create secondary selection panel with selection functionality
selection2 = alt.selection_multi(fields=['experience_level'])
color_select2 = alt.condition(selection2, alt.Color('experience_level:N'), alt.value('lightgray'))
selector2 = alt.Chart(df).mark_rect().encode(y='experience_level', color=color_select2).add_selection(selection2)

In [51]:
# Add secondary filter to main_graph
main_graph = main_graph.transform_filter(selection2)

In [52]:
granular_graph = alt.hconcat(selector, main_graph, selector2, data=df).configure_legend(disable=True)
granular_graph.configure_axisX(labelFontSize=15).configure_header(labelFontSize=15)

Now we have two working filter panels! But it might make more sense to condense multiple selected experience levels for comparisons...let's do that! And while we're at it, since the sizes may get very small, let's make it so we can view each person's full name when we hover over the segmented bar graph. Though this may not seem useful in this situation, the same method can be applied to point graphs, for instance a time series graph that plots mentors' meetings with mentees and their resultant attendances; making it so each time you hover over a marked absence, the mentee's name, id, or other chosen identifier will appear to be viewed for quick reference.

In [53]:
# Recreate main graph without column clustering
main_graph = alt.Chart().mark_bar().encode(
    x="role",
    y=alt.X(
        "count()",
        title="Head Count"
        ),
    color="experience_level",
    tooltip="full_name"
).transform_filter(selection).transform_filter(selection2).properties(width=100, height=600)

In [54]:
granular_graph = alt.hconcat(selector, main_graph, selector2, data=df).configure_legend(disable=True).configure_axisY(labelFontSize=15, titleFontSize=15, tickMinStep=1).configure_axisX(labelFontSize=15, titleFontSize=15)
granular_graph

Though this amount of control may not be useful, even detrimental to the user experience, We wanted to go farther than necessary in order to showcase the filtering possibilities within Altair. These filters can be added to graphs of any type, and the filters don't have to just change data; they could also change colors, opacity, and so forth! Graphs can be concatenated and have cross-dependencies ad nauseum, though we should probably reel in their coverage and focus them *only* on specific use cases, since higher complexity visualizations and options will very likely be overwhelming to those not trained in data science.