# Supplemental Notebook: Exploring Team Size

In the UMETRICS semester-level data, some of the decisions about measurement and aggregation have already been made for you. However, you might be interested in doing this process yourself, using different measures as you think about the assumptions that you make and the phenomena that you want to study. 

This notebook covers some of the basics of how that might be done through the individual grant and person level UMETRICS data.

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# for data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# add weights to data
suppressMessages(library(survey))

#Switching on warnings
options(warn = 0)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

Let's start with a very basic measure of team size. We can count how many people were on the grant in that time period by counting the distinct employee numbers associated with each unique award number.

Note that we are only going to be looking at the Fall semester of 2014 as an example.

In [None]:
qry <- "
SELECT unique_award_number, count(distinct emp_number) as team_size
FROM ds_iris_umetrics.dbo.core_employee
WHERE period_end_date BETWEEN '2014-09-01' AND '2015-01-01'
GROUP BY unique_award_number
"

team_size_by_award <- dbGetQuery(con, qry)
head(team_size_by_award)

Then, we can get all employees who were employed during this time period by grabbing only the unique employee and award numbers.

In [None]:
qry <- "
SELECT distinct emp_number, unique_award_number
FROM ds_iris_umetrics.dbo.core_employee
WHERE period_end_date BETWEEN '2014-09-01' AND '2015-01-01'
"

core_employee <- dbGetQuery(con, qry)
head(core_employee)

Now we can see the team size for each employee by the award they were on. 

In [None]:
core_employee %>% 
    left_join(team_size_by_award) %>% 
    head()

Due to the existence of NA values, there is some cleaning that would need to be done with this data. For now, consider what we found: for each employee and each award that employee was on during Fall 2014, we can see how many total employees were on the same award.

What if we only wanted to see how many graduate students were on the team, and use that as the measure of team size? We can do this too by adjusting the SQL query from before.

In [None]:
qry <- "
SELECT unique_award_number, count(distinct emp_number) as team_size
FROM ds_iris_umetrics.dbo.core_employee
WHERE period_end_date BETWEEN '2014-09-01' AND '2015-01-01' AND umetrics_occupational_class = 'Graduate Student'
GROUP BY unique_award_number
"

grad_team_by_award <- dbGetQuery(con, qry)
head(grad_team_by_award)

In [None]:
core_employee %>% 
    left_join(grad_team_by_award) %>% 
    head()

How might you refine this further? What other ways of defining a team size are there? 