## Determining college campus diversity

- Chapter 2 Wrap-Up
- college_diversity dataset

In [2]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 40

In [4]:
college_div = pd.read_csv('data/college_diversity.csv', index_col='School')

In [6]:
college_div.shape

(10, 1)

In [7]:
college_div

Unnamed: 0_level_0,Diversity Index
School,Unnamed: 1_level_1
"Rutgers University--Newark Newark, NJ",0.76
"Andrews University Berrien Springs, MI",0.74
"Stanford University Stanford, CA",0.74
"University of Houston Houston, TX",0.74
"University of Nevada--Las Vegas Las Vegas, NV",0.74
"University of San Francisco San Francisco, CA",0.74
"San Francisco State University San Francisco, CA",0.73
"University of Illinois--Chicago Chicago, IL",0.73
"New Jersey Institute of Technology Newark, NJ",0.72
"Texas Woman's University Denton, TX",0.72


- Read in the college dataset
- Filter for just the undergraduate race columns

In [8]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')

- Many of these colleges have missing values for all their race columns
- We can count all the missing values for each row and sort the resulting Series from the highest to lowest

In [11]:
college_ugds_.isnull().sum(axis=1).sort_values(ascending=False).head()

INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

- Now that we have seen the colleges that are missing all their race columns
- We can use the `dropna` method to drop all rows that have all nine race percentrages missing
- We can then count the remaining missing values

In [12]:
college_ugds_ = college_ugds_.dropna(how='all')
college_ugds_.isnull().sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

- There are no missing values left in the dataset
- We will use the greater than or equal DataFrame method, `ge`, to convert each value to a boolean

In [13]:
college_ugds_.ge(.15)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,True,False,False,False,False,False,False,False
University of Alabama at Birmingham,True,True,False,False,False,False,False,False,False
Amridge University,True,True,False,False,False,False,False,False,True
University of Alabama in Huntsville,True,False,False,False,False,False,False,False,False
Alabama State University,False,True,False,False,False,False,False,False,False
The University of Alabama,True,False,False,False,False,False,False,False,False
Central Alabama Community College,True,True,False,False,False,False,False,False,False
Athens State University,True,False,False,False,False,False,False,False,False
Auburn University at Montgomery,True,True,False,False,False,False,False,False,False
Auburn University,True,False,False,False,False,False,False,False,False


- From here, we can use the `sum` method to count the `True` values for each college

In [14]:
diversity_metric = college_ugds_.ge(.15).sum(axis='columns')
diversity_metric.head()

INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

- To get an idea of the distribution, let's use the `value_counts` method on this Series:

In [15]:
diversity_metric.value_counts()

1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

- Two schools have more than 15% in five different race categories
- Let's sort the `diversity_metric` Series to find out which ones they are:

In [16]:
diversity_metric.sort_values(ascending=False).head()

INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

- Let's look at the raw percentages from these top two schools
- The `.loc` indexer is used to specifically select based on the index label:

In [17]:
college_ugds_.loc[['Regency Beauty Institute-Austin',
                   'Central Texas Beauty College-Temple']]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Regency Beauty Institute-Austin,0.1867,0.2133,0.16,0.0,0.0,0.0,0.1733,0.0,0.2667
Central Texas Beauty College-Temple,0.1616,0.2323,0.2626,0.0202,0.0,0.0,0.1717,0.0,0.1515


- We can see how the top 10 US News shools fared with this basic diversity metric:

In [18]:
us_news_top = ['Rutgers University-Newark',
               'Andrews University',
               'Stanford University',
               'University of Houston',
               'University of Nevada-Las Vegas']
diversity_metric.loc[us_news_top]

INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

## How it works...

- The `dropna` method in step 3 has the `how` parameter, which is defaulted to the string `any` but may also be changed to `all`
    - When set to `any`, it drops rows that contain one or more missing values
    - When set to `all`, it only drops rows where all values are missing
    

- In this case, we conservatively drop rows that are missing all values
- This is because it's possible that some missing values simply represent 0 percent

## There's more...

- Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:

In [20]:
college_ugds_.max(axis=1).sort_values(ascending=False).head(10)

INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

- We can also determine if any school has all nine race categories exceeding 1%:

In [22]:
(college_ugds_ > .01).all(axis=1).any()

True