In [144]:
import glob

import yaml
import pandas as pd
import numpy as np

# Field of Science Report

This report is to show the current state of Topology projects Fields of Science and inform next steps.

## Current State

Fields of Science are taken from the project giles in Topology and mapped with a [mapping file](https://github.com/opensciencegrid/topology/blob/5248f5e2322f711b28ce767a0348564e65907aec/mappings/nsfscience.yaml). 

The mapping file on Topology maps the NSF fields from the yaml to a NSF defined Field of Science. These NSF Fields of Science originate in this [pdf](../nsf_fos.pdf). The mapping file is not strict in its adherence to this PDF and has some additional fields and casing inconsistencies.

### Update all the Fields of Science to adhere to the original NSF PDF

Below are the steps to correct the Fields of Science for all projects in their project yaml files to those specified in the original NSF PDF. 

The result of the code below is a [PR](https://github.com/opensciencegrid/topology/pull/3500).

### Update 

as well as the newest version of the [NSF Field of Study](https://ncses.nsf.gov/pubs/nsf24300/assets/technical-notes/tables/nsf24300-taba-005.xlsx).


In [2]:
PROJECT_PATH = "/Users/clock/PycharmProjects/topology/projects"
NSF_MAPPING_PATH = "/Users/clock/PycharmProjects/topology/mappings/nsfscience.yaml"

VALID_FIELDS_FROM_PDF = [
    "Agricultural sciences",
    "Biological sciences",
    "Atmospheric sciences",
    "Earth and ocean sciences",
    "Computer sciences",
    "Mathematics",
    "Astronomy",
    "Chemistry",
    "Physics",
    "Other physical sciences",
    "Psychology",
    "Economics",
    "Political science",
    "Sociology",
    "Other social sciences",
    "Aeronautical/Astronautical engineering",
    "Chemical engineering",
    "Civil engineering",
    "Electrical, electronics and communications engineering",
    "Industrial and manufacturing engineering",
    "Materials science engineering",
    "Mechanical engineering",
    "Other engineering",
    "Engineering technology",
    "Health",
    "Education",
    "Humanities",
    "Business management/administration",
    "Communication",
    "Other non-science and engineering/unknown",
    "Multidisciplinary/ interdisciplinary sciences" # Not from original PDF but from the newest NSF Field of Study
]

In [146]:
def strip_extension(file_name):
    return ".".join(file_name.split(".")[:-1])

def file_path_to_file_name(file_path):
    return file_path.split("/")[-1]

In [147]:
project_paths = glob.glob(f"{PROJECT_PATH}/*")

# Build the project DataFrame
PROJECT_VALUE_KEYS = ["ID", "Name", "PIName", "Department", "FieldOfScience"]
project_list = [["File", *PROJECT_VALUE_KEYS]]
for project_path in project_paths:
    
    project_file_name = strip_extension(file_path_to_file_name(project_path))
    
    try:
        project = yaml.load(open(project_path, "r"), Loader=yaml.FullLoader)
        project_list.append([project_file_name, *map(lambda x: project.get(x, None), PROJECT_VALUE_KEYS)])

    except Exception as e:
        print(f"Error in {project_file_name}: {e}")

project_df = pd.DataFrame(project_list[1:], columns=project_list[0])


Error in : mapping values are not allowed here
  in "/Users/clock/PycharmProjects/topology/projects/NCSU_2023_Hall", line 1, column 282


In [148]:
# Start the final Field of Science column
project_df["FinalFieldOfScience"] = project_df["FieldOfScience"].apply(lambda x: x if x in VALID_FIELDS_FROM_PDF else np.nan)

print(f"Final Field of Science is still missing for {project_df['FinalFieldOfScience'].isna().sum()}/{len(project_df['FinalFieldOfScience'])} projects.")

Final Field of Science is still missing for 893/1093 projects.


In [149]:
# Add the columns whose casing does not match
VALID_FIELDS_FROM_PDF_LOWER_MAP = {v.lower(): v for v in VALID_FIELDS_FROM_PDF}
project_df["FieldOfScienceIsCorrectCasingIndependent"] = project_df["FieldOfScience"].str.lower().isin(VALID_FIELDS_FROM_PDF_LOWER_MAP) & project_df["FinalFieldOfScience"].isna()
project_df["FinalFieldOfScience"] = project_df["FinalFieldOfScience"].fillna(project_df["FieldOfScience"].apply(lambda x: VALID_FIELDS_FROM_PDF_LOWER_MAP[x.lower()] if isinstance(x, str) and x.lower() in VALID_FIELDS_FROM_PDF_LOWER_MAP else np.nan))

# Matching Stats
print(f"Mapped an additional {project_df['FieldOfScienceIsCorrectCasingIndependent'].sum()} Fields of Science with incorrect casing.")
print(f"Final Field of Science is still missing for {project_df['FinalFieldOfScience'].isna().sum()}/{len(project_df['FinalFieldOfScience'])} projects.")


Mapped an additional 107 Fields of Science with incorrect casing.
Final Field of Science is still missing for 786/1093 projects.


In [150]:
# Add the rows who can be mapped
NSF_MAPPING = yaml.load(open(NSF_MAPPING_PATH, "r"), Loader=yaml.FullLoader)

project_df["FieldOfScienceViaMappingFile"] = project_df["FieldOfScience"].map(NSF_MAPPING).str.lower().map(VALID_FIELDS_FROM_PDF_LOWER_MAP).isin(VALID_FIELDS_FROM_PDF)
project_df["FinalFieldOfScience"] = project_df["FinalFieldOfScience"].fillna(project_df["FieldOfScience"].map(NSF_MAPPING).str.lower().map(VALID_FIELDS_FROM_PDF_LOWER_MAP))

# Matching Stats
print(f"Mapped an additional {project_df['FieldOfScienceViaMappingFile'].sum()} Fields of Science that were mapped correctly in the original mapping file.")
print(f"Final Field of Science is still missing for {project_df['FinalFieldOfScience'].isna().sum()}/{len(project_df['FinalFieldOfScience'])} projects.")


Mapped an additional 853 Fields of Science that were mapped correctly in the original mapping file.
Final Field of Science is still missing for 182/1093 projects.


In [151]:
print(f"Remaining Fields of Science that are not valid: {len(project_df[project_df['FinalFieldOfScience'].isna()]['FieldOfScience'].unique())}")

Remaining Fields of Science that are not valid: 64


In [152]:
HAND_MAPPING = {
    'NSF RAPTOR Project / Computer and Information Systems': "Computer sciences", 
    'Training': None, 
    'Cell Biology': "Biological sciences", 
    'Materials Science': "Materials science engineering", 
    'Computer and Information Sciences': "Computer sciences", 
    'Visual Arts': "Humanities", 
    'Electrical, Electronic, and Communications': "Electrical, electronics and communications engineering", 
    'Research Computing': "Computer sciences", 
    'Natural Resources and Conservation': "Agricultural sciences", 
    'Life Sciences - Biological and Biomedical': "Biological sciences", 
    'Other': None, 
    'Multi-Science Community': None, 
    'Data Science': "Computer sciences", 
    'Engineering': None, 
    'Elementary Particle Physics': "Physics", 
    'Appleid Mathematics': "Mathematics", 
    'Astronomy & Astrophysics': "Astronomy", 
    'Computer and Computation Research': "Computer sciences", 
    'Health Sciences': "Health", 
    'Multidisciplinary': None, 
    'Business': "Business management/administration", 
    'Engineering Systems': "Electrical, electronics and communications engineering", 
    'Atmospheric Science and Meteorology': "Atmospheric sciences", 
    'Social, Behavioral & Economic Sciences': "Psychology", # Looked up their project and they are in the Psychology Department
    'Computer and Information Science': "Computer sciences", 
    'Robotics': "Engineering technology", 
    'Integrative Activities': None, 
    'Applied Computer Science': "Computer sciences", 
    'Computer and information services': "Computer sciences", 
    'Physics, Particle Physics, High Energy Physics': "Physics", 
    'Genetics': "Biological sciences", 
    'Neuroscience, biomechanics, microscopy': "Biological sciences", 
    'Biomedical Engineering': "Other engineering", 
    'Earth Science': "Earth and ocean sciences", 
    'Electrical, Electronic, and Communications Engineering': "Electrical, electronics and communications engineering", 
    'Aerospace, Aeronautical, and Astronautical Engineering': "Aeronautical/Astronautical engineering", 
    'Science and Engineering Education': "Education", 
    'Radiological Science': "Health", 
    'Biological and Biomedical Sciences/Biophysics': "Biological sciences",
    'Mathmatics and Statistics': "Mathematics",
    'Physics and radiation therapy': "Health",
    'Social Sciences': "Political science", # Googled the PI: https://polisci.northwestern.edu/people/core-faculty/brian-libgober.html
    'Materials Engineering': "Materials science engineering",
    'Geosciences': "Earth and ocean sciences", 
    'Bioengineering & Biomedical Engineering': "Other engineering", # Direct match, but more descriptive is "Biological sciences"
    'Psychology and Life Sciences': "Psychology",
    'Structural Biology/Biophysics': "Biological sciences",
    'Computer Architecture/Computer Engineering': "Computer sciences", 
    'Mathematics and Statistics': "Mathematics",
    'Condensed Matter and Materials Physics': "Physics", 
    'Life Sciences. Other Sciences': "Biological sciences", # Project is in the Biological Sciences department 
    'Atomic Physics': "Physics", 
    'Behavioral Science': "Other social sciences", 
    'Computer & Information Science & Engineering': "Computer sciences",
    'Statistics and Probability': "Mathematics", 
    'Biology': "Biological sciences", 
    'Electrical Engineering': "Electrical, electronics and communications engineering", 
    None: None, 
    'Machine Learning/AI': "Computer sciences",
    'Materials Research': "Materials science engineering", 
    'Ocean Sciences and Marine Sciences': "Earth and ocean sciences", 
    'Medical (NIH)': "Health", 
    'Other Engineering and Technologies': "Other engineering", 
    'Agricultural Sciences specifically Poultry Science': "Agricultural sciences"
}

In [153]:
# Apply the hand mapping
project_df["FieldOfScienceViaHandMapping"] = project_df["FieldOfScience"].map(HAND_MAPPING).isin(VALID_FIELDS_FROM_PDF)
project_df["FinalFieldOfScience"] = project_df["FinalFieldOfScience"].fillna(project_df["FieldOfScience"].map(HAND_MAPPING))

# Matching Stats
print(f"Mapped an additional {project_df['FieldOfScienceViaHandMapping'].sum()} Fields of Science that were mapped correctly in the original mapping file.")
print(f"Final Field of Science is still missing for {project_df['FinalFieldOfScience'].isna().sum()}/{len(project_df['FinalFieldOfScience'])} projects.")

Mapped an additional 109 Fields of Science that were mapped correctly in the original mapping file.
Final Field of Science is still missing for 73/1093 projects.


### Remaining Issues

Copied from below there are 7 remaining invalid fields of science to handle:
```
['Training' 'Other' 'Engineering' 'Multi-Science Community' 'Multidisciplinary' 'Integrative Activities' None]
```

To handle Multi-Science Community and Multidisciplinary we will map them to a field that comes from the most recent NSF Field of Study spreadsheet linked above "Multidisciplinary/ interdisciplinary sciences". 

In [154]:
# Handle Multi-Science Community
project_df["FieldOfScienceViaMultiScienceCommunity"] = project_df["FieldOfScience"].isin(["Multi-Science Community", "Multidisciplinary"])
project_df["FinalFieldOfScience"] = project_df["FinalFieldOfScience"].fillna(project_df["FieldOfScience"].map({"Multi-Science Community": "Multidisciplinary/ interdisciplinary sciences", "Multidisciplinary": "Multidisciplinary/ interdisciplinary sciences"}))

# Matching Stats
print(f"Mapped an additional {project_df['FieldOfScienceViaMultiScienceCommunity'].sum()} Fields of Science that were mapped correctly in the original mapping file.")
print(f"Final Field of Science is still missing for {project_df['FinalFieldOfScience'].isna().sum()}/{len(project_df['FinalFieldOfScience'])} projects.")

Mapped an additional 21 Fields of Science that were mapped correctly in the original mapping file.
Final Field of Science is still missing for 52/1093 projects.


In [155]:
# Handle the remaining projects that have to be mapped by hand

# Mapping from project file to the correct Field of Science
PROJECT_MAPPING = {
    'TG-TRA170047': None, # This is an RCF Staff account
    'Training-ACE-NIAID': None, # This is an RCF Staff account
    'TG-STA110011S': None, # Nothing in project about Field of Science
    'TG-TRA150015': None, # This is an RCF Staff account
    'TG-TRA180032': None, # This is an RCF Staff account
    'TG-TRA120004': None, # This is an RCF Staff account
    'KnowledgeLab': "Computer sciences",
    'TG-TRA220011': None, # This is an RCF Staff account
    'TG-TRA160027': None, # This is an RCF Staff account
    'TG-TRA150018': None, # This is an RCF Staff account
    'NCSU_Staff': None, # This is an RCF Staff account
    'TG-TRA140029': None, # This is an RCF Staff account
    'Syracuse_ITSRC': None, # This is an RCF Staff account
    'TG-STA110014S': None, # This is an RCF Staff account
    'TG-TRA220017': None, # This is an RCF Staff account
    'TG-TRA130003': None, # This is an RCF Staff account
    'TG-TRA090005': None, # This is an RCF Staff account
    'PegasusTraining': None, # This is an RCF Staff account
    'TG-TRA220014': None, # This is an RCF Staff account
    'Rochester_Liu': None, # This is an RCF Staff account
    'CampusWorkshop_Feb2021': None, # This is an RCF Staff account
    'ConnectTrain': None, # This is an RCF Staff account
    'UNI_Staff': None, # This is an RCF Staff account
    'TG-TRA100004': None, # This is an RCF Staff account
    'OSGUserTrainingPilot': None, # This is an RCF Staff account
    'TG-CCR130001': None, # This is an RCF Staff account
    'TG-TRA210040': None, # This is an RCF Staff account
    'Workshop-RMACC21': None, # This is an RCF Staff account
    'UCSD_ResearchIT': None, # This is an RCF Staff account
    'UserSchool2016': None, # This is an RCF Staff account
    'TG-TRA130007': None, # This is an RCF Staff account
    'TG-TRA130011': None, # This is an RCF Staff account
    'UChicago-RCC': None, # Unsure what this is but it shouldn't have a FOS
    'TG-TRA140036': None, # This is an RCF Staff account
    'UWMadison_Negrut': "Engineering technology", 
    'Stanford_Fletcher': "Engineering technology", 
    'GanForAuto': "Electrical, electronics and communications engineering", 
    'ND_Chen': "Multidisciplinary/ interdisciplinary sciences", # They do machine learning on health data 
    'UCSDEngEarthquake': "Physics", # Using the department of the project 
    'UWMadison_Li': "Mechanical engineering", 
    'Clemson_Sarupria': "Chemical engineering", 
    'Stanford_Zia': "Chemical engineering", 
    'UCSD_Shah': "Other engineering", 
    'NDSU_Yellavajjala': "Engineering technology", 
    'FECliu': "Electrical, electronics and communications engineering",  
    'UCDenver_Gaffney': "Mechanical engineering", 
    'SWITCHHawaii': "Electrical, electronics and communications engineering",  
    'Perchlorate': "Engineering technology", 
    'UNL_Stolle': "Mechanical engineering",
    'TAMU_Rathinam': "Mechanical engineering",
    'SourceCoding': "Electrical, electronics and communications engineering",  
}

# Apply the engineering mapping
project_df["FieldOfScienceViaProjectMapping"] = project_df["File"].isin(PROJECT_MAPPING.keys())
project_df["FinalFieldOfScience"] = project_df["FinalFieldOfScience"].fillna(project_df["File"].map(PROJECT_MAPPING))

# Matching Stats
print(f"Mapped an additional {project_df['FieldOfScienceViaProjectMapping'].sum()} Fields of Science that were mapped correctly in the original mapping file.")
print(f"Final Field of Science is still missing for {len(project_df[project_df['FinalFieldOfScience'].isna() & ~(project_df['File'].isin(PROJECT_MAPPING))])}/{len(project_df['FinalFieldOfScience'])} projects.")

Mapped an additional 51 Fields of Science that were mapped correctly in the original mapping file.
Final Field of Science is still missing for 1/1093 projects.


### Do final check

In [156]:
ALL_VALID_FIELDS_OF_SCIENCE = [*VALID_FIELDS_FROM_PDF, None, "Multidisciplinary/ interdisciplinary sciences"]

if len(project_df[~project_df["FinalFieldOfScience"].isin(ALL_VALID_FIELDS_OF_SCIENCE)]) == 1:
    print("Passed: All Projects have been assigned a valid FOS")
else: 
    print("Failed")

Passed: All Projects have been assigned a valid FOS


In [157]:
project_df[~project_df["FinalFieldOfScience"].isin(ALL_VALID_FIELDS_OF_SCIENCE)]

Unnamed: 0,File,ID,Name,PIName,Department,FieldOfScience,FinalFieldOfScience,FieldOfScienceIsCorrectCasingIndependent,FieldOfScienceViaMappingFile,FieldOfScienceViaHandMapping,FieldOfScienceViaMultiScienceCommunity,FieldOfScienceViaProjectMapping
834,_CAMPUS_GRIDS,,,,,,,False,False,False,False,False


In [158]:
project_df

Unnamed: 0,File,ID,Name,PIName,Department,FieldOfScience,FinalFieldOfScience,FieldOfScienceIsCorrectCasingIndependent,FieldOfScienceViaMappingFile,FieldOfScienceViaHandMapping,FieldOfScienceViaMultiScienceCommunity,FieldOfScienceViaProjectMapping
0,FIU_DCunha,,,Cassian D’Cunha,Department of Information Technology,NSF RAPTOR Project / Computer and Information ...,Computer sciences,False,False,True,False,False
1,MSU_Kerzendorf,,,Wolfgang Kerzendorf,Department of Physics and Astronomy,Astronomy,Astronomy,False,True,False,False,False
2,cms.org.ksu,212,,Yurii Maravin,Physics,High Energy Physics,Physics,False,True,False,False,False
3,DTWclassifier,423,,Luke Remage-Healey,Psychological and Brain Sciences,Neuroscience,Biological sciences,False,True,False,False,False
4,SOLID,649,,Thomas Britton,Physics,Nuclear Physics,Physics,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
1088,mars,500,,Joe Boyd,,High Energy Physics,Physics,False,True,False,False,False
1089,cms.org.uic,232,,Nikos Varelas,Physics,High Energy Physics,Physics,False,True,False,False,False
1090,EpiBrain,452,,David Mogul,Biomedical Engineering,Neuroscience,Biological sciences,False,True,False,False,False
1091,Auburn_Hauck,,,Ruediger Hauck,Pathobiology,Agricultural Sciences specifically Poultry Sci...,Agricultural sciences,False,False,True,False,False


In [159]:
# Export
project_df[project_df["FinalFieldOfScience"].isin(ALL_VALID_FIELDS_OF_SCIENCE)].to_excel("ProjectMapping.xlsx", index=False)

In [160]:
import re

# Apply the new mapping 

for project_path in project_paths:
    
    # Get the project file name, filter out non yaml files and non project files
    project_file = file_path_to_file_name(project_path)
    if not project_file.endswith(".yaml") or project_file == "_CAMPUS_GRIDS.yaml":
        continue

    project_file_name = strip_extension(project_file)
        
    # Grab a dict of this projects row in the project df
    project_row = project_df[project_df['File'] == project_file_name].to_dict(orient='records')[0]
    
    # Grab the data pre change
    project_data = yaml.load(open(project_path, "r"), Loader=yaml.FullLoader)
    
    # Replace the FieldOfScience
    text = open(project_path, "r").read()
    pattern = r'FieldOfScience:.*'
    replaced_text = None
    if project_row["FinalFieldOfScience"] is not None:
        replaced_text = re.sub(pattern, f"FieldOfScience: {project_row['FinalFieldOfScience']}", text)
    else:
        replaced_text = re.sub(pattern, f"", text)
    open(project_path, "w").write(replaced_text)
    
    updated_project_data = yaml.load(open(project_path, "r"), Loader=yaml.FullLoader)
    
    # Check that the Field of Science has been appropriately updated
    if updated_project_data.get('FieldOfScience', None) != project_row['FinalFieldOfScience']:
        raise Exception("Field of Science not updated correctly")
    
    # Check that the content other than the Field of Science is the same
    updated_project_data.pop("FieldOfScience", None)
    project_data.pop("FieldOfScience", None)
    if updated_project_data != project_data:
        raise Exception("Other contents have been changed as a result of the FieldOfScience update")
        


In [161]:
# One final check
for project_path in project_paths:
    
    project_file = file_path_to_file_name(project_path)
    if not project_path.endswith(".yaml") or project_file == "_CAMPUS_GRIDS.yaml":
        continue
    
    project_data = yaml.load(open(project_path, "r"), Loader=yaml.FullLoader)
    if project_data.get("FieldOfScience", None) not in ALL_VALID_FIELDS_OF_SCIENCE:
        raise Exception("Invalid Field of Science in projects yaml")

In [162]:
ALL_VALID_FIELDS_OF_SCIENCE

['Agricultural sciences',
 'Biological sciences',
 'Atmospheric sciences',
 'Earth and ocean sciences',
 'Computer sciences',
 'Mathematics',
 'Astronomy',
 'Chemistry',
 'Physics',
 'Other physical sciences',
 'Psychology',
 'Economics',
 'Political science',
 'Sociology',
 'Other social sciences',
 'Aeronautical/Astronautical engineering',
 'Chemical engineering',
 'Civil engineering',
 'Electrical, electronics and communications engineering',
 'Industrial and manufacturing engineering',
 'Materials science engineering',
 'Mechanical engineering',
 'Other engineering',
 'Engineering technology',
 'Health',
 'Education',
 'Humanities',
 'Business management/administration',
 'Communication',
 'Other non-science and engineering/unknown',
 'Multidisciplinary/ interdisciplinary sciences',
 None,
 'Multidisciplinary/ interdisciplinary sciences']

In [163]:
df = pd.read_excel("nsf24300-taba-005.xlsx", skiprows=3)
df['New major field'].unique()

  warn("Workbook contains no default style, apply openpyxl's default")


array(['Agricultural, animal, plant, and veterinary sciences',
       'Natural resources and conservation',
       'Biochemistry, biophysics, and molecular biology',
       'Bioinformatics, biostatistics, and computational biology',
       'Cell/ cellular biology and anatomy',
       'Ecology, evolutionary biology, and epidemiology',
       'Genetics and genomics', 'Microbiology and immunology',
       'Neurobiology and neurosciences', 'Pharmacology and toxicology',
       'Physiology, oncology, and cancer biology',
       'Biological and biomedical sciences, other', 'Computer science',
       'Computer and information sciences, other',
       'Biological, biomedical, and biosystems engineering',
       'Chemical and petroleum engineering',
       'Civil, environmental, and transportation engineering',
       'Electrical and computer engineering', 'Engineering technologies',
       'Industrial engineering and operations research',
       'Materials and mining engineering', 'Mechanical 