# Project Title Normalization

Manual crosswalk normalization had to be performed to make sure that the projects that were in the original EC data matched up with those in NSF data.  There were discrepancies in titles (mostly due to administrative convenience), requiring title normalization before proceeding.

* [the crosswalk file `inputs/081922_title_crosswalk.csv`](./081922_title_crosswalk.csv) was created **manually** to normalize titles (project ids the same) for joining
* the corresponding [output file `outputs/nsf/nsfid_project_title_normed.csv`](../outputs/nsf/nsfid_project_title_normed.csv) provides the crosswalk for the normalized titles used in further analysis

In [1]:
import pandas as pd
from datetime import datetime
from datetime import timedelta

In [2]:
df_nsf = pd.read_json("../outputs/nsf/data_full_dump.json").T

In [3]:
df_nsf.columns

Index(['abstractText', 'estimatedTotalAmt', 'fundsObligatedAmt',
       'fundProgramName', 'id', 'projectOutComesReport', 'publicationResearch',
       'startDate', 'expDate', 'title', 'awardee'],
      dtype='object')

In [4]:
df_nsf.startDate = pd.to_datetime(df_nsf.startDate)
df_nsf.expDate = pd.to_datetime(df_nsf.expDate)

In [5]:
df_nsf.columns

Index(['abstractText', 'estimatedTotalAmt', 'fundsObligatedAmt',
       'fundProgramName', 'id', 'projectOutComesReport', 'publicationResearch',
       'startDate', 'expDate', 'title', 'awardee'],
      dtype='object')

In [6]:
print(df_nsf[df_nsf.title.str.lower().str.contains("office")][['title', 'expDate']].to_markdown())

|         | title                                   | expDate             |
|--------:|:----------------------------------------|:--------------------|
| 1623751 | EarthCube Science Support Office (ESSO) | 2019-10-31 00:00:00 |
| 1928208 | Geosciences EarthCube Community Office  | 2023-01-31 00:00:00 |


In [7]:
df_nsf['title_normed'] = df_nsf.title.str.replace(r'\s\s+', ' ', regex=True).str.lower()

In [8]:
df_nsf[['title', 'title_normed']]

Unnamed: 0,title,title_normed
1324760,RCN: Building a Sediment Experimentalist Net...,rcn: building a sediment experimentalist netwo...
1340233,EarthCube Test Enterprise Governance: An Agi...,earthcube test enterprise governance: an agile...
1340265,EC3 - Earth-Centered Communication for Cyberin...,ec3 - earth-centered communication for cyberin...
1340301,EarthCube RCN: C4P: Collaboration and Cybe...,earthcube rcn: c4p: collaboration and cyberinf...
1343661,EarthCube Conceptual Design: A Scalable Commu...,earthcube conceptual design: a scalable commun...
...,...,...
2126449,Collaborative Research: EarthCube Capabilities...,collaborative research: earthcube capabilities...
2126503,Collaborative Research: EarthCube Capabilities...,collaborative research: earthcube capabilities...
2127606,Collaborative Research: EarthCube Capabilities...,collaborative research: earthcube capabilities...
2126468,Collaborative Research: EarthCube Capabilities...,collaborative research: earthcube capabilities...


In [9]:
df_crosswalk = pd.read_csv("../inputs/081922_title_crosswalk.csv")
df_crosswalk.columns

Index(['old_name', 'new_name'], dtype='object')

In [10]:
# df_nsf_normed_titles = df_nsf.title.str.replace(r'\s\s+', ' ', regex=True).str.lower().reset_index()

In [11]:
df_crosswalk_tmp = \
    df_crosswalk\
        .merge(df_nsf, left_on='old_name', right_on='title_normed')\
        .drop(['old_name', 'title_normed'], axis=1)\
        .rename(columns={'new_name': 'title_normed'})

df_crosswalk_tmp.index = df_crosswalk_tmp.id
df_crosswalk_tmp.index.name = None
df_crosswalk_tmp.title_normed  = df_crosswalk_tmp.title_normed.apply(lambda d: "*"+d)

df_nsf_normed = pd.concat([df_nsf, df_crosswalk_tmp]) \
    .drop_duplicates('id', keep='last') 

# cleanup
del df_crosswalk_tmp

df_nsf_normed.title_normed = df_nsf_normed.title_normed.apply(lambda d: d[1:].title() if d[0] == '*' else d.title())

# dump the file and cleanup
df_nsf_normed.sort_values('title_normed')[['title_normed']].to_csv("../outputs/nsf/nsfid_project_title_normed.csv")

The file will now be use for all future use mapping data using titles.