## Step 1: Set up the dimensions API and required imports

### Step 1.1: Install the Dimensions API Client
First, install the dimcli package to interact with the Dimensions API.

In [None]:
!pip install dimcli -U --quiet

This command installs the `dimcli` package, which provides the necessary tools to interact with the Dimensions database. The `-U` flag ensures that you install or update to the latest version, and the `--quiet` flag suppresses unnecessary output during installation.

### Step 1.2: Import Required Libraries
In this step, we will import the libraries and modules necessary for using the Dimensions API.

In [2]:
import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
import re

- `dimcli`: The library used to interact with the Dimensions API.
- `json`: A library for parsing JSON data returned by the API.
- `sys`: Used for handling system-specific parameters, like checking if the notebook is running in Google Colab.
- `pandas`: A powerful data manipulation library, useful for handling and analyzing datasets.

### Step 1.3: Log In to the Dimensions API
To use the Dimensions API, you need to authenticate with your personal API key. In this step, you'll input your API key to log in.

In [None]:
print("==\nLogging in..")
# https://digital-science.github.io/dimcli/getting-started.html#authentication
ENDPOINT = "https://app.dimensions.ai"
if 'google.colab' in sys.modules:
  import getpass
  KEY = input("Input your API key for this session: ")
  dimcli.login(key=KEY, endpoint=ENDPOINT)
else:
  KEY = ""
  dimcli.login(key=KEY, endpoint=ENDPOINT)
dsl = dimcli.Dsl()


- The script checks if it’s running in Google Colab using `sys.modules`.
- If it is in Google Colab, it will prompt you to input your API key using `input()`. You can securely enter the key in Colab.
- The `dimcli.login()` function authenticates with the API using the provided API key.
- The `Dsl()` function creates an instance of the Dimensions API's Domain-Specific Language (DSL), which will be used to query the database.

## Step 2: Load Faculty Data and Prepare Lists of Faculty Members
In this step, we will read an Excel file containing the faculty roster and organize the data into dictionaries and lists that will be used for further analysis.



### Step 2.1: Create a Dictionary of DataFrame Objects
We start by reading the Excel file that contains the faculty data. This file includes data for various years across different sheets.

In [None]:
!wget 'https://raw.githubusercontent.com/The-CEAS-Library/Dimensions-API-Querying/master/Faculty%20Roster_Pillay%20Request_11.06.2024.xlsx' -O Faculty_Roster.xlsx

In [5]:
faculty_data = pd.read_excel('Faculty_Roster.xlsx',['11.01.2018', '11.01.2019', '11.01.2020','11.01.2021','11.01.2022','11.01.2023','11.01.2024'])

- The `pd.read_excel()` function loads the Excel file `'Faculty Roster_Pillay Request_11.06.2024.xlsx'`.
- The `['11.01.2018', '11.01.2019', ...]` argument specifies the list of sheet names to read from the Excel file, which correspond to different years.
- The result is stored in `faculty_data`, a dictionary where each key is a year (e.g., `'11.01.2018'`), and the corresponding value is the data for that year in the form of a DataFrame.

### Step 2.2: Initialize Lists for Names
Next, we initialize empty lists to store the faculty members' names in different formats.

In [63]:
# Initialize lists for names
fullName, fName, lName, empNameAgg = [], [], [], []

- `fullName`: A list for storing the full name of each faculty member (first and last).
- `fName`: A list for storing the first names of faculty members.
- `lName`: A list for storing the last names of faculty members.
- `empNameAgg`: An aggregated list of all unique names across years.

### Step 2.3: Gather and Clean Employee Names by Year
We now extract and clean the employee names for each year and store them in a dictionary. The names are cleaned by removing any periods, and the resulting list is sorted.

In [64]:
# List of years for each date
years = ['2018', '2019', '2020', '2021', '2022', '2023', '2024']

# Gather and clean employee names by year
empNames = {f'{year}': sorted(name.replace('.', '') for name in faculty_data[f'11.01.{year}']['Employee']) for year in years }

- `empNames`: A dictionary where each key is a year (e.g., `'2018'`), and the value is a sorted list of employee names after cleaning. The names are cleaned by removing periods, (`name.replace('.', '')`), because the database does not contain them in the names.

###Step 2.4: Aggregate All Names Across Years
Next, we aggregate all unique faculty names from each year into one sorted list.

In [65]:
# Aggregate all cleaned names and sort unique entries
empNameAgg = sorted(set(name for names in empNames.values() for name in names))

- This step flattens all the name lists across years and removes duplicates using `set()`.
- The result is a sorted list of unique faculty member names stored in `empNameAgg`.

### Step 2.5: Extract and Organize Name Components
We will now extract and organize the first names, last names, and full names from all the sheets in `faculty_data`.

In [66]:
# Extract and organize name components across all sheets in faculty_data
for sheet_data in faculty_data.values():
    fName.extend(sheet_data['Preferred First Name'])
    lName.extend(sheet_data['Preferred Last Name'])
    fullName.extend(f"{last} {first}" for first, last in zip(sheet_data['Preferred First Name'], sheet_data['Preferred Last Name']))

- For each sheet in `faculty_data`, we extract the `'Preferred First Name'` and `'Preferred Last Name'` columns.
- We then create the `fullName` list by combining first and last names into one string.
- The `fName`, `lName`, and `fullName` lists are extended with the respective data from each sheet.

### Step 2.6: Remove Duplicates
Finally, we remove any duplicates from the fullName and lName lists to ensure that each name appears only once.

In [68]:
# Remove duplicates from fullName and lName
fullName = list(set(fullName))
uniqueLN = list(set(lName))

- `fullName` is converted into a set to remove duplicates, and then converted back to a list.
- `uniqueLN` stores a list of unique last names.

## Step 3: Retrieve Researcher IDs from the Dimensions Database
In this step, we will query the Dimensions database to retrieve information about researchers based on their last names and the research organization they belong to (in this case, the University of Cincinnati).

### Step 3.1: Define the Grid ID for the University of Cincinnati
To query researchers at the University of Cincinnati, we first define the Grid ID for the university.

In [11]:
GRIDID = 'grid.24827.3b'

- The `GRIDID` is the unique identifier for the University of Cincinnati in the Dimensions database. You will use this ID in the query to filter results by the university.

### Step 3.2: Create the Query to Search for Researchers
Next, we define the query that will search for researchers based on their last names and the university's Grid ID.

In [12]:
q = """search researchers
        where research_orgs = "{}"
        and last_name in {}
        return researchers"""

- This query searches for researchers who belong to the research organization specified by the `GRIDID` and whose last names are in the `uniqueLN` list (which contains the cleaned and unique last names of faculty members).

### Step 3.3: Execute the Query and Retrieve the Results
We execute the query and retrieve the results from the Dimensions API. The ```query_iterative``` method is used to fetch the data iteratively. This is because otherwise we can only grab up to 1000 results at a time.

In [None]:
# run our query
researchers_json = dsl.query_iterative(q.format(GRIDID, json.dumps(uniqueLN)))

- `q.format(GRIDID, json.dumps(uniqueLN))`: This formats the query by inserting the GRIDID and the list of last names (uniqueLN).
- The query results are stored in the `researchers_json` variable.

### Step 3.4: Convert the Results to a DataFrame
The results returned by the query are in JSON format. To make them easier to work with, we convert them into a DataFrame using the `as_dataframe()` method.

In [106]:
#convert the information we retrieved into a DataFrame object
researchers = researchers_json.as_dataframe()

- This creates a DataFrame called `researchers` that contains the details of the researchers, including their names, IDs, and other basic information.

### Step 3.5: Create a Full Name Column
To facilitate the matching of faculty members, we create a full_name column by combining the last_name and first_name columns. This allows us to directly compare full names between the faculty data and the researchers in the Dimensions database.

In [None]:
# Create the full_name column
researchers.insert(researchers.columns.get_loc('last_name') + 1,
                   'full_name',
                   researchers['last_name'] + ' ' + researchers['first_name'])

- The `full_name` column is inserted after the `last_name` column.
- The `full_name` is constructed by concatenating the `last_name` and `first_name` for each researcher, creating a single string that represents their full name.

### Step 3.6: Apply the Filter to Match Faculty Members
After creating the `full_name` column, we proceed to filter the `researchers` DataFrame by comparing each researcher’s full name to the faculty members' names. This step ensures that only the relevant faculty members are retained, accounting for potential variations in how names may be stored.

In [108]:
# Create a mask to filter the DataFrame
mask = researchers.apply(
    lambda row: (
        # Check that 'first_name' is a valid string before applying split
        any(
            re.search(rf"^{row['full_name']}\b.*$", name)
            for name in set(empNameAgg + fullName)
        )
    ),
    axis=1
)

- The `apply()` function iterates over each row in the ``researchers` DataFrame and applies the filter logic.
- `re.search()` checks if the combination of the researcher’s last and first name matches any name in `empNameAgg` or `fullName`.
- Using `set(empNameAgg + fullName)` ensures that both lists are merged and duplicates are removed.

### Step 3.7: Apply the Filter and Remove Duplicates
Finally, we apply the filter to the DataFrame and drop any duplicate entries based on the researcher ID, last name, and first name.

In [109]:
# Filter the DataFrame
filtered_df = researchers[mask].drop_duplicates(subset=['id', 'last_name', 'first_name'], ignore_index=True)

- `researchers[mask]` filters the rows that match the condition specified in the mask.
- `drop_duplicates()` removes any rows with duplicate researcher IDs, ensuring that each researcher is listed only once.
- `ignore_index=True` resets the index after dropping duplicates.

In [None]:
filtered_df

## Step 4: Retrieving and sorting information from our filtered dataframe

### We can group the ids by the first and last name columns to combine most cases of multiple ids per a person

In [None]:
researchers_combined = filtered_df.groupby(['last_name', 'first_name', 'full_name'], as_index=False).agg({
                                            'id': list  # Combine ids into a list
                                          })

In [None]:
# Display the result
researchers_combined

By indexing or slicing with `.loc` and `.iloc`, we can combine rows of data manually that were missed or grab only the info associated with a single person (including the alternate names).

In [None]:
#Example for combining 2 rows that are the same person under different first names
# Get the single row from the DataFrame
single_researcher = researchers_combined.loc[[4], ['first_name', 'last_name', 'id']]

# Append the new ID (from the 5th row) to the list in 'id'
single_researcher['id'].iloc[0].extend(researchers_combined['id'].iloc[5])

In [None]:
# Display the updated single_researcher
single_researcher

Some people have multiple researcher IDs so this needs manually verified that they are the same or different people


## This retrieves information on the publications by the authors in the filtered_df

All researcher ids

In [None]:
researchers_ids = filtered_df['id']
researchers_ids

Alternatively create a list for the specific ids you wish to search

In [None]:
import itertools
r_id = list(itertools.chain.from_iterable(single_researcher['id']))
r_id

In [None]:
# no of researchers IDs per query: so to ensure we never hit the 1000 records limit per query

# adjust the years here if you want to look at years different from the earlier list (2018-2024)
#years = []

q = """search publications
            where researchers.id in {}
            and year in {}
            return publications
            limit 1000"""

#change the value of r_id or use a different variable to adjust which person is being looked at
data = dsl.query(q.format(json.dumps(r_id), json.dumps(years)))

#data = dsl.query(q)

In [None]:
data_df = data.as_dataframe()

In [None]:
data_df

In [None]:
data_df[["id", "title", "type", "year"]]