**Prepare Institution data**

The list of Scopus institutions that prodcued all the publications in the main dataframe was matched against the institution data from Wikipedia to produce a clean institution data suitable for extracting the Nigerian institutions' records from the main data and exploring institutional productivity metrics.
This cleaning was done manually on googlesheet. The cleaned names are stored as aliases.

In [None]:
# Load the CSV file including all Nigerian institutions and Scopus participating institutions
institutions_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/scopusWiki_affil.csv')

institutions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310 entries, 0 to 309
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Institution name  310 non-null    object
 1   Abbreviation      310 non-null    object
 2   Funding           310 non-null    object
 3   Institution_type  310 non-null    object
 4   Alias             116 non-null    object
dtypes: object(5)
memory usage: 12.2+ KB


In [None]:
institutions_df.sample(1)

Unnamed: 0,Institution name,Abbreviation,Funding,Institution_type,Alias
43,Adeyemi Federal University of Education,AFUED,Federal,University,Adeyemi Federal University


Now, we proceed to extract the institution involved

In [None]:
# Split the 'Alias' column based on ';' and create a new list column
institutions_df['Alias_Split'] = institutions_df['Alias'].str.split(';')

# Explode the new list column to separate records into rows
all_institutions = institutions_df.explode('Alias_Split')

# Drop the original 'Alias' column and rename 'Alias_Split' to 'Alias'
all_institutions = all_institutions.drop(columns=['Alias']).rename(columns={'Alias_Split': 'Alias'})

# Combine the 'Name' and 'Alias' columns into a single list
all_institutions = institutions_df['Institution name'].tolist() + all_institutions['Alias'].dropna().tolist()

In [None]:
print(all_institutions)
len(all_institutions)

['Centre for Management Development', 'Cocoa Research Institute of Nigeria', 'Federal Institute of Industrial Research', 'Forestry Research Institute of Nigeria', 'Institute of Agricultural Research and Training', 'Institute of Archaeology and Museum Studies', 'Institute of Human Virology - Nigeria', 'International Institute of Tropical Agriculture', 'International Livestock Research Institute', 'Lake Chad Research Institute', 'National Agricultural Extension and Research Liaison Services', 'National Animal Production Research Institute', 'National Centre for Agricultural Mechanization', 'National Centre for Energy Research and Development', 'National Centre for Genetic Resources and Biotechnology', 'National Cereals Research Institute', 'National Horticultural Research Institute', 'National Institute for Freshwater Fisheries Research', 'National Institute of Pharmaceutical Research and Development', 'National Research Institute for Chemical Technology', 'National Root Crops Research I

505

In [None]:
'University of Ilorin' in all_institutions

True

**Extraction technique A**

In [None]:
"""
# Compile a regex pattern to match any of these affiliations
affiliations_pattern = r'\b(' + '|'.join(map(re.escape, all_institutions)) + r')\b'

# Ensure that all entries in 'Authors with affiliations' are strings
df_scopus_main['Authors with affiliations'] = df_scopus_main['Authors with affiliations'].astype(str)

def extract_institutions(affiliations):
    affiliations_involved = re.findall(affiliations_pattern, affiliations, re.IGNORECASE)
    return list(set(affiliations_involved))
"""

"\n# Compile a regex pattern to match any of these affiliations\naffiliations_pattern = r'\x08(' + '|'.join(map(re.escape, all_institutions)) + r')\x08'\n\n# Ensure that all entries in 'Authors with affiliations' are strings\ndf_scopus_main['Authors with affiliations'] = df_scopus_main['Authors with affiliations'].astype(str)\n\ndef extract_institutions(affiliations):\n    affiliations_involved = re.findall(affiliations_pattern, affiliations, re.IGNORECASE)\n    return list(set(affiliations_involved))\n"

**Extraction Technique B** - Preffered

In [None]:
# Function to extract the institutions from the 'Authors with affiliations' column
def extract_institutions(affiliations):
    found_institutions = []
    if isinstance(affiliations, str):  # Check if the input is a valid string
        for institution in all_institutions:
            if institution.lower() in affiliations.lower():
                found_institutions.append(institution.strip())  # Ensure institutions are stripped of whitespace
    return list(set(found_institutions))  # Return unique institutions only

In [None]:
# Apply the function to the 'Authors with affiliations' column and create a new column named 'Institutions'
df_scopus_main['Institutions involved'] = df_scopus_main['Authors with affiliations'].apply(extract_institutions)

In [None]:
df_scopus_main['Institutions involved'].sample(3)

Unnamed: 0,Institutions involved
107139,[]
85,[University of Nigeria]
89695,"[University of Ibadan, University of Maiduguri..."


In [None]:
# Filter records where the 'Institutions involved' column is an empty list
no_match_records = df_scopus_main[df_scopus_main['Institutions involved'].apply(lambda x: isinstance(x, list) and len(x) == 0)]

In [None]:
no_match_records.sample(3)

Unnamed: 0,EID,Authors with affiliations,Year,DOI,Document Type,Open Access,Discipline,Countries involved,Region,Collaboration Type,Institutions involved
66184,2-s2.0-85083234053,"Abdullahi H., Department of Urban and Regional...",2020,10.1088/1755-1315/450/1/012011,Conference paper,All Open Access; Gold Open Access,ENV,"[Nigeria, China, Malaysia]","[Outside Africa, Africa]",International,[]
84869,2-s2.0-85110391637,"Goni M.D., Department of Microbiology and Para...",2021,10.3389/fpubh.2021.594204,Article,All Open Access; Gold Open Access; Green Open ...,MED 1,"[Nigeria, Malaysia]","[Outside Africa, Africa]",International,[]
81199,2-s2.0-85114687376,"Danbatta S.J., Kano State Institute for Inform...",2021,10.1109/ISDFS52919.2021.9486325,Conference paper,,ENGI,"[Nigeria, Turkey]","[Outside Africa, Africa]",International,[]


In [None]:
# Get the number of records without a match
num_no_match = len(no_match_records)

# Print the result
print(f"Number of records without a match: {num_no_match}")

Number of records without a match: 15752


In [None]:
# make a copy of Scopus_main
df_scopus_inst = df_scopus_main.copy()

df_scopus_inst.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122893 entries, 0 to 122897
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   EID                        122893 non-null  object
 1   Authors with affiliations  122893 non-null  object
 2   Year                       122893 non-null  int64 
 3   DOI                        110074 non-null  object
 4   Document Type              122893 non-null  object
 5   Open Access                58743 non-null   object
 6   Discipline                 122031 non-null  object
 7   Countries involved         122893 non-null  object
 8   Region                     122893 non-null  object
 9   Collaboration Type         122893 non-null  object
 10  Institutions involved      122893 non-null  object
dtypes: int64(1), object(10)
memory usage: 11.3+ MB


Next, we clean the institution data.

In [None]:
# explode by institution and group to inspect
df_scopus_inst.explode('Institutions involved').groupby('Institutions involved').size().sort_values(ascending=False)

Unnamed: 0_level_0,0
Institutions involved,Unnamed: 1_level_1
University of Nigeria,13078
University of Ibadan,12550
Covenant University,7783
Obafemi Awolowo University,6772
"University of Nigeria, Nsukka",6709
...,...
European University of Nigeria,1
Baba Ahmed University,1
Arthur Javis University,1
Anan University,1


In [None]:
# save specific records where we have colleges
college = df_scopus_main[df_scopus_main['Institutions involved'].apply(lambda x: any('College' in inst for inst in x))].sample(5)
college_indices = college.index.tolist()
for institutions in college['Institutions involved']:
  print(institutions)

['University of Ibadan', 'College of Medicine, University of Ibadan']
['University of Ibadan', 'College of Medicine, University of Ibadan']
['University of Ibadan', 'University College Hospital, Ibadan']
['University of Ibadan', 'College of Medicine, University of Ibadan']
['College of Medicine, University of Lagos', 'University of Lagos']


In [None]:
# clean up the aliases of extracted institutions to retain only the Institutions names

# Create a dictionary mapping aliases to institution names
alias_mapping = {}
for index, row in institutions_df[institutions_df['Alias'].notna()].iterrows():  # Filter for rows with aliases
    preferred_name = row['Institution name'].lower()

    alias_mapping[preferred_name] = []

    aliases = row['Alias'].split(';')
    for alias in aliases:
        alias_mapping[preferred_name].append(alias.strip().lower())

In [None]:
print(alias_mapping)

{'institute of agricultural research and training': ['institute of agricultural research & training'], 'institute of archaeology and museum studies': ['institute of archaeology & museum studies'], 'national agricultural extension and research liaison services': ['national agricultural extension & research liaison services'], 'national centre for energy research and development': ['national centre for energy research & development'], 'national centre for genetic resources and biotechnology': ['national centre for genetic resources & biotechnology'], 'national institute of pharmaceutical research and development': ['national institute of pharmaceutical research & development'], 'national space research and development agency': ['national space research and development', 'national space research & development', 'national space research & development agency'], 'nigerian building and road research institute': ['nigerian building & road research institute'], 'nigerian institute for oceanogra

In [None]:
# Update the clean_aliases function to handle case and whitespace
def clean_institutions(institutions_list):
    cleaned_list = []
    seen_institutions = set()
    for institution in institutions_list:
        for preferred_name, aliases in alias_mapping.items():
            if institution.lower() in aliases:
                if preferred_name not in seen_institutions:
                    cleaned_list.append(preferred_name)
                    seen_institutions.add(preferred_name)
                break  # Move to the next institution once a match is found
        else: # If no match found in aliases, check if the institution name itself is in preferred_names
            if institution.lower() not in seen_institutions:
                cleaned_list.append(institution.lower())
                seen_institutions.add(institution.lower())
    return cleaned_list

In [None]:
# Apply the clean_aliases function to the 'Institutions involved' column
df_scopus_inst['Institutions involved'] = df_scopus_inst['Institutions involved'].apply(clean_institutions)

In [None]:
# explode by institution and group to inspect
df_scopus_inst.explode('Institutions involved').groupby('Institutions involved').size().sort_values(ascending=False)

Unnamed: 0_level_0,0
Institutions involved,Unnamed: 1_level_1
university of ibadan,13122
university of nigeria,13078
covenant university,7799
university of lagos,6843
obafemi awolowo university,6772
...,...
peter university,1
state university of medical and applied sciences,1
ojaja university,1
philomath university,1


In [None]:
# inspect the exact records that matched colleges earlier

cleaned_college_records = df_scopus_inst['Institutions involved'].loc[college_indices]
for institutions in cleaned_college_records:
  print(institutions)

['university of ibadan']
['university of ibadan']
['university of ibadan']
['university of ibadan']
['university of lagos']


In [None]:
# save the scopus_inst back to the Scopus main data
df_scopus_main = df_scopus_inst.copy()

df_scopus_main.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122893 entries, 0 to 122897
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   EID                        122893 non-null  object
 1   Authors with affiliations  122893 non-null  object
 2   Year                       122893 non-null  int64 
 3   DOI                        110074 non-null  object
 4   Document Type              122893 non-null  object
 5   Open Access                58743 non-null   object
 6   Discipline                 122031 non-null  object
 7   Countries involved         122893 non-null  object
 8   Region                     122893 non-null  object
 9   Collaboration Type         122893 non-null  object
 10  Institutions involved      122893 non-null  object
dtypes: int64(1), object(10)
memory usage: 15.3+ MB
