**Data Wrangling - Extraction & Cleaning**

The entire dataset shows 122,898 documents were found, but not all the authors who published these documents are affilitaed with a Nigerian institution.

Since the scopus data does not specifially provide a single field for authors' country of affiliation which can aid the separation of authors affiliated with Nigerian institutions from others, this study explored the "Author with affiliation" field in the scopus data. This field record has a reference to the country to which authors' institutions are affiliated with, making it possible to make a delination. A regular expression algorithm was written to extract records having a particular country name as a string (i.e. 'nigeria') within the "Author with affiliation" field.  

Although this approach alone proved complex and limited since it does not give room to explore collarboration accurately. However, this limitation was mitigated and simplified using one of the lists provided alongside the search results which represented the 160 countries/region from which all the authors' affilated institutions were located. This was copied unto a csv file and then used to correctly match the countries involved.

To enrich the data for collaboration analysis, the countries were categorizes into two based on their regions, which were later used to determine local, regional and international collaboration.
- Africa
- Outside Africa

**Prepare country data for Collaboration Analysis**

**Country Affiliation** - Data Extraction

In [None]:
# Load the CSV file with countries and continents
country_continent_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/scopus-country-region.csv')

# Load your data containing author affiliation data only
df_scopus_main = df_scopus_merged[['EID', 'Authors with affiliations', 'Year', 'DOI', 'Document Type', 'Open Access', 'Discipline']].copy()


In [None]:
country_continent_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  160 non-null    object
 1   Region   160 non-null    object
dtypes: object(2)
memory usage: 2.6+ KB


In [None]:
# Convert the 'Country' column to a list
countries = country_continent_df['Country'].tolist()

# exclude "Benin" and "Niger" from the main list

countries = [country for country in countries if country not in ["Benin", "Niger"]]

# Compile a regex pattern to match any of these countries
country_pattern = r'\b(' + '|'.join(map(re.escape, countries)) + r')\b'

# Ensure that all entries in 'Authors with affiliations' are strings
df_scopus_main['Authors with affiliations'] = df_scopus_main['Authors with affiliations'].astype(str)

In [None]:
# create an exclusion list to provide better context for matching Niger as a country
notNigerCountry = ["Niger Delta", "Niger State", "Niger-State", "Abuja, Niger",
                   "River Niger", "+234, Niger", "Minnam, Niger", "Niger-state",
                   "Niger River","Niger delta","Central Bank of Nigeria, Niger"]

# create an inclusion list to provide better context for matching Niger as a country
isNigerCountry = ["Niamey"]

# create an exclusion list to provide better context for matching Benin as a country
isBeninCountry = ["Cotonou", "Porto-Novo", "Parakou", "Abomey", "jougou", "Bohicon",
                  "Kandi", "Natitingou", "Ouidah", "Lokossa"]

# create an inclusion list to provide better context for matching Benin as a country
notBeninCountry = ["University of Benin", "Edo State", "Ekpoma"]

**Simple RegEX**

In [None]:
# Function to extract countries using regex
"""
def extract_countries_from_address(address):
    countries_involved = re.findall(country_pattern, address, re.IGNORECASE)
    return list(set(countries_involved))
"""

'\ndef extract_countries_from_address(address):\n    countries_involved = re.findall(country_pattern, address, re.IGNORECASE)\n    return list(set(countries_involved))\n'

**Advanced RegEx that disambiguates Benin an Niger Republic correctly**

In [None]:
def extract_country(text, target_word, exclusion_list, inclusion_list):
    matches = []
    for match in re.finditer(r"\b" + target_word + r"\b", text):
        start_index = match.start()
        end_index = match.end()

        # Check for exclusion patterns
        if any(re.search(r"\b" + re.escape(pattern) + r"\b", text) for pattern in exclusion_list):
            continue  # Skip if exclusion pattern found

        # Check for inclusion patterns (optional)
        if inclusion_list and not any(re.search(r"\b" + re.escape(pattern) + r"\b", text) for pattern in inclusion_list):
            continue  # Skip if inclusion pattern not found

        matches.append(match.group(0))

    return matches

In [None]:
def extract_countries_from_address(address):
    # Pre-filter Benin and Niger
    benin_matches = extract_country(address, "Benin", notBeninCountry, isBeninCountry)
    niger_matches = extract_country(address, "Niger", notNigerCountry, isNigerCountry)

    # General regex extraction for other countries
    other_countries = re.findall(country_pattern, address, re.IGNORECASE)

    # Combine results
    countries_involved = benin_matches + niger_matches + other_countries
    return list(set(countries_involved))  # Ensure unique countries

In [None]:
# Apply the extraction function to each row
df_scopus_main['Countries involved'] = df_scopus_main['Authors with affiliations'].apply(extract_countries_from_address)

In [None]:
df_scopus_main.sample(3)

Unnamed: 0,EID,Authors with affiliations,Year,DOI,Document Type,Open Access,Discipline,Countries involved
34065,2-s2.0-85079700724,"Emetere M.E., Department of Physics, Covenant ...",2018,10.1007/s41810-018-0027-3,Article,,ENV,"[Nigeria, South Africa]"
27661,2-s2.0-85020496254,"Owolade S.O., National Horticultural Research ...",2017,,Article,,AGRI,[Nigeria]
57338,2-s2.0-85085129870,"Abdul-Hammed M., Department of Pure and Applie...",2020,10.22036/pcr.2020.221177.1737,Article,,MAT SCI,[Nigeria]


In [None]:
# check the number of records where the country involved contains "Niger"
[len(df_scopus_main[df_scopus_main['Countries involved'].apply(lambda x: 'Benin' in x)]),
len(df_scopus_main[df_scopus_main['Countries involved'].apply(lambda x: 'Niger' in x)])]

[526, 186]

In [None]:
# let us see the frequency per country
df_scopus_main['Countries involved'] = df_scopus_main['Countries involved'].apply(lambda x: [country.title() for country in x])
df_scopus_main.explode('Countries involved').groupby('Countries involved')['EID'].size().sort_values(ascending=False)

Unnamed: 0_level_0,EID
Countries involved,Unnamed: 1_level_1
Nigeria,122883
South Africa,13519
United States,11603
United Kingdom,10609
Malaysia,8734
...,...
Honduras,39
Nicaragua,37
Central African Republic,36
Dominican Republic,36


In [None]:
# Explode the list of countries into separate rows
df_country_exploded = df_scopus_main.explode('Countries involved')

# Merge with the continent DataFrame to get the continent for each country
df_country_exploded = df_country_exploded.merge(country_continent_df, left_on='Countries involved', right_on='Country', how='right')

# Group the continents back into lists and retain only unique continents per EID
continents_involved = df_country_exploded[df_country_exploded['Region'].notna()].groupby('EID')['Region'].apply(lambda x: list(set(x)))

# Merge back the continents involved to the original df
df_scopus_countries = df_scopus_main.merge(continents_involved, left_on='EID', right_index=True, how='left')

In [None]:
# Merge back the continents involved to the original df
df_scopus_main.sample(3)

Unnamed: 0,EID,Authors with affiliations,Year,DOI,Document Type,Open Access,Discipline,Countries involved
41236,2-s2.0-85076364570,"Babalola O.J., Liberia Field Epidemiology and ...",2019,10.1186/s12936-019-3046-x,Article,All Open Access; Gold Open Access; Green Open ...,MED 1,"[Nigeria, Liberia, Congo]"
86308,2-s2.0-85132047044,"Iheagwam F.N., Department of Biochemistry and ...",2022,10.26538/tjnpr/v6i5.1,Review,,MED 1,[Nigeria]
90710,2-s2.0-85108991511,"Asogwa F.O., Department of Economics, Universi...",2022,10.1002/pa.2485,Article,All Open Access; Green Open Access; Hybrid Gol...,SOCI,"[Nigeria, United Kingdom]"


Upon the extraction of the country data into a field, they were further categorized as local, regional, and International collaboration (). Local colaboration repesenting documents affiliated with Nigeria only. Regional collaboration indicated joint publications by Nigeria authors and authors from other African continet. Finally, International collaboration was depicted publications attributed to Nigeria and any other countries of the world except africa.

In [None]:
# Function to categorize the collaboration type based on the continents involved
def categorize_collaboration(countries_involved, continents_involved):
    # If only Nigeria is involved
    if len(countries_involved) == 1 and countries_involved[0] == "Nigeria":
        return "Local"
    # If only Africa is involved
    elif len(continents_involved) == 1 and continents_involved[0] == "Africa":
        return "Regional"
    # If only one continent is involved and it's not Africa
    elif len(continents_involved) == 1 and continents_involved[0] != "Africa":
        return "International"
    # Otherwise, it's international
    else:
        return "International"

In [None]:
nan_records = df_scopus_countries[df_scopus_countries['Region'].isnull()]
display(nan_records)

Unnamed: 0,EID,Authors with affiliations,Year,DOI,Document Type,Open Access,Discipline,Countries involved,Region
13839,2-s2.0-84996536618,Abidin S.; Ishaya I.V.; M-Nor M.N.,2016,,Article,,ECON,[],
15332,2-s2.0-85013031925,Ogunleye A.O.; Carlson S.,2016,,Article,,PHAR,[],
22571,2-s2.0-85062977201,,2018,,Note,,ENGI,[],
22619,2-s2.0-85062976276,,2018,,Note,,ENGI,[],
24715,2-s2.0-85063012955,,2018,,Note,,ENGI,[],


In [None]:
df_scopus_countries = df_scopus_countries.dropna(subset=['Region'])

df_scopus_countries.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122893 entries, 0 to 122897
Data columns (total 9 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   EID                        122893 non-null  object
 1   Authors with affiliations  122893 non-null  object
 2   Year                       122893 non-null  int64 
 3   DOI                        110074 non-null  object
 4   Document Type              122893 non-null  object
 5   Open Access                58743 non-null   object
 6   Discipline                 122031 non-null  object
 7   Countries involved         122893 non-null  object
 8   Region                     122893 non-null  object
dtypes: int64(1), object(8)
memory usage: 9.4+ MB


In [None]:
# Apply the collaboration categorization function
df_scopus_countries['Collaboration Type'] = df_scopus_countries.apply(
    lambda row: categorize_collaboration(row['Countries involved'], row['Region']), axis=1
)

In [None]:
df_scopus_countries.sample(3)

Unnamed: 0,EID,Authors with affiliations,Year,DOI,Document Type,Open Access,Discipline,Countries involved,Region,Collaboration Type
22976,2-s2.0-85090204767,"Abiola T., Department of Medical Services, Fed...",2017,10.4103/jfsm.jfsm_47_17,Article,All Open Access; Gold Open Access,SOCI,[Nigeria],[Africa],Local
1323,2-s2.0-85050629913,"Esere M.O., Department of Counsellor Education...",2015,,Article,,ECON,[Nigeria],[Africa],Local
8708,2-s2.0-84904317276,"Fiebai B., Department of Ophthalmology, Univer...",2014,10.4103/1119-3077.134040,Article,,MED 2,[Nigeria],[Africa],Local


In [None]:
# rename df_scopus_countries back to df_scopus_main
df_scopus_main = df_scopus_countries.copy()
df_scopus_main.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122893 entries, 0 to 122897
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   EID                        122893 non-null  object
 1   Authors with affiliations  122893 non-null  object
 2   Year                       122893 non-null  int64 
 3   DOI                        110074 non-null  object
 4   Document Type              122893 non-null  object
 5   Open Access                58743 non-null   object
 6   Discipline                 122031 non-null  object
 7   Countries involved         122893 non-null  object
 8   Region                     122893 non-null  object
 9   Collaboration Type         122893 non-null  object
dtypes: int64(1), object(9)
memory usage: 10.3+ MB
