# BiblioAnalysis

### Version: 4.2.0

### Aims
- This jupyter notebook results from the use analysis of BiblioTools2jupyter notebook and a new implementation of the following parts:
    - Parsing: replaced and tested 
    - Corpus description: replaced and tested
    - Filtering: replaced and tested, integrating the "EXCLUSION" mode and the recursive filtering
    - Cooccurrence analysis : replaced and tested, integrating graph plot and countries GPS coordinates
    - Coupling analyis : replaced and tested
    
### Created modules in the package BiblioAnalysis_Utils
    - BiblioCooc.py
    - BiblioCoupling.py
    - BiblioDescription.py
    - BiblioFilter.py    
    - BiblioGeneralGlobals.py
    - BiblioGraphPlot.py
    - BiblioGui.py
    - BiblioNltk.py
    - BiblioParsingConcat.py
    - BiblioParsingInstitutions.py
    - BiblioParsingScopus.py
    - BiblioParsingUtils.py
    - BiblioParsingWos.py  
    - BibloRefs.py
    - BiblioSpecificGlobals.py
    - BiblioSys.py
    - BiblioTempDev.py

### BiblioTool3.2 source
http://www.sebastian-grauwin.com/bibliomaps/download.html 

### List of initial Python packages extracted from  BiblioTool3.2
- biblio_parser.py	⇒ pre-processes WOS / Scopus data files,
- corpus_description.py	⇒ performs a frequency analysis of the items in corpus,
- filter.py	⇒ filters the corpus according to a range of potential queries but still too specific
- biblio_coupling.py	⇒ performs a BC anaysis of the corpus,
- cooc_graphs.py	⇒ produces various co-occurrence graphs based on the corpus (call parameters changed)

### Specifically required list of pip install 
(to be integrated in the setup.py of BiblioAnalysis_Utils)
- !pip3 install fuzzywuzzy
- !pip3 install squarify 
- !pip3 install python-louvain
- !pip3 install python-Levenshtein
- !pip3 install pyvis
- !pip3 install screeninfo

### Specifically required nltk downloads 
(integrated in BiblioNltk.py of BiblioAnalysis_Utils)
- import nltk
    - nltk.download('punkt')
    - nltk.download('averaged_perceptron_tagger')
    - nltk.download('wordnet')

## Preliminary instructions
#### These actions will be interactively performed in the next version of the Jupyter notebook
- Create the 'BiblioAnalysis_Files/' folder in your 'Users/' folder
<br>
<br>
- Create in this 'BiblioAnalysis_Files/' folder, the 'Configuration_Files/' folder
<br>
- Store the configuration files (config_filter.json) a the 'Configuration_Files/' folder that are:
    - 'config_filter.json' used for the filtering of a corpus
    - 'congig_temporal.json'used for the temporal development of item values in a set of annual coupuses 
<br>
<br>
- Create, in the 'Configuration_Files/' folder, your additional_files folder to be named 'Selection_Files/' 
<br>
- Store your files (free names) of selected item values in this additional_files folder together with:
    - 'TempDevK_full.txt' used to select the words to search in the description files of the corpuses for the temporal development of item values in the set of annual coupuses
<br>
<br>
- Create, in the 'BiblioAnalysis_Files/' folder, your project folder
<br>
- Create the 'rawdata/' folder in your project folder
<br>
- Store your corpus file (either wos or scopus extraction) in the 'rawdata/' folder of your project folder
<br>


# 0- User environment setting

In [None]:
# Standard library imports
import platform
import os
from IPython.display import clear_output
from pathlib import Path

# Local imports
import BiblioAnalysis_Utils as bau

clear_output(wait=True)
bold_text = bau.BOLD_TEXT
light_text = bau.LIGHT_TEXT

# Set the venv use status
venv = False
print('Virtual environment: ', venv)

# Get the information of current operating system
os_name = platform.uname().system
print('Operating system:    ', os_name)
if os_name=='Darwin':bau.add_site_packages_path(venv)

# User identification
user_root = Path.home()
user_id =  str(user_root)[str(user_root).rfind('/')+1:]
print('User:                ', user_id)
expert =  False

# Getting the corpuses folder
 # Setting the GUI titles
gui_titles = {'main':   'Corpuses folder selection window',
              'result': 'Selected folder'}
gui_buttons = ['SELECTION','HELP']

corpuses_folder = bau.select_folder_gui_new(user_root, gui_titles, gui_buttons, bau.GUI_DISP)
print('\nCorpuses folder:', corpuses_folder)
print('\n' + bold_text + 'Cell-run completed' + light_text)

# The following cells untill chapter 1 (parsings merging) allow to test the addresses parsing to identify the institutions of the publication authors
## These cells will be suppressed after integration of the useful functions to the BiblioParsingInstitutions.py module

In [None]:
def special_symbol_remove(text, only_ascii = True, strip = True):
    '''The function `special_symbol_remove` remove accentuated characters in the string 'text'
    and ignore non-ascii characters if 'only_ascii' is true. Finally, spaces at the ends of 'text'
    are removed if strip is true.
    
    Args:
        text (str): the text where to remove special symbols.
        only_ascii (boolean): True to remove non-ascii characters from 'text' (default: True).
        strip (boolean): True to remove spaces at the ends of 'text' (default: True).
        
    Returns:
        (str): the modified string 'text'.
    
    '''
    # Standard library imports
    import functools
    import unicodedata

    if only_ascii:
        nfc = functools.partial(unicodedata.normalize,'NFD')
        text = nfc(text). \
                   encode('ascii', 'ignore'). \
                   decode('utf-8')
    else:
        nfkd_form = unicodedata.normalize('NFKD',text)
        text = ''.join([c for c in nfkd_form if not unicodedata.combining(c)])

    if strip:
        text = text.strip()
    
    return text

In [None]:
def town_names_uniformization(text):
    '''the `town_names_uniformization` function replaces in the string 'text'
    symbols and words defined by the keys of the dictionaries 'DIC_TOWN_SYMBOLS'
    and 'DIC_TOWN_WORDS' by their corresponding values in these dictionaries.
    
    Args:
        text (str): The string where changes will be done.
        
    Returns:
        (str): the modified string.
        
    Notes:
        The globals 'DIC_TOWN_SYMBOLS' and 'DIC_TOWN_WORDS' are imported from
        `BiblioSpecificGlobals` module of `BiblioAnalysis_Utils' package.
    
    '''
    # Local imports
    from BiblioAnalysis_Utils.BiblioSpecificGlobals import DIC_TOWN_SYMBOLS
    from BiblioAnalysis_Utils.BiblioSpecificGlobals import DIC_TOWN_WORDS
    
    # Uniformizing symbols in town names using the dict 'DIC_TOWN_SYMBOLS'
    for town_symb in DIC_TOWN_SYMBOLS.keys():
        text = text.replace(town_symb, DIC_TOWN_SYMBOLS[town_symb])

    # Uniformizing words in town names using the dict 'DIC_TOWN_WORDS'
    for town_word in DIC_TOWN_WORDS.keys():
        text = text.replace(town_word, DIC_TOWN_WORDS[town_word])
    
    return text

In [None]:
# Globals for institutions

#############################################
# Specific globals for institutions parsing #
#############################################

# Standard library imports
import re

# Local imports 
#from BiblioAnalysis_Utils.BiblioParsingUtils import special_symbol_remove
#from BiblioAnalysis_Utils.BiblioParsingUtils import town_names_uniformization

# For replacing symbols in town names
DIC_TOWN_SYMBOLS = {"-": " ",
                     }

# For replacing names in town names
DIC_TOWN_WORDS = {" lez ":  " les ",
                   "saint ": "st ",
                   } 

# For replacing aliases of a word by a word (case sensitive)
DIC_WORD_RE_PATTERN = {}
DIC_WORD_RE_PATTERN['University'] = re.compile(r'\bUniv[aàädeéirstyz]{0,8}\b\.?')
DIC_WORD_RE_PATTERN['Laboratory'] = re.compile(r"'?\bLab\b\.?" \
                                                    +  "|" \
                                                    + r"'?\bLabor[aeimorstuy]{0,7}\b\.?")
DIC_WORD_RE_PATTERN['Center'] = re.compile(r"\b[CZ]ent[erum]{1,3}\b\.?")
DIC_WORD_RE_PATTERN['Department'] = re.compile(r"\bD[eé]{1}p[artemnot]{0,9}\b\.?")
DIC_WORD_RE_PATTERN['Institute'] = re.compile(r"\bInst[ituteosky]{0,7}\b\.?")
DIC_WORD_RE_PATTERN['Faculty'] = re.compile(r"\bFac[lutey]{0,4}\b\.?")
DIC_WORD_RE_PATTERN['School'] = re.compile(r"\bSch[ol]{0,3}\b\.?")


# For keeping chunks of addresses (without accents and in lower case)
    # Setting a list of keeping words
        # Setting a basic list of keeping words
_BASIC_KEEPING_WORDS = list(DIC_WORD_RE_PATTERN.keys())
        # Setting a user list of keeping words
_USER_KEEPING_WORDS = ['Beamline', 'CEA', 'CNRS', 'EA', 'ED', 'FR', 'IMEC', 'INES', 'IRCELYON', \
                      'LaMCoS', 'LEPMI', 'LITEN', 'LOCIE', 'STMicroelectronics', \
                       'TNO', 'ULR', 'UMR', 'VTT']
_KEEPING_WORDS = _BASIC_KEEPING_WORDS + _USER_KEEPING_WORDS
        # Removing accents keeping non adcii characters and converting to lower case the keeping words, by default
KEEPING_WORDS =[special_symbol_remove(x, only_ascii = False, strip = False).lower() for x in _KEEPING_WORDS]


# For droping chunks of addresses (without accents and in lower case)
    # Setting a list of droping suffixes
_DROPING_SUFFIX = ["platz", "strae", "strasse", "straße", "vej"] # added "ring" but drops chunks containing "Engineering"
        # Removing accents keeping non adcii characters and converting to lower case the droping suffixes, by default
DROPING_SUFFIX = [special_symbol_remove(x, only_ascii = False, strip = False).lower() for x in _DROPING_SUFFIX]

    # Setting a list of droping words
_DROPING_WORDS = ["allee", "av", "avda", "ave", "avenue", "bat", "batiment", "boulevard", "blv.", "box", "bp", "calle", 
                 "campus", "carrera", "cedex", "cesta", "chemin", "ch.", "city", "ciudad", "cours", "cs", "district", 
                 "lane", "mall", "no.", "po", "p.", "rd", "route", "rue", "road", "sec.", "st.", "strada",
                 "street", "str.", "via", "viale", "villa"]
        # Removing accents keeping non adcii characters and converting to lower case the droping words, by default
_DROPING_WORDS = [special_symbol_remove(x, only_ascii = False, strip = False).lower() for x in _DROPING_WORDS]
        # Escaping the regex meta-character "." from the droping words, by default
DROPING_WORDS = [x.replace(".", r"\.") for x in _DROPING_WORDS]


# For droping towns in addresses 
    # Setting string listing raw french-town names 
_FR_UNIVERSITY_TOWNS = '''Aix-Marseille,Aix-en-Provence,Amiens,Angers,Arras,Aulnay-sous-bois,Avignon,Aulnoye-Aymeries,
                         Besançon,Bordeaux,Bouguenais,Brest,Caen,Chambéry,Clermont-Ferrand,Dijon,
                         Fraisses,Gif-sur-Yvette,Grenoble,La Rochelle,Le Bourget-du-Lac,
                         Le Havre,Le Mans,Lille,Limoges,Lyon,Marseille,Metz,Montbonnot,Montpellier,Mulhouse,Moret-Sur-Loing,
                         Nancy,Nantes,Nice,Nîmes,Orléans,Paris,Pau,Palaiseau,Perpignan,Pointe-à-Pitre,
                         Poitiers,Reims,Rennes,Rouen,Saint-Denis de La Réunion,Saint-Étienne,
                         Saint-Paul-lez-Durance,Saint-Nazaire,Strasbourg,Toulon,Toulouse,Tours,Troyes,
                         Valenciennes,Villeurbanne'''

    # Converting to lower case
_FR_UNIVERSITY_TOWNS = _FR_UNIVERSITY_TOWNS.lower() 

    # Uniformizing town names 
_FR_UNIVERSITY_TOWNS = town_names_uniformization(_FR_UNIVERSITY_TOWNS)

    # Removing accents keeping non adcii characters
_FR_UNIVERSITY_TOWNS = special_symbol_remove(_FR_UNIVERSITY_TOWNS, only_ascii = False, strip = False)

    # Converting to list of lower-case stripped names of towns 
FR_UNIVERSITY_TOWNS = [x.strip() for x in _FR_UNIVERSITY_TOWNS.split(',')]


###################
# General globals #
###################

# For changing particularly encoded symbols
DIC_CHANGE_APOST = {"”": "'",
                    "’": "'",   
                    '"': "'",
                    "“": "'",   
                    "'": "'",   
                    } 

APOSTROPHE_CHANGE = str.maketrans(DIC_CHANGE_APOST)

# For replacing dashes by hyphen-minus
DIC_CHANGE_DASHES = {"‐": "-",   # Non-Breaking Hyphen to hyphen-minus
                     "—": "-",   # En-dash to hyphen-minus
                     "–": "-",   # Em-dash to hyphen-minus
                     "–": "-",
                     }

DASHES_CHANGE = str.maketrans(DIC_CHANGE_DASHES)





In [None]:
def address_standardization(raw_address):
    
    '''The `address_standardization` function standardizes the string 'raw_address' by replacing
    all aliases of a word, such as 'University', 'Institute', 'Center' and' Department', 
    by a standardized version.
    The aliases of a given word are captured using a specific regex which is case sensitive defined 
    by the global 'DIC_WORD_RE_PATTERN'.
    The aliases may contain symbols from a given list of any language including accentuated ones. 
    The length of the alliases is limited to a maximum according to the longest alias known.
        ex: The longest alias known for the word 'University' is 'Universidade'. 
            Thus, 'University' aliases are limited to 12 symbols begenning with the base 'Univ' 
            + up to 8 symbols from the list '[aàädeéirstyz]' and possibly finishing with a dot.
            
    Then, dashes are replaced by a hyphen-minus using 'DASHES_CHANGE' global and apostrophes are replaced 
    by the standard cote using 'APOSTROPHE_CHANGE' global.         
    
    Args:
        raw_address (str): the full address to be standardized.

        
    Returns:
        (str): the full standardized address.
        
    Notes:
        The global 'DIC_WORD_RE_PATTERN' and 'UNKNOWN' are imported from `BiblioSpecificGlobals` module 
        of `BiblioAnalysis_Utils` package.
        The globals 'DASHES_CHANGE' and 'APOSTROPHE_CHANGE' are imported from `BiblioGeneralGlobals` module 
        of `BiblioAnalysis_Utils` package.
        The function `country_normalization` is imported from `BiblioParsingInstitutions` module 
        of `BiblioAnalysis_Utils` package.
        
    '''
    
    # Standard library imports
    import re
    
    # Local imports
    from BiblioAnalysis_Utils.BiblioParsingUtils import country_normalization
    #from BiblioAnalysis_Utils.BiblioGeneralGlobals import APOSTROPHE_CHANGE
    #from BiblioAnalysis_Utils.BiblioGeneralGlobals import DASHES_CHANGE
    #from BiblioAnalysis_Utils.BiblioSpecificGlobals import DIC_WORD_RE_PATTERN
    from BiblioAnalysis_Utils.BiblioSpecificGlobals import UNKNOWN
    
    # Uniformizing words
    standard_address = raw_address
    for word_to_subsitute, re_pattern in DIC_WORD_RE_PATTERN.items():
        standard_address = re.sub(re_pattern,word_to_subsitute + ' ',standard_address)
    standard_address = re.sub(r'\s+',' ',standard_address)
    standard_address = re.sub(r'\s,',',',standard_address)
    
    # Uniformizing dashes
    standard_address = standard_address.translate(DASHES_CHANGE)
    
    # Uniformizing apostrophes
    standard_address = standard_address.translate(APOSTROPHE_CHANGE)
    
    # Uniformizing countries
    country_pos = -1
    first_raw_affiliations_list = standard_address.split(',')
    raw_affiliations_list = sum([x.split(' - ') for x in first_raw_affiliations_list],[])
    country = country_normalization(raw_affiliations_list[country_pos].strip())
    space = " "
    if country != UNKNOWN:
        standard_address = ','.join(raw_affiliations_list[:-1] + [space + country])
    else:
        standard_address = ','.join(raw_affiliations_list + [space + country])

    return standard_address


In [None]:
raw_address = " STMicroelect Crolles 2 SAS, 850 Rue Jean Monnet, F-38926 Crolles, France"
address_standardization(raw_address)

In [None]:
def get_affiliations_list(std_address, drop_to_end = False, verbose = False):
    
    '''The `get_affiliations_list` function extracts first, the country and then, the list 
    of institutions from a standardized address. It splits the address in list of chuncks 
    separated by coma or isolated hyphen-minus.
    The country is present as the last chunk of the spliting.
    The other chunks are kept as institutions if they contain at least one word among 
    those listed in the 'KEEPING_WORDS' global or if they do not contain any item 
    searched by the `search_droping_items` function.
    The first chunck is always kept in the final institutions list.
    The spaces at the ends of the items of the final institutions list are removed.
    
    Args:
        std_address (str): the full address to be parsed in list of institutions and country.
        drop_to_end (boolean): if true, all chuncks are dropped after the first found to drop,
                               (default: False).
        verbose (boolean): if true, prints are run (default: False). 
        
    Returns:
        (tuple): the tuple with list of kept chuncks, country and list of dropped chuncks .
        
    Notes:
        The function `search_droping_items` is imported from `BiblioParsingInstitutions` module 
        of `BiblioAnalysis_Utils` package.
        The function `country_normalization` is imported from `BiblioParsingInstitutions` module 
        of `BiblioAnalysis_Utils` package.
        The globals 'KEEPING_WORDS' and 'UNKNOWN' are imported from `BiblioSpecificGlobals` module 
        of `BiblioAnalysis_Utils` package.        
        
    '''
    
    # Standard library imports
    import re
    from string import Template
    
    # Local imports
    #from BiblioAnalysis_Utils.BiblioParsingInstitutions import search_droping_items 
    #from BiblioAnalysis_Utils.BiblioSpecificGlobals import KEEPING_WORDS
    
    def _search_keaping_words(text):
        '''The `_search_keaping_words` internal function searches in 'text' for isolated words 
        given by the 'KEEPING_WORDS' global using a templated regex.
        
        Args:
            text (str): the string where the words are searched after being converted to lower case.
            
        Returns:
            (boolean): True if a word given by the 'KEEPING_WORDS' global is found.
            
        Notes:
            The global 'KEEPING_WORDS' is imported from `BiblioSpecificGlobals` module 
            of `BiblioAnalysis_Utils` package.               

        '''
        keeping_words_template = Template(r'\b$word\b')

        keeping_word_found = False
        for word_to_keep in KEEPING_WORDS:
            re_keep_words = re.compile(keeping_words_template.substitute({"word":word_to_keep}))
            result = re.search(re_keep_words,text.lower())
            if result is not None:
                keeping_word_found = True
                break
        return keeping_word_found

    # Splitting standard address in chuncks set in a raw-affiliations list
    first_raw_affiliations_list = std_address.split(',')
    raw_affiliations_list = sum([x.split(' - ') for x in first_raw_affiliations_list],[])
    if verbose:
        print('Full standard address:',std_address)
        print('first_raw_affiliations_list:',first_raw_affiliations_list)
        print('raw_affiliations_list flattenned:',raw_affiliations_list)
        print()
    
    # Setting country index in raw-affiliations list
    country_pos = -1
    country = raw_affiliations_list[country_pos].strip()
    if verbose:
        print('country:', country)
    
    # Initializing the affiliations list by keeping systematically the first chunck of the full address
    affiliations_list = [raw_affiliations_list[0]]
    
    # Initializing the list of chuncks to drop from the raw-affiliations list
    affiliation_drop = []                                                                 #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!           
    
    # Searching for chuncks to keep and chuncks to drop in the raw-affiliations list, the first chunck and the country excepted
    if len(raw_affiliations_list)>2:   
        for affiliation in raw_affiliations_list[1:country_pos]:
            
            keeping_word_found = _search_keaping_words(affiliation)            
            if keeping_word_found: 
                affiliations_list.append(affiliation)
                if verbose: 
                    print('Keeping word found in:',affiliation)
                    print()

            else:
                droping_item_found = search_droping_items(affiliation, country, verbose = verbose)

                if verbose:
                    print('No keeping word found in:',affiliation)
                    print()

                if not droping_item_found:
                    if verbose:
                        print('  No droping item found in:',affiliation)
                        print()
                    affiliations_list.append(affiliation)

                else:
                    affiliation_drop.append(affiliation)                                                 #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                    if verbose:
                        print('  Droping item found in:',affiliation)
                        print()
                    if drop_to_end: break 
                
    # Removing spaces from the affiliations kept 
    affiliations_list = [x.strip() for x in affiliations_list]
    if verbose:
        print('affiliations_list stripped:',affiliations_list)
        print()
    
    return (affiliations_list,country,affiliation_drop)                                            #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


In [None]:
def search_droping_items(affiliation, country, verbose = False):
    
    '''The `search_droping_items` function searches for several item types in 'affiliation' after accents removal 
    and converting in lower case even if the search is case sensitive.
    It uses the following internal functions:
        - The `_search_droping_words` function searches for words given by the 'DROPING_WORDS' global 
        such as 'Avenue'.
        - The `_search_droping_suffix` function searches for words ending by a suffix among 
        those given by the 'DROPING_SUFFIX' global such as 'platz'.
        - The `_search_droping_bp` function searches for words that are postal-box numbers such as 'BP54'.
        - The `_search_droping_zip` function searches for words that are zip codes such as 'F-38000'.
        - The `_search_droping_town` function searches for words that are french towns 
        listed in the 'FR_UNIVERSITY_TOWNS' global.
    
    It is to remind that in a regex:
        - '\b' captures the transition between a non-alphanumerical symbol and an alphanumerical symbol 
        and vice-versa.
        - '\B' captures the transition between two alphanumerical symbols.
    
    Args:
        affiliation (str): a chunck of a standardized address where droping items are searched.
        country (str): the string that contains the country.
       
    Returns:
        (boolean): True if at least one droping item is found.
    
    Notes:
        The function `special_symbol_remove` is imported from `BiblioParsingUtils` module 
        of `BiblioAnalysis_Utils` package.
        The globals 'DROPING_SUFFIX', 'DROPING_WORDS', 'KEEPING_PREFIX' are imported from `BiblioSpecificGlobals` module 
        of `BiblioAnalysis_Utils` package.
        The global 'FR_UNIVERSITY_TOWNS' is imported from `BiblioGeneralGlobals` module 
        of `BiblioAnalysis_Utils` package.
    
    '''
    
    # Standard library imports
    import re
    from string import Template
    
    # Local imports 
    #from BiblioAnalysis_Utils.BiblioParsingUtils import special_symbol_remove
    #from BiblioAnalysis_Utils.BiblioParsingUtils import town_names_uniformization
    #from BiblioAnalysis_Utils.BiblioSpecificGlobals import DROPING_SUFFIX
    #from BiblioAnalysis_Utils.BiblioSpecificGlobals import DROPING_WORDS
    #from BiblioAnalysis_Utils.BiblioSpecificGlobals import KEEPING_PREFIX
    #from BiblioAnalysis_Utils.BiblioSpecificGlobals import FR_UNIVERSITY_TOWNS
    #from BiblioAnalysis_Utils.BiblioSpecificGlobals import ZIP_CODES

    def _search_droping_words():
        '''The `_search_droping_words` internal function searches in 'affiliation_mod' for isolated words 
        given by the 'DROPING_WORDS' global using a templated regex.
        
        Args:
            affiliation_mod (str): the string where the words are searched.
            
        Returns:
            (boolean): True if a word given by the 'DROPING_WORDS' global is found.
            
        Notes:
            The global 'DROPING_WORDS' is imported from `BiblioSpecificGlobals` module 
            of `BiblioAnalysis_Utils` package.               

        '''
        
        droping_words_template = Template(  r'[\s(]$word[\s)]'     # For instence capturing "avenue" in "12 Avenue Azerty" or " cedex" in "azert cedex".
                                                                # in "12 Avenue Azerty" or " cedex" in "azert cedex".
                                          + '|'
                                          + r'[\s]$word$$')
                                                               

        flag = False
        for word_to_drop in DROPING_WORDS:
            re_drop_words = re.compile(droping_words_template.substitute({"word":word_to_drop}))
            result = re.search(re_drop_words,affiliation_mod)
            if result is not None:
                flag = True
                if verbose:
                    print('Droping word (full word):', word_to_drop)
                break
        return flag

    def _search_droping_suffix():
        '''The `_search_droping_suffix` internal function searches in 'affiliation_mod' for words 
        ending by a suffix among those given by the 'DROPING_SUFFIX' global 
        using a templated regex.
        
        Args:
            affiliation_mod (str): the string where the suffixes given by the 'DROPING_SUFFIX' global
                                   are searched.
            
        Returns:
            (boolean): True if a suffix given by the 'DROPING_SUFFIX' global is found.
            
        Notes:
            The global 'DROPING_SUFFIX' is imported from `BiblioSpecificGlobals` module 
            of `BiblioAnalysis_Utils` package.               

        '''
        
        droping_suffix_template = Template(r'\B$word\b')    # For instence capturing "platz" 
                                                            # in "Azertyplatz uiops12".

        flag = False
        for word_to_drop in DROPING_SUFFIX:
            re_drop_words = re.compile(droping_suffix_template.substitute({"word":word_to_drop}))
            result = re.search(re_drop_words,affiliation_mod)
            if result is not None:
                flag = True
                if verbose:
                    print('Droping word (suffix):', word_to_drop)
                break
        return flag


    def  _search_droping_bp():
        '''The `_search_droping_bp` internal function searches in 'affiliation_mod' for words 
        begenning with 'bp' followed by digits using a non case sensitive regex.
        
        Args:
            affiliation_mod (str): the string where the words are searched.
            
        Returns:
            (boolean): True if a word is found.
            
        '''

        re_bp = re.compile(r'\bbp\d+\b')     # For instence capturing "bp12" in "azert BP12 yui_OP".

        flag = False
        result = re.search(re_bp,affiliation_mod)
        if result is not None:
            if verbose:
                print('Droping word: postal box')
            flag = True
        return flag 

    def _search_droping_zip():
        '''The `_search_droping_zip` internal function searches in 'affiliation_mod' for words 
        similar to zip codes except those begenning with a prefix from the 'KEEPING_PREFIX' global
        followed by 4 digits using case-sensitive regexes. 
        Regex for zip-codes search uses the 'ZIP_CODES' dict global for countries from 'ZIP_CODES.keys()'.
        Specific regex are set for ''

        Args:
            affiliation_mod (str): the string where the words are searched.
            country (str): the string that contains the country.

        Returns:
            (boolean): True if a word different from those begenning with 'umr' 
                       followed by 4 digits is found.

        '''

        # Setting regex for zip-codes search
        if country in ZIP_CODES.keys():
            zip_template = Template(r'\b($zip_letters)[\s-]?(\d{$zip_digits})\b')
            letters_list, digits_list = ZIP_CODES[country]['letters'], ZIP_CODES[country]['digits']
            letters_join = '|'.join(letters_list) if len(letters_list) else ''
            pattern_zip_list = [zip_template.substitute({"zip_letters": letters_join, "zip_digits":digits})
                                for digits in digits_list]     
            re_zip = re.compile('|'.join(pattern_zip_list))

        elif country == 'United Kingdom':
            # Capturing: for instence, " BT7 1NN" or " WC1E 6BT" or " G128QQ"
            #            " a# #a", " a# #az", " a# ##a", " a# ##az",
            #            " a##a", " a##az", " a###a", " a###az",
            #
            #            " a#a #a", " a#a #az", " a#a ##a", " a#a ##az",
            #            " a#a#a", " a#a#az", " a#a##a", " a#a##az",
            #
            #            " a## #a", " a## #az", " a## ##a", " a## ##az",
            #            " a###a", " a###az", " a####a", " a####az",
            #            
            #            " a##a #a", " a##a #az", " a##a ##a", " a##a ##az",
            #            " a##a#a", " a##a#az", " a##a##a", " a##a##az",
            #            
            #            " az# #a", " az# #az", " az# ##a", " az# ##az",
            #            " az##a", " az##az", " az###a", " az###az",
            #
            #            " az#a #a", " az#a #az", " az#a ##a", " az#a ##az",
            #            " az#a#a", " az#a#az", " az#a##a", " az#a##az",
            #
            #            " az## #a", " az## #az", " az## ##a", " az## ##az",
            #            " az###a", " az###az", " az###a", " az####az",
            #
            #            " az##a #a", " az##a #az", " az##a ##a", " az##a ##az",
            #            " az##a#a", " az##a#az", " az##a#a", " az##a##az",

            re_zip = re.compile(r'^\s?[a-z]{1,2}\d{1,2}[a-z]{0,1}\s?\d{1,2}[a-z]{1,2}$')

        elif country == 'United States' or country == 'Canada':
            # Capturing: for instence, " NY" or ' NI BT48 0SG' or " ON K1N 6N5" 
            #            " az" or " az " + 6 or 7 characters in 2 parts separated by spaces

            re_zip = re.compile(r'^\s?[a-z]{2}$' + '|' + r'^\s?[a-z]{2}\s[a-z0-9]{3,4}\s[a-z0-9]{2,3}$')
        else:
            print('country not found:', country)
            return False

        # Setting search regex of embedding digits
        re_digits = re.compile(r'\s\d+(-\d+)?\b'      # For instence capturing " 1234" in "azert 1234-yui_OP"
                                                      # or " 1" in "azert 1-yui_OP" or " 1-23" in "azert 1-23-yui".                            
                               + '|'
                               + r'\b[a-z]+(-)?\d{2,}\b') # For instence capturing "azert12" in "azert12 UI_OPq" 
                                                      # or "azerty1234567" in "azerty1234567 ui_OPq".

        # Setting search regex of keeping-prefix
        # for instence, capturing "umr1234" in "azert UMR1234 YUI_OP" or "fr1234" in "azert-fr1234 Yui_OP".
        prefix_template = Template(r'\b$prefix[-]?\d{4}\b')
        pattern_prefix_list = [prefix_template.substitute({"prefix": prefix})
                               for prefix in KEEPING_PREFIX]   
        re_prefix = re.compile('|'.join(pattern_prefix_list))
        #print(re_prefix)              

        flag = False
        prefix_result = False if (re.search(re_prefix,affiliation_mod) is None) else True
        if prefix_result and verbose: print('Keeping prefix: True')
        zip_result = False if (re.search(re_zip,affiliation_mod) is None) else True
        digits_result = False if (re.search(re_digits,affiliation_mod) is None) else True
        if not prefix_result and (zip_result or digits_result):
            if verbose:
                print('Droping word: zip code') if zip_result else  print('Droping word: digits code')   
            flag = True
        return flag
    
    def _search_droping_town():
        '''The `_search_droping_town` internal function searches in 'affiliation_mod' for words in lower case
        that are french towns listed in the 'FR_UNIVERSITY_TOWNS' global.
        
        Args:
            affiliation_mod (str): the string where the words are searched.
            
        Returns:
            (boolean): True if a word listed in the 'FR_UNIVERSITY_TOWNS' global is equal to 'affiliation_mod' 
                       after spaces removal at ends.
            
        '''
        flag = False
        text_mod = town_names_uniformization(affiliation_mod)
        for word_to_drop in FR_UNIVERSITY_TOWNS:
            if word_to_drop == text_mod.strip():
                if verbose:
                    print('Droping word: french town')
                flag = True
                break
        return flag        

    funct_list = [_search_droping_words, _search_droping_bp, _search_droping_zip, _search_droping_suffix, _search_droping_town]
    affiliation_mod  = special_symbol_remove(affiliation, only_ascii = False, strip = False).lower()
    droping_word_found = any([funct() for funct in funct_list])

    return droping_word_found
    

## The following cell is preparing zip code capture using a specific templated regex

In [None]:
    def  _search_droping_zip_old(text):
        '''The `_search_droping_zip` internal function searches in 'text' for words 
        similar to zip codes except those begenning with 'umr' followed by 4 digits
        using non-case-sensitive regexes.
        
        Args:
            text (str): the string where the words are searched.
            
        Returns:
            (boolean): True if a word different from those begenning with 'umr' 
                       followed by 4 digits is found.
            
        '''
        #To Do : CH-5232 A-1060 '1721 PW' D-48149  'CV4 7AL'

        re_zip = re.compile(r'\bb-?\d{4}\b'         # For instence capturing "b-1234" in "azert B-1234 yui_OP"
                                                    # or "b1234" in "azert B1234 yui_OP".
                            + '|'
                            + r'\bbe-?\d{4}\b'      # For instence capturing "be-1234" in "azert Be-1234 yui_OP"
                                                    # or "be1234" in "azert BE1234 yui_OP".                            
                            + '|'
                            + r'\bf-?\d{5}\b')      # For instence capturing "f-12345" in "azert F-12345 yui_OP"
                                                    # or "f12345" in "azert F12345 yui_OP". 
                            
                            
        re_digits = re.compile(r'\s\d+(-\d+)?\b'      # For instence capturing " 1234" in "azert 1234-yui_OP"
                                                      # or " 1" in "azert 1-yui_OP" or " 1-23" in "azert 1-23-yui".                            
                               + '|'
                               + r'\b[a-z]+\d{2,}\b') # For instence capturing "azert12" in "azert12 UI_OPq" 
                                                      # or "azerty1234567" in "azerty1234567 ui_OPq".


        re_umr = re.compile(r'\bumr\d{4}\b'         # For instence capturing "umr1234" in "azert UMR1234 YUI_OP" 
                                                    # or "umr1234" in "azert-umr1234 Yui_OP".
                            + '|'
                            + r'\bfr\d{4}\b')        # For instence capturing "fr1234" in "azert fr1234 YUI_OP" 
                                                     # or "fr1234" in "azert-fr1234 Yui_OP".                 

        flag = False
        umr_result = False if (re.search(re_umr,text) is None) else True
        zip_result = False if (re.search(re_zip,text) is None) else True
        digits_result = False if (re.search(re_digits,text) is None) else True
        if not umr_result and (zip_result or digits_result):
            if verbose:
                print('Droping word: zip or digits code')
            flag = True
        return flag

In [None]:
# Globals for '_search_droping_zip' internal function of 'search_droping_items' function 

# ' xxxx' is droped but " xxxx" is not droped
ZIP_CODES = {'Algeria':            {'letters':[],            'digits': [5]},
             'Austria':            {'letters':['A'],         'digits': [4]},
             'Belgium':            {'letters':['B','Be'],    'digits': [4]},
             'Bulgaria':           {'letters':[],            'digits': [4]},
             'Brazil':             {'letters':[],            'digits': [5]},   # ' ddddd-ddd'
             'Chile':              {'letters':[],            'digits': [6]},
             'China':              {'letters':[],            'digits': [6]},
             'Cuba':               {'letters':[],            'digits': [5]},
             'Denmark':            {'letters':['DK'],        'digits': [4]},
             'Ecuador':            {'letters':[],            'digits': [6]},
             'Estonia':            {'letters':['EE'],        'digits': [5]},
             'Finland':            {'letters':['FI'],        'digits': [5]},
             'France':             {'letters':['F','FR'],    'digits': [5,6]},   # ' dd ddd'
             'Germany':            {'letters':['D','DE'],    'digits': [5]},
             'Greece':             {'letters':['GR'],        'digits': [5]},
             'Hungary':            {'letters':[],            'digits': [4]},
             'India':              {'letters':[],            'digits': [6]},
             'Indonesia':          {'letters':[],            'digits': [5]},
             'Israel':             {'letters':[],            'digits': [7]},
             'Italy':              {'letters':['I'],         'digits': [5]},
             'Japan':              {'letters':[],            'digits': [3]},   # ' ddd-dddd'
             'Latvia':             {'letters':[],            'digits': [4]},
             'Lebanon':            {'letters':[],            'digits': [4]},
             'Luxembourg':         {'letters':['L'],         'digits': [4]},
             'Mexico':             {'letters':['C.P.'],      'digits': [5]},
             'Morocco':            {'letters':[],            'digits': [5]},
             'Netherlands':        {'letters':[],            'digits': [4]},   # ' dddd az' | " ddddaz" | ' az dddd'
             'Norway':             {'letters':['N','NO'],    'digits': [4]},
             'Pakistan':           {'letters':[],            'digits': [5]},
             'Poland':             {'letters':[],            'digits': [2]},   # ' dd-ddd'
             'Portugal':           {'letters':['P'],         'digits': [4,7]},   # ' dddd-ddd'
             'Romania':            {'letters':[],            'digits': [6]},
             'Russian Federation': {'letters':[],            'digits': [6]},
             'Singapore':          {'letters':['Singapore'], 'digits': [6]},
             'Slovenia':           {'letters':['Sl','SI'],   'digits': [4]},
             'Spain':              {'letters':['E'],         'digits': [5]},
             'Sri Lanka':          {'letters':[],            'digits': [5]},
             'Sweden':             {'letters':['SE','S'],        'digits': [5]},
             'Switzerland':        {'letters':['CH'],        'digits': [4]},
             'Thailand':           {'letters':[],            'digits': [5]},
             'Togo':               {'letters':[],            'digits': [5]},
             'Tunisia':            {'letters':[],            'digits': [4]},
             'Turkey':             {'letters':[],            'digits': [5]},
             'Viet Nam':           {'letters':[],            'digits': [5,6]},
             }
for country in ZIP_CODES.keys(): ZIP_CODES[country]['letters'] = [x.replace(".", r"\.").lower() for x in ZIP_CODES[country]['letters']]


_KEEPING_PREFIX = ['UMR','ULR','FR']
KEEPING_PREFIX = [x.lower() for x in _KEEPING_PREFIX]



In [None]:

import pandas as pd
from pathlib import Path

path_home = Path.home()

df = pd.DataFrame.from_dict(ZIP_CODES, orient = 'columns').T

file = path_home / Path('Temp/ZipCodes.xlsx')
df.to_excel(file)

In [None]:
import json

import pandas as pd
from pathlib import Path

path_home = Path.home()

file = path_home / Path('Temp/ZipCodes.xlsx')
df = pd.read_excel(file)
ZIP_CODES = df
#json.loads('[1, 2, 3]')

In [None]:
# Testing regex for '_search_droping_zip' internal function of 'search_droping_items' function  

import re
from string import Template

country = 'Mexico'
text = " C.P.38000".lower()

zip_template = Template(r'\b($zip_letters)[\s-]?(\d{$zip_digits})\b')

#letters_list, digits_list = ZIP_CODES[country]['letters'], ZIP_CODES[country]['digits']
letters_list = ['c\\.p\\.']
#letters_list = ['cp']
digits_list = [5]
letters_join = '|'.join(letters_list) if len(letters_list) else ''
pattern_zip_list = [zip_template.substitute({"zip_letters": letters_join, "zip_digits":digits})
                    for digits in digits_list]     
re_zip = re.compile('|'.join(pattern_zip_list))
print('re_zip:',re_zip)

zip_result = False if (re.search(re_zip,text) is None) else True
print('zip_result:',zip_result)

In [None]:
# Testing '_search_droping_zip' internal function of 'search_droping_items' function  

def _search_droping_zip_test(text, country):
    '''The `_search_droping_zip` internal function searches in 'text' for words 
    similar to zip codes except those begenning with a prefix from the 'KEEPING_PREFIX' global
    followed by 4 digits using case-sensitive regexes. 
    Regex for zip-codes search uses the 'ZIP_CODES' dict global for countries from 'ZIP_CODES.keys()'.
    Specific regex are set for ''

    Args:
        text (str): the string where the words are searched.
        country (str): the string that contains the country.

    Returns:
        (boolean): True if a word different from those begenning with 'umr' 
                   followed by 4 digits is found.

    '''
    
    # Setting regex for zip-codes search
    if country in ZIP_CODES.keys():
        print(country)
        zip_template = Template(r'\b($zip_letters)[\s-]?(\d{$zip_digits})\b')
        letters_list, digits_list = ZIP_CODES[country]['letters'], ZIP_CODES[country]['digits']
        letters_join = '|'.join(letters_list) if len(letters_list) else ''
        pattern_zip_list = [zip_template.substitute({"zip_letters": letters_join, "zip_digits":digits})
                            for digits in digits_list]     
        re_zip = re.compile('|'.join(pattern_zip_list))
        
    elif country == 'United Kingdom':
        print('United Kingdom')
        # Capturing: for instence, " BT7 1NN" or " WC1E 6BT" or " G128QQ"
        #            " a# #a", " a# #az", " a# ##a", " a# ##az",
        #            " a##a", " a##az", " a###a", " a###az",
        #
        #            " a#a #a", " a#a #az", " a#a ##a", " a#a ##az",
        #            " a#a#a", " a#a#az", " a#a##a", " a#a##az",
        #
        #            " a## #a", " a## #az", " a## ##a", " a## ##az",
        #            " a###a", " a###az", " a####a", " a####az",
        #            
        #            " a##a #a", " a##a #az", " a##a ##a", " a##a ##az",
        #            " a##a#a", " a##a#az", " a##a##a", " a##a##az",
        #            
        #            " az# #a", " az# #az", " az# ##a", " az# ##az",
        #            " az##a", " az##az", " az###a", " az###az",
        #
        #            " az#a #a", " az#a #az", " az#a ##a", " az#a ##az",
        #            " az#a#a", " az#a#az", " az#a##a", " az#a##az",
        #
        #            " az## #a", " az## #az", " az## ##a", " az## ##az",
        #            " az###a", " az###az", " az###a", " az####az",
        #
        #            " az##a #a", " az##a #az", " az##a ##a", " az##a ##az",
        #            " az##a#a", " az##a#az", " az##a#a", " az##a##az",
        
        re_zip = re.compile(r'^\s?[a-z]{1,2}\d{1,2}[a-z]{0,1}\s?\d{1,2}[a-z]{1,2}$')
        
    elif country == 'United States':
        print('United States')
        # Capturing: for instence, " NY" or ' NI BT48 0SG' or " ON K1N 6N5" 
        #            " az" or " az " + 6 or 7 characters in 2 parts separated by spaces
        
        re_zip = re.compile(r'^\s?[a-z]{2}$' + '|' + r'^\s?[a-z]{2}\s[a-z0-9]{3,4}\s[a-z0-9]{2,3}$')
        
    #print(re_zip)
    
    # Setting search regex of embedding digits
    re_digits = re.compile(r'\s\d+(-\d+)?\b'      # For instence capturing " 1234" in "azert 1234-yui_OP"
                                                  # or " 1" in "azert 1-yui_OP" or " 1-23" in "azert 1-23-yui".                            
                           + '|'
                           + r'\b[a-z]+\d{2,}\b') # For instence capturing "azert12" in "azert12 UI_OPq" 
                                                  # or "azerty1234567" in "azerty1234567 ui_OPq".

    # Setting search regex of keeping-prefix
    # for instence, capturing "umr1234" in "azert UMR1234 YUI_OP" or "fr1234" in "azert-fr1234 Yui_OP".
    prefix_template = Template(r'\b$prefix[-]?\d{4}\b')
    pattern_prefix_list = [prefix_template.substitute({"prefix": prefix})
                           for prefix in KEEPING_PREFIX]   
    re_prefix = re.compile('|'.join(pattern_prefix_list))
    #print(re_prefix)              

    flag = False
    prefix_result = False if (re.search(re_prefix,text) is None) else True
    if prefix_result and verbose: print('Keeping prefix: True')
    zip_result = False if (re.search(re_zip,text) is None) else True
    digits_result = False if (re.search(re_digits,text) is None) else True
    if not prefix_result and (zip_result or digits_result):
        if verbose:
            print('Droping word: zip code') if zip_result else  print('Droping word: digits code')   
        flag = True
    return flag

#########################################################################

# Standard library import
import re
from string import Template

# Local imports
#from BiblioAnalysis_Utils.BiblioSpecificGlobals import KEEPING_PREFIX
#from BiblioAnalysis_Utils.BiblioSpecificGlobals import ZIP_CODES

verbose = True

country = 'France'

text = ' FR-38900'.lower()

print('Droping zip result:', _search_droping_zip_test(text, country))


In [None]:
# For testing the affiliations list extraction on specific addresses

# " Facultad de Ciencias Químicas, University Autónoma de Chihuahua, \
#Campus Universitario #2, Circuito Universitario, Chih, Chihuahua, C.P.31125, Mexico"

# " Laboratory of Multifunctional Materials and Structures, \
#National Institute of Materials Physics, Atomistilor Str. 405A, Magurele, 077125, Romania"

#" STMicroelectronics (Crolles 2) SAS"

#" University Savoie Mont Blanc, LOCIE UMR CNRS/USMB 5271, FédESol FR3344, Bâtiment Hélios, \
#Avenue du Lac Léman, Le Bourget-du-LacF-73376, France"

#" University Autónoma de Chihuahua, Campus Universitario #2, Mexico"

#" SOLIDpower S.p.A, Mezzolombardo38017, Italy"

#" STMicroelect Crolles 2 SAS, 850 Rue Jean Monnet, F-38926 Crolles, France"

#" Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, SI-1000, Slovenia"


raw_address = " Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, SI-1000, Slovenia"
std_address = address_standardization(raw_address)
(affiliations_list,country,affiliation_drop) = get_affiliations_list(std_address, verbose = True)
print('country:',country)
print('affiliations_list:',affiliations_list)
print('affiliation_drop:',affiliation_drop)


In [None]:
# Snippet to check the robustness of the affiliations list extraction
# from all addresses of a wos or scopus corpus

# Standard library import
from pathlib import Path

# 3rd party imports
import pandas as pd

# Internal imports
import BiblioAnalysis_Utils as bau

find_address = lambda  affiliation : bau.RE_ADDRESS.findall(affiliation)

for type_corpus in ["wos","scopus"]:
    for year in ['2018','2019','2020','2021']:

        path_home = Path.home()

        # read the 'Affiliations' columns of the df corpus with one row per address
        if type_corpus == "wos":
            file = path_home / Path('BiblioMeter_Files/' + year + '/Corpus/wos/rawdata/savedrecs.txt')
            df_corpus = bau.read_database_wos(file)
            df_affiliations_raw = df_corpus['C1'].apply(find_address).explode()
        elif type_corpus == "scopus":
            file = path_home / Path('BiblioMeter_Files/' + year + '/Corpus/scopus/rawdata/scopus.csv')
            df_corpus = bau.read_database_scopus(file)
            df_affiliations_raw = df_corpus['Affiliations'].apply(lambda x: x.split(';')).explode()
        else:
            raise Exception(f"unknown corpus type :{type_corpus}. Must be wos or scopus")

        df_affiliations_std = df_affiliations_raw.apply(address_standardization)
        df_affiliations_list = df_affiliations_std.apply(get_affiliations_list)

        df_affiliations = pd.concat([df_affiliations_raw,df_affiliations_std,df_affiliations_list], axis = 1)
        df_affiliations.columns = ['Raw address','Standard address', 'Affiliations tuple']

        df_affiliations[['Affiliations', 'Country', 'Affiliation_drop']] = pd.DataFrame(df_affiliations['Affiliations tuple'].to_list(), \
                                                                                        index=df_affiliations.index)

        # Save the results as csv files
        file = path_home / Path('Temp/affiliations_' + type_corpus + '_' + year + '.xlsx')
        df_affiliations.to_excel(file,index=False)
        df_affiliations

# Save the results as csv files
#file_raw = r'c:\Temp\affiliations_raw.csv'
#df_affiliations_raw.to_csv(file_raw,index=False)
#file_uniform = r'c:\Temp\affiliations_uniform.csv'
#df_affiliations_std.to_csv(file_uniform,index=False)

### Following cells to be deeply updated

In [None]:
#Testing "address_inst_full_list" function for "_build_authors_countries_institutions_scopus" function

# Standard library imports
import itertools
import re
from colorama import Fore
from collections import namedtuple
from string import Template

# 3rd party imports
import pandas as pd
from fuzzywuzzy import process

# Local imports
# from BiblioAnalysis_Utils.BiblioParsingInstitutions import affiliation_uniformization            !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# from BiblioAnalysis_Utils.BiblioParsingInstitutions import address_inst_full_list                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
from BiblioAnalysis_Utils.BiblioParsingInstitutions import build_institutions_dic
from BiblioAnalysis_Utils.BiblioParsingUtils import country_normalization

# Globals, namedtuples and templates  used in "_build_authors_countries_institutions_scopus" function

rep_utils = Path(bau.REP_UTILS)

author_address_tup = namedtuple('author_address','author address')
    
template_inst = Template('[$symbol1]?($inst)[$symbol2].*($country)(?:$$|;)')

addr_country_inst = namedtuple('address',['Pub_id',
                                 'Idx_author',
                                 'Address',
                                 'Country',
                                 'Norm_institutions',
                                 'Raw_institutions',])

author_address_tup = namedtuple('author_address','author address')

# End of globals, namedtuples and templates  used in "_build_authors_countries_institutions_scopus" function

# scopus
pub_id = 6 # 1, 3, 6, 7, 46, 55, 164, 177

affiliations = affiliations_dic[pub_id]
authors_affiliations = authors_affiliations_dic[pub_id]

# Initializations for test
inst_dic = build_institutions_dic(rep_utils, dic_inst_filename = None)

# Part of "_build_authors_countries_institutions_scopus" function
list_addr_country_inst = [] 

idx_author, last_author = -1, '' # Initialization for the author and address counter

list_affiliations = affiliations.split(';')
list_authors_affiliations = authors_affiliations.split(';')

for x in list_authors_affiliations:
    author = (','.join(x.split(',')[0:2])).strip()
    if last_author != author:
        idx_author += 1
    last_author = author
    
    author_list_addresses = ','.join(x.split(',')[2:])
    author_address_list_raw = []
    for affiliation_raw in list_affiliations:
        if affiliation_raw in author_list_addresses:
            affiliation = affiliation_uniformization_test(affiliation_raw)
            author_address_list_raw.append(affiliation)

        for address in author_address_list_raw:
            print('address',address)
            author_country_raw = address.split(',')[-1].strip()
            author_country = country_normalization(author_country_raw)

            author_institutions_tup = address_inst_full_list_test(address, inst_dic)               # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
            list_addr_country_inst.append(addr_country_inst(pub_id,
                                                            idx_author,
                                                            address,
                                                            author_country,
                                                            author_institutions_tup.norm_inst_list,
                                                            author_institutions_tup.raw_inst_list,))
# End of part of "_build_authors_countries_institutions_scopus" function

for tup in list_addr_country_inst:
    print(tup.Pub_id)
    print(tup.Idx_author)
    print(tup.Address)
    print(tup.Country)
    print(tup.Norm_institutions)
    print(tup.Raw_institutions)
    print()
    print()

In [None]:
# For changing particularly encoded symbols
DIC_CHANGE_SYMB_test = {"&": "and",
                        "’": "'",   # Particular cote to standard cote
                        #".": "",
                        #"-": " ",   # To Do: to be tested from the point of view of the effect on raw institutions
                        #"§": " ",
                        #"(": " ",
                        #")": " ",
                        #"/": " ",
                        #"@": " "
                       } 

SYMB_CHANGE_test = str.maketrans(DIC_CHANGE_SYMB_test)

DIC_AMB_WORDS_test = {' des ': ' ', # Conflict with DES institution
                 ' @ ': ' ', # Management conflict with '@' between texts
                }

def affiliation_uniformization_test(affiliation_raw):    # En refonte
    
    '''The `affiliation_uniformization' function aims at getting rid 
    of heterogeneous typing of affilations. 
    It first replaces particular characters by standard ones 
    using 'DASHES_CHANGE' and 'SYMB_CHANGE' globals.
    Then, it substitutes by 'University' its aliases using specific 
    regular expressions set in 'RE_SUB',and 'RE_SUB_FIRST' globals.
    Finally, it removes accents using `special_symbol_remove` function.
    
    Args:
        affiliation_raw (str): the raw affiliation to be normalized.

    Returns:
        (str): the normalized affiliation.
        
    Notes:
        The globals 'DASHES_CHANGE' and 'SYMB_CHANGE' are imported
        from `BiblioGeneralGlobals` module of `BiblioAnalysis_Utils` package.
        The globals 'RE_SUB',and 'RE_SUB_FIRST' are imported
        from `BiblioSpecificGlobals` module of `BiblioAnalysis_Utils` package.
        #The function `special_symbol_remove` is used from `BiblioParsingUtils` of `BiblioAnalysis_utils` package.
        
    '''
    
    # Standard library imports
    import re
    
    # Local imports
    from BiblioAnalysis_Utils.BiblioParsingUtils import special_symbol_remove
    from BiblioAnalysis_Utils.BiblioGeneralGlobals import DASHES_CHANGE
    from BiblioAnalysis_Utils.BiblioGeneralGlobals import SYMB_CHANGE
    from BiblioAnalysis_Utils.BiblioSpecificGlobals import DIC_AMB_WORDS

    
    def _normalize_amb_words(text):  
        for amb_word in DIC_AMB_WORDS_test.keys():                                       #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
            text = text.replace(amb_word, DIC_AMB_WORDS_test[amb_word]).strip()          #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        text = " ".join(text.split())
        return text
    
    affiliation_raw = _normalize_amb_words(affiliation_raw)
    affiliation_raw = affiliation_raw.translate(DASHES_CHANGE)
    affiliation_raw = affiliation_raw.translate(SYMB_CHANGE_test)                       #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    affiliation_uniform = special_symbol_remove(affiliation_raw, only_ascii=True, skip=True )  #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    #affiliation_uniform = standard_words_uniformization(affiliation_uniform)
    
    
    
    return affiliation_uniform

In [None]:
def address_inst_full_list_test(full_address, inst_dic):    # En refonte

    '''The `address_inst_full_list` function allows building the affiliations list of a full address
    using the internal function `_check_institute`of `BiblioParsingUtils` module
    
    Args:
       full_address (str): the full address to be parsed in institutions and country.
       inst_dic (dict): a dict used for the normalization of the institutions names, 
                        with the raw names as keys and the normalized names as values.
        
    Returns:
        (namedtuple): tuple of two strings. 
                      - The first is the joined list of normalized institutions names 
                      found in the full address.
                      - The second is the joined list of raw institutions names of the full address 
                      with no fully corresponding normalized names.
        
    Notes:
        The globals 'RE_ZIP_CODE' and 'EMPTY' are imported from `BiblioSpecificGlobals` module 
        of `BiblioAnalysis_Utils` package.
        The function `country_normalization` is imported from `BiblioParsingUtils` module
        of `BiblioAnalysis_utils` package.
        
    '''
    
    # Standard library imports
    import re
    from collections import namedtuple
    from string import Template

    # 3rd party imports
    import pandas as pd
    from fuzzywuzzy import process
    
    # Local imports
    from BiblioAnalysis_Utils.BiblioParsingUtils import country_normalization
    from BiblioAnalysis_Utils.BiblioSpecificGlobals import INST_BASE_LIST
    from BiblioAnalysis_Utils.BiblioSpecificGlobals import RE_ZIP_CODE
    from BiblioAnalysis_Utils.BiblioSpecificGlobals import EMPTY

    inst_full_list_ntup = namedtuple('inst_full_list_ntup',['norm_inst_list','raw_inst_list'])
    
    country_raw = full_address.split(",")[-1].strip()
    country = country_normalization(country_raw)
    add_country = "_" + country
    
    if RE_ZIP_CODE.findall(full_address):
        address_to_keep = re.sub(RE_ZIP_CODE,"",full_address) + ","
    else:
        address_to_keep = ", ".join(full_address.split(",")[:-1])    
    address_to_keep = address_to_keep.lower()
    print('address_to_keep:',address_to_keep)
    
    # Building the list of normalized institutions which raw institutions are found in 'address_to_keep' 
    # and building the corresponding list of raw institutions found in 'address_to_keep'
    norm_inst_full_list = [] 
    raw_inst_found_list = []
    for raw_inst, norm_inst in inst_dic.items():        
        raw_inst_lower = raw_inst.lower()
        raw_inst_split = raw_inst_lower.split()                     
        if _check_institute_test(address_to_keep,raw_inst_split):          #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!  
            norm_inst_full_list.append(norm_inst + add_country)            
            raw_inst_found_list.append(raw_inst_lower)

    # Cleaning 'raw_inst_found_list' from partial institution names
    for inst_base in INST_BASE_LIST:
        if (inst_base.lower() in raw_inst_found_list) and (inst_base.lower()+"," not in address_to_keep) : 
            raw_inst_found_list.remove(inst_base.lower())
    for raw_inst_found in raw_inst_found_list:
        other_raw_inst_found_list = raw_inst_found_list.copy()
        other_raw_inst_found_list.remove(raw_inst_found)        
        for other_raw_inst_found in other_raw_inst_found_list:
            if raw_inst_found in other_raw_inst_found:
                raw_inst_found_list = other_raw_inst_found_list

    # Removing 'raw_inst_found_list' items from 'address_to_keep'             
    for raw_inst_found in raw_inst_found_list:            
        if raw_inst_found in address_to_keep: 
            address_to_keep = address_to_keep.replace(raw_inst_found,"")  
            address_to_keep = ", ".join(x.strip() for x in address_to_keep.split(","))

    while (address_to_keep and address_to_keep[0] == "-"): address_to_keep = address_to_keep[1:]
    address_to_keep = address_to_keep.replace("-", " ")
    address_to_keep = " ".join(x.strip() for x in address_to_keep.split(" ") if x!="")
    print('address_to_keep:',address_to_keep)        
            
    # # Building the list of raw institutions remaning in 'address_to_keep'
    raw_inst_full_list = [x.strip() + add_country for x in address_to_keep.split(",") if (x!="" and x!=" ")]

    # Building a string from the final list of raw institutions  
    if raw_inst_full_list:
        raw_inst_full_list_str = ";".join(raw_inst_full_list)       
    else:
        raw_inst_full_list_str = EMPTY 
    print('raw_inst_full_list_str:',raw_inst_full_list_str)
    
    # Building a string from the final list of normalized institutions without duplicates
    norm_inst_full_list = list(set(norm_inst_full_list))
    if norm_inst_full_list:
        norm_inst_full_list_str = ";".join(norm_inst_full_list)
    else:
        norm_inst_full_list_str = EMPTY 
    print('norm_inst_full_list_str:', norm_inst_full_list_str)
    print()
    
    # Setting the namedtuple to return
    inst_full_list_tup =  inst_full_list_ntup(norm_inst_full_list_str,raw_inst_full_list_str) 
    
    return inst_full_list_tup

In [None]:
def _check_institute_test(address,raw_inst_split):    # En refonte

    '''The funstion `_check_institute` checks if all the words contained in the list 'raw_inst_split'
    are part of the string 'address'.
    
    A word is defined as a string beginning and ending with a letter.
    For instance, 'cea-leti' or 'Laue-Langevin' are words but not 'Kern-' or '&aaZ)'.
    
    The regexp used is based on the following rules:
        - Alphanumerical (AM) characters are {a…z,A…Z,0…9,_}.
        - Non-alphanumerical (NAM) characters are all other characters.
        - '\b' detects transition between NAM and AM such as '@a', '<space>a', 'a-', '(a', 'a.', etc.
        - '\B' detects transition between AM and AM such as '1a', 'za', 'a_', '_a', etc.
    
    Matches are found between a word 'WORD' of the list `raw_inst_split` and a substring 'WORDS'
    of the string 'address' in 4 cases:

        - 'WORD' matches in '...dWORD' or '...dWORDd...' by the regexp "'\d+\BWORD\B\d+'". 
        ex: 'UMR' matches in 'a6UMR7040.' '@6UMR' or 'matches in '...*WORDd*...'', etc. 
                  doesn't match in 'aUMR', '6UMR-CNRS', '6@UMR7040', etc.

        - 'WORD' matches '...dWORD*...'  by the regexp "'\d+\BWORD\b'".  
        ex: 'UMR' matches in 'a6UMR-CNRS', etc.
                  doesn't match in '@6UMRCNRS', '@UMR-CNRS', etc.

        - 'WORD' matches in '...*WORD*...' by the regexp "'\bWORD\b'". 
        ex: 'UMR' matches in '(UMR7040)', '@UMR-CNRS', etc. where 'UMR'is in between NAM
                  doesn't match in 'UMR7040', '@6UMR_CNRS', etc, where an NAM at least misses around it.

        - 'WORD' matches in '...*WORDd...', by the regexp "'\bWORD\B\d+'"
        ex UMR = %(UMR43)+#
        ex: 'UMR' matches in '6@UMR7040', 'CNRS-UMR7040', etc.
                  doesn't match in 'CNRS_UMR7040', '#UMR_CNRS', etc.

    where '...' stands for any characters, 'd' stands for a digit and '*' stands for an NAM character. 
    
    The match is case insensitive.
    
    According the mentionned 4 rules an isolated '&' such as in 'Art & Metiers' 
    and words ending by a minus (ex: Kern-) are not catched. 
    A specific pretreatment of `raw_inst_split` and `address` should be done before calling this function.
    
    Examples:
        - _check_institute(['den'],'dept. of energy conversion, university of denmark,') is False.
        - _check_institute(['den','dept'],'DEN dept. of energy conversion, university of denmark,') is True.
    
    Args:
        address (str): the address where to check the matching of words.
        raw_inst_split (list): list of words to be found to match in the string 'address'.
    
    Returns:
        (boolean): 'True' for a full match.
                   'False' otherwise.
    '''
    # Standard library imports
    import re
    from string import Template
    
    raw_inst_split = list(set(raw_inst_split))
    
    # Taking care of the potential isolated special characters '&' and '-'   
    raw_inst_split = [x.replace("&","and") for x in raw_inst_split]    
    raw_inst_split_init = raw_inst_split.copy()

    # Removing small words
    small_words_list = ['a','et','de','and','for','of','the']
    for word in small_words_list:
        if word in raw_inst_split_init:
            raw_inst_split.remove(word)
    
    raw_inst_split_init = raw_inst_split.copy()
    for word in raw_inst_split_init:
        if len(word)==1:
            raw_inst_split.remove(word)
    
    # Adding \ to escape regexp reserved char
    for char in ["$",'(',')','[',']','^','-']: 
        escaped_char = '\\'+ char
        raw_inst_split = [x.replace(char,escaped_char) for x in raw_inst_split]
        
    items_number = len(raw_inst_split)

    # Building the 're_inst' regexp searching for 'raw_inst_split' items in 'address' 
    dic = {"inst"+str(i):inst for i,inst in enumerate(raw_inst_split)}

    template_inst = Template(r'|'.join([r'\d+\B$inst'+ str(i) + r'\B\d+'
                                      + '|'
                                      + r'\d+\B$inst' + str(i) + r'\b'
                                      + '|'
                                      + r'\b$inst' + str(i) + r'\b'
                                      + '|'
                                      + r'\b$inst' + str(i) + r'\B\d+'
                                      for i in range(items_number)]))

    re_inst = re.compile(template_inst.substitute(dic), re.IGNORECASE)

    # Checking mach of 'raw_inst_split' items in 'address' using 're_inst' regexp
    items_set = set(re_inst.findall(address))
    if  len(items_set) == items_number:
        return True
    else:
        return False

# I- Merging corpuses from different databases for single year corpus analysis

In [None]:
# Standard libraries import
import json
import os
import re
from pathlib import Path

# Local imports
import BiblioAnalysis_Utils as bau

# Setting global aliases
concat_folder_alias = bau.FOLDER_NAMES['concat']
corpus_folder_alias = bau.FOLDER_NAMES['corpus']
rational_folder_alias = bau.FOLDER_NAMES['dedup']
parsing_folder_alias = bau.FOLDER_NAMES['parsing']
rawdata_folder_alias = bau.FOLDER_NAMES['rawdata']
scopus_folder_alias = bau.FOLDER_NAMES['scopus']
wos_folder_alias = bau.FOLDER_NAMES['wos']

# Setting global regular expressions
re_year = bau.RE_YEAR

# Setting default values of inputs
ref_files_check ='y' 

# Selecting corpus folder (year)
corpusfiles_list =[file for file in os.listdir(corpuses_folder) if re_year.findall(file)]
corpusfiles_list.sort()
print('Please select the corpus via the tk window')
myprojectname = bau.Select_multi_items(corpusfiles_list,'single',2)[0]+'/'
project_folder = corpuses_folder / Path(myprojectname) / Path(corpus_folder_alias) 
print(bold_text + f'\nThe corpus folder selected is: {project_folder}' + light_text)

# Setting the useful paths
path_scopus_parsing = project_folder / Path(scopus_folder_alias) / Path(parsing_folder_alias)
path_scopus_rawdata = project_folder / Path(scopus_folder_alias) / Path(rawdata_folder_alias) 
path_wos_parsing = project_folder / Path(wos_folder_alias) / Path(parsing_folder_alias)
path_wos_rawdata = project_folder / Path(wos_folder_alias) / Path(rawdata_folder_alias) 
path_concat = project_folder / Path(concat_folder_alias)
path_concat_parsing = path_concat / Path(parsing_folder_alias)
if not os.path.exists(path_concat_parsing):
    if not os.path.exists(path_concat): os.mkdir(path_concat)
    os.mkdir(path_concat_parsing)
path_rational = project_folder / Path(rational_folder_alias)
path_rational_parsing = path_rational / Path(parsing_folder_alias)
if not os.path.exists(path_rational_parsing):
    if not os.path.exists(path_rational): os.mkdir(path_rational)
    os.mkdir(path_rational_parsing)

# Setting the useful-paths lists     
database_list = [(bau.WOS, path_wos_parsing, path_wos_rawdata), (bau.SCOPUS, path_scopus_parsing, path_scopus_rawdata)]
useful_path_list = [path_scopus_parsing,path_wos_parsing,path_concat_parsing,path_rational_parsing] 

# Checking availibility of parsing for each of the corpuses to be merged
parsing_files_check = input('\n Parsings available for each corpus (y/n, default: n) ?')
if parsing_files_check == 'n' or parsing_files_check == '':
    # Get the folder for the general files
    # and specific files for scopus type database in this folder
    rep_package = Path('BiblioAnalysis_Utils')
    rep_utils = Path(bau.REP_UTILS) 
    actual_folder = Path.cwd()
    print('\nWorking directory: ', actual_folder)
    print('Default folder for the reference files: ', actual_folder / rep_package / rep_utils)  
    
    ref_files_check = input(' Are all reference files in this folder (y/n, default: y) ?')
    if ref_files_check == 'y' or ref_files_check == '':
        for database_tup in database_list:
            database_type = database_tup[0]
            database_parsing_path = database_tup[1]
            database_rawdata_path = database_tup[2]
                # Folder containing the wos or scopus file to process
            in_dir_parsing = database_rawdata_path

                # Folder containing the output files of the data parsing 
            out_dir_parsing = database_parsing_path 
            if not os.path.exists(out_dir_parsing):
                os.mkdir(out_dir_parsing)

            ## Running function biblio_parser
            inst_filter_list_init = None
            bau.biblio_parser(in_dir_parsing, out_dir_parsing, database_type, expert, rep_utils, inst_filter_list_init)

            # Useful printings
            with open(Path(out_dir_parsing) / Path(bau.PARSING_PERF), 'r') as failed_json:
                    data_failed=failed_json.read()
            dic_failed = json.loads(data_failed)
            articles_number = dic_failed["number of article"]
            print('\n' + bold_text + f'Parsing processed on full {database_type} corpus' + light_text)
            print("\n Success rates")
            del dic_failed['number of article']
            for item, value in dic_failed.items():
                print(f'    {item}: {value["success (%)"]:.2f}%')
    
    else:
        print('\n' + bold_text + 'Please put all reference files in the specified folder and run again cell' + light_text)
        
if ref_files_check == 'y': 
    second_inst = input("\n Secondary institutions to be parsed (y/n, default = y)? ")
    if second_inst == '': second_inst = 'y' 
        
    # Setting the specific affiliations filter     
    if second_inst == 'y':   
        inst_filter_list = bau.INST_FILTER_LIST
        print(f' Default secondary institutions filter is: {inst_filter_list}' )
        change_inst_filter = input(" Do you want to change it (y/n, default = n)? ")
        if change_inst_filter == '': change_inst_filter = 'n'
        if change_inst_filter == 'y':
            # Setting the specific affiliations filter 
            inst_filter_list = bau.setting_secondary_inst_filter(path_concat_parsing)
    else:
        inst_filter_list = None
    
    # Concatenating and deduplicating parsing saved in 'project_folder' folder
    bau.parsing_concatenate_deduplicate(useful_path_list, inst_filter_list )
    print(bold_text + '\nCell-run completed' + light_text)


# II- Single year corpus analysis

## &emsp;&emsp;II-1 Selection of the corpus file for BiblioAnalysis

In [None]:
# Standard library imports
import os
from pathlib import Path

# Local imports
import BiblioAnalysis_Utils as bau

## Selection of corpus file
corpusfiles_list = os.listdir(corpuses_folder)
corpusfiles_list.sort()
print('Please select the corpus via the tk window')
myprojectname = bau.Select_multi_items(corpusfiles_list,'single',2)[0]+'/'
project_folder = corpuses_folder /Path(myprojectname)
database_type = input('Corpus file type (scopus, wos - default: "wos")? ')
if not database_type: database_type = 'wos' 
project_folder = corpuses_folder / Path(myprojectname) / Path(database_type)

 # Get the folder for the general files
 # and specific files for scopus type database in this folder
rep_package = Path('BiblioAnalysis_Utils')
rep_utils = Path(bau.REP_UTILS) 
actual_folder = Path.cwd()
print('\nWorking directory: ', actual_folder)
print('Default folder for the reference files: ', actual_folder / rep_package / rep_utils)   
ref_files_check = 'y'
ref_files_check = input(' Are all reference files in this folder (y/n, default: y) ?')
if ref_files_check == 'y':
    ## Setting the  graph main heading
    digits_list = list(filter(str.isdigit, myprojectname))
    corpus_year = ''
    for i in range(len(digits_list)):corpus_year = corpus_year + digits_list[i]
    init = str(corpuses_folder).rfind("_")+1
    corpus_state = str(corpuses_folder)[init:]
    main_heading = corpus_year + ' Corpus:' + corpus_state

    ## Printing useful information
    dict_print = {'Specific-paths set for user:': user_id,
                  'Project folder:': project_folder,
                  'Corpus year:': corpus_year,
                  'Corpus status:': corpus_state,
                  'Project name:': myprojectname,
                  'Corpus file type:':database_type}

    pad = 3
    max_len_str = max( [len(str(x)) for x in dict_print.values()]) + pad
    print('\n')
    for key,val in dict_print.items():
        print(key.ljust(max_len_str),val)
    print('\n' + bold_text + 'Cell-run completed' + light_text)
    
else:
    print('\n' + bold_text +'Please put all reference files in the specified folder' + light_text)


## &emsp;&emsp;II-2 Data parsing

In [None]:
# Standard libraries import
import json
import os
from pathlib import Path

# Local imports
import BiblioAnalysis_Utils as bau

## Building the names of the useful folders

    # Folder containing the wos or scopus file to process
in_dir_parsing = project_folder / Path(bau.FOLDER_NAMES['rawdata'])

    # Folder containing the output files of the data parsing 
out_dir_parsing = project_folder / Path(bau.FOLDER_NAMES['parsing'])
if not os.path.exists(out_dir_parsing):
    os.mkdir(out_dir_parsing)

## Running function biblio_parser
parser_done = input("Parsing available (y/n)? ")
if parser_done == "n":
    inst_filter_list_init = None
    bau.biblio_parser(in_dir_parsing, out_dir_parsing, database_type, expert, rep_utils, inst_filter_list_init)
         
    second_inst = input("Secondary institutions to be parsed (y/n)? ")
    if second_inst=='y': 
        inst_filter_list = bau.INST_FILTER_LIST
        print(f' Default secondary institutions filter is: {inst_filter_list}' )
        change_inst_filter = input(" Do you want to change it (y/n, default = n)? ")
        if change_inst_filter == '': change_inst_filter = 'n'
        if change_inst_filter == 'y':
            # Setting the specific affiliations filter 
            inst_filter_list = bau.setting_secondary_inst_filter(out_dir_parsing)
        # Extending the author with institutions parsing file
        bau.extend_author_institutions(out_dir_parsing, inst_filter_list)
    
    # Useful printings
    PARSING_PERF = 'failed.json'
    with open(Path(out_dir_parsing) / Path(bau.PARSING_PERF), 'r') as failed_json:
            data_failed=failed_json.read()
    dic_failed = json.loads(data_failed)
    articles_number = dic_failed["number of article"]
    print("Parsing processed on full corpus")
    print("\n\nSuccess rates")
    del dic_failed['number of article']
    for item, value in dic_failed.items():
        print(f'    {item}: {value["success (%)"]:.2f}%')
else:
    second_inst = input("Secondary institutions to be parsed (y/n)? ")
    if second_inst=='y' : 
        # Setting the specific affiliations filter 
        inst_filter_list = bau.setting_secondary_inst_filter(out_dir_parsing)
        # Extending the author with institutions parsing file
        bau.extend_author_institutions(out_dir_parsing, inst_filter_list)
        
    parser_filt = input("Parsing available without rawdata -from filtering- (y/n)? ")
    if parser_filt == "n": 
        # Reading json file of parsing performances
        with open(Path(out_dir_parsing) / Path(bau.PARSING_PERF), 'r') as failed_json:
            data_failed=failed_json.read()
        dic_failed = json.loads(data_failed)
        articles_number = dic_failed["number of article"]
        # Usefull printings
        print("Parsing available from full corpus")
        print("\n\nSuccess rates")
        del dic_failed['number of article']
        for item, value in dic_failed.items():
            print(f'    {item}: {value["success (%)"]:.2f}%')
    else:
        #clear_output(wait=True)
        print("Parsing available from filtered corpus without rawdata")
        file = project_folder /Path('parsing/' + 'articles.dat')
        with open(file) as f:
            lines = f.readlines()
        articles_number = len(lines)

print("\n\nCorpus parsing saved in folder:\n", str(out_dir_parsing))
print('\nNumber of articles in the corpus : ', articles_number)
print('\n' + bold_text + 'Cell-run completed' + light_text)


###  &emsp;&emsp;II-2.1 Data parsing / Corpus description

In [None]:
# Standard libraries import
import os
import json
from pathlib import Path
from IPython.display import clear_output

# Local imports
import BiblioAnalysis_Utils as bau

## Building the names of the useful folders

    # Folder containing the wos or scopus parsed files
in_dir_corpus = out_dir_parsing

    # Folder containing the wos or scopus parsed and analysed files
out_dir_corpus = project_folder / Path(bau.FOLDER_NAMES['description'])
if not os.path.exists(out_dir_corpus):
    os.mkdir(out_dir_corpus)    

## Running describe_corpus
description_done = input("Description available (y/n)? ")
#clear_output(wait=True)
if description_done == "n":
    verbose = False
    bau.describe_corpus(in_dir_corpus, out_dir_corpus, database_type, verbose)
    print("Corpus description saved in folder:", str(out_dir_corpus))
else:
    print("Corpus description available in folder:", str(out_dir_corpus))

# Building the name of file for histogram plot of an item
fullpath_distrib_item = out_dir_corpus / Path(bau.DISTRIBS_ITEM_FILE)

## Running plot of treemap, scatter plot and histogram for a selected item_treemap
do_treemap = input("Treemap for an item of the corpus description (y/n)? ")
if do_treemap == 'y':
    renew_treemap = 'y'
    while renew_treemap == 'y' :
        print("Choose the item for treemap in the tk window")
        item_treemap = bau.treemap_item_selection()
        fullpath_file_treemap = out_dir_corpus / Path('freq_'+ item_treemap +'.dat')
        print("Item selected:",item_treemap)
        bau.treemap_item(item_treemap, fullpath_file_treemap)
        do_scatter = input("Scatter plot for the item (y/n)? ")
        if do_scatter == 'y':
            bau.plot_counts(item_treemap, fullpath_file_treemap)
        do_histo = input("Histogram plot for the item (y/n)? ")
        if do_histo == 'y':
            bau.plot_histo(item_treemap, fullpath_distrib_item)
        renew_treemap = input("\n\nTreemap for a new item (y/n)? ")

# Initialize the variable G_coupl that will receive the biblioanalysis coupling graphs
try: G_coupl
except NameError: G_coupl = None

print('\n' + bold_text + 'Cell-run completed' + light_text)


#### &emsp;&emsp;II-2.1.1 Data parsing / Corpus description / Filtering the data and filtered corpus description
To be run after corpus description to allow using the following functions : describe_corpus() , treemap_item()

In [None]:
# Standard libraries import
import glob
import json
import os
from pathlib import Path
import shutil                      

# Local imports
import BiblioAnalysis_Utils as bau

## Recursive filtering

# Allows prints in filter_corpus_new function
verbose = False

# Initialization of parameters for recursive filtering
filtering_step = 1
while True:

    ## Building the names of the useful folders and creating the output folder if not find 
    if filtering_step == 1:
        in_dir_filter = out_dir_parsing
        ### Get the folder for the filter configuration file         
        gui_titles = {'main':   'Folder selection GUI for config_filters.json file ',
                      'result': 'Selected folder'}
        gui_buttons = ['SELECTION','HELP']
        filter_config_folder = bau.select_folder_gui_new(user_root, gui_titles, gui_buttons, bau.GUI_DISP,
                                                         widget_ratio=1, button_ratio=1, 
                                                         max_lines_nb=3)
        
        print('Filter configuration folder:', filter_config_folder)
        file_config_filters = filter_config_folder / Path('config_filters.json')
        print('Filter configuration file:',file_config_filters)
        modif_filtering = input("Modification of item-values list from a predefined file (y/n)? ")
        if modif_filtering == "y":
            bau.filters_modification(filter_config_folder,file_config_filters)    
    else:
        renew_filtering = input("Apply a new filtering process (y/n)? ") 
        if renew_filtering == "n": break
        in_dir_filter = project_folder / Path(bau.FOLDER_NAMES['filtering'] + '_' + str(filtering_step-1))
        file_config_filters = in_dir_filter / Path('save_config_filters.json')
        print('Filter configuration file:',file_config_filters) 
        
    out_dir_filter = project_folder / Path(bau.FOLDER_NAMES['filtering'] + '_' + str(filtering_step))
            
    if not os.path.exists(out_dir_filter):
        os.mkdir(out_dir_filter)
    else:
        print('out_dir_filter exists')
        files = glob.glob(str(out_dir_filter) + '/*.*')
        for f in files:
            os.remove(f)

    # Building the absolute file name of filter configuration file to save for the filtering step
    save_config_filters = out_dir_filter / Path(bau.SAVE_CONFIG_FILTERS)
    print('\nSaving filter configuration file:',save_config_filters)
    
    # Configurating the filtering through a dedicated GUI or getting it from the existing file
    bau.filters_selection(file_config_filters,save_config_filters,in_dir_filter,
                         fact=3, win_widthmm=85, win_heightmm=115, font_size=16)
    shutil.copyfile(save_config_filters, file_config_filters)

    # Read the filtering status
    combine,exclusion,filter_param = bau.read_config_filters(file_config_filters) 
    print("\nFiltering status:")
    print("   Combine   :",combine)
    print("   Exclusion :",exclusion)
    for key,value in filter_param.items():
        print(f"   Item      : {key}\n   Values    : {value}\n")

    # Running function filter_corpus_new
    bau.filter_corpus_new(in_dir_filter, out_dir_filter, verbose, file_config_filters) # <---???
    file = out_dir_filter /Path('articles.dat')
    with open(file) as f:
        lines = f.readlines()
        articles_number = len(lines)
    if articles_number == 0:
        print('Filtered corpus empty !')
        break
    print("Filtered-corpus parsing saved in folder ", 
            str(out_dir_filter),
            " with the corresponding filters configuration")

        # Folder containing the wos or scopus parsed and filtered files
    in_dir_freq_filt = out_dir_filter

        # Folder containing the wos or scopus parsed, filtered and analysed files
    out_dir_freq_filt = project_folder / Path(bau.FOLDER_NAMES['description'] + '_' + str(filtering_step))
    if not os.path.exists(out_dir_freq_filt): os.mkdir(out_dir_freq_filt)

        # Running describe_corpus 
    verbose = False
    bau.describe_corpus(in_dir_freq_filt, out_dir_freq_filt, database_type, verbose)
    print("Filtered corpus description saved in folder:", str(out_dir_freq_filt))

    # Treemap plot by a corpus item after filtering
    make_treemap = 'n'
    make_treemap = input("\n\nDraw treemap (y/n)?")
    if make_treemap == 'y' :

            # Running plot of treemap for selected item_treemap
        renew_treemap = 'y'    
        while renew_treemap == 'y' :
            print('\n\nChoose the item for treemap of the filtered corpus description in the tk window')
            item_treemap = bau.treemap_item_selection()
            file_name_treemap = project_folder / Path(bau.FOLDER_NAMES['description'] + '_'\
                                                      + str(filtering_step) + '/' + 'freq_'+ item_treemap +'.dat')
            print("Item selected:",item_treemap)
            bau.treemap_item(item_treemap, file_name_treemap)
            renew_treemap = input("\n\nTreemap for a new item (y/n)? ") 

    filtering_step += 1

print('\n' + bold_text + 'Cell-run completed' + light_text)


#### &emsp;&emsp;II-2.1.2 Data parsing / Corpus Description / Bibliographic Coupling analysis
To be run after corpus description to use the frequency analysis. You may execute the bibliographic coupling script several times successively on unfiltered corpus and on available filtering steps of the corpus.
The result files are saved in independant folders.

In [None]:
# Standard libraries import
import glob
from IPython.display import clear_output

# Local imports
import BiblioAnalysis_Utils as bau

# Building the names of the useful folders and creating the output folder if not find  
filtering = input(
                  "Corpus filtered (y/n)? "
                 )   
if filtering == "y":
    filtering_step = input(
                            "Enter filtering step : "
                          ) 
    in_dir_coupling = project_folder / Path(bau.FOLDER_NAMES['filtering'] + '_' + str(filtering_step))
    in_dir_freq= project_folder / Path(bau.FOLDER_NAMES['description'] + '_' + str(filtering_step))
    out_dir_coupling = project_folder / Path(bau.FOLDER_NAMES['coupling'] + '_' + str(filtering_step))
else:
    in_dir_coupling = out_dir_parsing
    in_dir_freq= out_dir_corpus    
    out_dir_coupling = project_folder / Path(bau.FOLDER_NAMES['coupling'])

if not os.path.exists(out_dir_coupling):
    os.mkdir(out_dir_coupling)
else:
    print('out_dir_coupling exists')
    files = glob.glob(str(out_dir_coupling) + '/*.html')
    for f in files:
        os.remove(f)
    
# Building the coupling graph of the corpus
print('Building the coupling graph of the corpus, please wait...')
G_coupl = bau.build_coupling_graph(in_dir_coupling)

# Building the partition of the corpus
print('Building the partition of the corpus, please wait...')
G_coupl,partition = bau.build_louvain_partition(G_coupl)
print()

# Adding attributes to the coupling graph nodes
attr_dic = {}
add_attrib = input("Add attributes to the coupling graph nodes (y/n)? ")
if add_attrib == 'y':
    while True:
        print('\n\nChoose the item for the attributes to add in the tk window')
        item, m_max_attrs = bau.coupling_attr_selection(fact=2, win_widthmm=80, win_heightmm=60, font_size=16)
        attr_dic[item] = m_max_attrs
        print("Item selected:",item," with ",m_max_attrs, " attributes" )
        G_coupl = bau.add_item_attribute(G_coupl, item, m_max_attrs, in_dir_freq, in_dir_coupling)
        renew_attrib = input("\nAdd attributes for a new item (y/n)?") 
        if renew_attrib == 'n' : break      

# Plot control of the coupling graph before using Gephy
NODES_NUMBER_MAX = 1
bau.plot_coupling_graph(G_coupl,partition,nodes_number_max=NODES_NUMBER_MAX)

# Creating a Gephy file of the coupling graph  
bau.save_graph_gexf(G_coupl,out_dir_coupling)
print("\nCoupling analysis of the corpus saved as Gephy file in folder:\n", str(out_dir_coupling))

# Creating an EXCEL file of the coupling analysis results
bau.save_communities_xls(partition,in_dir_coupling,out_dir_coupling)
print("\nCoupling analysis of the corpus saved as EXCEL file in folder:\n", str(out_dir_coupling))

print('\n' + bold_text + 'Cell-run completed' + light_text)

#### &emsp;&emsp;II-2.1.3  HTML graph of coupling analysis 
##### after Data parsing / Corpus Description / Coupling analysis  
You may execute the HTML graph construction script several times successively on the available coupling graph of the corpus. The result files are saved in the corresponding coupling floder.

In [None]:
'''Creating html file of graph G using pyviz
   This script uses the results of the Biblioanalysis coupling analysis:
   - out_dir_coupling (Path): path for saving the coupling analysis results;
   - G (networkx object): coupling graph with added attributes;
   - partition (dict):  partition of graph G;
   - attr_dic (dict): dict of added attributes with number of added values. 
   
'''

# Local imports
import BiblioAnalysis_Utils as bau

# Checking the availability of the corpus coupling graph G with all attributes and its partition
assert(G_coupl is not None),'''Please run first the "Bibliographic coupling analysis" 
                                script to build the coupling graph'''

# Setting the item label among the added attribute to be colored
colored_attr = input('Please enter the item label among the added attributes to be colored (default: S)')
if colored_attr == '':colored_attr = 'S'
print('Attribute to be colored:',colored_attr)
if colored_attr == 'S': 
    heading3 = 'Colored by main discipline (grey: without filtering subjects as main discipline).'
else:
    heading3 = 'Colored by main attribute values (grey: without filtering attribute values as main discipline).'
assert(colored_attr in attr_dic.keys()),\
    f'''Selected colored attribute should be among the added attributes: {list(attr_dic.keys())}.
Please run this script again to select an effectivelly added attribute to the coupling graph node 
or run again the "Bibliographic coupling analysis" script to add the targetted attribute to the coupling graph.'''

# Setting the colors for the values of the attribute to be colored
# default: values of 'S' item from a particular corpus
# TO DO: define the list of the attribute values through a GUI
colored_attr_values = {'Neurosciences & Neurology':'0',
                  'Psychology':'1',
                  'Computer Science':'2',
                  'Robotics,Automation & Control Systems':'3',
                  'Life Sciences & Biomedicine - Other Topics':'4',
                  'Biochemistry & Molecular Biology':'4',
                  'Cell Biology':'4',
                  'Evolutionary Biology':'4',
                  'Biomedical Social Sciences':'4',
                  'Biotechnology & Applied Microbiology':'4',
                  'Developmental Biology':'4',
                  'Microbiology':'4',
                  'Marine & Freshwater Biology':'4',
                  'Reproductive Biology':'4',
                  'Genetics & Heredity':'4',
                  'Philosophy':'5',
                  'History & Philosophy of Science':'5',
                  'Social Sciences - Other Topics':'6',
                  'Mathematical Methods In Social Sciences':'6',
                  'Linguistics':'7',
                  'Anthropology':'8',
                 }

# Setting the attribute value to be specifically shaped
shaped_attr = input('Please enter the added attribute value to be specifically shaped (default: Psychology)')
if shaped_attr == '':shaped_attr = 'Psychology'
print('Attribute value to be specifically shaped (triangle):',shaped_attr)
heading4 = 'Triangles for "' + shaped_attr + '" in disciplines.'

# Computing the number of communities
community_number = len(set(partition.values()))
print('Number of communities:',community_number)

# Computing the size of the communities
communities_size = {}
for value in set(partition.values()):
    communities_size[value]=0
    for key in set(partition.keys()):
        if partition[key] == value:
            communities_size[value]+=1
            
# Building the html graphs per community
for community_id in range(community_number):
    community_size = communities_size[community_id] 
    heading2 = 'Coupling graph for community ID: ' + str(community_id) + ' Size: ' + str(community_size)
    heading = '<h1>' + main_heading + '</h1>' + '<h2>' + heading2 + '</h2>' \
                  + '<h3 align=left nowrap>' + heading3 + '<br>'  + heading4 + '</h3>'
    html_file= str(out_dir_coupling /Path('coupling_' + 'com' + str(community_id) \
                                          + '_size' + str(community_size) + '.html'))
    #bau.coupling_graph_html_plot(G_coupl,html_file,community_id,attr_dic,colored_attr,
    #                             colored_attr_values,shaped_attr,nodes_colors,edges_color,
    #                             background_color,font_color,heading)
    bau.coupling_graph_html_nwplt(G_coupl,html_file,community_id,attr_dic,colored_attr,
                                  colored_attr_values,shaped_attr,heading)
# Building the html graph for the full corpus
heading2  = ' All ' + str(community_number) + ' communities'
heading = '<h1>' + main_heading + '</h1>' + '<h2>' + heading2 + '</h2>' \
          + '<h3 align=left nowrap>' + heading3 + '<br>'  + heading4 + '</h3>'
html_file= str(out_dir_coupling /Path('coupling_' + 'all.html'))
#bau.coupling_graph_html_plot(G_coupl,html_file,'all',attr_dic,colored_attr,
#                         colored_attr_values,shaped_attr,nodes_colors,edges_color,
#                         background_color,font_color,heading)
bau.coupling_graph_html_nwplt(G_coupl,html_file,'all',attr_dic,colored_attr,
                              colored_attr_values,shaped_attr,heading)

print("\nCreated html files of graph G_coupl using pyviz for the corpus in folder:\n", str(out_dir_coupling))

print('\n' + bold_text + 'Cell-run completed' + light_text)

### &emsp;&emsp;II-2.2 Data parsing / Co-occurrence Maps
You may execute the co-occurence script several times successively on unfiltered corpus and on available filtering steps of the corpus.
The result files are saved in independant folders.

In [None]:
# Local imports
import BiblioAnalysis_Utils as bau

# Building the names of the useful folders and creating the output folder if not find 
filtering = input(
                  "Corpus filtered (y/n)? "
                 )   
if filtering == "y":
    filtering_step = input(
                            "Enter filtering step : "
                          ) 
    in_dir_cooc = project_folder / Path(bau.FOLDER_NAMES['filtering'] + '_' + str(filtering_step))
    out_dir_cooc = project_folder / Path(bau.FOLDER_NAMES['cooccurrence'] + '_' + str(filtering_step))
else:
    in_dir_cooc = out_dir_parsing   
    out_dir_cooc = project_folder / Path(bau.FOLDER_NAMES['cooccurrence']) 

if not os.path.exists(out_dir_cooc):
    os.mkdir(out_dir_cooc)
else:
    print('out_dir_cooc available')

## Building the co-ocurrence graph
size_min = 1
node_size_ref=300
while True :
    print('\n\nChoose the item for co-occurrence analysis in the tk window')
    cooc_item, size_min = bau.cooc_selection(fact=3, win_widthmm=80, win_heightmm=100, font_size=16) 
    print("Item selected:",cooc_item," at minimum size ",size_min)
    out_dir_cooc_item = out_dir_cooc / Path('cooc_' + cooc_item + \
                                            '_thr' + str(size_min))
    if not os.path.exists(out_dir_cooc_item):
        os.mkdir(out_dir_cooc_item)
    else:
        print('out_dir_cooc_item available')
    G_cooc = bau.build_item_cooc(cooc_item,in_dir_cooc, out_dir_cooc_item, size_min = size_min)
    if G_cooc is None:
        print(f'The minimum node size ({size_min}) is two large. Relax this constraint.')
    else:
        print("Co-occurrence analysis of the corpus for item " + cooc_item + \
          " saved in folder:", str(out_dir_cooc_item))
        heading2 = 'Co_occurence graph for item ' + cooc_item + ' with minimum node size ' + str(size_min)
        heading3 = 'Bold node title: Node attributes[number of item value occurrences-item value (total number of edges)]'
        heading4 = 'Light node titles: Neighbors attributes[number of item value occurrences-item value (number of edges with node)]'
        heading = '<h1>' + main_heading + '</h1>' + '<h2>' + heading2 + '</h2>' \
                  + '<h3 align=left nowrap>' + heading3 + '<br>'  + heading4 + '</h3>'
    
        bau.plot_cooc_graph(G_cooc,cooc_item,size_min=size_min,node_size_ref=node_size_ref)
        # Creating html file of graph G_cooc using pyviz
        html_file= str(out_dir_cooc_item /Path('cooc_' + cooc_item + '_thr' + str(size_min) + '.html'))
        bau.cooc_graph_html_plot(G_cooc,html_file,heading)
        print("Created html file of",cooc_item,"co-occurrence graph using pyviz in folder:\n",\
              str(out_dir_cooc_item))
        
    renew_cooc = input("\n\nCo-occurrence analysis for a new item (y/n)?") 
    if renew_cooc == 'n' : break

print('\n' + bold_text + 'Cell-run completed' + light_text)

# III- Temporal development of item values weight
To run this cell a set of annual corpuses with their description should be available 

In [None]:
# Standard libraries import
import json
from pathlib import Path
from IPython.display import clear_output

# Local imports
import BiblioAnalysis_Utils as bau

# Initialize the search configuration dict 
keyword_filters = {
    'is_in':[],    
    'is_equal':[]}

## Get the folder for the configuration file for the temporal development analysis 
# To Do : use the new gui
temporaldev_config_folder = bau.select_folder_gui(user_root,'Select the folder for config_temporal.json file')
print('Item_values selection folder:', temporaldev_config_folder )

## Building the search configuration:
#### - either by reading of the 'config_temporal.json' without modification
#### - or by an interactive modification of the configuration and storing it in this file for a futher use
TemporalDev_file = temporaldev_config_folder / Path('config_temporal.json')

keywords_modif = input('Modification of the keywords list (y/n)?')
if keywords_modif == 'y':
    
        # Selection of items
    items_full_list = ['IK','AK','TK','S','S2']
    print('\nPlease select the items to be analyzed via the tk window')
    items = bau.Select_multi_items(items_full_list,'multiple')

        # Selection of the folder of item-values full-list file
        # To Do : use the new gui
    select_folder = bau.select_folder_gui(user_root,'Select the folder of the item-values list files')

        # Setting the file of the item-values full list  
    keywords_full_list_file = select_folder / Path('TempDevK_full.txt')
    
        # Setting the list of item-values full list
    keywords_full_list = bau.item_values_list(keywords_full_list_file)
    
        # Selection of the item-values list to be put in the temporal development configuration file 
    search_modes = ['is_in','is_equal']
    for search_mode in search_modes:
        print('\nPlease select the keywords for ',search_mode, ' via the tk window')
        keyword_filters[search_mode] = bau.Select_multi_items(keywords_full_list,mode = 'multiple')
        
    # Saving the new configuration in the 'config_temporal.json' file   
    bau.write_config_temporaldev(TemporalDev_file,items,keyword_filters)
    print('\n New temporal development configuration saved in: \n', TemporalDev_file)    
else:
    # Reading the search configuration from the 'config_temporal.json' file  
    items,keywords_param = bau.read_config_temporaldev(TemporalDev_file)
    print('Selection of items:\n',items)    
    keyword_filters['is_in'] = keywords_param['is_in']
    keyword_filters['is_equal'] = keywords_param['is_equal']

## Selection of annual corpus files
corpusfiles_list = os.listdir(corpuses_folder)
corpusfiles_list.sort()
print('\nPlease select the corpuses to be analyzed via the tk window')
years = bau.Select_multi_items(corpusfiles_list,'multiple')

# Print configuration
print('Search items:', items)
print('\nSearch Words:\n' + json.dumps(keyword_filters, indent=2))
print('\n Selection of annual corpus files:\n',years, '\n')

# Performing the search using the keyword_filters dict
keyword_filter_list = bau.temporaldev_itemvalues_freq(keyword_filters ,items, years, corpuses_folder)

# Saving the search results in an EXCEL file
store_file = corpuses_folder / Path('Results_Files/TempDev_synthesis.xlsx')
bau.temporaldev_result_toxlsx(keyword_filter_list,store_file)
print('\nTemporal development results saved in:\n', store_file) 

print('\n' + bold_text + 'Cell-run completed' + light_text)

# Annexe 1- Databases merging

In [None]:
# Local imports
import BiblioAnalysis_Utils as bau

database, filename, in_dir, out_dir = bau.merge_database_gui()
bau.merge_database(database,filename,in_dir,out_dir)

print('\n' + bold_text + 'Cell-run completed' + light_text)

# Annexe 2- Item values selection to list for filters configuration

In [None]:
# Standard library imports
from pathlib import Path

# Local imports
import BiblioAnalysis_Utils as bau

# Get the folder for the filter configuration file 
# To Do : use the new gui
filter_config_folder = bau.select_folder_gui(user_root,'Select the folder for the config_filters.json file')
print('Filter configuration folder:', filter_config_folder) 

file_config_filters = filter_config_folder/ Path('config_filters.json')    
bau.filters_modification(filter_config_folder,file_config_filters)

print('\n' + bold_text + 'Cell-run completed' + light_text)

# Annexe 3- Upgrade of parsing files with column names

In [None]:
# Local imports
import BiblioAnalysis_Utils as bau

# Get the folder for the filter configuration file
# To Do : use the new gui
corpus_folder_to_upgrade = bau.select_folder_gui(user_root,'Select the corpus folder to upgrade')
print('Corpus folder to upgrade:', corpus_folder_to_upgrade) 
bau.upgrade_col_names(corpus_folder_to_upgrade)

print('\n' + bold_text + 'Cell-run completed' + light_text)

# Documentation

## Data parsing
- articles.dat is the central file, listing all the publications within the corpus. It contains informations such as the document type (article, letter, review, conf proceeding, etc), title, year of publication, publication source, doi, number of citations (given by WOS or Scopus at the time of the extraction) AND a unique identifier used in all the other files to identify a precise publication.
- database.dat keeps track of the origin of the data, some part of the analysis being specific to WOS or Scopus data.
- authors.dat lists all authors names associated to all publications ID.
- addresses.dat lists all adresses associated to all publications ID, along with a specific ID for each adresse line. These adresses are reported as they appear in the raw data, without any further processing.
- countries.dat lists all countries associated to all publications ID and adresses lines ID. The countries are extracted from the adresses fields of the raw data, with some cleaning (changing mentions of US states and UK countries to respectively the USA and UK).
- institutions.dat lists all the comma-separated entities appearing in the adresses field associated to all publications ID and adresses lines ID, except those refering to a physical adresses. These entities correspond to various name variants of universities, organisms, hospitals, labs, services, departments, etc as they appear in the raw data. No treatment is made to e.g. filtering out the entities corresponding a given hierarchy level.
- keywords.dat lists various types of keywords associated to all publications ID. "AK" keywords correspond to Author's keywords. "IK" keywords correspond to either WOS or Scopus keywords, which are built based on the authors' keywords, the title and abstract. "TK" correspond to title words (from which we simply remove common words and stop words - no stemming is performed). TK are especially useful when studying pre-90's publications, when the use of keywords was not yet standard.
- references.dat lists all the references associated to all publications ID. The rawdata is parsed to store the first author name, title, source, volume and page of each reference of the raw "references" field.
- subjects.dat lists all subject categories associated to all publications ID (a journal may be associated to many subject category). WOS classifies the sources it indexes into ∼ 250 categories, that are reported in the extracted data. Scopus classifies its sources into 27 major categories and ∼ 300 sub-categories, none of which are reported in the extracted data. We use Elsevier Source Title List (october 2017 version) to retrieve that information. The "subject.dat" contains the info relative to the major categories.
- subjects2.dat lists Scopus's sub-categories, if the use database is Scopus.
- AA_log.txt keeps track of the date/time the script was executed and of all the messages displayed on the terminal (number of publications extracted, % of references rejected, etc).

##  Corpus description
Before doing anything else, you should get a general idea of the content of your database.  This script performs several basic tasks:
- it performs a series of frequency analysis, computing the number of occurrences of each item (authors, keywords, references, etc) within the publications of the corpus. These frequencies are automatically stored into several "freq_xxx.dat" files within a newly created "freq" folder.
- it performs a series of generic statistical analysis, storing the numbers of distinct items of each type (e.g. there are x distinct keyword in the corpus ), the distributions of number of occurrences of each item (e.g. there are x keywords appearing in at least y publications) and the distribution of number of items per publication (e.g.there are x% of publications with y keywords). All these statistics are stored in the "DISTRIBS_itemuse.json" file.
- it also performs a co-occurrence analysis, computing the number of co-occurrence of pairs of items among the top 100 most frequent items of each type (e.g. computing how often the two most used keywords appear together in the same publications). The results of this analysis are stored in the "coocnetworks.json" file. More systematic co-occurrence analysis can also be performed with another script, cf the Co-occurrence Maps section below.
All the generated files can be opened and read with a simple text editor. The freq_xxx.dat, listing items by order of frequency, can also be read in a spreadsheet software such as excel. All the files are however primarily made to be read in the BiblioMaps interface.

## Filtering the data
#### To be run after corpus description to allow using the following functions : describe_corpus() , treemap_item()

If, upon exploring the nature of the data you realize that before going further you'd prefer to filter your corpus based on some characteristic (keeping only the publications from certain years, using some keywords or references, written by some authors from some countries, etc), you can filter the initial corpus thanks to the script:

- python BiblioTools3.2/filter.py -i myprojectname/ -o myprojectname_filtered -v <br>

Edit the 'filter.py' file to specify your filters. You'll also need to create a new "myprojectname_filtered" main folder before running the script.
- create the files articles.dat, addresses.dat, authors.dat, countries.dat, institutions.dat, keywords.dat, references.dat, subjects.dat, subjects2.dat

### Co-occurrence Maps
You may execute the co-occurrence script several times successively on unfiltered corpus and on available filtering steps of the corpus.
The result files are saved in independant folders.

The script create multiple co-occurrence networks, all stored in gdf and gexf files that can be opened in Gephi, among which:

Example of heterogeneous network generated with BiblioAnlysis and visualized in Gephi.

- a co-cocitation network, linking references that are cited in the same publications.
- a co-refsources network, linking references's sources that are cited in the same publications.
- a co-author network, linking authors that collaborated in some publications.
- a co-country network, linking countries with researchers that collaborated in some publications.
- a co-institution network, linking institutions with researchers that collaborated in some publications. For this network to be fully useful, you may want to spend some time cleaning the "institutions.dat", e.g. by keeping only the big institutions (university level) or by replacing minor name variant by the dominant name variant ("Ecole Normale Supérieure de Lyon" → "ENS Lyon")
- a co-keyword network, linking keywords being co-used in some publications. Be careful about the interpretation: keywords can be polysemic, their meaning differing from field to another (eg "model", "energy", "evolution", etc).