# Name Screening

## Data Loader

## Table of Contents <a class="anchor" id="toc"></a>

1. [Function Definitions - Data Loading](#func-defs)
    1. [OFAC NS-PLC](#first-func-def)
    2. [BIS Denied Persons](#second-func-def)
    3. [EU FSF](#third-func-def)
    4. [Random Names](#fourth-func-def)
    5. [Consolidated Loader](#fifth-func-def)
2. [Loading names from all files](#data-load)
3. [Save the final name list](#save-data)

## Libraries

In [None]:
from platform import python_version
print("Python Version:", python_version())

import warnings
#warnings.filterwarnings(action='once')
warnings.filterwarnings('ignore')

# pip install abydos

import re
import os
import time
import random
import numpy as np
import pandas as pd
from datetime import datetime

from abydos import phonetic, distance

# 1. Function Definitions <a class="anchor" id="func-defs"></a>

Go to [Table of Contents](#toc)

## 1.1. Function Definition - Load OFAC NS-PLC list <a class="anchor" id="first-func-def"></a>

Go to [Table of Contents](#toc)

https://home.treasury.gov/policy-issues/financial-sanctions/consolidated-sanctions-list/non-sdn-palestinian-legislative-council-ns-plc-list

Non-SDN Palestinian Legislative Council (NS-PLC) List

Section (b) of General License 4 issued pursuant to the Global Terrorism Sanctions Regulations (31 C.F.R. Part 594), the Terrorism Sanctions Regulations (31 C.F.R. Part 595), and the Foreign Terrorist Organizations Sanctions Regulations (31 C.F.R. Part 597) authorizes U.S. financial institutions to reject transactions with members of the Palestinian Legislative Council (PLC) who were elected to the PLC on the party slate of Hamas, or any other Foreign Terrorist Organization (FTO), Specially Designated Terrorist (SDT), or Specially Designated Global Terrorist (SDGT), provided that any such individuals are not named on OFAC's list of Specially Designated Nationals and Blocked Persons (SDN List).

In order to uniquely identify these names, OFAC has created the program code (NS-PLC). The prefix "NS" stands for "non-SDN".

In [None]:
#############################################################################################
##################          Function to check if the file is valid          #################
#############################################################################################

def load_NSPLC(fil):
    if '.txt' in fil:
        try:
            with open(fil) as ifile:
                lines = ifile.readlines()
        except:
            print('Error in decoding file. Exiting...')
            return None
        
        intro = []

        for line in lines:
            if '__________' in line:
                break
            intro.append(line.strip())

        intro = ' '.join(intro)
        
        # Keywords to validate that the file is NS-PLC list
        ofac_strings = ['Office of Foreign Assets Control', 'OFAC', 'NS-PLC', 'non-SDN', 
                           'Palestinian Legislative Council']
        
        if all(x in intro for x in ofac_strings):
            names = load_from_NSPLC(fil, lower = False, namesplit = False)
            return names
        else:
            return None
    else:
        print('OFAC NS-PLC list in .TXT format not found...')
        return None
    
#############################################################################################


#############################################################################################
##################         Function to load the names from the file         #################
#############################################################################################


def load_from_NSPLC(filname, lower = True, namesplit = False):
    
    print()
    timestamp = datetime.now().strftime("%d-%m-%Y, %H:%M:%S")
    
    if filname.split('.')[1] != 'txt':
        print("Not OFAC NS-PLC list OR list is not in recommended .txt format...")
        return None

    try:
        with open(filname) as ifile:
            lines = ifile.readlines()
    except:
        print('Error in decoding file. Exiting...')
        return None


    print("OFAC NS-PLC List detected!")
    names = []
    flag = 0
    for line in lines:
        if '___________' in line:
            flag += 1
            continue

        if flag == 1  :
            names.append(line.strip())


    # Extract the names from the row of personal information
    names1 = ' '.join(names)
    names1 = names1.replace('  ', '\n')
    names1 = names1.split('\n')
    names1 = [name.split('DOB')[0] for name in names1]


    # Split the name aliases
    names2 = []
    for name in names1:
        if "(a.k.a" in name:
            names2.append(name.split(')')[0])
        elif " a.k.a" in name:
            names2.append(name.split('DOB')[0])
        else:
            names2.append(','.join(name.replace(';', ',').split(',')[:2]))

    # Replace letters in names that do not use anglosized letters
    names3 = []
    for name in names2:
        nms = name.strip().split('a.k.a.')
        new_nms = []
        for nm in nms:
            # a-z, A-Z, '-', ' ', ',' included
            nm = re.sub('[^a-zA-ZÀ-ÿ- ,]+', '', nm).strip()
            new_nms.append(nm)
        names3.append(new_nms)
        #names3.append(name.strip().split('a.k.a.'))

    
    
    if lower:
        names3 = [[n.lower() for n in nm] for nm in names3]
    
    # All names are provided as (Last-Name, First-Name). The chunk rearranges them back to (First-Name Last-Name)
    if not namesplit:
        new_names = []
        for name in names3:
            nn = []
            for n in name:
                #print(' '.join(n.split(',')[::-1]))
                nn.append(' '.join(n.split(',')[::-1]).strip())
            new_names.append(nn)
        
        names3 = new_names
    
    print(f"{len(names3):,} names detected!")
    
    # Remove duplicate names
    new_names = []
    for name in names3:
        flag = True
        for nn in new_names:
            if sorted(name) == sorted(nn):
                flag = False
                break
            
        if flag:
            new_names.append(name)
        flag = True
            
    names3 = new_names
    
    print(f"{len(names3):,} unique names found!")
        
    
    # Get final list of names that are usable and non-usable based on letters in names
    pho_list = []
    fin_names = []
    exc_list = []
    for nam in names3:
        pho = []
        max_len = 0
        max_nam = ""
        for n in nam:
            if len(n.split()) > max_len:
                max_len = len(n.split())
                max_nam = n
                
            n1 = n.replace('-', ' ').replace('.', ' ')
            for sn in n1.split():
                pho.append(phonetic.DoubleMetaphone().encode(sn)[0])
        
        if not re.match("^[A-Za-z0-9À-ÿ -.()/]*$", max_nam):
            exc_list.append(nam)
            continue
        fin_names.append(max_nam)
        pho = list(dict.fromkeys(pho))
        pho_list.append(pho)
        
    
    print(f"{len(fin_names):,} unique anglosized names extracted!")

    df = pd.DataFrame(data={"List": "NS-PLC", "Name": fin_names, "Phonemes": pho_list, "Timestamp": timestamp})
    exc_df = pd.DataFrame(data={"List": "NS-PLC", "Name": exc_list, "Reason": "Non-anglosized letters in name", "Timestamp": timestamp})
    
    return df, exc_df

########################################################################################


## 1.2. Function Definition - Load Bureau of Industry and Security: Denied Persons List <a class="anchor" id="second-func-def"></a>

Go to [Table of Contents](#toc)

https://www.bis.doc.gov/index.php/policy-guidance/lists-of-parties-of-concern/denied-persons-list


The Denied Persons List is a list of people and companies whose export privileges have been denied by the Department of Commerce's Bureau of Industry and Security (BIS). An American company or individual may not participate in an export transaction with an individual or company on the Denied Persons List

In [None]:

#############################################################################################
##################          Function to check if the file is valid          #################
#############################################################################################

def load_DPL(fil):
    if '.txt' in fil:
        try:
            with open(fil) as ifile:
                lines = ifile.readlines()
        except:
            print('Error in decoding file. Exiting...')
            return None
        
        intro = []

        for line in lines:
            if '__________' in line:
                break
            intro.append(line.strip())

        intro = ' '.join(intro)
        
        # keywords that validate that the file is the Denied Persons list
        bis_strings = ["Name", "Street_Address", "City", "State", "Country", "Postal_Code",
                            "Effective_Date", "Expiration_Date", "Standard_Order", "Last_Update", 
                            "Action", "FR_Citation"]
        
        if all(x in intro for x in bis_strings):
            names = load_from_BIS_Denied_Persons(fil, lower=False)
            return names
        else:
            return None
        
    else:
        print('BIS Denied Persons list in .TXT format not found...')
        return None

#############################################################################################


#############################################################################################
##################         Function to load the names from the file         #################
#############################################################################################


def load_from_BIS_Denied_Persons(filname, lower=True):
    
    print()
    timestamp = datetime.now().strftime("%d-%m-%Y, %H:%M:%S")
    
    # Remove words that are not associated with human names
    stopwords = ['advanced', 'airlines', 'technology', 'ltd', 'limited', 'corporation', 'corp', 'llc', 'trade', 
                 'international', 'trading', 'globe', 'systems', 'products', 'computers', 'inc', '.net', 'blue bird', 
                 'group', 'aviation', 'gmbh', 'enterprises', 'results', 'commercio', 'engineering']
    
    if filname.split('.')[1] != 'txt':
        print("Not Bureau of Industry and Security Denied Persons List OR list is not in recommended .txt format...")
        return None

    try:
        db = pd.read_csv(filname, delimiter='\t')
    except:
        print('Error in decoding file. Exiting...')
        return None

    print("BIS Denied Persons List detected!")
    
    print(f"{len(db):,} names detected!")
    
    db.drop_duplicates(subset='Name', inplace=True)
    
    print(f"{len(db):,} unique names found!")
    
    names = db['Name'].tolist()
    if lower:
        names = [n.lower() for n in names]
    
    # Choose person names instead of companies, corporations, etc.
    nn = []
    exc_list = []
    for name in names:
        if not any(x in name.lower() for x in stopwords):
            nn.append(name)
        else:
            exc_list.append(name)
    
    exc_df = pd.DataFrame({"List": "BIS Denied Persons List", "Name": exc_list, "Reason": "Not a person's name", "Timestamp": timestamp})
    
    names = nn
    print(f"{len(names):,} unique persons names found!")
    
    # Choose names that make use of english alphabets
    pho_list = []
    fin_names = []
    for nam in names:
        pho = []
        if not re.match("^[A-Za-z0-9À-ÿ -.()/]*$", nam):
            exc_df.loc[len(exc_df.index)] = ["BIS Denied Persons List", nam, "Non-anglosized letters in name", timestamp]
            continue
        
        nam1 = nam.replace('-', ' ').replace('.', ' ')
        for sn in nam1.split():
            pho.append(phonetic.DoubleMetaphone().encode(sn)[0])
        
        
        fin_names.append(nam)
        pho = list(dict.fromkeys(pho))
        pho_list.append(pho)
        
    
    print(f"{len(fin_names):,} unique anglosized names extracted!")

    df = pd.DataFrame(data={"List": "BIS Denied Persons List", "Name": fin_names, "Phonemes": pho_list, "Timestamp": timestamp})
    
    return df, exc_df

################################################################################################


## 1.3. Function Definition - European Union: Financial Sanctions List <a class="anchor" id="third-func-def"></a>

Go to [Table of Contents](#toc)

https://eeas.europa.eu/headquarters/headquarters-homepage_en/8442/Consolidated%20list%20of%20sanctions

In order to facilitate the application of financial sanctions, the European Banking Federation, the European Savings Banks Group, the European Association of Co-operative Banks and the European Association of Public Banks ("the EU Credit Sector Federations") and the Commission recognised the need for an EU consolidated list of persons, groups and entities subject to CFSP related financial sanctions. It was therefore agreed that the Credit Sector Federations would set up a database containing the consolidated list for the Commission, which would host and maintain the database and keep it up-to-date. This database was developed first and foremost to assist the members of the EU Credit Sector Federations in their compliance with financial sanctions.

In [None]:

#############################################################################################
##################          Function to check if the file is valid          #################
#############################################################################################

def load_FSF(fil):
    if '.csv' in fil:
        
        # Keywords to validate the FSF list
        fsf_strings = ['NameAlias_WholeName', 'Naal_wholename']
        
        try:
            db = pd.read_csv(fil, delimiter=';')
        except:
            print('Error in decoding file. Exiting...')
            return None

        if len([i for i in fsf_strings if i in db.columns]) > 0:
            names = load_from_EU_FSF(fil)
            return names
        else:
            return None
        
    else:
        print('EU FSF list in .CSV format not found...')
        return None
    
#############################################################################################


#############################################################################################
##################         Function to load the names from the file         #################
#############################################################################################


def load_from_EU_FSF(filname):
    
    print()
    timestamp = datetime.now().strftime("%d-%m-%Y, %H:%M:%S")
    
    if filname.split('.')[1] != 'csv':
        print("Not EU FSF Consolidated list OR list is not in recommended .csv format...")
        return None

    try:
        db = pd.read_csv(filname, delimiter=';')
    except:
        print('Error in decoding file. Exiting...')
        return None
    
    print("EU-FSF List detected!")
    
    # There are 2 versions of FSF lists; the condition checks which version is available
    names = []
    if 'NameAlias_WholeName' in db.columns:
        namecol = 'NameAlias_WholeName'
        db.dropna(subset=[namecol], inplace=True)
        db.drop_duplicates(subset=namecol, inplace=True)
        print(f"{len(db):,} names detected!")
        names = db[db['Entity_SubjectType']=='P'][namecol].tolist()
        
    elif 'Naal_wholename' in db.columns:
        namecol = 'Naal_wholename'
        db.dropna(subset=[namecol], inplace=True)
        db.drop_duplicates(subset=namecol, inplace=True)
        print(f"{len(db):,} names detected!")
        names = db[db['Subject_type']=='P'][namecol].tolist()
    
    names = [name.lower() for name in names]
    names = list(set(names))
    print(f"{len(names):,} unique persons names found!")
        
    # Select names that use english alphabets
    pho_list = []
    fin_names = []
    exc_list = []
    for nam in names:
        pho = []
        if not re.match("^[A-Za-z0-9À-ÿ -.()/]*$", nam):
            exc_list.append(nam)
            continue
        
        nam1 = nam.replace('-', ' ').replace('.', ' ')
        for sn in nam1.split():
            pho.append(phonetic.DoubleMetaphone().encode(sn)[0])
        
        
        fin_names.append(nam.title())
        pho = list(dict.fromkeys(pho))
        pho_list.append(pho)
        
    
    print(f"{len(fin_names):,} unique anglosized names extracted!")

    df = pd.DataFrame(data={"List": "EU-FSF", "Name": fin_names, "Phonemes": pho_list, "Timestamp": timestamp})
    exc_df = pd.DataFrame(data={"List": "EU-FSF", "Name": exc_list, "Reason": "Non-anglosized letters in name", "Timestamp": timestamp})
    
    return df, exc_df

#############################################################################################


In [None]:
val = "hello-ther.ee you. "

val = val.replace('-', ' ').replace('.', ' ')

val.split()

## 1.4. Function Definition - Random Name List <a class="anchor" id="fourth-func-def"></a>

Go to [Table of Contents](#toc)

https://fossbytes.com/tools/random-name-generator

An online tool (link provided) has been used to generate around 100,000 English, Spanish, Italian and German names to increase the size of the final name dataset.

In [None]:

#############################################################################################
##################             Function to load the random names            #################
#############################################################################################


def load_random():
    # 20,000 random names generated using https://fossbytes.com/tools/random-name-generator
    
    print()
    timestamp = datetime.now().strftime("%d-%m-%Y, %H:%M:%S")
    
    try:
        files = ['random_names_fossbytes_english.csv',
                 'random_names_fossbytes_spanish.csv',
                 'random_names_fossbytes_italian.csv',
                 'random_names_fossbytes_german.csv']
        
        dbs = []
        for filename in files:
            n = pd.read_csv(filename, header=None).shape[0]
            s = 5000
            skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
            
            df = pd.read_csv(filename, header=None, skiprows=skip)
            dbs.append(df)

        db = pd.concat(dbs, ignore_index=True, sort=False)
    except:
        print('Error in decoding file. Exiting...')
        return None
        
    print("Random name list loaded!")
    
    print(f"{db.shape[0]:,} names detected!")
    names = db[0].tolist()
    names = list(set(names))
    print(f"{len(names):,} unique persons names found!")
        
    
    pho_list = []
    fin_names = []
    exc_list = []
    for nam in names:
        pho = []
        if not re.match("^[A-Za-z0-9À-ÿ -.()/]*$", nam):
            exc_list.append(nam)
            continue
        
        nam1 = nam.replace('-', ' ').replace('.', ' ')
        for sn in nam1.split():
            pho.append(phonetic.DoubleMetaphone().encode(sn)[0])
        
        
        fin_names.append(nam.title())
        pho = list(dict.fromkeys(pho))
        pho_list.append(pho)
        
    
    print(f"{len(fin_names):,} unique anglosized names extracted!")

    df = pd.DataFrame(data={"List": "RandomGen", "Name": fin_names, "Phonemes": pho_list, "Timestamp": timestamp})
    exc_df = pd.DataFrame(data={"List": "RandomGen", "Name": exc_list, "Reason": "Non-anglosized letters in name", "Timestamp": timestamp})
    
    return df, exc_df

#############################################################################################

## 1.5. Function Definition - Load Files <a class="anchor" id="fifth-func-def"></a>

Go to [Table of Contents](#toc)

Consolidation function that allows for loading of the different name lists with appropriate checks in place.

In [None]:

#############################################################################################
##################      Function to load the files based on user input      #################
#############################################################################################

def load_files(disable_random=False):
    
    print('''Hello user! This script allows you to load 3 types of restricted persons files:
            1. OFAC NS-PLC list (in TXT format)
            2. BIS Denied Persons List (in TXT format)
            3. EU FSF list (in CSV format)''')
    ftype = input('''Following are the instructions to load the files from the current path (please make sure that the files exist in the same path as this script):
        1. (Default) Type 'all' for loading all the files in the current path
        2. Type 'ofac' for loading the OFAC NS-PLC list in TXT format
        3. Type 'dpl' for loading the Bureau of Industry and Security - Denied Persons List in TXT format
        4. Type 'fsf' for loading the EU Sanction 
        
        Choice: ''')

    start_time = time.time()

    if ftype == '':
        print('\tNo choice entered. Default choice \'all\' selected!')
        ftype='all'
    
    global_names = pd.DataFrame()
    global_exc_names = pd.DataFrame()
    
    if ftype == 'all':
        relev_fil = [fil for fil in os.listdir() if '.csv' in fil or '.txt' in fil]
        
        for fil in relev_fil:
            if '.txt' in fil:
                names = load_NSPLC(fil)
                if names is not None:
                    global_names = global_names.append(names[0], ignore_index=True)
                    global_exc_names = global_exc_names.append(names[1], ignore_index=True)
                    
                names = load_DPL(fil)
                if names is not None:
                    global_names = global_names.append(names[0], ignore_index=True)
                    global_exc_names = global_exc_names.append(names[1], ignore_index=True)
                    
            elif '.csv' in fil:
                names = load_FSF(fil)
                if names is not None:
                    global_names = global_names.append(names[0], ignore_index=True)
                    global_exc_names = global_exc_names.append(names[1], ignore_index=True)
                    
    
    elif ftype == 'ofac':
        relev_fil = [fil for fil in os.listdir() if '.txt' in fil]
        
        for fil in relev_fil:
            if '.txt' in fil:
                names = load_NSPLC(fil)
                if names is not None:
                    global_names = global_names.append(names[0], ignore_index=True)
                    global_exc_names = global_exc_names.append(names[1], ignore_index=True)
    
    elif ftype == 'dpl':
        relev_fil = [fil for fil in os.listdir() if '.txt' in fil]
        
        for fil in relev_fil:
            if '.txt' in fil:
                names = load_DPL(fil)
                if names is not None:
                    global_names = global_names.append(names[0], ignore_index=True)
                    global_exc_names = global_exc_names.append(names[1], ignore_index=True)

    elif ftype == 'fsf':
        relev_fil = [fil for fil in os.listdir() if '.csv' in fil]
        
        for fil in relev_fil:
            if '.csv' in fil:
                names = load_FSF(fil)
                if names is not None:
                    global_names = global_names.append(names[0], ignore_index=True)
                    global_exc_names = global_exc_names.append(names[1], ignore_index=True)
        
    if not disable_random:
        names = load_random()
        if names is not None:
            global_names = global_names.append(names[0], ignore_index=True)
            global_exc_names = global_exc_names.append(names[1], ignore_index=True)
    
    
    print(f"\n--- Execution Time: {np.round((time.time() - start_time)*1000, 2):,} ms ---")
    inc_shape = global_names.shape[0]
    exc_shape = global_exc_names.shape[0]
    print(f'\n\nTotal {inc_shape:,} names fetched!')
    print(f'\n\nTotal {exc_shape:,} ({np.round(exc_shape*100/(inc_shape+exc_shape), 2)} %) names excluded!')
    return global_names, global_exc_names

#############################################################################################

# 2. Loading names <a class="anchor" id="data-load"></a>

Go to [Table of Contents](#toc)

This script allows you to load 3 types of restricted persons files:
1. OFAC NS-PLC list (in TXT format)
2. BIS Denied Persons List (in TXT format)
3. EU FSF list (in CSV format)

To select the appropriate files, type one of the options, when prompted:
- all - (Default) Type 'all' for loading all the files in the current path
- ofac - Type 'ofac' for loading the OFAC NS-PLC list in TXT format
- dpl - Type 'dpl' for loading the Bureau of Industry and Security - Denied Persons List in TXT format
- fsf - Type 'fsf' for loading the EU Sanction 

In [None]:
disable_random = True # ignores random names
#disable_random = False # includes random names


names, exc = load_files(disable_random=disable_random)

In [None]:
# Final list of acceptable names
names

In [None]:
names.groupby('List').count()

In [None]:
# Final list of rejected names with reasons
exc

# 3. Save the final name list <a class="anchor" id="save-data"></a>

Go to [Table of Contents](#toc)

In [None]:
if disable_random:
    ### SAVE NAMES without RANDOM when generated
    names.to_pickle('Final_Names_wo_Random.pkl')
    exc.to_pickle('Excluded_Names_wo_Random.pkl')
else:
    ### SAVE NAMES with RANDOM when generated
    names.to_pickle('Final_Names_w_Random.pkl')
    exc.to_pickle('Excluded_Names_w_Random.pkl')
    