Objective: given the `I94_SAS_Labels_Descriptions.SAS` text file, programmatically create CSV lookup tables. Given the small dataset, we can do this with just Pandas (no Spark required).

* lookup_i94cntyl.csv
* lookup_i94port.csv
* lookup_i94mode.csv
* lookup_i94addr.csv
* lookup_i94visa.csv


In [1]:
import numpy as np
import json
import pandas as pd

In [2]:
INPUT_PATH = 'raw_input_data/i94_proc_format_code_sas/I94_SAS_Labels_Descriptions.SAS'
OUTPUT_DIR = 'raw_input_data/i94_lookup_csv'

OUTPUT_PATH_LKUP_I94CNTYL = f"{OUTPUT_DIR}/lookup_i94cntyl.csv"
OUTPUT_PATH_LKUP_I94PORT = f"{OUTPUT_DIR}/lookup_i94port.csv"
OUTPUT_PATH_LKUP_I94MODE = f"{OUTPUT_DIR}/lookup_i94mode.csv"
OUTPUT_PATH_LKUP_I94ADDR = f"{OUTPUT_DIR}/lookup_i94addr.csv"
OUTPUT_PATH_LKUP_I94VISA = f"{OUTPUT_DIR}/lookup_i94visa.csv"

# Utility Functions

In [3]:
def txt_to_list(txt_path):
    with open(txt_path) as f:
        lines = f.readlines()     
    return np.asarray(lines)


def subset_lookup_lines(
    np_array,
    lookup_title: str,
    start_str: str,
    start_pos_offset: int,
    end_str: str,
    end_pos_offset: int):
    
    start_pos, end_pos = None, None
    for idx, line in enumerate(np_array):
        #print(line)
        if line == start_str:
            start_pos = idx + start_pos_offset   
            break
            
    for idx, line in enumerate(np_array[start_pos:]):            
        if line == end_str:
            end_pos = start_pos + idx + end_pos_offset
            break
        
    return {
        "lookup_title": lookup_title,
        "start_pos": start_pos,
        "end_pos": end_pos,
        "subset": np_array[start_pos:end_pos].copy()
    }


def array_to_df(np_array, numeric_key=True, sort_by='key'):
    lookup_list = []
    for idx, line in enumerate(np_array):
        left_right = line.split("=")
        left, right = left_right[0], left_right[1]
        if numeric_key:
            key = int(left.strip())
        else:
            key = left.strip().replace("'", "").strip()
        value = right.strip().replace("'", "").replace(";", "").replace("\n", "").strip()
        lookup_list.append({
            "key": key,
            "value": value
        })
    
    df = pd.DataFrame(lookup_list).sort_values(by=sort_by)
    return df


# Read SAS Proc Format text file into a list of text lines

In [4]:
lines = txt_to_list('raw_input_data/i94_proc_format_code_sas/I94_SAS_Labels_Descriptions.SAS')

In [5]:
list(lines)

["libname library 'Your file location' ;\n",
 'proc format library=library ;\n',
 '\n',
 '/* I94YR - 4 digit year */\n',
 '\n',
 '/* I94MON - Numeric month */\n',
 '\n',
 '/* I94CIT & I94RES - This format shows all the valid and invalid codes for processing */\n',
 '  value i94cntyl\n',
 "   582 =  'MEXICO Air Sea, and Not Reported (I-94, no land arrivals)'\n",
 "   236 =  'AFGHANISTAN'\n",
 "   101 =  'ALBANIA'\n",
 "   316 =  'ALGERIA'\n",
 "   102 =  'ANDORRA'\n",
 "   324 =  'ANGOLA'\n",
 "   529 =  'ANGUILLA'\n",
 "   518 =  'ANTIGUA-BARBUDA'\n",
 "   687 =  'ARGENTINA '\n",
 "   151 =  'ARMENIA'\n",
 "   532 =  'ARUBA'\n",
 "   438 =  'AUSTRALIA'\n",
 "   103 =  'AUSTRIA'\n",
 "   152 =  'AZERBAIJAN'\n",
 "   512 =  'BAHAMAS'\n",
 "   298 =  'BAHRAIN'\n",
 "   274 =  'BANGLADESH'\n",
 "   513 =  'BARBADOS'\n",
 "   104 =  'BELGIUM'\n",
 "   581 =  'BELIZE'\n",
 "   386 =  'BENIN'\n",
 "   509 =  'BERMUDA'\n",
 "   153 =  'BELARUS'\n",
 "   242 =  'BHUTAN'\n",
 "   688 =  'BOLIVIA

# Create lookup_i94cntyl.csv

In [6]:
df_i94cntyl = array_to_df(
    subset_lookup_lines(
        lines,
        'i94cntyl',
        '  value i94cntyl\n',
        1,
        '\n',
        0
    )["subset"],
    numeric_key=True,
    sort_by='key'
)
df_i94cntyl

Unnamed: 0,key,value
255,0,INVALID: STATELESS
266,54,No Country Code (54)
267,100,No Country Code (100)
2,101,ALBANIA
4,102,ANDORRA
12,103,AUSTRIA
18,104,BELGIUM
31,105,BULGARIA
242,106,INVALID: CZECHOSLOVAKIA
167,107,POLAND


In [7]:
df_i94cntyl.to_csv(
    OUTPUT_PATH_LKUP_I94CNTYL,
    sep=',',
    index=False
)

# Create lookup_i94port.csv

In [8]:
df_i94port = array_to_df(
    subset_lookup_lines(
        lines,
        'i94port',
        '  value $i94prtl\n',
        1,
        ';\n',
        0
    )["subset"],
    numeric_key=False,
    sort_by='key'
)
df_i94port

Unnamed: 0,key,value
650,.GA,No PORT Code (.GA)
618,060,No PORT Code (60)
212,48Y,"PINECREEK BORDER ARPT, MN"
11,5KE,"KETCHIKAN, AK"
617,5T6,No PORT Code (5T6)
625,74S,No PORT Code (74S)
517,888,UNIDENTIFED AIR / SEAPORT
658,A2A,No PORT Code (A2A)
475,ABE,"ABERDEEN, WA"
457,ABG,"ALBURG, VT"


In [9]:
df_i94port.to_csv(
    OUTPUT_PATH_LKUP_I94PORT,
    sep=',',
    index=False
)

# lookup_i94mode.csv

In [10]:
df_i94mode = array_to_df(
    subset_lookup_lines(
        lines,
        'i94mode',
        'value i94model\n',
        1,
        '\t\n',
        0
    )["subset"],
    numeric_key=True,
    sort_by='key'
)
df_i94mode

Unnamed: 0,key,value
0,1,Air
1,2,Sea
2,3,Land
3,9,Not reported


In [11]:
df_i94mode.to_csv(
    OUTPUT_PATH_LKUP_I94MODE,
    sep=',',
    index=False
)

# lookup_i94addr.csv

In [12]:
df_i94addr = array_to_df(
    subset_lookup_lines(
        lines,
        'i94addr',
        'value i94addrl\n',
        1,
        '\n',
        0
    )["subset"],
    numeric_key=False,
    sort_by='key'
)
df_i94addr

Unnamed: 0,key,value
54,99,All Other Codes
1,AK,ALASKA
0,AL,ALABAMA
3,AR,ARKANSAS
2,AZ,ARIZONA
4,CA,CALIFORNIA
5,CO,COLORADO
6,CT,CONNECTICUT
8,DC,DIST. OF COLUMBIA
7,DE,DELAWARE


In [13]:
df_i94addr.to_csv(
    OUTPUT_PATH_LKUP_I94ADDR,
    sep=',',
    index=False
)

# lookup_i94visa.csv

In [14]:
df_i94visa = array_to_df(
    subset_lookup_lines(
        lines,
        'i94visa',
        '/* I94VISA - Visa codes collapsed into three categories:\n',
        1,
        '*/\n',
        0
    )["subset"],
    numeric_key=True,
    sort_by='key'
)
df_i94visa

Unnamed: 0,key,value
0,1,Business
1,2,Pleasure
2,3,Student


In [15]:
df_i94visa.to_csv(
    OUTPUT_PATH_LKUP_I94VISA,
    sep=',',
    index=False
)

# Conclusion

We have developed a working text parser that is able to extract the unstructured lookup info (from a SAS code / text file), into structured tabular lookup tables (in CSV format). These CSV files may be used downstream to make the consolidated analytical dataset richer. For instance, to include text labels, in additional to just cryptic numeric or letter codes. Analytical teams often find categorial labels like this useful when conducting analysis downstream.

We may convert the Jupyter Notebook logic into Python modules or scripts to make this process more repeatable.