  # VA Data tools
  ## Tools for navigating VA Datasets
  
  ## History
The VA open data portal (https://www.data.va.gov/) contains several thousand data files in Excel, CSV, and PDF formats.  Brent Brewington, using R, flattened out the dataset inventory from JSON to a CSV file.  Both the CSV file and the R code used to do this can be found at his repository:
https://github.com/bbrewington/VA-open-data-mysandbox



In [None]:
import csv, re, requests, io
import pandas as pd

We will download the latest version of Brent's CSV file and save it 'locally.' I only need to do this if something changes, really.

In [None]:
VADataInventory = requests.get('https://raw.githubusercontent.com/bbrewington/VA-open-data-mysandbox/master/va_data_inventory_links_checked_20171202.csv')
VADataInventory.raise_for_status()

megadata = open('va_data_inventory.csv', 'wb')
for chunk in VADataInventory.iter_content(100000):
    megadata.write(chunk)
    
megadata.close

Did it work?

In [None]:
!dir

Cool!  When the file is downloaded, I'll just run from the following lines.  The use for these will become more clear as we dig through the various functions.

In [None]:
MegaData= open('va_data_inventory.csv', encoding = "utf8")
megasheet = csv.reader(MegaData) # read this using the CSV module
alldata = list(megasheet) # which is easier to work with in a list format
headers = alldata[0]      # easily grab the headers this way
file_extension_regex = re.compile(r'\.((pdf|csv|xlsx?|zip|asp))', re.IGNORECASE) 
states = ['Alabama' ,'Alaska' ,'Arizona' ,'Arkansas' ,'California' ,'Colorado' ,'Connecticut' ,'Delaware' ,'Florida' ,'Georgia' ,'Hawaii' ,'Idaho' ,'Illinois' ,'Indiana' ,'Iowa' ,'Kansas' ,'Kentucky' ,'Louisiana' ,'Maine' ,'Maryland' ,'Massachusetts' ,'Michigan' ,'Minnesota' ,'Mississippi' ,'Missouri' ,'Montana' ,'Nebraska' ,'Nevada' ,'New Hampshire' ,'New Jersey' ,'New Mexico' ,'New York' ,'North Carolina' ,'North Dakota' ,'Ohio' ,'Oklahoma' ,'Oregon' ,'Pennsylvania' ,'Rhode' 'Island' ,'South Carolina' ,'South Dakota' ,'Tennessee' ,'Texas' ,'Utah' ,'Vermont' ,'Virginia' ,'Washington' ,'West' 'Virginia' ,'Wisconsin' ,'Wyoming']
territories = ['American Samoa', 'Guam', 'Northern Mariana Islands', 'Puerto Rico']
states_territories = states + territories



Great! First, note that Brent's CSV file contains the following headers:

In [None]:
headers

Without losing too much by 'overcleaning' the CSV file, these tools will help us browse through it. Remember that we are now treating this CSV file in list format. So each row is a list, and each data element is an element in the list matching the headers.  i.e

In [None]:
alldata[1]

The tools will help us sort through all this without 'overcleaning' this datafile. The first two are just a basic search.
You can search a line or the entire file.

In [None]:
def vsearch(line, term): # search for any term in a line/dataset
    search_term = re.compile(r'.*(%s).*' % (term), re.IGNORECASE )
    fields = filter(search_term.match, alldata[line])
    return(list(fields))

def searchall(term): # search entire CSV file. The output will tell you what line it's on.
    searchresults = []  
    for line in range(0,len(alldata)):
        if not vsearch(line, term) == []:
            searchresults = searchresults + [vsearch(line, term), 'line:' + str(line -1)]
    return(searchresults)


In [None]:
vsearch(1, 'state')

What if we wanted all the 'by state' datasets?

In [None]:
searchall('by state')

Note the line is given, so I can pull up that dataset.  468 looks interesting.  This is a little hard to look at though.  A few more tools...

In [None]:
def keywords(line): # What keywords descibe this file?
    return(alldata[line][7])

def whatformat(line):# This will tell you what format a datafile is in
    return(alldata[line][31])


def getheaders(line): # get headers of csv or excel file
    earl = whaturl(line) # assuming the headers are on the first line
    print('Getting file...') 
    # for excel files
    if whatformat(line) in ('xlsx', 'xls'): 
        print('Excel file. Getting headers...')
        df = pd.read_excel(earl)
        return(df.columns.tolist())
    # for CSV files
    elif whatformat(line) == 'csv':
        print('CSV file. Getting headers...')
        res = requests.get(whaturl(line))
        df = pd.read_csv(io.StringIO(res.text))
        return(df.columns.tolist())
    else:
        print('Oh dear! This does not appear to be an excel or CSV file.')
        
def whaturl(line): # Where can I find the file for download
    return(alldata[line][-1])

In [None]:
keywords(468)

In [None]:
whatformat(468)

In [None]:
whaturl(468)

This function will pull the headers down into a pandas dataframe.  This one could use some help. It pulls  down the entire file first to give you the headers.  There are a number of reasons why this might not always work, including the assumption that the headers are always on the first row. 

In [None]:
getheaders(468)

Yes, like that.  How about:

In [None]:
getheaders(6)

Better.  Want to pull the entire dataframe?

In [15]:
def VAPandas(line): # Pull down CSV or Excel files into Pandas Dataframe
    earl = whaturl(line) # assuming the headers are on the first line
    print('Getting file...') 
    # for excel files
    if whatformat(line) in ('xlsx', 'xls'): 
        print('Excel file. Retrieving into Pandas Dataframe...')
        df = pd.read_excel(earl)
        return(df)
    # for CSV files
    elif whatformat(line) == 'csv':
        print('CSV file. Retrieving into Pandas Dataframe...')
        res = requests.get(whaturl(line))
        df = pd.read_csv(io.StringIO(res.text))
        return(df)
    else:
        print('Oh dear! This does not appear to be an excel or CSV file.')
        

In [16]:
VAPandas(6), type(6)

Getting file...
Excel file. Retrieving into Pandas Dataframe...


(           Date Gender                                                POS  \
 0    2015-09-30      F                          (a)\nAll Veterans\n (b+c)   
 1    2015-09-30      F    (b)\nWartime Veterans\n (i+j+k+m+n+p+q+r+t+u+v)   
 2    2015-09-30      F             (c )\nPeacetime Veterans\n (h+l+o+s+w)   
 3    2015-09-30      F                               (d)\n WWII\n (i+j+k)   
 4    2015-09-30      F                   (e)\nKorean Conflict \n(j+k+m+n)   
 5    2015-09-30      F                      (f)\nVietnam Era\n(k+n+p+q+r)   
 6    2015-09-30      F                     (g)\nGulf War Era\n(q+r+t+u+v)   
 7    2015-09-30      F                                      (h)\nPre-WWII   
 8    2015-09-30      F                                   (i)\nWWII\n only   
 9    2015-09-30      F                              (j)\nWWII &\n KC only   
 10   2015-09-30      F                                (k)\nWWII, KC,\nVNE   
 11   2015-09-30      F                          (l)\n Between\n

Also to be improved is the following function, which is supposed to tell you what, if any state or territory data is included in a particular header.  

In [None]:
def whatstate(line):
    which_states = []
    for x in alldata[line]:
        for state in states_territories:
            if state.lower() in x.lower():
                which_states = which_states + [state]
    if len(which_states) == 0:
        return("No state names found")
    else:
        return(list(set(which_states))) #turned into a set to remove dups, then back into list

So, for example. What states is covered in the first, third and seventh dataset? Note that we do include terrotories, and will have ways for sorting out territories and states. 

In [17]:
whatstate(564), whatstate(565), whatstate(1)

(['Delaware'], ['Florida'], 'No state names found')

So, what files have state data?

In [None]:
statedata = [] 
for i in range(0,len(alldata)):
    if not whatstate(i) == 'No state names found':
        statedata = statedata + [[whatstate(i)[0], i]]

In [None]:
sorted(statedata)

# Other projects and future uses:
* Find all files with the same or similar headers
* Find data per county?
* Make the states/territories dataset into a dictionary format using state abbreviations
* Brent's CSV file is great, but maybe we could pull from the JSON directly?
* Fix whatever newbie coding mistakes I have made