# Module 1 Assignment
#### By: Steven Hunt


#### 1. The file to be imported should be passed as a parameter to a function which evaluates what type the importing file is based on its extension and invokes proper function to import the file.

#### 2. There are should have four functions, each of which imports either a CSV, JSON, XML, or Excel file.
#### 3. There should be function(s) to process data from the imported file. You may show some statistics or output some sample data.

#### 4. You should use RELATIVE file path (not ABSOLUTE/FULL PATH) when importing a file. 

For this module I used the first 100 data points of each of the sample data sets provided in chapters 3 and 4 of the 'Data Wrangling with Python' book.

In [22]:
# Import Libraries
import os
import xml.etree.ElementTree as ET
import json
import csv
import xlrd

# Identify Data Files in referenced Directory and return list of files with referenced extension
def directory_walker(extension, directory):
    return_list = []
    for root, dirs, files in os.walk(directory):
        for name in files:
            if name.endswith(extension):
                filename = os.path.join(root, name)
                return_list.append(filename)
    
    return return_list

# Function to parse all Json Files in list referenced and add to dictionary with iterative key
def json_parser(data_dict, files):
    counter = 0
    for item in files:
        counter += 1
        with open(item) as f:
            data = json.load(f)
        data_dict['json file ' + str(counter)] = data
    return data_dict

# Function to parse all CSV Files in list reference and add to dictionary with iterative key
def csv_parser(data_dict, files):
    counter = 0
    for item in files:
        counter += 1
        csvfile = open(item,'r')
        data = csv.DictReader(csvfile)

        data_dict['csv file ' + str(counter)] = data
    return data_dict

# Function to parse all XML files in list referenced and add to the dictionary with an iterative key
def xml_parser(data_dict, files):
    counter = 0
    for item in files:
        counter += 1
        tree = ET.parse(item)
        root = tree.getroot()

        data = root.find('Data')

        data_dict['xml file ' + str(counter)] = data
    return data_dict

# Function to parse all Excel files in list referenced and add to the dictionary with an iterative key
def excel_parser(data_dict, files):
    counter = 0
    for item in files:
        counter += 1
        book = xlrd.open_workbook(item)
        sheet = book.sheet_by_name('Table 9 ')

        data = {}
        for i in range(14, 114):
        # Start at 14th row, because that is where the country data begins
            row = sheet.row_values(i)
            country = row[1]
            data[country] = {
                'child_labor': {
                    'total': [row[4], row[5]],
                    'male': [row[6], row[7]],
                    'female': [row[8], row[9]],
                },
                'child_marriage': {
                  'married_by_15': [row[10], row[11]],
                 'married_by_18': [row[12], row[13]],
                }
            }

        data_dict['excel file ' + str(counter)] = data
    
    return data_dict

The section above is all of the functions that are used in combination to identify the data files in the provided directory and parse the data, outputing into a dictionary where the key is a concatenated string combining the file type and a iteration value for ease of understanding.

There are 5 functions:

directory_walker()
json_parser()
csv_parser()
xml_parser()
excel_parser()

The directory_walker() function identifies all of the files in a directory and can be used to return a list of all files of a specified file type.

The 4 remaining parser functions are used to parse through the list of files that match the specified filetype.  The Json and CSV functions are designed to handle any file containing data in their individual file formats.  The XML and Excel parsers are unique to the data file that is being used in this code sequence, unfortunately both of those file formats involve a large amount of unique design and the parser would need to be redesigned for each data file depending on the formatting.

In [23]:
# Function that 
def get_data(directory):

    #Initialize Data Dictionary, Directory, and Filetypes List
    data_dict = {}
    filetypes = ['.csv', '.json', '.xml', '.xls']

    # Create Lists of Each file Type in the Directory
    csv_files = directory_walker(filetypes[0], directory)
    json_files = directory_walker(filetypes[1], directory)
    xml_files = directory_walker(filetypes[2], directory)
    excel_files = directory_walker(filetypes[3], directory)

    #Build Data Dictionary for housing all collected data from each file
    data_dict = json_parser(data_dict, json_files)
    data_dict = csv_parser(data_dict, csv_files)
    data_dict = xml_parser(data_dict, xml_files)
    data_dict = excel_parser(data_dict, excel_files)
    
    return data_dict

The above function takes a directory argument and then runs all of the files through the parser functions and returns a dictionary with all of the data collected.

In [24]:
def main():
    directory = 'StevenHunt_Mod1_Data'
    data_dict = get_data(directory)
    
    for key, value in data_dict.items():
        print(key)


main()

json file 1
csv file 1
xml file 1
excel file 1


The main function executes all of the above code and assigns the data dictionary to the data_dict variable.  From here all of the data can be accessed directly by manipulating the data_dict dictionary. Some examples are below:

### CSV Example

In [25]:
def main_with_csv_file_print():
    directory = 'StevenHunt_Mod1_Data'
    data_dict = get_data(directory)
    for row in data_dict['csv file 1']:
        print(row)

main_with_csv_file_print()

{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': '1920', 'WHO region': 'Americas', 'Country': 'Canada', 'Sex': 'Both sexes', 'Display Value': '82.8', 'Numeric': '82.80972', 'Low': '', 'High': '', 'Comments': 'WHO life table method: Vital registration'}
{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': '2000', 'WHO region': 'Eastern Mediterranean', 'Country': 'Afghanistan', 'Sex': 'Male', 'Display Value': '54.6', 'Numeric': '54.57449', 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Healthy life expectancy (HALE) at birth (years)', 'PUBLISH STATES': 'Published', 'Year': '2000', 'WHO region': 'Eastern Mediterranean', 'Country': 'Afghanistan', 'Sex': 'Male', 'Display Value': '46.9', 'Numeric': '46.93113', 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at age 60 (years)', 'PUBLISH STATES': 'Published', 'Year': '2000', 'WHO region': 'Eastern Mediterranean', 'Country': 'Afghanistan', '

### XML Example

In [26]:
def main_with_xml_file_print():
    directory = 'StevenHunt_Mod1_Data'
    data_dict = get_data(directory)
    for observation in data_dict['xml file 1']:        
        for child in observation:
            print(child.tag, child.attrib)

main_with_xml_file_print()

Dim {'Category': 'PUBLISHSTATE', 'Code': 'PUBLISHED'}
Dim {'Category': 'YEAR', 'Code': '1990'}
Dim {'Category': 'SEX', 'Code': 'BTSX'}
Dim {'Category': 'GHO', 'Code': 'WHOSIS_000001'}
Dim {'Category': 'REGION', 'Code': 'EUR'}
Dim {'Category': 'COUNTRY', 'Code': 'AND'}
Dim {'Category': 'WORLDBANKINCOMEGROUP', 'Code': 'WB_HI'}
Value {'Numeric': '77.00000'}
Dim {'Category': 'WORLDBANKINCOMEGROUP', 'Code': 'WB_HI'}
Dim {'Category': 'YEAR', 'Code': '2000'}
Dim {'Category': 'SEX', 'Code': 'BTSX'}
Dim {'Category': 'COUNTRY', 'Code': 'AND'}
Dim {'Category': 'REGION', 'Code': 'EUR'}
Dim {'Category': 'GHO', 'Code': 'WHOSIS_000001'}
Dim {'Category': 'PUBLISHSTATE', 'Code': 'PUBLISHED'}
Value {'Numeric': '80.00000'}
Dim {'Category': 'GHO', 'Code': 'WHOSIS_000015'}
Dim {'Category': 'COUNTRY', 'Code': 'AND'}
Dim {'Category': 'YEAR', 'Code': '2012'}
Dim {'Category': 'REGION', 'Code': 'EUR'}
Dim {'Category': 'SEX', 'Code': 'FMLE'}
Dim {'Category': 'WORLDBANKINCOMEGROUP', 'Code': 'WB_HI'}
Dim {'Categor

Using the Key values it is possible to isolate and access each data set from the dictionary.