# Overview

The below script will extract data from multiple sources and file types (csv, json, xml) before performing a simple cleaning tasks and finally outputting a single csv file. The csv output file will be a collation of all the data ingested, post the simple cleaning transformation. <br>
Of course, this was just a fun evening project and the code has not been designed for any kind of commercial application.
<br><br>
The data is a very simple dataset comprising of a handful of people and a log of their names, height and weight. In the data file holding each of the datasets exists a three duplicates of the same data. Again, the data itself is not important, more so that I had a simple collection of multiple sources of data to play around with. 

# Set Up

In [1]:
import glob
import pandas as pd
import xml.etree.ElementTree as ET
from datetime import datetime

# Building The Functions

In [33]:
# Function to read in a csv file
'''
This function will read in a csv file and output a pandas dataframe object
Input: file path
Output: pandas object
Params: file_to_process
'''
def extract_from_csv(file_to_process): 
    dataframe = pd.read_csv(file_to_process) 
    return dataframe





# Function to read in multiple csv files
'''
This function will read in a list of csv files and output a pandas dataframe object
Input: iterable list of file paths
Output: pandas object
Params: path_list
'''
def multiple_csv(path_list):
    df_list = []
    for path in path_list:
        dataframe = extract_from_csv(path)
        df_list.append(dataframe)
        
    return pd.concat(df_list, ignore_index=True)


In [34]:
# Function to read in a json file
'''
This function will read in a csv file and output a pandas dataframe object
Input: file path
Output: pandas object
Params: file_to_process
'''
def extract_from_json(file_to_process): 
    dataframe = pd.read_json(file_to_process, lines=True) 
    return dataframe 





# Function to read in multiple json files
'''
This function will read in a list of json files and output a pandas dataframe object
Input: iterable list of file paths
Output: pandas object
Params: path_list
'''
def multiple_json(path_list):
    df_list = []
    for path in path_list:
        dataframe = extract_from_json(path)
        df_list.append(dataframe)
        
    return pd.concat(df_list, ignore_index=True)

In [35]:
# Function to read in an xml file
'''
This function will read in a csv file and output a pandas dataframe object. This assumes a very specific xml tree schema and is not broadly
applicable to any other xml schemas. 
Input: file path
Output: pandas object
Params: file_to_process
'''
def extract_from_xml(file_to_process):
    # Create an empty dataframe 
    dataframe = pd.DataFrame(columns = ['Name', 'Height', 'Weight'])
    
    # Create tree object and acess the root element
    tree = ET.parse(file_to_process)
    root_element = tree.getroot()
    
    # Iterate through each child of the root element (in this case this is an individual person)
    for child in root_element:
        # Accessing name, height and weight of each person
        name = child.find('name').text
        height = child.find('height').text
        weight = child.find('weight').text
        
        # Appending new row of data to existing dataframe
        dataframe.loc[len(dataframe), :] = [name, height, weight]
    
    return dataframe
    
    



# Function to read in multiple xml files
'''
This function will read in a list of xml files and output a pandas dataframe object
Input: iterable list of file paths
Output: pandas object
Params: path_list
'''
def multiple_xml(path_list):
    df_list = []
    for path in path_list:
        dataframe = extract_from_xml(path)
        df_list.append(dataframe)
        
    return pd.concat(df_list, ignore_index=True)

In [None]:
# Function to batch ingest multiple files

# Final Script