Module Name: | WorkbookFunctions |
Author: | Rory Creedon (rcreedon@poverty-action.org) |
Purpose: | Library of Classes and Functions for consolidating worksheets from various Excel workbooks into one single workbook and performing a variety of checks on those worksheets |
**Use: ** | Intended for use in the specialised DataNitro iPython shell. Functions will not load in regular iPython terminal due to use of DataNitro functions |
#Introduction
#==========
This library of functions was developed to meet the needs of the RMG Data Team working primarily with Quality and Absence data generated by the RMG projects in Bangladesh.
The aim of the library is to assist in the following stages of creating data sets from data provided as part of the RMG projects:
- Consolidate data on excel worksheets in a number of excel files in a number of directories into a single workbook
- Renaming sheets in a consistent manner.
- Perform various checks on that workbook, including:
- Making various checks with regard to dates on each sheet
- Identifying the 'headers' on each sheet
- Identifying the end point of the data on each sheet
- Checking that 'headers' have the same meaning on each sheet and consistent names
- Unmerging cells
- Creating an object that describes the data structure of each worksheet that can then be passed to a pandas program that will use the information to create a DataFrame that includes all relevant data on each sheet.
In an ideal world I would create a set of proper documentation for this library, perhaps one day this will happen. In the meantime this readme will follow the above aims in order that they appear such that the use of the library can be explained to those for whom the library is intended. Explaining the module in the order of the tasks makes sense as the workflow will generally always follow the same pattern. Any additional useful tips will be provided along the way.
#Using this Module
#==============
In order to use this module download the WorkbookFunctions.py file. You will need to add this file to a folder in the python path in order to be able to import it when using the DataNitro iPython shell. To determine what folders are in the python path open up a DataNitro iPython shell and do the following:
import sys
sys.path
This will return a list of folders in the python path (i.e. the folders which python searches when import
calls are made). I believe you can add the WorkbookFunctions.py file to any folder that appears in that path, although it is typical with Windows to add the file in the Python27/Lib/site-packages folder.
I always import the module in the following manner:
import WorkbookFunctions as wf
This is the convention that the remainder of this document follows.
##Consolidate Sheets
###Purpose and Information
Many times the data that we work with are in folders that contain a number of Excel workbooks each of which will contain a sheet of data that we wish to read into a pandas DataFrame. However, before doing that we need to make a number of checks on the data and also to work out what structure the data are in before we can pass the data to a pandas program for conversion to a DataFrame. It is far simpler to make these checks and to define the structure of each relevant worksheet to be passed to a pandas program if all the worksheets are in the same workbook.
The WorkbookFunctions module provides the sheet_compiler
class object for compiling sheets from a number of workbooks into a single (new) workbook. Here are some facts about the sheet_compiler class:
- The compile operations the class undertakes can copy sheets from Excel Workbooks stored in any number of directories
- The compile operations the class undertakes can only select and copy one sheet per workbook to the new workbook created.
- The compile operations the class undertakes will only copy a sheet if the sheet is uniquely identified in the workbook according to the arguments provided by the programmer. More on this below
###Workflow and Syntax
The syntax to create a sheet_compiler object is :
wf.sheet_compiler(top_folderpath, **kwargs)
top_folderpath : |
string of path to file directory where the objects created by WorkbookFunctions will be stored |
**kwargs : |
a number of keyword arguments are accepted. These should be strings of file paths that point to directories where Excel Workbooks from which sheets to be compiled are located. There is no restriction on the number of ** kwargs except that the programmer must provide at least one. |
An example call might look like this:
compiler = wf.sheet_compiler(r'C:DataFolder',
folder1 = r'C:DataFolder\January_2014,
folder2 = r'C:DataFolder\February_2014',
folder3 = r'C:DataFolder\March_2014')
Please note the use of the raw string literal (r'
) when providing the path name. This is needed when paths contain backslashes (which they nearly always do) as the backslash is a special character in python.
Additionally note that the **kwargs
should be named so as to be sortable into a logical order. For example if working with folders with folder that contain data for May, June, July, then the **
kwargs should be folder1
, folder2
and folder3
, rather than named by month. This is because if named as shown the folders are sortable into an order that matches the reality of the data that the folders actually contain. This will become important later on when working with the wf.Dates.compare_cell_file_date()
method.
Attributes
The sheet_compiler
Class has only two attributes:
wf.sheet_compiler.file_dict
: dictionary of the**
kwargs passed to the constructor.wf.sheet_compiler.top_folderpath
: string of top_folderpath argument passed to the class constructor.
####Getting a File List Dict
Before compiling any sheets we need to know what files are contained in the folders we are going to work with. A dict of the files in the directories can be obtained by calling the get_file_list_dict()
method. The syntax for this method is as follows:
wf.sheet_compiler.get_file_list_dict()
Returns: Returns a dict of folder names passed during compiler constructor as ** kwargs and values that are lists of the files found in the associated folders. |
So an example call might look like:
file_list_dict = compiler.get_file_list_dict()
file_list_dict
This file dict should be checked by the user.
If there are non excel files, these should be manually removed as this dict will eventually be passed to the compiler. The compiler does not care in what order the lists within the dict are in, but the user must care. In the workbook of compiled sheets for example, the user wants the sheets to appear in ascending order with regard to date of the worksheet. Typically the files we are working with a named according to date. However, the get_file_list_dict()
method uses the os.listdir()
function which does not guarantee order. Therefore the order of the lists within the dict should be manually manipulated until the user is satisfied that the workbooks will be approached in the order in which he would like to the see the worksheets in the compiled workbook.
####Compiling the Sheets
Once the file_list_dict
is as the user wishes it to be it can be passed to the compile_sheets()
method. The compile sheet method, actually does the work of compiling the sheets and returns a string with a report about the success of the compile. If some sheets could not be moved the user is notified by the returned string.
The syntax for the compile_sheets()
method is:
wf.sheet_compiler.compile_sheets(file_list_dict, new_wkbk_name, sub_string1 [, sub_string2])
file_list_dict : |
A dict of folder keys and lists values that contain the Workbook filenames from which sheets will be compiled |
new_wkbk_name : |
The name of the new workbook to which the individual worksheets will be compiled including the file extension (e.g. "compiled_workbook.xls") |
sub_string1 : |
A string that uniquely identifies a sheet in the workbooks from which a sheet will be copied |
sub_string2 : |
(Optional) A string that uniquely identifies a sheet in the workbooks from which a sheet will be copied |
Returns: | String that indicates to user the success of the compile operation. |
As all the workbooks from which sheets will be copied tend to be of the same type, they will tend to have the same sheet names. The arguments sub_string1 and sub_string2 are passed to the method which upon opening the relevant workbook will search for the sheet to be moved by creating a list of sheet names in that workbook. The sheet names are strings. Therefore, the user should look at the sheet names used in the workbooks from which sheets will be copied and identify up to two sub_strings that will uniquely identify the sheet that is to be copied to the new workbook.
By way of example suppose that each workbook contained three sheets:
- "Daily Efficiency Summary"
- "Daily Sewing Summary"
- "Daily Sewing Quality"
If we were interested to obtain the second of those sheets we could pass sub_string1 = "Sewing"
, sub_string2 = "Summary"
as arguments to the compile_sheets()
method. The full syntax might look look like this:
compile_result = compiler.compile_sheets(file_list_dict, 'compiled_workbook.xls', 'Sewing', 'Summary')
compile_result
The variable compile_result
is a string that is a report that tells the user how successful the operation was for each file. This is the string returned by the method.
Incidentally, during the compile process the file_list_dict is saved in .json
format in the folder as pointed to by the wf.sheet_compiler.top_folderpath
attribute.
The user should examine the output and manually move sheets as necessary from any files where the compile operation was not successful. Sheets should be moved so as to preserve the logic of the order in the workbook.
##Renaming Sheets
###Purpose and Information
When testing the structure of worksheets in a workbook, and their contents, it is useful if the sheets are named consistently, so that they can be easily identified by the user as needing attention.
The WorkbookFunctions module provides the rename_sheets()
function for this purpose. This function will rename up to 180 sheets according to prefix + 3 digit serial. Practically speaking it is not recommended to work with files with more than 180 sheets, as the memory used can exceed that available and tasks become difficult to execute without error.
###Workflow and Syntax
The syntax for the function is as follows:
wf.rename_sheets(prefix)
prefix : |
string |
Returns : | None |
An example function call might look like this:
wf.rename_sheets('P')
##Dates
###Purpose and Information
Dates are incredibly important. We need to be able to accurate identify which date all data come from, and this is a job that is not frequently made easy by the people supplying the data. For one thing, the date may be part of some other complex string, dates may be accidentally repeated etc. For that reason the WorkbookFunctions module provides the Dates
class object. Generally the cell that contains the date will be the same on each sheet of the workbook. The Dates
class object will do a number of things for the user, to be seen below:
###Workflow and Syntax
####Creating a wf.Dates
object
The syntax to create a Dates object is :
wf.Dates(date_cell_ref, [strp_format, [separator, index_pos]])
date_cell_ref : |
tuple of two integers that reference the cell on which the date representation is located |
strp_format : |
(Optional) string representation of date format to be parsed by datetime.datetime.strptime() function |
separator : |
(Optional) separator to be used to split date representation string into a list using python string.split() method |
index_pos : |
(Optional) index position of relevant date string in the list that is a result of splitting the date representation using the separator. |
Some explanation is needed here. All dates on every worksheet will be converted into datetime.date
objects in order for the other wf.Dates
methods to be executable. Sometimes Excel dates can be read directly by DataNitro as datetime.date
objects. In other cases they will be read as strings that need to be manipulated before they can be converted to datetime.date
objects.
In order to determine which universe you are in do a preliminary test. Say the cell with the date representation is represented by the tuple (2, 19) (row 2, column 19), then simply in the DataNitro iPython shell enter:
Cell(2, 19).value
If the output of doing this looks like this:
datetime.datetime(2013, 2, 4, 0, 0)
then DataNitro is reading the date directly as a datetime object. This is good. In this situation the only argument necessary to create the Dates class object is the tuple referencing the date. The syntax for creating a Dates class object in this situation is as follows:
dates = wf.Dates((2, 19))
The next simplest case is when the output is simply a basic string representation of the date such as '3.12.2013'. In this case, the user should supply the strp_format
argument which will enable the date string to be read as a datetime object. In this situation (and given the date string as just written), the synatax for creating a Dates class object is as follows:
dates = wf.Dates((2, 19), "%d.%m.%Y")
If you are unclear about how to use datetime.datetime.strptime() formats please see: https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior
The least simplest case is when the output looks anything like this: "Production Report for Date: 2/11/13"
To convert this type of string to something that can be formatted using datetime.strptime()
we need to be able to access the final part of the string i.e. the part that actually contains the string representation of the date without all the additional text. If we were to do this outside the context of WorkbookFunctions we might do the following:
complex_string = "Production Report for Date: 2/11/13"
string_list = complex_string.split(':')
date_string = string_list[-1]
date_string
"2/11/13"
And indeed the arguments when creating the Dates object mirror that exact process. The string is split according to a particular separator (in the example ':'), and then a particular index point of the resulting list is accessed (in the example [-1]). Therefore, the syntax for creating a Dates object when faced with such string representations would be as follows:
dates = wf.Dates((2, 19), %d/%m/%y", ':', -1)
Attributes
The attributes of the wf.Dates
Class are the arguments passed to the constructor.
####Getting a Single Date Value
To get a single date value from the active worksheet use the wf.Dates.cell_to_date()
method. The syntax for this is simply:
wf.Dates.cell_to_date()
Returns: | datetime.date object or None if conversion not possible given the arguments to the Dates class object creator |
An example call might look like:
single_date = dates.cell_to_date()
single_date
This method can be used to check the parameters provided to the wf.Dates
class constructor.
####Getting All Dates in Workbook
To get a dictionary of all the dates found in the workbook (including sheets where date conversion was not possible given the arguments to the Dates
Class object creator) use the check_all_dates()
method. The syntax for calling this method is as follows:
wf.Dates.check_all_dates()
Returns: | A dict that has keys that are the names of every sheet in the workbook, and values that are either the datetime.date objects found on those sheets, or a string indicating that the conversion to datetime.date object was not possible given the arguments provided to the wf.Dates Class object constructor. |
An example call might look as follows:
dates_dict = dates.check_all_dates()
dates_dict
The user should examine this dictionary to check that every sheet has a date that is convertible given the arguments supplied.
The point of the WorkbookFunctions module is not to deal with every possible alternative, but rather to show the user where the assumed standard format is not applicable. If on certain sheets the standard format is not applicable, changes must be made, either manually, or by using general DataNitro data manipulation techniques in order to ensure that the format is the same on every sheet, and therefore the methods can run as intended on every sheet in the Workbook. Therefore, once the dictionary is created and errors identified, changes should be made to the worksheets themselves and the method re-called until such time as a datetime.date
is available for every worksheet. The techniques used to ensure the formatting is standard will differ according to the workbook being worked with. Every time a data technician encounters an issue this should be logged along with the solution such that the group can learn from techniques developed.
####Getting All Date Cell Types in Workbook
If the date dictionary has many date values that are no found using the above, it can be useful to do a quick check of the types of value found at the date cell on each sheet. To do this use the get_types()
method. The syntax for calling this method is as follows:
wf.Dates.get_types()
An example call might look as follows:
dates_type_dict = dates.get_types()
dates_type_dict
The user should examine carefully the dict returned.
Some manual manipulation may be necessary on certain sheets. If the cell that references the date moves around within the sheets on the workbook, or the format changes, then the use can either manually manipulate the cells or write a small program outside of the WorkbookFunctions library context in order to ensure the values referenced by the date_cell_ref
are consistent. It is not the work of WorkbookFunctions to do this for the user, but only to highlight where more attention is needed.
####Checking for Duplicate Dates
Sometimes the persons providing this data forget to change the date on the worksheets, and this can lead to duplicate dates. In order to identify duplicate dates in the Workbook use the duplicates()
method. The syntax for this method is as follows:
wf.Dates.duplicates()
Returns: | Dict where dates that are found more than once in the values of date_dict are the keys, and the values are the keys of date_dict at which the duplicate dates are found. |
The method will create a dictionary of the dates by calling the one by calling the wf.Dates.check_all_dates()
method. This unfortunately reduces performance of the method (versus the user being able to provide a dict already created), but it ensures that date_dicts that are out of date due to subsequent modification of the worksheets are not used which could lead to perverse results.
The method will raise an exception if it is not the case that every value is a dateimte.date
object in date_dict created within the method.
An example call might be as follows:
duplicates_dict = dates.duplicates()
duplicates_dict
The user should examine the output and may have to make manual changes.
Typically if a date is repeated, the 'true' date can be found by looking at the filename from which that data sheet was extracted.
####Checking the Relative Order
The extent to which the sheets in the workbook need to be in the same order that the dates on the worksheets imply is not totally clear. On the one hand when we read the data from each sheet into and pandas DataFrame the data can simply be sorted according to date. On the other hand if the order of the sheets in the Workbook does not match the order implied by the dates found on the sheets this can be indicative of other problems with the data. Therefore, at a minimum the user should check that the order of the sheets matches the order implied by the dates on the sheets using the relative_order()
method. The syntax for this method is as follows:
wf.Dates.relative_order()
Returns: | Returns a dictionary that shows order of sheets implied by dates in the date_dict and the actual order of the sheets, if different. |
Again a date_dict is created inside the method by calling the wf.Dates.check_all_dates()
method.
The method will raise an exception if it is not the case that every value is a dateimte.date
object in the date_dict created inside the method.
An example call might look like this:
relative_dict = dates.relative_order()
relative_dict
The output dictionary should at least be looked at by the user. Action may or may not be necessary.
If simple changes can be made to ensure the order, then they should be made, but any changes should be verified by looking at the files from which the data sheets are drawn.
If changes to the sheet order are made it is useful to then use the wf.rename_sheets()
function to rename the sheets according to the new order.
####Comparing Dates on Sheets to Date in Filenames
Sometimes if the wf.Dates.relative_order()
and wf.Dates.duplicates()
methods are bringing errors to the attention of the user, it can be helpful in seeing where these errors have been generated to compare the dates found on the worksheets with those found in the filenames of the Workbooks from which those sheets were compiled. In order to do this the wf.Dates.compare_cell_file_date()
method is available. They syntax is as follows:
wf.Dates.compare_cell_file_date(file_list_dict, regex[, strp_format])
file_list_dict : |
A dict of folder keys and lists values that contain the Workbook filenames from which sheets will be compiled |
regex : |
string to pass to re.compile() that will identify the date component of the filenames in the file_list_dict |
strp_format : |
(Optional) A string to convert the date component of the filenames in the file_list_dict' to datetime.date()` objects |
Returns: | Returns a dictionary where keys are sheet names and values are tuples where first element of the tuple is the date as per the file name taken from file_list_dict , and the second is the date as per date cell taken from a date_dict that is created upon the method call. There are only keys for those tuples who's values are not equal. If date the regex provided does not match a date, or the date is not convertible, then the user is notified by a string inside the dict. |
Some explanation is needed here:
The file_list_dict
should be that which was created when compiling the sheets. If that object is still in the computer memory, then pass it directly, otherwise open the 'file_list_dict.json' that was created when the sheets were originally compiled and pass that.
The user must supply a regular expression (regex
) argument that will identify the date component of the file names in the file_list_dict
. This regular expression is passed to re.compile()
and each file name is searched using re.search
and returning the matched group.
If you need to revise regular expressions please see here: https://docs.python.org/2/library/re.html
By way of quick example, we may need to identify the date component of the following filenames:
+'Production Report 1.02.13 Nishat Fabrics.xls'
+'Production Report 01..2.13 Nishat Fabrics.xls'
+'Production Report 03.1.2013 Nishat Fabrics.xls'
These filenames are consecutive days, but the format is not consistent. Sometimes there is zero padding, sometimes not, and in the second example there is more than one full stop.
The regular expression that would identify the date component of all of these strings is:
"\d+.+\d+.\d+"
If you are unclear as to why, then please follow the above link and revise regular expressions.
The date portion of each filename when identified by the regular expression search is then converted to a datetime.date()
object using the dateutil
date parser, which assumes that the day comes first (rather than month) and the fuzzy
option is set to True
(see source code). This means the date will be converted as best as it can be. If this automated method is creating bizarre results, then you probably have a date string that the dateutil
parser cannot deal with. In such situations you can pass the strp_format
argument to have more control over how the conversion to datetime.date()
is made.
The output dict should be examined carefully by the user.
The output may help the user to decide how and why (and therefore how to correct) errors that have been identified when looking for duplicates and at the relative order.
It should be noted that the method assumes that the order the filenames are approached in the file_list_dict
match the order of the worksheets. If this is not the case, then many false mismatches will be generated. Therefore if extensive re-ordering of worksheets has been undertaken since compiling, the method will produce perverse results.
####Checking the Date Discontinuities
In theory we should have six workbooks per week from which sheets have been extracted and compiled. Therefore if there are any discontinuities in the dates greater than one day, then it is possible that some files were missing from the folder from which sheets were compiled, or data might otherwise be missing. In order to check whether there are such discontinuities use the wf.Dates.discontinuities()
method. The syntax for this method is as follows:
wf.Dates.discontinuities(discontinuity_value)
Returns: | Returns list of tuples where each tuple is a pair of contiguous sheets where the dates found on those sheets indicate a discontinuity of more than the number of days specified as the discontinuity_value. |
The method will create a date_dict by calling the wf.Dates.check_all_dates()
method upon the call.
The method will raise an exception if it is not the case that every value in the date_dict created on the method call is a dateimte.date
object.
An example call may look as follows:
dicontinuity_list = dates.discontinuities(2)
dicontinuity_list
The resulting list should be examined, and any causes of missing data sheets discussed with the project management.
Users should log any discontinuities found. If extra data are 'found' upon further investigation, the data from these sheets should be added to the folder directory and the processes described above should be repeated.
##FindPoints
###Purpose and Information
Before passing the workbook to a pandas program that creates a DataFrame of all the data on all worksheets we need to be able to describe the structure of the worksheets in the following manner:
- For every sheet, a row value on which the column 'Headers' are found.
- For every sheet, a row value that indicates the last row of data that we are interested in.
In locating these points (or rather finding the places where they are not locatable) the user is also performing useful checks on the worksheets, which if dealt with before the data are passed to pandas will reduce error and confusion for the persons using those DataFrames.
To assist in this process WorkbookFunctions
has a FindPoints
Class object available to the user.
###Workflow and Syntax
####Creating a wf.FindPoints
object
The synatax for creating a FindPoints
Class object is as follows:
wf.FindPoints(col, start_row, end_value, [,adjustments])
col : |
integer value of the column to be searched |
start_row : |
integer value of the first row of the column to be searched |
end_value : |
string value then when found indicates that the 'point' has been found |
adjustments : |
(Optional) negative or positive integer value by which amount the final row value will be adjusted |
Some explanation is needed here:
Suppose that the user wishes to identify the point (meaning row) in which the 'Column Headers' are found. The user should look at the structure of a worksheet and identify a value that when found should indicate that the point has been located. In general the 'headers' found in the worksheets will be the same or very similar. So for example, if in identifying the 'header columns' the first column header is 'Line' and this is found in column 2, then the user would specify the end_value as 'Line', the column value as 2, and then should specify which row the FindPoints
class should begin to search for the end_value. Unless there are special reasons not to, this will generally be row 1. So the syntax for creating a wf.FindPoints
Class object in such circumstances would be as follows:
headers = wf.FindPoints(2, 1, 'Line')
Attributes
The attributes of the class are the arguments passed to the constructor.
####Finding Points on a Single Active Worksheet
To find a point (as headers) on a single worksheet use the find_point()
method. The syntax for this method is as follows:
wf.FindPoints.find_point()
Returns: | Integer value of row in which the point is located |
If the point is not found within the first 300 rows after the start_row, then an exception is raised.
An example call might look like this:
header_row = headers.find_point()
####Finding All Points in a Workbook
To find all points on every sheet in a workbook use the wf.FindPoints.find_all_points()
method. The syntax for this method is as follows:
wf.FindPoints.find_all_points()
Returns: | Dict with one key for each sheet in workbook with point row as value. If no point row is found, the key maps to a string value notifying the user of the absence of the point row. |
An example call might look as follows:
headers_dict = headers.find_all_points()
headers_dict
The user should examine the dict, and pay attention to any sheets where the point row was not located, determine the reason for this, and make any necessary changes to the data.
The exact same process can be used to find the 'end points' of the data. Typically there will be a 'Total' row at the end of the data (that we are not interested in), but that is a useful place holder for identifying the end of the data we are interested in. Perhaps the word 'Total' is located in the fourth column. This total row might occur 2 rows after the end of the data we are interested in. Therefore, we would want to adjust the end_row values by -2. The process for doing this might look as follows:
end_points = wf.FindPoints(4, 1, 'Total', -2)
end_point_dict = end_points.find_all_points()
end_point_dict
With regard to finding end points it is better to be safe than sorry.
If the user is not totally convinced that the end row needs to be adjusted they can simply not pass any adjustment argument. It is better to capture some unnecessary data when passing the data to pandas, than to lose important data. Alternatively the user could verify the need to make an adjustment by using techniques outside of WorkbookFunctions
##Columns
###Purpose and Information
When the data from a workbook is passed to a pandas program to create a DataFrame the user can specify which columns of data he wishes to retain. Not every column will be preserved as we are only interested in some of the data that are kept by entities providing data. The argument that is passed to the pandas.ExcelFile
parser are integer values of the columns to be retained. Therefore in order to be sure that the same data are taken from every worksheet the user must be satisfied that the same data are in the same columns on every worksheet. Effectively this means checking that the columns 'headers' are the same on every worksheet. In practice that the columns headers are the same and in the same order might in fact be of little practical significance as the entire worksheet could be parsed by pandas and then when the concatenation of the individual worksheet DataFrames happens, the DataFrame columns with the same header values will be automatically aligned. However, there will be cases where certain worksheets have different 'header' values due to typos, or the insertion of a new column, or the column header name is changed. This will cause the pandas program to create a DataFrame that is not hugely useful to the user. Therefore it is useful to check that the column 'header' values are the same for every worksheet in the workbook being worked with. To this end WorkbookFunctions
has provided the wf.Columns
Class object.
###Workflow and Syntax
####Creating a wf.Columns
Object
The syntax to create a wf.Columns
class object is as follows:
wf.Columns(column_values)
column_values : |
list of integers representing the locations of the columns of interest. |
So if the user is interested ultimately in preserving the data from columns 1, 2, 3, 4, 6, 9, 10 and 11, then the syntax for creating the Columns
object would be as follows:
col_vals = wf.Columns([1, 2, 3, 4, 6, 9, 10, 11])
An exception will be raised if it is not the case that all elements of the list passed are integers.
Attributes
The class only has one attribute, and that is the columns_values
list passed as argument.
####Getting Values for a Specific Row
To get the column values at a specific row on the active worksheet use the wf.Columns.get_values()
method. The syntax of this method is as follows:
wf.Columns.get_values(row)
row : |
An integer representing the row from which values will be taken |
Returns | Returns list of lowered stripped string values found in each cell referenced by row and each column value as passed to the Columns object creator. |
An example call might look like:
values = col_vals.get_values(4)
values
The above will return a list of the values found in row 4 at the columns referenced by the list passed to the Columns
object creator.
####Comparing all Column Values
In order to check whether all of the sheets have the same values at specific points then use the wf.Columns.compare_all_columns()
method. The syntax for this method is as follows:
wf.Columns.compare_all_columns(start_row_dict)
'start_row_dict': | dict with one key per sheet in the workbook with values that are integers representing the rows in which the column values to be compared are found. |
Returns: | dict that has a key for each sheet in the workbook and values that are lists that contain those column values which were not located on any sheet when compared with the master sheet. |
Some explanation is needed here:
The general strategy of the method is to look at each sheet, take an integer row value for the sheet from the start_row dict, and compare the values found at the locations referenced by that row value at the column values as passed to the Columns
object creator to those values found on the first worksheet (the 'master sheet'). Therefore this is a general method. In fact of course it will work with any integers provided in the start_row_dict.
However, the method was specifically designed to identify differences between sheets with regards to the column 'headers'!
Bearing this is mind therefore, the start_row_dict is an object that should have been created by using the wf.FindPoints
object methods to locate each row on each sheet where the column 'headers' are found. The method will set the values found on the first sheet as the master values and compare all other values on all other sheets against those values.
So, an example call might look like this:
headers = wf.FindPoints(2, 1, 'Line')
headers_dict = headers.find_all_points()
cols = wf.Columns([1, 2, 3, 4, 6, 9, 10, 11])
columns_discrepancy_dict = cols.compare_all_columns(headers_dict)
columns_discrepancy_dict
The user should examine the output and determine the reasons for any discrepancies.
The data should be manipulated such that all columns do represent the same data. It is envisaged that this will happen initially outside of the WorkbookFunctions library, but in the near future such functionality may be added to this library to automate the process.
Incidentally, the method calls a hidden function __rename_headers
which strips the values found at each point referenced by the relevant start_row and column value, lowers it, and in cases where the value has more than one word, the words are joined with a '_' character. This makes using the columns names easier when ultimately the workbook is passed to program that reads the workdbook as a pandas.ExcelFile
object.
##Workbook Structure
###Purpose and Information
Once the above checks have been made a dictionary can be created that will contain all the information needed to pass to a pandas program to create a single DataFrame from the entire workbook. To assist in this process the wf.workbook_structure
Class is provided.
Why provide a class that deals with this rather than just allowing the user to create a dictionary of the object already created? There are several excellent reasons:
- By providing a class the resulting object can be indexed with values that are not modifiable by the user, which allows for the next stage of development be be standardised. In the
ExceltoPandas
module that is a companion library to WorkbookFunctions if index names are standardized this reduces the amount of work the user has to do. - The
wf.workbook_structure
Class will recreate adate_dict
by calling wf.Dates.check_all_dates()`. This will ensure that the most up to date version of the dict is used, which will prevetn accidental error on the part of the user. - pandas thinks about indexing in a different way to Excel. So column 1 in Excel is in fact column 0 in pandas. So the
wf.workbook_structure
Class will make the necessary adjustments to the dictionaries passed as arguments to the constructor.
###Workflow and Syntax
####Creating a wf.workbook_structure
Object
The syntax to create a wf.workbook_structure
class object is as follows:
wf.workbook_structure(Dates_class_object, start_row_dict, end_row_dict, cols_list)
Dates_class_object : |
wf.Dates object |
start_row_dict : |
dictionary of start rows as created by wf.FindPoints |
end_row_dict : |
dictionary of end rows as created by wf.FindPoints |
cols_list : |
list of integers representing the columns to be retained |
The constructor will check all the inputs to ensure they are of the type envisaged. If they are not then an InputError is raised.
An example call might look like this:
workbook_structure = wf.workbook_structure(dates, start_row_dict, end_row_dict, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Attribues
The class has only one (visible) attribute, and that is the dictionary of objects that have been modified by the __init__
method called when the class is intialized. To access this dictionary do the following:
ws_dict = workbook_structure.workbook_structure
ws_dict
####Saving the wf.workbook_structure
Object
The workbook strucutre created should be saved as a .json file for use in the next stage of work that will utilize the ExceltoPandas
module. For this purpose the wf.workbook_structure.save_structure()
method is available to the user. The syntax of this method is as follows:
`wf.workbook_structure.save_structure(top_folderpath)'
top_folderpath : |
string path to folder where .json will be saved |
Returns | string |
Function saves the workbook_structure attribute of the class to json in the folder passed as top_folderpath.
An example call might look as follows:
workbook_structure.save_structure(r'path\to\folder')
##Unmerging Cells
Merged cells are a massive headache. The value in a group of merged cells only exists in the upper left most cell. That means that all other cells when read by pandas.ExcelFile
will be missing values. This is hugely problematic if key bits of data exist in merged cells. As an example if multiple rows exist for each 'line' in the excel file, then it will be very hard to merge data from other sources based on 'line'. In fact, if there is merged data in any of the columns that are to be retained, then those cells need to be unmerged and the value that previously existed only in the upper left most cell needs to be propagated to all cells in the merged range.
Mercifully DataNitro provides a function to do exactly this, and this has been incorporated into a WorkbookFunctions function unmerge_data
. The syntax for using this function is as follows:
wf.unmerge_data(start_row_dict, end_row_dict, cols_list[, headers_only]
start_row_dict : |
dictionary of start rows as created by wf.FindPoints |
end_row_dict : |
dictionary of end rows as created by wf.FindPoints |
cols_list : |
list of integers representing the columns to be unmerged |
headers_only |
(Optional) bool. Default is True |
Returns | None |
The function unmerges cells in the columns represented by the integers in the cols_list
.
By default headers_only
is True
and this means only cells in the first row on each sheet will be unmerged (that row represented by the integer in the start_row_dict
- this should represent the headers if created as described above in relation to wf.FindPoints
). If headers_only
is False
then all cells in the column until the row represented by the integer in end_row_dict
will be unmerged (where they are in fact merged). The values of the merged cell are propagated to all cells in the merged range.
Use this function sparingly (but comprehensively) as it is quite slow to execute.
THE END