# Data Conversion Challenge
Challenge to automate the conversion of raw data into a specified format of data to make it more usable.

**Important note**: The data used in this notebook has been randomised and all names have been masked so they can be used for training purposes. No data is committed to the project repo. This notebook is for development purposes only.

This notebook is available in the following locations. These versions are kept in sync *manually* - there should not be discrepancies, but it is possible.
- On Kaggle: <https://www.kaggle.com/btw78jt/data-conversion-challenge-202004>
- In the GitHub project repo: <https://github.com/A-Breeze/premierconverter>. See the `README.md` for further instructions, and the associated `simulate_dummy_data` notebook to generate the dummy data that is used for this notebook.

<!-- This table of contents is updated *manually* -->
# Contents
1. [Setup](#Setup): Import packages, Config variables
1. [Variables](#Variables): Raw data structure, Inputs
1. [Workflow](#Workflow): Load raw data, Remove unwanted extra values, Stem section, Factor sets, Output to CSV, Load expected output to check it is as expected
1. [Using the functions](#Using-the-functions): Default arguments, Limited rows

<div align="right" style="text-align: right"><a href="#Contents">Back to Contents</a></div>

# Setup

In [1]:
# Set warning messages
import warnings
# Show all warnings in IPython
warnings.filterwarnings('always')
# Ignore specific numpy warnings (as per <https://github.com/numpy/numpy/issues/11788#issuecomment-422846396>)
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
# Other warnings that sometimes occur
warnings.filterwarnings("ignore", message="unclosed file <_io.Buffered")

In [2]:
# Determine whether this notebook is running on Kaggle
from pathlib import Path

ON_KAGGLE = False
print("Current working directory: " + str(Path('.').absolute()))
if str(Path('.').absolute()) == '/kaggle/working':
    ON_KAGGLE = True

Current working directory: H:\My Documents\05_Repos\premierconverter\development\compiled


In [3]:
# Import built-in modules
import sys
import platform
import os
import io

# Import external modules
from IPython import __version__ as IPy_version
import numpy as np
import pandas as pd
from click import __version__ as click_version

# Import project modules
if not ON_KAGGLE:
    from pyprojroot import here
    root_dir_path = here()
    # Allow modules to be imported relative to the project root directory
    if not sys.path[0] == root_dir_path:
        sys.path.insert(0, str(root_dir_path))
import premierconverter as PCon

# Re-load the project module that we are working on
%load_ext autoreload
%aimport premierconverter
%autoreload 1

# Check they have loaded and the versions are as expected
assert platform.python_version_tuple() == ('3', '6', '6')
print(f"Python version:\t\t\t{sys.version}")
assert IPy_version == '7.13.0'
print(f'IPython version:\t\t{IPy_version}')
assert np.__version__ == '1.18.2'
print(f'numpy version:\t\t\t{np.__version__}')
assert pd.__version__ == '0.25.3'
print(f'pandas version:\t\t\t{pd.__version__}')
assert click_version == '7.1.1'
print(f'click version:\t\t\t{click_version}')
print(f'premierconverter version:\t{PCon.__version__}')

Python version:			3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
IPython version:		7.13.0
numpy version:			1.18.2
pandas version:			0.25.3
click version:			7.1.1
premierconverter version:	0.3.4


In [4]:
# Output exact environment specification, in case it is needed later
if ON_KAGGLE:
    print("Capturing full package environment spec")
    print("(But note that not all these packages are required)")
    !pip freeze > requirements_snapshot.txt
    !jupyter --version > jupyter_versions_snapshot.txt

In [5]:
# Configuration variables
if ON_KAGGLE:
    raw_data_folder_path = Path('/kaggle/input') / 'dummy-premier-data-raw'
else:
    import proj_config
    raw_data_folder_path = proj_config.example_data_dir_path
assert raw_data_folder_path.is_dir()
print("Correct: All locations are available as expected")

Correct: All locations are available as expected


<div align="right" style="text-align: right"><a href="#Contents">Back to Contents</a></div>

# Variables

## Raw data structure

In [6]:
# Configuration variables for the expected format and structure of the data
ACCEPTED_FILE_EXTENSIONS = ['.csv', '', '.txt']
INPUT_FILE_ENCODINGS = ['utf-8', 'latin-1', 'ISO-8859-1']
INPUT_SEPARATOR = ","

RAW_STRUCT = {
    'stop_row_at': 'Total Peril Premium',
    'stem': {
        'ncols': 5,
        'chosen_cols': [0,1],
        'col_names': ['Premier_Test_Status', 'Total_Premium'],
        'col_types': [np.dtype('object'), np.dtype('float')],
    },
    'f_set': {
        'include_Test_Status': ['Ok'],
        'ncols': 4,
        'col_names': ['Peril_Factor', 'Relativity', 'Premium_increment', 'Premium_cumulative'],
        'col_types': [np.dtype('object')] + [np.dtype('float')] * 3,
    },
    'bp_name': 'Base Premium',
}
TRUNC_AFTER_REGEX = r",\s*{}.*".format(RAW_STRUCT['stop_row_at'])

# Output variables, considered to be constants
# Column name of the row IDs
ROW_ID_NAME = "Ref_num"

OUTPUT_DEFAULTS = {
    'pf_sep': ' ',
    'file_delimiter': ','
}

## Parameters

In [7]:
# Include Factors which are not found in the data
include_factors = None
if include_factors is None:
    include_factors = []

# Maximum number of rows to read in
nrows = None

In [8]:
# Input file location
in_filepath = raw_data_folder_path / 'minimal_input_adj.csv'

# Checks the file exists and has a recognised extension
in_filepath = Path(in_filepath)
if not in_filepath.is_file():
    raise FileNotFoundError(
        "\n\tin_filepath: There is no file at the input location:"
        f"\n\t'{in_filepath.absolute()}'"
        "\n\tCannot read the input data"
    )
if not in_filepath.suffix.lower() in ACCEPTED_FILE_EXTENSIONS:
    warnings.warn(
        f"in_filepath: The input file extension '{in_filepath.suffix}' "
        f"is not one of the recognised file extensions {ACCEPTED_FILE_EXTENSIONS}"
    )
print("Correct: Input file exists and has a recognised extension")

Correct: Input file exists and has a recognised extension


In [9]:
# View the first n raw CSV lines (without loading into a DataFrame)
nlines = 2
lines = []
with in_filepath.open() as f: 
    for line_num in range(nlines):
        lines.append(f.readline())
print(''.join(lines))

1,"Ok",96.95,,,9,Peril1 Base Premium,0.0,91.95,91.95,AnotherPrlBase Premium,0.0,5.17,5.17,Peril1Factor1,0.99818,-0.17,91.78,Total Peril Premium,[some more text]
2,"Ok",170.73,,,11,AnotherPrlBase Premium,0.0,101.56,101.56,AnotherPrlFactor1,1.064887,6.59,108.15,Peril1 Base Premium,0.0,100.55,100.55,AnotherPrlSomeFact,0.648875,-37.97,70.18,Total Peril Premium,2,extra text and figures



In [10]:
# Output file location
out_filepath = 'formatted_dummy_data1.csv'
force_overwrite = False

# Checks
out_filepath = Path(out_filepath)

if not out_filepath.parent.is_dir():
    raise FileNotFoundError(
        f"\n\tout_filepath: The folder of the output file does not exist"
        f"Folder path: '{out_filepath.parent}'"
        "\n\tCreate the output folder before running this command"
    )

if out_filepath.is_file() and not force_overwrite:
    raise FileExistsError(
        "\n\tOutput options: File already exists at the output location:"
        f"\n\t'{out_filepath.absolute()}'"
        "\n\tIf you want to overwrite it, re-run with `force_overwrite = True`"
    )
else:
    if not out_filepath.suffix in ACCEPTED_FILE_EXTENSIONS:
        warnings.warn(
            f"out_filepath: The output file extension '{out_filepath.suffix}' "
            f"is not one of the recognised file extensions {ACCEPTED_FILE_EXTENSIONS}",
        )

print("Correct: A suitable location for output has been chosen")

Correct: A suitable location for output has been chosen


<div align="right" style="text-align: right"><a href="#Contents">Back to Contents</a></div>

# Workflow

## Load raw data

In [11]:
# Load the CSV lines truncated as required
in_lines_trunc_df = None
for encoding in INPUT_FILE_ENCODINGS:
    try:
        in_lines_trunc_df = pd.read_csv(
            in_filepath, header=None, index_col=False,
            nrows=nrows, sep=TRUNC_AFTER_REGEX,
            engine='python', encoding=encoding,
        )
        # print(f"'{encoding}': Success")  # Used for debugging only
        break
    except UnicodeDecodeError:
        # print(f"'{encoding}': Fail")  # Used for debugging only
        pass
if in_lines_trunc_df is None:
    raise IOError(
        "\n\tread_input_lines: pandas.read_csv() failed."
        f"\n\tFile cannot be read with any of the encodings: {INPUT_FILE_ENCODINGS}"
    )

in_lines_trunc_df.head()

Unnamed: 0,0,1
0,"1,""Ok"",96.95,,,9,Peril1 Base Premium,0.0,91.95...",
1,"2,""Ok"",170.73,,,11,AnotherPrlBase Premium,0.0,...",
2,"3,""Error: Some text, that indicates an error.""...",
3,"4,""Ok"",161.68,,,5,Peril1NewFact,0.999998,0.0,1...",
4,"5,""Declined"",,,,4,Some more text on a declined...",


In [12]:
# Check it worked and is not malformed
if in_lines_trunc_df.shape[0] <= 1:
    warnings.warn(
        "Raw data lines: Only one row of data has been read. "
        "Are you sure you have specified the correct file? "
        "Are rows of data split into lines of the file?"
    )
if not ((
    in_lines_trunc_df.shape[1] == 1
) or (
    in_lines_trunc_df.iloc[:, 1].isna().sum() == in_lines_trunc_df.shape[0]
)):
    warnings.warn(
        "Raw data lines: A line in the input data has more than one match "
        f"to the regex pattern \"{TRUNC_AFTER_REGEX}\". "
        "Are you sure you have specified the correct file?"
    )

In [13]:
# Convert to DataFrame
with warnings.catch_warnings():
    # Ignore dtype warnings at this point, because we check them later on (after casting)
    warnings.filterwarnings(
        "ignore", message='.*Specify dtype option on import or set low_memory=False',
        category=pd.errors.DtypeWarning,
    )
    with io.StringIO('\n'.join(in_lines_trunc_df[0])) as in_lines_trunc_stream:
        df_trimmed = pd.read_csv(
            in_lines_trunc_stream, header=None, index_col=0, sep=INPUT_SEPARATOR,
            names=range(in_lines_trunc_df[0].str.count(INPUT_SEPARATOR).max() + 1),
        ).rename_axis(index=PCon.ROW_ID_NAME)

df_trimmed.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,16,17,18,19,20,21,22,23,24,25
Ref_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Ok,96.95,,,9,Peril1 Base Premium,0.0,91.95,91.95,AnotherPrlBase Premium,...,-0.17,91.78,,,,,,,,
2,Ok,170.73,,,11,AnotherPrlBase Premium,0.0,101.56,101.56,AnotherPrlFactor1,...,100.55,100.55,AnotherPrlSomeFact,0.648875,-37.97,70.18,,,,
3,"Error: Some text, that indicates an error.",0.0,,,4,,,,,,...,,,,,,,,,,
4,Ok,161.68,,,5,Peril1NewFact,0.999998,0.0,110.34,Peril1Factor1,...,,,AnotherPrlBase Premium,0.0,51.34,51.34,Peril1 Base Premium,0.0,91.95,91.95
5,Declined,,,,4,Some more text on a declined row,"even, more. text",,0.0,0,...,,,,,,,,,,


In [14]:
# Check it is as expected and not malformed
if not df_trimmed.index.is_unique:
    warnings.warn(
        f"Trimmed data: Row identifiers '{ROW_ID_NAME}' are not unique. "
        "This may lead to unexpected results."
    )
if not (
    # At least the stem columns and one factor set column
    df_trimmed.shape[1] >= 
    RAW_STRUCT['stem']['ncols'] + 1 * RAW_STRUCT['f_set']['ncols']
) or not (
    # Stem columns plus a multiple of factor set columns
    (df_trimmed.shape[1] - RAW_STRUCT['stem']['ncols']) 
    % RAW_STRUCT['f_set']['ncols'] == 0
):
    warnings.warn(
        "Trimmed data: Incorrect number of columns with relevant data: "
        f"{df_trimmed.shape[1] + 1}"
        "\n\tThere should be: 1 for index, "
        f"{RAW_STRUCT['stem']['ncols']} for stem section, "
        f"and by a multiple of {RAW_STRUCT['f_set']['ncols']} for factor sets"
    )

## Stem section

In [15]:
# Get the stem section of columns
df_stem = df_trimmed.iloc[
    :, RAW_STRUCT['stem']['chosen_cols']
].pipe(  # Rename the columns
    lambda df: df.rename(columns=dict(zip(
        df.columns, 
        RAW_STRUCT['stem']['col_names']
    )))
)

df_stem.head()

Unnamed: 0_level_0,Premier_Test_Status,Total_Premium
Ref_num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Ok,96.95
2,Ok,170.73
3,"Error: Some text, that indicates an error.",0.0
4,Ok,161.68
5,Declined,


In [16]:
# Checks
if not (
    df_stem.dtypes == RAW_STRUCT['stem']['col_types']
).all():
    warnings.warn(
        "Stem columns: Unexpected column data types"
        f"\n\tExepcted: {RAW_STRUCT['stem']['col_types']}"
        f"\n\tActual:   {df_stem.dtypes.tolist()}"
    )

## Factor sets

In [17]:
# Combine the rest of the DataFrame into one
df_fsets = pd.concat([
    # For each of the factor sets of columns
    df_trimmed.loc[  # Filter to only the valid rows
        df_trimmed[1].isin(RAW_STRUCT['f_set']['include_Test_Status'])
    ].iloc[  # Select the columns
        :, fset_start_col:(fset_start_col + RAW_STRUCT['f_set']['ncols'])
    ].dropna(  # Remove rows that have all missing values
        how="all"
    ).pipe(lambda df: df.rename(columns=dict(zip(  # Rename columns
        df.columns, RAW_STRUCT['f_set']['col_names']
    )))).reset_index()  # Get row_ID as a column

    for fset_start_col in range(
        RAW_STRUCT['stem']['ncols'], df_trimmed.shape[1], RAW_STRUCT['f_set']['ncols']
    )
], sort=False).apply(  # Where possible, convert object columns to numeric dtype
    pd.to_numeric, errors='ignore'
).reset_index(drop=True)  # Best practice to ensure a unique index

df_fsets.head()

Unnamed: 0,Ref_num,Peril_Factor,Relativity,Premium_increment,Premium_cumulative
0,1,Peril1 Base Premium,0.0,91.95,91.95
1,2,AnotherPrlBase Premium,0.0,101.56,101.56
2,4,Peril1NewFact,0.999998,0.0,110.34
3,1,AnotherPrlBase Premium,0.0,5.17,5.17
4,2,AnotherPrlFactor1,1.064887,6.59,108.15


In [18]:
# Checks
if not (
    df_fsets[RAW_STRUCT['f_set']['col_names']].dtypes == 
    RAW_STRUCT['f_set']['col_types']
).all():
    warnings.warn(
        "Factor sets columns: Unexpected column data types"
        f"\n\tExpected: {RAW_STRUCT['f_set']['col_types']}"
        f"\n\tActual:   {df_fsets[RAW_STRUCT['f_set']['col_names']].dtypes.tolist()}"
    )

In [19]:
perils_implied = df_fsets.Peril_Factor.drop_duplicates(  # Get only unique 'Peril_Factor' combinations
).to_frame().pipe(lambda df: df.loc[  # Filter to leave only 'Base Premium' occurences
    df.Peril_Factor.str.contains(RAW_STRUCT['bp_name']), :
]).assign(
    # Get the 'Peril' part of 'Peril_Factor'
    Peril=lambda df: df.Peril_Factor.str.replace(RAW_STRUCT['bp_name'], "").str.strip()
).Peril.sort_values().to_list()

perils_implied

['AnotherPrl', 'Peril1']

In [20]:
# Check that every 'Peril_Factor' starts with a Peril
if not df_fsets.Peril_Factor.str.startswith(
    tuple(perils_implied)
).all():
    warnings.warn(
        "Implied perils: Not every Peril_Factor starts with a Peril. "
        "Suggests the raw data format is not as expected."
    )
if '' in perils_implied:
    warnings.warn(
        "Implied perils: Empty string has been implied. "
        "Suggests the raw data format is not as expected."
    )

In [21]:
# Split out Peril_Factor
df_fsets_split = df_fsets.assign(
    # Split the Peril_Factor column into two
    Factor=lambda df: df.Peril_Factor.str.replace(
            '|'.join(perils_implied), ""
    ).str.strip(),
    Peril=lambda df: df.apply(
        lambda row: row.Peril_Factor.replace(row.Factor, "").strip()
        , axis=1
    )
).drop(columns='Peril_Factor')

df_fsets_split.head()

Unnamed: 0,Ref_num,Relativity,Premium_increment,Premium_cumulative,Factor,Peril
0,1,0.0,91.95,91.95,Base Premium,Peril1
1,2,0.0,101.56,101.56,Base Premium,AnotherPrl
2,4,0.999998,0.0,110.34,NewFact,Peril1
3,1,0.0,5.17,5.17,Base Premium,AnotherPrl
4,2,1.064887,6.59,108.15,Factor1,AnotherPrl


In [22]:
# Get the Base Premiums for all row_IDs and Perils
df_base_prems = df_fsets_split.query(
    # Get only the Base Preimum rows
    f"Factor == '{RAW_STRUCT['bp_name']}'"
).assign(
    # Create Peril_Factor combination for column names
    Peril_Factor=lambda df: df.Peril + OUTPUT_DEFAULTS['pf_sep'] + df.Factor,
    Custom_order=0,  # Will be used later to ensure desired column order
).pivot_table(
    # Pivot to 'Peril_Factor' columns and one row per row_ID
    index=ROW_ID_NAME,
    columns=['Peril', 'Custom_order', 'Peril_Factor'],
    values='Premium_cumulative'
)

df_base_prems.head()

Peril,AnotherPrl,Peril1
Custom_order,0,0
Peril_Factor,AnotherPrl Base Premium,Peril1 Base Premium
Ref_num,Unnamed: 1_level_3,Unnamed: 2_level_3
1,5.17,91.95
2,101.56,100.55
4,51.34,91.95


In [23]:
# Warning if the data set is not complete
if df_base_prems.isna().sum().sum() > 0:
    warnings.warn(
        "Base Premiums: Base Premium is missing for some rows and Perils."
        "Suggests the raw data format is not as expected."
    )

In [24]:
# Ensure every row_ID has a row for every Peril, Factor combination
# Get the Relativity for all row_ID, Perils and Factors
df_factors = df_fsets_split.query(
    # Get only the Factor rows
    f"Factor != '{RAW_STRUCT['bp_name']}'"
).drop(
    columns=['Premium_increment', 'Premium_cumulative']
).set_index(
    # Ensure there is one row for every combination of row_ID, Peril, Factor
    [ROW_ID_NAME, 'Peril', 'Factor']
).pipe(lambda df: df.reindex(index=pd.MultiIndex.from_product([
    df.index.get_level_values(ROW_ID_NAME).unique(),
    df.index.get_level_values('Peril').unique(),
    # Include additional factors if desired from the inputs
    set(df.index.get_level_values('Factor').tolist() + include_factors),
], names = df.index.names
))).sort_index().fillna({  # Any new rows need to have Relativity of 1
    'Relativity': 1.,
}).reset_index().assign(
    # Create Peril_Factor combination for column names
    Peril_Factor=lambda df: df.Peril + OUTPUT_DEFAULTS['pf_sep'] + df.Factor,
    Custom_order=1
).pivot_table(
    # Pivot to 'Peril_Factor' columns and one row per row_ID
    index=ROW_ID_NAME,
    columns=['Peril', 'Custom_order', 'Peril_Factor'],
    values='Relativity'
)

df_factors.head()

Peril,AnotherPrl,AnotherPrl,AnotherPrl,Peril1,Peril1,Peril1
Custom_order,1,1,1,1,1,1
Peril_Factor,AnotherPrl Factor1,AnotherPrl NewFact,AnotherPrl SomeFact,Peril1 Factor1,Peril1 NewFact,Peril1 SomeFact
Ref_num,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
1,1.0,1.0,1.0,0.99818,1.0,1.0
2,1.064887,1.0,0.648875,1.0,1.0,1.0
4,1.0,1.0,1.0,1.2,0.999998,1.0


In [25]:
# Checks
if not df_factors.apply(lambda col: (col > 0)).all().all():
    warnings.warn(
        "Factor relativities: At least one relativity is below zero."
    )

In [26]:
# Combine Base Premium and Factors columns
df_base_factors = df_base_prems.merge(
    df_factors,
    how='inner', left_index=True, right_index=True
).pipe(
    # Sort columns (uses 'Custom_order')
    lambda df: df[df.columns.sort_values()]
)

# Drop unwanted levels of the column MultiIndex
# Possible to do this following in a chain, but much to complicated
# See 'Chained drop a column MultiIndex level' in 'Unused rough work'
df_base_factors.columns = df_base_factors.columns.get_level_values('Peril_Factor')

df_base_factors.head()

Peril_Factor,AnotherPrl Base Premium,AnotherPrl Factor1,AnotherPrl NewFact,AnotherPrl SomeFact,Peril1 Base Premium,Peril1 Factor1,Peril1 NewFact,Peril1 SomeFact
Ref_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,5.17,1.0,1.0,1.0,91.95,0.99818,1.0,1.0
2,101.56,1.064887,1.0,0.648875,100.55,1.0,1.0,1.0
4,51.34,1.0,1.0,1.0,91.95,1.2,0.999998,1.0


In [27]:
# Join back on to stem section
df_formatted = df_stem.merge(
    df_base_factors,
    how='left', left_index=True, right_index=True
).fillna(0.)  # The only mising values are from 'error' rows

df_formatted.iloc[:10,:20]

Unnamed: 0_level_0,Premier_Test_Status,Total_Premium,AnotherPrl Base Premium,AnotherPrl Factor1,AnotherPrl NewFact,AnotherPrl SomeFact,Peril1 Base Premium,Peril1 Factor1,Peril1 NewFact,Peril1 SomeFact
Ref_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Ok,96.95,5.17,1.0,1.0,1.0,91.95,0.99818,1.0,1.0
2,Ok,170.73,101.56,1.064887,1.0,0.648875,100.55,1.0,1.0,1.0
3,"Error: Some text, that indicates an error.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ok,161.68,51.34,1.0,1.0,1.0,91.95,1.2,0.999998,1.0
5,Declined,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Output to CSV

In [28]:
# Save it
df_formatted.to_csv(
    out_filepath, sep=OUTPUT_DEFAULTS['file_delimiter'], index=True
)
print("Output saved")

Output saved


### Reload the spreadsheet to check it worked

In [29]:
# Check it worked
df_reload = pd.read_csv(
    out_filepath, index_col=0, sep=OUTPUT_DEFAULTS['file_delimiter'],
)

df_reload.head()

Unnamed: 0_level_0,Premier_Test_Status,Total_Premium,AnotherPrl Base Premium,AnotherPrl Factor1,AnotherPrl NewFact,AnotherPrl SomeFact,Peril1 Base Premium,Peril1 Factor1,Peril1 NewFact,Peril1 SomeFact
Ref_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Ok,96.95,5.17,1.0,1.0,1.0,91.95,0.99818,1.0,1.0
2,Ok,170.73,101.56,1.064887,1.0,0.648875,100.55,1.0,1.0,1.0
3,"Error: Some text, that indicates an error.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ok,161.68,51.34,1.0,1.0,1.0,91.95,1.2,0.999998,1.0
5,Declined,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
assert (df_formatted.dtypes == df_reload.dtypes).all()
assert df_reload.shape == df_formatted.shape
assert (df_formatted.index == df_reload.index).all()
assert df_formatted.iloc[:,1:].apply(
    lambda col: np.abs(col - df_reload[col.name]) < 1e-10
).all().all()
print("Correct: The reloaded values are equal, up to floating point tolerance")

Correct: The reloaded values are equal, up to floating point tolerance


## Load expected output to check it is as expected

In [31]:
# Location of sheet of expected results
expected_filepath = raw_data_folder_path / 'minimal_expected_output_5.csv'

In [32]:
df_expected = None
for encoding in INPUT_FILE_ENCODINGS:
    try:
        df_expected = pd.read_csv(
            expected_filepath,
            index_col=0, sep=OUTPUT_DEFAULTS['file_delimiter'],
            encoding=encoding
        ).apply(lambda col: (
            col.astype('float') 
            if np.issubdtype(col.dtype, np.number)
            else col
        ))
        # print(f"'{encoding}': Success")  # Used for debugging only
        break
    except UnicodeDecodeError:
        # print(f"'{encoding}': Fail")  # Used for debugging only
        pass
if df_expected is None:
    raise IOError(
        "\n\tload_formatted_file: pandas.read_csv() failed."
        f"\n\tFile cannot be read with any of the encodings: {INPUT_FILE_ENCODINGS}"
    )

df_expected.head()

Unnamed: 0_level_0,Premier_Test_Status,Total_Premium,AnotherPrl Base Premium,AnotherPrl Factor1,AnotherPrl NewFact,AnotherPrl SomeFact,Peril1 Base Premium,Peril1 Factor1,Peril1 NewFact,Peril1 SomeFact
Ref_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Ok,96.95,5.17,1.0,1.0,1.0,91.95,0.99818,1.0,1.0
2,Ok,170.73,101.56,1.064887,1.0,0.648875,100.55,1.0,1.0,1.0
3,"Error: Some text, that indicates an error.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ok,161.68,51.34,1.0,1.0,1.0,91.95,1.2,0.999998,1.0
5,Declined,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
assert (df_formatted.dtypes == df_expected.dtypes).all()
assert df_expected.shape == df_formatted.shape
assert (df_formatted.index == df_expected.index).all()
assert df_formatted.iloc[:,1:].apply(
    lambda col: np.abs(col - df_expected[col.name]) < 1e-10
).all().all()
print("Correct: The reloaded values are equal, up to floating point tolerance")

Correct: The reloaded values are equal, up to floating point tolerance


<div align="right" style="text-align: right"><a href="#Contents">Back to Contents</a></div>

# Using the functions
## Default arguments

In [34]:
help(PCon.convert)

Help on function convert in module premierconverter:

convert(in_filepath, out_filepath, nrows=None, file_delimiter=',', verbose=True, *, force_overwrite=False, **kwargs)
    Load raw data, convert to specified format, and save result
    
    in_filepath: Path to file containing the input data
    out_filepath: Path of a file to save the formatted data
        If it does not exist, a new file will be created.
        The directory must already exist.
    nrows: Maximum number of rows to read
        If None (default), then attempt to read all rows.
    file_delimiter: Seperator character for columns in the output file
        Default is a comma ",".
    verbose: Whether to show progress messages
    force_overwrite: Set to True if you want to overwrite an existing file
    **kwargs: Other arguments to pass to convert_df
    
    Returns: out_filepath, if it completes successfully



In [35]:
#in_filepath = raw_data_folder_path / 'minimal_input_adj.csv'
out_filepath = 'formatted_data.csv'
res_filepath = PCon.convert(in_filepath, out_filepath)

Step 1:	Validate filepath arguments


Step 2:	Load and truncate input file lines
Step 3:	Split lines into a DataFrame
Step 4:	Reshape DataFrame into desired format


Step 5:	Save resulting data format


Output saved here: H:\My Documents\05_Repos\premierconverter\development\compiled\formatted_data.csv




In [36]:
# Run the pipeline manually to check
# Load raw data
in_lines_trunc_df = PCon.read_input_lines(in_filepath)
PCon.validate_input_lines_trunc(in_lines_trunc_df)
df_trimmed = PCon.split_lines_to_df(in_lines_trunc_df)
# Get converted DataFrame
df_formatted = PCon.convert_df(df_trimmed)

df_formatted.head()

Unnamed: 0_level_0,Premier_Test_Status,Total_Premium,AnotherPrl Base Premium,AnotherPrl Factor1,AnotherPrl NewFact,AnotherPrl SomeFact,Peril1 Base Premium,Peril1 Factor1,Peril1 NewFact,Peril1 SomeFact
Ref_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Ok,96.95,5.17,1.0,1.0,1.0,91.95,0.99818,1.0,1.0
2,Ok,170.73,101.56,1.064887,1.0,0.648875,100.55,1.0,1.0,1.0
3,"Error: Some text, that indicates an error.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ok,161.68,51.34,1.0,1.0,1.0,91.95,1.2,0.999998,1.0
5,Declined,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
# Reload resulting data from workbook
df_reload = PCon.load_formatted_file(res_filepath)

# Check it matches expectations
if PCon.formatted_dfs_are_equal(df_formatted, df_reload):
    print("Correct: The reloaded values are equal, up to floating point tolerance")

Correct: The reloaded values are equal, up to floating point tolerance


In [38]:
# Check against expected output from manually created worksheet
expected_filepath = raw_data_folder_path / 'minimal_expected_output_5.csv'
df_expected = PCon.load_formatted_file(expected_filepath)

# Check it matches expectations
if PCon.formatted_dfs_are_equal(df_reload, df_expected):
    print("Correct: The reloaded values are equal, up to floating point tolerance")

Correct: The reloaded values are equal, up to floating point tolerance


In [39]:
# Delete the results file
res_filepath.unlink()
print("Workspace restored")

Workspace restored


## Limited rows

In [40]:
nrows = 2  # Choose a specific number for which the expected results have been created: 2, 4 or 5
in_filepath = raw_data_folder_path / 'minimal_input_adj.csv'
out_filepath = f'formatted_data_{nrows}.csv'
res_filepath = PCon.convert(in_filepath, out_filepath, nrows = nrows)

# Check against expected output from manually created worksheet
expected_filepath = raw_data_folder_path / f'minimal_expected_output_{nrows}.csv'
df_expected = PCon.load_formatted_file(expected_filepath)
df_reload = PCon.load_formatted_file(res_filepath)

# Check it matches expectations
if PCon.formatted_dfs_are_equal(df_reload, df_expected):
    print("Correct: The reloaded values are equal, up to floating point tolerance")

# Delete the results file
res_filepath.unlink()
print("Workspace restored")

Step 1:	Validate filepath arguments
Step 2:	Load and truncate input file lines
Step 3:	Split lines into a DataFrame


Step 4:	Reshape DataFrame into desired format


Step 5:	Save resulting data format


Output saved here: H:\My Documents\05_Repos\premierconverter\development\compiled\formatted_data_2.csv
Correct: The reloaded values are equal, up to floating point tolerance


Workspace restored


## Limited rows with included factors

In [41]:
nrows = 2
include_factors = ['NewFact', 'SomeFact']
in_filepath = raw_data_folder_path / 'minimal_input_adj.csv'
out_filepath = f'formatted_data_2_all_facts.csv'
res_filepath = PCon.convert(in_filepath, out_filepath, nrows=nrows, include_factors=include_factors)

# Check against expected output from manually created worksheet
expected_filepath = raw_data_folder_path / 'minimal_expected_output_2_all_facts.csv'  # Specifically created for this test
df_expected = PCon.load_formatted_file(expected_filepath)
df_reload = PCon.load_formatted_file(res_filepath)

# Check it matches expectations
if PCon.formatted_dfs_are_equal(df_reload, df_expected):
    print("Correct: The reloaded values are equal, up to floating point tolerance")

# Delete the results file
res_filepath.unlink()
print("Workspace restored")

Step 1:	Validate filepath arguments
Step 2:	Load and truncate input file lines


Step 3:	Split lines into a DataFrame


Step 4:	Reshape DataFrame into desired format


Step 5:	Save resulting data format


Output saved here: H:\My Documents\05_Repos\premierconverter\development\compiled\formatted_data_2_all_facts.csv


Correct: The reloaded values are equal, up to floating point tolerance
Workspace restored


Further connotations are tested in the package's automated test suite.

<div align="right" style="text-align: right"><a href="#Contents">Back to Contents</a></div>