# Software Engineering Assessment

**Problem Statement**

Currently have a Power Automate flow that takes data from multiple sources and merges them into one output to be used by finance teams. Flow is slow to run and prone to failure, a Python ETL process to be spun up to replace it.

## Setup Instructions

Run the following code block and upload the following files from the Zip Archive:

* PayRates.csv
* Staff.csv
* Teams.csv
* requirements.txt

In [1]:
from google.colab import files
uploaded = files.upload()

Saving PayRates.csv to PayRates.csv
Saving requirements.txt to requirements.txt
Saving Staff.csv to Staff.csv
Saving Teams.csv to Teams.csv


In [2]:
pip install -r requirements.txt

Collecting faker (from -r requirements.txt (line 2))
  Downloading faker-37.5.3-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.5.3-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-37.5.3


In [3]:
import pandas as pd
import unittest

In [4]:
df_staff = pd.read_csv('Staff.csv')
df_teams = pd.read_csv('Teams.csv')
df_payrates = pd.read_csv('PayRates.csv')

print(df_staff.head())
print(df_teams.head())
print(df_payrates.head())

   Unnamed: 0       ID              Full Name    Location  Pay Rate
0           0  9327790             Megan West  Manchester        68
1           1  2785284  Deborah McDonald-Hall       Leeds        99
2           2  4641741   Gary Martin-Crawford     Bristol       124
3           3  2838717            Jasmine Fox    Brighton        74
4           4   705387      Mr Damian Gregory    Aberdeen        70
   Unnamed: 0     ID       Department                    Team
0           0  42962    Adminstration    Adminstration Team 1
1           1  99703  Human Resources  Human Resources Team 2
2           2  86354       Investment       Investment Team 1
3           3  93505      Engineering      Engineering Team 1
4           4  77026  Human Resources  Human Resources Team 2
   Unnamed: 0     ID  Pay Rate
0           0  42962     63.69
1           1  99703     84.76
2           2  86354    134.80
3           3  93505    117.14
4           4  77026    102.47


In [8]:
def add_leading_zeros(df, column_name, total_digits=8):
  """
  Adds leading zeros to a specified column in a DataFrame to reach a total number of digits.

  Args:
    df: pandas DataFrame.
    column_name: Name of the column to modify.
    total_digits: The desired total number of digits.

  Returns:
    DataFrame with the modified column.
  """
  df[column_name] = df[column_name].astype(str).str.zfill(total_digits)
  return df

df_staff = add_leading_zeros(df_staff, 'ID')
print(df_staff['ID'].head())

0    09327790
1    02785284
2    04641741
3    02838717
4    00705387
Name: ID, dtype: object


In [9]:
class TestAddLeadingZeros(unittest.TestCase):

    def test_add_zeros_to_short_id(self):
        df = pd.DataFrame({'ID': [123, 4567, 89]})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = ['00000123', '00004567', '00000089']
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_id_already_correct_length(self):
        df = pd.DataFrame({'ID': [12345678, 98765432]})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = ['12345678', '98765432']
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_id_longer_than_total_digits(self):
        df = pd.DataFrame({'ID': [123456789, 9876543210]})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        # zfill does not truncate, it only adds zeros
        expected_ids = ['123456789', '9876543210']
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_empty_dataframe(self):
        df = pd.DataFrame({'ID': []})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = []
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_column_with_strings(self):
        df = pd.DataFrame({'ID': ['abc', 'defg']})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = ['00000abc', '0000defg']
        self.assertEqual(list(df_modified['ID']), expected_ids)

if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

.....
----------------------------------------------------------------------
Ran 5 tests in 0.011s

OK
