# Software Engineering Assessment

**Problem Statement**

Currently have a Power Automate flow that takes data from multiple sources and merges them into one output to be used by finance teams. Flow is slow to run and prone to failure, a Python ETL process to be spun up to replace it.

## Setup Instructions

Run the following code block and upload the following files from the Zip Archive:

* PayRates.csv
* Staff.csv
* Teams.csv
* requirements.txt

In [1]:
from google.colab import files
uploaded = files.upload()

Saving PayRates.csv to PayRates.csv
Saving requirements.txt to requirements.txt
Saving Staff.csv to Staff.csv
Saving Teams.csv to Teams.csv


In [2]:
pip install -r requirements.txt

Collecting faker (from -r requirements.txt (line 2))
  Downloading faker-37.5.3-py3-none-any.whl.metadata (15 kB)
Collecting jupyter-contrib-nbextensions (from -r requirements.txt (line 4))
  Downloading jupyter_contrib_nbextensions-0.7.0.tar.gz (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ydata-profiling[notebook] (from -r requirements.txt (line 5))
  Downloading ydata_profiling-4.16.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting jupyter_contrib_core>=0.3.3 (from jupyter-contrib-nbextensions->-r requirements.txt (line 4))
  Downloading jupyter_contrib_core-0.4.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jupyter_highlight_selected_word>=0.1.1 (from jupyter-contrib-nbextensions->-r requirements.txt (line 4))
  Downloading jupyter_highlight_selected_word-0.2.0-py2.py3-none-any.whl.metadata (730

In [42]:
import pandas as pd
import numpy as np
import unittest
from ydata_profiling import ProfileReport

## Import Data

In [34]:
df_staff = pd.read_csv('Staff.csv')
df_teams = pd.read_csv('Teams.csv')
df_payrates = pd.read_csv('PayRates.csv')

print(df_staff.head())
print(df_teams.head())
print(df_payrates.head())

   Unnamed: 0     ID         Full Name    Location       Department
0           0  47695    Pauline Parkin    Aberdeen         Sciences
1           1  18846      Dale Fleming     Glasgow        Marketing
2           2  48690  Dr Alison Taylor    Brighton  Human Resources
3           3  21227   Frances Roberts  Manchester  Human Resources
4           4  71734        Ian Harris   Edinburgh  Risk Management
   Unnamed: 0     ID       Department                    Team
0           0  47695         Sciences         Sciences Team 2
1           1  18846        Marketing        Marketing Team 1
2           2  48690  Human Resources  Human Resources Team 1
3           3  21227  Human Resources  Human Resources Team 1
4           4  71734  Risk Management  Risk Management Team 1
   Unnamed: 0     ID  Pay Rate
0           0  47695    117.79
1           1  18846    116.13
2           2  48690     47.79
3           3  21227    103.54
4           4  71734    103.33


## YData Auto EDA

In [12]:
staff_profile = ProfileReport(df_staff, title="Staff Profiling Report")
teams_profile = ProfileReport(df_staff, title="Teams Profiling Report")
pay_profile = ProfileReport(df_staff, title="Pay Profiling Report")
staff_profile.to_file("staff_report.html")
teams_profile.to_file("teams_report.html")
pay_profile.to_file("pay_report.html")
files.download("staff_report.html")
files.download("teams_report.html")
files.download("pay_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/4 [00:00<?, ?it/s][A
100%|██████████| 4/4 [00:00<00:00, 16.91it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/4 [00:00<?, ?it/s][A
100%|██████████| 4/4 [00:00<00:00, 18.75it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/4 [00:00<?, ?it/s][A
100%|██████████| 4/4 [00:00<00:00, 22.13it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##Data Prep

In [35]:
df_staff = df_staff[['ID','Full Name', 'Location']].drop_duplicates(subset=['ID'])
df_staff.sort_values(by='ID', inplace=True)
df_staff.reset_index(drop=True, inplace=True)
print(df_staff.head())

      ID         Full Name    Location
0  10007  Heather Bartlett      London
1  10026     Hayley Sutton  Birmingham
2  10045      Eileen Smith    Brighton
3  10052     Roger Edwards       Leeds
4  10061     Andrew Martin  Manchester


In [36]:
df_teams = df_teams[['ID','Department', 'Team']].drop_duplicates(subset=['ID'])
df_teams.sort_values(by='ID', inplace=True)
df_teams.reset_index(drop=True, inplace=True)
print(df_teams.head())

      ID     Department                  Team
0  10007       Sciences       Sciences Team 1
1  10026       Sciences       Sciences Team 1
2  10045    Engineering    Engineering Team 1
3  10052    Engineering    Engineering Team 1
4  10061  Adminstration  Adminstration Team 1


In [37]:
df_payrates = df_payrates[['ID','Pay Rate']].drop_duplicates(subset=['ID'])
df_payrates.sort_values(by='ID', inplace=True)
df_payrates.reset_index(drop=True, inplace=True)
print(df_payrates.head())

      ID  Pay Rate
0  10007     75.72
1  10026     63.07
2  10045     87.74
3  10052    137.14
4  10061    147.31


## Data Merging, Final Cleanup And Export


In [38]:
df_merged = pd.merge(df_staff, df_teams, on='ID', how='left')
df_merged = pd.merge(df_merged, df_payrates, on='ID', how='left')
print(df_merged.head())
df_merged.info()

      ID         Full Name    Location     Department                  Team  \
0  10007  Heather Bartlett      London       Sciences       Sciences Team 1   
1  10026     Hayley Sutton  Birmingham       Sciences       Sciences Team 1   
2  10045      Eileen Smith    Brighton    Engineering    Engineering Team 1   
3  10052     Roger Edwards       Leeds    Engineering    Engineering Team 1   
4  10061     Andrew Martin  Manchester  Adminstration  Adminstration Team 1   

   Pay Rate  
0     75.72  
1     63.07  
2     87.74  
3    137.14  
4    147.31  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          10000 non-null  int64  
 1   Full Name   10000 non-null  object 
 2   Location    10000 non-null  object 
 3   Department  10000 non-null  object 
 4   Team        10000 non-null  object 
 5   Pay Rate    10000 non-null  float64
dtypes

In [39]:
def add_leading_zeros(df, column_name, total_digits=8):
  """
  Adds leading zeros to a specified column in a DataFrame to reach a total number of digits.

  Args:
    df: pandas DataFrame.
    column_name: Name of the column to modify.
    total_digits: The desired total number of digits.

  Returns:
    DataFrame with the modified column.
  """
  df[column_name] = df[column_name].astype(str).str.zfill(total_digits)
  return df

df_merged = add_leading_zeros(df_merged, 'ID')
print(df_merged['ID'].head())

0    00010007
1    00010026
2    00010045
3    00010052
4    00010061
Name: ID, dtype: object


In [52]:
def remove_titles(df, column_name):
  """
  Removes common titles (Mr, Mrs, Ms, Dr) with or without periods and their non-capitalized versions from a specified column in a DataFrame.

  Args:
    df: pandas DataFrame.
    column_name: Name of the column to modify.

  Returns:
    DataFrame with the modified column.
  """
  if df.empty or column_name not in df.columns or not pd.api.types.is_string_dtype(df[column_name]):
      return df

  titles = ['Mr', 'Mrs', 'Ms', 'Dr', 'mr', 'mrs', 'ms', 'dr']
  df[column_name] = df[column_name].str.replace(r'\b(' + '|'.join(titles) + r')\b\.?', '', regex=True).str.strip()
  return df

df_merged = remove_titles(df_merged, 'Full Name')
print(df_merged['Full Name'].head())

0    Heather Bartlett
1       Hayley Sutton
2        Eileen Smith
3       Roger Edwards
4       Andrew Martin
Name: Full Name, dtype: object


In [61]:
df_merged.to_csv('final_output.csv', index=False)
files.download('final_output.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Unit Testing

### Add Leading Zeros Testing

In [60]:
class TestAddLeadingZeros(unittest.TestCase):

    def test_add_zeros_to_short_id(self):
        df = pd.DataFrame({'ID': [123, 4567, 89]})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = ['00000123', '00004567', '00000089']
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_id_already_correct_length(self):
        df = pd.DataFrame({'ID': [12345678, 98765432]})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = ['12345678', '98765432']
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_id_longer_than_total_digits(self):
        df = pd.DataFrame({'ID': [123456789, 9876543210]})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        # zfill does not truncate, it only adds zeros
        expected_ids = ['123456789', '9876543210']
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_empty_dataframe(self):
        df = pd.DataFrame({'ID': []})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = []
        self.assertEqual(list(df_modified['ID']), expected_ids)

    def test_column_with_strings(self):
        df = pd.DataFrame({'ID': ['abc', 'defg']})
        df_modified = add_leading_zeros(df, 'ID', total_digits=8)
        expected_ids = ['00000abc', '0000defg']
        self.assertEqual(list(df_modified['ID']), expected_ids)

if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

............
----------------------------------------------------------------------
Ran 12 tests in 0.524s

OK


### Remove Titles Testing

In [59]:
class TestRemoveTitles(unittest.TestCase):

    def test_remove_mr(self):
        df = pd.DataFrame({'Full Name': ['Mr. John Smith','mr. John Smith','Mr John Smith','mr John Smith', 'John Smith']})
        df_modified = remove_titles(df, 'Full Name')
        expected_names = ['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith']
        self.assertEqual(list(df_modified['Full Name']), expected_names)

    def test_remove_mrs(self):
        df = pd.DataFrame({'Full Name': ['Mrs. Jane Doe','Mrs Jane Doe','mrs. Jane Doe','mrs Jane Doe', 'Jane Doe']})
        df_modified = remove_titles(df, 'Full Name')
        expected_names = ['Jane Doe', 'Jane Doe', 'Jane Doe', 'Jane Doe', 'Jane Doe']
        self.assertEqual(list(df_modified['Full Name']), expected_names)

    def test_remove_ms(self):
        df = pd.DataFrame({'Full Name': ['Ms. Fonda Lee', 'Ms Fonda Lee','ms. Fonda Lee','ms Fonda Lee','Fonda Lee']})
        df_modified = remove_titles(df, 'Full Name')
        expected_names = ['Fonda Lee','Fonda Lee','Fonda Lee','Fonda Lee','Fonda Lee']
        self.assertEqual(list(df_modified['Full Name']), expected_names)

    def test_remove_dr(self):
        df = pd.DataFrame({'Full Name': ['Dr. Gregory House','Dr Gregory House','dr. Gregory House','dr Gregory House','Gregory House']})
        df_modified = remove_titles(df, 'Full Name')
        expected_names = ['Gregory House','Gregory House','Gregory House','Gregory House','Gregory House']
        self.assertEqual(list(df_modified['Full Name']), expected_names)

    def test_multiple_titles_not_present(self):
        df = pd.DataFrame({'Full Name': ['John Smith', 'Jane Doe']})
        df_modified = remove_titles(df, 'Full Name')
        expected_names = ['John Smith', 'Jane Doe']
        self.assertEqual(list(df_modified['Full Name']), expected_names)

    def test_empty_dataframe(self):
        df = pd.DataFrame({'Full Name': []})
        df_modified = remove_titles(df, 'Full Name')
        expected_names = []
        self.assertEqual(list(df_modified['Full Name']), expected_names)

    def test_names_with_dots_not_titles(self):
        df = pd.DataFrame({'Full Name': ['J.R.R. Tolkien', 'G.R.R. Martin']})
        df_modified = remove_titles(df, 'Full Name')
        expected_names = ['J.R.R. Tolkien', 'G.R.R. Martin']
        self.assertEqual(list(df_modified['Full Name']), expected_names)

if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

............
----------------------------------------------------------------------
Ran 12 tests in 0.026s

OK
