In [1]:
# from modules.answerkey import validate_inputs_df as vid
# from modules.answerkey import insert_df_to_db as idb

import numpy as np
import pandas as pd
import sqlite3
import re

from numpy.testing import assert_equal

## Scenario:
***

In a startup company where data knowledge is just starting and there is no standard regarding data encoding, you've been hired as a Data Management Associate. As a new hire, you've been tasked with creating a function to verify the input data from an Excel file that a data encoder manually populates. This excel file contains Name, Age, Date of Birth, Social Security Number, Tax Identification Number and Drivers License. `DID YOU KNOW?` ***Even the most forgiving statistics in data entry work reveal an average human error rate of 1%. In the context of sales operations, that means that if your CSR is tasked with processing 1000 orders,  you can expect 10 to be incorrect*** - [Conexiom](https://conexiom.com/blog/understanding-impact-100-percent-order-accuracy/#:~:text=Even%20the%20most%20forgiving%20statistics,expect%2010%20to%20be%20incorrect). With this information you drafted the format and data governance conditions to make sure that the input from the excel file is correct before it is ingested to your new database. Below are the specifications:
```
- Name (Condition: No special characters | ,.!@#$%^&*() are not allowed)
- Age (Must be a valid Age)
- Date_of_Birth (Format: DD/MM/YY)
- Social_Security_Number (Format: XX-XXXXXXX-X)
- Tax_Identification_Number (Format: XXX-XXX-XXX-XXX)
- Drivers_License (Format: CXX-XX-XXXXXX)

X = digit <br>
C = character
```

#### Your task:

Create a Python function called `validate_inputs_df` to verify the input DataFrame and another function called `insert_df_to_db` to save its contents to an SQLite database if it conforms to your drafted specifications. For verification, there are 25 accounts for initial account open and only 20 of those conforms to your standards. GOOD LUCK!

In [2]:
df_account_open = pd.read_csv('account_open_data.csv')

print(f"There are {len(df_account_open)} accounts for database ingestion.")

There are 25 accounts for database ingestion.


In [3]:
def validate_inputs_df(df):
    validation_results = []
    for index, row in df.iterrows():
        valid = True
        message = 'Inputs are valid'

        if not re.match("^[a-zA-Z0-9 ]*$", row['Name']):
            valid = False
            message = "Invalid name format"

        if not isinstance(row['Age'], int) or row['Age'] <= 0:
            valid = False
            message = "Invalid age format"

        try:
            pd.to_datetime(row['Date_of_Birth'], format='%d/%m/%y')
        except ValueError:
            valid = False
            message = "Invalid date of birth format"

        if not re.match("^\d{2}-\d{7}-\d$", row['Social_Security_Number']):
            valid = False
            message = "Invalid social security number format"

        if not re.match("^\d{3}-\d{3}-\d{3}-\d{3}$", row['Tax_Identification_Number']):
            valid = False
            message = "Invalid tax identification number format"

        if not re.match("^[a-zA-Z]\d{2}-\d{2}-\d{6}$", row['Drivers_License']):
            valid = False
            message = "Invalid driver's license format"
        
        validation_results.append((index, valid, message))
    
    return validation_results

In [4]:
validation_results = validate_inputs_df(df_account_open)

validation_results

[(0, True, 'Inputs are valid'),
 (1, True, 'Inputs are valid'),
 (2, True, 'Inputs are valid'),
 (3, True, 'Inputs are valid'),
 (4, True, 'Inputs are valid'),
 (5, True, 'Inputs are valid'),
 (6, True, 'Inputs are valid'),
 (7, True, 'Inputs are valid'),
 (8, True, 'Inputs are valid'),
 (9, True, 'Inputs are valid'),
 (10, True, 'Inputs are valid'),
 (11, True, 'Inputs are valid'),
 (12, True, 'Inputs are valid'),
 (13, True, 'Inputs are valid'),
 (14, True, 'Inputs are valid'),
 (15, False, 'Invalid name format'),
 (16, True, 'Inputs are valid'),
 (17, True, 'Inputs are valid'),
 (18, True, 'Inputs are valid'),
 (19, True, 'Inputs are valid'),
 (20, False, "Invalid driver's license format"),
 (21, False, 'Invalid age format'),
 (22, True, 'Inputs are valid'),
 (23, False, 'Invalid social security number format'),
 (24, False, 'Invalid tax identification number format')]

In [5]:
def insert_df_to_db(df, validation_results):
    conn = sqlite3.connect('df_account_open.db')
    c = conn.cursor()

    c.execute('''CREATE TABLE IF NOT EXISTS accounts
                 (id INTEGER PRIMARY KEY, Name TEXT, Age INTEGER, Date_of_Birth TEXT, 
                 Social_Security_Number TEXT, Tax_Identification_Number TEXT, Drivers_License TEXT)''')

    for index, valid, message in validation_results:
        if valid:
            row = df.iloc[index]
            c.execute('''INSERT INTO accounts (Name, Age, Date_of_Birth, Social_Security_Number, 
                         Tax_Identification_Number, Drivers_License) VALUES (?, ?, ?, ?, ?, ?)''',
                      (row['Name'], row['Age'], row['Date_of_Birth'], row['Social_Security_Number'], 
                       row['Tax_Identification_Number'], row['Drivers_License']))
            conn.commit()
        else:
            print(f"Row {index}: Validation failed - {message}")

    conn.close()

In [6]:
insert_df_to_db(df_account_open, validation_results)

Row 15: Validation failed - Invalid name format
Row 20: Validation failed - Invalid driver's license format
Row 21: Validation failed - Invalid age format
Row 23: Validation failed - Invalid social security number format
Row 24: Validation failed - Invalid tax identification number format


<b> <center> Based the results above, while the code prints validation failures, it still lacks functionality to handle potential errors during the database insertion process.

In [7]:
def insert_df_to_db(df, validation_results):
  conn = sqlite3.connect('df_account_open.db')
  c = conn.cursor()

  c.execute('''CREATE TABLE IF NOT EXISTS accounts
              (id INTEGER PRIMARY KEY, Name TEXT, Age INTEGER, Date_of_Birth TEXT, 
              Social_Security_Number TEXT, Tax_Identification_Number TEXT, Drivers_License TEXT)''')

  for index, valid, message in validation_results:
    if valid:
      try:
        row = df.iloc[index]
        c.execute('''INSERT INTO accounts (Name, Age, Date_of_Birth, Social_Security_Number, 
                     Tax_Identification_Number, Drivers_License) VALUES (?, ?, ?, ?, ?, ?)''',
                  (row['Name'], row['Age'], row['Date_of_Birth'], row['Social_Security_Number'], 
                   row['Tax_Identification_Number'], row['Drivers_License']))
        conn.commit()
      except sqlite3.Error as e:
        print(f"Row {index} insertion failed: {e}")
  conn.close()

The ***insert_df_to_db*** function now uses a ***try-except*** block to catch potential ***sqlite3.Error*** exceptions that might occur during the database insertion process. This ensures that the code doesn't crash if there's an issue with the database connection, table creation, or data insertion itself. Moreover, inside the ***except*** block, the code prints a more informative message that includes the row index and the specific error message from the database (***e***). This helps identify the exact row that caused the issue and the nature of the problem. Overall, this revised code prevents unexpected crashes and allows for debugging and corrective actions when data insertion fails. By gracefully handling errors, we can ensure data integrity and maintain a smooth data ingestion process.

In [8]:
insert_df_to_db(df_account_open, validation_results)

*Follow up question, how do you address the Invalid data inputs? Do you hire a new data encoder? If yes, why? If no, how are you going to talk to the data encoder? Provide your answer on this markdown cell.*

It's evident that the data encoder is responsible for inputting the data into the Excel file, and the validation process has identified several discrepancies from the specified format. However, hiring a new data encoder might not be necessary right away. Instead, it's essential to approach the situation methodically to understand the root causes of the validation errors and work collaboratively with the current data encoder to address them. 

**[1]** Provide the encoder with specific details about the errors identified during data validation. Share the "validation_results" highlighting which rows have invalid data and the corresponding error messages such as  such as "Invalid Name Format" or "Invalid Social Security Number Format". This focused feedback allows them to pinpoint areas for improvement and learn from their mistakes. 

**[2]** Discuss data entry procedures with the encoder. Emphasize the importance of following the specific data formats for each field (Name, Age, Date of Birth, etc.).  Create a clear reference sheet or "cheat sheet" summarizing the format requirements for easy reference during data entry. 

**[3]**  If the errors are consistent or indicate a knowledge gap, provide additional training. This could be a short training session on the specific data requirements or providing relevant resources to solidify their understanding. 

**[4]** Continued monitoring of validation results for future data uploads is crucial. This ongoing assessment helps gauge the effectiveness of implemented solutions and identifies any further areas for improvement or support needed by the data encoder. Overall, collaboration with the current data encoder can lead to improvements in data quality without the immediate need for replacement. Providing clear feedback, improving data entry procedures, and offering additional training as necessary are key strategies to enhance accuracy and ensure data integrity over time.

***
***