# 01 - Data Loading and Initial Inspection

## Notebook Objective:

In this notebook, we will lay the foundation for our "Consolidated Invoicing, Engine, and Exception Analytics" project by:

1.  **Loading the Raw, Messy Dataset:** We will import the `mock_dataset.csv`, which simulates real-world enterprise data with intentional data quality issues.
2.  **Exploring the Dataset:** We will conduct an initial review of the data's structure, dimensions, and contents using various Pandas functions.
3.  **Identifying Data Quality Issues:** A critical step for analyst, we will systematically identify and document the specific inconsistencies, missing values, incorrect data types, and outliers present in the raw data, which will guide our subsequent cleaning efforts.

## Loading the dataset

Import necessary libraries.

In [4]:
# Import Libraries
import pandas as pd
import numpy as np

Define file path

In [46]:
file_path = r"C:\Users\Prashanth\OneDrive\ドキュメント\Portfolio_Projects_DA\Consolidated Invoicing, Engine, and Exception Analytics\data\mock_dataset.csv"

Load the dataset

In [47]:
try:
    df = pd.read_csv(file_path)
    print("dataset loaded successfully.")
    print(f"Initial rows: {len(df)}")
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")
    df = pd.DataFrame() 

dataset loaded successfully.
Initial rows: 10113


## Initial Data Inspection 

Display the First 5 rows of the dataset

In [21]:
print("First 5 Rows of Raw Data ---")
df.head()

First 5 Rows of Raw Data ---


Unnamed: 0,Count of Invoice,Type of Request,REQUEST ID,CREATED BY,CREATED ON,BRANCH NAME,SUPPLIER CODE,NAME,COUNTRY,CURRENCY,...,QC Status,Audited By,Auditor Comments,Ageing-SLA-FPY,Ageing-SLA-2,Ageing of Re-work Days,Ageing of Re-work,Month,Unnamed: 28,Amount
0,1.0,Check Request,960900,Allison Hill,2023-10-03 00:00:00,1224 - Romerochester,Zs84Pp-281,Brown PLC,United States,USD,...,Not audited,,,0.0,,,,Oct-23,1.0,
1,1.0,Check Request,906015,Noah Rhodes,2023-10-04 00:00:00,7027 - Websterton,NK51DM-941,Frye-Welch,Canada,CAD,...,Not audited,,,24.0,,,,Oct-23,1.0,
2,1.0,Check Request,172579,Angie Henderson,2023-10-04 00:00:00,2305 - Emilyfort,bQ81gM-763,"Harris, Ruiz and Suarez",Canada,CAD,...,Not audited,,,24.0,,,,Oct-23,1.0,
3,1.0,Check Request,831527,Daniel Wagner,2023-10-04 00:00:00,9392 - West Vincent,Uh11BS-543,Duncan-Elliott,Canada,CAD,...,Not audited,,,24.0,,,,Oct-23,1.0,
4,1.0,Check Request,393640,Cristian Santos,2023-10-04 00:00:00,7433 - Atkinsshire,at70RJ-141,Kelley-Hernandez,United States,USD,...,Pass,Prashanth,Passed,24.0,,,,Oct-23,1.0,


Display the Last 5 rows of the dataset

In [22]:
print("Last 5 Rows of Raw Data ---")
df.tail()

Last 5 Rows of Raw Data ---


Unnamed: 0,Count of Invoice,Type of Request,REQUEST ID,CREATED BY,CREATED ON,BRANCH NAME,SUPPLIER CODE,NAME,COUNTRY,CURRENCY,...,QC Status,Audited By,Auditor Comments,Ageing-SLA-FPY,Ageing-SLA-2,Ageing of Re-work Days,Ageing of Re-work,Month,Unnamed: 28,Amount
10108,1.0,Union,906366,Francisco Johns,2024-03-21 00:00:00,8340 - wilsontown,Mw16Rx-242,Chavez-Snow,United States,USD,...,Not audited,,,0.0,,,,Mar-24,1.0,
10109,1.0,Union,657182,Kathleen Garcia,2024-12-17 00:00:00,5664 - lake gloriashire,dp86An-899,Anderson-Newton,United States dollar,USD,...,Not Audited,,,0.0,,,,Dec-24,1.0,
10110,1.0,Union,246772,Richard Yoder,2024-09-30 00:00:00,7899 - Mckenziemouth,Tr54Yf-635,"Scott, Hancock and Gordon",United States,USD,...,Not Audited,,,0.0,,,,Sep-24,1.0,
10111,1.0,Check Request,676695,Charles Dougherty,2025-05-28 00:00:00,5114 - Rosariostad,tx17xp-597,Gomez Group,United States,USD,...,,,,0.0,,,,May-25,1.0,
10112,1.0,ENGIE,783785,Joshua Fowler,2024-12-03 00:00:00,8539 - Port Michaelville,XE46JN-456,"Johnson, Graham and Fuentes",United States,USD,...,Not Audited,,,0.0,,,,Dec-24,1.0,


Display the data overview

In [25]:
print("Dataset Overview (df.info()) ---")
df.info()

Dataset Overview (df.info()) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10113 entries, 0 to 10112
Data columns (total 30 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Count of Invoice                 10110 non-null  float64
 1   Type of Request                  10113 non-null  object 
 2   REQUEST ID                       10113 non-null  int64  
 3   CREATED BY                       10113 non-null  object 
 4   CREATED ON                       10113 non-null  object 
 5   BRANCH NAME                      10113 non-null  object 
 6   SUPPLIER CODE                    10113 non-null  object 
 7   NAME                             10113 non-null  object 
 8   COUNTRY                          10113 non-null  object 
 9   CURRENCY                         10113 non-null  object 
 10  TYPE OF REQUEST                  10113 non-null  object 
 11  Amount                           10105 non-null

Display the Basic Statistics of Data

In [26]:
print("Basic Statistics of Data (df.describe(include='all')) ---")
df.describe(include='all')

Basic Statistics of Data (df.describe(include='all')) ---


Unnamed: 0,Count of Invoice,Type of Request,REQUEST ID,CREATED BY,CREATED ON,BRANCH NAME,SUPPLIER CODE,NAME,COUNTRY,CURRENCY,...,QC Status,Audited By,Auditor Comments,Ageing-SLA-FPY,Ageing-SLA-2,Ageing of Re-work Days,Ageing of Re-work,Month,Unnamed: 28,Amount
count,10110.0,10113,10113.0,10113,10113,10113,10113,10113,10113,10113,...,8913,3892,3896,9781.0,0.0,0.0,0.0,9917,9629.0,0.0
unique,,7,,9376,556,10013,10013,8697,17,8,...,3,1,2,,,,,22,,
top,,Check Request,,Jennifer Smith,2025-04-10 00:00:00,8004 - Krystalfort,KS24Oy-304,Smith Group,United States,USD,...,Pass,Prashanth,Passed,,,,,Jul-24,,
freq,,4730,,7,63,2,2,18,8326,8354,...,3892,3892,3892,,,,,580,,
mean,1.183482,,552474.529714,,,,,,,,...,,,,0.528167,,,,,0.946827,
std,2.38231,,260185.794419,,,,,,,,...,,,,5.564104,,,,,0.224389,
min,1.0,,100144.0,,,,,,,,...,,,,-263.0,,,,,0.0,
25%,1.0,,328217.0,,,,,,,,...,,,,0.0,,,,,1.0,
50%,1.0,,551007.0,,,,,,,,...,,,,0.0,,,,,1.0,
75%,1.0,,777251.0,,,,,,,,...,,,,0.0,,,,,1.0,


Display the Missing Values in the Dataset

In [28]:
print("Missing Values in Raw Data ---")
df.isnull().sum()

Missing Values in Raw Data ---


Count of Invoice                       3
Type of Request                        0
REQUEST ID                             0
CREATED BY                             0
CREATED ON                             0
BRANCH NAME                            0
SUPPLIER CODE                          0
NAME                                   0
COUNTRY                                0
CURRENCY                               0
TYPE OF REQUEST                        0
Amount                                 8
TASK START                             0
Actioned Date                          0
Request Received Stage               303
Request received date                  0
Completed Date                        64
Status of Request                     54
Pending Reason                      8272
Pending With Approver/Requestor     8905
QC Status                           1200
Audited By                          6221
Auditor Comments                    6217
Ageing-SLA-FPY                       332
Ageing-SLA-2    

Check for whitespace issues in 'Status of Request'

In [29]:
if 'Status of Request' in df.columns:
    print("Unique 'Status of Request' values (checking for whitespace) ---")
    # Use a lambda to check for leading/trailing spaces
    whitespace_issues = df['Status of Request'].dropna().apply(lambda x: x != x.strip()).sum()
    print(f"Number of 'Status of Request' values with leading/trailing whitespace: {whitespace_issues}")
    print(df['Status of Request'].value_counts(dropna=False))


Unique 'Status of Request' values (checking for whitespace) ---
Number of 'Status of Request' values with leading/trailing whitespace: 1003
Status of Request
COMPLETED             8988
 COMPLETED             997
NaN                     54
Completed               22
rejected                13
PENDING                 12
pending                 11
Pending                  6
Back to originator       3
 PENDING                 2
 pending                 2
 Completed               2
cOMPLETED                1
Name: count, dtype: int64


Check for case inconsistency in 'BRANCH NAME'

In [31]:
if 'BRANCH NAME' in df.columns:
    print("Unique 'BRANCH NAME' values (checking for casing) ---")
    # Show top 5 values to reveal distinct entries due to casing
    print(df['BRANCH NAME'].value_counts(dropna=False).head(5))


Unique 'BRANCH NAME' values (checking for casing) ---
BRANCH NAME
8004 - Krystalfort      2
2158 - North Scott      2
2984 - Annaview         2
3390 - Henrytown        2
2714 - new kellyview    2
Name: count, dtype: int64


Check for date format corruption in 'CREATED ON'


In [35]:
if 'CREATED ON' in df.columns:
    print("Sample of 'CREATED ON' values (checking for mixed formats) ---")
    # Print a few samples to manually inspect formats
    print(df['CREATED ON'].sample(min(10, len(df))).tolist())

Sample of 'CREATED ON' values (checking for mixed formats) ---
['2024-11-21 00:00:00', '2024-01-26 00:00:00', '2025-01-03 00:00:00', '2025-01-13 00:00:00', '2024-12-16 00:00:00', '2025-01-28 00:00:00', '2024-06-05 00:00:00', '2025-02-20 00:00:00', '2025-03-13 00:00:00', '2024-05-09 00:00:00']


Check for ' Amount ' column (missing values and potential non-numeric string type)

In [42]:
# Check for ' Amount ' column (missing values and potential non-numeric string type)
amount_col = ' Amount ' # Store column name for clarity and consistency

if amount_col in df.columns: 
    print(f"'{amount_col}' column details ---")
    print(f"Data type: {df[amount_col].dtype}")
    print(f"Number of NaN values: {df[amount_col].isnull().sum()}")

    # Attempt to convert to numeric, coercing errors to NaN
    s_numeric_attempt = pd.to_numeric(df[amount_col], errors='coerce')

    # Identify values that are non-numeric AND were not already NaN
    non_numeric_but_not_nan = s_numeric_attempt.isnull() & df[amount_col].notnull()
    num_non_numeric_values = non_numeric_but_not_nan.sum()

    if num_non_numeric_values > 0:
        print(f"Found {num_non_numeric_values} values in '{amount_col}' that are not numbers.")
        print("Sample of non-numeric values (will become NaN on conversion):")
        print(df.loc[non_numeric_but_not_nan, amount_col].sample(min(5, num_non_numeric_values)).tolist())

' Amount ' column details ---
Data type: float64
Number of NaN values: 10113


Check for 'Ageing-SLA-FPY' (negative values)

In [43]:
if 'Ageing-SLA-FPY' in df.columns:
    print(f"\nMin 'Ageing-SLA-FPY': {df['Ageing-SLA-FPY'].min()}")
    print(f"Max 'Ageing-SLA-FPY': {df['Ageing-SLA-FPY'].max()}")
    print(f"Count of negative 'Ageing-SLA-FPY' values: {(df['Ageing-SLA-FPY'] < 0).sum()}")


Min 'Ageing-SLA-FPY': -263.0
Max 'Ageing-SLA-FPY': 96.0
Count of negative 'Ageing-SLA-FPY' values: 75


Checking columns that appeared entirely empty/problematic in our previous describe() output

In [44]:
cols_to_check_empty = ['Ageing-SLA-2', 'Ageing of Re-work Days', 'Ageing of Re-work', 'Unnamed: 28']
for col in cols_to_check_empty:
    if col in df.columns:
        if df[col].isnull().all():
            print(f"\nColumn '{col}' appears to be entirely empty (all NaN values).")
        elif df[col].dtype == 'object' and len(df[col].dropna().unique()) == 0:
             print(f"\nColumn '{col}' is object type but contains no meaningful non-null values.")
        else: # If not all null, show some info
            print(f"\nColumn '{col}' (not entirely empty/meaningless):")
            print(df[col].value_counts(dropna=False).head()) # Show top values and NaNs


Column 'Ageing-SLA-2' appears to be entirely empty (all NaN values).

Column 'Ageing of Re-work Days' appears to be entirely empty (all NaN values).

Column 'Ageing of Re-work' appears to be entirely empty (all NaN values).

Column 'Unnamed: 28' (not entirely empty/meaningless):
Unnamed: 28
1.0    9117
0.0     512
NaN     484
Name: count, dtype: int64


Display the Columns

In [45]:
df.columns

Index(['Count of Invoice', 'Type of Request', 'REQUEST ID', 'CREATED BY',
       'CREATED ON', 'BRANCH NAME', 'SUPPLIER CODE', 'NAME', 'COUNTRY',
       'CURRENCY', 'TYPE OF REQUEST', 'Amount', 'TASK START', 'Actioned Date',
       'Request Received Stage', 'Request received date', 'Completed Date',
       'Status of Request', 'Pending Reason',
       'Pending With Approver/Requestor', 'QC Status', 'Audited By',
       'Auditor Comments', 'Ageing-SLA-FPY', 'Ageing-SLA-2',
       'Ageing of Re-work Days', 'Ageing of Re-work', 'Month', 'Unnamed: 28',
       ' Amount '],
      dtype='object')

___