# Supplier Data Update and Merging Script

## Overview
This script processes two CSV files containing supplier data. It merges the data based on common keys, filters out records with discrepancies, and exports the cleaned data to a new CSV file.

## Process Goal
The goal of this script is to:
1. Read two source CSV files into DataFrames, specifying data types for selected columns.
2. Print the row counts for both DataFrames to monitor data size.
3. Clean whitespace from specified columns in both DataFrames.
4. Merge the two DataFrames based on common keys.
5. Filter the merged DataFrame to retain only rows where supplier names differ.
6. Format the filtered DataFrame to match the original structure and rename columns as needed.
7. Export the final cleaned DataFrame to a new CSV file.

## Considerations for Improvement
- **Dynamic File Paths**: Consider parameterizing file paths or using a configuration file to facilitate easier updates and changes to file locations.
- **Error Handling**: Implement error handling for file reading and writing operations to manage potential issues (e.g., missing files).
- **Logging**: Add logging capabilities to track script execution, especially for data processing steps, making it easier to debug or analyze later.
- **Data Validation**: Include validation checks to ensure the integrity of merged data (e.g., confirming no missing values in critical columns before filtering).
- **Documentation**: Enhance the code comments for clarity on the purpose and function of each section, making it easier for future reference or collaboration.

In [None]:
import pandas as pd

dtype = {
    'FILE_NAME': 'str',
    'SUPPLIER_ERP': 'str',
    'SUPPLIER_NORMALIZED': 'str'
}

dtype1 = {
    'FILE_NAME': 'str',
    'SUPPLIER_NORMALIZED': 'str',
    'SUPPLIER_NORMALIZED_AFT': 'str'
}

# read the source data frames
update_table = pd.read_csv('update_table_b4 1.10.24.csv', encoding= 'UTF-8-SIG', dtype=dtype)
right_table = pd.read_csv('right_table 1.10.24.csv', encoding='UTF-8-SIG',dtype=dtype1 )

update_table_row_count = update_table.shape[0]
right_table_table_row_count = right_table.shape[0]

print (f"The row count for the update table: {update_table_row_count}, and right table: {right_table_table_row_count}.")

#Use .loc to strip whitespaces in 'SUPPLIER_ERP' and 'SUPPLIER_NORMALIZED'
update_table.loc[:, 'SUPPLIER_ERP'] = update_table['SUPPLIER_ERP'].str.strip()
update_table.loc[:, 'SUPPLIER_NORMALIZED'] = update_table['SUPPLIER_NORMALIZED'].str.strip()
right_table.loc[:, 'SUPPLIER_NORMALIZED'] = right_table['SUPPLIER_NORMALIZED'].str.strip()
right_table.loc[:, 'SUPPLIER_NORMALIZED_AFT'] = right_table['SUPPLIER_NORMALIZED_AFT'].str.strip()

The row count for the update table: 4491, and right table: 728.


In [None]:
# Create a new DataFrame by merging the tables
merge_keys = ['FILE_NAME', 'SUPPLIER_NORMALIZED']

merged_df = update_table.merge(right_table[['FILE_NAME','SUPPLIER_NORMALIZED','SUPPLIER_NORMALIZED_AFT']],
                               on=merge_keys, how='left', suffixes= ('', '_right'))


# Step 1: Filter merged_df to get rows with different 'SUPPLIER_NORMALIZED' and 'SUPPLIER_NORMALIZED_AFT'
filtered_df = merged_df[merged_df['SUPPLIER_NORMALIZED'] != merged_df['SUPPLIER_NORMALIZED_AFT']]

# Step 2: Drop unnecessary columns so that filtered_df has the same structure as update_table
# Select only the relevant columns from filtered_df and rename as needed
filtered_update_table = filtered_df[['FILE_NAME', 'SUPPLIER_ERP', 'SUPPLIER_NORMALIZED_AFT', 'TOTAL_SPEND_USD', 'ROW_COUNT']].copy()

# Step 3: Rename 'SUPPLIER_NORMALIZED_AFT' to 'SUPPLIER_NORMALIZED' to preserve the cleaner version
filtered_update_table.rename(columns={'SUPPLIER_NORMALIZED_AFT': 'SUPPLIER_NORMALIZED'}, inplace=True)

shape = filtered_update_table.shape
print(shape)

(1591, 5)


In [None]:
# export the update table 

filtered_update_table.to_csv('update_table_aft 1.10.24.csv', encoding='UTF-8-SIG', index=False)