# Data Merging and Cleaning Script

## Overview
This script demonstrates the process of merging multiple CSV files into a final consolidated table, while also performing basic data cleaning operations.

## Process Goal
The goal of this script is to:
1. Read multiple CSV files containing invoice data.
2. Append the data from missing files to their respective main files.
3. Combine data from two sources  into a final table.
4. Clean the final dataset by replacing NaN values with empty strings and extracting relevant columns.


## Considerations for Improvement
- **Path Management**: Use relative paths or configuration files for better portability.
- **Method Update**: Replace the deprecated `append()` method with `pd.concat()`.
- **Error Handling**: Implement basic error handling for validations before processing.

## <span style="color: #777777;">MISSING FILES</span>

In [2]:
import pandas as pd

# Specify paths
file1 = r"C:\Users\LOGICSOUERCE02\Downloads\TBC RELATED\TBC Invoice Update\0 - GNFR\GNFR I2P Data.csv"
file2 = r"C:\Users\LOGICSOUERCE02\Downloads\TBC RELATED\TBC Invoice Update\0 - GNFR\missing_files_gnfr.csv"

# Read the files
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

# Append data from second file to first dataframe
df1 = df1.append(df2, ignore_index=True)

# Save the dataframe to the first file, omitting the header
df1.to_csv(file1, index=False)


  df1 = df1.append(df2, ignore_index=True)


In [1]:
import pandas as pd

# Specify paths
file1 = r"C:\Users\LOGICSOUERCE02\Downloads\TBC RELATED\TBC Invoice Update\1 - NONPO\Monthly I2P Data\TO CONCAT\NONPO FY22 - Q1.csv"
file2 = r"C:\Users\LOGICSOUERCE02\Downloads\TBC RELATED\TBC Invoice Update\1 - NONPO\Monthly I2P Data\missing_files_nonpo.csv"

# Read the files
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

# Append data from second file to first dataframe
df1 = df1.append(df2, ignore_index=True)

# Save the dataframe to the first file, omitting the header
df1.to_csv(file1, index=False)


  df1 = pd.read_csv(file1)
  df1 = df1.append(df2, ignore_index=True)


## <span style="color: #777777;">FINAL TABLE</span>

In [5]:
import pandas as pd
import numpy as np

# Specify paths
file1 = r"C:\Users\LOGICSOUERCE02\Downloads\TBC RELATED\TBC Invoice Update\1 - NONPO\nonpo_final_table.csv"
file2 = r"C:\Users\LOGICSOUERCE02\Downloads\TBC RELATED\TBC Invoice Update\0 - GNFR\gnfrpo_final_table.csv"

# Specify data types
dtype={'INVOICE_NBR': str, 'SUPPLIER_NBR': str, 'INVOICE_REFERENCE': str, 'PO_NBR': str}

# Read the files
df_nonpo = pd.read_csv(file1, dtype=dtype)
df_gnfr = pd.read_csv(file2, dtype=dtype)

# Append data from second file to first dataframe
final_table = df_nonpo.append(df_gnfr, ignore_index=True)

# Convert NaNs to ' '
final_table = final_table.replace(np.nan, ' ', regex=True)

print("Row count:", len(final_table))

  final_table = df_nonpo.append(df_gnfr, ignore_index=True)


Row count: 2526033


In [6]:
filtered_df = final_table[['INVOICE_NBR','SUPPLIER_NBR', 'PO_NBR', 'INVOICE_REFERENCE']]
print(filtered_df)

        INVOICE_NBR SUPPLIER_NBR      PO_NBR INVOICE_REFERENCE
0        1900826760       100092                  1034783930-9
1        1700031033       100092                  1060187314-3
2        1900827480       100092                  1060187314-3
3        1900830756       100310                      05102021
4        1900828585       100638                  667460153694
...             ...          ...         ...               ...
2526028  5107390975       171216  4500086698              4950
2526029  5107305448       180168  4500085027         202204710
2526030  5107205343       180276  4500066154            323493
2526031  5107372171       182045  4500071710           000301C
2526032  5107242787       188496  4500083186              3360

[2526033 rows x 4 columns]


In [7]:
# Save the dataframe to the first file, omitting the header
final_table.to_csv('final_table.csv', index=False, encoding='UTF-8-SIG')