
## The following program is for formatting Daily Invoice Reports so they can be slightly modified in Notepad++ then uploaded to Hadoop. 

### The program will:
#### 1. Import the reports and read them into a pandas dataframe
#### 2. Add the "As of Date" column to the report, this is for SQL queries
#### 3. Move the "Description" column to the end of the table. This is so if there are any special characters they do not interfere with Hadoop. 
#### 4. Replace 'Unclassified', which Ariba uses as 'null' with NaN. 'Unclassified' in date fields blocks the date formatting, which is the following step. 
#### 5. Format date fields as YYYY-MM-DD
#### 6. Write the modified dataframe/s to a txt file on the O:Drive
#### 7. Convert the EOL delimiter to Unix (LF)

### After the program runs, the files will need to be opened in Notepad++ to remove double quotes ("). 
### This program is set up to run 5 files at a time, you may copy and paste or remove lines of code to accommodate more or fewer files. 
### Note: It takes approximately 5 minutes to clean and write 1 file. When you are ready, select "Cell" at the top, and "Run All". 

In [None]:
# txt Start time

from datetime import datetime

now = datetime.now()

current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

In [None]:
# Import pandas and numpy libraries

import pandas as pd
import numpy as np

df1 = pd.read_excel("filepath")
print("All files read into dataframes")


In [None]:
# Drop as of date (if it exists) then add 'As of Date' column to the dataframe/s

df1.drop(['As of'], axis=1, inplace=True, errors='ignore')
df1.drop(['As Of'], axis=1, inplace=True, errors='ignore')
df1.drop(['as of'],axis=1, inplace=True, errors='ignore')
df1["As of Date"] = "2020-10-29"

print("as of date added")

In [None]:
# Move the Description Column to the end of the dataframe. Repeat the code for as many times as needed.

cols = list(df1.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('Description')) #Remove b from list
df1 = df1[cols+['Description']]

print("decription column moved to end")

In [None]:
# Replace ariba's 'Unclassified' filler with np.nan

df1.replace('Unclassified' , np.nan,inplace=True)

print("replace unclassified complete")

In [None]:
# Reformat the date columns to YYYY-MM-DD format

df1['Invoice Date Created - Date'] = pd.to_datetime(df1['Invoice Date Created - Date']).dt.strftime('%Y-%m-%d')
df1['Invoice Date - Date'] = pd.to_datetime(df1['Invoice Date - Date']).dt.strftime('%Y-%m-%d')
df1['Invoice Submit Date - Date'] = pd.to_datetime(df1['Invoice Submit Date - Date']).dt.strftime('%Y-%m-%d')
df1['Approved Date - Date'] = pd.to_datetime(df1['Approved Date - Date']).dt.strftime('%Y-%m-%d')
df1['Reconciled Date - Date'] = pd.to_datetime(df1['Reconciled Date - Date']).dt.strftime('%Y-%m-%d')


print("Date formatting complete")


In [None]:
# Read files to .txt.

df1.to_csv(r"filepath", sep='\t', index=False)

print("")
print("Files read to .txt")

#### The next blocks of code performs the formatting which would otherwise be done manually in Notepad ++. Copy and paste the code to repeat as needed. This code will remove the carriage return \r and replace with \n, it will also conver the EOL to Unix (LF). Notepad++ will still be needed to remove double quotes ("), but the remaining steps will not be needed. 

In [None]:
# replacement strings

WINDOWS_LINE_ENDING = b'\r\n'
UNIX_LINE_ENDING = b'\n'

# file path
file_path = r"filepath"
with open(file_path, 'rb') as open_file:
    content = open_file.read()

content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)

with open(file_path, 'wb') as open_file:
    open_file.write(content)
    


In [None]:
now = datetime.now()

current_time = now.strftime("%H:%M:%S")
print("Finished time =", current_time)