# Final Cleaning

In the final cleaning, since every csv file is a lot cleaner to work with, we can remove any additional bad values to clean our data further.
Once we remove those final values, our csvs will be sql ready!

### Importing libraries

In [4]:
import pandas as pd

### Lookup tables, and header file cleaning
After the creation of our lookup table csvs, we can go through them one final time to check for any unecessary data

### File paths for header and lookup tables

In [1]:
# Removing numerical values from our port of lading and unlading data
# Numerical data is uneeded in this case
port_lading_fps = [
    "./data/cleaned_/2019/lookup_table_files/port_of_lading_lookup.csv",
    "./data/cleaned_/2019/lookup_table_files/port_of_unlading.csv",
    "./data/cleaned_/2020/lookup_table_files/port_of_lading_lookup.csv",
    "./data/cleaned_/2020/lookup_table_files/port_of_unlading.csv"
]


In [None]:
header_2019_fps = [
    "./data/cleaned_/2019/header_table_files/header_0.csv",
    "./data/cleaned_/2019/header_table_files/header_1.csv",
    "./data/cleaned_/2019/header_table_files/header_2.csv",
    "./data/cleaned_/2019/header_table_files/header_3.csv"
]

In [3]:
header_2020_fps = [
    "./data/cleaned_/2020/header_table_files/header_0.csv",
    "./data/cleaned_/2020/header_table_files/header_1.csv",
    "./data/cleaned_/2020/header_table_files/header_2.csv",
]

In [None]:
# After reviewing the dates lookup tables, there was a lot of data from previous years not pertaining to 2019 and 2020 that we don't need
# That will be corrected when we get rid of those dates.
date_fps = [
    "./data/cleaned_/2019/lookup_table_files/estimated_arrival_lookup.csv",
    "./data/cleaned_/2019/lookup_table_files/arrival_date_lookup.csv",
    "./data/cleaned_/2020/lookup_table_files/estimated_arrival_lookup.csv",
    "./data/cleaned_/2020/lookup_table_files/arrival_date_lookup.csv"
]

In [None]:
# Now, we pass over each file and get the ids

df = pd.read_csv(port_lading_fps[0])

#  Converting columns to numeric, if they can't they'll be NaN
# Any port value that is numerical and not alphabetical will be able to turn
# We'll be able to use this when obtaining the ids to remove
numeric_mask = pd.to_numeric(df['port_of_lading'],errors='coerce').notna()

# With the mask, we can get the ids we need to remove
removed_ids = df.loc[numeric_mask,'port_lading_id'].tolist()

# Now we have a cleaned df we can convert to csv to have the lookup table SQL ready
df_clean = df.loc[~numeric_mask]

In [None]:
# The cleaned lookup table is saved. Ensuring no more numerical values
df_clean.to_csv('port_of_lading_lookup.csv',index=False)

In [None]:

# Now, with the ids we know are needed to remove, we go over the header files and remove them
# This is because the data will be junk, every single item in our data must be proper.
# You can adjust some variables to change between the 2019 and 2020 header files
idx = 0
for fp in header_2019_fps:
    df = pd.read_csv(fp)
    df_filtered = df[~df['port_lading_id'].isin(removed_ids)]

    df_filtered.to_csv(f'header_{idx}.csv',index=False)
    idx += 1


In [None]:
# Now, there's a lot of garbage year data, data from years before what we want to use, we need to remove it!

# Read in the years from the list csvs
df = pd.read_csv(date_fps[3])

# Creating a T/F df with all the values that have the year
mask_2020 = df['arrival_date'].str[:4] == "2020"

# Any value that doesn't have the year, will be removed
ids_to_remove = df.loc[~mask_2020,'arrival_id'].tolist()

# A clean dataframe, with only the values with the specific year are kept
df_clean = df.loc[mask_2020]

# This dataframe is our lookup table, we must save the new and updated lookup table
df_clean.to_csv('arrival_date_lookup.csv',index=False)

In [None]:
# Similarly to the ports, we remove the records in the header tables with the bad years. This is because they're shipment
# information from years our use case is not focusing on.
idx = 0
for fp in header_2020_fps:
    df = pd.read_csv(fp)
    df_filtered = df[~df['arrival_id'].isin(ids_to_remove)]
    df_filtered.to_csv(f'header_{idx}.csv',index=False)
    idx += 1

# Conclusion

It's important to note that these steps are automatable by switching just a few variables.
Now that our data has been thoroughly cleaned, it's ready for SQL!

Along the process, I figured I wouldn't need the container nor the cargodesc files, as my use case changed. That's why you don't see them after a file or two.