# Poc Guidelines:

- What is the original format of the data
- How large is the entire dataset?
- Describe the features of the data that you are working with?
    - Provide a statistical summary of numerical columns
    - Provide a detailed breakdown of how you choose to clean the data
        - what is your methodology for dealing with errors, nulls, and outliers?
        - What quality standards are enforced for your cleaned data?
- Does the data need to be augmented with any external data?


![ERD Diagram](./erd.svg)
Example ERD

# My Findings (Answering guideline questions)

#### Original Format:
In the original format, the data is organized by year. Each year has its own folder with separate contents:
- 2018
- 2019
- 2020


Inside those years, there are seperate folders for each import data.
These folders contain csv files containing data with a varying number of files per folder.
- billgen
- cargodesc
- consignee
- container
- hazmat
- hazmatclass
- header
- marksnumbers
- notifyparty


Inside those files, the csv files contains row and column values. Each row represents a new record with specific values, and each column represents the values based on the column. The column names for each import data are as follows:

- **Billgen:**
    - identifier -> int64
    - master_bol_number ->    object
    - house_bol_number   ->   object
    - sub_house_bol_number ->  float64
    - voyage_number   ->   object
    - bill_type_code    ->  object
    - manifest_number   ->   float64
    - trade_update_date ->  object
    - run_date ->    object
- **Cargo Desc:**
    - identifier -> int64
    - container_number -> object
    - description_sequence_number -> int64
    - piece_count -> float64
    - description_text ->  object
- **Consignee:**
    - identifier -> int64
    - consignee_name -> object
    - consignee_address_1 -> object
    - consignee_address_2 -> object
    - consignee_address_3 -> object
    - consignee_address_4 -> object
    - city -> object
    - state_province ->  object
    - zip_code -> object
    - country_code -> object
    - contact_name -> object
    - comm_number_qualifier -> object
    - comm_number -> object
- **Container:**
    - identifier ->  int64
    - container_number -> object
    - seal_number_1 -> object
    - seal_number_2 -> object
    - equipment_description_code -> object
    - container_length -> int64
    - container_height -> int64
    - container_width -> int64
    - container_type -> object
    - load_status -> object
    - type_of_service -> object
- **Hazmat**
    - identifier  ->  int64
    - container_number  ->  object
    - hazmat_sequence_number  ->  int64
    - hazmat_code  ->  object
    - hazmat_class  ->    object
    - hazmat_code_qualifier ->  object
    - hazmat_contact  ->  object
    - hazmat_page_number -> object
    - hazmat_flash_point_temperature -> int64
    - hazmat_flash_point_temperature_negative_ind  ->  object
    - hazmat_flash_point_temperature_unit ->  object
    - hazmat_description  ->  object
- **Hazmat Class**
    - identifier ->  int64
    - container_number -> object
    - hazmat_sequence_number -> int64
    - hazmat_classification -> object
- **Header**
    - identifier              ->                   int64
    - carrier_code            ->                  object
    - vessel_country_code          ->             object
    - vessel_name                    ->           object
    - port_of_unlading                 ->         object
    - estimated_arrival_date             ->       object
    - foreign_port_of_lading_qualifier     ->     object
    - foreign_port_of_lading                 ->   object
    - manifest_quantity                        ->  int64
    - manifest_unit                          ->   object
    - weight            ->                         int64
    - weight_unit         ->                      object
    - measurement           ->                     int64
    - measurement_unit        ->                  object
    - record_status_indicator   ->                object
    - place_of_receipt            ->              object
    - port_of_destination           ->            object
    - foreign_port_of_destination_qualifier  ->  float64
    - foreign_port_of_destination       ->       float64
    - conveyance_id_qualifier        ->           object
    - conveyance_id          ->                   object
    - in_bond_entry_type       ->                 object
    - mode_of_transportation     ->               object
    - secondary_notify_party_1     ->             object
    - secondary_notify_party_2       ->           object
    - secondary_notify_party_3       ->     float64
    - secondary_notify_party_4     ->    float64
    - secondary_notify_party_5     ->   float64
    - secondary_notify_party_6       ->   float64
    - secondary_notify_party_7         -> float64
    - secondary_notify_party_8           ->      float64
    - secondary_notify_party_9             ->    float64
    - secondary_notify_party_10         ->       float64
    - actual_arrival_date           ->            object
- **Marks Numbers**
    - identifier   ->    int64
    - container_number   ->    object
    - marks_and_numbers_1  ->  object
    - marks_and_numbers_2  ->  object
    - marks_and_numbers_3  ->  object
    - marks_and_numbers_4 ->   object
    - marks_and_numbers_5   -> object
    - marks_and_numbers_6  ->  object
    - marks_and_numbers_7   -> object
    - marks_and_numbers_8 ->   object
- **Notify Party**
    - identifier          ->       int64
    - notify_party_name   ->      object
    - notify_party_address_1   -> object
    - notify_party_address_2 ->   object
    - notify_party_address_3  ->  object
    - notify_party_address_4  ->  object
    - city               ->       object
    - state_province     ->       object
    - zip_code        ->          object
    - country_code      ->        object
    - contact_name       ->       object
    - comm_number_qualifier ->    object
    - comm_number      ->         object


While this is a lot of information, it's important to know the format and structure of the original raw data so that you the Data Engineer know it inside out.

# Size of data set
The size of the data set was calculated by KB -> MB -> GB
- Total KB: 2152 KB -> converted to 2.2MB
- Total MB w/converted KB: 26,105.3 MB -> converted to 26.11GB
- Total GB w/converted MB: 64.08 GB


Overall, the size of the entire dataset, is 64.08 GB!

In [1]:
import pandas as pd

file_paths = [
    "./2018/ams__billgen_2018__202001290000.csv",
    "./2018/ams__cargodesc_2018__202001290000.csv",
    "./2018./ams__consignee_2018__202001290000.csv",
    "./2018/ams__container_2018__202001290000.csv",
    "./2018/ams__hazmat_2018__202001290000.csv",
    "./2018/ams__hazmatclass_2018__202001290000.csv",
    "./2018/ams__header_2018__202001290000.csv",
    "./2018/ams__marksnumbers_2018__202001290000.csv",
    "./2018/ams__notifyparty_2018__202001290000.csv",
    "./2018/ams__shipper_2018__202001290000.csv",
    "./2018/ams__tariff_2018__202001290000.csv"
]
dataframe_list = []

for fp in file_paths:
    df = pd.read_csv(fp)
    dataframe_list.append(df)


# Describing the features of the Data set:
Using the sample data, we'll run some analyses on the numerical columns to get a feel for the data and what ways it can be used.

In [2]:
dataframe_list[6].head()

Unnamed: 0,identifier,carrier_code,vessel_country_code,vessel_name,port_of_unlading,estimated_arrival_date,foreign_port_of_lading_qualifier,foreign_port_of_lading,manifest_quantity,manifest_unit,...,secondary_notify_party_2,secondary_notify_party_3,secondary_notify_party_4,secondary_notify_party_5,secondary_notify_party_6,secondary_notify_party_7,secondary_notify_party_8,secondary_notify_party_9,secondary_notify_party_10,actual_arrival_date
0,201801010,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",21,CTN,...,,,,,,,,,,2017-02-15
1,201801011,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",3,CAS,...,,,,,,,,,,2017-02-15
2,201801012,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",59,CTN,...,,,,,,,,,,2017-02-15
3,201801013,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",168,CTN,...,,,,,,,,,,2017-02-15
4,201801014,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",9,CTN,...,,,,,,,,,,2017-02-15


In [None]:
# getting the top 10 country with exports
sr1 = dataframe_list[6]['foreign_port_of_lading'].value_counts()
#print(sr1[:10])
# getting the top 10 country with imports
sr2 = dataframe_list[6]['port_of_unlading'].value_counts()
#print(sr2[:10])

# Quick conclusion, we can see that china is a big exporter, and the US is a big importer


# Checking to see how accurate the estimated_arrival_time is compared to the actual arrival time
dataframe_list[6]['estimated_arrival_date'] = pd.to_datetime(dataframe_list[6]['estimated_arrival_date'])
dataframe_list[6]['actual_arrival_date'] = pd.to_datetime(dataframe_list[6]['actual_arrival_date'])

dataframe_list[6]['arrival_diff_days'] = (dataframe_list[6]['estimated_arrival_date'] - dataframe_list[6]['actual_arrival_date']).dt.days


# Counting how many were early, this is if the difference is greater than 0
print((dataframe_list[6]['arrival_diff_days'] > 0).sum())

# Counting how many were late, this is if the difference is less than 0
print((dataframe_list[6]['arrival_diff_days'] < 0).sum())

# Counting how many were on time
print((dataframe_list[6]['arrival_diff_days'] == 0).sum())

# Based on these results we can conclude that shipments usually arrive later than their estimated time


# These are just some quick calculations to get a feel for the data!

7
354
138


# Cleaning Data
The basic data cleaning practices (according to me) are:
- Dropping duplicates
- Ensuring data types for columns are correct


We wouldn't drop any records, as they contain important information to our calculations, but we can drop columns that don't serve any interest in our analytical process.
For example, in the header csv, there are a ton of columns that can be dropped


In [5]:
# Notice the secondary_notify_party columns, it seems there's a lot of NaN values
dataframe_list[6].head(10)

Unnamed: 0,identifier,carrier_code,vessel_country_code,vessel_name,port_of_unlading,estimated_arrival_date,foreign_port_of_lading_qualifier,foreign_port_of_lading,manifest_quantity,manifest_unit,...,secondary_notify_party_2,secondary_notify_party_3,secondary_notify_party_4,secondary_notify_party_5,secondary_notify_party_6,secondary_notify_party_7,secondary_notify_party_8,secondary_notify_party_9,secondary_notify_party_10,actual_arrival_date
0,201801010,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",21,CTN,...,,,,,,,,,,2017-02-15
1,201801011,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",3,CAS,...,,,,,,,,,,2017-02-15
2,201801012,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",59,CTN,...,,,,,,,,,,2017-02-15
3,201801013,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",168,CTN,...,,,,,,,,,,2017-02-15
4,201801014,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",9,CTN,...,,,,,,,,,,2017-02-15
5,201801015,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",224,CTN,...,,,,,,,,,,2017-02-15
6,201801016,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",95,CTN,...,,,,,,,,,,2017-02-15
7,201801017,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",349,CTN,...,,,,,,,,,,2017-02-15
8,201801018,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",20,CTN,...,,,,,,,,,,2017-02-15
9,201801019,DFDS,GB,EVER SIGMA,"Los Angeles, California",2017-02-14,Schedule K Foreign Port,"Kaohsiung,China (Taiwan)",2,CTN,...,,,,,,,,,,2017-02-15


In [None]:
# We can check if any of those columns have values, we can do this by using the .unique() method on those columns to see if there's any values at all
dataframe_list[6]['secondary_notify_party_1'].unique()

# After running this command on all those columns, we observe that only 'secondary_notify_party_1' and 'secondary_notify_party_2' have any values
# Knowing this, we can drop the rest of the columns as they are basically garbage

list_to_drop = [
    'secondary_notify_party_3',
    'secondary_notify_party_4',
    'secondary_notify_party_5',
    'secondary_notify_party_6',
    'secondary_notify_party_7',
    'secondary_notify_party_8',
    'secondary_notify_party_9',
    'secondary_notify_party_10'
]

dataframe_list[6].drop(columns=list_to_drop,axis=1)

# The headers csv files are one of the largest files in the actual dataset. 
# If we were to use this technique to clean up those files, it'll save us a lot of space!
# One thing to keep in mind is that since this is the sample data, there could be more records with actual values in the other columns
# For now, this is how I would clean any null values where there are no values are all.

In [None]:
# Dealing with null values when we want to actually keep the columns if they provide any statistical weight can be a statistic itself. 
# For example, for my first QA Presentation, I was missing a lot of values for a specific column.
# I turned those null values into data themselves! That way, it could be another value that could be used in the data visualization.

# For example, the container type has values where it is null, inputting a default value would 
# unfairly skew the data one way as that is not its intended value.
# So we could possibly do some data analysis, with records with no container type!
dataframe_list[3]['container_type'].unique()


# In conclusion, you don't always have to remove null values, I would use them as a statistic themselves!

array([nan, '4EB0', '45R1', '45G0', '42R0', '22G0', '42G0', '4510',
       '4FG0', '2200', '2210', '4300', '2CB0', '2CT0', 'L5G0', '4500',
       '2CG0', '4CG0', '4200', '45G1', '4EG0', '22G1', '4EU1', '4CU1',
       '4CB0', '40G0', '4ER0', '4FR0', '4310'], dtype=object)

#### **Outliers in data**
The way I'd go about dealing with outliers is by checking the size of the csv files.
The larger the size the more I'd be encouraged to remove outliers as they would be quite insignificant in the grand scheme of things. A smaller csv file would keep all values, as removing a value can drastically change your statistical outputs.

Another way to deal with outliers would be a box plot. Understanding the distribution and how things can get skewed and mess up a visualization chart is important. A box plot comes in handy to help us make informed decisions on removing outliers.

An example of this is, when we were presenting the movie data notebooks, one of the teams had an outlier that made all the other data look all bunched up, and the outlier was all alone. A box plot would've been good to use so that that outlier could be removed.

#### **Quality Standards:**
In my data cleaning, the quality standards enforced are:

- **Uniqueness**: absolutely no duplicates!
- **Validity**: Data types, column values, and format of values are what they're supposed to be
- **Null value handling**: If null values will be kept, how will they be managed
- **Outlier handling**: As mentioned in outlier section
- **Unecessary Columns**: Removing any column that doesn't serve any purpose in my analytical process
- **Syntax Checks**: Ensure there are little to no typos (I already saw a mispelling in a column name)
- **Documentation**: Thorough documentation explaining thought process and such


# Propose a solution:

- A complete Entity Relationship Diagram of your final normalized schema
- A standardized file structure to store the raw and transformed data
- You can choose your own case for the data or use the default
    - Default: Your db will be used to provide aggregated analysis on the breakdown of shipping activity for major US ports by type of goods

# Proposing a solution:
My database will be used to provide an analysis on the years 2019 and 2020 based on the shipping activity, and type of goods being shipped.

The reasoning for this is that since 2020 was the year the pandemic happened, it'll be interesting look into how covid-19 affected the ports and what kind of goods were shipped during the pandemic year.


## ERD

See attached file!

!['myerd'](./ERD.drawio.png)

## Standardized file structure

The files will be structured as so:

- /data -> data folder that contains subfolders
    - /raw -> raw csv files are stored here
    - /cleaned -> errors removed, formatting fixed
    - /transformed -> calculations added, joins done if necessary

Dividing our files this way, we can have the data ready for any purpose we will need it for.
Having these separate folders help us understand in what stage the data is at. This makes the data we will be working with organized and consistent!