<a href="https://colab.research.google.com/github/NIP-Data-Computation/show-and-tell/blob/master/reinar_consolidate_customs_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook generates a consolidated customs dataset with import data from 2012 to 2019 filtered to only the top 9 HS codes. 

**Author**: Reina Reyes <br>
**Date Created**: August 1, 2020 <br>
**Last Updated**: August 3, 2020 <br> 
**Description**: Consolidates import data from 8 files (**boc_lite_YYYY.csv** *from 2012 to 2019*) filtered to top 9 HS codes. Dataset is saved to output file (**boc_lite_2012_2019_top9_hscode.csv**). All files are in the shared Google Drive: *NIP-Data-Computation-Group-Drive > Datasets > PHL Customs Open Data > clean > csv*. 

# Mount Google Drive and load input files

*Mount Google Drive*

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


*Test if mount is successful*


In [2]:
!ls drive/My\ Drive

 Application  'Colab Notebooks'   Misc	     Teaching
 CBT	       Engagements	  Research


*Define directory path of input and output files*

In [3]:
fn_dir = "/content/drive/My Drive/Research/NIP-Data-Computation/NIP-Data-Computation-Group-Drive/Datasets/PHL Customs Open Data/clean/csv/"

*Import libraries*

In [4]:
import pandas as pd
import numpy as np
import gc

*Loop over input files and filter to top HS codes*

We do this in 2 batches to not exceed the memory allocation of Colab:

In [5]:
start_year = 2012
n_year = 8
n_batch = 4
hscodes = [84733090000, 84799040000, 85429000000, 85415000000, 73269099000,
            84799030000, 85411000000, 64059000000, 40169390000]

# Initialize empty output dataframe
df_out = pd.DataFrame()

for i in np.arange(n_batch):
    fn_i = fn_dir + "boc_lite_%d.csv" % (start_year + i)
    df_i = pd.read_csv(fn_i, encoding = "ISO-8859-1")    # Specified encoding is required to avoid UnicodeError in Colab
    df_i_hs = df_i[df_i["hscode"].isin(hscodes)]
    df_out = pd.concat([df_out, df_i_hs]) 
    print("Read %d rows for year %d; filtered rows: %d; total rows: %d" % (len(df_i), start_year + i, len(df_i_hs), len(df_out)))

Read 1193628 rows for year 2012; filtered rows: 24498; total rows: 24498
Read 1225431 rows for year 2013; filtered rows: 28766; total rows: 53264


  interactivity=interactivity, compiler=compiler, result=result)


Read 1421241 rows for year 2014; filtered rows: 32550; total rows: 85814


  interactivity=interactivity, compiler=compiler, result=result)


Read 2236612 rows for year 2015; filtered rows: 132110; total rows: 217924


*Clear RAM*

In [6]:
gc.collect()

11

In [7]:
for i in np.arange(n_year - n_batch) + n_batch:
    fn_i = fn_dir + "boc_lite_%d.csv" % (start_year + i)
    df_i = pd.read_csv(fn_i, encoding = "ISO-8859-1")    # Specified encoding is required to avoid UnicodeError in Colab
    df_i_hs = df_i[df_i["hscode"].isin(hscodes)]
    df_out = pd.concat([df_out, df_i_hs]) 
    print("Read %d rows for year %d; filtered rows: %d; total rows: %d" % (len(df_i), start_year + i, len(df_i_hs), len(df_out)))

  interactivity=interactivity, compiler=compiler, result=result)


Read 3140436 rows for year 2016; filtered rows: 232092; total rows: 450016


  interactivity=interactivity, compiler=compiler, result=result)


Read 3490131 rows for year 2017; filtered rows: 240926; total rows: 690942
Read 3753118 rows for year 2018; filtered rows: 260909; total rows: 951851
Read 3794763 rows for year 2019; filtered rows: 239027; total rows: 1190878


In [8]:
gc.collect()

0

*Check total no. of rows in output dataframe*


In [9]:
len(df_out) 

1190878

# Save and test output file


*Save output dataframe to CSV file (also in Google Drive)*

In [13]:
fn_out = fn_dir + "boc_lite_2012_2019_top9_hscode.csv"
df_out.to_csv(fn_out, encoding = "ISO-8859-1")

*Test and time loading of consolidated file*

In [15]:
%time df = pd.read_csv(fn_out, encoding = "ISO-8859-1")



CPU times: user 4.68 s, sys: 245 ms, total: 4.92 s
Wall time: 5.15 s
