# The Canada/US Border Crossing Dataset: Transborder Trucking


*I have suspicions that the 'trucks' in the border crossing dataset do not reflect trucking*

## Load Data

**Import Libraries**

In [1]:
import pandas as pd #working with dataframes

**Load Data**

In [4]:
trucking_df_000 = pd.read_csv('23100219.csv')

**Dataset Versions**

* 000 - raw dataset
* 001 - reduce column 'Shipment type' to a single class 'Transborder shipments'
* 002 - reduce column 'Trucking industry activity' to a single class 'Shipments'
* 003 - remove columns with 1 or less unique elements/classes

## Data Exploration

In [19]:
trucking_df_003.head()

Unnamed: 0,REF_DATE,VALUE,STATUS
18,2004,9824132.0,B
63,2005,9937695.0,B
108,2006,9679201.0,B
153,2007,9614467.0,A
198,2008,8956462.0,A


In [6]:
border_df_000.shape

(630, 16)

### What are the number of NaNs for each column?

In [7]:
border_df_000.isna().sum()

REF_DATE                        0
GEO                             0
DGUID                           0
Shipment type                   0
Trucking industry activity      0
UOM                             0
UOM_ID                          0
SCALAR_FACTOR                   0
SCALAR_ID                       0
VECTOR                          0
COORDINATE                      0
VALUE                           0
STATUS                          0
SYMBOL                        630
TERMINATED                    630
DECIMALS                        0
dtype: int64

*Only two features are NaN*

### What are the number of unique elements for each column?

In [16]:
trucking_df_002.nunique()

REF_DATE                      14
GEO                            1
DGUID                          1
Shipment type                  1
Trucking industry activity     1
UOM                            1
UOM_ID                         1
SCALAR_FACTOR                  1
SCALAR_ID                      1
VECTOR                         1
COORDINATE                     1
VALUE                         14
STATUS                         2
SYMBOL                         0
TERMINATED                     0
DECIMALS                       1
dtype: int64

*Any feature that has 1 or less unique elements can be removed, since there is no variance to explore*

### How many shipments went across the border?

In [20]:
trucking_df_003['VALUE'].sum()

140108198.0

In [21]:
trucking_df_003

Unnamed: 0,REF_DATE,VALUE,STATUS
18,2004,9824132.0,B
63,2005,9937695.0,B
108,2006,9679201.0,B
153,2007,9614467.0,A
198,2008,8956462.0,A
243,2009,7823969.0,A
288,2010,9273881.0,B
333,2011,9301133.0,B
378,2012,10849871.0,A
423,2013,11328379.0,A


## Data Munging


**Reduce 'Shipment type' to a single class 'Transborder shipments'**

In [13]:
# Let's look at the unique values for 'Shipment type'
print(trucking_df_000['Shipment type'].unique())

# For this study we are only interested in the 'Transborder shipments'
trucking_df_001 = trucking_df_000[trucking_df_000['Shipment type'] == 'Transborder shipments']
print(trucking_df_000.shape)
print(trucking_df_001.shape)

['All shipments' 'Domestic shipments' 'Transborder shipments'
 'Local shipments' 'Long distance shipments']
(630, 16)
(126, 16)


*This reduced the number of rows by 80%*

**Reduce 'Trucking industry activity' to a single class 'Shipments'**

In [15]:
# Let's look at the unique values for 'Shipment type'
print(trucking_df_001['Trucking industry activity'].unique())

# For this study we are only interested in the 'Shipments'
trucking_df_002 = trucking_df_001[trucking_df_001['Trucking industry activity'] == 'Shipments']
print(trucking_df_000.shape)
print(trucking_df_002.shape)

['Shipments' 'Weight' 'Distance' 'Tonne-kilometres' 'Revenue'
 'Weight per shipment' 'Distance per shipment' 'Revenue per shipment'
 'Revenue per tonne-kilometre']
(630, 16)
(14, 16)


*This reduced the number of rows by 98%*

**Drop columns that have 1 or less unique elements**

In [17]:
drop_nonunique = [] # create an empty list to put columns to be dropped in
column_unique = trucking_df_002.nunique() # shows the number of unique elements in each feature

for i in range(len(trucking_df_002.nunique())):
    if column_unique[i] <= 1: # if the number of unique elements in each feature is <= 1
        drop_nonunique.append(column_unique.index[i]) # add that column to the list

drop_nonunique    

['GEO',
 'DGUID',
 'Shipment type',
 'Trucking industry activity',
 'UOM',
 'UOM_ID',
 'SCALAR_FACTOR',
 'SCALAR_ID',
 'VECTOR',
 'COORDINATE',
 'SYMBOL',
 'TERMINATED',
 'DECIMALS']

In [18]:
trucking_df_003 = trucking_df_002.drop(drop_nonunique, axis = 1)
print(trucking_df_000.shape)
print(trucking_df_003.shape)

(630, 16)
(14, 3)


*This reduced the number of columns by 81%*