Introduction

The US border crossing dataset contains information about the inbound crossings at US-Canada and US-Mexico borders. It reflects the number of containers, vehicles, passengers entering the United States 

Data Exploration

In [1]:
your_local_path = "D:/Premy/UPX/ML/Practice/"

In [2]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
#Loading dataset into a dataframe
border_data_df = pd.read_csv(your_local_path + 'Border_Crossing_Entry_Data.csv')

In [4]:
border_data_df.head()

Unnamed: 0,Port Name,State,Port Code,Border,Date,Measure,Value,Location
0,Calexico East,California,2507,US-Mexico Border,03/01/2019 12:00:00 AM,Trucks,34447,POINT (-115.48433000000001 32.67524)
1,Van Buren,Maine,108,US-Canada Border,03/01/2019 12:00:00 AM,Rail Containers Full,428,POINT (-67.94271 47.16207)
2,Otay Mesa,California,2506,US-Mexico Border,03/01/2019 12:00:00 AM,Trucks,81217,POINT (-117.05333 32.57333)
3,Nogales,Arizona,2604,US-Mexico Border,03/01/2019 12:00:00 AM,Trains,62,POINT (-110.93361 31.340279999999996)
4,Trout River,New York,715,US-Canada Border,03/01/2019 12:00:00 AM,Personal Vehicle Passengers,16377,POINT (-73.44253 44.990010000000005)


In [4]:
border_data_df.isnull().any()

Port Name    False
State        False
Port Code    False
Border       False
Date         False
Measure      False
Value        False
Location     False
dtype: bool

There are no null values

In [5]:
border_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 346733 entries, 0 to 346732
Data columns (total 8 columns):
Port Name    346733 non-null object
State        346733 non-null object
Port Code    346733 non-null int64
Border       346733 non-null object
Date         346733 non-null object
Measure      346733 non-null object
Value        346733 non-null int64
Location     346733 non-null object
dtypes: int64(2), object(6)
memory usage: 21.2+ MB


Date column is in String format. Need to convert that into date format

In [6]:
border_data_df['Date'] = pd.to_datetime(border_data_df['Date'])

In [9]:
border_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 346733 entries, 0 to 346732
Data columns (total 8 columns):
Port Name    346733 non-null object
State        346733 non-null object
Port Code    346733 non-null int64
Border       346733 non-null object
Date         346733 non-null datetime64[ns]
Measure      346733 non-null object
Value        346733 non-null int64
Location     346733 non-null object
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 21.2+ MB


In [10]:
# Get unique borders from the dataset
print(border_data_df['Border'].unique())

['US-Mexico Border' 'US-Canada Border']


In [12]:
# Get unique year from the dataset
print(border_data_df['Date'].dt.year.unique())

[2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006
 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996]


In [14]:
print(len(border_data_df['Port Name'].unique()))

116


In [15]:
print(len(border_data_df['Port Code'].unique()))

117


Port Name and Port Code count is different. Need to handle the discrepancy

In [16]:
print(len(border_data_df['Location'].unique()))

224


In [18]:
ports = border_data_df[['Port Name', 'Port Code']].drop_duplicates()

In [19]:
ports[ports['Port Name'].duplicated(keep = False)]

Unnamed: 0,Port Name,Port Code
29,Eastport,3302
217,Eastport,103


In [21]:
border_data_df.iloc[[29, 217]]

Unnamed: 0,Port Name,State,Port Code,Border,Date,Measure,Value,Location
29,Eastport,Idaho,3302,US-Canada Border,2019-03-01,Trains,101,POINT (-116.18027999999998 48.99944)
217,Eastport,Maine,103,US-Canada Border,2019-03-01,Trucks,165,POINT (-66.99387 44.90357)


In [22]:
border_data_df.loc[(border_data_df['Port Name'] == 'Eastport') & (border_data_df['State'] == 'Idaho'), 'Port Name'] = 'Eastport, ID'

In [23]:
border_data_df.iloc[[29,217]]

Unnamed: 0,Port Name,State,Port Code,Border,Date,Measure,Value,Location
29,"Eastport, ID",Idaho,3302,US-Canada Border,2019-03-01,Trains,101,POINT (-116.18027999999998 48.99944)
217,Eastport,Maine,103,US-Canada Border,2019-03-01,Trucks,165,POINT (-66.99387 44.90357)


In [24]:
print(len(border_data_df['Port Name'].unique()))

117


In [25]:
print(len(border_data_df['Port Code'].unique()))

117
