### 🔍 Dataset Features Breakdown

**1. Shipment & Trade Information**
| Feature     | Description |
|-------------|-------------|
| `YEAR`      | Year of shipment |
| `MONTH`     | Shipment month |
| `TRDTYPE`   | Trade type: 1 = Export, 2 = Import |
| `COMMODITY2`| 2-digit commodity classification |

**2. Geographic Origins & Destinations**
| Feature     | Description |
|-------------|-------------|
| `USASTATE`  | U.S. state code for shipment |
| `MEXSTATE`  | Mexican state code (if applicable) |
| `CANPROV`   | Canadian province code (if applicable) |
| `COUNTRY`   | Country code (e.g., Canada = 1220, Mexico = 2010) |
| `DEPE`      | Code for port/district where shipment is processed |

**3. Transportation Logistics**
| Feature     | Description |
|-------------|-------------|
| `DISAGMOT`  | Mode of transport (1 = Vessel, 3 = Air, 5 = Truck, etc.) |
| `CONTCODE`  | Indicates if shipment is containerized (X = Yes, 0 = No) |

**4. Economic & Cost Metrics**
| Feature         | Description |
|------------------|-------------|
| `VALUE`          | Monetary value of shipment (USD) |
| `SHIPWT`         | Shipping weight (Kg) |
| `FREIGHT_CHARGES`| Freight costs (USD) |
| `DF`             | 1 = Domestic product, 2 = Foreign product |

--- 

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# to suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
import os

csv_dir = "C:/Users/Aisha Kanyiti/Desktop\Azubi Frieght Project/2020/2020"
files = [f for f in os.listdir(csv_dir) if f.endswith(".csv")]

print(files)


['dot1_0120.csv', 'dot1_0220.csv', 'dot1_0320.csv', 'dot1_0420.csv', 'dot1_0520.csv', 'dot1_0620.csv', 'dot1_0720.csv', 'dot1_0820.csv', 'dot1_0920.csv', 'dot2_0120.csv', 'dot2_0220.csv', 'dot2_0320.csv', 'dot2_0420.csv', 'dot2_0520.csv', 'dot2_0620.csv', 'dot2_0720.csv', 'dot2_0820.csv', 'dot2_0920.csv', 'dot3_0120.csv', 'dot3_0220.csv', 'dot3_0320.csv', 'dot3_0420.csv', 'dot3_0520.csv', 'dot3_0620.csv', 'dot3_0720.csv', 'dot3_0820.csv', 'dot3_0920.csv']


In [3]:
csv_dir =  "C:/Users/Aisha Kanyiti/Desktop\Azubi Frieght Project/2020/2020"

# Get a list of all .csv files in the directory
csv_files = [file for file in os.listdir(csv_dir) if file.endswith(".csv")]

# Create an empty list to hold DataFrames
all_dataframes = []

# Loop through each CSV file and read it into a DataFrame
for file in csv_files:
    file_path = os.path.join(csv_dir, file)
    
    try:
        df = pd.read_csv(file_path)
        df["source_file"] = file  # Add filename for tracking origin
        all_dataframes.append(df)
    except Exception as e:
        print(f"⚠️ Could not read {file}: {e}")

# Combine all DataFrames into one
df = pd.concat(all_dataframes, ignore_index=True)



In [4]:
df.drop(['source_file'],inplace=True,axis=1)

In [5]:
df.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,07XX,3,,XA,1220,3302,378,125,1.0,X,1,2020,
1,1,AK,20XX,3,,XA,1220,133362,137,1563,1.0,X,1,2020,
2,1,AK,20XX,3,,XA,1220,49960,66,2631,2.0,X,1,2020,
3,1,AK,20XX,3,,XC,1220,21184,3418,795,1.0,X,1,2020,
4,1,AK,20XX,3,,XM,1220,4253,2,75,1.0,X,1,2020,


In [6]:
df.isna().sum()

TRDTYPE                 0
USASTATE           148522
DEPE               634881
DISAGMOT                0
MEXSTATE           738672
CANPROV            482158
COUNTRY                 0
VALUE                   0
SHIPWT                  0
FREIGHT_CHARGES         0
DF                 344964
CONTCODE                0
MONTH                   0
YEAR                    0
COMMODITY2         232029
dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1015432 entries, 0 to 1015431
Data columns (total 15 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   TRDTYPE          1015432 non-null  int64  
 1   USASTATE         866910 non-null   object 
 2   DEPE             380551 non-null   object 
 3   DISAGMOT         1015432 non-null  int64  
 4   MEXSTATE         276760 non-null   object 
 5   CANPROV          533274 non-null   object 
 6   COUNTRY          1015432 non-null  int64  
 7   VALUE            1015432 non-null  int64  
 8   SHIPWT           1015432 non-null  int64  
 9   FREIGHT_CHARGES  1015432 non-null  int64  
 10  DF               670468 non-null   float64
 11  CONTCODE         1015432 non-null  object 
 12  MONTH            1015432 non-null  int64  
 13  YEAR             1015432 non-null  int64  
 14  COMMODITY2       783403 non-null   float64
dtypes: float64(2), int64(8), object(5)
memory usage: 116.2+ MB


In [8]:
df.describe()

Unnamed: 0,TRDTYPE,DISAGMOT,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,MONTH,YEAR,COMMODITY2
count,1015432.0,1015432.0,1015432.0,1015432.0,1015432.0,1015432.0,670468.0,1015432.0,1015432.0,783403.0
mean,1.339721,4.754051,1526.534,2271701.0,1226584.0,32761.61,1.3255,4.98801,2020.0,56.407174
std,0.4736148,1.246482,384.9662,26968990.0,36893000.0,777787.3,0.468562,2.613486,0.0,27.945309
min,1.0,1.0,1220.0,0.0,0.0,0.0,1.0,1.0,2020.0,1.0
25%,1.0,5.0,1220.0,13368.0,0.0,0.0,1.0,3.0,2020.0,33.0
50%,1.0,5.0,1220.0,63187.0,0.0,293.0,1.0,5.0,2020.0,59.0
75%,2.0,5.0,2010.0,366050.0,2805.0,2867.0,2.0,7.0,2020.0,84.0
max,2.0,9.0,2010.0,2661210000.0,7222164000.0,155846500.0,2.0,9.0,2020.0,99.0


In [9]:
df.duplicated().sum()

np.int64(0)

In [10]:
data = df.drop_duplicates() #Dropping duplicates
data

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,07XX,3,,XA,1220,3302,378,125,1.0,X,1,2020,
1,1,AK,20XX,3,,XA,1220,133362,137,1563,1.0,X,1,2020,
2,1,AK,20XX,3,,XA,1220,49960,66,2631,2.0,X,1,2020,
3,1,AK,20XX,3,,XC,1220,21184,3418,795,1.0,X,1,2020,
4,1,AK,20XX,3,,XM,1220,4253,2,75,1.0,X,1,2020,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1015427,2,,55XX,5,,,1220,7848059,7709,2818,,0,9,2020,98.0
1015428,2,,55XX,8,,,1220,30822,11982,1250,,0,9,2020,98.0
1015429,2,,60XX,8,,,1220,111628,2732,513,,0,9,2020,89.0
1015430,2,,70XX,8,,,1220,224120328,0,0,,0,9,2020,99.0


In [11]:
data.duplicated().sum()

np.int64(0)

In [12]:
data.isnull().sum()

TRDTYPE                 0
USASTATE           148522
DEPE               634881
DISAGMOT                0
MEXSTATE           738672
CANPROV            482158
COUNTRY                 0
VALUE                   0
SHIPWT                  0
FREIGHT_CHARGES         0
DF                 344964
CONTCODE                0
MONTH                   0
YEAR                    0
COMMODITY2         232029
dtype: int64

In [13]:
# Mapping of U.S. State Codes to State Names
us_state_mapping = {
    "AL": "Alabama", "AK": "Alaska", "AS": "American Samoa", "AZ": "Arizona",
    "AR": "Arkansas", "CA": "California", "CO": "Colorado", "CT": "Connecticut",
    "DE": "Delaware", "DC": "District of Columbia", "FL": "Florida", "GA": "Georgia",
    "HI": "Hawaii", "ID": "Idaho", "IL": "Illinois", "IN": "Indiana", "IA": "Iowa",
    "KS": "Kansas", "KY": "Kentucky", "LA": "Louisiana", "ME": "Maine", "MD": "Maryland",
    "MA": "Massachusetts", "MI": "Michigan", "MN": "Minnesota", "MS": "Mississippi",
    "MO": "Missouri", "MT": "Montana", "NE": "Nebraska", "NV": "Nevada",
    "NH": "New Hampshire", "NJ": "New Jersey", "NM": "New Mexico", "NY": "New York",
    "NC": "North Carolina", "ND": "North Dakota", "OH": "Ohio", "OK": "Oklahoma",
    "OR": "Oregon", "PA": "Pennsylvania", "RI": "Rhode Island", "SC": "South Carolina",
    "SD": "South Dakota", "TN": "Tennessee", "TX": "Texas", "UT": "Utah",
    "VT": "Vermont", "VA": "Virginia", "WA": "Washington", "WV": "West Virginia",
    "WI": "Wisconsin", "WY": "Wyoming", "DU": "Unknown"
}

    
# Replace state codes with state names
data["USASTATE"] = data["USASTATE"].map(us_state_mapping)

data.head(10)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,Alaska,07XX,3,,XA,1220,3302,378,125,1.0,X,1,2020,
1,1,Alaska,20XX,3,,XA,1220,133362,137,1563,1.0,X,1,2020,
2,1,Alaska,20XX,3,,XA,1220,49960,66,2631,2.0,X,1,2020,
3,1,Alaska,20XX,3,,XC,1220,21184,3418,795,1.0,X,1,2020,
4,1,Alaska,20XX,3,,XM,1220,4253,2,75,1.0,X,1,2020,
5,1,Alaska,20XX,3,,XO,1220,26587,413,555,1.0,X,1,2020,
6,1,Alaska,20XX,3,,XO,1220,128192,5193,266,2.0,X,1,2020,
7,1,Alaska,20XX,3,,XQ,1220,4597,526,482,1.0,X,1,2020,
8,1,Alaska,20XX,3,XX,,2010,2970,1,0,1.0,0,1,2020,
9,1,Alaska,2304,5,CM,,2010,125986,0,0,1.0,0,1,2020,


In [14]:
# Mapping of Trade Type Codes to Trade Type Names
trade_type_mapping = {
    1: "Export",
    2: "Import"
}

# Replace trade type codes with trade type names
data["TRDTYPE"] = data["TRDTYPE"].map(trade_type_mapping)


data.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,3,,XA,1220,3302,378,125,1.0,X,1,2020,
1,Export,Alaska,20XX,3,,XA,1220,133362,137,1563,1.0,X,1,2020,
2,Export,Alaska,20XX,3,,XA,1220,49960,66,2631,2.0,X,1,2020,
3,Export,Alaska,20XX,3,,XC,1220,21184,3418,795,1.0,X,1,2020,
4,Export,Alaska,20XX,3,,XM,1220,4253,2,75,1.0,X,1,2020,


In [15]:
# Create a mapping for the DISAGMOT column
disagmot_mapping = {
    1: "Vessel",
    3: "Air",
    4: "Mail (U.S. Postal Service)",
    5: "Truck",
    6: "Rail",
    7: "Pipeline",
    8: "Other",
    9: "Foreign Trade Zones (FTZs)"
}

# Apply the mapping to the 'DISAGMOT' column
data['DISAGMOT'] = data['DISAGMOT'].map(disagmot_mapping)


data.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,,XA,1220,3302,378,125,1.0,X,1,2020,
1,Export,Alaska,20XX,Air,,XA,1220,133362,137,1563,1.0,X,1,2020,
2,Export,Alaska,20XX,Air,,XA,1220,49960,66,2631,2.0,X,1,2020,
3,Export,Alaska,20XX,Air,,XC,1220,21184,3418,795,1.0,X,1,2020,
4,Export,Alaska,20XX,Air,,XM,1220,4253,2,75,1.0,X,1,2020,


In [16]:
# Create a mapping for the CANPROV column
canprov_mapping = {
    'XA': 'Alberta',
    'XC': 'British Columbia',
    'XM': 'Manitoba',
    'XB': 'New Brunswick',
    'XW': 'Newfoundland',
    'XT': 'Northwest Territories',
    'XN': 'Nova Scotia',
    'XO': 'Ontario',
    'XP': 'Prince Edward Island',
    'XQ': 'Quebec',
    'XS': 'Saskatchewan',
    'XV': 'Nunavut',
    'XY': 'Yukon Territory',
    'OT': 'Province Unknown'
}

# Apply the mapping to the 'CANPROV' column
data['CANPROV'] = data['CANPROV'].map(canprov_mapping)

data.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,,Alberta,1220,3302,378,125,1.0,X,1,2020,
1,Export,Alaska,20XX,Air,,Alberta,1220,133362,137,1563,1.0,X,1,2020,
2,Export,Alaska,20XX,Air,,Alberta,1220,49960,66,2631,2.0,X,1,2020,
3,Export,Alaska,20XX,Air,,British Columbia,1220,21184,3418,795,1.0,X,1,2020,
4,Export,Alaska,20XX,Air,,Manitoba,1220,4253,2,75,1.0,X,1,2020,


In [17]:
# Create a mapping for the MEXSTATE column
mexstate_mapping = {
    'AG': 'Aguascalientes',
    'BC': 'Baja California',
    'BN': 'Baja California Norte',
    'BS': 'Baja California Sur',
    'CH': 'Chihuahua',
    'CL': 'Colima',
    'CM': 'Campeche',
    'CO': 'Coahuila',
    'CS': 'Chiapas',
    'DF': 'Distrito Federal',
    'DG': 'Durango',
    'GR': 'Guerrero',
    'GT': 'Guanajuato',
    'HG': 'Hidalgo',
    'JA': 'Jalisco',
    'MI': 'Michoacán',
    'MO': 'Morelos',
    'MX': 'Estado de Mexico',
    'NA': 'Nayarit',
    'NL': 'Nuevo Leon',
    'OA': 'Oaxaca',
    'PU': 'Puebla',
    'QR': 'Quintana Roo',
    'QT': 'Queretaro',
    'SI': 'Sinaloa',
    'SL': 'San Luis Potosi',
    'SO': 'Sonora',
    'TB': 'Tabasco',
    'TL': 'Tlaxcala',
    'TM': 'Tamaulipas',
    'VE': 'Veracruz',
    'YU': 'Yucatan',
    'ZA': 'Zacatecas',
    'OT': 'State Unknown'
}

# Apply the mapping to the 'MEXSTATE' column
data['MEXSTATE'] = data['MEXSTATE'].map(mexstate_mapping)


data.head(20)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,,Alberta,1220,3302,378,125,1.0,X,1,2020,
1,Export,Alaska,20XX,Air,,Alberta,1220,133362,137,1563,1.0,X,1,2020,
2,Export,Alaska,20XX,Air,,Alberta,1220,49960,66,2631,2.0,X,1,2020,
3,Export,Alaska,20XX,Air,,British Columbia,1220,21184,3418,795,1.0,X,1,2020,
4,Export,Alaska,20XX,Air,,Manitoba,1220,4253,2,75,1.0,X,1,2020,
5,Export,Alaska,20XX,Air,,Ontario,1220,26587,413,555,1.0,X,1,2020,
6,Export,Alaska,20XX,Air,,Ontario,1220,128192,5193,266,2.0,X,1,2020,
7,Export,Alaska,20XX,Air,,Quebec,1220,4597,526,482,1.0,X,1,2020,
8,Export,Alaska,20XX,Air,,,2010,2970,1,0,1.0,0,1,2020,
9,Export,Alaska,2304,Truck,Campeche,,2010,125986,0,0,1.0,0,1,2020,


In [18]:
# Create a mapping for the COUNTRY column
country_mapping = {
    1220: 'Canada',
    2010: 'Mexico'
}

# Apply the mapping to the 'COUNTRY' column
data['COUNTRY'] = data['COUNTRY'].map(country_mapping)

data.head(10)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,,Alberta,Canada,3302,378,125,1.0,X,1,2020,
1,Export,Alaska,20XX,Air,,Alberta,Canada,133362,137,1563,1.0,X,1,2020,
2,Export,Alaska,20XX,Air,,Alberta,Canada,49960,66,2631,2.0,X,1,2020,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,21184,3418,795,1.0,X,1,2020,
4,Export,Alaska,20XX,Air,,Manitoba,Canada,4253,2,75,1.0,X,1,2020,
5,Export,Alaska,20XX,Air,,Ontario,Canada,26587,413,555,1.0,X,1,2020,
6,Export,Alaska,20XX,Air,,Ontario,Canada,128192,5193,266,2.0,X,1,2020,
7,Export,Alaska,20XX,Air,,Quebec,Canada,4597,526,482,1.0,X,1,2020,
8,Export,Alaska,20XX,Air,,,Mexico,2970,1,0,1.0,0,1,2020,
9,Export,Alaska,2304,Truck,Campeche,,Mexico,125986,0,0,1.0,0,1,2020,


In [19]:
# Create a mapping for the DF column
df_mapping = {
    1: 'domestically produced merchandise',
    2: 'foreign produced merchandise'
}

# Apply the mapping to the 'DF' column
data['DF'] = data['DF'].map(df_mapping)


data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,,Alberta,Canada,3302,378,125,domestically produced merchandise,X,1,2020,
1,Export,Alaska,20XX,Air,,Alberta,Canada,133362,137,1563,domestically produced merchandise,X,1,2020,
2,Export,Alaska,20XX,Air,,Alberta,Canada,49960,66,2631,foreign produced merchandise,X,1,2020,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,21184,3418,795,domestically produced merchandise,X,1,2020,
4,Export,Alaska,20XX,Air,,Manitoba,Canada,4253,2,75,domestically produced merchandise,X,1,2020,


In [20]:
# Create a mapping for the CONTCODE (Container Code) column
contcode_mapping = {
    'X': 'Containerized',
    0: 'Non-Containerized'
}

# Apply the mapping to the 'DF' column
data['CONTCODE'] = data['CONTCODE'].map(contcode_mapping)

# List of columns where missing values need to be replaced with 'Unknown'
column_to_replace = ['CONTCODE']

# Replace missing values in these columns with 'Unknown'
data[column_to_replace] = data[column_to_replace].fillna('Non-Containerized')


data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,,Alberta,Canada,3302,378,125,domestically produced merchandise,Containerized,1,2020,
1,Export,Alaska,20XX,Air,,Alberta,Canada,133362,137,1563,domestically produced merchandise,Containerized,1,2020,
2,Export,Alaska,20XX,Air,,Alberta,Canada,49960,66,2631,foreign produced merchandise,Containerized,1,2020,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,21184,3418,795,domestically produced merchandise,Containerized,1,2020,
4,Export,Alaska,20XX,Air,,Manitoba,Canada,4253,2,75,domestically produced merchandise,Containerized,1,2020,


In [21]:
# Create a mapping for the MONTH column
month_mapping = {
    1: 'January',
    2: 'February',
    3: 'March',
    4: 'April',
    5: 'May',
    6: 'June',
    7: 'July',
    8: 'August',
    9: 'September',
    10: 'October',
    11: 'November',
    12: 'December'
}


# Apply the mapping to the 'MONTH' column
data['MONTH'] = data['MONTH'].map(month_mapping)


data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,,Alberta,Canada,3302,378,125,domestically produced merchandise,Containerized,January,2020,
1,Export,Alaska,20XX,Air,,Alberta,Canada,133362,137,1563,domestically produced merchandise,Containerized,January,2020,
2,Export,Alaska,20XX,Air,,Alberta,Canada,49960,66,2631,foreign produced merchandise,Containerized,January,2020,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,21184,3418,795,domestically produced merchandise,Containerized,January,2020,
4,Export,Alaska,20XX,Air,,Manitoba,Canada,4253,2,75,domestically produced merchandise,Containerized,January,2020,


In [22]:
data.isnull().sum()

TRDTYPE                 0
USASTATE           148522
DEPE               634881
DISAGMOT                0
MEXSTATE           761658
CANPROV            482158
COUNTRY                 0
VALUE                   0
SHIPWT                  0
FREIGHT_CHARGES         0
DF                 344964
CONTCODE                0
MONTH                   0
YEAR                    0
COMMODITY2         232029
dtype: int64

In [23]:
# Impute categorical columns with the most frequent value (mode)
data['DEPE'] = data['DEPE'].fillna(data['DEPE'].mode()[0])

data['DF'] = data['DF'].fillna(data['DF'].mode()[0])
data['CONTCODE'] = data['CONTCODE'].fillna(data['CONTCODE'].mode()[0])
data['COMMODITY2'] = data['COMMODITY2'].fillna(data['COMMODITY2'].mode()[0])

In [24]:
# First, convert empty strings ("") to NaN if necessary
data['USASTATE'].replace('', np.nan, inplace=True)

# Now fill NaN with 'State Unknown'
data['USASTATE'].fillna('State Unknown', inplace=True)


In [25]:
# Compute mode for MEXSTATE and CANPROV based on COUNTRY
mexstate_mode_per_country = data.groupby('COUNTRY')['MEXSTATE'].agg(lambda x: x.mode()[0] if not x.isnull().all() else np.nan)
canprov_mode_per_country = data.groupby('COUNTRY')['CANPROV'].agg(lambda x: x.mode()[0] if not x.isnull().all() else np.nan)

# Fill missing MEXSTATE based on COUNTRY
data['MEXSTATE'] = data.apply(
    lambda row: mexstate_mode_per_country.get(row['COUNTRY'], row['MEXSTATE']) 
    if pd.isna(row['MEXSTATE']) and row['COUNTRY'] == 'Mexico' else row['MEXSTATE'], axis=1
)

# Fill missing CANPROV based on COUNTRY
data['CANPROV'] = data.apply(
    lambda row: canprov_mode_per_country.get(row['COUNTRY'], row['CANPROV']) 
    if pd.isna(row['CANPROV']) and row['COUNTRY'] == 'Canada' else row['CANPROV'], axis=1
)

In [26]:
data.isnull().sum()

TRDTYPE                 0
USASTATE                0
DEPE                    0
DISAGMOT                0
MEXSTATE           621426
CANPROV            394006
COUNTRY                 0
VALUE                   0
SHIPWT                  0
FREIGHT_CHARGES         0
DF                      0
CONTCODE                0
MONTH                   0
YEAR                    0
COMMODITY2              0
dtype: int64

In [27]:
# Check where values are either NaN (missing) or the string "NaN"
nan_summary = data.isna().sum() + (data == "NaN").sum()

# Show only columns with NaN values
nan_summary[nan_summary > 0]

MEXSTATE    621426
CANPROV     394006
dtype: int64

In [28]:
# Get total number of rows in the DataFrame
total_rows = len(data)

# Compute percentage of NaN values
nan_percentage = (nan_summary / total_rows) * 100

# Display columns with missing values and their percentages
nan_percentage[nan_percentage > 0]

MEXSTATE    61.19819
CANPROV     38.80181
dtype: float64

In [29]:
data.loc[data['COUNTRY'] == 'Mexico', 'CANPROV'] = "State Unknown"
data.loc[data['COUNTRY'] == 'Canada', 'MEXSTATE'] = "Province Unknown"

In [30]:
#Check for missing values
data.isnull().sum()

TRDTYPE            0
USASTATE           0
DEPE               0
DISAGMOT           0
MEXSTATE           0
CANPROV            0
COUNTRY            0
VALUE              0
SHIPWT             0
FREIGHT_CHARGES    0
DF                 0
CONTCODE           0
MONTH              0
YEAR               0
COMMODITY2         0
dtype: int64

In [31]:
data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,07XX,Air,Province Unknown,Alberta,Canada,3302,378,125,domestically produced merchandise,Containerized,January,2020,84.0
1,Export,Alaska,20XX,Air,Province Unknown,Alberta,Canada,133362,137,1563,domestically produced merchandise,Containerized,January,2020,84.0
2,Export,Alaska,20XX,Air,Province Unknown,Alberta,Canada,49960,66,2631,foreign produced merchandise,Containerized,January,2020,84.0
3,Export,Alaska,20XX,Air,Province Unknown,British Columbia,Canada,21184,3418,795,domestically produced merchandise,Containerized,January,2020,84.0
4,Export,Alaska,20XX,Air,Province Unknown,Manitoba,Canada,4253,2,75,domestically produced merchandise,Containerized,January,2020,84.0


In [32]:
# Create a new DataFrame with the cleaned data
dataset = data.copy()

In [33]:
#Convert file to Cleaded to CSV format
dataset.to_csv('cleaned_transborder_freight_data_2020.csv', index=False)

In [34]:
# pip install ydata-profiling


from ydata_profiling import ProfileReport

# Create the profile report
profile = ProfileReport(data, title="TransBorderFreight Data Profiling Report", explorative=True)

# Save it to an HTML file
profile.to_file("freight_data_report.html")
