### 🔍 Dataset Features Breakdown

**1. Shipment & Trade Information**
| Feature     | Description |
|-------------|-------------|
| `YEAR`      | Year of shipment |
| `MONTH`     | Shipment month |
| `TRDTYPE`   | Trade type: 1 = Export, 2 = Import |
| `COMMODITY2`| 2-digit commodity classification |

**2. Geographic Origins & Destinations**
| Feature     | Description |
|-------------|-------------|
| `USASTATE`  | U.S. state code for shipment |
| `MEXSTATE`  | Mexican state code (if applicable) |
| `CANPROV`   | Canadian province code (if applicable) |
| `COUNTRY`   | Country code (e.g., Canada = 1220, Mexico = 2010) |
| `DEPE`      | Code for port/district where shipment is processed |

**3. Transportation Logistics**
| Feature     | Description |
|-------------|-------------|
| `DISAGMOT`  | Mode of transport (1 = Vessel, 3 = Air, 5 = Truck, etc.) |
| `CONTCODE`  | Indicates if shipment is containerized (X = Yes, 0 = No) |

**4. Economic & Cost Metrics**
| Feature         | Description |
|------------------|-------------|
| `VALUE`          | Monetary value of shipment (USD) |
| `SHIPWT`         | Shipping weight (Kg) |
| `FREIGHT_CHARGES`| Freight costs (USD) |
| `DF`             | 1 = Domestic product, 2 = Foreign product |

--- 

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# to suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
import os

csv_dir = "C:/Users/Aisha Kanyiti/Desktop/Azubi Frieght Project/2021"
files = [f for f in os.listdir(csv_dir) if f.endswith(".csv")]

print(files)


['dot1_0121.csv', 'dot1_0221.csv', 'dot1_0321.csv', 'dot1_0421.csv', 'dot1_0521.csv', 'dot1_0621.csv', 'dot1_0721.csv', 'dot1_0821.csv', 'dot1_0921.csv', 'dot1_1021.csv', 'dot1_1121.csv', 'dot1_1221.csv', 'dot2_0121.csv', 'dot2_0221.csv', 'dot2_0321.csv', 'dot2_0421.csv', 'dot2_0521.csv', 'dot2_0621.csv', 'dot2_0721.csv', 'dot2_0821.csv', 'dot2_0921.csv', 'dot2_1021.csv', 'dot2_1121.csv', 'dot2_1221.csv', 'dot3_0121.csv', 'dot3_0221.csv', 'dot3_0321.csv', 'dot3_0421.csv', 'dot3_0521.csv', 'dot3_0621.csv', 'dot3_0721.csv', 'dot3_0821.csv', 'dot3_0921.csv', 'dot3_1021.csv', 'dot3_1121.csv', 'dot3_1221.csv']


In [3]:
csv_dir =  "C:/Users/Aisha Kanyiti/Desktop/Azubi Frieght Project/2021"

# Get a list of all .csv files in the directory
csv_files = [file for file in os.listdir(csv_dir) if file.endswith(".csv")]

# Create an empty list to hold DataFrames
all_dataframes = []

# Loop through each CSV file and read it into a DataFrame
for file in csv_files:
    file_path = os.path.join(csv_dir, file)
    
    try:
        df = pd.read_csv(file_path)
        df["source_file"] = file  # Add filename for tracking origin
        all_dataframes.append(df)
    except Exception as e:
        print(f"⚠️ Could not read {file}: {e}")

# Combine all DataFrames into one
df = pd.concat(all_dataframes, ignore_index=True)



In [4]:
df.drop(['source_file'],inplace=True,axis=1)

In [5]:
df.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,18XX,1,XX,,2010,5940,1136,0,1.0,1,1,2021,
1,1,AK,20XX,3,,XA,1220,7490,26,155,1.0,X,1,2021,
2,1,AK,20XX,3,,XA,1220,24885,13,78,2.0,X,1,2021,
3,1,AK,20XX,3,,XC,1220,16415,139,355,1.0,X,1,2021,
4,1,AK,20XX,3,,XC,1220,9025,5,35,2.0,X,1,2021,


In [6]:
df.isna().sum()

TRDTYPE                  0
USASTATE            207778
DEPE                902616
DISAGMOT                 0
MEXSTATE           1036096
CANPROV             690072
COUNTRY                  0
VALUE                    0
SHIPWT                   0
FREIGHT_CHARGES          0
DF                  485725
CONTCODE                 0
MONTH                    0
YEAR                     0
COMMODITY2          327584
dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1437978 entries, 0 to 1437977
Data columns (total 15 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   TRDTYPE          1437978 non-null  int64  
 1   USASTATE         1230200 non-null  object 
 2   DEPE             535362 non-null   object 
 3   DISAGMOT         1437978 non-null  int64  
 4   MEXSTATE         401882 non-null   object 
 5   CANPROV          747906 non-null   object 
 6   COUNTRY          1437978 non-null  int64  
 7   VALUE            1437978 non-null  int64  
 8   SHIPWT           1437978 non-null  int64  
 9   FREIGHT_CHARGES  1437978 non-null  int64  
 10  DF               952253 non-null   float64
 11  CONTCODE         1437978 non-null  object 
 12  MONTH            1437978 non-null  int64  
 13  YEAR             1437978 non-null  int64  
 14  COMMODITY2       1110394 non-null  float64
dtypes: float64(2), int64(8), object(5)
memory usage: 164.6+ MB


In [8]:
df.describe()

Unnamed: 0,TRDTYPE,DISAGMOT,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,MONTH,YEAR,COMMODITY2
count,1437978.0,1437978.0,1437978.0,1437978.0,1437978.0,1437978.0,952253.0,1437978.0,1437978.0,1110394.0
mean,1.337783,4.75859,1532.126,2764976.0,1219452.0,36546.05,1.330785,6.532569,2021.0,56.56374
std,0.4729544,1.248287,386.2085,32802850.0,38786790.0,909815.7,0.470496,3.431261,0.0,27.84372
min,1.0,1.0,1220.0,0.0,0.0,0.0,1.0,1.0,2021.0,1.0
25%,1.0,5.0,1220.0,13979.0,0.0,0.0,1.0,4.0,2021.0,33.0
50%,1.0,5.0,1220.0,69331.5,0.0,234.0,1.0,7.0,2021.0,59.0
75%,2.0,5.0,2010.0,418148.8,2625.75,2736.75,2.0,10.0,2021.0,84.0
max,2.0,9.0,2010.0,3921320000.0,8450848000.0,209324700.0,2.0,12.0,2021.0,99.0


In [9]:
df.duplicated().sum()

np.int64(0)

In [10]:
data = df.drop_duplicates() #Dropping duplicates
data

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,AK,18XX,1,XX,,2010,5940,1136,0,1.0,1,1,2021,
1,1,AK,20XX,3,,XA,1220,7490,26,155,1.0,X,1,2021,
2,1,AK,20XX,3,,XA,1220,24885,13,78,2.0,X,1,2021,
3,1,AK,20XX,3,,XC,1220,16415,139,355,1.0,X,1,2021,
4,1,AK,20XX,3,,XC,1220,9025,5,35,2.0,X,1,2021,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1437973,2,,55XX,5,,,2010,10000,2766,1,,0,12,2021,98.0
1437974,2,,60XX,8,,,1220,15100,1500,106,,0,12,2021,89.0
1437975,2,,60XX,8,,,1220,16746,1004,300,,0,12,2021,98.0
1437976,2,,70XX,8,,,1220,196576164,0,0,,0,12,2021,99.0


In [11]:
data.duplicated().sum()

np.int64(0)

In [12]:
data.isnull().sum()

TRDTYPE                  0
USASTATE            207778
DEPE                902616
DISAGMOT                 0
MEXSTATE           1036096
CANPROV             690072
COUNTRY                  0
VALUE                    0
SHIPWT                   0
FREIGHT_CHARGES          0
DF                  485725
CONTCODE                 0
MONTH                    0
YEAR                     0
COMMODITY2          327584
dtype: int64

In [13]:
# Mapping of U.S. State Codes to State Names
us_state_mapping = {
    "AL": "Alabama", "AK": "Alaska", "AS": "American Samoa", "AZ": "Arizona",
    "AR": "Arkansas", "CA": "California", "CO": "Colorado", "CT": "Connecticut",
    "DE": "Delaware", "DC": "District of Columbia", "FL": "Florida", "GA": "Georgia",
    "HI": "Hawaii", "ID": "Idaho", "IL": "Illinois", "IN": "Indiana", "IA": "Iowa",
    "KS": "Kansas", "KY": "Kentucky", "LA": "Louisiana", "ME": "Maine", "MD": "Maryland",
    "MA": "Massachusetts", "MI": "Michigan", "MN": "Minnesota", "MS": "Mississippi",
    "MO": "Missouri", "MT": "Montana", "NE": "Nebraska", "NV": "Nevada",
    "NH": "New Hampshire", "NJ": "New Jersey", "NM": "New Mexico", "NY": "New York",
    "NC": "North Carolina", "ND": "North Dakota", "OH": "Ohio", "OK": "Oklahoma",
    "OR": "Oregon", "PA": "Pennsylvania", "RI": "Rhode Island", "SC": "South Carolina",
    "SD": "South Dakota", "TN": "Tennessee", "TX": "Texas", "UT": "Utah",
    "VT": "Vermont", "VA": "Virginia", "WA": "Washington", "WV": "West Virginia",
    "WI": "Wisconsin", "WY": "Wyoming", "DU": "Unknown"
}

    
# Replace state codes with state names
data["USASTATE"] = data["USASTATE"].map(us_state_mapping)

data.head(10)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,1,Alaska,18XX,1,XX,,2010,5940,1136,0,1.0,1,1,2021,
1,1,Alaska,20XX,3,,XA,1220,7490,26,155,1.0,X,1,2021,
2,1,Alaska,20XX,3,,XA,1220,24885,13,78,2.0,X,1,2021,
3,1,Alaska,20XX,3,,XC,1220,16415,139,355,1.0,X,1,2021,
4,1,Alaska,20XX,3,,XC,1220,9025,5,35,2.0,X,1,2021,
5,1,Alaska,20XX,3,,XM,1220,18901,1069,649,1.0,X,1,2021,
6,1,Alaska,20XX,3,,XM,1220,7936,273,78,2.0,X,1,2021,
7,1,Alaska,20XX,3,,XO,1220,31745,28,39,1.0,X,1,2021,
8,1,Alaska,20XX,3,,XO,1220,13074,35,256,2.0,X,1,2021,
9,1,Alaska,2304,5,AG,,2010,56649,0,0,1.0,0,1,2021,


In [14]:
# Mapping of Trade Type Codes to Trade Type Names
trade_type_mapping = {
    1: "Export",
    2: "Import"
}

# Replace trade type codes with trade type names
data["TRDTYPE"] = data["TRDTYPE"].map(trade_type_mapping)


data.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,1,XX,,2010,5940,1136,0,1.0,1,1,2021,
1,Export,Alaska,20XX,3,,XA,1220,7490,26,155,1.0,X,1,2021,
2,Export,Alaska,20XX,3,,XA,1220,24885,13,78,2.0,X,1,2021,
3,Export,Alaska,20XX,3,,XC,1220,16415,139,355,1.0,X,1,2021,
4,Export,Alaska,20XX,3,,XC,1220,9025,5,35,2.0,X,1,2021,


In [15]:
# Create a mapping for the DISAGMOT column
disagmot_mapping = {
    1: "Vessel",
    3: "Air",
    4: "Mail (U.S. Postal Service)",
    5: "Truck",
    6: "Rail",
    7: "Pipeline",
    8: "Other",
    9: "Foreign Trade Zones (FTZs)"
}

# Apply the mapping to the 'DISAGMOT' column
data['DISAGMOT'] = data['DISAGMOT'].map(disagmot_mapping)


data.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,XX,,2010,5940,1136,0,1.0,1,1,2021,
1,Export,Alaska,20XX,Air,,XA,1220,7490,26,155,1.0,X,1,2021,
2,Export,Alaska,20XX,Air,,XA,1220,24885,13,78,2.0,X,1,2021,
3,Export,Alaska,20XX,Air,,XC,1220,16415,139,355,1.0,X,1,2021,
4,Export,Alaska,20XX,Air,,XC,1220,9025,5,35,2.0,X,1,2021,


In [16]:
# Create a mapping for the CANPROV column
canprov_mapping = {
    'XA': 'Alberta',
    'XC': 'British Columbia',
    'XM': 'Manitoba',
    'XB': 'New Brunswick',
    'XW': 'Newfoundland',
    'XT': 'Northwest Territories',
    'XN': 'Nova Scotia',
    'XO': 'Ontario',
    'XP': 'Prince Edward Island',
    'XQ': 'Quebec',
    'XS': 'Saskatchewan',
    'XV': 'Nunavut',
    'XY': 'Yukon Territory',
    'OT': 'Province Unknown'
}

# Apply the mapping to the 'CANPROV' column
data['CANPROV'] = data['CANPROV'].map(canprov_mapping)

data.head(5)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,XX,,2010,5940,1136,0,1.0,1,1,2021,
1,Export,Alaska,20XX,Air,,Alberta,1220,7490,26,155,1.0,X,1,2021,
2,Export,Alaska,20XX,Air,,Alberta,1220,24885,13,78,2.0,X,1,2021,
3,Export,Alaska,20XX,Air,,British Columbia,1220,16415,139,355,1.0,X,1,2021,
4,Export,Alaska,20XX,Air,,British Columbia,1220,9025,5,35,2.0,X,1,2021,


In [17]:
# Create a mapping for the MEXSTATE column
mexstate_mapping = {
    'AG': 'Aguascalientes',
    'BC': 'Baja California',
    'BN': 'Baja California Norte',
    'BS': 'Baja California Sur',
    'CH': 'Chihuahua',
    'CL': 'Colima',
    'CM': 'Campeche',
    'CO': 'Coahuila',
    'CS': 'Chiapas',
    'DF': 'Distrito Federal',
    'DG': 'Durango',
    'GR': 'Guerrero',
    'GT': 'Guanajuato',
    'HG': 'Hidalgo',
    'JA': 'Jalisco',
    'MI': 'Michoacán',
    'MO': 'Morelos',
    'MX': 'Estado de Mexico',
    'NA': 'Nayarit',
    'NL': 'Nuevo Leon',
    'OA': 'Oaxaca',
    'PU': 'Puebla',
    'QR': 'Quintana Roo',
    'QT': 'Queretaro',
    'SI': 'Sinaloa',
    'SL': 'San Luis Potosi',
    'SO': 'Sonora',
    'TB': 'Tabasco',
    'TL': 'Tlaxcala',
    'TM': 'Tamaulipas',
    'VE': 'Veracruz',
    'YU': 'Yucatan',
    'ZA': 'Zacatecas',
    'OT': 'State Unknown'
}

# Apply the mapping to the 'MEXSTATE' column
data['MEXSTATE'] = data['MEXSTATE'].map(mexstate_mapping)


data.head(20)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,,,2010,5940,1136,0,1.0,1,1,2021,
1,Export,Alaska,20XX,Air,,Alberta,1220,7490,26,155,1.0,X,1,2021,
2,Export,Alaska,20XX,Air,,Alberta,1220,24885,13,78,2.0,X,1,2021,
3,Export,Alaska,20XX,Air,,British Columbia,1220,16415,139,355,1.0,X,1,2021,
4,Export,Alaska,20XX,Air,,British Columbia,1220,9025,5,35,2.0,X,1,2021,
5,Export,Alaska,20XX,Air,,Manitoba,1220,18901,1069,649,1.0,X,1,2021,
6,Export,Alaska,20XX,Air,,Manitoba,1220,7936,273,78,2.0,X,1,2021,
7,Export,Alaska,20XX,Air,,Ontario,1220,31745,28,39,1.0,X,1,2021,
8,Export,Alaska,20XX,Air,,Ontario,1220,13074,35,256,2.0,X,1,2021,
9,Export,Alaska,2304,Truck,Aguascalientes,,2010,56649,0,0,1.0,0,1,2021,


In [18]:
# Create a mapping for the COUNTRY column
country_mapping = {
    1220: 'Canada',
    2010: 'Mexico'
}

# Apply the mapping to the 'COUNTRY' column
data['COUNTRY'] = data['COUNTRY'].map(country_mapping)

data.head(10)

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,,,Mexico,5940,1136,0,1.0,1,1,2021,
1,Export,Alaska,20XX,Air,,Alberta,Canada,7490,26,155,1.0,X,1,2021,
2,Export,Alaska,20XX,Air,,Alberta,Canada,24885,13,78,2.0,X,1,2021,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,16415,139,355,1.0,X,1,2021,
4,Export,Alaska,20XX,Air,,British Columbia,Canada,9025,5,35,2.0,X,1,2021,
5,Export,Alaska,20XX,Air,,Manitoba,Canada,18901,1069,649,1.0,X,1,2021,
6,Export,Alaska,20XX,Air,,Manitoba,Canada,7936,273,78,2.0,X,1,2021,
7,Export,Alaska,20XX,Air,,Ontario,Canada,31745,28,39,1.0,X,1,2021,
8,Export,Alaska,20XX,Air,,Ontario,Canada,13074,35,256,2.0,X,1,2021,
9,Export,Alaska,2304,Truck,Aguascalientes,,Mexico,56649,0,0,1.0,0,1,2021,


In [19]:
# Create a mapping for the DF column
df_mapping = {
    1: 'domestically produced merchandise',
    2: 'foreign produced merchandise'
}

# Apply the mapping to the 'DF' column
data['DF'] = data['DF'].map(df_mapping)


data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,,,Mexico,5940,1136,0,domestically produced merchandise,1,1,2021,
1,Export,Alaska,20XX,Air,,Alberta,Canada,7490,26,155,domestically produced merchandise,X,1,2021,
2,Export,Alaska,20XX,Air,,Alberta,Canada,24885,13,78,foreign produced merchandise,X,1,2021,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,16415,139,355,domestically produced merchandise,X,1,2021,
4,Export,Alaska,20XX,Air,,British Columbia,Canada,9025,5,35,foreign produced merchandise,X,1,2021,


In [20]:
# Create a mapping for the CONTCODE (Container Code) column
contcode_mapping = {
    'X': 'Containerized',
    0: 'Non-Containerized'
}

# Apply the mapping to the 'DF' column
data['CONTCODE'] = data['CONTCODE'].map(contcode_mapping)

# List of columns where missing values need to be replaced with 'Unknown'
column_to_replace = ['CONTCODE']

# Replace missing values in these columns with 'Unknown'
data[column_to_replace] = data[column_to_replace].fillna('Non-Containerized')


data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,,,Mexico,5940,1136,0,domestically produced merchandise,Non-Containerized,1,2021,
1,Export,Alaska,20XX,Air,,Alberta,Canada,7490,26,155,domestically produced merchandise,Containerized,1,2021,
2,Export,Alaska,20XX,Air,,Alberta,Canada,24885,13,78,foreign produced merchandise,Containerized,1,2021,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,16415,139,355,domestically produced merchandise,Containerized,1,2021,
4,Export,Alaska,20XX,Air,,British Columbia,Canada,9025,5,35,foreign produced merchandise,Containerized,1,2021,


In [21]:
# Create a mapping for the MONTH column
month_mapping = {
    1: 'January',
    2: 'February',
    3: 'March',
    4: 'April',
    5: 'May',
    6: 'June',
    7: 'July',
    8: 'August',
    9: 'September',
    10: 'October',
    11: 'November',
    12: 'December'
}


# Apply the mapping to the 'MONTH' column
data['MONTH'] = data['MONTH'].map(month_mapping)


data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,,,Mexico,5940,1136,0,domestically produced merchandise,Non-Containerized,January,2021,
1,Export,Alaska,20XX,Air,,Alberta,Canada,7490,26,155,domestically produced merchandise,Containerized,January,2021,
2,Export,Alaska,20XX,Air,,Alberta,Canada,24885,13,78,foreign produced merchandise,Containerized,January,2021,
3,Export,Alaska,20XX,Air,,British Columbia,Canada,16415,139,355,domestically produced merchandise,Containerized,January,2021,
4,Export,Alaska,20XX,Air,,British Columbia,Canada,9025,5,35,foreign produced merchandise,Containerized,January,2021,


In [22]:
data.isnull().sum()

TRDTYPE                  0
USASTATE            207778
DEPE                902616
DISAGMOT                 0
MEXSTATE           1069585
CANPROV             690072
COUNTRY                  0
VALUE                    0
SHIPWT                   0
FREIGHT_CHARGES          0
DF                  485725
CONTCODE                 0
MONTH                    0
YEAR                     0
COMMODITY2          327584
dtype: int64

In [23]:
# Impute categorical columns with the most frequent value (mode)
data['DEPE'] = data['DEPE'].fillna(data['DEPE'].mode()[0])

data['DF'] = data['DF'].fillna(data['DF'].mode()[0])
data['CONTCODE'] = data['CONTCODE'].fillna(data['CONTCODE'].mode()[0])
data['COMMODITY2'] = data['COMMODITY2'].fillna(data['COMMODITY2'].mode()[0])

In [24]:
# First, convert empty strings ("") to NaN if necessary
data['USASTATE'].replace('', np.nan, inplace=True)

# Now fill NaN with 'State Unknown'
data['USASTATE'].fillna('State Unknown', inplace=True)


In [25]:
# Compute mode for MEXSTATE and CANPROV based on COUNTRY
mexstate_mode_per_country = data.groupby('COUNTRY')['MEXSTATE'].agg(lambda x: x.mode()[0] if not x.isnull().all() else np.nan)
canprov_mode_per_country = data.groupby('COUNTRY')['CANPROV'].agg(lambda x: x.mode()[0] if not x.isnull().all() else np.nan)

# Fill missing MEXSTATE based on COUNTRY
data['MEXSTATE'] = data.apply(
    lambda row: mexstate_mode_per_country.get(row['COUNTRY'], row['MEXSTATE']) 
    if pd.isna(row['MEXSTATE']) and row['COUNTRY'] == 'Mexico' else row['MEXSTATE'], axis=1
)

# Fill missing CANPROV based on COUNTRY
data['CANPROV'] = data.apply(
    lambda row: canprov_mode_per_country.get(row['COUNTRY'], row['CANPROV']) 
    if pd.isna(row['CANPROV']) and row['COUNTRY'] == 'Canada' else row['CANPROV'], axis=1
)

In [26]:
data.isnull().sum()

TRDTYPE                 0
USASTATE                0
DEPE                    0
DISAGMOT                0
MEXSTATE           869838
CANPROV            568140
COUNTRY                 0
VALUE                   0
SHIPWT                  0
FREIGHT_CHARGES         0
DF                      0
CONTCODE                0
MONTH                   0
YEAR                    0
COMMODITY2              0
dtype: int64

In [27]:
# Check where values are either NaN (missing) or the string "NaN"
nan_summary = data.isna().sum() + (data == "NaN").sum()

# Show only columns with NaN values
nan_summary[nan_summary > 0]

MEXSTATE    869838
CANPROV     568140
dtype: int64

In [28]:
# Get total number of rows in the DataFrame
total_rows = len(data)

# Compute percentage of NaN values
nan_percentage = (nan_summary / total_rows) * 100

# Display columns with missing values and their percentages
nan_percentage[nan_percentage > 0]

MEXSTATE    60.490355
CANPROV     39.509645
dtype: float64

In [29]:
data.loc[data['COUNTRY'] == 'Mexico', 'CANPROV'] = "State Unknown"
data.loc[data['COUNTRY'] == 'Canada', 'MEXSTATE'] = "Province Unknown"

In [30]:
#Check for missing values
data.isnull().sum()

TRDTYPE            0
USASTATE           0
DEPE               0
DISAGMOT           0
MEXSTATE           0
CANPROV            0
COUNTRY            0
VALUE              0
SHIPWT             0
FREIGHT_CHARGES    0
DF                 0
CONTCODE           0
MONTH              0
YEAR               0
COMMODITY2         0
dtype: int64

In [31]:
data.head()

Unnamed: 0,TRDTYPE,USASTATE,DEPE,DISAGMOT,MEXSTATE,CANPROV,COUNTRY,VALUE,SHIPWT,FREIGHT_CHARGES,DF,CONTCODE,MONTH,YEAR,COMMODITY2
0,Export,Alaska,18XX,Vessel,Estado de Mexico,State Unknown,Mexico,5940,1136,0,domestically produced merchandise,Non-Containerized,January,2021,84.0
1,Export,Alaska,20XX,Air,Province Unknown,Alberta,Canada,7490,26,155,domestically produced merchandise,Containerized,January,2021,84.0
2,Export,Alaska,20XX,Air,Province Unknown,Alberta,Canada,24885,13,78,foreign produced merchandise,Containerized,January,2021,84.0
3,Export,Alaska,20XX,Air,Province Unknown,British Columbia,Canada,16415,139,355,domestically produced merchandise,Containerized,January,2021,84.0
4,Export,Alaska,20XX,Air,Province Unknown,British Columbia,Canada,9025,5,35,foreign produced merchandise,Containerized,January,2021,84.0


In [32]:
# Create a new DataFrame with the cleaned data
dataset = data.copy()

In [33]:
#Convert file to Cleaded to CSV format
dataset.to_csv('cleaned_transborder_freight_data_2021.csv', index=False)

In [34]:
# pip install ydata-profiling


In [35]:
# from ydata_profiling import ProfileReport

# Create the profile report
# profile = ProfileReport(data, title="TransBorderFreight Data Profiling Report", explorative=True)

# Save it to an HTML file
# profile.to_file("freight_data_report.html")
