# Container Freight Charges Data Processing

## Overview
This notebook processes container freight charges data from PDF format into a clean, analysis-ready CSV format.

### Input Data Structure
- Source: PDF file containing freight charge tables
- Multiple pages with different table formats
- Japanese text with mixed date formats
- Complex nested structure with city pairs and container sizes

### Output Format
- CSV file with standardized structure
- Index column '年月' in yyyymm format
- Columns showing routes with container sizes
- All prices as float values

## 1. Setup and Data Loading

### Mount Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Install Required Packages

In [3]:
!pip install tabula-py jpype1

Collecting tabula-py
  Downloading tabula_py-2.10.0-py3-none-any.whl.metadata (7.6 kB)
Collecting jpype1
  Downloading jpype1-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading tabula_py-2.10.0-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m58.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jpype1-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (493 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.8/493.8 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jpype1, tabula-py
Successfully installed jpype1-1.5.1 tabula-py-2.10.0


### Import Libraries

In [4]:
import pandas as pd
import tabula
from google.colab import files

### Load Data

In [5]:
# Define file path
path_to_pdf = "/content/drive/Shareddrives/125-2日本ハム-業務委託共有/前処理作業ファイル for Joseph-san/Before Preprocessing/2024-06断面データ/輸送費関連-コンテナ代.pdf"

# Read all pages from PDF
data = tabula.read_pdf(path_to_pdf, pages="all")

## 2. Data Processing Function
### Define Processing Function

In [6]:
def process_dataframe(df):
    """
    Process a single dataframe of freight charges data.

    Args:
        df: Input dataframe with raw freight charge data

    Returns:
        Processed dataframe with standardized format
    """
    if '積地' in df.columns and '向け地' in df.columns:
        # Extract and combine city and country information
        cities = df[["積地", "向け地"]].iloc[0].tolist()
        countries = df[["積地", "向け地"]].iloc[1].tolist()
        df.loc[df.index[0], "積地"] = f"{cities[0]} {countries[0]}"
        df.loc[df.index[0], "向け地"] = f"{cities[1]} {countries[1]}"

        # Propagate route information
        if len(df) > 1:
            df.loc[df.index[1]:, ['積地', '向け地']] = df.loc[df.index[0], ['積地', '向け地']].values

        # Rename and restructure columns
        df = df.rename(columns={'Unnamed: 0': 'Year', 'Unnamed: 1': 'ft'})
        df = df.assign(
            route=df.apply(lambda row: f"{row['積地']} -> {row['向け地']} ({row['ft']})", axis=1)
        ).drop(columns=["積地", "向け地", "ft"])

        # Transform to long format and standardize dates
        df_melted = pd.melt(df, id_vars=["Year", "route"], var_name="Month", value_name="Price")
        df_melted["Year"] = df_melted["Year"].str.replace("年", "")
        df_melted["Month"] = df_melted["Month"].str.replace("月", "")
        df_melted["年月"] = df_melted['Year'].astype(str) + df_melted['Month'].str.zfill(2)
        df_melted = df_melted.drop(['Year', 'Month'], axis=1)

        # Create final pivot format
        df_pivoted = df_melted.pivot(
            index='年月',
            columns='route',
            values='Price'
        ).rename_axis(columns=None)

        return df_pivoted

    return pd.DataFrame()

## 3. Processing Pipeline

### Process All Pages

In [7]:
# Dictionary to store processed dataframes
dataframes = {}

# Process each page
for i in range(len(data)):
    data_i = data[i]

    # Special handling for page 4 (index 3)
    if i == 3:
        # Split combined city column
        data_i[["積地", "向け地"]] = data_i["積地 向け地"].str.split(expand=True)
        data_i = data_i.drop(columns=["積地 向け地"])
        # Realign columns
        data_i.drop('Unnamed: 0', axis=1, inplace=True)
        data_i = data_i.rename(columns={'Unnamed: 1': 'Unnamed: 0', 'Unnamed: 2': 'Unnamed: 1'})
        data_i = data_i.reindex(columns=data[1].columns)

    # Standard processing steps
    data_i['Unnamed: 0'] = data_i['Unnamed: 0'].fillna(method='ffill')
    data_i = data_i[data_i['Unnamed: 0'] != '前年比']

    # Process in chunks of 10 rows
    num_rows = len(data_i)
    num_dfs = (num_rows + 9) // 10

    for j in range(num_dfs):
        start_index = j * 10
        end_index = min((j + 1) * 10, num_rows)
        df_subset = data_i.iloc[start_index:end_index]
        key = f"df_{i+1}_{j+1}"
        dataframes[key] = process_dataframe(df_subset)

  data_i['Unnamed: 0'] = data_i['Unnamed: 0'].fillna(method='ffill')
  data_i['Unnamed: 0'] = data_i['Unnamed: 0'].fillna(method='ffill')
  data_i['Unnamed: 0'] = data_i['Unnamed: 0'].fillna(method='ffill')
  data_i['Unnamed: 0'] = data_i['Unnamed: 0'].fillna(method='ffill')
  data_i['Unnamed: 0'] = data_i['Unnamed: 0'].fillna(method='ffill')
  data_i['Unnamed: 0'] = data_i['Unnamed: 0'].fillna(method='ffill')


### Inspect Results

In [8]:
# Print info about processed dataframes
for key in dataframes.keys():
    print(f"Shape of {key}: {dataframes[key].shape}")
    print(f"Columns in {key}: {dataframes[key].columns.tolist()}")
    print()

Shape of df_1_1: (60, 2)
Columns in df_1_1: ['Shanghai (China) -> Los Angeles (U.S.A.) (20ft)', 'Shanghai (China) -> Los Angeles (U.S.A.) (40ft)']

Shape of df_1_2: (60, 2)
Columns in df_1_2: ['Shanghai (China) -> New York (U.S.A.) (20ft)', 'Shanghai (China) -> New York (U.S.A.) (40ft)']

Shape of df_1_3: (60, 2)
Columns in df_1_3: ['Yokohama (Japan) -> Los Angeles (U.S.A.) (20ft)', 'Yokohama (Japan) -> Los Angeles (U.S.A.) (40ft)']

Shape of df_1_4: (60, 2)
Columns in df_1_4: ['Yokohama (Japan) -> New York (U.S.A.) (20ft)', 'Yokohama (Japan) -> New York (U.S.A.) (40ft)']

Shape of df_2_1: (60, 2)
Columns in df_2_1: ['Los Angeles (U.S.A.) -> Shanghai (China) (20ft)', 'Los Angeles (U.S.A.) -> Shanghai (China) (40ft)']

Shape of df_2_2: (60, 2)
Columns in df_2_2: ['New York (U.S.A.) -> Shanghai (China) (20ft)', 'New York (U.S.A.) -> Shanghai (China) (40ft)']

Shape of df_2_3: (60, 2)
Columns in df_2_3: ['Los Angeles (U.S.A.) -> Yokohama (Japan) (20ft)', 'Los Angeles (U.S.A.) -> Yokohama 

## 4. Combining and Exporting Results


### Combine and Export

In [9]:
# Combine all processed dataframes
df_combined = pd.concat(dataframes.values(), axis=1)

# Export to CSV
csv_filename = 'container_freight_charges202404.csv'
df_combined.to_csv(csv_filename, index=True)

# Download the result
files.download(csv_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Final Inspection

In [10]:
# Display final dataframe info
print("Final DataFrame Info:")
print("Shape:", df_combined.shape)
print("\nColumn names:")
for col in df_combined.columns:
    print(f"- {col}")
print("\nSample of final data:")
df_combined.head()

Final DataFrame Info:
Shape: (60, 48)

Column names:
- Shanghai (China) -> Los Angeles (U.S.A.) (20ft)
- Shanghai (China) -> Los Angeles (U.S.A.) (40ft)
- Shanghai (China) -> New York (U.S.A.) (20ft)
- Shanghai (China) -> New York (U.S.A.) (40ft)
- Yokohama (Japan) -> Los Angeles (U.S.A.) (20ft)
- Yokohama (Japan) -> Los Angeles (U.S.A.) (40ft)
- Yokohama (Japan) -> New York (U.S.A.) (20ft)
- Yokohama (Japan) -> New York (U.S.A.) (40ft)
- Los Angeles (U.S.A.) -> Shanghai (China) (20ft)
- Los Angeles (U.S.A.) -> Shanghai (China) (40ft)
- New York (U.S.A.) -> Shanghai (China) (20ft)
- New York (U.S.A.) -> Shanghai (China) (40ft)
- Los Angeles (U.S.A.) -> Yokohama (Japan) (20ft)
- Los Angeles (U.S.A.) -> Yokohama (Japan) (40ft)
- New York (U.S.A.) -> Yokohama (Japan) (20ft)
- New York (U.S.A.) -> Yokohama (Japan) (40ft)
- Shanghai (China) -> Rotterdam (Netherlands) (20ft)
- Shanghai (China) -> Rotterdam (Netherlands) (40ft)
- Shanghai (China) -> Genoa (Italy) (20ft)
- Shanghai (China) -> 

Unnamed: 0_level_0,Shanghai (China) -> Los Angeles (U.S.A.) (20ft),Shanghai (China) -> Los Angeles (U.S.A.) (40ft),Shanghai (China) -> New York (U.S.A.) (20ft),Shanghai (China) -> New York (U.S.A.) (40ft),Yokohama (Japan) -> Los Angeles (U.S.A.) (20ft),Yokohama (Japan) -> Los Angeles (U.S.A.) (40ft),Yokohama (Japan) -> New York (U.S.A.) (20ft),Yokohama (Japan) -> New York (U.S.A.) (40ft),Los Angeles (U.S.A.) -> Shanghai (China) (20ft),Los Angeles (U.S.A.) -> Shanghai (China) (40ft),...,Busan (Korea) -> Yokohama (Japan) (20ft),Busan (Korea) -> Yokohama (Japan) (40ft),Yokohama (Japan) -> Laem Chabang (Thailand) (20ft),Yokohama (Japan) -> Laem Chabang (Thailand) (40ft),Laem Chabang (Thailand) -> Yokohama (Japan) (20ft),Laem Chabang (Thailand) -> Yokohama (Japan) (40ft),Yokohama (Japan) -> Nhava Sheva (India) (20ft),Yokohama (Japan) -> Nhava Sheva (India) (40ft),Nhava Sheva (India) -> Yokohama (Japan) (20ft),Nhava Sheva (India) -> Yokohama (Japan) (40ft)
年月,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
202001,1390,1800,2420,3020,1330,1510,2270,2540,750,840,...,620,1030,570,990,640,1040,1170.0,2160.0,1040.0,1590.0
202002,1380,1730,2230,2850,1520,1760,2060,2920,750,840,...,630,1030,570,990,620,1020,,,,
202003,1410,1740,2330,2930,1420,1750,2270,3090,700,800,...,630,1040,580,970,630,1050,1280.0,2360.0,1090.0,1680.0
202004,1430,1840,2300,2920,1600,1970,2740,3330,710,830,...,630,1040,560,940,610,1000,,,,
202005,1670,2080,2540,3080,1690,2130,2630,3210,680,840,...,640,1060,550,920,600,990,1180.0,1930.0,1090.0,1670.0


## Results
- Successfully processed multi-page PDF input
- Handled special cases and inconsistent formats
- Created standardized CSV with:
  - Consistent date format (yyyymm)
  - Clear route naming with container sizes
  - All numeric values as floats
  - No missing values in key fields