<a href="https://colab.research.google.com/github/Tiru-Kaggundi/Trade_AI/blob/main/data_key_partners_30_30_30_40.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

df = pd.read_parquet('/content/drive/MyDrive/ai4trade/data/interim/merged_trade_data.parquet')

display(df.head())

Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
0,CHN,\N,30359,303,Exports,2023-01-01,45000
1,CHN,\N,70310,703,Exports,2023-01-01,2200
2,CHN,\N,70320,703,Exports,2023-01-01,600
3,CHN,\N,190230,1902,Exports,2023-01-01,28917
4,CHN,\N,200870,2008,Exports,2023-01-01,11520


# Task
Filter the dataset "merged_trade_data.parquet" to keep the top partners based on total value between 2023-01 and 2024-12, specifically keeping the top 30 partners for USA Exports, top 30 for USA Imports, top 30 for CHN Exports, and top 40 for CHN Imports. Save the filtered data to "filtered_30_30_30_40.parquet", preserving all columns and rows for the selected partners across all available months. Include sanity prints for the number of partners kept per group, total rows retained, and the min/max month in the filtered data.

## Prepare for ranking

### Subtask:
Filter the data to the ranking window (2023-01 to 2024-12) and calculate the total value for each (origin, trade_flow, destination) group.


**Reasoning**:
Convert the 'month' column to datetime objects, filter the DataFrame to the ranking window, and then group by 'origin', 'trade_flow', and 'destination' to calculate the sum of 'value' for each group.



In [3]:
df['month'] = pd.to_datetime(df['month'])
start_date = pd.to_datetime('2023-01-01')
end_date = pd.to_datetime('2024-12-01')
df_ranking_window = df[(df['month'] >= start_date) & (df['month'] <= end_date)].copy()
df_ranked = df_ranking_window.groupby(['origin', 'trade_flow', 'destination'])['value'].sum().reset_index()
display(df_ranked.head())

Unnamed: 0,origin,trade_flow,destination,value
0,CHN,Exports,ABW,155448256
1,CHN,Exports,AFG,2813031586
2,CHN,Exports,AGO,7621043773
3,CHN,Exports,AIA,9961277
4,CHN,Exports,ALA,453513


## Rank partners

### Subtask:
Rank partners within each (origin, trade_flow) group based on their total value.


**Reasoning**:
Apply a rank transformation to the 'value' column within each group defined by 'origin' and 'trade_flow' in the `df_ranked` DataFrame, assigning the ranks in descending order of value and storing the result in a new column named 'rank'.



In [4]:
df['month'] = pd.to_datetime(df['month'])
start_date = pd.to_datetime('2023-01-01')
end_date = pd.to_datetime('2024-12-01')
df_ranking_window = df[(df['month'] >= start_date) & (df['month'] <= end_date)].copy()
df_ranked = df_ranking_window.groupby(['origin', 'trade_flow', 'destination'])['value'].sum().reset_index()
df_ranked['rank'] = df_ranked.groupby(['origin', 'trade_flow'])['value'].rank(method='first', ascending=False)
display(df_ranked.head())

Unnamed: 0,origin,trade_flow,destination,value,rank
0,CHN,Exports,ABW,155448256,186.0
1,CHN,Exports,AFG,2813031586,115.0
2,CHN,Exports,AGO,7621043773,83.0
3,CHN,Exports,AIA,9961277,223.0
4,CHN,Exports,ALA,453513,236.0


## Select top partners

### Subtask:
Select the top N partners for each (origin, trade_flow) group based on the specified counts (30/30/30/40).


**Reasoning**:
Filter the ranked DataFrame to keep only the top N partners for each group as specified in the `top_n_partners` dictionary.



In [5]:
top_n_partners = {
    ('USA', 'Exports'): 30,
    ('USA', 'Imports'): 30,
    ('CHN', 'Exports'): 30,
    ('CHN', 'Imports'): 40
}

df_top_partners = df_ranked[
    df_ranked.apply(lambda row: row['rank'] <= top_n_partners.get((row['origin'], row['trade_flow']), 0), axis=1)
].copy()

display(df_top_partners.head())

Unnamed: 0,origin,trade_flow,destination,value,rank
8,CHN,Exports,ARE,122001396051,19.0
15,CHN,Exports,AUS,145521623010,15.0
19,CHN,Exports,BEL,65612425433,28.0
23,CHN,Exports,BGD,46151949990,30.0
33,CHN,Exports,BRA,131400856138,18.0


## Filter full dataset

### Subtask:
Filter the full dataset to keep only the rows corresponding to the selected top partners.


**Reasoning**:
Create a list of top partners and filter the original dataframe based on this list.



In [6]:
top_partners_list = list(zip(df_top_partners['origin'], df_top_partners['trade_flow'], df_top_partners['destination']))

df_filtered = df[df.apply(lambda row: (row['origin'], row['trade_flow'], row['destination']) in top_partners_list, axis=1)].copy()

display(df_filtered.head())

Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
748,CHN,AGO,250610,2506,Imports,2023-01-01,94339
749,CHN,AGO,250620,2506,Imports,2023-01-01,356
751,CHN,AGO,251611,2516,Imports,2023-01-01,1117614
752,CHN,AGO,251612,2516,Imports,2023-01-01,2575491
755,CHN,AGO,270900,2709,Imports,2023-01-01,1539349098


## Sanity checks

### Subtask:
Print the number of partners kept by group, the total number of rows retained, and the min/max month in the filtered data.


**Reasoning**:
Print the number of unique partners per group, the total number of rows, and the min/max month in the filtered data as requested.



In [7]:
partner_counts = df_filtered.groupby(['origin', 'trade_flow'])['destination'].nunique()
print("Number of partners kept by group:")
print(partner_counts)

total_rows = df_filtered.shape[0]
print(f"\nTotal number of rows retained: {total_rows}")

min_month = df_filtered['month'].min()
max_month = df_filtered['month'].max()
print(f"\nMin month in filtered data: {min_month}")
print(f"Max month in filtered data: {max_month}")

Number of partners kept by group:
origin  trade_flow
CHN     Exports       30
        Imports       40
USA     Exports       30
        Imports       30
Name: destination, dtype: int64

Total number of rows retained: 16770599

Min month in filtered data: 2023-01-01 00:00:00
Max month in filtered data: 2025-03-01 00:00:00


In [9]:
# Check for duplicate rows and display both instances
duplicate_rows = df_filtered[df_filtered.duplicated(keep=False)]
print(f"\nNumber of duplicate rows: {duplicate_rows.shape[0]}")

# Function to display examples of duplicate rows for a given origin and trade flow
def display_duplicate_examples(df, origin, trade_flow, num_examples=4):
    group_duplicates = df[(df['origin'] == origin) & (df['trade_flow'] == trade_flow)]
    if not group_duplicates.empty:
        print(f"\nExample duplicate rows for {origin} {trade_flow}:")
        # Sort to ensure duplicate pairs are next to each other for better viewing
        group_duplicates_sorted = group_duplicates.sort_values(by=['origin', 'trade_flow', 'destination', 'hs6', 'hs4', 'month', 'value'])
        # Display only the duplicate rows, showing both instances
        display(group_duplicates_sorted[group_duplicates_sorted.duplicated(keep=False)].head(num_examples * 2)) # Display twice the number of examples to show both duplicates
    else:
        print(f"\nNo duplicate rows found for {origin} {trade_flow}.")

# Display examples for each group
display_duplicate_examples(duplicate_rows, 'CHN', 'Imports', 4)
display_duplicate_examples(duplicate_rows, 'USA', 'Imports', 4)
display_duplicate_examples(duplicate_rows, 'CHN', 'Exports', 4)
display_duplicate_examples(duplicate_rows, 'USA', 'Exports', 4)


Number of duplicate rows: 2755830

Example duplicate rows for CHN Imports:


Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
4098649,CHN,ARE,220190,2201,Imports,2023-10-01,11916
4503658,CHN,ARE,220190,2201,Imports,2023-10-01,11916
10757086,CHN,ARE,391910,3919,Imports,2024-12-01,5
10883985,CHN,ARE,391910,3919,Imports,2024-12-01,5
3150929,CHN,ARE,482040,4820,Imports,2023-08-01,5
3514299,CHN,ARE,482040,4820,Imports,2023-08-01,5
4101578,CHN,ARE,940421,9404,Imports,2023-10-01,613
4505607,CHN,ARE,940421,9404,Imports,2023-10-01,613



Example duplicate rows for USA Imports:


Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
13819138,USA,AUS,10129,101,Imports,2023-04-01,0
14038217,USA,AUS,10129,101,Imports,2023-04-01,0
15993257,USA,AUS,10129,101,Imports,2023-07-01,0
16341608,USA,AUS,10129,101,Imports,2023-07-01,0
20243153,USA,AUS,10129,101,Imports,2023-11-01,0
20666785,USA,AUS,10129,101,Imports,2023-11-01,0
21140820,USA,AUS,10129,101,Imports,2023-12-01,0
21349620,USA,AUS,10129,101,Imports,2023-12-01,0



Example duplicate rows for CHN Exports:


Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
4099009,CHN,ARE,330741,3307,Exports,2023-10-01,2
4503882,CHN,ARE,330741,3307,Exports,2023-10-01,2
443318,CHN,ARE,540742,5407,Exports,2023-02-01,5
591498,CHN,ARE,540742,5407,Exports,2023-02-01,5
443485,CHN,ARE,610210,6102,Exports,2023-02-01,4
591543,CHN,ARE,610210,6102,Exports,2023-02-01,4
4548774,CHN,ARE,842641,8426,Exports,2023-11-01,80000
4653883,CHN,ARE,842641,8426,Exports,2023-11-01,80000



Example duplicate rows for USA Exports:


Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
13857909,USA,ARE,10121,101,Exports,2023-04-01,0
14224351,USA,ARE,10121,101,Exports,2023-04-01,0
15272627,USA,ARE,10121,101,Exports,2023-06-01,0
15465364,USA,ARE,10121,101,Exports,2023-06-01,0
15989668,USA,ARE,10121,101,Exports,2023-07-01,0
16336968,USA,ARE,10121,101,Exports,2023-07-01,0
17229145,USA,ARE,10121,101,Exports,2023-08-01,0
17631199,USA,ARE,10121,101,Exports,2023-08-01,0


In [10]:
df_filtered.drop_duplicates(inplace=True)
print(f"\nNumber of rows after removing duplicates: {df_filtered.shape[0]}")


Number of rows after removing duplicates: 15131660


## Save filtered data

### Subtask:
Save the filtered data to a new parquet file.


**Reasoning**:
Save the filtered data to a parquet file.



In [11]:
df_filtered.to_parquet('/content/drive/MyDrive/ai4trade/data/interim/filtered_30_30_30_40.parquet', index=False)

## Summary:

### Data Analysis Key Findings

*   The top partners were determined based on total trade value within the 2023-01 to 2024-12 ranking window.
*   The filtering successfully retained exactly 30 partners for CHN Exports, 40 for CHN Imports, 30 for USA Exports, and 30 for USA Imports.
*   The filtered dataset contains 16,770,599 rows. Duplicates were removed and now we have only 15131660 rows.
*   The months included in the filtered data range from 2023-01 to 2025-03.

### Insights or Next Steps

*   The filtered dataset is now ready for further analysis focused on the key trade relationships identified.
*   Investigate the trade dynamics (trends, seasonality, etc.) for these top partners over the full time range of the filtered data (2023-01 to 2025-03).
