<a href="https://colab.research.google.com/github/Tiru-Kaggundi/Trade_AI/blob/main/data_key_partners_30_30_30_40.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd

df = pd.read_parquet('/content/drive/MyDrive/ai4trade/merged_trade_data.parquet')

display(df.head())

Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
0,CHN,\N,30359,303,Exports,2023-01-01,45000
1,CHN,\N,70310,703,Exports,2023-01-01,2200
2,CHN,\N,70320,703,Exports,2023-01-01,600
3,CHN,\N,190230,1902,Exports,2023-01-01,28917
4,CHN,\N,200870,2008,Exports,2023-01-01,11520


# Task
Filter the dataset "merged_trade_data.parquet" to keep the top partners based on total value between 2023-01 and 2024-12, specifically keeping the top 30 partners for USA Exports, top 30 for USA Imports, top 30 for CHN Exports, and top 40 for CHN Imports. Save the filtered data to "filtered_30_30_30_40.parquet", preserving all columns and rows for the selected partners across all available months. Include sanity prints for the number of partners kept per group, total rows retained, and the min/max month in the filtered data.

## Prepare for ranking

### Subtask:
Filter the data to the ranking window (2023-01 to 2024-12) and calculate the total value for each (origin, trade_flow, destination) group.


**Reasoning**:
Convert the 'month' column to datetime objects, filter the DataFrame to the ranking window, and then group by 'origin', 'trade_flow', and 'destination' to calculate the sum of 'value' for each group.



In [4]:
df['month'] = pd.to_datetime(df['month'])
start_date = pd.to_datetime('2023-01-01')
end_date = pd.to_datetime('2024-12-01')
df_ranking_window = df[(df['month'] >= start_date) & (df['month'] <= end_date)].copy()
df_ranked = df_ranking_window.groupby(['origin', 'trade_flow', 'destination'])['value'].sum().reset_index()
display(df_ranked.head())

Unnamed: 0,origin,trade_flow,destination,value
0,CHN,Exports,ABW,155448256
1,CHN,Exports,AFG,2813031586
2,CHN,Exports,AGO,7621043773
3,CHN,Exports,AIA,9961277
4,CHN,Exports,ALA,453513


## Rank partners

### Subtask:
Rank partners within each (origin, trade_flow) group based on their total value.


**Reasoning**:
Apply a rank transformation to the 'value' column within each group defined by 'origin' and 'trade_flow' in the `df_ranked` DataFrame, assigning the ranks in descending order of value and storing the result in a new column named 'rank'.



In [6]:
df['month'] = pd.to_datetime(df['month'])
start_date = pd.to_datetime('2023-01-01')
end_date = pd.to_datetime('2024-12-01')
df_ranking_window = df[(df['month'] >= start_date) & (df['month'] <= end_date)].copy()
df_ranked = df_ranking_window.groupby(['origin', 'trade_flow', 'destination'])['value'].sum().reset_index()
df_ranked['rank'] = df_ranked.groupby(['origin', 'trade_flow'])['value'].rank(method='first', ascending=False)
display(df_ranked.head())

Unnamed: 0,origin,trade_flow,destination,value,rank
0,CHN,Exports,ABW,155448256,186.0
1,CHN,Exports,AFG,2813031586,115.0
2,CHN,Exports,AGO,7621043773,83.0
3,CHN,Exports,AIA,9961277,223.0
4,CHN,Exports,ALA,453513,236.0


## Select top partners

### Subtask:
Select the top N partners for each (origin, trade_flow) group based on the specified counts (30/30/30/40).


**Reasoning**:
Filter the ranked DataFrame to keep only the top N partners for each group as specified in the `top_n_partners` dictionary.



In [7]:
top_n_partners = {
    ('USA', 'Exports'): 30,
    ('USA', 'Imports'): 30,
    ('CHN', 'Exports'): 30,
    ('CHN', 'Imports'): 40
}

df_top_partners = df_ranked[
    df_ranked.apply(lambda row: row['rank'] <= top_n_partners.get((row['origin'], row['trade_flow']), 0), axis=1)
].copy()

display(df_top_partners.head())

Unnamed: 0,origin,trade_flow,destination,value,rank
8,CHN,Exports,ARE,122001396051,19.0
15,CHN,Exports,AUS,145521623010,15.0
19,CHN,Exports,BEL,65612425433,28.0
23,CHN,Exports,BGD,46151949990,30.0
33,CHN,Exports,BRA,131400856138,18.0


## Filter full dataset

### Subtask:
Filter the full dataset to keep only the rows corresponding to the selected top partners.


**Reasoning**:
Create a list of top partners and filter the original dataframe based on this list.



In [8]:
top_partners_list = list(zip(df_top_partners['origin'], df_top_partners['trade_flow'], df_top_partners['destination']))

df_filtered = df[df.apply(lambda row: (row['origin'], row['trade_flow'], row['destination']) in top_partners_list, axis=1)].copy()

display(df_filtered.head())

Unnamed: 0,origin,destination,hs6,hs4,trade_flow,month,value
748,CHN,AGO,250610,2506,Imports,2023-01-01,94339
749,CHN,AGO,250620,2506,Imports,2023-01-01,356
751,CHN,AGO,251611,2516,Imports,2023-01-01,1117614
752,CHN,AGO,251612,2516,Imports,2023-01-01,2575491
755,CHN,AGO,270900,2709,Imports,2023-01-01,1539349098


## Sanity checks

### Subtask:
Print the number of partners kept by group, the total number of rows retained, and the min/max month in the filtered data.


**Reasoning**:
Print the number of unique partners per group, the total number of rows, and the min/max month in the filtered data as requested.



In [10]:
partner_counts = df_filtered.groupby(['origin', 'trade_flow'])['destination'].nunique()
print("Number of partners kept by group:")
print(partner_counts)

total_rows = df_filtered.shape[0]
print(f"\nTotal number of rows retained: {total_rows}")

min_month = df_filtered['month'].min()
max_month = df_filtered['month'].max()
print(f"\nMin month in filtered data: {min_month}")
print(f"Max month in filtered data: {max_month}")

Number of partners kept by group:
origin  trade_flow
CHN     Exports       30
        Imports       40
USA     Exports       30
        Imports       30
Name: destination, dtype: int64

Total number of rows retained: 16770599

Min month in filtered data: 2023-01-01 00:00:00
Max month in filtered data: 2025-03-01 00:00:00


## Save filtered data

### Subtask:
Save the filtered data to a new parquet file.


**Reasoning**:
Save the filtered data to a parquet file.



In [11]:
df_filtered.to_parquet('/content/drive/MyDrive/ai4trade/filtered_30_30_30_40.parquet', index=False)

## Summary:

### Data Analysis Key Findings

*   The top partners were determined based on total trade value within the 2023-01 to 2024-12 ranking window.
*   The filtering successfully retained exactly 30 partners for CHN Exports, 40 for CHN Imports, 30 for USA Exports, and 30 for USA Imports.
*   The filtered dataset contains 16,770,599 rows.
*   The months included in the filtered data range from 2023-01 to 2025-03.

### Insights or Next Steps

*   The filtered dataset is now ready for further analysis focused on the key trade relationships identified.
*   Investigate the trade dynamics (trends, seasonality, etc.) for these top partners over the full time range of the filtered data (2023-01 to 2025-03).
