In [1]:
import pandas as pd

**OECD BIMTS 2 digit HS code level bulk data comes in six files spliting on the time period where each consists of consecutive five years. Here I am considering 2000 to 2023, which comes in five different csv files. Cleaning and preparing the those take the following steps:**  

1. Reading data for each five year and keeping only the HS codes those are relevant to this study
2. Removing 'world' from both reference and counterpart area which is coded as 'W' 
3. Making five dataframes- one for the **total** export mentioned in the product HS as the category *_T* and four others for each cultural products and naming those as **unique** for HS code *97*, **cinema** for HS code *37*, **books** for HS code *49* and **tapes** for HS code *85* for every five year data. Subsetting each dataset on the adjustment column value **B_ADJ_RX** which represents the reconciled bilateral trade flow, adjusted for re-exports. 
4. *For unique cultural goods*: Concatenating the five different dataframes based on year splits of HS code 97 to make a dyad level time series data where time ranges from 2000 to 2023.
5. Cleaning this data takes the following steps:
   > - a. Renaming the exporter and importer columns
   > - b. Removing self loops
   > - c. Removing duplicate rows and keeping the first instance only
   > - d. Sorting values by country pair and time period
   > - e. Reseting index
   > - f. Saving the data in the cleanind folder
   
7. *For reproducible cultural goods*: Making three different dyad level time series datasets for books, tapes and cinemas where time ranges is 2000 to 2023 (doing the same as I did for unique cultural products). Cleaning each dataset goes through the steps mentioned in point 5. 
8. Aggregating the three different reproducible cultural goods data on country pair and time, while tracking the HS codes and number of products traded among each pair in a particular year in the column 'PRODUCT_HS' and 'product_count'
9. The notebook also counts and appends hysteresis column to all the datasets.
10. So this notebook will save five different csv files in the 'cleaned' subfolder of the folder 'data', four for four different HS code level data and one for aggregated reproducible data. 

#### From 2000 to 2004

In [2]:
c00_04 = pd.read_csv('../data/raw/HS17_2D_DE_2000_To_2004.csv')

In [3]:
c00_04 = c00_04[['REF_AREA', 'COUNTERPART_AREA', 'PRODUCT_HS', 'TIME_PERIOD', 'OBS_VALUE', 'ADJUSTMENT']]

In [4]:
c00_04 = c00_04[c00_04['COUNTERPART_AREA'] != 'W']
c00_04 = c00_04[c00_04['REF_AREA'] != 'W']

In [5]:
c00_04_total = c00_04[(c00_04['PRODUCT_HS'] == '_T') & (c00_04['ADJUSTMENT'] == 'B_ADJ_RX') ]

In [6]:
print(c00_04_total['REF_AREA'].nunique())
print(c00_04_total['COUNTERPART_AREA'].nunique())
print(c00_04_total['PRODUCT_HS'].nunique())

197
197
1


In [7]:
c00_04_unique = c00_04[(c00_04['PRODUCT_HS'] == 'HS17_97') & (c00_04['ADJUSTMENT'] == 'B_ADJ_RX') ]

In [8]:
print(c00_04_unique['REF_AREA'].nunique())
print(c00_04_unique['COUNTERPART_AREA'].nunique())
print(c00_04_unique['PRODUCT_HS'].nunique())

197
197
1


In [9]:
c00_04_49 =  c00_04[(c00_04['PRODUCT_HS'] == 'HS17_49') & (c00_04['ADJUSTMENT'] == 'B_ADJ_RX')]
c00_04_37 =  c00_04[(c00_04['PRODUCT_HS'] == 'HS17_37') & (c00_04['ADJUSTMENT'] == 'B_ADJ_RX')]
c00_04_85 =  c00_04[(c00_04['PRODUCT_HS'] == 'HS17_85') & (c00_04['ADJUSTMENT'] == 'B_ADJ_RX')]

In [10]:
print(c00_04_49['REF_AREA'].nunique())
print(c00_04_49['COUNTERPART_AREA'].nunique())
print(c00_04_37['REF_AREA'].nunique())
print(c00_04_37['COUNTERPART_AREA'].nunique())
print(c00_04_85['REF_AREA'].nunique())
print(c00_04_85['COUNTERPART_AREA'].nunique())
print(c00_04_49['PRODUCT_HS'].nunique())
print(c00_04_37['PRODUCT_HS'].nunique())
print(c00_04_85['PRODUCT_HS'].nunique())

197
197
193
197
197
197
1
1
1


#### From 2005 to 2009

In [11]:
c05_09 = pd.read_csv('../data/raw/HS17_2D_DE_2005_To_2009.csv')

In [12]:
c05_09.head(3)

Unnamed: 0,DATAFLOW,REF_AREA,COUNTERPART_AREA,TRADE_FLOW,PRODUCT_TYPE,PRODUCT_CPA,PRODUCT_HS,FREQ,TIME_PERIOD,OBS_VALUE,OBS_STATUS,METHODOLOGY_TYPE,UNIT_MULT,UNIT_MEASURE,ADJUSTMENT,DECIMALS
0,OECD.SDD.TPS:DSD_BIMTS@DF_BIMTS_HS2017_2D(1.0),HUN,MYS,X,C,_Z,HS17_90,A,2006,1.176961,E,AG,6,USD_EXC,B,2
1,OECD.SDD.TPS:DSD_BIMTS@DF_BIMTS_HS2017_2D(1.0),HUN,MYS,X,C,_Z,HS17_90,A,2006,1.176961,E,AG,6,USD_EXC,B_ADJ_RX,2
2,OECD.SDD.TPS:DSD_BIMTS@DF_BIMTS_HS2017_2D(1.0),USA,ROU,X,C,_Z,HS17_45,A,2006,0.004019,E,AG,6,USD_EXC,B,2


In [13]:
c05_09 = c05_09[['REF_AREA', 'COUNTERPART_AREA', 'PRODUCT_HS', 'TIME_PERIOD', 'OBS_VALUE','ADJUSTMENT']]

In [14]:
c05_09 = c05_09[c05_09['COUNTERPART_AREA'] != 'W']
c05_09 = c05_09[c05_09['REF_AREA'] != 'W']

Kosovo and Montenegro got independence from Serbia in 2008 and 2006 repentively. I am checking their availablity. 

In [15]:
#c05_09 [c05_09['REF_AREA'] == 'MNE']
#c05_09[c05_09['COUNTERPART_AREA'] == 'MNE']

In [16]:
#c05_09 [c05_09['REF_AREA'] == 'XKX']
#c05_09[c05_09['COUNTERPART_AREA'] == 'XKX']

In [17]:
c05_09_total = c05_09[(c05_09['PRODUCT_HS'] == '_T') & (c05_09['ADJUSTMENT'] == 'B_ADJ_RX') ]

In [18]:
print(c05_09_total['REF_AREA'].nunique())
print(c05_09_total['COUNTERPART_AREA'].nunique())
print(c05_09_total['PRODUCT_HS'].nunique())

199
199
1


In [19]:
c05_09_unique = c05_09[(c05_09['PRODUCT_HS'] == 'HS17_97') & (c05_09['ADJUSTMENT'] == 'B_ADJ_RX')]

In [20]:
print(c05_09_unique['REF_AREA'].nunique())
print(c05_09_unique['COUNTERPART_AREA'].nunique())
print(c05_09_unique['PRODUCT_HS'].nunique())

199
199
1


In [21]:
c05_09_49 =  c05_09[(c05_09['PRODUCT_HS'] == 'HS17_49') & (c05_09['ADJUSTMENT'] == 'B_ADJ_RX')]
c05_09_37 =  c05_09[(c05_09['PRODUCT_HS'] == 'HS17_37') & (c05_09['ADJUSTMENT'] == 'B_ADJ_RX')]
c05_09_85 =  c05_09[(c05_09['PRODUCT_HS'] == 'HS17_85') & (c05_09['ADJUSTMENT'] == 'B_ADJ_RX')]

In [22]:
print(c05_09_37['REF_AREA'].nunique())
print(c05_09_37['COUNTERPART_AREA'].nunique())
print(c05_09_49['REF_AREA'].nunique())
print(c05_09_49['COUNTERPART_AREA'].nunique())
print(c05_09_85['REF_AREA'].nunique())
print(c05_09_85['COUNTERPART_AREA'].nunique())
print(c05_09_49['PRODUCT_HS'].nunique())
print(c05_09_37['PRODUCT_HS'].nunique())
print(c05_09_85['PRODUCT_HS'].nunique())

195
199
199
199
199
199
1
1
1


#### From 2010 to 2014

In [23]:
c10_14 = pd.read_csv('../data/raw/HS17_2D_DE_2010_To_2014.csv')

In [24]:
c10_14 = c10_14[['REF_AREA', 'COUNTERPART_AREA', 'PRODUCT_HS', 'TIME_PERIOD', 'OBS_VALUE','ADJUSTMENT']]

In [25]:
c10_14 = c10_14[c10_14['COUNTERPART_AREA'] != 'W']
c10_14 = c10_14[c10_14['REF_AREA'] != 'W']

Kosovo and Montenegro got independence from Serbia in 2008 and 2006 repentively. From 2005 to 2009 Montenegro was present bit not Kosovo. Also, SOuth Sudan has got independence from Sudan in 2011. I will check the availability of KOsovo and South Sudan here

In [26]:
#c10_14[c10_14['REF_AREA'] == 'MNE']
#c10_14[c10_14['COUNTERPART_AREA'] == 'MNE']

In [27]:
#c10_14[c10_14['REF_AREA'] == 'XKX']
#c10_14[c10_14['COUNTERPART_AREA'] == 'XKX']

In [28]:
#c10_14[c10_14['REF_AREA'] == 'SSD']
#c10_14[c10_14['COUNTERPART_AREA'] == 'SSD']

In [29]:
c10_14_total = c10_14[(c10_14['PRODUCT_HS'] == '_T') & (c10_14['ADJUSTMENT'] == 'B_ADJ_RX')]

In [30]:
print(c10_14_total['REF_AREA'].nunique())
print(c10_14_total['COUNTERPART_AREA'].nunique())
print(c10_14_total['PRODUCT_HS'].nunique())

200
200
1


In [31]:
c10_14_unique = c10_14[(c10_14['PRODUCT_HS'] == 'HS17_97') & (c10_14['ADJUSTMENT'] == 'B_ADJ_RX')]

In [32]:
print(c10_14_unique['REF_AREA'].nunique())
print(c10_14_unique['COUNTERPART_AREA'].nunique())
print(c10_14_unique['PRODUCT_HS'].nunique())

199
200
1


In [33]:
c10_14_49 =  c10_14[(c10_14['PRODUCT_HS'] == 'HS17_49') & (c10_14['ADJUSTMENT'] == 'B_ADJ_RX')]
c10_14_37 =  c10_14[(c10_14['PRODUCT_HS'] == 'HS17_37') & (c10_14['ADJUSTMENT'] == 'B_ADJ_RX')]
c10_14_85 =  c10_14[(c10_14['PRODUCT_HS'] == 'HS17_85') & (c10_14['ADJUSTMENT'] == 'B_ADJ_RX')]

In [34]:
print(c10_14_49['REF_AREA'].nunique())
print(c10_14_49['COUNTERPART_AREA'].nunique())
print(c10_14_37['REF_AREA'].nunique())
print(c10_14_37['COUNTERPART_AREA'].nunique())
print(c10_14_85['REF_AREA'].nunique())
print(c10_14_85['COUNTERPART_AREA'].nunique())
print(c10_14_49['PRODUCT_HS'].nunique())
print(c10_14_37['PRODUCT_HS'].nunique())
print(c10_14_85['PRODUCT_HS'].nunique())

200
200
195
200
200
200
1
1
1


#### From 2015 to 2019

In [35]:
c15_19 = pd.read_csv('../data/raw/HS17_2D_DE_2015_To_2019.csv')

In [36]:
c15_19 = c15_19[['REF_AREA', 'COUNTERPART_AREA', 'PRODUCT_HS', 'TIME_PERIOD', 'OBS_VALUE', 'ADJUSTMENT']]

In [37]:
c15_19 = c15_19[c15_19['COUNTERPART_AREA'] != 'W']
c15_19 = c15_19[c15_19['REF_AREA'] != 'W']

In [38]:
#c15_19[c15_19['REF_AREA'] == 'MNE']
#c15_19[c15_19['COUNTERPART_AREA'] == 'MNE']

In [39]:
#c15_19[c15_19['REF_AREA'] == 'XKX']
#c15_19[c15_19['COUNTERPART_AREA'] == 'XKX']

In [40]:
#c15_19[c15_19['REF_AREA'] == 'SSD']
#c15_19[c15_19['COUNTERPART_AREA'] == 'SSD']

In [41]:
c15_19_total = c15_19[(c15_19['PRODUCT_HS'] == '_T') & (c15_19['ADJUSTMENT'] == 'B_ADJ_RX')]

In [42]:
print(c15_19_total['REF_AREA'].nunique())
print(c15_19_total['COUNTERPART_AREA'].nunique())
print(c15_19_total['PRODUCT_HS'].nunique())

200
200
1


In [43]:
c15_19_unique = c15_19[(c15_19['PRODUCT_HS'] == 'HS17_97') & (c15_19['ADJUSTMENT'] == 'B_ADJ_RX') ]

In [44]:
print(c15_19_unique['REF_AREA'].nunique())
print(c15_19_unique['COUNTERPART_AREA'].nunique())
print(c15_19_unique['PRODUCT_HS'].nunique())

200
200
1


In [45]:
c15_19_49 =  c15_19[(c15_19['PRODUCT_HS'] == 'HS17_49') & (c15_19['ADJUSTMENT'] == 'B_ADJ_RX') ]
c15_19_37 =  c15_19[(c15_19['PRODUCT_HS'] == 'HS17_37') & (c15_19['ADJUSTMENT'] == 'B_ADJ_RX')]
c15_19_85 =  c15_19[(c15_19['PRODUCT_HS'] == 'HS17_85') & (c15_19['ADJUSTMENT'] == 'B_ADJ_RX')]

In [46]:
print(c15_19_49['REF_AREA'].nunique())
print(c15_19_49['COUNTERPART_AREA'].nunique())
print(c15_19_37['REF_AREA'].nunique())
print(c15_19_37['COUNTERPART_AREA'].nunique())
print(c15_19_85['REF_AREA'].nunique())
print(c15_19_85['COUNTERPART_AREA'].nunique())
print(c15_19_49['PRODUCT_HS'].nunique())
print(c15_19_37['PRODUCT_HS'].nunique())
print(c15_19_85['PRODUCT_HS'].nunique())

200
200
189
200
200
200
1
1
1


#### From 2020 to 2023

In [47]:
c020_23 = pd.read_csv('../data/raw/HS17_2D_DE_2020_To_2023.csv')

In [48]:
c020_23 = c020_23[['REF_AREA', 'COUNTERPART_AREA', 'PRODUCT_HS', 'TIME_PERIOD', 'OBS_VALUE', 'ADJUSTMENT']]

In [49]:
c020_23 = c020_23[c020_23['COUNTERPART_AREA'] != 'W']
c020_23 = c020_23[c020_23['REF_AREA'] != 'W']

In [50]:
#c020_23[c020_23['REF_AREA'] == 'MNE']
#c020_23[c020_23['COUNTERPART_AREA'] == 'MNE']

In [51]:
#c020_23[c020_23['REF_AREA'] == 'XKX']
#c020_23[c020_23['COUNTERPART_AREA'] == 'XKX']

In [52]:
#c020_23[c020_23['REF_AREA'] == 'SSD']
#c020_23[c020_23['COUNTERPART_AREA'] == 'SSD']

In [53]:
c020_23_total = c020_23[(c020_23['PRODUCT_HS'] == '_T') & (c020_23['ADJUSTMENT'] == 'B_ADJ_RX')]

In [54]:
print(c020_23_total['REF_AREA'].nunique())
print(c020_23_total['COUNTERPART_AREA'].nunique())
print(c020_23_total['PRODUCT_HS'].nunique())

200
200
1


In [55]:
c020_23_unique = c020_23[(c020_23['PRODUCT_HS'] == 'HS17_97') & (c020_23['ADJUSTMENT'] == 'B_ADJ_RX')]

In [56]:
print(c020_23_unique['REF_AREA'].nunique())
print(c020_23_unique['COUNTERPART_AREA'].nunique())
print(c020_23_unique['PRODUCT_HS'].nunique())

200
200
1


In [57]:
c020_23_49 =  c020_23[(c020_23['PRODUCT_HS'] == 'HS17_49')  & (c020_23['ADJUSTMENT'] == 'B_ADJ_RX')]
c020_23_37 =  c020_23[(c020_23['PRODUCT_HS'] == 'HS17_37')  & (c020_23['ADJUSTMENT'] == 'B_ADJ_RX')]
c020_23_85 =  c020_23[(c020_23['PRODUCT_HS'] == 'HS17_85')  & (c020_23['ADJUSTMENT'] == 'B_ADJ_RX')]

In [58]:
print(c020_23_49['REF_AREA'].nunique())
print(c020_23_49['COUNTERPART_AREA'].nunique())
print(c020_23_37['REF_AREA'].nunique())
print(c020_23_37['COUNTERPART_AREA'].nunique())
print(c020_23_85['REF_AREA'].nunique())
print(c020_23_85['COUNTERPART_AREA'].nunique())
print(c020_23_49['PRODUCT_HS'].nunique())
print(c020_23_37['PRODUCT_HS'].nunique())
print(c020_23_85['PRODUCT_HS'].nunique())

200
200
186
200
200
200
1
1
1


##### For all datasets I will do the following steps:
1. Concateninating the 5 dataframes for each HS code category to make a time series from 2000 to 2023
2. Renaming the exporter and importer columns
3. Removing self loops
4. Checking for duplicates and removing if there is any
5. Sorting values by country pair and time period
6. Reseting index
7. Saving in the cleaned file in csv format 

#### For unique goods

In [59]:
# appending the five dataframes for HS code 97
unique = pd.concat([c00_04_unique, c05_09_unique, c10_14_unique, c15_19_unique, c020_23_unique], axis = 0)

In [60]:
# renaming the exporter and importer column to match the fixed variable dataset
unique = unique.rename(columns = {'REF_AREA': 'iso_o'})
unique = unique.rename(columns = {'COUNTERPART_AREA': 'iso_d'})

In [61]:
unique.shape

(174733, 6)

In [62]:
print(unique['TIME_PERIOD'].nunique())
print(unique['ADJUSTMENT'].nunique())

24
1


In [63]:
# removing self_loops
unique = unique[unique['iso_o'] != unique['iso_d']]

In [64]:
unique.shape

(174305, 6)

In [65]:
# removing duplicates
unique_duplicate = unique.duplicated(keep = 'last')
#unique[unique_duplicate].shape[0]
unique_unique = unique[~unique_duplicate]
# There is no duplicate for unique cultural trade. 

In [66]:
unique_unique.shape

(174305, 6)

In [67]:
unique_unique = unique_unique.sort_values(['iso_o', 'iso_d', 'TIME_PERIOD'])

In [68]:
unique_unique = unique_unique.reset_index(drop= True)

In [69]:
unique_unique.head()

Unnamed: 0,iso_o,iso_d,PRODUCT_HS,TIME_PERIOD,OBS_VALUE,ADJUSTMENT
0,ABW,AGO,HS17_97,2014,2.8e-05,B_ADJ_RX
1,ABW,ARE,HS17_97,2005,0.000641,B_ADJ_RX
2,ABW,ARE,HS17_97,2014,5.3e-05,B_ADJ_RX
3,ABW,ARE,HS17_97,2020,1.9e-05,B_ADJ_RX
4,ABW,ARE,HS17_97,2022,0.000132,B_ADJ_RX


In [70]:
print(unique_unique['iso_o'].nunique())
print(unique_unique['iso_d'].nunique())
print(unique_unique.groupby(['iso_o','iso_d']).ngroups)

201
201
18915


In [71]:
memory = 0.3
def compute_hysteresis(series):
    '''It takes a numeric panda series, initializes an empty list to store the hysteresis values, and two values for 
    previous year's trade volume. Then it counts the husteresis value for a particular year and appends it to the list of hysteresis values.'''
    hyst = []
    prev_cum = 0
    prev_val = 0
    for val in series:
        hyst_val = memory * prev_cum + prev_val
        hyst.append(hyst_val)
        prev_cum = hyst_val
        prev_val = val
    return hyst

In [72]:
unique_unique['hysteresis_unique'] = (unique_unique.groupby(['iso_o', 'iso_d'])['OBS_VALUE'].transform(compute_hysteresis))

In [73]:
#unique_unique[(unique_unique['iso_o'] == 'BGD') & (unique_unique['iso_d'] == 'IND')]

In [74]:
unique_unique.to_csv("../data/cleaned/unique2000_2023.csv", encoding='utf-8', index=False)

In [75]:
#unique_unique[(unique_unique['iso_o'] == 'BGD') & (unique_unique['iso_d'] == 'IND')]

#### Preparing the dataset for reproducible goods

* Cinema 

In [76]:
cinema = pd.concat([c00_04_37, c05_09_37, c10_14_37, c15_19_37, c020_23_37], axis = 0)

In [77]:
print(cinema['TIME_PERIOD'].nunique())
print(cinema['ADJUSTMENT'].nunique())

24
1


In [78]:
# renaming the exporter and importer column to match the fixed variable dataset
cinema = cinema.rename(columns = {'REF_AREA': 'iso_o'})
cinema = cinema.rename(columns = {'COUNTERPART_AREA': 'iso_d'})

In [79]:
cinema.shape

(138854, 6)

In [80]:
# removing self_loops
cinema = cinema[cinema['iso_o'] != cinema['iso_d']]

In [81]:
cinema.shape

(138525, 6)

In [82]:
# removing duplicates
cinema_duplicate = cinema.duplicated(keep = 'last')
cinema_unique = cinema[~cinema_duplicate]

In [83]:
cinema_unique.shape
# No duplicate in cinema dataframe

(138525, 6)

In [84]:
cinema_unique = cinema_unique.sort_values(['iso_o', 'iso_d', 'TIME_PERIOD'])

In [85]:
cinema_unique = cinema_unique.reset_index(drop= True)

In [86]:
cinema_unique['hysteresis_cinema'] = (cinema_unique.groupby(['iso_o', 'iso_d'])['OBS_VALUE'].transform(compute_hysteresis))

In [87]:
cinema_unique.to_csv("../data/cleaned/cinema2000_2023.csv", encoding='utf-8', index=False)

In [88]:
#cinema_unique[(cinema_unique['iso_o'] == 'BGD') & (cinema_unique['iso_d'] == 'IND')]

* Books

In [89]:
books = pd.concat([c00_04_49, c05_09_49, c10_14_49, c15_19_49, c020_23_49], axis = 0)

In [90]:
books.shape

(339664, 6)

In [91]:
print(books['TIME_PERIOD'].nunique())
print(books['ADJUSTMENT'].nunique())

24
1


In [92]:
# renaming the exporter and importer column to match the fixed variable dataset
books = books.rename(columns = {'REF_AREA': 'iso_o'})
books = books.rename(columns = {'COUNTERPART_AREA': 'iso_d'})

In [93]:
# removing self_loops
books = books[books['iso_o'] != books['iso_d']]

In [94]:
books.shape

(339113, 6)

In [95]:
# removing duplicates
books_duplicate = books.duplicated(keep = 'last')
books_unique = books[~books_duplicate]
print(books_unique.shape)

(339113, 6)


In [96]:
books_unique = books_unique.sort_values(['iso_o', 'iso_d', 'TIME_PERIOD'])

In [97]:
books_unique = books_unique.reset_index(drop= True)

In [98]:
books_unique['hysteresis_books'] = (books_unique.groupby(['iso_o', 'iso_d'])['OBS_VALUE'].transform(compute_hysteresis))

In [99]:
books_unique.to_csv("../data/cleaned/books2000_2023.csv", encoding='utf-8', index=False)

In [100]:
#books_unique[(books_unique['iso_o'] == 'BGD') & (books_unique['iso_d'] == 'IND')]

* Tapes

In [101]:
tapes = pd.concat([c00_04_85, c05_09_85, c10_14_85, c15_19_85, c020_23_85], axis = 0)

In [102]:
print(tapes['TIME_PERIOD'].nunique())
print(tapes['ADJUSTMENT'].nunique())

24
1


In [103]:
# renaming the exporter and importer column to match the fixed variable dataset
tapes = tapes.rename(columns = {'REF_AREA': 'iso_o'})
tapes = tapes.rename(columns = {'COUNTERPART_AREA': 'iso_d'})

In [104]:
tapes.shape

(471045, 6)

In [105]:
# removing self_loops
tapes = tapes[tapes['iso_o'] != tapes['iso_d']]

In [106]:
tapes.shape

(470377, 6)

In [107]:
# removing duplicates
tapes_duplicate = tapes.duplicated(keep = 'last')
tapes_unique = tapes[~tapes_duplicate]

In [108]:
tapes_unique.shape
# no duplicate for tapes as well

(470377, 6)

In [109]:
tapes_unique = tapes_unique.sort_values(['iso_o', 'iso_d', 'TIME_PERIOD'])
tapes_unique = tapes_unique.reset_index(drop= True)

In [110]:
tapes_unique['hysteresis_tapes'] = (tapes_unique.groupby(['iso_o', 'iso_d'])['OBS_VALUE'].transform(compute_hysteresis))

In [111]:
tapes_unique.to_csv("../data/cleaned/tapes2000_2023.csv", encoding='utf-8', index=False)

In [112]:
#tapes_unique[(tapes_unique['iso_o'] == 'BGD') & (tapes_unique['iso_d'] == 'IND')]

In [113]:
print(unique_unique['iso_o'].nunique())
print(books_unique['iso_o'].nunique())
print(cinema_unique['iso_o'].nunique())
print(tapes_unique['iso_o'].nunique())
print(unique_unique['iso_d'].nunique())
print(books_unique['iso_d'].nunique())
print(cinema_unique['iso_d'].nunique())
print(tapes_unique['iso_d'].nunique())
print(unique_unique.groupby(['iso_o','iso_d']).ngroups)
print(books_unique.groupby(['iso_o','iso_d']).ngroups)
print(cinema_unique.groupby(['iso_o','iso_d']).ngroups)
print(tapes_unique.groupby(['iso_o','iso_d']).ngroups)

201
201
200
201
201
201
201
201
18915
27050
14037
32709


### Aggreagating cinemas, books and tapes

##### Steps to aggregate three different reproducible cultural products datasets:
1. Concatenating tapes, books and cinemas dataframes on columns. 
2. Grouping the concatenaded data by country pair and year,
3. Summing over the obs_value which is the exchange rate converted US dollar value of trade for that particular year
4. Adding a column which will record the HS codes of the products traded on that year

In [114]:
tapes_unique = tapes_unique.drop('hysteresis_tapes', axis = 1)

In [115]:
books_unique = books_unique.drop('hysteresis_books', axis = 1)

In [116]:
cinema_unique = cinema_unique.drop('hysteresis_cinema', axis = 1)

In [117]:
rp = pd.concat([tapes_unique, books_unique, cinema_unique], axis = 0)

In [118]:
print(rp['iso_o'].nunique())
print(rp['iso_d'].nunique())
print(rp['TIME_PERIOD'].nunique())
print(rp['PRODUCT_HS'].nunique())
print(rp.groupby(['iso_o','iso_d']).ngroups)

201
201
24
3
33443


In [119]:
rp.columns

Index(['iso_o', 'iso_d', 'PRODUCT_HS', 'TIME_PERIOD', 'OBS_VALUE',
       'ADJUSTMENT'],
      dtype='object')

In [120]:
rp_agg = rp.groupby(['iso_o', 'iso_d', 'TIME_PERIOD']).agg({'OBS_VALUE': 'sum', 'PRODUCT_HS': lambda x :tuple(sorted(set(x)))}).reset_index()

In [121]:
rp_agg['product_count'] = rp_agg['PRODUCT_HS'].apply(len)

In [122]:
rp_agg.head()

Unnamed: 0,iso_o,iso_d,TIME_PERIOD,OBS_VALUE,PRODUCT_HS,product_count
0,ABW,AFG,2019,0.012061,"(HS17_49, HS17_85)",2
1,ABW,AGO,2008,0.000165,"(HS17_85,)",1
2,ABW,AGO,2011,7.6e-05,"(HS17_85,)",1
3,ABW,ARE,2013,0.000953,"(HS17_49,)",1
4,ABW,ARE,2016,0.018888,"(HS17_85,)",1


In [123]:
print(rp_agg['iso_o'].nunique())
print(rp_agg['iso_d'].nunique())
print(rp_agg['TIME_PERIOD'].nunique())
print(rp_agg['PRODUCT_HS'].nunique())
print(rp_agg.groupby(['iso_o','iso_d']).ngroups)

201
201
24
7
33443


In [124]:
print(rp_agg['PRODUCT_HS'].value_counts())

PRODUCT_HS
(HS17_49, HS17_85)             185011
(HS17_85,)                     148323
(HS17_37, HS17_49, HS17_85)    132380
(HS17_49,)                      21362
(HS17_37, HS17_85)               4663
(HS17_37,)                       1122
(HS17_37, HS17_49)                360
Name: count, dtype: int64


In [125]:
185011 + 4663 + 360

190034

In [126]:
148323 + 21362 + 1122

170807

In [127]:
print(rp_agg['product_count'].value_counts())

product_count
2    190034
1    170807
3    132380
Name: count, dtype: int64


In [128]:
rp_agg['hysteresis_repro'] = (rp_agg.groupby(['iso_o', 'iso_d'])['OBS_VALUE'].transform(compute_hysteresis))

In [129]:
rp_agg.to_csv("../data/cleaned/reproducible2000_2023.csv", encoding='utf-8', index=False)

In [130]:
print(unique_unique.groupby(['iso_o','iso_d']).ngroups)
print(rp_agg.groupby(['iso_o','iso_d']).ngroups)

18915
33443


In [131]:
assert unique_unique['iso_o'].nunique() == rp_agg['iso_o'].nunique() 

In [132]:
assert unique_unique['iso_d'].nunique() == rp_agg['iso_d'].nunique()

### non-cultural goods

In [133]:
c00_04 = c00_04[(c00_04['PRODUCT_HS'] != '_T') & (c00_04['PRODUCT_HS'] != 'HS17_97') & (c00_04['PRODUCT_HS'] != 'HS17_49') & (c00_04['PRODUCT_HS'] != 'HS17_37') & (c00_04['PRODUCT_HS'] != 'HS17_85')]

In [134]:
c05_09 = c05_09[(c05_09['PRODUCT_HS'] != '_T') & (c05_09['PRODUCT_HS'] != 'HS17_97') & (c05_09['PRODUCT_HS'] != 'HS17_49') & (c05_09['PRODUCT_HS'] != 'HS17_37') & (c05_09['PRODUCT_HS'] != 'HS17_85')]

In [135]:
c10_14 = c10_14[(c10_14['PRODUCT_HS'] != '_T') & (c10_14['PRODUCT_HS'] != 'HS17_97') & (c10_14['PRODUCT_HS'] != 'HS17_49') & (c10_14['PRODUCT_HS'] != 'HS17_37') & (c10_14['PRODUCT_HS'] != 'HS17_85')]

In [136]:
c15_19 = c15_19[(c15_19['PRODUCT_HS'] != '_T') & (c15_19['PRODUCT_HS'] != 'HS17_97') & (c15_19['PRODUCT_HS'] != 'HS17_49') & (c15_19['PRODUCT_HS'] != 'HS17_37') & (c15_19['PRODUCT_HS'] != 'HS17_85')]

In [137]:
c020_23 = c020_23[(c020_23['PRODUCT_HS'] != '_T') & (c020_23['PRODUCT_HS'] != 'HS17_97') & (c020_23['PRODUCT_HS'] != 'HS17_49') & (c020_23['PRODUCT_HS'] != 'HS17_37') & (c020_23['PRODUCT_HS'] != 'HS17_85')]

In [138]:
non_cultural = pd.concat([c00_04, c05_09, c10_14, c15_19, c020_23], axis = 0)

In [139]:
non_cultural = non_cultural[non_cultural['ADJUSTMENT'] == "B_ADJ_RX"]

In [140]:
non_cultural['ADJUSTMENT'].nunique()

1

In [141]:
# renaming the exporter and importer column to match the fixed variable dataset
non_cultural = non_cultural.rename(columns = {'REF_AREA': 'iso_o'})
non_cultural = non_cultural.rename(columns = {'COUNTERPART_AREA': 'iso_d'})

In [142]:
non_cultural.shape

(19117913, 6)

In [143]:
# removing duplicates
non_cultural_duplicate = non_cultural.duplicated(keep = 'last')
non_cultural_unique = non_cultural[~non_cultural_duplicate]

In [144]:
non_cultural_unique.shape  #no duplicate

(19117913, 6)

In [145]:
non_cultural_unique = non_cultural_unique.sort_values(['iso_o', 'iso_d', 'TIME_PERIOD'])
non_cultural_unique = non_cultural_unique.reset_index(drop = True)

In [146]:
#non_cultural_unique.head(10)

In [147]:
0.031152 + 0.003771 + 0.000121 + 0.000045 + 0.030341 + 0.000079 + 0.000059 + 0.000282

0.06585

In [148]:
print(non_cultural_unique['PRODUCT_HS'].nunique())
print(non_cultural_unique['ADJUSTMENT'].nunique())
print(non_cultural_unique['TIME_PERIOD'].nunique())

93
1
24


In [149]:
non_cultural_unique_grouped = non_cultural_unique.groupby(['iso_o', 'iso_d', 'TIME_PERIOD'])['OBS_VALUE'].sum().reset_index()

In [150]:
print(non_cultural_unique_grouped['TIME_PERIOD'].nunique())
print(non_cultural_unique_grouped['iso_o'].nunique())
print(non_cultural_unique_grouped['iso_d'].nunique())

24
201
201


In [151]:
non_cultural_unique_grouped['hysteresis_total'] = (non_cultural_unique_grouped.groupby(['iso_o', 'iso_d'])['OBS_VALUE'].transform(compute_hysteresis))

In [152]:
non_cultural_unique_grouped.to_csv("../data/cleaned/total2000_2023.csv", encoding = 'utf-8', index = False)

In [153]:
assert unique_unique['iso_o'].nunique() == rp_agg['iso_o'].nunique() == non_cultural_unique_grouped['iso_o'].nunique()

In [154]:
assert unique_unique['iso_d'].nunique() == rp_agg['iso_d'].nunique() == non_cultural_unique_grouped['iso_d'].nunique()