In [None]:
# required packages
import pandas as pd
import numpy as np

In [2]:
# Loading data
raw_data = pd.read_csv('Datasets_MS_Project/Government_Expenditure/Investment_GovernmentExpenditure_E_All_Data_(Normalized)/Investment_GovernmentExpenditure_E_All_Data_(Normalized).csv')
raw_data.head()

Unnamed: 0,Area Code,Area Code (M49),Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag,Note
0,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2006,2006,million SLC,111274.57,X,consolidated General Government
1,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2007,2007,million SLC,165029.87,X,consolidated General Government
2,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2008,2008,million SLC,466732.04,X,consolidated General Government
3,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2009,2009,million SLC,449927.62,X,consolidated General Government
4,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2010,2010,million SLC,599141.98,X,consolidated General Government


In [3]:
raw_data['Item'].unique()

array(['Total Expenditure (General Government)',
       'Agriculture, forestry, fishing (General Government)',
       'Environmental protection (General Government)',
       'Protection of Biodiversity and Landscape (General Government)',
       'Total Expenditure (Central Government)',
       'R&D Environmental Protection (General Government)',
       'Agriculture, forestry, fishing (Central Government)',
       'Agriculture, forestry, fishing, Recurrent (Central Government)',
       'Agriculture, forestry, fishing, Capital (Central Government)',
       'Environmental protection (Central Government)',
       'Protection of Biodiversity and Landscape (Central Government)',
       'R&D Environmental Protection (Central Government)',
       'SDG 2.a.1: Highest Government level',
       'Agriculture, forestry, fishing, Recurrent (General Government)',
       'Agriculture, forestry, fishing, Capital (General Government)',
       'Agriculture (General Government)',
       'Agriculture, Recu

In [4]:
raw_data['Element'].unique()

array(['Value Standard Local Currency', 'Value US$',
       'Value US$, 2015 prices', 'Share of Total Expenditure',
       'SDG 2.a.1: Agriculture share of Government Expenditure',
       'SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure',
       'SDG 2.a.1: Agriculture value added share of GDP'], dtype=object)

This dataset contains a very detailed representation of government expenditure. 
There are several items in the "Item" column and "Element" column include 
different metrics to measure government expenditure. We would choose just the ones 
that are most relevant to our project, i.e., the ones that directly affects 
farmer producer prices. 

We would choose following items:
- 'Agriculture (General Government)',
- 'Agriculture, Recurrent (General Government)',
- 'Agriculture, Capital (General Government)', 
- 'R&D Agriculture, forestry, fishing (General Government)',
- 'Agriculture, forestry, fishing (General Government)',
- 'Agriculture (Central Government)',
- 'Agriculture, Recurrent (Central Government)',
- 'Agriculture, Capital (Central Government)',
- 'Agriculture, forestry, fishing (Highest Government level)'
- 'Agriculture, forestry, fishing (Central Government)',
- 'R&D Agriculture, forestry, fishing (Central Government)'

We would use data for general government whenever available. If not available, 
we would fall back to central government, and then if both general and central 
are not available, we could use highest government data. 

**General Government:** includes all levels: central + subnational 
(states, provinces, municipalities). Broader scope.

**Central Government:** Only national-level institutions 
(e.g., Ministry of Agriculture). Narrower scope.

**Highest Government Level:** Often similar to Central Government, but used in 
countries where data is available only at one level.

**Recurrent expenditure:** Salaries, fertilizer subsidies, operational costs.

**Capital expenditure:** Infrastructure (e.g., dams, market yards), 
land development.

We would consider separating recurrent and capital expenditures as separate 
features to see their different lag effects on prices.


From the elements, we would consider following: 

**Value US$, 2015 prices** – for the actual real expenditure trend

**Share of Total Expenditure** – to show relative political priority

**Agriculture Orientation Index (AOI)** – to capture policy bias toward agriculture
Compares share of agri expenditure to agriculture’s share in GDP. Captures policy emphasis. 
Useful for modeling how much priority agriculture gets.

In [8]:
# filter the dataset for desired data
items_to_keep = [
    'Agriculture (General Government)',
    'Agriculture, Recurrent (General Government)',
    'Agriculture, Capital (General Government)', 
    'R&D Agriculture, forestry, fishing (General Government)',
    'Agriculture, forestry, fishing (General Government)',
    'Agriculture (Central Government)',
    'Agriculture, Recurrent (Central Government)',
    'Agriculture, Capital (Central Government)',
    'Agriculture, forestry, fishing (Highest Government level)'
    'Agriculture, forestry, fishing (Central Government)',
    'R&D Agriculture, forestry, fishing (Central Government)',
    'SDG 2.a.1: Highest Government level'
]

elements_to_keep = [
    'Value US$, 2015 prices',
    'Share of Total Expenditure',
    'SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure'
]

filtered_data = raw_data.loc[
    (raw_data['Item'].isin(items_to_keep)) &
    (raw_data['Element'].isin(elements_to_keep))
]

filtered_data.head()

Unnamed: 0,Area Code,Area Code (M49),Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag,Note
52,2,'004,Afghanistan,23131,"Agriculture, forestry, fishing (General Govern...",6184,"Value US$, 2015 prices",2006,2006,million USD,153.3,X,consolidated General Government
53,2,'004,Afghanistan,23131,"Agriculture, forestry, fishing (General Govern...",6184,"Value US$, 2015 prices",2007,2007,million USD,332.82,X,consolidated General Government
54,2,'004,Afghanistan,23131,"Agriculture, forestry, fishing (General Govern...",6184,"Value US$, 2015 prices",2008,2008,million USD,335.91,X,consolidated General Government
55,2,'004,Afghanistan,23131,"Agriculture, forestry, fishing (General Govern...",6184,"Value US$, 2015 prices",2009,2009,million USD,465.1,X,consolidated General Government
56,2,'004,Afghanistan,23131,"Agriculture, forestry, fishing (General Govern...",6184,"Value US$, 2015 prices",2010,2010,million USD,595.72,X,consolidated General Government


In [9]:
filtered_data['Item'].value_counts()

Item
SDG 2.a.1: Highest Government level                        4713
Agriculture, forestry, fishing (General Government)        3660
Agriculture (Central Government)                           1064
Agriculture, Recurrent (Central Government)                 739
Agriculture, Capital (Central Government)                   708
R&D Agriculture, forestry, fishing (Central Government)     605
Agriculture (General Government)                            542
Agriculture, Recurrent (General Government)                 406
R&D Agriculture, forestry, fishing (General Government)     376
Agriculture, Capital (General Government)                   374
Name: count, dtype: int64

In [10]:
filtered_data['Element'].value_counts()

Element
Value US$, 2015 prices                                                       6648
SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure    4713
Share of Total Expenditure                                                   1826
Name: count, dtype: int64

## Govt expenditure in USD, 2015 prices

In [15]:
filtered_data_1 = filtered_data.loc[
    filtered_data['Element']=='Value US$, 2015 prices'
]
filtered_data_1.shape

(6648, 13)

In [21]:
filtered_data_1['Item'].value_counts()

Item
Agriculture, forestry, fishing (General Government)        1834
Agriculture (Central Government)                           1064
Agriculture, Recurrent (Central Government)                 739
Agriculture, Capital (Central Government)                   708
R&D Agriculture, forestry, fishing (Central Government)     605
Agriculture (General Government)                            542
Agriculture, Recurrent (General Government)                 406
R&D Agriculture, forestry, fishing (General Government)     376
Agriculture, Capital (General Government)                   374
Name: count, dtype: int64

In [22]:
filtered_data['Item'].value_counts()

Item
SDG 2.a.1: Highest Government level                        4713
Agriculture, forestry, fishing (General Government)        3660
Agriculture (Central Government)                           1064
Agriculture, Recurrent (Central Government)                 739
Agriculture, Capital (Central Government)                   708
R&D Agriculture, forestry, fishing (Central Government)     605
Agriculture (General Government)                            542
Agriculture, Recurrent (General Government)                 406
R&D Agriculture, forestry, fishing (General Government)     376
Agriculture, Capital (General Government)                   374
Name: count, dtype: int64

In [128]:
filtered_data.loc[filtered_data['Item']=='Agriculture, forestry, fishing (General Government)']['Element'].unique()

array(['Value US$, 2015 prices', 'Share of Total Expenditure'],
      dtype=object)

The data for 'Agriculture, forestry, fishing (General Government)' varies between 
filtered_data and filtered_data_1 because the data is shared among the two elements- 
'Value US$, 2015 prices', and 'Share of Total Expenditure'.

In [None]:
# restructuring data from long to wide format
pivoted_data_1 = filtered_data_1.pivot_table(
    index = ['Area Code', 'Area', 'Year Code', 'Year'],
    columns = 'Item',
    values = 'Value'
)

# resetting row index
pivoted_data_1.reset_index(inplace=True)

# setting column index name to None
pivoted_data_1.columns.name = None

pivoted_data_1.head()

Unnamed: 0,Area Code,Area,Year Code,Year,Agriculture (Central Government),Agriculture (General Government),"Agriculture, Capital (Central Government)","Agriculture, Capital (General Government)","Agriculture, Recurrent (Central Government)","Agriculture, Recurrent (General Government)","Agriculture, forestry, fishing (General Government)","R&D Agriculture, forestry, fishing (Central Government)","R&D Agriculture, forestry, fishing (General Government)"
0,1,Armenia,2009,2009,86.12,86.8,48.38,48.68,37.74,38.12,90.51,0.0,0.01
1,1,Armenia,2010,2010,90.44,91.83,62.34,63.27,28.1,28.56,93.82,0.0,0.01
2,1,Armenia,2011,2011,91.94,93.72,61.73,63.21,30.21,30.52,95.55,0.0,0.01
3,1,Armenia,2012,2012,47.32,49.31,18.65,20.21,28.67,29.1,51.02,0.0,0.0
4,1,Armenia,2013,2013,37.64,39.13,9.09,10.26,28.55,28.87,40.84,0.0,0.0


In [38]:
pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2435 entries, 0 to 2434
Data columns (total 13 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   Area Code                                                2435 non-null   int64  
 1   Area                                                     2435 non-null   object 
 2   Year Code                                                2435 non-null   int64  
 3   Year                                                     2435 non-null   int64  
 4   Agriculture (Central Government)                         1064 non-null   float64
 5   Agriculture (General Government)                         542 non-null    float64
 6   Agriculture, Capital (Central Government)                708 non-null    float64
 7   Agriculture, Capital (General Government)                374 non-null    float64
 8   Agriculture, Recurrent (Cent

In [84]:
pivoted_data_1.loc[pivoted_data_1['Area']=='United States of America'].head()

Unnamed: 0,Area Code,Area,Year Code,Year,Agriculture (Central Government),Agriculture (General Government),"Agriculture, Capital (Central Government)","Agriculture, Capital (General Government)","Agriculture, Recurrent (Central Government)","Agriculture, Recurrent (General Government)","Agriculture, forestry, fishing (General Government)","R&D Agriculture, forestry, fishing (Central Government)","R&D Agriculture, forestry, fishing (General Government)"
2232,231,United States of America,2001,2001,,,,,41756.2,48562.86,48562.86,,
2233,231,United States of America,2002,2002,,,,,29128.8,35702.11,35702.11,,
2234,231,United States of America,2003,2003,,,,,32736.0,39182.08,39182.08,,
2235,231,United States of America,2004,2004,,,,,27448.99,33603.47,33603.47,,
2236,231,United States of America,2005,2005,,,,,39981.31,45829.32,45829.32,,


In [None]:
# play with the code below to understand the data distribution
pivoted_data_1.loc[
    (pivoted_data_1['Agriculture, Recurrent (General Government)'].isna()) &
    (pivoted_data_1['Agriculture (General Government)'].isna()) &
    (~pivoted_data_1['Agriculture, Capital (General Government)'].isna())
].shape

(0, 13)

In [None]:
# play with the code below to understand the data distribution
pivoted_data_1.loc[
    (pivoted_data_1['Agriculture (Central Government)'].isna()) &
    (pivoted_data_1['Agriculture (General Government)'].isna()) &
    (pivoted_data_1['Agriculture, Capital (General Government)'].isna()) &
    (pivoted_data_1['Agriculture, Recurrent (General Government)'].isna()) &
    (~pivoted_data_1['Agriculture, Capital (Central Government)'].isna()) &
    (~pivoted_data_1['Agriculture, Recurrent (Central Government)'].isna()) 
].shape

(18, 13)

After data exploration, it became evident that there is little consistency in 
terms of how different countries gather and represent data related to government 
expenditure. 

For example, there are certain countries which report the data on government 
expenditure through all the categories mentioned above- 'Agriculture (Central Government)', 
'Agriculture (General Government)', 'Agriculture, Capital (General Government)', 
'Agriculture, Recurrent (General Government)', 'Agriculture, Capital (Central Government)', 
etc. On the other hand, there are certain countries which don't have multi-level 
governance systems, they only report central government level expenditure. 

Certain countries, like USA, report government expenditure as recurrent expenditure 
at general government level and central government level and it is same as government 
expenditure for agriculture, forestry and fishery. 

Clearly, there is a lot of variablity in terms of how the data on government expenditure 
is reported by the countries around the world. And, because of this reason, we 
have to find alternative ways to document the data. 

The plan is to use 'Agriculture (General Government)' as government expenditure. 
Wherever data for 'Agriculture (General Government)' is not available, we will 
combine the data for 'Agriculture, Recurrent (General Government)' and 
'Agriculture, Capital (General Government)' and treat it as 
'Agriculture (General Government)'. We would do the same for 
'Agriculture (Central Government)'. However, we would only use 
'Agriculture (Central Government)' as the data for government expenditure if the data 
for 'Agriculture (General Government)' is not available. 

Similarly, we would use 'R&D Agriculture, forestry, fishing (General Government)' 
as the data for government expenditure on R&D in agriculture, forestry, and 
fishing. Whenever it is not available, we would use data for 'R&D Agriculture, forestry, 
fishing (Central Government)'. 

In [101]:
pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2435 entries, 0 to 2434
Data columns (total 13 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   Area Code                                                2435 non-null   int64  
 1   Area                                                     2435 non-null   object 
 2   Year Code                                                2435 non-null   int64  
 3   Year                                                     2435 non-null   int64  
 4   Agriculture (Central Government)                         1064 non-null   float64
 5   Agriculture (General Government)                         542 non-null    float64
 6   Agriculture, Capital (Central Government)                708 non-null    float64
 7   Agriculture, Capital (General Government)                374 non-null    float64
 8   Agriculture, Recurrent (Cent

In [None]:
# Filling NaN values in the column with values from other columns (Capital & Recurrent)
pivoted_data_1['Agriculture (General Government)'] = pivoted_data_1['Agriculture (General Government)'].fillna(
    pivoted_data_1['Agriculture, Capital (General Government)'] + 
    pivoted_data_1['Agriculture, Recurrent (General Government)']
)

pivoted_data_1['Agriculture (Central Government)'] = pivoted_data_1['Agriculture (Central Government)'].fillna(
    pivoted_data_1['Agriculture, Capital (Central Government)'] + 
    pivoted_data_1['Agriculture, Recurrent (Central Government)']
)

pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2435 entries, 0 to 2434
Data columns (total 13 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   Area Code                                                2435 non-null   int64  
 1   Area                                                     2435 non-null   object 
 2   Year Code                                                2435 non-null   int64  
 3   Year                                                     2435 non-null   int64  
 4   Agriculture (Central Government)                         1082 non-null   float64
 5   Agriculture (General Government)                         542 non-null    float64
 6   Agriculture, Capital (Central Government)                708 non-null    float64
 7   Agriculture, Capital (General Government)                374 non-null    float64
 8   Agriculture, Recurrent (Cent

In [None]:
# filling NaN values in the column (General Govt) with values from other column (Central Govt)
pivoted_data_1['Agriculture (General Government)'] = pivoted_data_1['Agriculture (General Government)'].fillna(
    pivoted_data_1['Agriculture (Central Government)'] 
)

pivoted_data_1['R&D Agriculture, forestry, fishing (General Government)'] = pivoted_data_1['R&D Agriculture, forestry, fishing (General Government)'].fillna(
    pivoted_data_1['R&D Agriculture, forestry, fishing (Central Government)'] 
)

# dropping redundant/undesired columns
pivoted_data_1 = pivoted_data_1.drop(
    ['Agriculture (Central Government)',
     'Agriculture, Capital (Central Government)',
     'Agriculture, Capital (General Government)',
     'Agriculture, Recurrent (Central Government)',
     'Agriculture, Recurrent (General Government)',
     'R&D Agriculture, forestry, fishing (Central Government)'], axis=1
)

pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2435 entries, 0 to 2434
Data columns (total 7 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   Area Code                                                2435 non-null   int64  
 1   Area                                                     2435 non-null   object 
 2   Year Code                                                2435 non-null   int64  
 3   Year                                                     2435 non-null   int64  
 4   Agriculture (General Government)                         1126 non-null   float64
 5   Agriculture, forestry, fishing (General Government)      1834 non-null   float64
 6   R&D Agriculture, forestry, fishing (General Government)  636 non-null    float64
dtypes: float64(3), int64(3), object(1)
memory usage: 133.3+ KB


Looks like we don't have enough data (lots of missing values) for 'R&D Agriculture, 
forestry, fishing (General Government)', we can get rid of this feature from our dataset. 

In [None]:
# dropping the R&D column
pivoted_data_1 = pivoted_data_1.drop(
     'R&D Agriculture, forestry, fishing (General Government)', axis=1
)

pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2435 entries, 0 to 2434
Data columns (total 6 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Area Code                                            2435 non-null   int64  
 1   Area                                                 2435 non-null   object 
 2   Year Code                                            2435 non-null   int64  
 3   Year                                                 2435 non-null   int64  
 4   Agriculture (General Government)                     1126 non-null   float64
 5   Agriculture, forestry, fishing (General Government)  1834 non-null   float64
dtypes: float64(2), int64(3), object(1)
memory usage: 114.3+ KB


In [117]:
# renaming columns
cleaned_data_1 = pivoted_data_1.rename(
    columns = {
        'Area Code': 'area_code',
        'Area': 'area',
        'Year Code': 'year_code',
        'Year': 'year',
        'Agriculture (General Government)': 'Govt_expenditure_on_Ag',
        'Agriculture, forestry, fishing (General Government)': 'Govt_expenditure_on_Ag_forest_fish'
    }
)

## Government expenditure on agriculture in terms of share of total expenditure

In [None]:
# filter for desired element
filtered_data_2 = filtered_data.loc[
    filtered_data['Element']=='Share of Total Expenditure'
    ]
filtered_data_2.shape

(1826, 13)

In [107]:
filtered_data_2['Item'].value_counts()

Item
Agriculture, forestry, fishing (General Government)    1826
Name: count, dtype: int64

In [None]:
# unit for the value
filtered_data_2['Unit'].unique()

array(['%'], dtype=object)

In [108]:
# restructuring data from long to wide format
pivoted_data_2 = filtered_data_2.pivot_table(
    index = ['Area Code', 'Area', 'Year Code', 'Year'],
    columns = 'Item',
    values = 'Value'
)

# resetting row index
pivoted_data_2.reset_index(inplace=True)

# setting column index name to None
pivoted_data_2.columns.name = None

pivoted_data_2.head()

Unnamed: 0,Area Code,Area,Year Code,Year,"Agriculture, forestry, fishing (General Government)"
0,1,Armenia,2009,2009,3.84
1,1,Armenia,2010,2010,4.17
2,1,Armenia,2011,2011,4.26
3,1,Armenia,2012,2012,2.2
4,1,Armenia,2013,2013,1.61


In [120]:
# renaming columns
cleaned_data_2 = pivoted_data_2.rename(
    columns = {
        'Area Code': 'area_code',
        'Area': 'area',
        'Year Code': 'year_code',
        'Year': 'year',
        'Agriculture, forestry, fishing (General Government)': 'Ag_forest_fish_as_share_of_total_expenditure'
    }
)

In [None]:
# Combining the two datasets with left-join
merged_data_1 = pd.merge(
    cleaned_data_1, cleaned_data_2,
    on = ['area_code', 'area', 'year_code', 'year'],
    how = 'left'
)

merged_data_1.head()

Unnamed: 0,area_code,area,year_code,year,Govt_expenditure_on_Ag,Govt_expenditure_on_Ag_forest_fish,Ag_forest_fish_as_share_of_total_expenditure
0,1,Armenia,2009,2009,86.8,90.51,3.84
1,1,Armenia,2010,2010,91.83,93.82,4.17
2,1,Armenia,2011,2011,93.72,95.55,4.26
3,1,Armenia,2012,2012,49.31,51.02,2.2
4,1,Armenia,2013,2013,39.13,40.84,1.61


## Agriculture Orientation Index (AOI) for Government Expenditure

In [None]:
# filter for desired element
filtered_data_3 = filtered_data.loc[
    filtered_data['Element']=='SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure'
]

filtered_data_3.shape

(4713, 13)

In [113]:
filtered_data_3['Unit'].unique()

array([nan], dtype=object)

In [114]:
filtered_data_3['Item'].unique()

array(['SDG 2.a.1: Highest Government level'], dtype=object)

In [122]:
# restructuring data from long to wide format
pivoted_data_3 = filtered_data_3.pivot_table(
    index = ['Area Code', 'Area', 'Year Code', 'Year'],
    columns = 'Element',
    values = 'Value'
)

# resetting row index
pivoted_data_3.reset_index(inplace=True)

# setting column index name to None
pivoted_data_3.columns.name = None

pivoted_data_3.head()

Unnamed: 0,Area Code,Area,Year Code,Year,SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure
0,1,Armenia,2003,2003,0.2
1,1,Armenia,2004,2004,0.19
2,1,Armenia,2005,2005,0.23
3,1,Armenia,2006,2006,0.23
4,1,Armenia,2007,2007,0.24


In [123]:
# renaming columns
cleaned_data_3 = pivoted_data_3.rename(
    columns = {
        'Area Code': 'area_code',
        'Area': 'area',
        'Year Code': 'year_code',
        'Year': 'year',
        'SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure': 'AOI_for_govt_expenditure'
    }
)

In [124]:
cleaned_data_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4713 entries, 0 to 4712
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   area_code                 4713 non-null   int64  
 1   area                      4713 non-null   object 
 2   year_code                 4713 non-null   int64  
 3   year                      4713 non-null   int64  
 4   AOI_for_govt_expenditure  4713 non-null   float64
dtypes: float64(1), int64(3), object(1)
memory usage: 184.2+ KB


In [None]:
# Combining the two datasets with left-join
merged_data_2 = pd.merge(
    cleaned_data_3, merged_data_1,
    on = ['area_code', 'area', 'year_code', 'year'],
    how = 'left'
)

merged_data_2.head(10)

Unnamed: 0,area_code,area,year_code,year,AOI_for_govt_expenditure,Govt_expenditure_on_Ag,Govt_expenditure_on_Ag_forest_fish,Ag_forest_fish_as_share_of_total_expenditure
0,1,Armenia,2003,2003,0.2,,,
1,1,Armenia,2004,2004,0.19,,,
2,1,Armenia,2005,2005,0.23,,,
3,1,Armenia,2006,2006,0.23,,,
4,1,Armenia,2007,2007,0.24,,,
5,1,Armenia,2008,2008,0.27,,,
6,1,Armenia,2009,2009,0.24,86.8,90.51,3.84
7,1,Armenia,2010,2010,0.26,91.83,93.82,4.17
8,1,Armenia,2011,2011,0.22,93.72,95.55,4.26
9,1,Armenia,2012,2012,0.12,49.31,51.02,2.2


In [127]:
merged_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4713 entries, 0 to 4712
Data columns (total 8 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   area_code                                     4713 non-null   int64  
 1   area                                          4713 non-null   object 
 2   year_code                                     4713 non-null   int64  
 3   year                                          4713 non-null   int64  
 4   AOI_for_govt_expenditure                      4713 non-null   float64
 5   Govt_expenditure_on_Ag                        1125 non-null   float64
 6   Govt_expenditure_on_Ag_forest_fish            1827 non-null   float64
 7   Ag_forest_fish_as_share_of_total_expenditure  1819 non-null   float64
dtypes: float64(4), int64(3), object(1)
memory usage: 294.7+ KB


There are a lots of missing values in the columns- 
'Govt_expenditure_on_Ag', 'Govt_expenditure_on_Ag_forest_fish', 
'Ag_forest_fish_as_share_of_total_expenditure'. However, the good news is we have 
complete data regarding the Agricultural Orientation Index for government expenditure. 
This is important as it measure the emphasis of government on agriculture. Even if 
we are missing the raw numbers, AOI will cover those effects anyways. 

In [129]:
# exporting cleaned data as csv file
merged_data_2.to_csv('cleaned_datasets/government_investment_cleaned.csv', index='False')