In [1]:
# required packages
import pandas as pd
import numpy as np

In [63]:
# Loading data
raw_data = pd.read_csv('/Users/gurjitsingh/Desktop/MS Data Science/MS_Project_Python/raw_datasets/Government_Expenditure/Investment_GovernmentExpenditure_E_All_Data_(Normalized)/Investment_GovernmentExpenditure_E_All_Data_(Normalized).csv')
raw_data.head()

Unnamed: 0,Area Code,Area Code (M49),Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag,Note
0,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2006,2006,million SLC,111274.57,X,consolidated General Government
1,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2007,2007,million SLC,165029.87,X,consolidated General Government
2,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2008,2008,million SLC,466732.04,X,consolidated General Government
3,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2009,2009,million SLC,449927.62,X,consolidated General Government
4,2,'004,Afghanistan,23130,Total Expenditure (General Government),6224,Value Standard Local Currency,2010,2010,million SLC,599141.98,X,consolidated General Government


In [3]:
raw_data['Item'].unique()

array(['Total Expenditure (General Government)',
       'Agriculture, forestry, fishing (General Government)',
       'Environmental protection (General Government)',
       'Protection of Biodiversity and Landscape (General Government)',
       'Total Expenditure (Central Government)',
       'R&D Environmental Protection (General Government)',
       'Agriculture, forestry, fishing (Central Government)',
       'Agriculture, forestry, fishing, Recurrent (Central Government)',
       'Agriculture, forestry, fishing, Capital (Central Government)',
       'Environmental protection (Central Government)',
       'Protection of Biodiversity and Landscape (Central Government)',
       'R&D Environmental Protection (Central Government)',
       'SDG 2.a.1: Highest Government level',
       'Agriculture, forestry, fishing, Recurrent (General Government)',
       'Agriculture, forestry, fishing, Capital (General Government)',
       'Agriculture (General Government)',
       'Agriculture, Recu

In [4]:
raw_data['Element'].unique()

array(['Value Standard Local Currency', 'Value US$',
       'Value US$, 2015 prices', 'Share of Total Expenditure',
       'SDG 2.a.1: Agriculture share of Government Expenditure',
       'SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure',
       'SDG 2.a.1: Agriculture value added share of GDP'], dtype=object)

This dataset contains a very detailed representation of government expenditure. 
There are several items in the "Item" column and "Element" column include 
different metrics to measure government expenditure. We would choose just the ones 
that are most relevant to our project, i.e., the ones that directly affects 
farmer producer prices. 

We would choose following items:
- 'Total Expenditure (General Government)',
- 'Total Expenditure (Central Government)',
- 'Total Expenditure (Highest Government level)',
- 'Agriculture (General Government)',
- 'Agriculture, Recurrent (General Government)',
- 'Agriculture, Capital (General Government)', 
- 'R&D Agriculture, forestry, fishing (General Government)',
- 'Agriculture, forestry, fishing (General Government)',
- 'Agriculture, forestry, fishing, Recurrent (General Government)',
- 'Agriculture, forestry, fishing, Capital (General Government)',
- 'Agriculture (Central Government)',
- 'Agriculture, Recurrent (Central Government)',
- 'Agriculture, Capital (Central Government)',
- 'Agriculture, forestry, fishing (Highest Government level)'
- 'Agriculture, forestry, fishing (Central Government)',
- 'Agriculture, forestry, fishing, Recurrent (Central Government)',
- 'Agriculture, forestry, fishing, Capital (Central Government)',
- 'R&D Agriculture, forestry, fishing (Central Government)',
- 'SDG 2.a.1: Highest Government level'

We would use data for general government whenever available. If not available, 
we would fall back to central government, and then if both general and central 
are not available, we could use highest government data. 

**General Government:** includes all levels: central + subnational 
(states, provinces, municipalities). Broader scope.

**Central Government:** Only national-level institutions 
(e.g., Ministry of Agriculture). Narrower scope.

**Highest Government Level:** Often similar to Central Government, but used in 
countries where data is available only at one level.

**Recurrent expenditure:** Salaries, fertilizer subsidies, operational costs.

**Capital expenditure:** Infrastructure (e.g., dams, market yards), 
land development.

We would consider separating recurrent and capital expenditures as separate 
features to see their different lag effects on prices.


From the elements, we would consider following: 

**Value US$, 2015 prices** – for the actual real expenditure trend

**Agriculture Orientation Index (AOI)** – to capture policy bias toward agriculture
Compares share of agri expenditure to agriculture’s share in GDP. Captures policy emphasis. 
Useful for modeling how much priority agriculture gets.

In [64]:
# filter the dataset for desired data
items_to_keep = [
    'Total Expenditure (General Government)',
    'Total Expenditure (Central Government)',
    'Total Expenditure (Highest Government level)',
    'Agriculture (General Government)',
    'Agriculture, Recurrent (General Government)',
    'Agriculture, Capital (General Government)', 
    'R&D Agriculture, forestry, fishing (General Government)',
    'Agriculture, forestry, fishing (General Government)',
    'Agriculture, forestry, fishing, Recurrent (General Government)',
    'Agriculture, forestry, fishing, Capital (General Government)',
    'Agriculture (Central Government)',
    'Agriculture, Recurrent (Central Government)',
    'Agriculture, Capital (Central Government)',
    'Agriculture, forestry, fishing (Highest Government level)'
    'Agriculture, forestry, fishing (Central Government)',
    'Agriculture, forestry, fishing, Recurrent (Central Government)',
    'Agriculture, forestry, fishing, Capital (Central Government)',
    'R&D Agriculture, forestry, fishing (Central Government)',
    'SDG 2.a.1: Highest Government level',
    
]

elements_to_keep = [
    'Value US$, 2015 prices',
    'SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure'
]

filtered_data = raw_data.loc[
    (raw_data['Item'].isin(items_to_keep)) &
    (raw_data['Element'].isin(elements_to_keep))
]

filtered_data.head()

Unnamed: 0,Area Code,Area Code (M49),Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag,Note
24,2,'004,Afghanistan,23130,Total Expenditure (General Government),6184,"Value US$, 2015 prices",2006,2006,million USD,3272.2,X,consolidated General Government
25,2,'004,Afghanistan,23130,Total Expenditure (General Government),6184,"Value US$, 2015 prices",2007,2007,million USD,4466.81,X,consolidated General Government
26,2,'004,Afghanistan,23130,Total Expenditure (General Government),6184,"Value US$, 2015 prices",2008,2008,million USD,11304.58,X,consolidated General Government
27,2,'004,Afghanistan,23130,Total Expenditure (General Government),6184,"Value US$, 2015 prices",2009,2009,million USD,10925.17,X,consolidated General Government
28,2,'004,Afghanistan,23130,Total Expenditure (General Government),6184,"Value US$, 2015 prices",2010,2010,million USD,12723.8,X,consolidated General Government


In [49]:
filtered_data['Item'].value_counts()

Item
SDG 2.a.1: Highest Government level                               4713
Total Expenditure (Central Government)                            3676
Total Expenditure (General Government)                            2120
Agriculture, forestry, fishing (General Government)               1834
Agriculture, forestry, fishing, Recurrent (Central Government)    1237
Agriculture, forestry, fishing, Capital (Central Government)      1199
Agriculture (Central Government)                                  1064
Total Expenditure (Highest Government level)                       943
Agriculture, Recurrent (Central Government)                        739
Agriculture, forestry, fishing, Recurrent (General Government)     730
Agriculture, forestry, fishing, Capital (General Government)       717
Agriculture, Capital (Central Government)                          708
R&D Agriculture, forestry, fishing (Central Government)            605
Agriculture (General Government)                                   542
A

In [50]:
filtered_data['Element'].value_counts()

Element
Value US$, 2015 prices                                                       17270
SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure     4713
Name: count, dtype: int64

## Govt expenditure in USD, 2015 prices

In [65]:
filtered_data_1 = filtered_data.loc[
    filtered_data['Element']=='Value US$, 2015 prices'
]
filtered_data_1.shape

(17270, 13)

In [21]:
filtered_data_1['Item'].value_counts()

Item
Total Expenditure (Central Government)                            3676
Total Expenditure (General Government)                            2120
Agriculture, forestry, fishing (General Government)               1834
Agriculture, forestry, fishing, Recurrent (Central Government)    1237
Agriculture, forestry, fishing, Capital (Central Government)      1199
Agriculture (Central Government)                                  1064
Total Expenditure (Highest Government level)                       943
Agriculture, Recurrent (Central Government)                        739
Agriculture, forestry, fishing, Recurrent (General Government)     730
Agriculture, forestry, fishing, Capital (General Government)       717
Agriculture, Capital (Central Government)                          708
R&D Agriculture, forestry, fishing (Central Government)            605
Agriculture (General Government)                                   542
Agriculture, Recurrent (General Government)                        406
R

In [22]:
filtered_data['Item'].value_counts()

Item
SDG 2.a.1: Highest Government level                               9426
Total Expenditure (Central Government)                            3676
Agriculture, forestry, fishing (General Government)               3660
Total Expenditure (General Government)                            2120
Agriculture, forestry, fishing, Recurrent (Central Government)    1237
Agriculture, forestry, fishing, Capital (Central Government)      1199
Agriculture (Central Government)                                  1064
Total Expenditure (Highest Government level)                       943
Agriculture, Recurrent (Central Government)                        739
Agriculture, forestry, fishing, Recurrent (General Government)     730
Agriculture, forestry, fishing, Capital (General Government)       717
Agriculture, Capital (Central Government)                          708
R&D Agriculture, forestry, fishing (Central Government)            605
Agriculture (General Government)                                   542
A

In [23]:
filtered_data.loc[filtered_data['Item']=='Agriculture, forestry, fishing (General Government)']['Element'].unique()

array(['Value US$, 2015 prices', 'Share of Total Expenditure'],
      dtype=object)

The data for 'Agriculture, forestry, fishing (General Government)' varies between 
filtered_data and filtered_data_1 because the data is shared among the two elements- 
'Value US$, 2015 prices', and 'Share of Total Expenditure'.

In [66]:
# restructuring data from long to wide format
pivoted_data_1 = filtered_data_1.pivot_table(
    index = ['Area Code', 'Area', 'Year Code', 'Year'],
    columns = 'Item',
    values = 'Value'
)

# resetting row index
pivoted_data_1.reset_index(inplace=True)

# setting column index name to None
pivoted_data_1.columns.name = None

pivoted_data_1.head(20)

Unnamed: 0,Area Code,Area,Year Code,Year,Agriculture (Central Government),Agriculture (General Government),"Agriculture, Capital (Central Government)","Agriculture, Capital (General Government)","Agriculture, Recurrent (Central Government)","Agriculture, Recurrent (General Government)","Agriculture, forestry, fishing (General Government)","Agriculture, forestry, fishing, Capital (Central Government)","Agriculture, forestry, fishing, Capital (General Government)","Agriculture, forestry, fishing, Recurrent (Central Government)","Agriculture, forestry, fishing, Recurrent (General Government)","R&D Agriculture, forestry, fishing (Central Government)","R&D Agriculture, forestry, fishing (General Government)",Total Expenditure (Central Government),Total Expenditure (General Government),Total Expenditure (Highest Government level)
0,1,Armenia,2003,2003,,,,,,,,,,,,,,990.01,,
1,1,Armenia,2004,2004,,,,,,,,,,,,,,996.17,1181.94,
2,1,Armenia,2005,2005,,,,,,,,,,,,,,1211.24,1403.58,
3,1,Armenia,2006,2006,,,,,,,,,,,,,,1344.82,1590.91,
4,1,Armenia,2007,2007,,,,,,,,,,,,,,1705.81,1998.6,
5,1,Armenia,2008,2008,,,,,,,,,,,,,,2067.57,2226.36,
6,1,Armenia,2009,2009,86.12,86.8,48.38,48.68,37.74,38.12,90.51,49.54,49.85,40.27,40.66,0.0,0.01,2310.84,2356.81,
7,1,Armenia,2010,2010,90.44,91.83,62.34,63.27,28.1,28.56,93.82,62.34,63.28,30.07,30.54,0.0,0.01,2202.76,2252.47,
8,1,Armenia,2011,2011,91.94,93.72,61.73,63.21,30.21,30.52,95.55,61.73,63.21,32.03,32.34,0.0,0.01,2182.55,2242.21,
9,1,Armenia,2012,2012,47.32,49.31,18.65,20.21,28.67,29.1,51.02,18.65,20.21,30.38,30.81,0.0,0.0,2252.56,2322.04,


In [27]:
pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 20 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Area Code                                                       4777 non-null   int64  
 1   Area                                                            4777 non-null   object 
 2   Year Code                                                       4777 non-null   int64  
 3   Year                                                            4777 non-null   int64  
 4   Agriculture (Central Government)                                1064 non-null   float64
 5   Agriculture (General Government)                                542 non-null    float64
 6   Agriculture, Capital (Central Government)                       708 non-null    float64
 7   Agriculture, Capital (General Government)          

In [68]:
pivoted_data_1.loc[pivoted_data_1['Area']=='Kenya'].head(10)

Unnamed: 0,Area Code,Area,Year Code,Year,Agriculture (Central Government),Agriculture (General Government),"Agriculture, Capital (Central Government)","Agriculture, Capital (General Government)","Agriculture, Recurrent (Central Government)","Agriculture, Recurrent (General Government)","Agriculture, forestry, fishing (General Government)","Agriculture, forestry, fishing, Capital (Central Government)","Agriculture, forestry, fishing, Capital (General Government)","Agriculture, forestry, fishing, Recurrent (Central Government)","Agriculture, forestry, fishing, Recurrent (General Government)","R&D Agriculture, forestry, fishing (Central Government)","R&D Agriculture, forestry, fishing (General Government)",Total Expenditure (Central Government),Total Expenditure (General Government),Total Expenditure (Highest Government level)
1743,114,Kenya,2001,2001,,,,,,,,,,,,,,5634.49,,
1744,114,Kenya,2002,2002,,,,,,,,,,,,,,6026.04,,
1745,114,Kenya,2003,2003,,,,,,,,,,,,,,6074.28,,
1746,114,Kenya,2004,2004,,,,,,,,,,,,,,7174.62,,
1747,114,Kenya,2005,2005,,,,,,,,,,,,,,6616.02,,
1748,114,Kenya,2006,2006,201.1,,66.19,,134.91,,,67.44,,149.92,,53.6,,7792.86,,
1749,114,Kenya,2007,2007,267.06,,89.04,,178.0,,,90.66,,195.91,,60.49,,8382.62,,
1750,114,Kenya,2008,2008,277.23,,102.21,,175.02,,,104.51,,190.99,,59.37,,9293.32,,
1751,114,Kenya,2009,2009,326.22,,164.91,,161.3,,,174.59,,201.59,,39.42,,9622.13,,
1752,114,Kenya,2010,2010,320.96,,160.22,,256.63,,,171.77,,198.34,,40.52,,9451.19,,


In [32]:
# play with the code below to understand the data distribution
pivoted_data_1.loc[
    (pivoted_data_1['Agriculture, Recurrent (General Government)'].isna()) &
    (pivoted_data_1['Agriculture (General Government)'].isna()) &
    (~pivoted_data_1['Agriculture, Capital (General Government)'].isna())
].shape

(0, 20)

In [33]:
# play with the code below to understand the data distribution
pivoted_data_1.loc[
    (pivoted_data_1['Agriculture (Central Government)'].isna()) &
    (pivoted_data_1['Agriculture (General Government)'].isna()) &
    (pivoted_data_1['Agriculture, Capital (General Government)'].isna()) &
    (pivoted_data_1['Agriculture, Recurrent (General Government)'].isna()) &
    (~pivoted_data_1['Agriculture, Capital (Central Government)'].isna()) &
    (~pivoted_data_1['Agriculture, Recurrent (Central Government)'].isna()) 
].shape

(18, 20)

After data exploration, it became evident that there is little consistency in 
terms of how different countries gather and represent data related to government 
expenditure. 

For example, there are certain countries which report the data on government 
expenditure through all the categories mentioned above- 'Agriculture (Central Government)', 
'Agriculture (General Government)', 'Agriculture, Capital (General Government)', 
'Agriculture, Recurrent (General Government)', 'Agriculture, Capital (Central Government)', 
etc. On the other hand, there are certain countries which don't have multi-level 
governance systems, they only report central government level expenditure. These 
are mostly the developing countries or small economies.

Certain countries, like USA, report government expenditure as recurrent expenditure 
at general government level and central government level and it is same as government 
expenditure for agriculture, forestry and fishery. 

Clearly, there is a lot of variablity in terms of how the data on government expenditure 
is reported by the countries around the world. And, because of this reason, we 
have to find alternative ways to document the data. 

The plan is to use 'Agriculture (General Government)' as government expenditure. 
Wherever data for 'Agriculture (General Government)' is not available, we will 
fill in the data from corressponding 'Agriculture (Central Government)' column. 
We will do the same for 'Agriculture Capital (General Government)', 
'Agriculture Recurrent (General Government)', 'Agriculture, forestry, fishing (General Government)',
'Agriculture, forestry, fishing Capital (General Government)', and
'Agriculture, forestry, fishing Recurrent (General Government)'. 

However, we would only combine the data for 'Agriculture Capital (General Government)' and 
'Agriculture Recurrent (General Government)' only when the data is missing for 
'Agriculture (General Government)'.
 
Similarly, we would use 'R&D Agriculture, forestry, fishing (General Government)' 
as the data for government expenditure on R&D in agriculture, forestry, and 
fishing. Whenever it is not available, we would use data for 'R&D Agriculture, forestry, 
fishing (Central Government)'. 

In [36]:
pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 20 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Area Code                                                       4777 non-null   int64  
 1   Area                                                            4777 non-null   object 
 2   Year Code                                                       4777 non-null   int64  
 3   Year                                                            4777 non-null   int64  
 4   Agriculture (Central Government)                                1064 non-null   float64
 5   Agriculture (General Government)                                542 non-null    float64
 6   Agriculture, Capital (Central Government)                       708 non-null    float64
 7   Agriculture, Capital (General Government)          

In [72]:
# filling NaN values in the column (General Govt) with values from other column (Central Govt)
pivoted_data_1['Total Expenditure (General Government)'] = pivoted_data_1['Total Expenditure (General Government)'].fillna(
    pivoted_data_1['Total Expenditure (Central Government)'] 
)

pivoted_data_1['Agriculture (General Government)'] = pivoted_data_1['Agriculture (General Government)'].fillna(
    pivoted_data_1['Agriculture (Central Government)'] 
)

pivoted_data_1['Agriculture, Capital (General Government)'] = pivoted_data_1['Agriculture, Capital (General Government)'].fillna(
    pivoted_data_1['Agriculture, Capital (Central Government)'] 
)

pivoted_data_1['Agriculture, Recurrent (General Government)'] = pivoted_data_1['Agriculture, Recurrent (General Government)'].fillna(
    pivoted_data_1['Agriculture, Recurrent (Central Government)'] 
)

pivoted_data_1['Agriculture, forestry, fishing, Capital (General Government)'] = pivoted_data_1['Agriculture, forestry, fishing, Capital (General Government)'].fillna(
    pivoted_data_1['Agriculture, forestry, fishing, Capital (Central Government)'] 
)

pivoted_data_1['Agriculture, forestry, fishing, Recurrent (General Government)'] = pivoted_data_1['Agriculture, forestry, fishing, Recurrent (General Government)'].fillna(
    pivoted_data_1['Agriculture, forestry, fishing, Recurrent (Central Government)'] 
)

pivoted_data_1['R&D Agriculture, forestry, fishing (General Government)'] = pivoted_data_1['R&D Agriculture, forestry, fishing (General Government)'].fillna(
    pivoted_data_1['R&D Agriculture, forestry, fishing (Central Government)'] 
)

# if there are still missing values in total expenditure (general govt), we will 
# fill it using total expenditure (highest govt)
pivoted_data_1['Total Expenditure (General Government)'] = pivoted_data_1['Total Expenditure (General Government)'].fillna(
    pivoted_data_1['Total Expenditure (Highest Government level)'] 
)

In [38]:
pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 20 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Area Code                                                       4777 non-null   int64  
 1   Area                                                            4777 non-null   object 
 2   Year Code                                                       4777 non-null   int64  
 3   Year                                                            4777 non-null   int64  
 4   Agriculture (Central Government)                                1064 non-null   float64
 5   Agriculture (General Government)                                1108 non-null   float64
 6   Agriculture, Capital (Central Government)                       708 non-null    float64
 7   Agriculture, Capital (General Government)          

In [74]:
# dropping redundant/undesired columns
pivoted_data_1 = pivoted_data_1.drop(
    ['Agriculture (Central Government)',
     'Agriculture, Capital (Central Government)',
     'Agriculture, Recurrent (Central Government)',
     'Agriculture, forestry, fishing, Capital (Central Government)',
     'Agriculture, forestry, fishing, Recurrent (Central Government)',
     'R&D Agriculture, forestry, fishing (Central Government)',
     'Total Expenditure (Central Government)',
     'Total Expenditure (Highest Government level)'], axis=1
)

pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 12 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Area Code                                                       4777 non-null   int64  
 1   Area                                                            4777 non-null   object 
 2   Year Code                                                       4777 non-null   int64  
 3   Year                                                            4777 non-null   int64  
 4   Agriculture (General Government)                                1108 non-null   float64
 5   Agriculture, Capital (General Government)                       721 non-null    float64
 6   Agriculture, Recurrent (General Government)                     754 non-null    float64
 7   Agriculture, forestry, fishing (General Government)

In [75]:
pivoted_data_1.head(10)

Unnamed: 0,Area Code,Area,Year Code,Year,Agriculture (General Government),"Agriculture, Capital (General Government)","Agriculture, Recurrent (General Government)","Agriculture, forestry, fishing (General Government)","Agriculture, forestry, fishing, Capital (General Government)","Agriculture, forestry, fishing, Recurrent (General Government)","R&D Agriculture, forestry, fishing (General Government)",Total Expenditure (General Government)
0,1,Armenia,2003,2003,,,,,,,,990.01
1,1,Armenia,2004,2004,,,,,,,,1181.94
2,1,Armenia,2005,2005,,,,,,,,1403.58
3,1,Armenia,2006,2006,,,,,,,,1590.91
4,1,Armenia,2007,2007,,,,,,,,1998.6
5,1,Armenia,2008,2008,,,,,,,,2226.36
6,1,Armenia,2009,2009,86.8,48.68,38.12,90.51,49.85,40.66,0.01,2356.81
7,1,Armenia,2010,2010,91.83,63.27,28.56,93.82,63.28,30.54,0.01,2252.47
8,1,Armenia,2011,2011,93.72,63.21,30.52,95.55,63.21,32.34,0.01,2242.21
9,1,Armenia,2012,2012,49.31,20.21,29.1,51.02,20.21,30.81,0.0,2322.04


Clearly, we have filled quite abunch of missing values in the desired columns. 
Now, we will try to fill the remaining missing values by combining the data from 
Capital and Recurrent columns.

In [76]:
# Filling NaN values in the column with values from other columns (Capital & Recurrent)
pivoted_data_1['Agriculture (General Government)'] = pivoted_data_1['Agriculture (General Government)'].fillna(
    pivoted_data_1[
        ['Agriculture, Capital (General Government)', 'Agriculture, Recurrent (General Government)']
        ].sum(axis=1, skipna=True, min_count=1)
)

pivoted_data_1['Agriculture, forestry, fishing (General Government)'] = pivoted_data_1['Agriculture, forestry, fishing (General Government)'].fillna(
    pivoted_data_1[
        ['Agriculture, forestry, fishing, Capital (General Government)', 'Agriculture, forestry, fishing, Recurrent (General Government)']
        ].sum(axis=1, skipna=True, min_count=1)
)

pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 12 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Area Code                                                       4777 non-null   int64  
 1   Area                                                            4777 non-null   object 
 2   Year Code                                                       4777 non-null   int64  
 3   Year                                                            4777 non-null   int64  
 4   Agriculture (General Government)                                1162 non-null   float64
 5   Agriculture, Capital (General Government)                       721 non-null    float64
 6   Agriculture, Recurrent (General Government)                     754 non-null    float64
 7   Agriculture, forestry, fishing (General Government)

Looks like we don't have enough data (lots of missing values) for 'R&D Agriculture, 
forestry, fishing (General Government)', we can get rid of this feature from our dataset. 

In [77]:
# dropping redundant/undesired columns
pivoted_data_1 = pivoted_data_1.drop(
    ['Agriculture, Capital (General Government)',
     'Agriculture, Recurrent (General Government)',
     'Agriculture, forestry, fishing, Capital (General Government)',
     'Agriculture, forestry, fishing, Recurrent (General Government)',
     'R&D Agriculture, forestry, fishing (General Government)'], axis=1
)

pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 7 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Area Code                                            4777 non-null   int64  
 1   Area                                                 4777 non-null   object 
 2   Year Code                                            4777 non-null   int64  
 3   Year                                                 4777 non-null   int64  
 4   Agriculture (General Government)                     1162 non-null   float64
 5   Agriculture, forestry, fishing (General Government)  2335 non-null   float64
 6   Total Expenditure (General Government)               4770 non-null   float64
dtypes: float64(3), int64(3), object(1)
memory usage: 261.4+ KB


In [79]:
# filling NaN values in the column (Agriculture, forestry, fishing (General Government)) 
# with values from other column (Agriculture (General Government))
pivoted_data_1['Agriculture, forestry, fishing (General Government)'] = pivoted_data_1['Agriculture, forestry, fishing (General Government)'].fillna(
    pivoted_data_1['Agriculture (General Government)'] 
)

pivoted_data_1 = pivoted_data_1.drop('Agriculture (General Government)', axis=1)

pivoted_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 6 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Area Code                                            4777 non-null   int64  
 1   Area                                                 4777 non-null   object 
 2   Year Code                                            4777 non-null   int64  
 3   Year                                                 4777 non-null   int64  
 4   Agriculture, forestry, fishing (General Government)  2600 non-null   float64
 5   Total Expenditure (General Government)               4770 non-null   float64
dtypes: float64(2), int64(3), object(1)
memory usage: 224.1+ KB


In [80]:
# renaming columns
cleaned_data_1 = pivoted_data_1.rename(
    columns = {
        'Area Code': 'area_code',
        'Area': 'area',
        'Year Code': 'year_code',
        'Year': 'year',
        'Agriculture, forestry, fishing (General Government)': 'Govt_expenditure_on_Ag_forest_fish',
        'Total Expenditure (General Government)': 'total_govt_expenditure'
    }
)

## Agriculture Orientation Index (AOI) for Government Expenditure

In [51]:
# filter for desired element
filtered_data_2 = filtered_data.loc[
    filtered_data['Element']=='SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure'
]

filtered_data_2.shape

(4713, 13)

In [52]:
filtered_data_2['Unit'].unique()

array([nan], dtype=object)

In [53]:
filtered_data_2['Item'].unique()

array(['SDG 2.a.1: Highest Government level'], dtype=object)

In [54]:
# restructuring data from long to wide format
pivoted_data_2 = filtered_data_2.pivot_table(
    index = ['Area Code', 'Area', 'Year Code', 'Year'],
    columns = 'Element',
    values = 'Value'
)

# resetting row index
pivoted_data_2.reset_index(inplace=True)

# setting column index name to None
pivoted_data_2.columns.name = None

pivoted_data_2.head()

Unnamed: 0,Area Code,Area,Year Code,Year,SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure
0,1,Armenia,2003,2003,0.2
1,1,Armenia,2004,2004,0.19
2,1,Armenia,2005,2005,0.23
3,1,Armenia,2006,2006,0.23
4,1,Armenia,2007,2007,0.24


In [55]:
# renaming columns
cleaned_data_2 = pivoted_data_2.rename(
    columns = {
        'Area Code': 'area_code',
        'Area': 'area',
        'Year Code': 'year_code',
        'Year': 'year',
        'SDG 2.a.1: Agriculture Orientation Index (AOI) for Government Expenditure': 'AOI_for_govt_expenditure'
    }
)

In [56]:
cleaned_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4713 entries, 0 to 4712
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   area_code                 4713 non-null   int64  
 1   area                      4713 non-null   object 
 2   year_code                 4713 non-null   int64  
 3   year                      4713 non-null   int64  
 4   AOI_for_govt_expenditure  4713 non-null   float64
dtypes: float64(1), int64(3), object(1)
memory usage: 184.2+ KB


In [81]:
# Combining the two datasets with left-join
merged_data = pd.merge(
    cleaned_data_1, cleaned_data_2,
    on = ['area_code', 'area', 'year_code', 'year'],
    how = 'left'
)

merged_data.head(10)

Unnamed: 0,area_code,area,year_code,year,Govt_expenditure_on_Ag_forest_fish,total_govt_expenditure,AOI_for_govt_expenditure
0,1,Armenia,2003,2003,,990.01,0.2
1,1,Armenia,2004,2004,,1181.94,0.19
2,1,Armenia,2005,2005,,1403.58,0.23
3,1,Armenia,2006,2006,,1590.91,0.23
4,1,Armenia,2007,2007,,1998.6,0.24
5,1,Armenia,2008,2008,,2226.36,0.27
6,1,Armenia,2009,2009,90.51,2356.81,0.24
7,1,Armenia,2010,2010,93.82,2252.47,0.26
8,1,Armenia,2011,2011,95.55,2242.21,0.22
9,1,Armenia,2012,2012,51.02,2322.04,0.12


In [82]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4777 entries, 0 to 4776
Data columns (total 7 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   area_code                           4777 non-null   int64  
 1   area                                4777 non-null   object 
 2   year_code                           4777 non-null   int64  
 3   year                                4777 non-null   int64  
 4   Govt_expenditure_on_Ag_forest_fish  2600 non-null   float64
 5   total_govt_expenditure              4770 non-null   float64
 6   AOI_for_govt_expenditure            4706 non-null   float64
dtypes: float64(3), int64(3), object(1)
memory usage: 261.4+ KB


The good news is we have complete data regarding the Agricultural Orientation 
Index for government expenditure. This is important as it measure the emphasis 
of government on agriculture. It tells you whether agriculture receives a 
proportionally larger or smaller share of government spending relative to its 
contribution to the economy.

The AOI of 1 means the government's spending on agriculture is exactly proportional 
to agriculture's contribution to the national economy.

In [83]:
# exporting cleaned data as csv file
merged_data.to_csv('/Users/gurjitsingh/Desktop/MS Data Science/MS_Project_Python/cleaned_datasets/government_investment_cleaned.csv', index='False')