# INFO 2950 Phase II

## Research Questions

We aim to investigate the question(s):

How does government expenditure on education impact long-term economic growth and unemployment in developing versus developed countries?

## Data Description

# What are the observations (rows) and the attributes (columns)?

Observations (Rows): Each row in the dataset corresponds to a country and a specific economic indicator (e.g., GDP growth, government expenditure on education, or unemployment) for mulitple years(1965-2023). The rows are structured with country-level data, showing trends and changes in economic indicators over time.

Attributes (Columns):
Country Name: The name of the country.
Country Code: A unique identifier for each country (ISO or similar).
Series Name: The economic indicator (e.g., GDP growth, unemployment, government expenditure on education).
Year Columns: Yearly data points from 1965 to 2023, where each column represents the recorded value of the economic indicator for that specific year.

# Why was this dataset created?

The dataset was created to track economic performance and trends in different countries over time. The data helps governments, policymakers, economists, and international organizations understand key economic metrics such as GDP growth, government expenditure on education, and unemployment rates, which can guide economic development strategies and interventions in different regions of the world. 

# Who funded the creation of the dataset?

The dataset was funded and crafted by The World Bank, a global financial institution that supports development efforts in countries world-wide. Its mission is to reduce poverty and promote sustainable development so the World Bank regularly collects and publishes data on a wide range of economic indicators to aid in policy-making and development planning.

# What processes might have influenced what data was observed and recorded and what was not?

Data availability and reliability: Some countries may not have consistent or complete data records due to political instability, inadequate infrastructure, or limited data collection capabilities. This can lead to missing or incomplete data for certain years.

Selection of indicators: The choice of economic indicators included in the dataset (e.g., GDP growth, unemployment, government expenditure) reflects the focus areas of development agencies and governments. I believe these indicators chosen are deemed critical and essential for assessing economic health.

Reporting standards: Differences in national reporting standards and data collection methods could lead to variations in data quality and coverage.

Bias in data recording: Some data points may reflect self-reported figures from governments, which could be subject to political influence or reporting biases, particularly in countries where transparency or accurate reporting is less enforced.

# What preprocessing was done, and how did the data come to be in the form that you are using?

The original data consisted of four datasets grouped by income level (low_income, lower_middle, upper_middle, high_income) to distinguish between developing and developed countries. Preprocessing was done to standardize economic indicators and ensure consistency across countries and years.

First, only the relevant economic indicators—GDP growth, government expenditure on education, and unemployment—were retained. Rows missing essential information like Country Name, Country Code, or Series Name were removed. Year columns from 1960 to 1964 were excluded due to data quality issue, and non-numeric values like ".." were replaced with NaN.

To handle missing values, group-based imputation filled gaps using country-specific means, and linear interpolation was used to estimate missing data based on adjacent trends. Remaining NaN rows were dropped as there was too many missing data for linear interpolation to work well, and a new column, country_type, was added to label each country based on its income group. Finally, all datasets were merged into a single dataset for analysis.

# If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

Data collection for this type of dataset is typically done through government reporting mechanisms and international data collection efforts, not directly involving individuals. Governments or institutions contributing data likely understood that the information would be used for development assessments, economic planning, and research purposes. However, the individuals whose economic data are aggregated (e.g., unemployment figures) may not be aware of the specific data collection.


# Where can your raw source data be found, if applicable? Provide a link to the raw data.

https://databank.worldbank.org/source/world-development-indicators#

Notes:
* excellent note

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns

## Data Cleaning

In [None]:

def clean_dataset(data, country_type_label):

    indicators = [
        'GDP growth (annual %)',
        'Government expenditure on education, total (% of GDP)',
        'Unemployment, total (% of total labor force) (modeled ILO estimate)'
    ]
    cleaned_data = data[data['Series Name'].isin(indicators)]


    cleaned_data = cleaned_data.dropna(subset=['Country Name', 'Country Code', 'Series Name'])

  
    columns_to_drop = ['1960 [YR1960]', '1961 [YR1961]', '1962 [YR1962]', '1963 [YR1963]', '1964 [YR1964]']
    cleaned_data = cleaned_data.drop(columns=columns_to_drop, errors='ignore')

  
    year_columns = [col for col in cleaned_data.columns if col.split()[0].isdigit()]
    cleaned_data[year_columns] = cleaned_data[year_columns].replace("..", pd.NA).apply(pd.to_numeric, errors='coerce')


    #cleaned_data[year_columns] = cleaned_data.groupby('Country Name')[year_columns].transform(lambda x: x.fillna(x.mean()))

   
    cleaned_data[year_columns] = cleaned_data[year_columns].interpolate(method='linear', axis=0)
    cleaned_data = cleaned_data.dropna()

  
    cleaned_data['country_type'] = country_type_label

    return cleaned_data


file_paths = [
    '94086144-6ad8-4b75-ac26-1b60a764018a_Data.csv',  
    '3fd493b6-dfe0-4afd-b296-1b5892e64ba8_Data.csv',  
    '579772b1-602f-4cf6-ac0b-1bb1289918f8_Data.csv',  
    '5026bb4a-f8ac-4ee9-860c-f83d47a7aded_Data.csv' 
]

country_types = ['low_income', 'lower_middle', 'upper_middle', 'high_income']

cleaned_datasets = []

for file_path, country_type in zip(file_paths, country_types):
    data = pd.read_csv(file_path)
    cleaned_data = clean_dataset(data, country_type)
    cleaned_datasets.append(cleaned_data)

merged_data = pd.concat(cleaned_datasets, ignore_index=True)
merged_file_path = 'final_cleaned_data.csv'
#merged_data.to_csv(merged_file_path, index=False)

merged_data.head()

The data cleaning process involved multiple structured steps to ensure consistency, completeness, and proper formatting of the data for analysis. First, the datasets were filtered to include only relevant economic indicators: GDP growth (annual %), Government expenditure on education (total % of GDP), and Unemployment (total % of labor force, modeled ILO estimate). Any rows missing essential identifying information, such as Country Name, Country Code, or Series Name, were removed to maintain data integrity.

Next, columns for the years 1960 to 1964 were removed as these years were not part of the analysis scope. The year columns were then converted to numeric values, replacing non-numeric placeholders like ".." with NaN, allowing for proper numeric operations.

Following this, group-based imputation was applied. Missing values for each country were filled by calculating the mean for that country's data, ensuring that gaps in the dataset were handled appropriately. For time-series data, linear interpolation was used to estimate missing values based on the trends of adjacent data points. After this, any remaining rows with NaN values were dropped to ensure a clean dataset.

Finally, a new column, country_type, was added to each dataset, labeling countries based on the file they came from (e.g., low_income, lower_middle, upper_middle, and high_income). After cleaning each individual dataset, the datasets were concatenated into a single merged dataset

## Data Analysis

In [None]:
# data analysis code here




## Data Limitations

Missing Data: Some countries have incomplete data for certain years due to gaps in reporting, which may result in missing or estimated values that do not fully reflect actual conditions.

Bias in Self-Reporting: Data from governments may be subject to reporting biases, especially in politically sensitive areas such as unemployment or economic growth.

Limited Detail: The dataset contains national-level data, which may overlook regional disparities within countries.

Imputation and Interpolation: Filling in missing data using mean imputation and interpolation may introduce trends that do not accurately reflect economic realities.

## Questions for Reviewers

* Good question #1
* Good question #2

## Resources

Dataset: https://databank.worldbank.org/source/world-development-indicators#