In [1]:
import pandas as pd

try:
    # 1. Load the dataset
    file_name = "combined_world_bank_data.csv"
    df = pd.read_csv(file_name)

    print(f"--- Initial Data Info ---")
    df.info()
    print(f"\nOriginal shape: {df.shape}")

    # 2. Handle redundant columns from the merge
    # We keep 'country_/_economy' and 'approval_fy' from the original project data.
    # 'CountryName' and 'Year' are redundant.
    df_cleaned = df.drop(columns=['CountryName', 'Year'])
    print(f"\nDropped redundant columns: 'CountryName', 'Year'")

    # 3. Handle missing values (NaNs)

    # Fill missing FCS status. Assuming null means 'non-FCS'.
    initial_fcs_nulls = df_cleaned['country_/_economy_fcs_status'].isnull().sum()
    df_cleaned['country_/_economy_fcs_status'] = df_cleaned['country_/_economy_fcs_status'].fillna('non-FCS')
    print(f"Filled {initial_fcs_nulls} missing 'country_/_economy_fcs_status' with 'non-FCS'")

    # Fill missing evaluation metrics with 'Not Evaluated'
    eval_cols = ['outcome', 'quality_at_entry', 'quality_of_supervision', 'bank_performance', 'm&e_quality']
    for col in eval_cols:
        initial_nulls = df_cleaned[col].isnull().sum()
        df_cleaned[col] = df_cleaned[col].fillna('Not Evaluated')
        print(f"Filled {initial_nulls} missing '{col}' with 'Not Evaluated'")

    # Drop rows with missing key identifiers (very few)
    initial_rows = df_cleaned.shape[0]
    df_cleaned = df_cleaned.dropna(subset=['agreement_type', 'lending_instrument_type'])
    rows_dropped = initial_rows - df_cleaned.shape[0]
    print(f"Dropped {rows_dropped} rows with missing 'agreement_type' or 'lending_instrument_type'")

    # Note: We are *not* filling NaNs for macroeconomic columns (e.g., GDP_USD, Inflation_CPI_pct)
    # or 'CountryCode'. NaN here means data was not available, which is valid.

    # 4. Check for and remove duplicate rows
    initial_duplicates = df_cleaned.duplicated().sum()
    if initial_duplicates > 0:
        df_cleaned = df_cleaned.drop_duplicates()
        print(f"Removed {initial_duplicates} duplicate rows.")
    else:
        print("No duplicate rows found.")

    # 5. Final Inspection
    print(f"\n--- Cleaned Data Info ---")
    df_cleaned.info()
    print(f"\nCleaned shape: {df_cleaned.shape}")

    # 6. Save the cleaned dataset
    output_filename = "cleaned_combined_world_bank_data.csv"
    df_cleaned.to_csv(output_filename, index=False)

    print(f"\nSuccessfully cleaned the dataset and saved to '{output_filename}'")

except FileNotFoundError:
    print(f"Error: The file '{file_name}' was not found.")
except Exception as e:
    print(f"An error occurred during cleaning: {e}")

--- Initial Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11464 entries, 0 to 11463
Data columns (total 30 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   project_name                         11464 non-null  object 
 1   wb_region                            11464 non-null  object 
 2   country_/_economy                    11464 non-null  object 
 3   country_/_economy_lending_group      11464 non-null  object 
 4   country_/_economy_fcs_status         5142 non-null   object 
 5   country_/_economy_fcs_lending_group  11464 non-null  object 
 6   practice_group                       11464 non-null  object 
 7   global_practice                      11464 non-null  object 
 8   agreement_type                       11463 non-null  object 
 9   lending_instrument_type              11460 non-null  object 
 10  approval_fy                          11464 non-null  int64  
 11  fi

```text?code_stdout&code_event_index=2
--- Initial Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11464 entries, 0 to 11463
Data columns (total 30 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   project_name                         11464 non-null  object
 1   wb_region                            11464 non-null  object
 2   country_/_economy                    11464 non-null  object
 3   country_/_economy_lending_group      11464 non-null  object
 4   country_/_economy_fcs_status         5142 non-null   object
 5   country_/_economy_fcs_lending_group  11464 non-null  object
 6   practice_group                       11464 non-null  object
 7   global_practice                      11464 non-null  object
 8   agreement_type                       11463 non-null  object
 9   lending_instrument_type              11460 non-null  object
 10  approval_fy                          11464 non-null  int64  
 11  final_closing_fy                     7783 non-null   float64
 12  evaluation_type                      11464 non-null  object
 13  outcome                              11331 non-null  object
 14  quality_at_entry                     8721 non-null   object
 15  quality_of_supervision               9374 non-null   object
 16  bank_performance                     8212 non-null   object
 17  m&e_quality                          4861 non-null   object
 18  evaluation_fy                        11464 non-null  int64  
 19  CountryCode                          7372 non-null   object
 20  CountryName                          7372 non-null   object
 21  Year                                 7372 non-null   float64
 22  GDP_USD                              7329 non-null   float64
 23  GDP_per_capita_USD                   7329 non-null   float64
 24  GDP_growth_pct                       7307 non-null   float64
 25  Inflation_CPI_pct                    4422 non-null   float64
 26  Unemployment_pct                     7074 non-null   float64
 27  Govt_debt_pct_GDP                    1951 non-null   float64
 28  Current_account_pct_GDP              6806 non-null   float64
 29  Population_total                     7372 non-null   float64
dtypes: float64(10), int64(2), object(18)
memory usage: 2.6+ MB

Original shape: (11464, 30)

Dropped redundant columns: 'CountryName', 'Year'
Filled 6322 missing 'country_/_economy_fcs_status' with 'non-FCS'
Filled 133 missing 'outcome' with 'Not Evaluated'
Filled 2743 missing 'quality_at_entry' with 'Not Evaluated'
Filled 2090 missing 'quality_of_supervision' with 'Not Evaluated'
Filled 3252 missing 'bank_performance' with 'Not Evaluated'
Filled 6603 missing 'm&e_quality' with 'Not Evaluated'
Dropped 4 rows with missing 'agreement_type' or 'lending_instrument_type'
No duplicate rows found.

--- Cleaned Data Info ---
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11460 entries, 0 to 11463
Data columns (total 28 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   project_name                         11460 non-null  object
 1   wb_region                            11460 non-null  object
 2   country_/_economy                    11460 non-null  object
 3   country_/_economy_lending_group      11460 non-null  object
 4   country_/_economy_fcs_status         11460 non-null  object
 5   country_/_economy_fcs_lending_group  11460 non-null  object
 6   practice_group                       11460 non-null  object
 7   global_practice                      11460 non-null  object
 8   agreement_type                       11460 non-null  object
 9   lending_instrument_type              11460 non-null  object
 10  approval_fy                          11460 non-null  int64  
 11  final_closing_fy                     7780 non-null   float64
 12  evaluation_type                      11460 non-null  object
 13  outcome                              11460 non-null  object
 14  quality_at_entry                     11460 non-null  object
 15  quality_of_supervision               11460 non-null  object
 16  bank_performance                     11460 non-null  object
 17  m&e_quality                          11460 non-null  object
 18  evaluation_fy                        11460 non-null  int64  
 19  CountryCode                          7369 non-null   object
 20  GDP_USD                              7326 non-null   float64
 21  GDP_per_capita_USD                   7326 non-null   float64
 22  GDP_growth_pct                       7304 non-null   float64
 23  Inflation_CPI_pct                    4419 non-null   float64
 24  Unemployment_pct                     7071 non-null   float64
 25  Govt_debt_pct_GDP                    1951 non-null   float64
 26  Current_account_pct_GDP              6803 non-null   float64
 27  Population_total                     7369 non-null   float64
dtypes: float64(9), int64(2), object(17)
memory usage: 2.5+ MB

Cleaned shape: (11460, 28)

Successfully cleaned the dataset and saved to 'cleaned_combined_world_bank_data.csv'

```

[file-tag: code-generated-file-0-1761207942363161603]

I've cleaned the combined dataset. Here's a summary of the actions I took:

1.  **Removed Redundant Columns**: I dropped the `CountryName` and `Year` columns from the macroeconomic dataset, as this information was already present in the `country_/_economy` and `approval_fy` columns from the original project data.

2.  **Filled Missing Categorical Data**:

      * **FCS Status**: Filled 6,322 missing `country_/_economy_fcs_status` values with `'non-FCS'`, assuming that null values meant the country was not classified as a Fragile and Conflict-affected Situation (FCS).
      * **Evaluation Metrics**: Filled missing values for `outcome`, `quality_at_entry`, `quality_of_supervision`, `bank_performance`, and `m&e_quality` with the category `'Not Evaluated'`. This makes it clear that the data isn't missing, but rather that an evaluation was not performed or not applicable.

3.  **Removed Rows with Missing Key Data**: Dropped 4 rows that were missing essential identifiers (`agreement_type` or `lending_instrument_type`).

4.  **Handled Duplicates**: Checked for any fully duplicate rows and found none.

5.  **Preserved Macroeconomic Data**: I intentionally **did not** fill in missing values for macroeconomic indicators (like `GDP_USD`, `Inflation_CPI_pct`, etc.). In this context, `NaN` correctly signifies that data was unavailable for that country in that year, which is different from a value of zero.

The cleaned dataset is now ready for analysis and has been saved to a new file.

**File Created:**

  * `cleaned_combined_world_bank_data.csv`