# Data Wrangling: Population By Country

The `population_by_country_AUG.2024.csv` requires some wrangling to properly work with. Specifically, the csv file contains regions which do not correspond with the desired subregions to be reported. This notebook goes through the steps required to extract and format the correct output from this csv file.

## Step 1: Read in CSV File

The following packages are required to successfully run the code in this notebook:

In [1]:
import pandas as pd

Next we import the csv file in question:

In [7]:
population_by_country = pd.read_csv("../../data/population_by_country_AUG.2024.csv")

## Step 2: Preserve Specific Sub Regions

Here is a manually curated list of unspecific higher level regions from the csv file. We first remove these from the csv file.

In [12]:
unspecific_region_list = ["WORLD", "AFRICA", "LATIN AMERICA AND THE CARIBBEAN", "ASIA","EUROPE"]

In [16]:
population_by_country = population_by_country[~population_by_country.Geography.isin(unspecific_region_list)]

## Step 3: Match Countries to Curated Sub Regions

Leveraging the fact that the region names are uppercase, we use pandas functionality to match a "Region" field for each country by filling in the preceding upper case region name.

In [19]:
population_by_country['Region'] = population_by_country['Geography'].where(population_by_country['Geography'].str.isupper()).ffill()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  population_by_country['Region'] = population_by_country['Geography'].where(population_by_country['Geography'].str.isupper()).ffill()


## Step 4: Remove Sub Regions From the Population by Country DataFrame

Now that we have identified the relevant sub region for each country, we remove the sub regions from the dataframe, and write the curated data frame to a  CSV file.

In [25]:
population_by_country = population_by_country[~population_by_country.Geography.str.isupper()]
population_by_country.to_csv("../../data/population_by_country_region_AUG.2024.csv")