# About Dataset
### Description
This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

### Key Features
* Country: Name of the country.
* Density (P/Km2): Population density measured in persons per square kilometer.
* Abbreviation: Abbreviation or code representing the country.
* Agricultural Land (%): Percentage of land area used for agricultural purposes.
* Land Area (Km2): Total land area of the country in square kilometers.
* Armed Forces Size: Size of the armed forces in the country.
* Birth Rate: Number of births per 1,000 population per year.
* Calling Code: International calling code for the country.
* Capital/Major City: Name of the capital or major city.
* CO2 Emissions: Carbon dioxide emissions in tons.
* CPI: Consumer Price Index, a measure of inflation and purchasing power.
* CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
* Currency_Code: Currency code used in the country.
* Fertility Rate: Average number of children born to a woman during her lifetime.
* Forested Area (%): Percentage of land area covered by forests.
* Gasoline_Price: Price of gasoline per liter in local currency.
* GDP: Gross Domestic Product, the total value of goods and services produced in the country.
* Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
* Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
* Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
* Largest City: Name of the country's largest city.
* Life Expectancy: Average number of years a newborn is expected to live.
* Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
* Minimum Wage: Minimum wage level in local currency.
* Official Language: Official language(s) spoken in the country.
* Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
* Physicians per Thousand: Number of physicians per thousand people.
* Population: Total population of the country.
* Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
* Tax Revenue (%): Tax revenue as a percentage of GDP.
* Total Tax Rate: Overall tax burden as a percentage of commercial profits.
* Unemployment Rate: Percentage of the labor force that is unemployed.
* Urban Population: Percentage of the population living in urban areas.
* Latitude: Latitude coordinate of the country's location.
* Longitude: Longitude coordinate of the country's location.

### Potential Use Cases
* Analyze population density and land area to study spatial distribution patterns.
* Investigate the relationship between agricultural land and food security.
* Examine carbon dioxide emissions and their impact on climate change.
* Explore correlations between economic indicators such as GDP and various socio-economic factors.
* Investigate educational enrollment rates and their implications for human capital development.
* Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
* Study labor market dynamics through indicators such as labor force participation and unemployment rates.
* Investigate the role of taxation and its impact on economic development.
* Explore urbanization trends and their social and environmental consequences.

### Data Source: This dataset was compiled from multiple data sources

### The purpose of this script is to prepare the data set for analysis and look for relationships among the variables through visualization and statistical testing.

In [1]:
import pandas as pd

# Set the maximum number of columns to display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Delete the zip file if it exists.

In [2]:
import os

zip_file_path = "countries-of-the-world-2023.zip"  # Specify the file path

if os.path.exists(zip_file_path):  # Check if the file exists
    os.remove(zip_file_path)  # Remove the file
    print(f"File '{zip_file_path}' has been successfully removed.")
else:
    print(f"File '{zip_file_path}' does not exist.")

File 'countries-of-the-world-2023.zip' has been successfully removed.


Retrieve the zipped data set from kaggle and save it into the default directory.

In [3]:
os.environ['KAGGLE_USERNAME'] = 'reesemcdonald'  # Replace with your Kaggle username
os.environ['KAGGLE_KEY'] = '7796176eb7a5d98c8c63ffb78535a6ac'  # Replace with your Kaggle API key

!kaggle datasets download -d nelgiriyewithana/countries-of-the-world-2023

Downloading countries-of-the-world-2023.zip to C:\Users\rsmcd\OneDrive\Desktop\Github Showcase




  0%|          | 0.00/23.5k [00:00<?, ?B/s]
100%|##########| 23.5k/23.5k [00:00<00:00, 845kB/s]


Delete the extracted subdirectory if it exists.

In [4]:
import shutil

subdirectory_path = "countries-of-the-world-2023"  # Specify the subdirectory path

if os.path.exists(subdirectory_path):  # Check if the subdirectory exists
    shutil.rmtree(subdirectory_path)  # Remove the subdirectory and its contents
    print(f"Subdirectory '{subdirectory_path}' has been successfully deleted.")
else:
    print(f"Subdirectory '{subdirectory_path}' does not exist.")

Subdirectory 'countries-of-the-world-2023' has been successfully deleted.


Extract the zip file into a subdirectory.

In [5]:
import zipfile

zip_file_path = "countries-of-the-world-2023.zip"  # Update with the actual path to your ZIP file

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall("countries-of-the-world-2023")  # Update "destination_directory" with the desired directory path

# Perform further operations on the extracted files

# The ZIP file is automatically closed after exiting the `with` block

Create the file path necessary to write the data set to a data frame object.

In [6]:
import pandas as pd

folder_path = "countries-of-the-world-2023"  # Specify the folder path
file_name = "world-data-2023.csv"  # Specify the file name

file_path = os.path.join(folder_path, file_name)  # Combine the folder path and file name

# Read the CSV file into a pandas DataFrame
wd23 = pd.read_csv(file_path)

# Remove symbols '%', '$', and ','
wd23 = wd23.replace(['%', '\$', ','], '', regex=True)

# You can now use the 'wd23' DataFrame to work with the data from the CSV file

Taking a look around. It's Limp Bizkit fuckin' up your town.

In [7]:
print(file_path)
wd23

countries-of-the-world-2023\world-data-2023.csv


Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,CPI,CPI Change (%),Currency-Code,Fertility Rate,Forested Area (%),Gasoline Price,GDP,Gross primary education enrollment (%),Gross tertiary education enrollment (%),Infant mortality,Largest city,Life expectancy,Maternal mortality ratio,Minimum wage,Official language,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.1,652230.0,323000.0,32.49,93.0,Kabul,8672.0,149.9,2.3,AFN,4.47,2.1,0.7,19101353833.0,104.0,9.7,47.9,Kabul,64.5,638.0,0.43,Pashto,78.4,0.28,38041754.0,48.9,9.3,71.4,11.12,9797273.0,33.93911,67.709953
1,Albania,105,AL,43.1,28748.0,9000.0,11.78,355.0,Tirana,4536.0,119.05,1.4,ALL,1.62,28.1,1.36,15278077447.0,107.0,55.0,7.8,Tirana,78.5,15.0,1.12,Albanian,56.9,1.2,2854191.0,55.7,18.6,36.6,12.33,1747593.0,41.153332,20.168331
2,Algeria,18,DZ,17.4,2381741.0,317000.0,24.28,213.0,Algiers,150006.0,151.36,2.0,DZD,3.02,0.8,0.28,169988236398.0,109.9,51.4,20.1,Algiers,76.7,112.0,0.95,Arabic,28.1,1.72,43053054.0,41.2,37.2,66.1,11.7,31510100.0,28.033886,1.659626
3,Andorra,164,AD,40.0,468.0,,7.2,376.0,Andorra la Vella,469.0,,,EUR,1.27,34.0,1.51,3154057987.0,106.4,,2.7,Andorra la Vella,,,6.63,Catalan,36.4,3.33,77142.0,,,,,67873.0,42.506285,1.521801
4,Angola,26,AO,47.5,1246700.0,117000.0,40.73,244.0,Luanda,34693.0,261.73,17.1,AOA,5.52,46.3,0.97,94635415870.0,113.5,9.3,51.6,Luanda,60.8,241.0,0.71,Portuguese,33.4,0.21,31825295.0,77.5,9.2,49.1,6.89,21061025.0,-11.202692,17.873887
5,Antigua and Barbuda,223,AG,20.5,443.0,0.0,15.33,1.0,St. John's Saint John,557.0,113.81,1.2,XCD,1.99,22.3,0.99,1727759259.0,105.0,24.8,5.0,St. John's Saint John,76.9,42.0,3.04,English,24.3,2.76,97118.0,,16.5,43.0,,23800.0,17.060816,-61.796428
6,Argentina,17,AR,54.3,2780400.0,105000.0,17.02,54.0,Buenos Aires,201348.0,232.75,53.5,ARS,2.26,9.8,1.1,449663446954.0,109.7,90.0,8.8,Buenos Aires,76.5,39.0,3.35,Spanish,17.6,3.96,44938712.0,61.3,10.1,106.3,9.79,41339571.0,-38.416097,-63.616672
7,Armenia,104,AM,58.9,29743.0,49000.0,13.99,374.0,Yerevan,5156.0,129.18,1.4,AMD,1.76,11.7,0.77,13672802158.0,92.7,54.6,11.0,Yerevan,74.9,26.0,0.66,Armenian,81.6,4.4,2957731.0,55.6,20.9,22.6,16.99,1869848.0,40.069099,45.038189
8,Australia,3,AU,48.2,7741220.0,58000.0,12.6,61.0,Canberra,375908.0,119.8,1.6,AUD,1.74,16.3,0.93,1392680589329.0,100.3,113.1,3.1,Sydney,82.7,6.0,13.59,,19.6,3.68,25766605.0,65.5,23.0,47.4,5.27,21844756.0,-25.274398,133.775136
9,Austria,109,AT,32.4,83871.0,21000.0,9.7,43.0,Vienna,61448.0,118.06,1.5,EUR,1.47,46.9,1.2,446314739528.0,103.1,85.1,2.9,Vienna,81.6,5.0,,German,17.9,5.17,8877067.0,60.7,25.4,51.4,4.67,5194416.0,47.516231,14.550072


Check which columns are strings and which are numerical.

In [8]:
column_info = wd23.dtypes.reset_index()
column_info.columns = ['Column_Name', 'Data_Type']
column_info = column_info.sort_values(by='Data_Type').set_index('Column_Name')
column_info

Unnamed: 0_level_0,Data_Type
Column_Name,Unnamed: 1_level_1
Longitude,float64
Life expectancy,float64
Maternal mortality ratio,float64
Fertility Rate,float64
Latitude,float64
Calling Code,float64
Physicians per thousand,float64
Birth Rate,float64
Infant mortality,float64
Population: Labor force participation (%),object


Getting a list of columns from 'column_info'.

In [9]:
# Print the index of 'column_info_df' as a list
index_list = column_info.index.tolist()
print(index_list)

['Longitude', 'Life expectancy', 'Maternal mortality ratio', 'Fertility Rate', 'Latitude', 'Calling Code', 'Physicians per thousand', 'Birth Rate', 'Infant mortality', 'Population: Labor force participation (%)', 'Population', 'Tax revenue (%)', 'Out of pocket health expenditure', 'Official language', 'Minimum wage', 'Total tax rate', 'Unemployment rate', 'Urban_population', 'Largest city', 'Country', 'GDP', 'Gasoline Price', 'Forested Area (%)', 'Currency-Code', 'CPI Change (%)', 'CPI', 'Co2-Emissions', 'Capital/Major City', 'Armed Forces size', 'Land Area(Km2)', 'Agricultural Land( %)', 'Abbreviation', 'Density\n(P/Km2)', 'Gross tertiary education enrollment (%)', 'Gross primary education enrollment (%)']


Determine which columns to convert to string and which to numerical and convert them.

In [10]:
# Columns to convert to string format
string_columns = ['Official language', 'Largest city', 'Country', 'Currency-Code', 'Capital/Major City', 'Abbreviation']

# Convert specified columns to string format
wd23[string_columns] = wd23[string_columns].astype(str)

# Determine the columns to convert to numerical format
numerical_columns = [col for col in wd23.columns if col not in string_columns]

# Convert remaining columns to numerical format
wd23[numerical_columns] = wd23[numerical_columns].apply(pd.to_numeric, errors='coerce')

wd23

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,CPI,CPI Change (%),Currency-Code,Fertility Rate,Forested Area (%),Gasoline Price,GDP,Gross primary education enrollment (%),Gross tertiary education enrollment (%),Infant mortality,Largest city,Life expectancy,Maternal mortality ratio,Minimum wage,Official language,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.1,652230.0,323000.0,32.49,93.0,Kabul,8672.0,149.9,2.3,AFN,4.47,2.1,0.7,19101350000.0,104.0,9.7,47.9,Kabul,64.5,638.0,0.43,Pashto,78.4,0.28,38041750.0,48.9,9.3,71.4,11.12,9797273.0,33.93911,67.709953
1,Albania,105,AL,43.1,28748.0,9000.0,11.78,355.0,Tirana,4536.0,119.05,1.4,ALL,1.62,28.1,1.36,15278080000.0,107.0,55.0,7.8,Tirana,78.5,15.0,1.12,Albanian,56.9,1.2,2854191.0,55.7,18.6,36.6,12.33,1747593.0,41.153332,20.168331
2,Algeria,18,DZ,17.4,2381741.0,317000.0,24.28,213.0,Algiers,150006.0,151.36,2.0,DZD,3.02,0.8,0.28,169988200000.0,109.9,51.4,20.1,Algiers,76.7,112.0,0.95,Arabic,28.1,1.72,43053050.0,41.2,37.2,66.1,11.7,31510100.0,28.033886,1.659626
3,Andorra,164,AD,40.0,468.0,,7.2,376.0,Andorra la Vella,469.0,,,EUR,1.27,34.0,1.51,3154058000.0,106.4,,2.7,Andorra la Vella,,,6.63,Catalan,36.4,3.33,77142.0,,,,,67873.0,42.506285,1.521801
4,Angola,26,AO,47.5,1246700.0,117000.0,40.73,244.0,Luanda,34693.0,261.73,17.1,AOA,5.52,46.3,0.97,94635420000.0,113.5,9.3,51.6,Luanda,60.8,241.0,0.71,Portuguese,33.4,0.21,31825300.0,77.5,9.2,49.1,6.89,21061025.0,-11.202692,17.873887
5,Antigua and Barbuda,223,AG,20.5,443.0,0.0,15.33,1.0,St. John's Saint John,557.0,113.81,1.2,XCD,1.99,22.3,0.99,1727759000.0,105.0,24.8,5.0,St. John's Saint John,76.9,42.0,3.04,English,24.3,2.76,97118.0,,16.5,43.0,,23800.0,17.060816,-61.796428
6,Argentina,17,AR,54.3,2780400.0,105000.0,17.02,54.0,Buenos Aires,201348.0,232.75,53.5,ARS,2.26,9.8,1.1,449663400000.0,109.7,90.0,8.8,Buenos Aires,76.5,39.0,3.35,Spanish,17.6,3.96,44938710.0,61.3,10.1,106.3,9.79,41339571.0,-38.416097,-63.616672
7,Armenia,104,AM,58.9,29743.0,49000.0,13.99,374.0,Yerevan,5156.0,129.18,1.4,AMD,1.76,11.7,0.77,13672800000.0,92.7,54.6,11.0,Yerevan,74.9,26.0,0.66,Armenian,81.6,4.4,2957731.0,55.6,20.9,22.6,16.99,1869848.0,40.069099,45.038189
8,Australia,3,AU,48.2,7741220.0,58000.0,12.6,61.0,Canberra,375908.0,119.8,1.6,AUD,1.74,16.3,0.93,1392681000000.0,100.3,113.1,3.1,Sydney,82.7,6.0,13.59,,19.6,3.68,25766600.0,65.5,23.0,47.4,5.27,21844756.0,-25.274398,133.775136
9,Austria,109,AT,32.4,83871.0,21000.0,9.7,43.0,Vienna,61448.0,118.06,1.5,EUR,1.47,46.9,1.2,446314700000.0,103.1,85.1,2.9,Vienna,81.6,5.0,,German,17.9,5.17,8877067.0,60.7,25.4,51.4,4.67,5194416.0,47.516231,14.550072


Re-checking the data type of each column.

In [11]:
column_info = wd23.dtypes.reset_index()
column_info.columns = ['Column_Name', 'Data_Type']
column_info = column_info.sort_values(by='Data_Type').set_index('Column_Name')
column_info

Unnamed: 0_level_0,Data_Type
Column_Name,Unnamed: 1_level_1
Density\n(P/Km2),int64
Gross primary education enrollment (%),float64
Urban_population,float64
Unemployment rate,float64
Total tax rate,float64
Tax revenue (%),float64
Population: Labor force participation (%),float64
Population,float64
Physicians per thousand,float64
Out of pocket health expenditure,float64


Checking to see which columns can be aggregated after string/numerical transformations.

In [12]:
# Filter numerical columns
numerical_columns = wd23.select_dtypes(include=[float, int]).columns

# Calculate the average for each numerical column
averages = wd23[numerical_columns].mean()

# Create a new DataFrame with the averages
averages_df = pd.DataFrame({'Column': averages.index, 'Average': averages.values})

# Print the resulting DataFrame
print(averages_df)

                                       Column       Average
0                            Density\n(P/Km2)  3.567641e+02
1                       Agricultural Land( %)  3.911755e+01
2                              Land Area(Km2)  6.896244e+05
3                           Armed Forces size  1.592749e+05
4                                  Birth Rate  2.021497e+01
5                                Calling Code  3.605464e+02
6                               Co2-Emissions  1.777992e+05
7                                         CPI  1.904610e+02
8                              CPI Change (%)  6.722346e+00
9                              Fertility Rate  2.698138e+00
10                          Forested Area (%)  3.201543e+01
11                             Gasoline Price  1.002457e+00
12                                        GDP  4.772959e+11
13     Gross primary education enrollment (%)  1.024702e+02
14    Gross tertiary education enrollment (%)  3.796339e+01
15                           Infant mort