Welcome to this project. We will here explore a dataset accessed from Kaggle.com, containing global country information from 2023 (https://www.kaggle.com/datasets/nelgiriyewithana/countries-of-the-world-2023/data)

In [111]:
import pandas as pd
import numpy as np
import matplotlib as plt

df = pd.read_csv('world-data-2023.csv')

After having installed the for the data exploration most essential modules and reading the CSV file, we proceed by getting a brief overview of the data, making sure that everything was imported properly:

In [112]:
df.shape

(195, 35)

In [113]:
df

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.10%,652230,323000,32.49,93.0,Kabul,8672,...,78.40%,0.28,38041754,48.90%,9.30%,71.40%,11.12%,9797273,33.939110,67.709953
1,Albania,105,AL,43.10%,28748,9000,11.78,355.0,Tirana,4536,...,56.90%,1.20,2854191,55.70%,18.60%,36.60%,12.33%,1747593,41.153332,20.168331
2,Algeria,18,DZ,17.40%,2381741,317000,24.28,213.0,Algiers,150006,...,28.10%,1.72,43053054,41.20%,37.20%,66.10%,11.70%,31510100,28.033886,1.659626
3,Andorra,164,AD,40.00%,468,,7.20,376.0,Andorra la Vella,469,...,36.40%,3.33,77142,,,,,67873,42.506285,1.521801
4,Angola,26,AO,47.50%,1246700,117000,40.73,244.0,Luanda,34693,...,33.40%,0.21,31825295,77.50%,9.20%,49.10%,6.89%,21061025,-11.202692,17.873887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,Venezuela,32,VE,24.50%,912050,343000,17.88,58.0,Caracas,164175,...,45.80%,1.92,28515829,59.70%,,73.30%,8.80%,25162368,6.423750,-66.589730
191,Vietnam,314,VN,39.30%,331210,522000,16.75,84.0,Hanoi,192668,...,43.50%,0.82,96462106,77.40%,19.10%,37.60%,2.01%,35332140,14.058324,108.277199
192,Yemen,56,YE,44.60%,527968,40000,30.45,967.0,Sanaa,10609,...,81.00%,0.31,29161922,38.00%,,26.60%,12.91%,10869523,15.552727,48.516388
193,Zambia,25,ZM,32.10%,752618,16000,36.19,260.0,Lusaka,5141,...,27.50%,1.19,17861030,74.60%,16.20%,15.60%,11.43%,7871713,-13.133897,27.849332


A brief overview by accessing the .shape attribute and printing the dataset reveals that it consists of 195 rows and 35 columns. In each row, a different country, from Afghanistan to Zimbabwe, has been recorded together with their values for 33 factors of high social and economical importance, from population density to unemployment rate. The two final columns also include the latitudes and longitudes for the countries, allowing an eventual geospatial analysis. 

Lets have a closer look at the columns:

In [114]:
df.columns

Index(['Country', 'Density\n(P/Km2)', 'Abbreviation', 'Agricultural Land( %)',
       'Land Area(Km2)', 'Armed Forces size', 'Birth Rate', 'Calling Code',
       'Capital/Major City', 'Co2-Emissions', 'CPI', 'CPI Change (%)',
       'Currency-Code', 'Fertility Rate', 'Forested Area (%)',
       'Gasoline Price', 'GDP', 'Gross primary education enrollment (%)',
       'Gross tertiary education enrollment (%)', 'Infant mortality',
       'Largest city', 'Life expectancy', 'Maternal mortality ratio',
       'Minimum wage', 'Official language', 'Out of pocket health expenditure',
       'Physicians per thousand', 'Population',
       'Population: Labor force participation (%)', 'Tax revenue (%)',
       'Total tax rate', 'Unemployment rate', 'Urban_population', 'Latitude',
       'Longitude'],
      dtype='object')

Looking at the columns, most of them are pretty self explanatory so lets move forward and see what type of data they contain

In [115]:
df.dtypes

Country                                       object
Density\n(P/Km2)                              object
Abbreviation                                  object
Agricultural Land( %)                         object
Land Area(Km2)                                object
Armed Forces size                             object
Birth Rate                                   float64
Calling Code                                 float64
Capital/Major City                            object
Co2-Emissions                                 object
CPI                                           object
CPI Change (%)                                object
Currency-Code                                 object
Fertility Rate                               float64
Forested Area (%)                             object
Gasoline Price                                object
GDP                                           object
Gross primary education enrollment (%)        object
Gross tertiary education enrollment (%)       

As wee see there are lot of datatypes marked as objects, where we would have expected the data type to be numeric. For most of the columns this is most likely due to there being NaN-objects for missing data. 
So lets check for missing values. 

In [116]:
df.isnull().sum()

Country                                       0
Density\n(P/Km2)                              0
Abbreviation                                  7
Agricultural Land( %)                         7
Land Area(Km2)                                1
Armed Forces size                            24
Birth Rate                                    6
Calling Code                                  1
Capital/Major City                            3
Co2-Emissions                                 7
CPI                                          17
CPI Change (%)                               16
Currency-Code                                15
Fertility Rate                                7
Forested Area (%)                             7
Gasoline Price                               20
GDP                                           2
Gross primary education enrollment (%)        7
Gross tertiary education enrollment (%)      12
Infant mortality                              6
Largest city                            

As we can see, quite many variables are missing values, the most prominent being "minimum wage" with 45 missing values. Now as we do not want these missing data points to skew our data, we can make the pragmatical decision of dropping the variables with more than 10 missing values.

In [117]:
for x in df.columns:
    if df.isnull()[x].sum() > 10:
       df = df.drop(x,axis='columns')

In [118]:
df.isnull().sum()

Country                                   0
Density\n(P/Km2)                          0
Abbreviation                              7
Agricultural Land( %)                     7
Land Area(Km2)                            1
Birth Rate                                6
Calling Code                              1
Capital/Major City                        3
Co2-Emissions                             7
Fertility Rate                            7
Forested Area (%)                         7
GDP                                       2
Gross primary education enrollment (%)    7
Infant mortality                          6
Largest city                              6
Life expectancy                           8
Official language                         1
Out of pocket health expenditure          7
Physicians per thousand                   7
Population                                1
Urban_population                          5
Latitude                                  1
Longitude                       

Now it looks a bit better. The remaining missing values are in the range 5-7, so one could suspect that there are a few countries, missing values for the same categories. If that is the case, it would be deseriable to remove these countries from the data set.

In [119]:
for x in range(0,len(df.index.values)):
    if df.iloc[x,:].isnull().sum() > 0:
        print(df.iloc[x,0])

Andorra
Bosnia and Herzegovina
Brunei
Republic of the Congo
Cuba
Eswatini
Vatican City
Republic of Ireland
Libya
Liechtenstein
Monaco
Namibia
Nauru
North Korea
North Macedonia
Palestinian National Authority
San Marino
S�����������
Singapore
Somalia
South Sudan
Tuvalu


We find that most of the missing values can be attributed to a few countries. Lots of them are small micro states, or states with weak local authorities such as the republic of the congo or somalia. 

Lets go ahead and drop all countries with more than 1 missing value

In [124]:
c_to_be_dropped = []
for x in df.index.values:
    if df.iloc[x,:].isnull().sum() > 1:
        c_to_be_dropped.append(x)

df = df.drop(c_to_be_dropped, axis='index')

In [125]:
for x in range(0,len(df.index.values)):
    if df.iloc[x,:].isnull().sum() > 0:
        print(df.iloc[x,0],df.iloc[x,:].isnull().sum() )

Andorra 1
Bosnia and Herzegovina 1
Brunei 1
Republic of the Congo 1
Cuba 1
Republic of Ireland 1
Namibia 1
North Korea 1
San Marino 1
Somalia 1


In [126]:
df.isnull().sum()

Country                                   0
Density\n(P/Km2)                          0
Abbreviation                              3
Agricultural Land( %)                     0
Land Area(Km2)                            0
Birth Rate                                0
Calling Code                              0
Capital/Major City                        0
Co2-Emissions                             1
Fertility Rate                            0
Forested Area (%)                         0
GDP                                       0
Gross primary education enrollment (%)    1
Infant mortality                          0
Largest city                              1
Life expectancy                           1
Official language                         0
Out of pocket health expenditure          3
Physicians per thousand                   0
Population                                0
Urban_population                          0
Latitude                                  0
Longitude                       

Finally, we see that there are two columns with more than one missing value, we will thus go ahead and drop these two columns, as they are relatively uninteresting, and the remaining countries with missing values will be dropped as well. 

In [None]:
df = df.drop()


In [128]:
df

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,Fertility Rate,...,Infant mortality,Largest city,Life expectancy,Official language,Out of pocket health expenditure,Physicians per thousand,Population,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.10%,652230,32.49,93.0,Kabul,8672,4.47,...,47.9,Kabul,64.5,Pashto,78.40%,0.28,38041754,9797273,33.939110,67.709953
1,Albania,105,AL,43.10%,28748,11.78,355.0,Tirana,4536,1.62,...,7.8,Tirana,78.5,Albanian,56.90%,1.20,2854191,1747593,41.153332,20.168331
2,Algeria,18,DZ,17.40%,2381741,24.28,213.0,Algiers,150006,3.02,...,20.1,Algiers,76.7,Arabic,28.10%,1.72,43053054,31510100,28.033886,1.659626
3,Andorra,164,AD,40.00%,468,7.20,376.0,Andorra la Vella,469,1.27,...,2.7,Andorra la Vella,,Catalan,36.40%,3.33,77142,67873,42.506285,1.521801
4,Angola,26,AO,47.50%,1246700,40.73,244.0,Luanda,34693,5.52,...,51.6,Luanda,60.8,Portuguese,33.40%,0.21,31825295,21061025,-11.202692,17.873887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,Venezuela,32,VE,24.50%,912050,17.88,58.0,Caracas,164175,2.27,...,21.4,Caracas,72.1,Spanish,45.80%,1.92,28515829,25162368,6.423750,-66.589730
191,Vietnam,314,VN,39.30%,331210,16.75,84.0,Hanoi,192668,2.05,...,16.5,Ho Chi Minh City,75.3,Vietnamese,43.50%,0.82,96462106,35332140,14.058324,108.277199
192,Yemen,56,YE,44.60%,527968,30.45,967.0,Sanaa,10609,3.79,...,42.9,Sanaa,66.1,Arabic,81.00%,0.31,29161922,10869523,15.552727,48.516388
193,Zambia,25,ZM,32.10%,752618,36.19,260.0,Lusaka,5141,4.63,...,40.4,Lusaka,63.5,English,27.50%,1.19,17861030,7871713,-13.133897,27.849332
