# Global Country Analysis I

Welcome to our comprehensive data analysis of global statistics by country. This report explores a wide range of socio-economic indicators, revealing insightful global trends and patterns. Through data visualization using charts and graphs, we aim to present an objective view of the global landscape.

Our analysis delves into population demographics, economic growth, health, and environmental factors. By identifying correlations and outliers, we seek to provide valuable insights for decision-makers, policymakers, and researchers.

This data-driven exploration is vital for fostering international cooperation and addressing global challenges.

In the table below, you can see the attributes available and a brief description on what kind of information they convey.

| Attribute                                | Country                     |
|-----------------------------|-----------------|
| **Density (P/Km2)** | Population density measured in persons per square kilometer |
| **Abbreviation**               | Abbreviation or code representing the country |
| **Agricultural Land (%)**  | Percentage of land area used for agricultural purposes |
| **Land Area (Km2)**        | Total land area of the country in square kilometers |
| **Armed Forces Size**     | Size of the armed forces in the country |
| **Birth Rate**                  | Number of births per 1,000 population per year |
| **Calling Code**              | International calling code for the country |
| **Capital/Major City**     | Name of the capital or major city |
| **CO2 Emissions**          | Carbon dioxide emissions in tons |
| **CPI**                             | Consumer Price Index, a measure of inflation and purchasing power |
| **CPI Change (%)**        | Percentage change in the Consumer Price Index compared to the previous year |
| **Currency_Code**          | Currency code used in the country |
| **Fertility Rate**             | Average number of children born to a woman during her lifetime |
| **Forested Area (%)**   | Percentage of land area covered by forests |
| **Gasoline_Price**        | Price of gasoline per liter in local currency |
| **GDP**                          | Gross Domestic Product, the total value of goods and services produced in the country |
| **Gross Primary Education Enrollment (%)** | Gross enrollment ratio for primary education |
| **Gross Tertiary Education Enrollment (%)** | Gross enrollment ratio for tertiary education |
| **Infant Mortality**           | Number of deaths per 1,000 live births before reaching one year of age |
| **Largest City**                | Name of the country's largest city |
| **Life Expectancy**            | Average number of years a newborn is expected to live |
| **Maternal Mortality Ratio** | Number of maternal deaths per 100,000 live births |
| **Minimum Wage**           | Minimum wage level in local currency |
| **Official Language**         | Official language(s) spoken in the country |
| **Out of Pocket Health Expenditure (%)** | Percentage of total health expenditure paid out-of-pocket by individuals |
| **Physicians per Thousand** | Number of physicians per thousand people |
| **Population**                   | Total population of the country |
| **Population: Labor Force Participation (%)** | Percentage of the population that is part of the labor force |
| **Tax Revenue (%)**          | Tax revenue as a percentage of GDP |
| **Total Tax Rate**             | Overall tax burden as a percentage of commercial profits |
| **Unemployment Rate**     | Percentage of the labor force that is unemployed |
| **Urban Population**       | Percentage of the population living in urban areas |
| **Latitude**                      | Latitude coordinate of the country's location |
| **Longitude**                  | Longitude coordinate of the country's location |


Below, we are going to import relevant libraries for this job, and also define some helpful functions that are going to be used afterwards.

In [12]:
# Importing Libraries

# Data Handling
import pandas as pd
import numpy as np


# Data Visualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# Statistics & Mathematics
import scipy.stats as stats
import math


# Machine Learning Pipeline & process
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

# Preprocessing data
from sklearn.preprocessing import RobustScaler, StandardScaler

# Model Selection for Cross Validation
from sklearn.model_selection import StratifiedKFold, KFold, train_test_split

# Machine Learning metrics
from sklearn.metrics import mean_squared_error, r2_score

# ML algorithms
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor, StackingRegressor, AdaBoostRegressor


from sklearn.cluster import KMeans



# Encoder of categorical variables
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

from sklearn.preprocessing import MinMaxScaler

In [13]:
pip install catboost


Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [15]:
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [16]:
pip install optuna

Note: you may need to restart the kernel to use updated packages.


In [17]:
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import optuna
# Hiding warnings 
import warnings
warnings.filterwarnings("ignore")

In [18]:
df = pd.read_csv("world-data-2023.csv")

### Defining Functions


In [19]:
def dataframe_description(df):
    """
    This function prints some basic info on the dataset.
    """
    categorical_features = []
    continuous_features = []
    binary_features = []
    
    for col in df.columns:
        if df[col].dtype == object:
            categorical_features.append(col)
        else:
            if df[col].nunique() <= 2:
                binary_features.append(col)
            else:
                continuous_features.append(col)
    
    print("\n{} shape: {}".format(type(df).__name__, df.shape))
    print("\n{:,.0f} samples".format(df.shape[0]))
    print("\n{:,.0f} attributes".format(df.shape[1]))
    print(f'\nMissing Data: \n')
    print(df.isnull().sum())
    print(f'\nDuplicates: {df.duplicated().sum()}')
    print(f'\nData types: \n')
    print(df.dtypes)
    print(f'\nCategorical features: \n')
    if len(categorical_features) == 0:
        print('No Categorical Features')
    else:
        for feature in categorical_features:
            print(feature)
    print(f'\nContinuous features: \n')
    if len(continuous_features) == 0:
        print('No Continuous Features')
    else:
        for feature in continuous_features:
            print(feature)
    print(f'\nBinary features: \n')
    if len(binary_features) == 0:
        print('No Binary Features')
    else:
        for feature in binary_features:
            print(feature)
    print(f'\n{type(df).__name__} Head: \n')
    display(df.head(5))
    print(f'\n{type(df).__name__} Tail: \n')
    display(df.tail(5))

In [20]:
def remove_symbols(cell_data):
    if isinstance(cell_data, str):
        # Check for "$" and "," in the cell_data
        if "$" in cell_data:
            cell_data = cell_data.replace("$", "")
        if "," in cell_data:
            cell_data = cell_data.replace(",", "")
        if "%" in cell_data:
            cell_data = cell_data.replace("%", "")
    return cell_data

In [21]:
## function to convert numerical cells from strings to floats

def num_to_float(cell):
    if isinstance(cell, (int, float)):
        return float(cell)
    try:
        return float(cell)
    except (ValueError, TypeError):
        return cell

In [22]:
def top10_barplot(df, feat):    
    
    '''
    This function is supposed to organize the 15 top value counts of any attribute and plot a Barplot
    '''
    '''sort the list by top 10 values order descending'''
    df_sorted = df.sort_values(by=f'{feat}', ascending=False)
    top10 = df_sorted[feat].head(10)
    fig = px.bar(y=top10.values, 
                 x=top10.index, 
                 color = top10.index,
                 text=top10.values)

    fig.update_layout(title=f'<b>Top 10 highest {feat}<br> <sup> Barplot</sup></b>',
                      xaxis=dict(title=f'{feat}'),
                      yaxis=dict(title='Count'),
                      legend=dict(title=f'{feat}'),
                      showlegend=True,
                      height=600,
                      width=1000,
                      margin=dict(t=100, l=80),
                      template='plotly_white')
    fig.show()

## Exploratory Data Analysis

In [23]:
dataframe_description(df)


DataFrame shape: (195, 35)

195 samples

35 attributes

Missing Data: 

Country                                       0
Density\n(P/Km2)                              0
Abbreviation                                  7
Agricultural Land( %)                         7
Land Area(Km2)                                1
Armed Forces size                            24
Birth Rate                                    6
Calling Code                                  1
Capital/Major City                            3
Co2-Emissions                                 7
CPI                                          17
CPI Change (%)                               16
Currency-Code                                15
Fertility Rate                                7
Forested Area (%)                             7
Gasoline Price                               20
GDP                                           2
Gross primary education enrollment (%)        7
Gross tertiary education enrollment (%)      12
Infant mortalit

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.10%,652230,323000.0,32.49,93.0,Kabul,8672,...,78.40%,0.28,38041754,48.90%,9.30%,71.40%,11.12%,9797273,33.93911,67.709953
1,Albania,105,AL,43.10%,28748,9000.0,11.78,355.0,Tirana,4536,...,56.90%,1.2,2854191,55.70%,18.60%,36.60%,12.33%,1747593,41.153332,20.168331
2,Algeria,18,DZ,17.40%,2381741,317000.0,24.28,213.0,Algiers,150006,...,28.10%,1.72,43053054,41.20%,37.20%,66.10%,11.70%,31510100,28.033886,1.659626
3,Andorra,164,AD,40.00%,468,,7.2,376.0,Andorra la Vella,469,...,36.40%,3.33,77142,,,,,67873,42.506285,1.521801
4,Angola,26,AO,47.50%,1246700,117000.0,40.73,244.0,Luanda,34693,...,33.40%,0.21,31825295,77.50%,9.20%,49.10%,6.89%,21061025,-11.202692,17.873887



DataFrame Tail: 



Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
190,Venezuela,32,VE,24.50%,912050,343000,17.88,58.0,Caracas,164175,...,45.80%,1.92,28515829,59.70%,,73.30%,8.80%,25162368,6.42375,-66.58973
191,Vietnam,314,VN,39.30%,331210,522000,16.75,84.0,Hanoi,192668,...,43.50%,0.82,96462106,77.40%,19.10%,37.60%,2.01%,35332140,14.058324,108.277199
192,Yemen,56,YE,44.60%,527968,40000,30.45,967.0,Sanaa,10609,...,81.00%,0.31,29161922,38.00%,,26.60%,12.91%,10869523,15.552727,48.516388
193,Zambia,25,ZM,32.10%,752618,16000,36.19,260.0,Lusaka,5141,...,27.50%,1.19,17861030,74.60%,16.20%,15.60%,11.43%,7871713,-13.133897,27.849332
194,Zimbabwe,38,ZW,41.90%,390757,51000,30.68,263.0,Harare,10983,...,25.80%,0.21,14645468,83.10%,20.70%,31.60%,4.95%,4717305,-19.015438,29.154857


## Data Cleaning

#### In order to conduct thorough exporatory analysis, we must first clean the data by doing the following:


In [24]:
# Changing data type from str to float
column_to_float=['Density\n(P/Km2)', 'Agricultural Land( %)','Land Area(Km2)',
                 'Birth Rate', 'Co2-Emissions', 'Forested Area (%)',
                 'CPI', 'CPI Change (%)', 'Fertility Rate', 'Gasoline Price','GDP',
                 'Gross primary education enrollment (%)', "Armed Forces size",
                 'Gross tertiary education enrollment (%)', 'Infant mortality',
                 'Life expectancy', 'Maternal mortality ratio','Minimum wage', 
                 'Out of pocket health expenditure','Physicians per thousand', 
                 'Population','Population: Labor force participation (%)', 
                 'Tax revenue (%)','Total tax rate', 'Unemployment rate', 'Urban_population']

for column in column_to_float:
    # convert to string
    df[column]=df[column].astype(str)
    # remove commas
    df[column]=df[column].str.replace(",","")
    # remove "$"
    df[column]=df[column].str.replace("$","")
    # Remove "%"
    df[column]=df[column].str.replace("%","").astype(float)

In [25]:
df.dtypes

Country                                       object
Density\n(P/Km2)                             float64
Abbreviation                                  object
Agricultural Land( %)                        float64
Land Area(Km2)                               float64
Armed Forces size                            float64
Birth Rate                                   float64
Calling Code                                 float64
Capital/Major City                            object
Co2-Emissions                                float64
CPI                                          float64
CPI Change (%)                               float64
Currency-Code                                 object
Fertility Rate                               float64
Forested Area (%)                            float64
Gasoline Price                               float64
GDP                                          float64
Gross primary education enrollment (%)       float64
Gross tertiary education enrollment (%)      f

## Feature Engineering


In [26]:
df['GDP_per_capita'] = df['GDP'] / df["Population"]
df['Armed_forces_per_capita'] = df["Armed Forces size"]/df['Population']
df['Urban_population_ratio'] = df['Urban_population']/df['Population']
## C02 per capita
df['C02_per_capita'] = df['Co2-Emissions']/df['Population']
df['C02_per_urban_population'] = df['Co2-Emissions']/df['Urban_population']
df['Agricultural Land(Km2)']=df['Agricultural Land( %)']*df['Land Area(Km2)']/100
df['Forested Area(Km2)'] = df['Forested Area (%)']*df['Land Area(Km2)']/100


In [27]:
df

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Urban_population,Latitude,Longitude,GDP_per_capita,Armed_forces_per_capita,Urban_population_ratio,C02_per_capita,C02_per_urban_population,Agricultural Land(Km2),Forested Area(Km2)
0,Afghanistan,60.0,AF,58.1,652230.0,323000.0,32.49,93.0,Kabul,8672.0,...,9797273.0,33.939110,67.709953,502.115487,0.008491,0.257540,0.000228,0.000885,378945.630,13696.830
1,Albania,105.0,AL,43.1,28748.0,9000.0,11.78,355.0,Tirana,4536.0,...,1747593.0,41.153332,20.168331,5352.857411,0.003153,0.612290,0.001589,0.002596,12390.388,8078.188
2,Algeria,18.0,DZ,17.4,2381741.0,317000.0,24.28,213.0,Algiers,150006.0,...,31510100.0,28.033886,1.659626,3948.343279,0.007363,0.731890,0.003484,0.004761,414422.934,19053.928
3,Andorra,164.0,AD,40.0,468.0,,7.20,376.0,Andorra la Vella,469.0,...,67873.0,42.506285,1.521801,40886.391162,,0.879845,0.006080,0.006910,187.200,159.120
4,Angola,26.0,AO,47.5,1246700.0,117000.0,40.73,244.0,Luanda,34693.0,...,21061025.0,-11.202692,17.873887,2973.591160,0.003676,0.661770,0.001090,0.001647,592182.500,577222.100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,Venezuela,32.0,VE,24.5,912050.0,343000.0,17.88,58.0,Caracas,164175.0,...,25162368.0,6.423750,-66.589730,16915.493453,0.012028,0.882400,0.005757,0.006525,223452.250,480650.350
191,Vietnam,314.0,VN,39.3,331210.0,522000.0,16.75,84.0,Hanoi,192668.0,...,35332140.0,14.058324,108.277199,2715.276036,0.005411,0.366280,0.001997,0.005453,130165.530,159312.010
192,Yemen,56.0,YE,44.6,527968.0,40000.0,30.45,967.0,Sanaa,10609.0,...,10869523.0,15.552727,48.516388,922.929642,0.001372,0.372730,0.000364,0.000976,235473.728,5279.680
193,Zambia,25.0,ZM,32.1,752618.0,16000.0,36.19,260.0,Lusaka,5141.0,...,7871713.0,-13.133897,27.849332,1291.343357,0.000896,0.440720,0.000288,0.000653,241590.378,490706.936


## Correlation Analysis

In [28]:
df_corr = df.corr()

In [29]:
df_corr

Unnamed: 0,Density\n(P/Km2),Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Co2-Emissions,CPI,CPI Change (%),Fertility Rate,...,Urban_population,Latitude,Longitude,GDP_per_capita,Armed_forces_per_capita,Urban_population_ratio,C02_per_capita,C02_per_urban_population,Agricultural Land(Km2),Forested Area(Km2)
Density\n(P/Km2),1.0,-0.108478,-0.053279,0.000526,-0.147024,0.006554,-0.014262,-0.043356,-0.052459,-0.149302,...,-0.02242,0.06464,0.03194,0.509388,0.128609,0.163681,0.077699,0.044822,-0.078108,-0.064134
Agricultural Land( %),-0.108478,1.0,-0.031189,0.030547,0.199269,-0.067008,0.061191,-0.007684,-0.010878,0.180507,...,0.101679,-0.034725,-0.016716,-0.18461,-0.114,-0.131475,-0.259911,-0.263387,0.186975,-0.111622
Land Area(Km2),-0.053279,-0.031189,1.0,0.55173,-0.066269,-0.230872,0.590746,0.020893,0.03421,-0.064764,...,0.546836,0.049251,0.010435,0.042286,-0.049293,0.131915,0.204046,0.167334,0.806959,0.923257
Armed Forces size,0.000526,0.030547,0.55173,1.0,-0.144801,-0.18433,0.740868,0.040839,0.061519,-0.146258,...,0.851101,0.120165,0.208265,-0.031141,0.22669,-0.003246,0.036841,0.045749,0.631675,0.454573
Birth Rate,-0.147024,0.199269,-0.066269,-0.144801,1.0,0.072071,-0.155796,0.143298,0.10716,0.981069,...,-0.107609,-0.502965,0.039247,-0.517063,-0.220185,-0.549081,-0.499486,-0.499357,-0.040281,-0.09322
Calling Code,0.006554,-0.067008,-0.230872,-0.18433,0.072071,1.0,-0.155946,-0.069684,-0.082451,0.028379,...,-0.191676,0.012973,0.194236,-0.118386,0.216663,-0.064436,0.074199,0.029172,-0.202962,-0.202168
Co2-Emissions,-0.014262,0.061191,0.590746,0.740868,-0.155796,-0.155946,1.0,-0.017479,-0.004215,-0.141534,...,0.926263,0.118715,0.074345,0.094225,-0.045806,0.092414,0.165296,0.156803,0.778458,0.417436
CPI,-0.043356,-0.007684,0.020893,0.040839,0.143298,-0.069684,-0.017479,1.0,0.89471,0.141352,...,-0.009298,-0.075549,-0.022809,-0.092499,0.224205,-0.095935,-0.071926,-0.083757,0.029184,0.027841
CPI Change (%),-0.052459,-0.010878,0.03421,0.061519,0.10716,-0.082451,-0.004215,0.89471,1.0,0.097405,...,0.007353,-0.08957,-0.076524,-0.085904,0.199855,-0.012757,-0.063803,-0.084091,0.042509,0.028466
Fertility Rate,-0.149302,0.180507,-0.064764,-0.146258,0.981069,0.028379,-0.141534,0.141352,0.097405,1.0,...,-0.104058,-0.457999,0.034218,-0.479901,-0.222319,-0.515546,-0.457236,-0.46722,-0.037783,-0.089239


In [19]:
## Extract the strongest correlations from the correlation table
key_corr = []
for col in df_corr.columns:
    for row in df_corr.index:
        if (df_corr[row][col] > 0.5 or df_corr[row][col] < -0.5) and (df_corr[row][col] < 1.0):
            key_corr.append({"x": row, "y": col, "correlation": df_corr[row][col]})

In [20]:
key_corr

[{'x': 'GDP_per_capita',
  'y': 'Density\n(P/Km2)',
  'correlation': 0.5093882161768769},
 {'x': 'Armed Forces size',
  'y': 'Land Area(Km2)',
  'correlation': 0.5517303327979796},
 {'x': 'Co2-Emissions',
  'y': 'Land Area(Km2)',
  'correlation': 0.5907461890743969},
 {'x': 'GDP', 'y': 'Land Area(Km2)', 'correlation': 0.5505920485677658},
 {'x': 'Urban_population',
  'y': 'Land Area(Km2)',
  'correlation': 0.546836346860166},
 {'x': 'Agricultural Land(Km2)',
  'y': 'Land Area(Km2)',
  'correlation': 0.8069594004324024},
 {'x': 'Forested Area(Km2)',
  'y': 'Land Area(Km2)',
  'correlation': 0.9232570018571457},
 {'x': 'Land Area(Km2)',
  'y': 'Armed Forces size',
  'correlation': 0.5517303327979796},
 {'x': 'Co2-Emissions',
  'y': 'Armed Forces size',
  'correlation': 0.7408682021582728},
 {'x': 'GDP', 'y': 'Armed Forces size', 'correlation': 0.6054864405942135},
 {'x': 'Population',
  'y': 'Armed Forces size',
  'correlation': 0.8759115892124134},
 {'x': 'Urban_population',
  'y': 'Arm

In [21]:
key_corr_df = pd.DataFrame(key_corr)

In [22]:
# Remove Duplicates
key_corr_df

Unnamed: 0,x,y,correlation
0,GDP_per_capita,Density\n(P/Km2),0.509388
1,Armed Forces size,Land Area(Km2),0.551730
2,Co2-Emissions,Land Area(Km2),0.590746
3,GDP,Land Area(Km2),0.550592
4,Urban_population,Land Area(Km2),0.546836
...,...,...,...
123,Population,Agricultural Land(Km2),0.646121
124,Urban_population,Agricultural Land(Km2),0.750044
125,Forested Area(Km2),Agricultural Land(Km2),0.598796
126,Land Area(Km2),Forested Area(Km2),0.923257


In [23]:
key_corr_df.drop_duplicates(subset=['correlation'], inplace=True)

In [24]:
key_corr_df

Unnamed: 0,x,y,correlation
0,GDP_per_capita,Density\n(P/Km2),0.509388
1,Armed Forces size,Land Area(Km2),0.551730
2,Co2-Emissions,Land Area(Km2),0.590746
3,GDP,Land Area(Km2),0.550592
4,Urban_population,Land Area(Km2),0.546836
...,...,...,...
93,Agricultural Land(Km2),Population,0.646121
99,Agricultural Land(Km2),Urban_population,0.750044
114,C02_per_capita,Urban_population_ratio,0.515211
117,C02_per_urban_population,C02_per_capita,0.921530


# Environmental Analysis

## C02 Emission Analysis

### Top 10 C02 Emissions

In [30]:
df = df.set_index('Country')
top10_barplot(df, 'Co2-Emissions')

### Top 10 C02 Emissions per capita

In [31]:
top10_barplot(df, 'C02_per_capita')

### Top 10 C02 Emissions per Urban population

In [32]:
top10_barplot(df, 'C02_per_urban_population')

### Analysis
- #### China and United States have the highest overall CO2 Emissions
- #### Trinidad and Qatar have high CO2 Emission per capita due to the mass amounts of Energy Production that fuels their economies

## Land Usability Analysis

### Agricultural land per KM2

In [33]:
df1 = df.reset_index()
scaler = MinMaxScaler(feature_range=(0, 1000))
column_data = np.array(df1["Agricultural Land(Km2)"]).reshape(-1, 1)
scaled_data = scaler.fit_transform(column_data)
df1["Agricultural Land(Km2)X"] = scaled_data
df1["Agricultural Land(Km2)X"].fillna(0,inplace=True)

# Create a bubble map
fig = go.Figure(data=go.Scattermapbox(
    lat=df['Latitude'],
    lon=df['Longitude'],
    mode='markers',
    marker=dict(
        size=df1["Agricultural Land(Km2)X"],
        colorscale='Viridis',
        sizemode='area',
        sizeref=0.1,
        showscale=False
    ),
    text=df1['Country'] + '<br>: ' + df1['Agricultural Land(Km2)'].astype(str),
))

# Set the map layout
fig.update_layout(
    mapbox=dict(
        style='carto-positron',
    ),
    title='World Agricultural Area Map'
)

# Show the plot
fig.show()

### Forested Area per Km2

In [34]:
scaler = MinMaxScaler(feature_range=(0, 1000))
column_data = np.array(df1["Forested Area(Km2)"]).reshape(-1, 1)
scaled_data = scaler.fit_transform(column_data)
df1["Forested Area(Km2)X"] = scaled_data
df1["Forested Area(Km2)X"].fillna(0,inplace=True)

# Create a bubble map
fig = go.Figure(data=go.Scattermapbox(
    lat=df['Latitude'],
    lon=df['Longitude'],
    mode='markers',
    marker=dict(
        size=df1["Forested Area(Km2)X"],
        colorscale='Viridis',
        sizemode='area',
        sizeref=0.1,
        showscale=False
    ),
    text=df1['Country'] + '<br>: ' + df1['Forested Area(Km2)'].astype(str),
))

# Set the map layout
fig.update_layout(
    mapbox=dict(
        style='carto-positron',
    ),
    title='World Forested Land Area Map'
)

# Show the plot
fig.show()

### Analysis
- #### China and United States have the largest amount of Agricultural land per KM2
- #### Russia, Brazil, and Canada have the largest amount of forested Area per KM2

# Health Analysis
Now lets analyse the health metrics of various countries in the world

In [35]:
country_by_gdp = df1.sort_values(by='GDP', ascending=False).head(20)

# Increase the height of the subplots
fig = make_subplots(rows=5, cols=1, shared_xaxes=True, vertical_spacing=0.05, subplot_titles=("Infant mortality", "Maternal mortality ratio", "Fertility Rate", "Life expectancy", "GDP per capita"), row_heights=[0.2, 0.2, 0.2, 0.2, 0.2], print_grid=True)

fig.add_trace(go.Scatter(
            name="Infant mortality",
            x=country_by_gdp["Country"], 
            y=country_by_gdp["Infant mortality"],
            mode='lines',
            showlegend=True,
            marker=dict(
                opacity=0.5,
                showscale=False,
            )),row=1,col=1,
        )

fig.add_trace(go.Scatter(
            name="Maternal mortality ratio",
            x=country_by_gdp["Country"], 
            y=country_by_gdp["Maternal mortality ratio"],
            mode='lines',
            showlegend=True,
            marker=dict(
                opacity=0.5,
                showscale=False,
            )),row=2,col=1,
        )

fig.add_trace(go.Scatter(
            name="Fertility Rate",
            x=country_by_gdp["Country"], 
            y=country_by_gdp["Fertility Rate"],
            mode='lines',
            showlegend=True,
            marker=dict(
                opacity=0.5,
                showscale=False,
            )),row=3,col=1,
        )

fig.add_trace(go.Scatter(
            name="Life expectancy",
            x=country_by_gdp["Country"], 
            y=country_by_gdp["Life expectancy"],
            mode='lines',
            showlegend=True,
            marker=dict(
                opacity=0.5,
                showscale=False,
            )),row=4,col=1,
        )

fig.add_trace(go.Scatter(
            name="GDP per capita",
            x=country_by_gdp["Country"], 
            y=country_by_gdp["GDP_per_capita"],
            mode='lines',
            showlegend=True,
            marker=dict(
                opacity=0.5,
                showscale=False,
            )),row=5,col=1,
        )

# Update x-axis labels for the last subplot
fig.update_xaxes(showticklabels=True, row=5, col=1)

# Update layout properties for better readability
fig.update_layout(
    height=800,    
    title_text="Top 20 Countries by GDP",
    title_x=0.5,
    font=dict(size=12),
)

fig.show()

This is the format of your plot grid:
[ (1,1) x,y   ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]
[ (4,1) x4,y4 ]
[ (5,1) x5,y5 ]



### Analysis
- #### Fertility Rate is negatively correlated with life expectancy and GDP per Capita

## Economic Analysis



### GDP of Top 10 Countries

In [38]:
top10_barplot(df, 'GDP')

### GDP per capita of Top 10 Countries

In [40]:
# Top10['GDP_per_capita'].plot(kind="bar")

top10_barplot(df, 'GDP_per_capita')

## Armed forces Size

In [41]:
top10_barplot(df, 'Armed Forces size')

## Inflation



In [42]:
top10_barplot(df, 'CPI')

## Unemployment Rate

In [43]:
top10_barplot(df, 'Unemployment rate')

### Analysis
- #### United States and China have the highest GDP worldwide
- #### Monaco and Liechtenstein have the highest GDP per capita
- #### India and China have the largest Armed Forces
- #### South Sudan and Venezuela have the highest Inflation
- #### South Africa and Leotho have the highest unemployment rate

## Demographic Analysis

### Categorical Variable Analysis

In [44]:
df = df.reset_index()

In [45]:
df

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Urban_population,Latitude,Longitude,GDP_per_capita,Armed_forces_per_capita,Urban_population_ratio,C02_per_capita,C02_per_urban_population,Agricultural Land(Km2),Forested Area(Km2)
0,Afghanistan,60.0,AF,58.1,652230.0,323000.0,32.49,93.0,Kabul,8672.0,...,9797273.0,33.939110,67.709953,502.115487,0.008491,0.257540,0.000228,0.000885,378945.630,13696.830
1,Albania,105.0,AL,43.1,28748.0,9000.0,11.78,355.0,Tirana,4536.0,...,1747593.0,41.153332,20.168331,5352.857411,0.003153,0.612290,0.001589,0.002596,12390.388,8078.188
2,Algeria,18.0,DZ,17.4,2381741.0,317000.0,24.28,213.0,Algiers,150006.0,...,31510100.0,28.033886,1.659626,3948.343279,0.007363,0.731890,0.003484,0.004761,414422.934,19053.928
3,Andorra,164.0,AD,40.0,468.0,,7.20,376.0,Andorra la Vella,469.0,...,67873.0,42.506285,1.521801,40886.391162,,0.879845,0.006080,0.006910,187.200,159.120
4,Angola,26.0,AO,47.5,1246700.0,117000.0,40.73,244.0,Luanda,34693.0,...,21061025.0,-11.202692,17.873887,2973.591160,0.003676,0.661770,0.001090,0.001647,592182.500,577222.100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,Venezuela,32.0,VE,24.5,912050.0,343000.0,17.88,58.0,Caracas,164175.0,...,25162368.0,6.423750,-66.589730,16915.493453,0.012028,0.882400,0.005757,0.006525,223452.250,480650.350
191,Vietnam,314.0,VN,39.3,331210.0,522000.0,16.75,84.0,Hanoi,192668.0,...,35332140.0,14.058324,108.277199,2715.276036,0.005411,0.366280,0.001997,0.005453,130165.530,159312.010
192,Yemen,56.0,YE,44.6,527968.0,40000.0,30.45,967.0,Sanaa,10609.0,...,10869523.0,15.552727,48.516388,922.929642,0.001372,0.372730,0.000364,0.000976,235473.728,5279.680
193,Zambia,25.0,ZM,32.1,752618.0,16000.0,36.19,260.0,Lusaka,5141.0,...,7871713.0,-13.133897,27.849332,1291.343357,0.000896,0.440720,0.000288,0.000653,241590.378,490706.936


In [46]:
df_lang = df[['Country', 'Official language']]

# df_lang.groupby(['Country', 'Official language'], as_index=False).count()

In [47]:
df_lang 

Unnamed: 0,Country,Official language
0,Afghanistan,Pashto
1,Albania,Albanian
2,Algeria,Arabic
3,Andorra,Catalan
4,Angola,Portuguese
...,...,...
190,Venezuela,Spanish
191,Vietnam,Vietnamese
192,Yemen,Arabic
193,Zambia,English


In [48]:
df_lang_grp = df_lang.groupby(['Official language']).count()

In [49]:
df_lang_grp

Unnamed: 0_level_0,Country
Official language,Unnamed: 1_level_1
Afrikaans,1
Albanian,1
Amharic,1
Arabic,18
Armenian,1
...,...
Tuvaluan Language,1
Ukrainian,1
Urdu,1
Uzbek,1


In [50]:
df_lang_grp_sorted = df_lang_grp.sort_values(by='Country', ascending=False)

In [51]:
df_lang_grp_sorted

Unnamed: 0_level_0,Country
Official language,Unnamed: 1_level_1
English,31
French,25
Spanish,19
Arabic,18
Portuguese,7
...,...
Latvian,1
Lithuanian,1
Albanian,1
Macedonian,1


### Number of Countries that Speak the Top Languages

In [52]:
# top10_barplot(df_lang_grp_sorted, 'Country')
df_lang_grp_sorted = df_lang_grp_sorted.reset_index()

df_lang_grp_sorted = df_lang_grp_sorted.head(10)

In [53]:

fig = px.bar(y=df_lang_grp_sorted['Country'].values, 
             x=df_lang_grp_sorted['Official language'].values, 
             color = df_lang_grp_sorted.index,
             text=df_lang_grp_sorted['Country'].values)

fig.update_layout(title=f'<b>Number of countries that speak the top 10 languages',
                  xaxis=dict(title='Country'),
                  yaxis=dict(title='Count'),
                  legend=dict(title='Country'),
                  showlegend=True,
                  height=600,
                  width=1000,
                  margin=dict(t=100, l=80),
                  template='plotly_white')
fig.show()

In [54]:
df_lang = df[['Official language', 'Population']]

In [55]:
df_lang_grp = df_lang.groupby(['Official language']).sum()

In [56]:
df_lang_grp = df_lang_grp.sort_values(by='Population', ascending=False)

In [57]:
df_lang_grp

Unnamed: 0_level_0,Population
Official language,Unnamed: 1_level_1
Standard Chinese,1.397715e+09
Hindi,1.366418e+09
,6.062467e+08
English,4.870488e+08
French,4.361648e+08
...,...
Samoan,2.025060e+05
Tongan Language,1.002090e+05
Catalan,7.714200e+04
Marshallese,5.879100e+04


## Number of People Who speak the top languages

In [58]:
top10_barplot(df_lang_grp, 'Population')

### Analysis
- #### English, French, and Spanish are spoken in the most countries around the world
- #### Chinese and Hindi are spoken by the most people around the world