<div style="background-color: #333; padding: 40px; border: 2px solid #ffd700; border-radius: 10px; color: #ffd700; text-align: center; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">

<h1 style="font-size: 48px; font-weight: bold; color: #ffd700;">Global country information</h1>

<img src="https://www.brookings.edu/wp-content/uploads/2021/09/09152021_shutterstock_579322279.jpg" alt="Movie Reel" style="width: 500px; margin: 20px auto; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
    
</div>

<div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 180%; text-align: center; color: #ffd700; font-weight: bold;"> Table of Contents 
</div>

<ul class="list-group" id="list-tab" role="tablist">
    <li><a href="#1.-Import-Libraries">1. Import Libraries</a></li><br>
    <li><a href="#2.-Load-data">2. Load data</a></li><br>
    <li><a href="#3.-Exploratory-Data-Analysis">3. Exploratory Data Analysis</a></li><br>
    <li><a href="#3.1-Data-quality">3.1 Data quality</a></li><br>
    <li><a href="#3.2-Univariative-Analysis">3.2 Univariative Analysis</a></li><br>
    <li><a href="#3.3-Bivariative-Analysis">3.3 Bivariative Analysis</a></li><br>
</ul>

## <div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">1. Import Libraries</div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import plotly.express as px
import plotly.io as pio

## <div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">2. Load data</div>

In [None]:
df = pd.read_csv('/kaggle/input/countries-of-the-world-2023/world-data-2023.csv')
df

**III | Describe the Data**

In [None]:
df.info()

In [None]:
df.describe().style.format("{:.2f}")

## <div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">3. Exploratory Data Analysis</div>

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">3.1 Data quality</div>

### I | Check duplicates

In [None]:
duplicates = df.duplicated().sum()
print(duplicates)

There are 0 columns that contain duplicates.

### II | Check null and missing values

In [None]:
missing_values = df.isnull().sum()
total_missing_values = (missing_values).sum()
total_cells = np.product(df.shape)
percent_missing_values = (total_missing_values / total_cells)*100
print("Percent of data that is missing", percent_missing_values)
print(missing_values)

In [None]:
columns_with_null = df.columns[df.isnull().any()]

# Numerical columns
numerical_columns = df.select_dtypes(include=['float64'])
numerical_columns = numerical_columns.columns[numerical_columns.isnull().any()]
df[numerical_columns] = df[numerical_columns].fillna(df[numerical_columns].mean())

# Categorical columns with mode (replace NaN with the most fresquently occuring value)
categorical_columns = df.select_dtypes(include=['object'])
categorical_columns = categorical_columns.columns[categorical_columns.isnull().any()]
df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])

#Check if there is NaN value
missing_values = df.isnull().sum()
print(missing_values)

### III | Check unique values in each columns

In [None]:
for column in df.columns:
    num_distinct_values = len(df[column].unique())
    print(f"{column}: {num_distinct_values} distinct values")

### IV | Correlation Analysis

**Transform columns with object to numeric**

In [None]:
column_to_float=['Density\n(P/Km2)', 'Agricultural Land( %)','Land Area(Km2)',
                 'Birth Rate', 'Co2-Emissions', 'Forested Area (%)',
                 'CPI', 'CPI Change (%)', 'Fertility Rate', 'Gasoline Price','GDP',
                 'Gross primary education enrollment (%)', "Armed Forces size",
                 'Gross tertiary education enrollment (%)', 'Infant mortality',
                 'Life expectancy', 'Maternal mortality ratio','Minimum wage', 
                 'Out of pocket health expenditure','Physicians per thousand', 
                 'Population','Population: Labor force participation (%)', 
                 'Tax revenue (%)','Total tax rate', 'Unemployment rate', 'Urban_population']

for column in column_to_float:
    df[column]=df[column].astype(str)
    df[column]=df[column].str.replace(",","")
    df[column]=df[column].str.replace("$","")
    df[column]=df[column].str.replace("%","").astype(float)

In [None]:
df.info()

Since there is a lot of features, the correlation matrix is going to be unreadable. Therefore we are going to select few features that are interesting to me for a analysis. These chosen features cover a wide range of aspects related to economic, social, environmental, and demographic characteristics of countries, making them potentially interesting for analysis and interpretation.

In [None]:
interesting_features = [
    'GDP', 'Life expectancy','Population','Urban_population','Infant mortality',
    'Unemployment rate','Tax revenue (%)','Co2-Emissions','Agricultural Land( %)','Fertility Rate'
    ]

numeric_columns = df[interesting_features]
correlation_matrix = numeric_columns.corr()
correlation_matrix

In [None]:
fig, ax = plt.subplots() 
fig.set_size_inches(15,10)
sns.heatmap(correlation_matrix, vmax =.8, square = True, annot = True,cmap='YlGn' )
plt.title('Correlation Matrix - interesting_features',fontsize=15);

Most related features : 
 - C02-Emissions / Gdp : 0.92 -> highly positive correlated
 - CO2-Emissions / urban_population -> 0.93 highly positive correlated
 - Urban population / Population : 0.95 -> highly postive correlated
 - infant_mortality / Life expectancy : -0.93 -> highly negative correlated

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">3.2 Univariative Analysis</div>

In [None]:
plt.figure(figsize=(15, 20))
for i, feature in enumerate(numeric_columns, start=1):
    plt.subplot(9, 3, i)
    sns.histplot(df[feature], kde=True)
    plt.title(feature)
plt.tight_layout()
plt.show()

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">3.3 Bivariative Analysis</div>

# Population Analysis

In [None]:
# Sort the DataFrame by population in descending order
top_10_population = df.sort_values(by='Population', ascending=False).head(10)
# Sort the DataFrame by fertility rate in descending order
top_10_fertility_rate = df.sort_values(by='Fertility Rate', ascending=False).head(10)


plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.barplot(data = top_10_population, x='Country', y='Population', edgecolor='black')
plt.ylabel('Population')
plt.xlabel('Country')
plt.title('Top 10 Countries with Highest Population')
plt.xticks(rotation=45, ha='right')

plt.subplot(1, 2, 2)
sns.barplot(data = top_10_fertility_rate, y='Fertility Rate', x='Country', edgecolor='black')
plt.xlabel('Country')
plt.ylabel('Fertility Rate')
plt.title('Top 10 Countries with highest fertility rate')
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

In [None]:
# Apply logarithmic transformation to the data
log_urban_population = np.log10(df['Urban_population'])
log_total_population = np.log10(df['Population'])

# Create scatter plot with logarithmic scales
plt.figure(figsize=(8, 6))
plt.scatter(log_urban_population, log_total_population, alpha=0.5)
plt.xlabel('Urban Population (log scale)')
plt.ylabel('Total Population (log scale)')
plt.title('Relation between Urban Population and Total Population (log scale)')
plt.grid(True)
plt.show()

This plot exhibits a close clustering of points, indicating a positive slope and a clear linear relationship between the variables.

# Life expectancy

In [None]:
# Sort the DataFrame by infant mortality in descending order
top_10_infant_mortality = df.sort_values(by='Infant mortality', ascending=False).head(10)
# Sort the DataFrame by life expectancy in descending order
bottom_10_Life_expectancy = df.sort_values(by='Life expectancy', ascending=False).tail(10)


plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.barplot(data = top_10_infant_mortality, y='Infant mortality', x='Country', edgecolor='black')
plt.ylabel('Infant mortality')
plt.xlabel('Country')
plt.title('Top 10 Countries with Highest Infant mortality')
plt.xticks(rotation=45, ha='right')

plt.subplot(1, 2, 2)
sns.barplot(data = bottom_10_Life_expectancy, y='Life expectancy', x='Country', edgecolor='black')
plt.ylabel('Life expectancy')
plt.xlabel('Country')
plt.title('Top 10 Countries with lowest Life expectancy')
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(data = df, y='Life expectancy',x='Infant mortality', alpha=0.5)
plt.ylabel('Life expectancy')
plt.xlabel('Infant mortality')
plt.title('Relation between Life expectancy and Infant mortality')
plt.grid(True)
plt.show()

# GDP Analysis

In [None]:
# Define latitude ranges for Northern Hemisphere, Equatorial, and Southern Hemisphere
latitude_ranges = {
    "Northern Hemisphere": (0, 90),
    "Equatorial": (-23.5, 23.5),  # Approximately the Tropic of Cancer to the Tropic of Capricorn
    "Southern Hemisphere": (-90, 0),
}

# Create a function to categorize countries based on latitude
def categorize_latitude(latitude):
    for region, (min_lat, max_lat) in latitude_ranges.items():
        if min_lat <= latitude <= max_lat:
            return region
    return "Unknown"


# If the column names are different, replace them accordingly
df['Latitude Category'] = df['Latitude'].apply(categorize_latitude)


plt.figure(figsize=(10, 6))
plt.boxplot([df[df['Latitude Category'] == region]['GDP'] for region in latitude_ranges.keys()],
            labels=latitude_ranges.keys(), patch_artist=True)
plt.title('Boxplot of GDP by Latitude Range')
plt.xlabel('Latitude Range')
plt.ylabel('GDP (log scale)')
plt.yscale('log')  # Apply a logarithmic scale to the y-axis (normalization)
plt.grid(True)
plt.show()


In [None]:
fig = px.choropleth(df,
                    locationmode="country names",
                    locations=df["Country"],
                    color="GDP",
                    title="GDP for each country")
fig.show()

# CO2 Analysis

In [None]:
fig = px.choropleth(df,
                    locationmode="country names",
                    locations=df["Country"],
                    color="Co2-Emissions",
                    color_continuous_scale ="peach",
                    title="Co2 emissions for each country")
fig.show()

In [None]:
# Apply logarithmic transformation to the data
log_GDP = np.log10(df['GDP'])
log_Co2_emissions = np.log10(df['Co2-Emissions'])

# Create scatter plot with logarithmic scales
plt.figure(figsize=(8, 6))
plt.scatter(log_GDP, log_Co2_emissions, alpha=0.5)
plt.xlabel('GDP (log scale)')
plt.ylabel('Co2 emissions (log scale)')
plt.title('Relation between GDP and Co2 emissions (log scale)')
plt.grid(True)
plt.show()

# Agricultural Land

In [None]:
# Ensure Plot Display Mode
pio.renderers.default = 'notebook'

fig1 = px.choropleth(df, 
                    locations="Country", 
                    locationmode="country names",  
                    color="Agricultural Land( %)",  
                    color_continuous_scale="Viridis",  
                    title="Percentage of Agricultural Land by Country")  

fig1.show()

<div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #001f3f; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">If you found this work helpful or valuable, I would greatly appreciate an upvote.</div>