# Analyse World Population Data

## Task

Find out what might make countries better.
1. Have a look at the variables, understand what they are.
2. Pick a variable which interests you in a country.
2. Which other variables are associated with your target variable? Formulate several hypotheses.
3. Explore each hypothesis.
    - Make plots and/or compute statistics.
    - Write a short conclusion, refer to the justifications you found in the data.

## Dataset description

_Source: https://www.kaggle.com/datasets/madhurpant/world-population-data
  (See the bottom of this notebook for the steps to re-create it)_

This dataset describes 192 countries and their populations. Here are the columns, grouped by topic:

1. Height and weight:
    - male_height
    - female_height
    - male_weight
    - female_weight
    - male_bmi
    - female_bmi

2. Life expectancy:
    - male_life_expectancy
    - female_life_expectancy
    - birth_rate
    - death_rate

3. Population density:
    - area
    - population
    - pop_per_km_sq

4. Quality of life:
    - stability
    - rights
    - health
    - safety
    - climate
    - costs
    - popularity

5. Other:
    - iq
    - education_expenditure_per_inhabitant
    - daily_max_temp


## Analysis

In [None]:
import pandas as pd
import seaborn as sns

# From https://drive.google.com/file/d/181fFa4h4EigLpMlyu3DXaptm41tXVrNS/view
df = pd.read_csv(
    "https://drive.google.com/uc?id=181fFa4h4EigLpMlyu3DXaptm41tXVrNS",
    index_col=0,
)
df.shape

In [None]:
df.columns

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(25,10))
df_subset_non_string = df.select_dtypes(exclude="object")
sns.heatmap(df_subset_non_string.corr(method="kendall"), annot=True, fmt=".3f", ax=ax);

In [None]:
df[df.columns[::-1]].head()

Variables in the dataset:

Height and weight: we have data on the average height and weight of men and women, as well as the average body mass index for both sexes.
Life expectancy: includes average life expectancy for men and women, as well as fertility and mortality rates.
Population density: provides data on area, population and population per square kilometre.
Quality of life: This group contains indicators of stability, human rights, health, safety, climate, costs and popularity.
Other indicators: include average IQ, education expenditure per capita, and maximum daytime temperature.
If I were to choose a variable of interest, it would be "quality of life", as it is an important indicator for the development of countries. Now let's look at what factors can have a positive impact on the quality of life in countries. This can include the stability of the political and social situation, human rights, the quality of healthcare and education, security, a favourable climate, economic stability and the popularity of the country for residents and tourists.

We want to understand what might make countries better, we can start by exploring variables related to quality of life.
From the description, it seems like the variables "stability", "rights", "health", "safety", "climate", "costs", and "popularity" are relevant to quality of life.

 #We're interested in exploring what factors might be associated with a country's stability.

Health Hypothesis:Countries with better health indicators (such as higher life expectancy, lower death rates, and lower BMI) may also exhibit greater stability.

In [None]:
# Defining the variables we have
health_variables = ['male_life_expectancy', 'female_life_expectancy', 'birth_rate', 'death_rate', 'male_bmi', 'female_bmi']

# Review the correlation matrix between health and sustainability indicators
health_stability_correlation = df[health_variables + ['stability']].corr()
print(health_stability_correlation)


In [None]:
# Defining female health variables
male_health_variables = ['male_life_expectancy', 'male_bmi']

# Creating a correlation matrix
male_health_stability_correlation = df[male_health_variables + ['stability']].corr()

# Print the correlation matrix
print(male_health_stability_correlation)

# Plotting scatter plots for each female health variable against stability
for variable in male_health_variables:
    sns.scatterplot(x=variable, y='stability', data=df)
    plt.xlabel(variable)
    plt.ylabel('Stability')
    plt.title(f'Correlation between {variable} and Stability')
    plt.show()

In [None]:
# Defining female health variables
female_health_variables = ['female_life_expectancy', 'female_bmi']
# Creating a correlation matrix
female_health_stability_correlation = df[female_health_variables + ['stability']].corr()

# Print the correlation matrix
print(female_health_stability_correlation)

# Plotting scatter plots for each female health variable against stability
for variable in female_health_variables:
    sns.scatterplot(x=variable, y='stability', data=df)
    plt.xlabel(variable)
    plt.ylabel('Stability')
    plt.title(f'Correlation between {variable} and Stability')
    plt.show()

Male Life Expectancy and Stability: The scatter plot shows a weak positive correlation between male life expectancy and country stability. This suggests that countries where men have longer life expectancies may have slightly higher levels of stability.
Female Life Expectancy and Stability: Similar to male life expectancy, there is a weak positive correlation between female life expectancy and country stability.
BMI: There is no significant correlation between BMI for males or females and country stability.

According to these data, we see that:

There is a strong positive correlation between male life expectancy and country stability (correlation coefficient 0.724741).
There is a moderate positive correlation between male body mass index and male life expectancy (correlation coefficient 0.587547).
There is also a moderate positive correlation between male body mass index and country stability (correlation coefficient 0.424205).
Thus, these data confirm our earlier finding of a positive correlation between male life expectancy and country stability. They also indicate a certain relationship between male body mass index and these two factors, but the correlation is moderate.

According to these data, we see that:

There is a strong positive correlation between female life expectancy and country stability (correlation coefficient 0.730485).
There is a weak positive correlation between women's body mass index and life expectancy (correlation coefficient 0.273751).
There is also a very weak positive correlation between women's body mass index and country stability (correlation coefficient 0.068683).
Thus, these data support the conclusion that there is a positive correlation between women's life expectancy and country stability, but indicate a lower correlation between women's body mass index and these two factors.

To investigate the hypothesis that the more stable the situation in a country, the higher the birth\lower rate, we can use the analysis of the correlation between stability indicators and birth\death rates.

In [None]:
# Plotting scatter plot between birth rate, death rate, and stability
sns.scatterplot(x='birth_rate', y='stability', data=df, label='Birth Rate')
sns.scatterplot(x='death_rate', y='stability', data=df, color='red', label='Death Rate')
plt.xlabel('Rate')
plt.ylabel('Stability')
plt.title('Correlation between Birth Rate, Death Rate, and Stability')
plt.legend()
plt.show()

# Calculating correlation coefficient between birth rate and stability
correlation = df['birth_rate'].corr(df['stability'])
print(f"Correlation coefficient between birth rate and stability: {correlation}")

In the diagram, the birth rate data will be marked with blue dots and the death rate data with red dots.

Based on the results of the scatter plot, the following conclusions can be drawn:
За результатами діаграми розсіювання можна зробити наступні висновки:
Birth rate and stability: The chart shows that most countries with high fertility rates also have low levels of stability. However, there is no clear linear relationship between the two factors, meaning that there is no definitive rule that says that higher fertility always leads to less stability or vice versa.

Death rate and stability: The chart shows that most countries with low mortality rates have higher levels of stability. This may indicate that a low mortality rate can be a positive factor in maintaining stability in a country.


Birth and death rates: There is no significant correlation between birth or death rates and a country's stability. This may mean that these health factors are not crucial to the overall stability of the country.


In [None]:
plt.figure(figsize=(14, 6))
# Box plot
sns.boxplot(x='stability', y='male_life_expectancy', data=df)
plt.xlabel("stability")
plt.ylabel("male_life_expectancy")
plt.title("Distribution of men's life expectancy by levels of stability")
plt.xticks(rotation=90)
plt.show()


Гіпотеза про клімат:
Ми можемо дослідити кореляцію між стабільністю та змінними, що стосуються клімату, такими як максимальна щоденна температура.
Діаграми розсіювання або кореляційні матриці можуть виявити будь-який потенційний зв'язок між стабільністю та кліматом.

Гіпотеза про вплив температури на стабільність: Країни з помірним кліматом, зазвичай, можуть мати більшу стабільність порівняно з країнами з екстремальними температурами.

In [None]:
# Виведення опису даних про клімат
print(df[['climate', 'stability']].head())

In [None]:
# Виведення опису даних про клімат та стабільність
print(df[['climate', 'stability']])

Analysis of the relationship between the climate and stability:

In [None]:
df['climate'] = pd.to_numeric(df['climate'], errors='coerce')

df_cleaned = df.dropna(subset=['climate', 'stability'])

sns.scatterplot(x='climate', y='stability', data=df_cleaned)
plt.xlabel('Climate')
plt.ylabel('Stability')
plt.title('Correlation between Climate and Stability')
plt.show()

correlation_climate_stability = df_cleaned['climate'].corr(df_cleaned['stability'])
print(f"Correlation coefficient between climate and stability: {correlation_climate_stability}")

Relationship with mean annual temperature: The scatterplot and correlation coefficient show a weak positive relationship between mean annual temperature and country stability. This means that countries with higher average annual temperatures may have a slightly higher level of stability.

In [None]:
# Convert 'climate' column to numeric
df['climate'] = pd.to_numeric(df['climate'], errors='coerce')

# Drop rows with missing values in 'climate' and 'stability' columns
df_cleaned = df.dropna(subset=['climate', 'stability'])

# Creating a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df_cleaned[['climate', 'stability']].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap between Climate and Stability')
plt.show()

An economic hypothesis that countries with lower costs of living and higher per capita education costs may have better economic conditions that may contribute to greater stability

Analysis of the relationship between living expenses and stability:

In [None]:
sns.scatterplot(x='costs', y='stability', data=df)
plt.xlabel('Cost of Living')
plt.ylabel('Stability')
plt.title('Correlation between Cost of Living and Stability')
plt.show()


correlation_costs_stability = df['costs'].corr(df['stability'])
print(f"Correlation coefficient between costs and stability: {correlation_costs_stability}")


 The scatterplot and correlation coefficient show that there is a weak positive relationship between cost of living and country stability. This may indicate that countries with a higher cost of living may have a slightly higher level of stability.


Analysis of the relationship between education expenditure per capita and stability

In [None]:
# Вилучення рядків з рядковими значеннями в стовпці 'education_expenditure_per_inhabitant'
df_cleaned_education = df.dropna(subset=['education_expenditure_per_inhabitant', 'stability'])

# Перетворення стовпця 'education_expenditure_per_inhabitant' у числовий формат
df_cleaned_education['education_expenditure_per_inhabitant'] = pd.to_numeric(df_cleaned_education['education_expenditure_per_inhabitant'], errors='coerce')

# Побудова діаграми розсіювання
sns.scatterplot(x='education_expenditure_per_inhabitant', y='stability', data=df_cleaned_education)
plt.xlabel('Education Expenditure per Inhabitant')
plt.ylabel('Stability')
plt.title('Correlation between Education Expenditure per Inhabitant and Stability')
plt.show()

# Розрахунок кореляції між витратами на освіту та стабільністю
correlation_education_stability = df_cleaned_education['education_expenditure_per_inhabitant'].corr(df_cleaned_education['stability'])
print(f"Correlation coefficient between education expenditure and stability: {correlation_education_stability}")


Relationship with Education Expenditure: The scatterplot and correlation coefficient also show a weak positive relationship between education expenditure per capita and country stability. This may suggest that countries with higher spending on education may have slightly more stability.

# How the dataframe was created


This section is not relevant for doing the project, you can ignore it.

In case the dataset needs to be recreated, or if you are a very curious student, this is how it was done (on a local machine, _not_ in Colab):

```python
import pandas as pd
import glob
from functools import reduce

# 1. Download and extract data from
#    https://www.kaggle.com/datasets/madhurpant/world-population-data

# 2. Merge all dataframes
joint_df = reduce(
    lambda df1, df2: df1.merge(df2, on="country", how="outer"),
    [pd.read_csv(path) for path in glob.glob("world-population-data/*")],
)

# 3. Make "country" the index
joint_df.set_index(["country"], inplace=True)

# 4. Save the result
joint_df.to_csv("world-population-data.csv")
```