#  <span style="color:green">Clustering Global Development

In [None]:
import pandas as pd

In [None]:
c = pd.read_csv('World_development_mesurement.csv')

# Data Preview

In [None]:
#Displaying first 5 records
c.head()

In [None]:
#Displaying last 5 records
c.tail()

# Data Size

In [None]:
c.shape

# Columns

In [None]:
c.columns

# Data Types

In [None]:
c.info()

- We have 17-float64, 2-int64, 6-object
- The dataset has 2,704 rows and 25 columns, with a mix of numerical and categorical data types.

# Summary Statistics

In [None]:
# Summary statistics for numerical columns
print("\nSummary statistics for numerical columns:\n", c.describe())

# Missing Values

1. Firstly we will Visualizing Missing Values

In [None]:
c.isnull().sum()[c.isnull().sum()>0]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Heatmap to visualize missing data
plt.figure(figsize=(12, 6))
sns.heatmap(c.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

# Bar plot of missing values count for each column
missing_values = c.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)

plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')
plt.title('Count of Missing Values per Column')
plt.ylabel('Missing Value Count')
plt.show()


2. After Visualization we'll fill missing values by three ways:
    1. Dropping Columns with High Missing Values
    2. Filling Numerical Columns with Mean
    3. Filling Categorical Columns with Mode
    

<h3 style="color:#00A36C">A. Dropping Columns with High Missing Values</h3>
- If a columns have a large proportion of missing values e.g., Ease of Business has 2,519 missing values out of 2,704 it might be practical to drop them, because they don't contribute much to your analysis.

In [None]:
# For columns with more than 50% missing values, droping them
threshold = len(c) * 0.5
c.dropna(thresh=threshold, axis=1, inplace=True)
print("Columns remaining after dropping high missing value columns:\n", c.columns)
print(c.shape)

<h3 style="color:#00A36C">B. Filling Numerical Columns with Mean</h3>


In [None]:
# Fill missing values in numerical columns with mean
numerical_columns = c.select_dtypes(include=['float64', 'int64']).columns

for col in numerical_columns:
    c[col].fillna(c[col].mean(), inplace=True)

# Verify if missing values in numerical columns are handled
print("Missing values after mean imputation:\n", c[numerical_columns].isnull().sum())


In [None]:
c.isnull().sum()[c.isnull().sum()>0]

<h3 style="color:#00A36C">C. Filling Categorical Columns with Mode</h3>


In [None]:
# Fill missing values in categorical columns with mode
categorical_columns = c.select_dtypes(include=['object']).columns

for col in categorical_columns:
    c[col].fillna(c[col].mode()[0], inplace=True)

# Verify if missing values in categorical columns are handled
print("Missing values after mode imputation:\n", c[categorical_columns].isnull().sum())


In [None]:
# Finally checking for any remaining missing values 
print("Remaining missing values = \n", c.isnull().sum()[c.isnull().sum() > 0])


- No missing values now.

# EDA

1. Histogram for Numerical Columns
- Histograms show the distribution of individual numerical columns, which is useful for understanding data spread, skewness, and common values.

In [None]:
import matplotlib.pyplot as plt

# Histogram for the 'GDP' column
plt.figure(figsize=(8, 5))
plt.hist(c['GDP'].dropna(), bins=30, color='skyblue', edgecolor='black')
plt.title('GDP Distribution')
plt.xlabel('GDP')
plt.ylabel('Frequency')
plt.show()


2. Box Plot
- Box plots help detect outliers in numerical columns by showing the spread and identifying extreme values.


In [None]:
# Box plots for numerical features (to detect outliers)
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(5, 4, i)
    sns.boxplot(x=c[col])
    plt.title(col)
plt.tight_layout()
plt.show()

In [None]:
# we have some outliers in columns like CO2 Emission, Days to Start Business, Energy Usage, etc

3. Correlation Heatmap
- A heatmap of correlations helps reveal relationships among numerical features, indicating how they might interact with each other.


In [None]:
# Correlation heatmap
numerical_cols = c.select_dtypes(include=['float64', 'int64']).columns
plt.figure(figsize=(12, 8))
sns.heatmap(c[numerical_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


4. Pair plot
- Pair plots allow you to visualize the relationships between several numerical features at once, showing distributions on the diagonal and scatter plots on the off-diagonals.

In [None]:
# Pair plot for selected numerical features
selected_features = ['GDP', 'CO2 Emissions', 'Energy Usage', 'Internet Usage']
sns.pairplot(c[selected_features], diag_kind='kde')
plt.suptitle('Pair Plot of Selected Features', y=1.02)
plt.show()


5. Average GDP of Top 10 Countries by Frequency

- The bar chart reveals the average GDP of the top 10 most frequently listed countries in the dataset. Countries with higher average GDP stand out, highlighting significant economic strength relative to the others.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


top_countries = c['Country'].value_counts().nlargest(10).index
top_countries_data = c[c['Country'].isin(top_countries)].copy()

# Clean and convert GDP column to float
top_countries_data['GDP'] = top_countries_data['GDP'].replace('[\$,]', '', regex=True).astype(float)

# Calculate the average GDP for the top 10 countries
avg_gdp_top_countries = top_countries_data.groupby('Country')['GDP'].mean().sort_values(ascending=False)

# Plotting the bar chart
plt.figure(figsize=(10, 6))
avg_gdp_top_countries.plot(kind='bar', color= '#9DC209')
plt.title('Average GDP of Top 10 Countries by Frequency')
plt.xlabel('Country')
plt.ylabel('Average GDP (in $)')
plt.xticks(rotation=45)
plt.tight_layout()  # Adjust layout to prevent label cutoff
plt.show()


6. Average Life Expectancy (Male and Female) by Top Countries
- This bar plot compares the average life expectancy for males and females across the top countries in the dataset by record count.


In [None]:
# Filter top countries by frequency
top_countries = c['Country'].value_counts().nlargest(10).index
top_countries_data = c[c['Country'].isin(top_countries)]

# Calculate the average life expectancy for males and females in each country
avg_life_expectancy = top_countries_data.groupby('Country')[['Life Expectancy Male', 'Life Expectancy Female']].mean()

# Plotting
avg_life_expectancy.plot(kind='bar', figsize=(12, 6), color=['skyblue', 'salmon'])
plt.title('Average Life Expectancy (Male & Female) by Top Countries')
plt.xlabel('Country')
plt.ylabel('Average Life Expectancy')
plt.legend(['Male', 'Female'])
plt.show()


7. Average CO₂ Emissions by Country
- This bar plot shows the average CO₂ emissions per country, helping to identify which countries contribute the most to emissions.


In [None]:
import matplotlib.pyplot as plt

# Sorting the average CO2 emissions from highest to lowest
avg_co2_emissions = top_countries_data.groupby('Country')['CO2 Emissions'].mean().sort_values(ascending=False)

# Plotting with multiple colors (one color for each bar)
colors = plt.cm.get_cmap('viridis', len(avg_co2_emissions))  # Using 'viridis' colormap, you can choose another one if you like

# Plotting
avg_co2_emissions.plot(kind='bar', figsize=(10, 6), color=colors(range(len(avg_co2_emissions))))
plt.title('Average CO₂ Emissions by Top Countries')
plt.xlabel('Country')
plt.ylabel('Average CO₂ Emissions')
plt.show()



8. Average Business Tax Rate by Top Countries
- This bar plot displays the average business tax rate across a selection of top countries, giving insights into the tax environment for businesses.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Sample data frame
# top_countries_data = pd.DataFrame(...)  # Ensure this is loaded with your data

# Remove '%' sign and convert to float
top_countries_data['Business Tax Rate'] = top_countries_data['Business Tax Rate'].str.replace('%', '').astype(float)

# Group by country and calculate mean business tax rate
avg_business_tax = top_countries_data.groupby('Country')['Business Tax Rate'].mean()

# Plotting
avg_business_tax.plot(kind='bar', figsize=(10, 6), color='purple')
plt.title('Average Business Tax Rate by Top Countries')
plt.xlabel('Country')
plt.ylabel('Average Business Tax Rate')
plt.show()


9. Average Internet Usage by Top Countries
- This plot compares average internet usage across the top countries, highlighting digital access levels.


In [None]:
import matplotlib.pyplot as plt

# Sorting the average internet usage from highest to lowest
avg_internet_usage = top_countries_data.groupby('Country')['Internet Usage'].mean().sort_values(ascending=False)

# Plotting with multiple colors (one color for each bar)
colors = plt.cm.get_cmap('coolwarm', len(avg_internet_usage))  # Using 'coolwarm' colormap, you can choose another one if you like

# Plotting
avg_internet_usage.plot(kind='bar', figsize=(10, 6), color=colors(range(len(avg_internet_usage))))
plt.title('Average Internet Usage by Top Countries')
plt.xlabel('Country')
plt.ylabel('Average Internet Usage (%)')
plt.show()



10. Average Energy Usage by Top Countries
- This plot shows the average energy usage for the top countries, revealing insights about energy consumption.


In [None]:
import matplotlib.pyplot as plt

# Sorting the average energy usage from highest to lowest
avg_energy_usage = top_countries_data.groupby('Country')['Energy Usage'].mean().sort_values(ascending=False)

# Plotting with multiple colors (one color for each bar)
colors = plt.cm.get_cmap('plasma', len(avg_energy_usage))  # Using 'plasma' colormap, you can choose another one if you like

# Plotting
avg_energy_usage.plot(kind='bar', figsize=(10, 6), color=colors(range(len(avg_energy_usage))))
plt.title('Average Energy Usage by Top Countries')
plt.xlabel('Country')
plt.ylabel('Average Energy Usage')
plt.show()


11. Average Infant Mortality Rate by Country
- This bar plot displays the average infant mortality rate across a selection of countries, highlighting differences in healthcare quality.

In [None]:
# Average Infant Mortality Rate for top countries
avg_infant_mortality = top_countries_data.groupby('Country')['Infant Mortality Rate'].mean().sort_values(ascending=False)

# Plotting
avg_infant_mortality.plot(kind='bar', figsize=(10, 6), color='red')
plt.title('Average Infant Mortality Rate by Top Countries')
plt.xlabel('Country')
plt.ylabel('Average Infant Mortality Rate')
plt.show()


12. Mobile Phone Usage by Country
- This bar plot displays the average mobile phone usage in selected countries, indicating the level of mobile technology adoption.

In [None]:
# Calculate the average Mobile Phone Usage for top countries and sort in descending order
avg_mobile_phone_usage = top_countries_data.groupby('Country')['Mobile Phone Usage'].mean().sort_values(ascending=False)

# Plotting with multiple colors (one color for each bar)
colors = plt.cm.get_cmap('magma', len(avg_mobile_phone_usage))  # Using 'magma' colormap, you can choose another one if you like


# Plotting
avg_mobile_phone_usage.plot(kind='bar', figsize=(10, 6), color=colors(range(len(avg_mobile_phone_usage))))
plt.title('Average Mobile Phone Usage by Top Countries (Highest to Lowest)')
plt.xlabel('Country')
plt.ylabel('Average Mobile Phone Usage')
plt.show()


Scatter plots
- help visualize the relationship between two numerical columns.

In [None]:
# Scatter plot between 'GDP' and 'CO2 Emissions'
plt.figure(figsize=(8, 5))
plt.scatter(c['GDP'], c['CO2 Emissions'], color='blue', alpha=0.5)
plt.title('GDP vs CO2 Emissions')
plt.xlabel('GDP')
plt.ylabel('CO2 Emissions')
plt.show()


Conclusion

The analysis of socio-economic and environmental indicators across countries provided meaningful insights into global trends, disparities, and relationships across various dimensions. Key findings from this analysis include:

1) Economic Strength: 

Countries with higher average GDP emerged as economically influential, while others lagged behind, illustrating global economic disparities. This variation highlights the need for balanced economic policies to support growth in less prosperous regions.

2) Health and Demographics: 

Differences in life expectancy showed a gender pattern favoring females, consistent with global health trends. Infant mortality rates varied significantly, serving as a key indicator of healthcare quality and resource allocation gaps across nations.

3) Environmental Impact: 

Analysis of CO₂ emissions pointed to certain countries as major contributors to global emissions, raising concerns about environmental responsibility and underscoring the importance of sustainable practices.

4) Digital and Energy Usage: 

Variations in internet usage and energy consumption reflected disparities in technological access and energy demands. Higher internet usage was generally associated with more developed countries, while high energy consumption pointed to greater industrial activity or energy dependence, signaling development levels.

5) Healthcare Quality: 

The infant mortality rate served as a crucial indicator of healthcare quality, with lower rates observed in countries with robust healthcare systems. This finding emphasizes the role of healthcare investment in improving population health outcomes.

6) Business Environment: 

Analysis of business tax rates demonstrated differences in tax environments, which can influence business operations and investment decisions, ultimately affecting economic growth and stability.

7) Mobile Technology Adoption: 

Mobile phone usage highlighted digital adoption rates, with higher usage in economically strong countries, reflecting access to technology and connectivity infrastructure.

Summary:

This analysis underscored significant economic, environmental, and social differences across countries, providing a comprehensive view of global development priorities and challenges. These findings serve as a strong foundation for further analysis, such as clustering, which can categorize countries with similar socio-economic profiles. Such insights can guide targeted policies and strategic initiatives aimed at fostering balanced and sustainable global development.