# Exploratory Data Analysis on **Population and Net Migration Dataset World Bank**

<h2 style="font-family: 'poppins'; font-weight: bold;">üë®‚ÄçüíªAuthor: Muhammad Hassan Saboor</h2>

[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/MuhammadHassanSaboor) 
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/mhassansaboor) 
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/muhammad-hassan-saboor/)  
[![Facebook](https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook)](https://www.facebook.com/profile.php?id=61555194218257) 
[![Twitter/X](https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter)](https://twitter.com/MUHAMMA84929767) 
[![Instagram](https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram)](https://www.instagram.com/m_hassan_saboor/) 

## üåç Understanding Key Terms: Migration Dynamics

### üß≥ **Immigrants**  
Individuals who move **into** a country from another to reside there.  
*Example*: A person moving from Country A to Country B becomes an immigrant in Country B.

### üö∂‚Äç‚ôÇÔ∏è **Emigrants**  
Individuals who move **out of** a country to reside in another.  
*Example*: A person moving from Country A to Country B is an emigrant for Country A.

### ‚öñÔ∏è **Net Migration**  
The difference between the number of **immigrants** and **emigrants** for a given country and time period.  

The formula for **Net Migration** is:  
$$
\text{Net Migration} = \text{Number of Immigrants} - \text{Number of Emigrants}
$$

### üìà **Total Population with Net Migration**  
The population of a country, including the effects of migration, can be calculated as:  
$$
\text{Total Population}_{\text{current year}} = \text{Total Population}_{\text{previous year}} + \text{Net Migration} + \text{Natural Increase}
$$

Where:  
$$
\text{Natural Increase} = \text{Births} - \text{Deaths}
$$  
$$
\text{Net Migration} = \text{Immigrants} - \text{Emigrants}
$$

### üìä Key Insights:  
- **Positive Net Migration**: More people entering than leaving.  
- **Negative Net Migration**: More people leaving than entering.  


# MetaData

This dataset provides a comprehensive look at population and migration trends in five South Asian countries: Afghanistan, Bangladesh, India, Pakistan, and Sri Lanka, covering the years 1960 to 2023. The data is sourced directly from the World Bank API and contains detailed statistics on total population and net migration for each year.

**This dataset is ideal for:**

- Time-series analysis to study population trends over six decades.
- Migration studies to assess policy impacts and demographic shifts.
- Data visualization for dashboards and presentations.
- Machine learning applications in predictive analytics.
  
**Columns:**

- **Country**: Name of the country.
- **Year**: Year of the recorded data.
- **Total** Population: The total population of the country.
- **Net Migration**: Net migration balance (positive for immigration surplus, negative for emigration surplus).

**Key Insights:**

- Afghanistan: Significant migration shifts due to conflicts and crises.
- India: Continuous population growth with varying migration trends.
- Bangladesh: A history of large emigration and its impact on demographics.
- Pakistan: Migration surpluses in some years and large outflows in others.
- Sri Lanka: Gradual population growth and consistent emigration patterns.

## Dataset Credit

This dataset is provided by **Dr. Muhammad Aammar Tufail**, the Founder of [Codanics.com](https://codanics.com).

We appreciate the valuable contributions made by **Dr. Muhammad Aammar Tufail** in the field of data science and technology.


## üìö Importing Libraries

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import warnings

## ‚öôÔ∏è Basic Important Settings

In [2]:
warnings.filterwarnings("ignore")

This line of code will help us to avoid warnings during the analysis.

## üì• Loading the Dataset

In [3]:
df = pd.read_csv("/kaggle/input/population-and-net-migration-dataset-world-bank/pop_and_net_migration.csv")

## üìä Exploring the Dataset

In [4]:
df.sample(10)

Unnamed: 0,Country,Year,total_population,net_migration
47,Afghanistan,1976,12425270.0,-85430.0
132,India,2019,1383112000.0,-593495.0
252,Sri Lanka,1963,10517530.0,-9202.0
22,Afghanistan,2001,19688630.0,-192286.0
199,Sri Lanka,2016,21425490.0,-98896.0
102,Bangladesh,1985,95959100.0,-204954.0
160,India,1991,888941800.0,-158007.0
179,India,1972,582838000.0,-60291.0
75,Bangladesh,2012,152090600.0,-242022.0
151,India,2000,1059634000.0,-149966.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320 entries, 0 to 319
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country           320 non-null    object 
 1   Year              320 non-null    int64  
 2   total_population  320 non-null    float64
 3   net_migration     320 non-null    float64
dtypes: float64(2), int64(1), object(1)
memory usage: 10.1+ KB


As we can see that there are total 4 columns in the datasets:
- Country
- Year
- total_population
- net_migration
  
And there is no null value in the dataset.

In [6]:
df["Country"].value_counts()

Country
Afghanistan    64
Bangladesh     64
India          64
Sri Lanka      64
Pakistan       64
Name: count, dtype: int64

In [7]:
start_year = df["Year"].min()
end_year = df["Year"].max()

In [8]:
print(f"The dataset shows statistics of migration from {start_year} to {end_year}")

The dataset shows statistics of migration from 1960 to 2023


## üîç Exploratory Data Analysis (EDA)

Now lets Explore the data and see the insights using beautiful plots.

In [9]:
# 1. Population Growth Rate (percentage change from previous year)
df['population_growth_rate'] = df.groupby('Country')['total_population'].pct_change() * 100

# 2. Migration Rate (percentage change in net migration)
df['migration_rate'] = df.groupby('Country')['net_migration'].pct_change() * 100

# 3. Population Migration Ratio (net migration / total population)
df['population_migration_ratio'] = df['net_migration'] / df['total_population']

# 4. Total Migration by Country (sum of net migration for each country)
migration_by_country = df.groupby('Country')['net_migration'].sum().reset_index()
migration_by_country = migration_by_country.rename(columns={'net_migration': 'total_migration'})

# Merge the total migration by country back into the main dataframe
df = df.merge(migration_by_country[['Country', 'total_migration']], on='Country', how='left')

## ‚è≥ Time-Series Analysis

#### Population

In [10]:
fig_population = px.line(
    df,
    x="Year",
    y="total_population",
    color="Country",
    title="Total Population Over the Years",
    labels={"total_population": "Total Population"},
    markers=True,
)

fig_population.update_layout(
    plot_bgcolor="black",    
    paper_bgcolor="black",   
    font_color="white"      
)

fig_population.update_traces(line_shape="spline")  
fig_population.show()

- As we can see that **India** has a big population as compare to other 4 countries.
- If we remove **India** from the list then we can easily see that there is a big increase in **Pakistan** population after **1987**.
- The population of **Sri Lanka** and **Afghanistan** is almost same but after **2002**, The **Afghanistan** population has big increment as compare to **Bangladesh**


#### With Trend Lines

In [11]:
fig_population_trend = px.scatter(
    df,
    x="Year",
    y="total_population",
    color="Country",
    title="Population Trends with Trendlines",
    trendline="ols", 
    labels={"total_population": "Total Population"},
)

fig_population_trend.update_layout(
    plot_bgcolor="black",    
    paper_bgcolor="black",   
    font_color="white"       
)

fig_population_trend.show()

#### Net Migration

In [12]:
fig_migration = px.line(
    df,
    x="Year",
    y="net_migration",
    color="Country",
    title="Net Migration Trends Over the Years",
    labels={"net_migration": "Net Migration"},
    markers=True,
)

fig_migration.update_layout(
    plot_bgcolor="black",    
    paper_bgcolor="black",   
    font_color="white"       
)

fig_migration.update_traces(line_shape="spline") 
fig_migration.show()

- Now we can clearly see that the Migration of **Afghanistan** goes to negative in **1981** , **1990** and after **2000**, war can be a big reason behind this.
- The population of **Pakistan** becomes highly negative in **2016** due to terrorism and political instability.
- And we can also see some other trends

#### With Trend Lines

In [13]:
fig_migration_trend = px.scatter(
    df,
    x="Year",
    y="net_migration",
    color="Country",
    title="Net Migration Trends with Trendlines",
    trendline="ols",
    labels={"net_migration": "Net Migration"},
)

fig_migration_trend.update_layout(
    plot_bgcolor="black",    
    paper_bgcolor="black",   
    font_color="white"       
)

fig_migration_trend.show()

#### Total Migration by Country

In [14]:
fig_total_migration = px.bar(
    migration_by_country,
    x="Country",
    y="total_migration",
    title="Total Migration by Country",
    labels={"total_migration": "Total Migration"},
)

fig_total_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_total_migration.show()

This graph shows that **Bangladesh** shows the biggest Negative Net Migration and **Afghanistan** has lowest.

## üåç Geographic Analysis

In [15]:
df_sorted_by_years = df.sort_values(by="Year")

### Choropleth Map on Population

In [16]:
fig_population_map = px.choropleth(
    df_sorted_by_years,
    locations="Country",
    locationmode="country names",
    color="total_population",
    hover_name="Country",
    animation_frame="Year",
    color_continuous_scale="Viridis", 
    title="Total Population by Country",
    labels={"total_population": "Total Population"}
)

fig_population_map.update_layout(
    geo=dict(bgcolor="black"), 
    paper_bgcolor="black",      
    font_color="white",        
)

fig_population_map.show()

#### Choropleth Map on Net Migration

In [17]:
fig_migration_map = px.choropleth(
    df_sorted_by_years,
    locations="Country",
    locationmode="country names",
    color="net_migration",
    hover_name="Country",
    animation_frame="Year",
    color_continuous_scale="RdBu",  
    title="Net Migration by Country",
    labels={"net_migration": "Net Migration"}
)

fig_migration_map.update_layout(
    geo=dict(bgcolor="black"),  
    paper_bgcolor="black",     
    font_color="white",        
)

fig_migration_map.show()

This Map shows the Net Migration of 5 countries over the time

## ‚öñÔ∏è Comparative Analysis

#### Total Population vs. Net Migration (over years)

In [18]:
fig_line = px.line(
    df,
    x="Year",
    y=["total_population", "net_migration"],
    title="Comparative Line Plot: Total Population vs Net Migration",
    labels={"total_population": "Total Population", "net_migration": "Net Migration"},
    line_shape="linear"
)

fig_line.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_line.show()

In [19]:
fig_bar = px.bar(
    df,
    x="Year",
    y=["total_population", "net_migration"],
    title="Comparative Bar Plot: Total Population vs Net Migration by Year",
    labels={"total_population": "Total Population", "net_migration": "Net Migration"},
)

fig_bar.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_bar.show()

#### Distribution of Total Population vs Net Migration (by year)

In [20]:
fig_box = px.box(
    df,
    x="Year",
    y="total_population",
    title="Box Plot: Distribution of Total Population by Year",
    labels={"total_population": "Total Population"},
)

fig_box.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_box.show()

fig_box_migration = px.box(
    df,
    x="Year",
    y="net_migration",
    title="Box Plot: Distribution of Net Migration by Year",
    labels={"net_migration": "Net Migration"},
)

fig_box_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_box_migration.show()

#### Relationship between Total Population and Net Migration acording to Time

In [21]:
fig_scatter = px.scatter(
    df,
    x="total_population",
    y="net_migration",
    color="Year",
    title="Scatter Plot: Total Population vs Net Migration",
    labels={"total_population": "Total Population", "net_migration": "Net Migration"},
    color_continuous_scale="Viridis"
)

fig_scatter.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_scatter.show()

####  Relationship between Total Population and Net Migration acording to Country

In [22]:
fig_scatter = px.scatter(
    df,
    x="total_population",
    y="net_migration",
    color="Country",
    title="Scatter Plot: Total Population vs Net Migration",
    labels={"total_population": "Total Population", "net_migration": "Net Migration"},
    color_continuous_scale="Viridis"
)

fig_scatter.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_scatter.show()

## üìä Distribution Analysis

#### Distribution of Population

In [23]:
fig_hist_population = px.histogram(
    df,
    x="total_population",
    title="Histogram: Distribution of Total Population",
    labels={"total_population": "Total Population"},
    nbins=40, 
)

fig_hist_population.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_hist_population.show()

In [24]:
fig_kde_population = px.density_contour(
    df,
    x="total_population",
    title="KDE Plot: Distribution of Total Population",
    labels={"total_population": "Total Population"},
)

fig_kde_population.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_kde_population.show()

In [25]:
fig_violin_population = px.violin(
    df,
    x="Year",
    y="total_population",
    box=True,
    points="all",
    color="Country",
    title="Violin Plot: Distribution of Total Population by Year",
    labels={"total_population": "Total Population", "Year": "Year"},
)

fig_violin_population.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_violin_population.show()

#### Distribution of Net Migration

In [26]:
fig_hist_migration = px.histogram(
    df,
    x="net_migration",
    title="Histogram: Distribution of Net Migration",
    labels={"net_migration": "Net Migration"},
    nbins=40,
)

fig_hist_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_hist_migration.show()

In [27]:
fig_kde_migration = px.density_contour(
    df,
    x="net_migration",
    title="KDE Plot: Distribution of Net Migration",
    labels={"net_migration": "Net Migration"},
)

fig_kde_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_kde_migration.show()

In [28]:
fig_violin_migration = px.violin(
    df,
    x="Year",
    y="net_migration",
    box=True,
    points="all",
    color="Country",
    title="Violin Plot: Distribution of Net Migration by Year",
    labels={"net_migration": "Net Migration", "Year": "Year"},
)

fig_violin_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_violin_migration.show()

In [29]:
fig_violin = px.violin(
    df,
    y="net_migration",
    x="Country",
    color="Country",
    box=True,
    points="all",
    title="Distribution of Net Migration across Countries",
    labels={"net_migration": "Net Migration"},
)

fig_violin.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_violin.show()

## üìÖ Year-on-Year Changes

In [30]:
df["yoy_change_population"] = df.groupby("Country")["total_population"].pct_change() * 100
df["yoy_change_migration"] = df.groupby("Country")["net_migration"].pct_change() * 100

#### Year-on-Year Change in Total Population

In [31]:
fig_line_population = px.line(
    df,
    x="Year",
    y="yoy_change_population",
    color="Country",
    title="Year-on-Year Change in Total Population",
    labels={"yoy_change_population": "YoY Change in Total Population (%)", "Year": "Year"},
)

fig_line_population.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_line_population.show()

In [32]:
fig_bar_population = px.bar(
    df,
    x="Year",
    y="yoy_change_population",
    color="Country",
    title="Year-on-Year Change in Total Population",
    labels={"yoy_change_population": "YoY Change in Total Population (%)", "Year": "Year"},
)

fig_bar_population.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_bar_population.show()

#### Year-on-Year Change in Net Migration

In [33]:
fig_line_migration = px.line(
    df,
    x="Year",
    y="yoy_change_migration",
    color="Country",
    title="Year-on-Year Change in Net Migration",
    labels={"yoy_change_migration": "YoY Change in Net Migration (%)", "Year": "Year"},
)

fig_line_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_line_migration.show()

In [34]:
fig_bar_migration = px.bar(
    df,
    x="Year",
    y="yoy_change_migration",
    color="Country",
    title="Year-on-Year Change in Net Migration",
    labels={"yoy_change_migration": "YoY Change in Net Migration (%)", "Year": "Year"},
)

fig_bar_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_bar_migration.show()

## üîó Relationship Between Variables

In [35]:
fig_scatter = px.scatter(
    df,
    x="total_population",
    y="net_migration",
    color="Country",
    title="Relationship between Total Population and Net Migration",
    labels={"total_population": "Total Population", "net_migration": "Net Migration"},
)

fig_scatter.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_scatter.show()

In [36]:
fig_bubble = px.scatter(
    df,
    x="total_population",
    y="net_migration",
    size="Year", 
    color="Country",
    title="Relationship between Total Population, Net Migration, and Year (Bubble Plot)",
    labels={"total_population": "Total Population", "net_migration": "Net Migration", "Year": "Year"},
)

fig_bubble.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_bubble.show()

In [37]:
corr_matrix = df[["total_population", "net_migration", "Year"]].corr()

fig_heatmap = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale="Viridis"
))

fig_heatmap.update_layout(
    title="Correlation Matrix between Total Population, Net Migration, and Year",
    xaxis=dict(title="Variables"),
    yaxis=dict(title="Variables"),
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_heatmap.show()

In [38]:
corr_matrix = df[['total_population', 'net_migration']].corr()

fig_heatmap = px.imshow(
    corr_matrix,
    text_auto=True,
    title="Correlation Heatmap between Total Population and Net Migration",
    labels={"x": "Variables", "y": "Variables"}
)

fig_heatmap.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_heatmap.show()

In [39]:
fig_3d = go.Figure(data=[go.Scatter3d(
    x=df["total_population"],
    y=df["net_migration"],
    z=df["Year"],
    mode="markers",
    marker=dict(
        size=12,
        color=df["Year"],
        colorscale="Viridis",
        opacity=0.8
    ),
    text=df["Country"],
)])

fig_3d.update_layout(
    title="3D Scatter Plot: Total Population, Net Migration, and Year",
    scene=dict(
        xaxis_title="Total Population",
        yaxis_title="Net Migration",
        zaxis_title="Year"
    ),
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_3d.show()

## üö® Anomalies and Outliers

#### Outliers in Total Population 

In [40]:
fig_box = px.box(
    df,
    y="total_population",
    color="Country",
    title="Box Plot of Total Population (Identifying Outliers)",
    labels={"total_population": "Total Population"}
)

fig_box.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_box.show()

#### Outliers in Net Migration

In [41]:
fig_box_migration = px.box(
    df,
    y="net_migration",
    color="Country",
    title="Box Plot of Net Migration (Identifying Outliers)",
    labels={"net_migration": "Net Migration"}
)

fig_box_migration.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_box_migration.show()

In [42]:
from scipy.stats import zscore

df['z_total_population'] = zscore(df['total_population'])
df['z_net_migration'] = zscore(df['net_migration'])

outliers_population = df[df['z_total_population'].abs() > 3]
outliers_migration = df[df['z_net_migration'].abs() > 3]

fig_scatter_outliers = px.scatter(
    df,
    x="total_population",
    y="net_migration",
    color="Country",
    title="Scatter Plot with Anomalies (Z-Score > 3)",
    labels={"total_population": "Total Population", "net_migration": "Net Migration"}
)

fig_scatter_outliers.add_scatter(
    x=outliers_population['total_population'],
    y=outliers_population['net_migration'],
    mode='markers',
    marker=dict(color='red', size=12, symbol='x'),
    name='Outliers (Population)'
)

fig_scatter_outliers.add_scatter(
    x=outliers_migration['total_population'],
    y=outliers_migration['net_migration'],
    mode='markers',
    marker=dict(color='yellow', size=12, symbol='x'),
    name='Outliers (Migration)'
)

fig_scatter_outliers.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_scatter_outliers.show()

In [43]:
Q1_population = df['total_population'].quantile(0.25)
Q3_population = df['total_population'].quantile(0.75)
IQR_population = Q3_population - Q1_population

lower_bound_population = Q1_population - 1.5 * IQR_population
upper_bound_population = Q3_population + 1.5 * IQR_population

outliers_iqr_population = df[(df['total_population'] < lower_bound_population) | (df['total_population'] > upper_bound_population)]

Q1_migration = df['net_migration'].quantile(0.25)
Q3_migration = df['net_migration'].quantile(0.75)
IQR_migration = Q3_migration - Q1_migration

lower_bound_migration = Q1_migration - 1.5 * IQR_migration
upper_bound_migration = Q3_migration + 1.5 * IQR_migration

outliers_iqr_migration = df[(df['net_migration'] < lower_bound_migration) | (df['net_migration'] > upper_bound_migration)]

fig_scatter_iqr = px.scatter(
    df,
    x="total_population",
    y="net_migration",
    color="Country",
    title="Scatter Plot with IQR Outliers",
    labels={"total_population": "Total Population", "net_migration": "Net Migration"}
)

fig_scatter_iqr.add_scatter(
    x=outliers_iqr_population['total_population'],
    y=outliers_iqr_population['net_migration'],
    mode='markers',
    marker=dict(color='red', size=12, symbol='x'),
    name='IQR Outliers (Population)'
)

fig_scatter_iqr.add_scatter(
    x=outliers_iqr_migration['total_population'],
    y=outliers_iqr_migration['net_migration'],
    mode='markers',
    marker=dict(color='orange', size=12, symbol='x'),
    name='IQR Outliers (Migration)'
)

fig_scatter_iqr.update_layout(
    plot_bgcolor="black",
    paper_bgcolor="black",
    font_color="white",
)

fig_scatter_iqr.show()

# üí¨ Thank You for Exploring!

I hope this notebook provided valuable insights into the dynamics of population and migration through advanced visualizations and analysis. Your journey here reflects a shared passion for uncovering stories hidden within data.

If you found this work helpful or have suggestions for improvement, feel free to leave feedback. Together, we can make data exploration even more impactful. üåü

Happy Analyzing! üöÄ

### Muhammad Hassan Saboor