# Example to calculate countries with highest population growth between years 2020 and 2021

Preparation to run this analysis
* Upload data to Google Drive
* Download this notebook from Github classroom to your computer and upload it to Colab
* Select a compute resource
YOU ARE NOW ALL SET TO RUN THE ANALYSIS!



#### Problem statement

I am a United Nations policy maker. Now that I have basic statistics about the world population for 2021, I want to know which countries had the highest population growth for countries between 2020 and 2021. This information will help me with providing recommendations for resource allocation. 

Notes:
When the objectives of the task are clear, the analyst can do an efficient job of extracting the relevant data. 




**Guidance:**

In this workshop, the analysis that we provide below is not exhaustive. Please use this as a guide to learn the concepts of data analysis that were taught in the course (problem statement, data collection, data cleaning and analysis). Here we have used data from the latest available date. In a real scenario, the analyst would want to look at more recent data. In addition, the analyst may choose a variation of the method shown that would answer the question.

Let us assume that the year is 2021

*Instructions*
* We will be using Python to run the code
* **We want to focus on the concepts of analyzing data**
    * If you have prior experience in Python, please feel free to try additional questions - as we walk through the course
    * If you are new to Python, you will not be asked to write Python code. The code is provided. A novice usually finds it difficult to analyze the data as well as follow the code. We re-iterate that you focus on the concepts (and the outcome of the analysis) rather than the code. At your liesure, we encourage you to implement the analysis in your tool of choice (such as Excel or other.)
* For those interested to pursue a career path in data analysis, the code we present below is higher than an introductory level Python course. We present popular tools and code patterns in the hope that they will help you accelerate your learning journey.

*About this course*
* The purpose of this data analysis is to understand basic trends in World Population. We want to explore the data that will help answer the questions - What, Where or When. In this course, we will focus on introductory data analysis.
* Advanced data analysis (not covered here) would ask questions on the "Why" and also help come up with forecast for the future. 
* Any data analysis always starts with the basic questions. 

Data sources:
United Nations, World Population Prospects (2022)


#### Select data

import packages

In [4]:
import pandas as pd
import plotly.express as px 
import plotly.offline as py
import plotly.graph_objects as go
import numpy as np


In [5]:
# Import data into Colab by clicking on the "Files" icon and then clicking the "Upload" button. Select the CSV from your computer.

# pd.read_csv(""/content/drive/MyDrive/population-and-demography.csv)

In [None]:
# population data contains population for countries,  several groups of countries and the world
pg_data = pd.read_csv("/mnt/c/Users/pamaz/Downloads/population-and-demography.csv")
pg_data.head()

*In the real world, the raw data are ingested into the raw layer or the "Bronze layer"*

#### Clean the data

##### Transform the datasets to meet our needs
* Normally, the data are merged with other data or are grouped to meet the analysis needs. 
* In this exercise, we only need to population table. As a result, we will not use any merge operations.
* We will do additional cleaning. 



Questions that we will answer in this exercise:
* What are the minimum, maximum and mean population in 2021?
    * We will re-calculate using only the population for countries.
* What is the population growth of each country between 2020 and 2021?
* Find the countries with the highest population growth.




*In the real world, the transformed (and cleaned) data are stored into the intermediate layer or the "Silver layer". This data is now ready for analytics use.*
*   Additional data cleaning, formatting are done to ensure the data types of each column are making sense. 
    * This can include checks if country names are consistent.
    * Also, include data aggregation.e.g., represent minute data as hourly. For this data, we do not need to do aggregation.

In [None]:
# We will use the result from the previous exercise to clean our data.
# We will remove entities that do not correspond to a country.

remove_words = [ 'regions',
 'countries',
 'World','UN']
# remove regions
pg_data_nona_clean = pg_data.copy()
def remove_non_countries(remove_words,pg_data_nona_clean):
    for word in remove_words:

        mask = ~(pg_data_nona_clean["Country name"].str.contains(word))
        pg_data_nona_clean = pg_data_nona_clean\
            [mask]
     
        # num of unique countries in the data
        num_of_countries = len(pg_data_nona_clean["Country name"].unique())
        print("Number of countries in the data \
            are {}".format(num_of_countries )
        )
        print(word)
    return pg_data_nona_clean
pg_data_nona_clean = remove_non_countries(remove_words,pg_data_nona_clean)


#### Analyze the data

##### Discover patterns

In [None]:
yearmax = 2021 #Define year of analysis and previous year
previous_year = yearmax - 1 # we will keep the previous year because we want to measure the trend
print(f"The most recent year is {yearmax}")

In [None]:
# Number of unique countries in 2020, 2021
num_of_countries_current = len(pg_data_nona_clean[pg_data_nona_clean["Year"]==yearmax]["Country name"].unique())
print("There are {} countries present in the data".format(num_of_countries_current))

num_of_countries_p = len(pg_data_nona_clean[pg_data_nona_clean["Year"]==previous_year]["Country name"].unique())
print("There are {} countries present in the data".format(num_of_countries_p))

Notes on the above result:

* Like before, we have more than expected entities after cleaning, where some of them may not be independent countries. 
* The maximum population was high due to the entity "World"
    * Removing it, resulted in realistic answer for population.
* We now have 237 countries. 
    * Most of the countries are recognized by the international community and about 42 of them are not. Most of the non-recognized entities have low population.
  



Please note the coding pattern
* We will define a condition that we need (see variable `condition`)
* Then we apply that condition on the dataframe

*   *Let us look at the data for years 2020 and 2021

In [None]:
condition = pg_data_nona_clean["Year"]==2021
df = pg_data_nona_clean[condition].sort_values("Population").head(5)
df

In [None]:
# Statistics on population
# mean, minimum, maximum in the year 2021
condition = pg_data_nona_clean["Year"]==2021
df = pg_data_nona_clean[condition]

condition = (df["Population"]==df["Population"].min())
min_country,min_population = df[condition][["Country name","Population"]].values[0]

condition = (df["Population"]==df["Population"].max())
max_country,max_population = df[condition][["Country name","Population"]].values[0]

mean_population = df["Population"].mean()
print(f"The minimum world population is {min_population} for country {min_country}")
print(f"The maximum world population is {max_population} for country {max_country}")
print(f"The mean world population is {mean_population}.")



We obtained the same answers that we obtained in exercise 1. No surprises here!

*Which country had the highest population growth in 2021?*
*   Here we need to compare 2021's population with that of its previous year. Calculate the increase and report the highest percent and corresponding country.


In [None]:
# Calculate the country with the highest population growth
df = pg_data_nona_clean[(pg_data_nona_clean["Year"]==yearmax) | (pg_data_nona_clean["Year"]==previous_year)]
# for each country calculate change in population from 2020 to 2021
## formula: (2021 population minus 2020 population) 
# declare a function to calculate the formula
def growth(data,yearmax,previous_year):
    p_2020 = data[data["Year"]==previous_year]["Population"].values[0]# get 2020 population
    p_2021 = data[data["Year"]==yearmax]["Population"].values[0]# get 2021 population
    change = p_2021-p_2020
    return change
df_sel = df.groupby("Country name")[["Year","Population"]]\
    .apply(lambda x: growth(x,yearmax,previous_year)).reset_index()
df_sel.rename(columns={0:"growth"},inplace=True)
# get country with the highest growth
country_name = df_sel[df_sel['growth']==df_sel["growth"].max()]\
                    ["Country name"].values[0]
# get its population growth
growth_val = df_sel[df_sel['growth']==df_sel["growth"].max()]\
                    ["growth"].values[0]

print("{} had the highest population growth in {} at {} people.".format(country_name,yearmax,growth_val))



We will look at the 10 countries that had the most growth

In [None]:
# Get countries with the 10 highest population growth
df_sel.sort_values("growth",ascending=False).head(10)


In [None]:
# Calculate the world's total growth in 2021
total_growth = df_sel["growth"].sum()
print(f"The world's population grew by {total_growth} in {yearmax}")

*In the real world, the data fit for analysis is stored in the Gold layer*

In this case, the data contained in the variable `df_sel` in the above cell is stored.

*Further, in the real world, the analyzed data are typically sent to a dashboard for consumption. The results are shown and discussed with the person that initiated the request.*



Let us plot the results because pictures are easier to understand

In [None]:
# Plot 20 countries with the most population growth and their corresponding growth
df1 = df_sel.sort_values(by = ["growth"],ascending = False).reset_index().head(20)

fig = px.bar(df1,
             x = 'Country name',
             y = "growth",
             color = "growth",
             color_continuous_scale = 'balance',
             labels = {"growth":"Growth"}
            )
fig.update_layout(title = 'Top 20 countries with the highest growth in 2021',
                  title_x = 0.5,
                  title_font = dict(size = 16, color = 'DarkBlue'),
                  xaxis = dict(title = 'Country'),
                  yaxis = dict(title = 'Growth ')
                 
                 )
fig.show()

In [None]:
# Show the 20 countries with the most population growth on a map
df = df_sel.copy()

df = df[["growth","Country name"]].\
    sort_values("growth",ascending=False).head(10)

fig = px.choropleth(df,
                    locations = 'Country name',
                    locationmode = 'country names',
                    color = 'growth',
                    hover_name = 'Country name',
                    color_continuous_scale = 'twilight',
                    labels = {"growth":"growth"}
                   
                   )

fig.update_layout(title = '2021',
                  title_font = dict(size = 30, color = 'DarkSlateBlue'),
                  title_x = 0.5,
                  geo = dict(showframe = False,
                             showcoastlines = False,
                             projection_type = 'equirectangular'))

fig.show()

##### Interpret results
* Of the 237 countries analyzed, the countries with the most population growth in 2021 were ... (students are encouraged to provide the answer)
* The population growth ranged from ... 
* About 42 of the 237 countries analyzed were islands or may not be independent/recognized. These countries had low population and therefore, low growth compared to the highest growth ones. 
* India's population grew the most in 2021 by about 11 million and Nigeria's population growth was the second highest at about 5 million people. India's population growth was more than twice the growth seen by Nigeria. 
* The world's population grew by 68 million in 2021.
* The 20 highest population growth occurred in Asia or Africa. 




##### Report (use the previous exercise as a guide to write your report.)
* Talk about the problem statement
* Discuss the data selected and its source
* Discuss the steps used to clean the data and why
* Discussion the analysis steps
    * What patterns did you discover?
    * Share the interpretation of your results
    * <Add any figures>
* Have a meeting with the requestor and discuss your findings. Provide the final data and the report to them. 

#### Summary of this exercise

*Analytics Workflow*
* We saw how to ingest raw data, transform (ET of the ETL). 
* We then further cleaned the data and explored it to help answer the questions. 
* We discovered the countries with the most growth in 2021. 
* We also calculated the world's population growth in 2021. 


#### Next Steps


*Write a report (1 or 2 pages) in Microsoft Word on what you analyzed today. Please email your reports in pdf format to aifeatures2000@gmail.com by May 8, 2024. This is an optional exercise. 
As you write your report, please draw on the concepts you learnt in the class.*