# Example to gather basic statistics on world population

Preparation to run this analysis
* Upload data to Google Drive
* Download this notebook from Github classroom to your computer and upload it to Colab
* Select a compute resource
YOU ARE NOW ALL SET TO RUN THE ANALYSIS!



#### Define problem statement

**Problem statement**

I am a United Nations policy maker that wants basic statistics about the world population in the latest data we have available. This information will help me with recommendations for resource allocation. 

Notes:
When the objectives of the task are clear, the analyst can do an efficient job of extracting the relevant data. 




**Disclaimer:**

In this workshop, the analysis that we provide below is not exhaustive. Please use this as a guide to learn the concepts of data analysis that were taught in the course (problem statement, data collection, cleaning and analysis). Here we have used data from the latest available date. In a real scenario, the analyst would want to look at more recent data and/or apply various other math/statistical methods to help answer their question. 

Let us assume that the year is 2021

*Instructions*
* We will be using Python to run the code
* **We want to focus on the concepts of analyzing data**
    * If you have prior experience in Python, please feel free to try additional questions - as we walk through the course
    * If you are new to Python, you will not be asked to write Python code. The code is provided. A novice usually finds it difficult to analyze the data as well as follow the code. We re-iterate that you focus on the concepts (and the outcome of the analysis) rather than the code. At your liesure, we encourage you to implement the analysis in your tool of choice (such as Excel or other.)
* For those interested to pursue a career path in data analysis, the code we present below is higher than an introductory level Python course. We present popular tools and code patterns in the hope that they will help you accelerate your learning journey.

*About this course*
* The purpose of this data analysis is to understand basic trends in World Population. We want to explore the data that will help answer the questions - What, Where or When. In this course, we will focus on introductory data analysis.
* Advanced data analysis (not covered here) would ask questions on the "Why" and also help come up with forecast for the future. 
* Any data analysis always starts with the basic questions. 

Data sources:
United Nations, World Population Prospects (2022)


import packages

In [3]:
import pandas as pd
import plotly.express as px 
import plotly.offline as py
import plotly.graph_objects as go
import numpy as np


#### Select data

In [4]:
# Import data into Colab by clicking on the "Files" icon and then clicking the "Upload" button. Select the CSV from your computer.


# pd.read_csv("/content/drive/MyDrive/population-and-demography.csv")

In [None]:
# population data contains population for countries,  several groups of countries and the world
pg_data = pd.read_csv("/mnt/c/Users/pamaz/Downloads/population-and-demography.csv")
pg_data.head()

*In the real world, the raw data are ingested into the raw layer or the "Bronze layer"*

#### Clean data

##### Transform the datasets to meet our needs
* Normally, the data are merged with other data or are grouped to meet the analysis needs. Here in the first exercise, we will work on one data. 
* The population data that we need is present in the ingested data. Further, we will be analyzing data by country. 

##### Clean data and understand a summary of the data
*   Remove null values 
*   Show a summary of the data
 

In [None]:
# The number of rows before removing null values
print(f"Number of rows is {pg_data.shape[0]}")

In [None]:
pg_data = pg_data.dropna()
# The number of rows after removing null values
print(f"Number of rows is {pg_data.shape[0]}")
pg_data.head()

Looks like no null values were present

*Tell me about this data*

In [None]:
pg_data.describe()

Notes on the data:
* Maximum population is about 7.9 billion. **This seems odd**
* Data is for each country and given by year.
* Total population and population by age groups are provided

Question:
* What are the minimum, maximum and mean population in 2021?




*In the real world, the transformed (and cleaned) data are stored into the intermediate layer or the "Silver layer". This data is now ready for analytics use.*
*   Additional data cleaning, formatting are done to ensure the data types of each column are making sense. 
    * This can include checks if country names are consistent.
    * Also, include data aggregation.e.g., represent minute data as hourly. For this data, we do not need to do aggregation.

##### Perform additional checks to make sure the data looks alright and do additional cleaning if necessary

*   *Show countries that have greater than China's 2021 population*

Please note the coding pattern
* We will define a condition that we need (see variable `condition`)
* Then we apply that condition on the dataframe

In [None]:
condition = pg_data["Population"]>1425893500 
pg_data[condition].head() # find me data whose population is more than china

In [None]:
# Some values do not correspond to country. Let us remove these entries from the data.

# get all values for country names whose population is more than 1.47 billion
condition = pg_data["Population"]>1425893500
remove_words = list(pg_data[condition]["Country name"].unique())
remove_words

*   *Remove "World" and other non-country entities from the Country name column*
    * Count number of countries

In [11]:
# We can generalize the above words to maximize cleaning
remove_words = [ 'regions',
 'countries',
 'World','UN']

In [None]:
# remove regions
pg_data_nona_clean = pg_data.copy()
def remove_non_countries(remove_words,pg_data_nona_clean):
    for word in remove_words:

        mask = ~(pg_data_nona_clean["Country name"].str.contains(word)) # condition to check if word does not exist
        pg_data_nona_clean = pg_data_nona_clean\
            [mask] # select rows for which condition is met
     
        # num of unique countries in the data
        num_of_countries = len(pg_data_nona_clean["Country name"].unique())
        print("Number of countries in the data \
            are {}".format(num_of_countries )
        )
        print(word)
    return pg_data_nona_clean
pg_data_nona_clean = remove_non_countries(remove_words,pg_data_nona_clean)


Still, there are way too many countries. Let us see if non-countries exist in the lower population.

#### Analyze data 

* During analysis also, data cleaning may be required.
* As we discover patterns, we may need to perform additional cleaning.

In [None]:
# countries with population below 100,000 in 2021
condition1 = (pg_data_nona_clean["Population"]<=100000)
condition2 = (pg_data_nona_clean["Year"]==2021)
pg_data_nona_clean[condition1 & condition2]["Country name"].unique()

It looks like islands are included in the data as countries. 

*Let us find out entities with minimum population and confirm if they are countries

In [None]:
condition = pg_data_nona_clean["Year"]==2021
df = pg_data_nona_clean[condition].sort_values("Population").head(15)
df


Tokelau is a territory and Niue is considered a country (As per Australian Government sources). Even though we seem to have some entities may or may not be recognized as countries, we can calculate basic statistics such as mean, minimum, maximum population. We will use the 237 entities to calculate the basic statistics. 


In [None]:
# Statistics on population
# mean, minimum, maximum in the year 2021
condition = pg_data_nona_clean["Year"]==2021
df = pg_data_nona_clean[condition]

# what is the minimum population?
condition = (df["Population"]==df["Population"].min())
min_country,min_population = df[condition][["Country name","Population"]].values[0]

# what is the maximum population?
condition = (df["Population"]==df["Population"].max())
max_country,max_population = df[condition][["Country name","Population"]].values[0]

mean_population = df["Population"].mean()

total_population = df["Population"].sum()

print(f"The minimum world population is {min_population} for {min_country}")
print(f"The maximum world population is {max_population} for country {max_country}")
print(f"The mean world population is {mean_population}.")
print(f"The total world population is {total_population}.")



##### Interpret results

* Of the 237 countries analyzed, in 2021, Tokelau had the lowest world population of 1869
* In the same year, China had the highest world population of about 1.4 billion people
    * The world population ranged from 1800 to 1.4 billion people. 
* About 42 of the 237 countries may not be independent. These countries were also included in the analysis. ## Exercise: is it 42 countries that are not independent?
* The average world population in 2021 is about 33 million people.
* The total world population in 2021 is about 7.9 billion people.


##### Report
* The problem statement
    * A UN policy maker wants to obtain the basic statistics about the world population. 
* Discuss the data selected and its source
    * Population data was used from the source United Nations, World Population Prospects (2022).
    * The data lists countries, regions (or group of countries) and their population. 
    * The data also contains population across various age groups. 
    * The data contained information for various years ranging from about 1950 till 2021. 
* Discuss the steps used to clean the data and why
    * In this analysis, the total population across all age groups was considered. 
    * The data for regions was removed from the analysis. 
        * We only keep the data for various countries.
        * This resulted in 237 countries.
        * The regions we removed contain key words 'regions', 'countries', 'World' and 'UN'. 
            * Including region data would cause the world population to be inaccurate.
    * A step to remove null values in the data was applied. This step did not result in any data loss. No null values were found in the data. 
    * Population data discussed above was used in the analysis.  
* Discussion the analysis steps
    * What patterns were discovered?
        * Data for the year 2021 was analyzed. 
        * The data consisted of countries that were islands. 
            * The data pertaining to islands had the lowest population.
    * The lowest, highest, average and total population were calculated. 
    * Interpretation of the results
        * Of the 237 countries analyzed, in 2021, Tokelau had the lowest world population of 1869
        * In the same year, China had the highest world population of about 1.4 billion people
            * The world population ranged from 1800 to 1.4 billion people. 
        * About 42 of the 237 countries may not be independent. These countries were also included in the analysis. 
        * The average world population in 2021 is about 33 million people.
        * The total world population in 2021 is about 7.9 billion people.
        * <*Any relevant figures are added to the report and explained.>

* Analyst will have a meeting with the requestor and discuss their findings. The final data and the report are provided to the requestor. 

Question:
Do you think the minimum and mean value skew affects the decision of the policy maker?

*In the real world, the data fit for analysis is stored in the Gold layer*


In this case, the data contained in the variable `df` in the above cell is stored.

*Further, in the real world, the analyzed data are typically sent to a dashboard for consumption. The results are shown and discussed with the person that initiated the request.*



#### Summary

*Analytics Workflow*
* We saw how to ingest raw data, transform (ET of the ETL). 
* We then further cleaned the data and explored it to help answer the questions. 
* We discovered the minimum, maximum, total and mean population of countries in 2021. 

