# Example to gather basic statistics on world population

Preparation to run this analysis
* Upload data to Google Drive
* Download the notebook from Github classroom to your computer and upload it to Colab
* Select a compute resource
YOU ARE NOW ALL SET TO RUN THE ANALYSIS!



#### Define problem statement

**Problem statement**

I am a United Nations policy maker that wants basic statistics about the world population in the latest data we have available. This information will help me with recommendations for resource allocation.

Notes:
When the objectives of the task are clear, the analyst can do an efficient job of extracting the relevant data.




**Disclaimer:**

In this workshop, the analysis that we provide below is not exhaustive. Please use this as a guide to learn the concepts of data analysis that were taught in the course (problem statement, data collection, cleaning and analysis). Here we have used data from the latest available date. In a real scenario, the analyst would want to look at more recent data and/or apply various other math/statistical methods to help answer their question.

For the purposes of this course, let us assume that the year is 2021

*Instructions*
* We will be using Python to run the code
* **We want to focus on the concepts of analyzing data**
    * If you have prior experience in Python, please feel free to try additional questions - as we walk through the course
    * If you are new to Python, you will not be asked to write Python code. The code is provided. A novice usually finds it difficult to analyze the data as well as follow the code. We re-iterate that you focus on the concepts (and the outcome of the analysis) rather than the code. At your liesure, we encourage you to implement the analysis in your tool of choice (such as Excel or other.)
* For those interested to pursue a career path in data analysis, the code we present below is higher than an introductory level Python course. We present popular tools and code patterns in the hope that they will help you accelerate your learning journey.

*About this course*
* The purpose of this data analysis is to understand basic trends in World Population. We want to explore the data that will help answer the questions - What, Where or When. In this course, we will focus on introductory data analysis.
* Advanced data analysis (not covered here) would ask questions on the "Why" and also help come up with forecast for the future.
* Any data analysis always starts with the basic questions.

Data sources:
United Nations, World Population Prospects (2022)


import packages

In [None]:
import pandas as pd
import numpy as np


#### Select data

In [None]:
# Import data into Colab by clicking on the "Files" icon and then clicking the "Upload" button. Select the CSV from your computer.

# pd.read_csv(""/content/drive/MyDrive/population-and-demography.csv)

In [None]:
# population data contains population for countries,  several groups of countries and the world
pg_data = pd.read_csv("/content/population-and-demography.csv")
pg_data.head()

Unnamed: 0,Country name,Year,Population,Population of children under the age of 1,Population of children under the age of 5,Population of children under the age of 15,Population under the age of 25,Population aged 15 to 64 years,Population older than 15 years,Population older than 18 years,...,Population aged 15 to 19 years,Population aged 20 to 29 years,Population aged 30 to 39 years,Population aged 40 to 49 years,Population aged 50 to 59 years,Population aged 60 to 69 years,Population aged 70 to 79 years,Population aged 80 to 89 years,Population aged 90 to 99 years,Population older than 100 years
0,Afghanistan,1950,7480464,301735.0,1248282,3068855,4494349,4198587,4411609,3946595,...,757113,1241348,909953,661807,467170,271905,92691,9499,123,0.0
1,Afghanistan,1951,7571542,299368.0,1246857,3105444,4552138,4250002,4466098,3993640,...,768616,1260904,922765,667015,468881,273286,94358,10155,118,0.0
2,Afghanistan,1952,7667534,305393.0,1248220,3145070,4613604,4303436,4522464,4041439,...,781411,1280288,935638,672491,470898,274852,96026,10721,139,0.0
3,Afghanistan,1953,7764549,311574.0,1254725,3186382,4676232,4356242,4578167,4088379,...,794308,1298803,948321,678064,472969,276577,97705,11254,166,0.0
4,Afghanistan,1954,7864289,317584.0,1267817,3231060,4741371,4408474,4633229,4136116,...,806216,1316768,961484,684153,475117,278210,99298,11793,190,0.0


*In the real world, the raw data are ingested into the raw layer or the "Bronze layer"*

#### Clean data

##### Transform the datasets to meet our needs
* Normally, the data are merged with other data or are grouped to meet the analysis needs. Here in the first exercise, we will work on one data.
* The population data that we need is present in the ingested data. Further, we will be analyzing data by country.

##### Clean data and understand a summary of the data
*   Remove null values
*   Show a summary of the data


In [None]:
# The number of rows before removing null values
print(f"Number of rows is {pg_data.shape[0]}")

Number of rows is 18288


In [None]:
pg_data = pg_data.dropna()
# The number of rows after removing null values
print(f"Number of rows is {pg_data.shape[0]}")
pg_data.head()

Number of rows is 18288


Unnamed: 0,Country name,Year,Population,Population of children under the age of 1,Population of children under the age of 5,Population of children under the age of 15,Population under the age of 25,Population aged 15 to 64 years,Population older than 15 years,Population older than 18 years,...,Population aged 15 to 19 years,Population aged 20 to 29 years,Population aged 30 to 39 years,Population aged 40 to 49 years,Population aged 50 to 59 years,Population aged 60 to 69 years,Population aged 70 to 79 years,Population aged 80 to 89 years,Population aged 90 to 99 years,Population older than 100 years
0,Afghanistan,1950,7480464,301735.0,1248282,3068855,4494349,4198587,4411609,3946595,...,757113,1241348,909953,661807,467170,271905,92691,9499,123,0.0
1,Afghanistan,1951,7571542,299368.0,1246857,3105444,4552138,4250002,4466098,3993640,...,768616,1260904,922765,667015,468881,273286,94358,10155,118,0.0
2,Afghanistan,1952,7667534,305393.0,1248220,3145070,4613604,4303436,4522464,4041439,...,781411,1280288,935638,672491,470898,274852,96026,10721,139,0.0
3,Afghanistan,1953,7764549,311574.0,1254725,3186382,4676232,4356242,4578167,4088379,...,794308,1298803,948321,678064,472969,276577,97705,11254,166,0.0
4,Afghanistan,1954,7864289,317584.0,1267817,3231060,4741371,4408474,4633229,4136116,...,806216,1316768,961484,684153,475117,278210,99298,11793,190,0.0


Looks like no null values were present

*Tell me about this data*

In [None]:
pg_data.describe()

Unnamed: 0,Year,Population,Population of children under the age of 1,Population of children under the age of 5,Population of children under the age of 15,Population under the age of 25,Population aged 15 to 64 years,Population older than 15 years,Population older than 18 years,Population at age 1,...,Population aged 15 to 19 years,Population aged 20 to 29 years,Population aged 30 to 39 years,Population aged 40 to 49 years,Population aged 50 to 59 years,Population aged 60 to 69 years,Population aged 70 to 79 years,Population aged 80 to 89 years,Population aged 90 to 99 years,Population older than 100 years
count,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,...,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0,18288.0
mean,1985.5,126470400.0,3133497.0,14825710.0,41095230.0,63762600.0,77429510.0,85372100.0,78196240.0,3011213.0,...,11782260.0,20872880.0,17158700.0,13622140.0,10177070.0,6801757.0,3618710.0,1195799.0,142784.4,3107.718068
std,20.783173,588851200.0,14167010.0,67384370.0,188417000.0,294251900.0,367651900.0,404866900.0,372017000.0,13662000.0,...,55126040.0,98860990.0,82404600.0,66008220.0,49288480.0,32712920.0,17491540.0,6238308.0,853350.4,20951.566812
min,1950.0,1363.0,25.0,136.0,416.0,623.0,748.0,849.0,752.0,26.0,...,110.0,158.0,137.0,119.0,95.0,64.0,31.0,6.0,0.0,0.0
25%,1967.75,291591.5,6663.75,31995.25,89541.5,139541.5,170263.5,186716.0,166417.5,6473.75,...,26296.5,45050.75,36608.25,27440.25,19649.75,12603.0,6221.0,1818.75,154.75,0.0
50%,1985.5,3833998.0,88352.0,423784.5,1186122.0,1843100.0,2246772.0,2482104.0,2238130.0,85824.0,...,336969.5,609723.5,486290.5,364712.5,264781.5,168417.5,81824.0,20269.5,1468.5,13.0
75%,2003.25,16785460.0,463000.5,2160046.0,5905945.0,9025130.0,9641250.0,10354350.0,9239904.0,440787.5,...,1626211.0,2758738.0,2113149.0,1556334.0,1203386.0,845242.8,436710.0,133380.5,12499.0,163.0
max,2021.0,7909295000.0,139783700.0,690360700.0,2015023000.0,3239281000.0,5132999000.0,5893679000.0,5516283000.0,138478700.0,...,623576100.0,1210493000.0,1165207000.0,976407200.0,851356900.0,598067100.0,330491200.0,131835600.0,22223970.0,593166.0


Notes on the data:
* Maximum population is about 7.9 billion. **This seems odd**
* Data is for each country and given by year.
* Total population and population by age groups are provided

Question:
* What are the minimum, maximum and mean population in 2021?




*In the real world, the transformed (and cleaned) data are stored into the intermediate layer or the "Silver layer". This data is now ready for analytics use.*
*   Additional data cleaning, formatting are done to ensure the data types of each column are making sense.
    * This can include checks if country names are consistent.
    * Also, include data aggregation.e.g., represent minute data as hourly. For this data, we do not need to do aggregation.

##### Perform additional checks to make sure the data looks alright and do additional cleaning if necessary

*   *Show countries that have greater than China's 2021 population*

Please note the coding pattern
* We will define a condition that we need (see variable `condition`)
* Then we apply that condition on the dataframe

In [None]:
condition = pg_data["Population"]>1425893500
pg_data[condition].head() # find me data whose population is more than china

Unnamed: 0,Country name,Year,Population,Population of children under the age of 1,Population of children under the age of 5,Population of children under the age of 15,Population under the age of 25,Population aged 15 to 64 years,Population older than 15 years,Population older than 18 years,...,Population aged 15 to 19 years,Population aged 20 to 29 years,Population aged 30 to 39 years,Population aged 40 to 49 years,Population aged 50 to 59 years,Population aged 60 to 69 years,Population aged 70 to 79 years,Population aged 80 to 89 years,Population aged 90 to 99 years,Population older than 100 years
866,Asia (UN),1952,1435002800,53441104.0,224022800,538876300,807658750,838080200,896121900,809027500,...,142534500,236827220,181836320,142478320,99356310,60423012,26108144,6141401,416713,4511.0
867,Asia (UN),1953,1466384300,55347424.0,234208110,555507650,828502800,852104060,910872000,823001860,...,144235760,241922620,184647220,144652830,101525256,61161500,26221600,6093195,412023,4633.0
868,Asia (UN),1954,1498736600,56284908.0,243142940,573178050,849711040,865980900,925554240,837504260,...,145128240,246899860,187728160,147067150,103757150,62098856,26404860,6062275,407669,4325.0
869,Asia (UN),1955,1532902800,57757990.0,251023420,592322200,872123260,880300600,940576830,852483800,...,145856960,251657020,191298400,149476320,105985730,63177212,26682296,6038162,404737,3804.0
870,Asia (UN),1956,1567477800,58110908.0,257773180,611292300,894578940,895319230,956182000,867582140,...,146785280,256619490,195036580,151727950,108235304,64280350,27076954,6021905,398161,3528.0


In [None]:
# Some values do not correspond to country. Let us remove these entries from the data.

# get all values for country names whose population is more than 1.47 billion
condition = pg_data["Population"]>1425893500
remove_words = list(pg_data[condition]["Country name"].unique())
remove_words

['Asia (UN)',
 'Less developed regions',
 'Less developed regions, excluding China',
 'Less developed regions, excluding least developed countries',
 'Lower-middle-income countries',
 'Upper-middle-income countries',
 'World']

*   *Remove "World" and other non-country entities from the Country name column*
    * Count number of countries

In [None]:
# We can generalize the above words to maximize cleaning
remove_words = [ 'regions',
 'countries',
 'World','UN','SIDS']

In [None]:
# remove regions
pg_data_nona_clean = pg_data.copy()
def remove_non_countries(remove_words,pg_data_nona_clean):
    for word in remove_words:

        mask = ~(pg_data_nona_clean["Country name"].str.contains(word))
        pg_data_nona_clean = pg_data_nona_clean\
            [mask]

        # num of unique countries in the data
        num_of_countries = len(pg_data_nona_clean["Country name"].unique())
        print("Number of countries in the data \
            are {}".format(num_of_countries )
        )
        # print(word)
    return pg_data_nona_clean
pg_data_nona_clean = remove_non_countries(remove_words,pg_data_nona_clean)


Number of countries in the data             are 250
Number of countries in the data             are 244
Number of countries in the data             are 243
Number of countries in the data             are 237
Number of countries in the data             are 236


Still, there are way too many countries. Let us see if non-countries exist in the lower population.

#### Analyze data

* During analysis also, data clean may be required.
* As we discover patterns, we may need to perform additional cleaning.

In [None]:
# countries with population below 100,000 in 2021
condition1 = (pg_data_nona_clean["Population"]<=100000)
condition2 = (pg_data_nona_clean["Year"]==2021)
pg_data_nona_clean[condition1 & condition2]["Country name"].unique()

array(['American Samoa', 'Andorra', 'Anguilla', 'Antigua and Barbuda',
       'Bermuda', 'Bonaire Sint Eustatius and Saba',
       'British Virgin Islands', 'Cayman Islands', 'Cook Islands',
       'Dominica', 'Falkland Islands', 'Faroe Islands', 'Gibraltar',
       'Greenland', 'Guernsey', 'Isle of Man', 'Liechtenstein',
       'Marshall Islands', 'Monaco', 'Montserrat', 'Nauru', 'Niue',
       'Northern Mariana Islands', 'Palau', 'Saint Barthelemy',
       'Saint Helena', 'Saint Kitts and Nevis',
       'Saint Martin (French part)', 'Saint Pierre and Miquelon',
       'San Marino', 'Sint Maarten (Dutch part)', 'Tokelau',
       'Turks and Caicos Islands', 'Tuvalu', 'Wallis and Futuna'],
      dtype=object)

It looks like islands are included in the data as countries.

*Let us find out entities with minimum population and confirm if they are countries

In [None]:
condition = pg_data_nona_clean["Year"]==2021
df = pg_data_nona_clean[condition].sort_values("Population").head(15)
df


Unnamed: 0,Country name,Year,Population,Population of children under the age of 1,Population of children under the age of 5,Population of children under the age of 15,Population under the age of 25,Population aged 15 to 64 years,Population older than 15 years,Population older than 18 years,...,Population aged 15 to 19 years,Population aged 20 to 29 years,Population aged 30 to 39 years,Population aged 40 to 49 years,Population aged 50 to 59 years,Population aged 60 to 69 years,Population aged 70 to 79 years,Population aged 80 to 89 years,Population aged 90 to 99 years,Population older than 100 years
16487,Tokelau,2021,1869,35.0,170,545,858,1165,1324,1215,...,171,277,223,207,208,130,67,38,3,0.0
12095,Niue,2021,1957,27.0,145,514,720,1148,1443,1350,...,131,176,222,286,208,235,114,70,1,0.0
5183,Falkland Islands,2021,3786,43.0,212,672,1083,2703,3114,2977,...,217,416,556,658,633,348,174,101,11,0.0
11087,Montserrat,2021,4438,41.0,196,595,1243,3077,3843,3668,...,299,633,490,610,756,574,364,108,9,0.0
13895,Saint Helena,2021,5428,42.0,221,758,1124,3219,4669,4535,...,198,421,532,629,941,898,756,247,47,1.0
14183,Saint Pierre and Miquelon,2021,5905,45.0,267,1015,1561,3880,4889,4658,...,323,481,744,970,979,697,424,227,44,1.0
13823,Saint Barthelemy,2021,10888,87.0,472,1454,2550,8317,9432,9196,...,400,1714,1949,1798,1838,1015,464,224,30,2.0
16991,Tuvalu,2021,11229,259.0,1250,3548,5428,6977,7681,7106,...,947,1876,1549,994,1097,830,289,92,7,0.0
17927,Wallis and Futuna,2021,11654,138.0,753,2714,4366,7419,8940,8299,...,1010,1241,1287,1748,1496,1140,704,283,31,0.0
11519,Nauru,2021,12533,341.0,1711,4863,7026,7378,7670,6963,...,1130,2073,1897,1203,807,424,115,21,0,0.0


Tokelau is a territory and Niue is considered a country (As per Australian Government sources). Even though we seem to have some entities may or may not be recognized as countries, we can calculate basic statistics such as mean, minimum, maximum population. We will use the 236 entities to calculate the basic statistics.


In [None]:
# Statistics on population
# mean, minimum, maximum in the year 2021
condition = pg_data_nona_clean["Year"]==2021
df = pg_data_nona_clean[condition]

condition = (df["Population"]==df["Population"].min())
min_country,min_population = df[condition][["Country name","Population"]].values[0]

condition = (df["Population"]==df["Population"].max())
max_country,max_population = df[condition][["Country name","Population"]].values[0]

mean_population = df["Population"].mean()

total_population = df["Population"].sum()

print(f"The minimum world population is {min_population} for {min_country}")
print(f"The maximum world population is {max_population} for country {max_country}")
print(f"The mean world population is {mean_population}.")
print(f"The total world population is {total_population}.")



The minimum world population is 1869 for Tokelau
The maximum world population is 1425893500 for country China
The mean world population is 33513966.868644066.
The total world population is 7909296181.


*In the real world, the data that is used for analysis is stored in the Gold layer*


In this case, the data contained in the variable `df` in the above cell is stored.

##### Interpret results

* Of the 236 countries analyzed, in 2021, Tokelau had the lowest world population of 1869
* In the same year, China had the highest world population of about 1.4 billion people
    * The world population ranged from 1800 to 1.4 billion people.
* Some of the countries may not be independent. These countries were also included in the analysis.
* The average world population in 2021 is about 33 million people.
* The total world population in 2021 is about 7.9 billion people.


##### Report
* The problem statement
    * A UN policy maker wants to obtain the basic statistics about the world population.
* Discuss the data selected and its source
    * Population data was used from the source United Nations, World Population Prospects (2022).
    * The data lists countries, regions (or group of countries) and their population.
    * The data also contains population across various age groups.
    * The data contained information for various years ranging from about 1950 till 2021.
* Discuss the steps used to clean the data and why
    * In this analysis, the total population across all age groups was considered.
    * The data for regions was removed from the analysis.
        * We only keep the data for various countries.
        * This resulted in 236 countries.
        * The regions we removed contain key words 'regions', 'countries', 'World' and 'UN'.
            * Including region data would cause the world population to be inaccurate.
    * A step to remove null values in the data was applied. This step did not result in any data loss. No null values were found in the data.
    * Population data discussed above was used in the analysis.  
* Discussion the analysis steps
    * What patterns did you discover?
        * Data for the year 2021 was analyzed.
        * The data consisted of countries that were islands.
            * The data pertaining to islands had the lowest population.
    * The lowest, highest, average and total population were calculated.
    * Share the interpretation of your results from above in the report
        * Of the 236 countries analyzed, in 2021, Tokelau had the lowest world population of 1869
        * In the same year, China had the highest world population of about 1.4 billion people
            * The world population ranged from 1800 to 1.4 billion people.
        * Some of the countries may not be independent. These countries were also included in the analysis.
        * The average world population in 2021 is about 33 million people.
        * The total world population in 2021 is about 7.9 billion people.
        * <*Add any figures* to your report and explain the figure.>

* Have a meeting with the requestor and discuss your findings. Provide the final data and the report to them.

Question:
Do you think the minimum and mean value skew affects the decision of the policy maker?

*In the real world, the analyzed data are typically sent to a dashboard for consumption.*



#### Summary

*Analytics Workflow*
* We saw how to ingest raw data, transform (ET of the ETL).
* We then further cleaned the data and explored it to help answer the questions.
* We discovered the minimum, maximum, total and mean population of countries in 2021.

