Lab 1: Data analysis with numpy, review Python

In [None]:
# Name: Shamita Goyal

In this lab you will write code to analyze the population data of countries of the world.<br>

Note:<br>
- <u>Do not use pandas</u> for this lab<br>
- Take advantage of numpy's functions instead of writing loops to access data<br>
If your code doesn't contain any loop to access data in the population numpy array, you earn 1pt EC

There are 3 input files for this lab ([source](https://www.census.gov/data-tools/demo/idb/#/table?COUNTRY_YEAR=2024&COUNTRY_YR_ANIM=2024&menu=tableViz&TABLE_YEARS=2023&TABLE_USE_RANGE=N&TABLE_USE_YEARS=Y&TABLE_STEP=1&TABLE_ADD_YEARS=2023)):<br>
- `countries.txt`: contains a list of all countries of the world, one country name per line<br>
- `population.csv`: contains a table of population data. Each row is for one country, and each row has 11 columns: population, growth rate, population density, fertility rate, life expectancy, under-5 mortality rate, ratio of males to females, ratio of youth to working age people, ratio of seniors to working age people, median male age, median female age.<br>
- `header.csv`: contains the column headers for the 11 columns of population.csv.

1. Import modules

In [3]:
import numpy as np
import csv

2. Read in data from the input files.

From `countries.txt` file, read and store all the country names.<br> 
Then __Print the number of countries__, with a text explanation of your choice.<br>

From `population.csv` file, read and store all the population data.<br>
Then __print the number of rows and columns of the population table__, with a text explanation of your choice.<br>

From `headers.csv` file, read and store all the column headers.<br>
Then __print the number of headers__, with a text explanation of your choice.<br>

Sample print output:<br>
`Number of countries: 227`<br>
`(Rows,columns) of population data: (227, 11)`<br>
`Number of column headers: 11`

In [4]:
directory = "/Users/shamitagoyal/Desktop/data_science_files/data_files/"

#opening countries.txt file
with open(directory + "countries.txt") as file:
    countries_file = np.array((file.read()).split("\n"))
print("Number of countries:", len(countries_file))


#opening population.csv
population_file = np.loadtxt(directory + "population.csv", delimiter=",", dtype=float)
print("(Rows,columns) of population data:", population_file.shape)

#opening headers.csv
with open(directory + "headers.csv") as file:
    headers_file = csv.reader(file)
    for line in headers_file:
        print("Number of column headers:", len(line))

Number of countries: 227
(Rows,columns) of population data: (227, 11)
Number of column headers: 11


In [None]:
# improved way for headers
headers = np.genfromtxt(directory + "headers.csv", delimiter=",", dtype="str")
print("number of the column headers:", len(headers))

In [16]:
display(population_file)
index = np.where(countries_file == "United States")
population_file[214][1]

array([[3.9232003e+07, 2.2600000e+00, 6.0200000e+01, ..., 5.0000000e+00,
        2.0000000e+01, 1.9800000e+01],
       [3.1016210e+06, 1.9000000e-01, 1.1320000e+02, ..., 2.1600000e+01,
        3.7200000e+01, 3.4300000e+01],
       [4.6286076e+07, 1.6200000e+00, 1.9400000e+01, ..., 1.0700000e+01,
        2.9200000e+01, 2.8600000e+01],
       ...,
       [3.1565602e+07, 1.8300000e+00, 5.9800000e+01, ..., 5.4000000e+00,
        2.1800000e+01, 2.1500000e+01],
       [2.0216029e+07, 2.8600000e+00, 2.7200000e+01, ..., 5.0000000e+00,
        1.8400000e+01, 1.8000000e+01],
       [1.6819805e+07, 1.9900000e+00, 4.3500000e+01, ..., 6.7000000e+00,
        2.1800000e+01, 2.0200000e+01]])

0.68

3. Find the total world population by adding all the country population numbers.<br>
Then __print the total world population__, with your choice of text explanation and with commas separating the thousands in the number.

To print a large number with commas, use the f-string format:  `f'{largeNum:,}'`  where largeNum is the large value


In [271]:
total_world_pop = np.sum(population_file, axis=0)[0]

print(f'The total world population is: {total_world_pop:,}')

The total world population is: 7,982,019,198.0


4. Find and __print the number of countries with a positive growth rate__<br>
Then find and __print the number of countries with a negative growth rate__.<br>
Make sure to print a text explanation with each number.

In [277]:
grow_rate_col = population_file[:,1]
print("The number of countries with a negative growth rate:", sum(grow_rate_col <0))
print("The number of countries with a positive growth rate:", sum(grow_rate_col >0))

The number of countries with a negative growth rate: 37
The number of countries with a positive growth rate: 190


5a. Given your output of step 4, would you expect the median growth rate to be positive or negative?<br>
__Create a RawNB Convert cell to <u>explain</u> your answer__.

5b. Prove that your answer of 5a is correct by __printing the median growth rate__, with a text explanation.

In [265]:
print("The median growth rate is:", np.median(grow_rate_col))

The median growth rate is: 0.79


6a. Find the top 12 countries with the highest life expectancy.<br>
Then __print the top 12 country names and life expectancy rate__, sorted in order of lowest to highest rate, in 2-column format.<br>

You should only use the numpy methods discussed in class. Hint: sort the life expectancy and use the index of the sorted array to find the country names in the country array.

In [280]:
# 1. find all the growth rates
life_expec_rate_col = population_file[:,4]

# 2. Sort the top 12 growth rates
top_12_rates = np.sort(life_expec_rate_col)[0:12]

# 3. Find the indexes of the sorted growth rates
top_12_indexes = np.argsort(life_expec_rate_col)[0:12]

# 4. Find the 12 countries from the indexes
top_12_countries = countries_file[top_12_indexes]

# 5. Make the data values and country names into 2-column format
top_life_expec_table = np.array([top_12_countries, top_12_rates]).T

print("The top 12 country names and life expectancy rate:\n", top_life_expec_table)

The top 12 country names and life expectancy rate:
 [['Afghanistan' '54.1']
 ['Central African Republic' '56.0']
 ['Somalia' '56.1']
 ['Mozambique' '57.7']
 ['Sierra Leone' '59.1']
 ['Chad' '59.6']
 ['South Sudan' '59.7']
 ['Lesotho' '59.9']
 ['Eswatini' '60.2']
 ['Niger' '60.5']
 ['Liberia' '61.3']
 ['Nigeria' '61.8']]


In [6]:
indices = np.argsort(population_file[:,4])[-12:]
print("Top 12 life expectancy")
for i in indices:
    print(f'{countries_file[i]:23s}\t{population_file[i,4]}')

Top 12 life expectancy
Malta                  	83.4
Guernsey               	83.4
Andorra                	83.6
Iceland                	83.8
Switzerland            	83.8
Hong Kong              	83.8
Canada                 	84.0
San Marino             	84.1
Japan                  	85.0
Macau                  	85.2
Singapore              	86.5
Monaco                 	89.6


6b. Using the same sorted array in step 6a, find the bottom 12 countries with the lowest life expectancy.<br>
Then __print the bottom 12 country names and life expectancy rate__, sorted in order of lowest to highest rate, in 2-column format.<br>

In [267]:
# 1. find all the growth rates from lest to greatest, find the bottom 12
bottom_12_rates= (np.sort(life_expec_rate_col))[-12:]

#2. do the same above but to find indexes
bottom_12_indexes = np.argsort(life_expec_rate_col)[-12:]

#3. search the countries from the indexes
bottom_12_countries = countries_file[bottom_12_indexes]

#4. make a table and store it into a variable
bottom_life_expec_table = np.array([bottom_12_countries, bottom_12_rates]).T

print("The bottom 12 country names and life expectancy rate:\n", bottom_life_expec_table)

The bottom 12 country names and life expectancy rate:
 [['Malta' '83.4']
 ['Guernsey' '83.4']
 ['Andorra' '83.6']
 ['Iceland' '83.8']
 ['Switzerland' '83.8']
 ['Hong Kong' '83.8']
 ['Canada' '84.0']
 ['San Marino' '84.1']
 ['Japan' '85.0']
 ['Macau' '85.2']
 ['Singapore' '86.5']
 ['Monaco' '89.6']]


7a. Write code to:
- find the 75th, 50th, and 25th percentile of the Median Male and Female Ages (last 2 columns)
- use if statements to __print whether the median US Male age and median US Female age is in the top, 2nd, 3rd, or bottom quartile__.

In [290]:
# 1. extract data from column
male_median_ages_col = population_file[:,9]

# 2. extract data from column and find the media of that column for females
female_median_ages_col = population_file[:,10]

# 3. find the 25th, 50th, & 75th percentile of male and females
q1_male, q2_male, q3_male = np.percentile(male_median_ages_col,[25, 50, 75])
print(f"Male percentiles: 25th: {q1_male}, 50th: {q2_male}, 75th: {q3_male}")
                        
q1_female, q2_female, q3_female = np.percentile(female_median_ages_col,[25, 50, 75])
print(f"Female percentiles: 25th: {q1_female}, 50th: {q2_female}, 75th: {q3_female}\n")

#4. determine US median:
US_median_index = np.where(countries_file == "United States")
US_male = male_median_ages_col[US_median_index] 
US_female = female_median_ages_col[US_median_index] 

# print(US_male,US_female) #[37.2] [39.8]


# 4. determine which quartile the median lies
#print whether the median US Male age and median US Female age is in the top, 2nd, 3rd, or bottom quartile.
if (US_male >= q3_male) and (US_female >= q3_female):
    print("Female and Male median age are in the top quartile.")
elif (q2_male <= US_male < q3_male) and (q2_female <= US_female < q3_female):
    print("Female and Male median age are in the 2nd quartile.")
elif (q1_male <= US_male < q2_male) and (q1_female <= US_female < q2_female):
    print("Female and Male median age are in the 3rd quartile.")
else:
    print("US Female and Male median age are in the bottom quartile.")

Male percentiles: 25th: 25.0, 50th: 32.3, 75th: 41.3
Female percentiles: 25th: 24.25, 50th: 31.3, 75th: 39.25

[37.2] [39.8]
US Female and Male median age are in the bottom quartile.


In [287]:
if (US_male >= q3_male) and (US_female >= q3_female):
    print("Female and Male median age are in the top quartile.")
elif (q2_male <= US_male < q3_male) and (q2_female <= US_female < q3_female):
    print("Female and Male median age are in the 2nd quartile.")
elif (q1_male <= US_male < q2_male) and (q1_female <= US_female < q2_female):
    print("Female and Male median age are in the 3rd quartile.")
else:
    print("US Female and Male median age are in the bottom quartile.")

US Female and Male median age are in the bottom quartile.


7b. Based on the quartile of your output of 7a, __create a RawNB Convert cell to explain whether the US population is older, younger, or about the average compared to the rest of the world__?