### Descriptive Statistics: SQL

Location: 'scripts' folder

Lucas Lobo

In [1]:
# Including code to connect to the database:
import sqlite3
import pandas as pd

# Set up connection (already established in separate file, located in the 'data' folder).
conn = sqlite3.connect(r"C:\Users\lcsrl\Downloads\qtm350_project.db")
cursor = conn.cursor()

Here are some insights from the data:

1. Top 10 Countries by Average Life Expectancy across various time periods:

- 1975-2024 (all years in dataset).

- 1980-1989 (80s).

- 2010-2019 (2010s).

In [2]:
q1_a = """
SELECT country_name, AVG(value) AS avg_life_exp
FROM wdi_long_clean
WHERE indicator = 'Life expectancy at birth, total (years)'
GROUP BY country_name
ORDER BY avg_life_exp DESC
LIMIT 10
"""
q1_a_table = pd.read_sql(q1_a, conn)
q1_a_table

Unnamed: 0,country_name,avg_life_exp
0,Costa Rica,76.953
1,St. Martin (French part),76.319592
2,Puerto Rico,75.974449
3,Chile,75.545469
4,Cuba,75.323898
5,Uruguay,74.259551
6,Panama,73.476163
7,Argentina,73.002898
8,"Venezuela, RB",71.32249
9,Colombia,71.205878


In [3]:
q1_b = """
SELECT country_name, AVG(value) AS avg_life_exp
FROM wdi_long_clean
WHERE indicator = 'Life expectancy at birth, total (years)'
  AND year IN ('1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989')
GROUP BY country_name
ORDER BY avg_life_exp DESC
LIMIT 10
"""
q1_b_table = pd.read_sql(q1_b, conn)
q1_b_table

Unnamed: 0,country_name,avg_life_exp
0,Costa Rica,74.55
1,Cuba,73.4186
2,Puerto Rico,73.0772
3,St. Martin (French part),73.016
4,Uruguay,71.7568
5,Chile,71.3751
6,Panama,70.2783
7,Argentina,69.8508
8,"Venezuela, RB",69.7803
9,Belize,68.2004


In [4]:
q1_c = """
SELECT country_name, AVG(value) AS avg_life_exp
FROM wdi_long_clean
WHERE indicator = 'Life expectancy at birth, total (years)'
  AND year IN ('2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019')
GROUP BY country_name
ORDER BY avg_life_exp DESC
LIMIT 10
"""
q1_c_table = pd.read_sql(q1_c, conn)
q1_c_table

Unnamed: 0,country_name,avg_life_exp
0,Costa Rica,80.0098
1,Puerto Rico,79.9724
2,Chile,79.8613
3,St. Martin (French part),79.4908
4,Cuba,77.7397
5,Panama,77.4483
6,Uruguay,77.1427
7,Argentina,76.2544
8,Ecuador,76.0896
9,Colombia,75.9536


2. We'll create a table that shows the highest rates of immunization for DPT, HepB3, and measles for each country. In other words, the number in each cell represents the immunization rate that was highest for that country across the years in the dataset. We will also include a value 'agg_immunization_rate', which is computed by taking teh average of the three maximum immunization rates.

In [5]:
q2 = """
SELECT country_name,
       MAX(CASE WHEN indicator = 'Immunization, DPT (% of children ages 12-23 months)' THEN value END) AS max_dpt,
       MAX(CASE WHEN indicator = 'Immunization, HepB3 (% of one-year-old children)' THEN value END) AS max_hepb3,
       MAX(CASE WHEN indicator = 'Immunization, measles (% of children ages 12-23 months)' THEN value END) AS max_measles,
       (MAX(CASE WHEN indicator = 'Immunization, DPT (% of children ages 12-23 months)' THEN value END) +
        MAX(CASE WHEN indicator = 'Immunization, HepB3 (% of one-year-old children)' THEN value END) +
        MAX(CASE WHEN indicator = 'Immunization, measles (% of children ages 12-23 months)' THEN value END)) / 3 AS agg_immunization_rate
FROM wdi_long_clean
WHERE indicator IN (
    'Immunization, DPT (% of children ages 12-23 months)',
    'Immunization, HepB3 (% of one-year-old children)',
    'Immunization, measles (% of children ages 12-23 months)'
)
GROUP BY country_name
ORDER BY agg_immunization_rate DESC
LIMIT 23;
"""
q2_table = pd.read_sql(q2, conn)
q2_table

Unnamed: 0,country_name,max_dpt,max_hepb3,max_measles,agg_immunization_rate
0,Mexico,99,99,99,99
1,Honduras,99,99,99,99
2,Guyana,99,99,99,99
3,El Salvador,99,99,99,99
4,Cuba,99,99,99,99
5,Brazil,99,99,99,99
6,Panama,99,98,99,98
7,Nicaragua,98,98,99,98
8,Costa Rica,99,98,99,98
9,Chile,99,97,99,98


3. Yearly change in urban population for Mexico and Brazil (two countries with generally high urban populations) from 1975 to 2000.

We will use the LAG() function in SQL, which allows us to access data from a previous row in the same result set without the use of a self-join, in order to calculate the difference between years.

In [6]:
q3_a = """
SELECT year, 
  value AS urban_population,
  value - LAG(value) OVER (ORDER BY year) AS yearly_growth
FROM wdi_long_clean
WHERE country_name = 'Mexico'
  AND indicator = 'Urban population'
  AND year BETWEEN '1975' AND '2000'
ORDER BY year;
"""
q3_a_table = pd.read_sql(q3_a, conn)
q3_a_table

Unnamed: 0,year,urban_population,yearly_growth
0,1975,37016764,
1,1976,38504442,1487678.0
2,1977,40011399,1506957.0
3,1978,41544154,1532755.0
4,1979,43095854,1551700.0
5,1980,44646369,1550515.0
6,1981,46068153,1421784.0
7,1982,47469200,1401047.0
8,1983,48882146,1412946.0
9,1984,50305880,1423734.0


In [7]:
q3_b = """
SELECT year, 
  value AS urban_population,
  value - LAG(value) OVER (ORDER BY year) AS yearly_growth
FROM wdi_long_clean
WHERE country_name = 'Brazil'
  AND indicator = 'Urban population'
  AND year BETWEEN '1975' AND '2000'
ORDER BY year;
"""
q3_b_table = pd.read_sql(q3_b, conn)
q3_b_table

Unnamed: 0,year,urban_population,yearly_growth
0,1975,65420857,
1,1976,68051232,2630375.0
2,1977,70760392,2709160.0
3,1978,73551099,2790707.0
4,1979,76416004,2864905.0
5,1980,79352101,2936097.0
6,1981,82340685,2988584.0
7,1982,85371053,3030368.0
8,1983,88441554,3070501.0
9,1984,91547882,3106328.0


4. We can compute tables that show the average trade amounts (as % of GDP) across 1975-2024, first by region, and second by country.


In [8]:
q4_a = """
SELECT region,
       AVG(value) AS avg_trade_gdp
FROM wdi_long_clean
WHERE indicator = 'Trade (% of GDP)'
  AND CAST(year AS INTEGER) BETWEEN 1975 AND 2024
GROUP BY region
ORDER BY avg_trade_gdp DESC;
"""
q4_a_table = pd.read_sql(q4_a, conn)
q4_a_table

Unnamed: 0,region,avg_trade_gdp
0,Central America,75.001154
1,Caribbean,65.423034
2,South America,53.059413
3,Other,0.0


In [9]:
q4_b = """
SELECT country_name, region,
       AVG(value) AS avg_trade_gdp
FROM wdi_long_clean
WHERE indicator = 'Trade (% of GDP)'
  AND CAST(year AS INTEGER) BETWEEN 1975 AND 2024
GROUP BY country_name, region
ORDER BY avg_trade_gdp DESC
LIMIT 24;
"""
q4_b_table = pd.read_sql(q4_b, conn)
q4_b_table

Unnamed: 0,country_name,region,avg_trade_gdp
0,Guyana,South America,172.706599
1,Panama,Central America,123.698223
2,Puerto Rico,Caribbean,104.956422
3,Suriname,South America,97.518818
4,Honduras,Central America,95.767372
5,Belize,Central America,89.613579
6,Nicaragua,Central America,74.432156
7,Costa Rica,Central America,74.18346
8,Paraguay,South America,71.780297
9,El Salvador,Central America,65.621117
