Individual Assignment 1

Sean Anselmo

March 21st 2024

Queries on the European Health Database. This Database is used to document health outcomes across a several different European countries. For this set of queries, we are working with College graduation rate for people aged 25+. You can find this particular table here:

https://gateway.euro.who.int/en/indicators/hfa_39-0410-of-population-with-postsecondary-education-aged-25plus-years/#id=18846

In [10]:
#Import the dataset into pandas
import pandas as pd

college = pd.read_csv("HFA_39_EN.csv")
college.head()

Unnamed: 0,COUNTRY,COUNTRY_GRP,SEX,YEAR,VALUE
0,ALB,,ALL,2001.0,7.43
1,ALB,,ALL,2008.0,9.82
2,ALB,,ALL,2011.0,12.0
3,AND,,ALL,2003.0,27.81
4,AND,,ALL,2004.0,30.51


Next we are going to import SQL alchemy since that is the method we will be using in this assessment. 

In [11]:
import sqlalchemy as sq

In [12]:
engine = sq.create_engine('mysql+mysqlconnector://sean_anselmo:4i1tawVQFvTUd@datasciencedb.ucalgary.ca/sean_anselmo')

In [14]:
college.to_sql('college', engine)

545

In [16]:
college = pd.read_sql_table("college", engine)
college.tail()

Unnamed: 0,index,COUNTRY,COUNTRY_GRP,SEX,YEAR,VALUE
540,540,Description,% of population with postsecondary education a...,,,
541,541,Reference link,https://dw.euro.who.int/api/v3/measures/HFA_39...,,,
542,542,,,,,
543,543,Copyright,,,,
544,544,© WHO Regional Office for Europe 2024. All rig...,https://www.who.int/about/policies/publishing/...,,,


The first thing we are going to do is remove the identifer rows at the end of the document. This command will remove the rows that do not have an entry for grad rate, so it is not useful for us. This will help us in our further analysis when we join tables. 

This code was adopted from the SQL Connector example and from:
https://docs.sqlalchemy.org/en/20/core/connections.html

In [38]:
from sqlalchemy import text

with engine.connect() as conn:
    transaction = conn.begin()
    conn.execute(text("DELETE FROM college WHERE value IS NULL"))
    transaction.commit()



In [33]:
nullvalues = pd.read_sql_query('SELECT * FROM college WHERE value IS NULL', engine)
print(nullvalues)

Empty DataFrame
Columns: [index, COUNTRY, COUNTRY_GRP, SEX, YEAR, VALUE]
Index: []


 The first query we are going to use is to find the country and the year they posted the highest post secondary graduation rate. This is a good query to start with, as we can get an idea of which era and country type is best.

In [21]:
#Search for highest post secondary rate for age 25+
highest_gradrate = pd.read_sql_query('SELECT * FROM college ORDER BY value DESC LIMIT 1;', engine)
print(highest_gradrate)

   index COUNTRY COUNTRY_GRP  SEX    YEAR  VALUE
0     80     BLR        None  ALL  2019.0  73.77


Next query is to search for the average graduation rate by year. This shows us if what we noticed in the last query shows a trend. We are trying to figure out which time period had the highest graduations, without controlling for country. This can help us in our next assessment by informing us of the trends in graduation.

In [22]:
#avg grad rate by year

gradrate_by_year = pd.read_sql_query('SELECT year, AVG(value) AS avg_value FROM college GROUP BY year ORDER BY year;', engine)
print(gradrate_by_year)

      year  avg_value
0      NaN        NaN
1   1970.0   5.033333
2   1971.0   3.840000
3   1972.0  14.800000
4   1975.0   6.975000
5   1976.0   6.700000
6   1977.0   4.600000
7   1978.0   5.700000
8   1979.0  15.400000
9   1980.0   8.256667
10  1981.0   6.009000
11  1982.0  15.580000
12  1983.0  10.060000
13  1984.0   8.755000
14  1985.0   9.327500
15  1986.0  14.366667
16  1987.0   8.940000
17  1988.0  14.210000
18  1989.0  14.218889
19  1990.0  15.028571
20  1991.0  11.988824
21  1992.0  15.405000
22  1993.0  16.345000
23  1994.0  19.724286
24  1995.0  21.333333
25  1998.0   7.900000
26  1999.0  30.410000
27  2000.0  16.462500
28  2001.0  19.453333
29  2002.0  19.590000
30  2003.0  21.167500
31  2004.0  24.545385
32  2005.0  23.505652
33  2006.0  26.078000
34  2007.0  26.161600
35  2008.0  25.138519
36  2009.0  27.538667
37  2010.0  26.674242
38  2011.0  26.468947
39  2012.0  28.875200
40  2013.0  28.796364
41  2014.0  29.341379
42  2015.0  31.605333
43  2016.0  30.744444
44  2017.0

The next query is to identify the country with the highest average graduation rate for years above 2000. We are controlling for above 2000 to get a better understanding of how rates are working for recency.

In [23]:
#highest grad rate since 2000

highest_gradrate_2000 = pd.read_sql_query('SELECT country, AVG(value) AS avg_value FROM college WHERE year >= 2000 GROUP BY country ORDER BY avg_value DESC LIMIT 1;', engine)
print(highest_gradrate_2000)


  country  avg_value
0     BLR     68.575


The next two queries here are to check which year has the highest average gradudation rate, followed by which country. This will inform us the general trends over the year more accurately, by taking into account every country for that year. The same can be said for the highest value by country. This includes outlier years like 1970, where many countries posted lows.

In [24]:
# Highest average value by year
highest_avg_by_year = pd.read_sql_query('SELECT year, AVG(value) AS avg_value FROM college GROUP BY year ORDER BY avg_value DESC LIMIT 5;', engine)
print("Highest Average Value by Year:")
print(highest_avg_by_year)

# Highest average value by country
highest_avg_by_country = pd.read_sql_query('SELECT country, AVG(value) AS avg_value FROM college GROUP BY country ORDER BY avg_value DESC LIMIT 5;', engine)
print("\nHighest Average Value by Country:")
print(highest_avg_by_country)

Highest Average Value by Year:
     year  avg_value
0  2019.0  37.677500
1  2017.0  34.972778
2  2020.0  34.452222
3  2015.0  31.605333
4  2016.0  30.744444

Highest Average Value by Country:
  country  avg_value
0     GEO  52.448333
1     LTU  47.735333
2     ARM  46.576000
3     ISL  41.126667
4     BLR  38.937500


The next query takes what we learned from the other queries and takes it together. This one is a nested query, where we find the average for each year and produces a yearly average. We then order which years are performing less than 90% of the overall average. We see years up to 2003, meaning that years past 2003 yield significantly better graduation rates. We also made the query into it's own variable to increase readability for long queries.

In [25]:
query = """
SELECT year, avg_value
FROM (
  SELECT year, AVG(value) AS avg_value
  FROM college
  GROUP BY year
) AS yearly_averages
WHERE avg_value < (SELECT AVG(value) FROM college) * 0.9;
"""

underperforming_years = pd.read_sql_query(query, engine)
print(underperforming_years)



      year  avg_value
0   1970.0   5.033333
1   1971.0   3.840000
2   1972.0  14.800000
3   1975.0   6.975000
4   1976.0   6.700000
5   1977.0   4.600000
6   1978.0   5.700000
7   1979.0  15.400000
8   1980.0   8.256667
9   1981.0   6.009000
10  1982.0  15.580000
11  1983.0  10.060000
12  1984.0   8.755000
13  1985.0   9.327500
14  1986.0  14.366667
15  1987.0   8.940000
16  1988.0  14.210000
17  1989.0  14.218889
18  1990.0  15.028571
19  1991.0  11.988824
20  1992.0  15.405000
21  1993.0  16.345000
22  1994.0  19.724286
23  1995.0  21.333333
24  1998.0   7.900000
25  2000.0  16.462500
26  2001.0  19.453333
27  2002.0  19.590000
28  2003.0  21.167500


This last query is another nested query similar to our last. This query uses HAVING which checks our country group made by GROUP BY. It then checks for countries whose average is lower than the overall 2019 average. We checked against 2019 since it was our highest grad rate by year.

In [53]:
query_2019 = """
SELECT country, value
FROM college
WHERE year = 2019
GROUP BY country
HAVING value < (
  SELECT AVG(value) 
  FROM college 
  WHERE year = 2019
);

"""

underperform_2019 =  pd.read_sql_query(query_2019, engine)
print(underperform_2019)

  country  value
0     AUT  32.12
1     AZE  30.02
2     BEL  37.07
3     FIN  37.57
4     FRA  32.30
5     PRT  21.44
6     ROU  18.37
7     SRB  23.25
8     SVK  25.24
