# A comparison of CAO points between 2019, 2020, and 2021.
***

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests as rq
import re
import datetime as dt
import urllib.request as urlrq

- How to load CAO points information from the CAO website.
- A detailed comparison of CAO points in 2019, 2020, and 2021 using pandas

## Pulling down and saving the 2021 data from html webpage.

In [2]:
# Pulling raw data from CAO - level 8 (2021) [1]
CAO_2021 = rq.get("http://www2.cao.ie/points/l8.php")


In [3]:
# Identifying if they pull correctly (Response [200] means its good.)
CAO_2021

<Response [200]>

In [4]:
# Compiling the regular expression code to find lines of courses. [2] There should be 949 of them. 
# Third group is ran using the 'OR' function i.e. #123* OR #123 OR # OR 123* etc.. [3]
# RE is almost there however not capturing courses with no points assigned. Needs amendment.

group1 = "([A-Z]{2}[0-9]{3})  "  # 2 uppercase letters and 3 numbers
group2 = "(.*[^[0-9]{3}|[A-Z]{3}|[#]|[\*]])  " # everything except the following items.
group3 = "*([#]|[#][0-9]{3}[\*]|[#][0-9]{3}|[0-9]{3}|[0-9]{3}[\*]|[\*]|[ *])  " # 3rd pattern powered by OR
group4 = "(.*)"

re_course = re.compile(group1 + group2 + group3 + group4)  # compling the REGEX. 

Initally used the code below however it did not return course with no points assigned eg DB503.  Nor did it return courses with round two offers only e.g. GC301

In [5]:
# Gettinh the date and time as items in an array.
now = dt.datetime.now()

# Taking the string items and putting them in a string.
strnow = now.strftime('%Y_%m_%d_%H%M')

In [6]:
# Creating a path to backup original html data with time/date.
original_data = "data/cao2021_"+strnow+".html"

# Defining a path to output the 2021 data with time/date. 
path = "data/cao2021_"+strnow+".csv"

In [7]:
# Backing up the original html file.

# Need to change the encoder to cp1252. 
CAO_2021.encoding = "cp1252" 

# Writing raw data to the path defined above - "original-data".
with open (original_data, "w") as f:
    f.write(CAO_2021.text)

In [8]:
counter = 0  

# Writing lines from RE to a CSV file. 
with open (path, "w") as f:  

# Running a loop to find lines in the above RE. 
    for line in CAO_2021.iter_lines(): 
        decoded = line.decode('cp1252') # error in the standard used. ISO standard didn't reckonise character '\x96' on CM002.
        if re_course.fullmatch(decoded):
            CSV_2021 = re_course.sub(r"\1,\2,\3,\4", decoded)
            f.write(CSV_2021 + "\n")
            counter +=1
            
print ("\nNumber of level 8 courses in 2021: {}\n".format(counter))


Number of level 8 courses in 2021: 949



## Puling down as saving the 2020 data from an Excel file.

The 2020 CAO data is on an excel spreadsheet on the CAO website.  After backing the original data up, I will read it in using Pandas.

In [9]:
original_data = "data/cao2020_"+strnow+".xlsx"
urlrq.urlretrieve("http://www2.cao.ie/points/CAOPointsCharts2020.xlsx", original_data)

('data/cao2020_2021_11_13_2020.xlsx',
 <http.client.HTTPMessage at 0x1b8a192cd00>)

# References
- [1] http://www.cao.ie/index.php?page=points&p=2021
- [2] https://www.w3schools.com/python/python_regex.asp
- [2] https://realpython.com/regex-python/
- [3] https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
- https://www.w3schools.com/python/python_functions.asp
- https://stackoverflow.com/questions/54496411/python-errortypeerror-findall-missing-1-required-positional-argument-stri
- https://www.w3schools.com/python/python_regex.asp
- https://realpython.com/regex-python/
- https://stackoverflow.com/questions/2013124/regex-matching-up-to-the-first-occurrence-of-a-character
- https://developers.google.com/edu/python/regular-expressions
- https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/

***
# End