In [14]:
# ![cao](https://www.cao.ie/icons/fblogo.png)

# [Official CAO Website](https://www.cao.ie/)

<br>

# CAO Points Analysis

http://www.cao.ie/index.php?page=points&p=2021

***

In [2]:
# Regular expressions - ref 1.
import re

# Convenient HTTP requests - ref 4.
import requests as rq

# Dates and times - ref 8. 
import datetime as dt

In [3]:
# Fetch the CAO points URL - ref 4.
resp = rq.get('http://www2.cao.ie/points/l8.php')

# Check response. 200 means OK. 401 means error. 
resp

<Response [200]>

<br>

## Save origional data set

***

In [4]:
# Get the current date and time.
now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')

In [5]:
# Create a file path for the original data.
path = 'data/cao2021_' + nowstr + '.html'

<br>

## Error on server

***

Technically, the server says we should decode as per:

    Content-Type: text/html; charset=iso-8859-1

However, one line uses \x96 which isn't defined in iso-8859-1.

Therefore we use the similar decoding standard cp1252, which is very similar but includes #x96

In [6]:
# The server uses the wrong encoding, fix it.
original_encoding = resp.encoding

# Change to cp1252.
resp.encoding = 'cp1252'

In [7]:
# Save the original html file.
with open(path, 'w') as f:
    f.write(resp.text)

<br>

## Use regular expressions to select lines we want

***

In [8]:
# Compile the regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})  (.*)([0-9]{3})(\*?) *')

In [9]:
# This uses regular expressions to filter through the webpage in order to get the course lines I want for my code. 

In [10]:
# Being with r it looks for any 2 uppercase letters and any 3 numbers from 0-9 followed by 2 spaces and any character
# up until another 3 numbers from 0-9 and anything after that up until a new line

<br>

## Loop through the lines of the response

***

In [11]:
# Must use the line.decode function when reading in the webpage it uses a different encoding method
# to that of my system's default encoding and therefore must be decoded. ref 7.

In [12]:
# The file path for the csv file.
path = 'data/cao2021_csv_' + nowstr + '.csv'

# Keeps track of how many courses we process.
no_lines = 0

# Open the csv file for writing.
with open(path, 'w') as f:
    # Loop through lines of the response - ref 6.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            # Split the line on two or more spaces.
            linesplit = re.split('  +', dline)
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')
               
# Print the total number of processed lines.
print(f"Total number of lines is {no_lines}.")
   
        # Pick out the relevant parts of the matched line - ref 5.
        #csv_version = re_course.sub(r'\1,\2,\3,\4', line.decode('iso-8859-1'))
        # Print the CSV-style line
        #print(csv_version)


Total number of lines is 922.


<br>

## Comparison of CAO points in 2019, 2020 and 2021

***

<br>

# References:

***

1. https://docs.python.org/3/library/re.html

2. https://realpython.com/regex-python/

3. https://realpython.com/regex-python-part-2/

4. https://docs.python-requests.org/en/latest/user/quickstart/#make-a-request

5. https://www.mygreatlearning.com/blog/regular-expression-in-python/

6. https://stackoverflow.com/questions/16870648/python-read-website-data-line-by-line-when-available

7. https://sites.pitt.edu/~naraehan/python3/mbb12.html

8. https://docs.python.org/3/library/datetime.html

9. 


***

# End