<br>

# CAO Points Analysis

[Official CAO Website](https://www.cao.ie/)

***

In [1]:
# Regular expressions - ref 1.
import re

# Convenient HTTP requests - ref 4.
import requests as rq

# Dates and times - ref 8. 
import datetime as dt

# Pandas for data frames - ref 9.
import pandas as pd

# For downloading urls - ref 10.
import urllib.request as urlrq

<br>

## 2021 CAO Points

http://www.cao.ie/index.php?page=points&p=2021

In [2]:
# Fetch the CAO points URL - ref 4.
resp = rq.get('http://www2.cao.ie/points/l8.php')

# Check response. 200 means OK. 401 means error. 
resp

<Response [200]>

<br>

### Save origional data set

***

In [3]:
# Get the current date and time.
now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')

In [4]:
# Create a file path for the original data.
path = 'data/cao2021_' + nowstr + '.html'

<br>

### Error on server

***

Technically, the server says we should decode as per:

    Content-Type: text/html; charset=iso-8859-1

However, one line uses \x96 which isn't defined in iso-8859-1.

Therefore we use the similar decoding standard cp1252, which is very similar but includes #x96

In [5]:
# The server uses the wrong encoding, fix it.
original_encoding = resp.encoding

# Change to cp1252.
resp.encoding = 'cp1252'

In [6]:
# Save the original html file.
with open(path, 'w') as f:
    f.write(resp.text)

<br>

### Use regular expressions to select lines we want

***

In [7]:
# Compile the regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})  (.*)([0-9]{3})(\*?) *')

In [8]:
# This uses regular expressions to filter through the webpage in order to get the course lines I want for my code. 

In [9]:
# Being with r it looks for any 2 uppercase letters and any 3 numbers from 0-9 followed by 2 spaces and any character
# up until another 3 numbers from 0-9 and anything after that up until a new line

<br>

### Loop through the lines of the response

***

In [10]:
# Must use the line.decode function when reading in the webpage it uses a different encoding method
# to that of my system's default encoding and therefore must be decoded. ref 7.

In [11]:
# The file path for the csv file.
path = 'data/cao2021_csv_' + nowstr + '.csv'

# Keeps track of how many courses we process.
no_lines = 0

# Open the csv file for writing.
with open(path, 'w') as f:
    # Loop through lines of the response - ref 6.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            # Split the line on two or more spaces.
            linesplit = re.split('  +', dline)
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')
               
# Print the total number of processed lines.
print(f"Total number of lines is {no_lines}.")
   
        # Pick out the relevant parts of the matched line - ref 5.
        #csv_version = re_course.sub(r'\1,\2,\3,\4', line.decode('iso-8859-1'))
        # Print the CSV-style line
        #print(csv_version)


Total number of lines is 922.


<br>

## 2020 CAO Points

http://www.cao.ie/index.php?page=points&p=2020

***

<br>

### Save origional data set

***

In [12]:
# Create a file path for the original data.
path = 'data/cao2020_' + nowstr + '.xlsx'

In [13]:
urlrq.urlretrieve('http://www2.cao.ie/points/CAOPointsCharts2020.xlsx', path)

('data/cao2020_20211103_203544.xlsx',
 <http.client.HTTPMessage at 0x2601e6623d0>)

<br>

### Load Excel Spreadsheet using Pandas

***

In [14]:
# Download and parse the excel spreadsheet.
df = pd.read_excel('http://www2.cao.ie/points/CAOPointsCharts2020.xlsx', skiprows=10)

In [15]:
df

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


In [16]:
# Spot check a random row.
df.iloc[753]

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Road Transport Technology and Management
COURSE CODE2                                                           LC286
R1 POINTS                                                                264
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      264
EOS Random *                                                             NaN
EOS Mid-point                                                            360
LEVEL                                                                      7
HEI                                         Limerick Institute of Technology
Test/Interview #                                                         NaN

In [17]:
# Spot check the last row.
df.iloc[-1]

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Mechanical and Manufacturing Engineering
COURSE CODE2                                                           WD230
R1 POINTS                                                                253
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      253
EOS Random *                                                             NaN
EOS Mid-point                                                            369
LEVEL                                                                      8
HEI                                        Waterford Institute of Technology
Test/Interview #                                                         NaN

In [18]:
# Create a file path for the pandas data.
path = 'data/cao2020_' + nowstr + '.csv'

In [19]:
# Save pandas data frame to disk.
df.to_csv(path)

 <br>

## 2019 CAO Points

http://www.cao.ie/index.php?page=points&p=2019

***   

**Steps to reproduce**

1.  Download original pdf file.
2.  Open original pdf file in Microsoft Word.
3.  Save Microsoft Word's converted PDF in docx format.
4.  Re-save Word document for editing.
5.  Delete headers and footers.
6.  Delete preamble on page 1.
7.  Select all and copy.
8.  Paste into Visual Studio Code.
9.  Remove HEI name lines and black lines.
10. Change column heading "COURSE AND INSTITUTION" to "Course".
11. Change backticks to apostrophes.
12. Replaced double tab charater at on line 28 with single tab.
13. Delete tabs at end of lines 604, 670, 700, 701, 793, and 830.

In [20]:
df2019 = pd.read_csv('data/cao2019_20211103_202410_edited.csv', sep='\t')

In [21]:
df2019

Unnamed: 0,Course Code,Course,EOS,Mid
0,AL801,Software Design with Virtual Reality and Gaming,304,328.0
1,AL802,Software Design with Cloud Computing,301,306.0
2,AL803,Software Design with Mobile Apps and Connected...,309,337.0
3,AL805,Network Management and Cloud Infrastructure,329,442.0
4,AL810,Quantity Surveying,307,349.0
...,...,...,...,...
925,WD200,Arts (options),221,296.0
926,WD210,Software Systems Development,271,329.0
927,WD211,Creative Computing,275,322.0
928,WD212,Recreation and Sport Management,274,311.0


<br>

## Comparison of CAO points in 2019, 2020 and 2021

***

<br>

# References:

***
 
1.  https://docs.python.org/3/library/re.html

2.  https://realpython.com/regex-python/

3.  https://realpython.com/regex-python-part-2/

4.  https://docs.python-requests.org/en/latest/user/quickstart/#make-a-request

5.  https://www.mygreatlearning.com/blog/regular-expression-in-python/

6.  https://stackoverflow.com/questions/16870648/python-read-website-data-line-by-line-when-available

7.  https://sites.pitt.edu/~naraehan/python3/mbb12.html

8.  https://docs.python.org/3/library/datetime.html

9.  https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html

10. https://docs.python.org/3/library/urllib.request.html

11. 


***

# End