## A demonstration of PANDAS data frames used to investigate CAO points

Author: Jon Ishaque
Commenced: 29th September 2021
GMIT SID: G00398244

This notebook extracts CAO points from the CAO website for 2019, 2020 and 2021. It loads data into pandas dataframes and uses pandas and python to compare points from different years.

 Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [1]:
# package for making http requests

import requests as rq
# Dates and time package
import datetime as dt

#dataframes
import pandas as pd

#import regex package for searching strings
import re

## 2021 points
#http://www.cao.ie/index.php?page=points&p=2021
The 2021 CAO points are presented in a web age. This part of the note part of the note book will load the web page content. A loop will read each line of web page and determine if it's content is relevant and write content to a csv file.

In [2]:
#Get the web page
#getheaders and determine contenttype [3]

resp = rq.get('http://www2.cao.ie/points/l8.php', headers={"content-type":"text"})
resp.headers['content-type']

'text/html; charset=iso-8859-1'

In [3]:
#check the response
resp

<Response [200]>

The page header from the server should decode as per: *Content-Type: text/html; charset=iso-8859-1*
However, one line uses \x96 which isn't defined in iso-8859-1. Therefore we use the similar decoding standard cp1252, which is very similar but includes #x96. The character in question was had an Irish foda on a level 8 course

In [4]:
# Save the oringinal encoding
original_encoding = resp.encoding

# Change to cp1252, which handles accented characters
resp.encoding = 'cp1252'

Create a string var,*now*. this is use in file names of back up copies of CAO points.

In [5]:
# Get the current date and time.

now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')



In [6]:
# Create a file path for the original data. 2021
path = 'data/cao2021_' + nowstr + '.html'

In [7]:
# Save the original html file.
with open(path, 'w') as f:
    f.write(resp.text)

###### Compile the regular expression so it is not compiled at each interation of the loop reading the webpage

###### Explanation of the regualar expression [4][5]:
('[A-Z]{2}[0-9]{3} (.*)([0-9]{3}))</span>

[A-Z]{2}        Any two upper case aphanumberic

[0-9]{3}        Any three digits 0-9

'  '            Two spaces

(.*)([0-9]{3})   Any amount of text before 3 numeric characters


    </font>

In [28]:
#set reg ex
re_courses = re.compile('[A-Z]{2}[0-9]{3} (.*)') #[4]




In [85]:
#helper
def points_to_arr(s):
    portfolio =''
    #check 1st char for #
    #print(s)
    if s[0]=='#':
        portfolio='#'# add to var
    random = ''
    #check final char for  *
    if s[-1] == '*':
        random ='*'
    points=''
    #strip ~ and * from start and end of s
    for i in s:
        if i.isdigit():
            #concat points string
            points = points + i
    #return
    return [points, portfolio, random]

The following block of code iterates through each line of the csv file 

In [86]:
# The file path for the csv file.
path = 'data/cao2021_csv_' + nowstr + '.csv'

In [88]:



#resp.text
#loop through response text lines

#bytes problem

#set var to count lines for cross check with webpage
no_lines = 0
with open(path, 'w') as f:
    #write csv header
    for line in resp.iter_lines():
        #
      
             
            #so convert str to bytes
         
            #print (line)
        dline = line.decode('cp1252')
        #check if line mathces reg exp pattern. If so, do something.
        if re_courses.fullmatch(dline):
            no_lines +=1
            #get first five chars - course code
            course_code = dline[:5]
            #course title
            course_title = dline[7:57]
            #r1 points
            round_1 = dline[60:65].rstrip() # get five chars, remove white space
            #if round 1 not blank call fn points_to_arr
            if len(round_1) > 0:
                round_1= points_to_arr(round_1)
                #assign vals from returned array
                pts1 = round_1[0]
                plo1 = round_1[1]
                rnd1 = round_1[2]
            else: 
                pts1 = ''
                plo1 = ''
                rnd1 = ''
            #r2 points
            round_2 = dline[67:].rstrip() # get four chars, remove white space
            #if round 2 not blank call fn points_to_arr
            if len(round_2) > 0:
                round_2= points_to_arr(round_2)
                #assign vals from returned array
                pts2 = round_2[0]
                plo2 = round_2[1]
                rnd2 = round_2[2]
            else: 
                pts2 = ''
                plo2 = ''
                rnd2 = ''
                
                
            # Split the line 
            linesplit = [course_code,course_title,pts1,plo1,rnd1,pts2,plo2,rnd2]
            #print (linesplit)
            #debug
            #print(f"'{course_code} {dline} r1: {round_1} r2: {round_2}'")
           # print((','.join(linesplit) + '\n'))
            # Rejoin the array values with commas in between. ie.comma separated
            f.write(','.join(linesplit) + '\n')
print (f"number of lines is", {no_lines})
#check this number is correct

number of lines is {949}


#### NB: 949 courses on CAO website verified on 10 November 2021

In [10]:
[6] 

[6]

*** 

## 2020 CAO points
###### http://www.cao.ie/index.php?page=points&p=2020 CAO points in 2020 include level 6,7 & 8

In [11]:
#use urlib to retrieve url as file
import urllib.request as urlrq

In [12]:
# Create a file path for the original data.For backup
path = 'data/cao2020_' + nowstr + '.xlsx'

#download to path
urlrq.urlretrieve('http://www2.cao.ie/points/CAOPointsCharts2020.xlsx', path)

('data/cao2020_20211110_184438.xlsx',
 <http.client.HTTPMessage at 0x22313aa4dc0>)

In [13]:
###### Read the Excel file into a pandas dataframe 

In [14]:
#download and parse the excel spreadsheet
#skip first 10 rows
df=pd.read_excel('http://www2.cao.ie/points/CAOPointsCharts2020.xlsx',skiprows=10)


In [15]:
#df.iloc[123]

#check final row
df.iloc[-1]

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Mechanical and Manufacturing Engineering
COURSE CODE2                                                           WD230
R1 POINTS                                                                253
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      253
EOS Random *                                                             NaN
EOS Mid-point                                                            369
LEVEL                                                                      8
HEI                                        Waterford Institute of Technology
Test/Interview #                                                         NaN

***

## 2019 CAO
#http://www2.cao.ie/points/lvl8_19.pdf

In [16]:
import camelot #use camelot package to extract tables from pdf files [7]

In [17]:
#create a path to back up the file as a csv
path = 'data/cao2019_' + nowstr + '.csv'


In [18]:
tables = camelot.read_pdf('http://www2.cao.ie/points/lvl8_19.pdf',pages='1-end',flavor='stream')
#read all pages [8]

tables
tbl_cnt = len(tables)

#export all tables - not what we really want
#tables.export(path, f='csv', compress=False) # json, excel, html, markdown, sqlite
#tables[0]

tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}

{'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1}

print parsing report of first table

In [19]:
tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}

{'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1}

In [20]:
tbl_cnt

18

In [21]:
# The file path for the csv file.
path = 'data/cao2019_csv_' + nowstr + '.csv'

In [22]:
i = 1 # exclude first header table 
#interate through the list of tables [9] 
data2019 = [] # empty list of tables

for t in tables:    
    if i > 0: #exclude 1st table
        #write the table as a dataframe to listdata2019
        
        data2019.append(t.df) 
    i +=1 
    
#combine all the dataframes in the list into one dataframe
dfcombined = pd.concat(data2019)

#add column headers
dfcombined.columns = ['Course Code', 'COURSE', 'EOS', 'Mid']

#write to csv to store as back up.
dfcombined.to_csv(path)


Filter df so only rows with course codes remain. [10]

In [23]:
def regex_filter(val): 
    regex= '[A-Z]{2}[0-9]{3}'
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = dfcombined[dfcombined['Course Code'].apply(regex_filter)]





###### reset index to remove indexes from appended dataframes.
reset because reindex will notwork with duplicate values indexes [11]


In [24]:
df = df_filtered.reset_index(drop=True)

In [25]:
df

Unnamed: 0,Course Code,COURSE,EOS,Mid
0,AL801,Software Design with Virtual Reality and Gaming,304,328
1,AL802,Software Design with Cloud Computing,301,306
2,AL803,Software Design with Mobile Apps and Connected...,309,337
3,AL805,Network Management and Cloud Infrastructure,329,442
4,AL810,Quantity Surveying,307,349
...,...,...,...,...
925,WD200,Arts (options),221,296
926,WD210,Software Systems Development,271,329
927,WD211,Creative Computing,275,322
928,WD212,Recreation and Sport Management,274,311


---
## References:
[1]

[2]

[3] https://docs.python-requests.org/en/latest/index.html

[4] https://docs.python.org/3/library/re.html

[5] https://docs.python.org/3/library/re.html?highlight=re%20match#re.match

[6] https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html?highlight=read_excel#pandas.read_excel

[7] https://camelot-py.readthedocs.io/en/master/

[8] https://github.com/atlanhq/camelot/issues/278

[9] https://stackoverflow.com/questions/55052989/how-to-iterate-through-a-list-of-data-frames-and-drop-all-data-if-a-specific-str

[10] https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex/48884429

[11] https://stackoverflow.com/questions/68261366/right-way-to-reindex-a-dataframe


## End