## A demonstration of PANDAS data frames used to investigate CAO points

Author: Jon Ishaque
Commenced: 29th September 2021
GMIT SID: G00398244

This notebook extracts CAO points from the CAO website for 2019, 2020 and 2021. It loads data into pandas dataframes and uses pandas and python to compare points from different years.

 Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [1]:
# package for making http requests

import requests as rq
# Dates and time package
import datetime as dt

#dataframes
import pandas as pd

#import regex package for searching strings
import re

## 2021 points
#http://www.cao.ie/index.php?page=points&p=2021
The 2021 CAO points are presented in a web age. This part of the note part of the note book will load the web page content. A loop will read each line of web page and determine if it's content is relevant and write content to a csv file.

In [2]:
#Get the web page
#getheaders and determine contenttype [3]

resp = rq.get('http://www2.cao.ie/points/l8.php', headers={"content-type":"text"})
resp.headers['content-type']

'text/html; charset=iso-8859-1'

In [3]:
#check the response
resp

<Response [200]>

The server says we should decode as per: *Content-Type: text/html; charset=iso-8859-1*
However, one line uses \x96 which isn't defined in iso-8859-1. Therefore we use the similar decoding standard cp1252, which is very similar but includes #x96. The character in question was had an Irish foda on a level 8 course

In [4]:
# Save the oringinal encoding
original_encoding = resp.encoding

# Change to cp1252, which handles accented characters
resp.encoding = 'cp1252'

Create a string var,*now*. this is use in file names of back up copies of CAO points.

In [5]:
# Get the current date and time.

now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')



In [6]:
# Create a file path for the original data. 2021
path = 'data/cao2021_' + nowstr + '.html'

In [7]:
# Save the original html file.
with open(path, 'w') as f:
    f.write(resp.text)

###### Compile the regular expression so it is not compiled at each interation of the loop reading the webpage

###### Explanation of the regualar expression [4][5]:
('[A-Z]{2}[0-9]{3} (.*)([0-9]{3})(\*)? *')</span>

[A-Z]{2}        Any two upper case aphanumberic

[0-9]{3}        Any three digits 0-9

'  '            Two spaces

(.*)([0-9]{3})   Any amount of text before 3 numeric characters

(\*)?               An optional asterix in the text

\*                  Anything else
    </font>

In [8]:
#set reg ex
re_courses = re.compile('[A-Z]{2}[0-9]{3} (.*)([0-9]{3})(\*)? *') #[4]




The following block of code iterates through each line of the csv file 

In [15]:


# The file path for the csv file.
path = 'data/cao2021_csv_' + nowstr + '.csv'
#resp.text
#loop through response text lines

#bytes problem
#set var to count lines for cross check with webpage
no_lines = 0
with open(path, 'w') as f:
    #write csv header
    for line in resp.iter_lines():
        #
      
             
            #so convert str to bytes
         
            #print (line)
        dline = line.decode('cp1252')
        #check if line mathces reg exp pattern. If so, do something.
        if re_courses.fullmatch(dline):
            no_lines +=1
            # Split the line on two or more spaces.
            linesplit = re.split('  +', dline)
            #show line
            #print((','.join(linesplit) + '\n'))
            # Rejoin the substrings with commas in between. ie.comma separated
            f.write(','.join(linesplit) + '\n')
print (f"number of lines is", {no_lines})
#check this number is correct

number of lines is {922}


---
## References:
[1]

[2]

[3] https://docs.python-requests.org/en/latest/index.html

[4] https://docs.python.org/3/library/re.html

[5] https://docs.python.org/3/library/re.html?highlight=re%20match#re.match

## End