# CAO POINTS ANALYSIS
### AUTHOR: ANTE DUJIC
<hr style="border:2px solid black"> </hr>

## INTRODUCTION
<hr style="border:2px solid gray"> </hr>

This notebook gives an overview of how to load CAO points information from the CAO website into a pandas data frame and the comparison of CAO points in 2019, 2020, and 2021.

[<center><img src="http://www.cao.ie/images/cao.png" width="100"/></center> ](http://www.cao.ie/index.php)

***
### CONTENTS

1. [WHAT IS CAO](#CAO)
2. [LOADING AND SAVING THE DATA](#DATA)
    - 2.1. [LEVEL 8 POINTS - R1 and R2](#R1R2)
        - 2.1.1. [2019, 2020, 2021 (html)](#HTML)
    - 2.2. [LEVEL 8 POINTS - EOS and MID](#EOSMID)
        - 2.2.1. [2020 (xlsx)](#XLSX)
        - 2.2.2. [2019 (pdf)](#PDF)
2. [CONCATENATING THE DATA](#CONCATENATE)

## 1. WHAT IS CAO <a id='CAO'></a>
<hr style="border:2px solid gray"> </hr>

The purpose of the Central Applications Office (CAO) is to process centrally applications for undergraduate courses in Irish Higher Education Institutions (HEIs), and to deal with them in an efficient and fair manner. [1]

Students applying for admission to third level education courses in Ireland apply to the CAO rather than to individual educational institutions such as colleges and universities. The CAO then offers places to students who meet the minimum requirements for a course for which they have applied. If for a particular course there are more qualified applicants than available places, the CAO makes offers to those applicants with the highest score in the CAO points system. If students do not accepts offers, or later decline them because they receive an offer for another course, the CAO makes further offers until all of the places have been filled or until the offer season closes. [2]

In [1]:
# HTTP request
import requests as rq

# Regular expressions
import re

# Dates and time
import datetime as dt

# Data frames
import pandas as pd

# For downloading
import urllib.request as urlrq

# PDF
import camelot

# To use .unescape
import html

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

## 2. LEVEL 8 - ROUND 1 and ROUND 2
<hr style="border:2px solid gray"> </hr>

The data for Round 1 and Round 2 for the three given years (2019., 2020., 2021.) has been pulled from the CAO website. As mentioned in the README, CAO practice is to first upload the data with only Round 1 and Round 2 points and then overwrite that data with EOS and MID points after the academic year starts. To obtain the Round 1 and Round 2 data for the 2019. and 2020. I have used [The Internet Archive](https://web.archive.org/). This didn't only allow me access to the archived version of the website but also gave me the same format and the structure of the data for all the years.


2019: https://web.archive.org/web/20191019135815/http://www2.cao.ie:80/points/l8.php <br>
2020: https://web.archive.org/web/20201108133105/http://www2.cao.ie/points/l8.php <br>
2021: http://www2.cao.ie/points/l8.php

### THE WEBSITE STRUCTURE
***

The data on the CAO website has the following structure:
- Title
- Information on how to read the data
- List of colleges
- Course Level Title
- Points table
    - Course code
    - Course title
    - Round 1 points
    - Round 2 points

The data of interest is contained under the table section of the website. It is explained below how the data was scrapped, cleaned and saved for the later analysis.

### SCRAPPING AND CLEANING THE DATA
***

<br>
Current data and time are used for the filenames that will get created to save the original data and the cleaned data. I've used the date and time for easier access and organization of the data, but also to avoid the overwriting of the data files.

In [2]:
# Current date and time
now = dt.datetime.now()
# Format as a string
nowstr = (now.strftime("%Y%m%d_%H%M%S"))

<br>
To filter out only relevant data from the website I've used the below regular expression. Regular expression  is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. [4] The set regular expression filters out only the lines starting with the course code (e.g. AL801).

In [3]:
# Regular expression
re_course = re.compile(r"([A-Z]{2}[0-9]{3})(.*)")

<br>
To loop through the years and run multiple lines of code for each year in one go, I've created the dictionary below.

In [4]:
# Defining a dictionary
years_dict = {
    "2019": [("data/cao2019" +  nowstr), "https://web.archive.org/web/20191019135815/http://www2.cao.ie:80/points/l8.php"],
    "2020": [("data/cao2020" +  nowstr), "https://web.archive.org/web/20201108133105/http://www2.cao.ie/points/l8.php"],
    "2021": [("data/cao2021" +  nowstr), "http://www2.cao.ie/points/l8.php"]
}

<br>
There are 3 main parts of the loop below:
<br>
1. Save the original html
<br>
This was done to ensure the original website format from which the data was scrapped is kept, in case of any future changes on the CEO website itself. The data was fetched from the given url-s and it was encoded (cp1252), due to symbols in certain course names not being supported. NOTE: Encoding on the website is wrong.
<br>
2. Read the data and filter out only the relevant data
<br>
The fetched data was ittered through and filtered out using the mentioned regular expression to catch only the relevant data. Symbol "&" instead "and" was used in certain 2019 and 2020 course names and it was being decoded as "&amp;" when saving to csv. To avoid this *html.unescape* function was used. The data was also structured to fit the csv file - split and divided by commas to create 4 columns "CODE", "TITLE", "R1_POINTS", "R2_POINTS".
<br>
3. Save the cleaned data as csv files

In [None]:
# Loop through the (dict) years
for year, content in years_dict.items():
    # Fetch the CAO points URL
    rq.get (content[1])
    resp = rq.get (content[1])
    # The server uses the wrong encoding
        # Change to "cp1252"
    resp.encoding = "cp1252"
    # Check if OK:
        #Response [200] means OK
    print (year,resp)
    
    # Save the original html file
    with open(content[0] + ".html", "w") as f:
        f.write(resp.text)
        # Keep track of how many courses we process
    
    no_lines = 0
    # Iterating through the lines
    resp.iter_lines()
    resps = resp.iter_lines()
    
    # Open the csv file for writing (saving)
    with open(content[0] + ".csv", "w") as f:
        # Cleaning the data
        # Write a header row.
        f.write(','.join(["CODE", "TITLE", "R1_POINTS", "R2_POINTS"]) + "\n")
        # Loop through lines of the response.
        for line in resps:
            # Decode the line, using the wrong encoding
            dline = line.decode("cp1252")
            # Converting &amp; to &
            dline = html.unescape(dline)
            # Match only the lines representing courses
            if re_course.fullmatch(dline):
                # Add one to the lines counter
                no_lines = no_lines + 1
                # The course code
                course_code = dline[:5]
                # The course title
                course_title = dline[7:57]
                # Round one points
                course_points = re.split(' +', dline[60:])
                if len(course_points) != 2:
                    course_points = course_points[:2]
                # Join the fields using a comma
                linesplit = [course_code, course_title, course_points[0], course_points[1]]
                # Rejoin the substrings with commas in between
                f.write(",".join(linesplit) + "\n")   
    # Print the total number of processed lines
    print("Total number of lines in CAO", year, "database is", no_lines)

## 2. LEVEL 8 - EOS and MID
<hr style="border:2px solid gray"> </hr>

As mentioned in README, EOS and MID points data becomes available after the start of the academic year, and are only available for 2019. and 2020. in the time of making this project. However, the project could be updated in the future when the data becomes available for 2021.

2019: http://www2.cao.ie/points/lvl8_19.pdf <br>
2020: http://www2.cao.ie/points/CAOPointsCharts2020.xlsx

### 2019 - THE PDF FILE STRUCTURE
***

Data containing EOS and MID points for 2019 is in pdf format.
The file has the following structure:
- Title and the Subtitles of the file
- Information on the data in the file
- Information on how to read the data
- Points table
    - Course code
    - Institution and Course
    - EOS points
    - MID points

The data of interest is contained under the table section of the file. It is explained below how the data was scrapped, cleaned and saved for the later analysis.

### SCRAPPING AND CLEANING THE DATA
***

To obtain the data from the pdf file I've used the *camelot* library. Same steps are taken as per scrapping the R1 and R2 points:
<br>
1. Save the original file
2. Read the pdf file and filter out the table only
3. Concate the tables read
4. Save the data as csv file

In [None]:
# Creating a file path for the original data
path2019pdf = 'data/cao2019_eos' + nowstr + '.pdf'

In [None]:
# Fetch the CAO points URL
resp_pdf = rq.get("http://www2.cao.ie/points/lvl8_19.pdf")
resp_pdf # <Response [200]> means OK

In [None]:
# Save the original file
with open(path2019pdf, 'wb') as f:
    f.write(resp_pdf.content)

In [None]:
# Read the pdf file
tables = camelot.read_pdf(path2019pdf, pages = "all", flavor = "lattice")

In [None]:
# Check the total number of tables read
print ("Tables:", tables.n)

In [None]:
# Create empty list
table_total = []
# Loop through all 18 tables
for x in range (0,18):
    df = tables[x].df
    # Append all 18 tables to table_total
    table_total.append(df)

# Concatenate all tables  
table = pd.concat(table_total)
# Remove old column names
table = table.iloc[1: , :]
# Name the columns
table.columns = ["CODE","TITLE", "EOS_2019", "MID_2019"]
# Sort table by "CODE" column
table.sort_values("CODE", inplace = True)
# Remove first 35 rows (name of the college)
table = table.iloc[35: , :]
# Save .csv file
table.to_csv("data/cao2019_eos" +  nowstr + ".csv", index = False)
#table

### 2020 - THE XLSX FILE STRUCTURE
***

Data containing EOS and MID points for 2020 is in xlsx format. The file has the following structure:
The file has the following structure:
- Title and the Subtitles of the file
- Information on the data in the file
- Information on how to read the data
- Points table


The data of interest is contained under the columns "". It is explained below how the data was scrapped, cleaned and saved for the later analysis.

### SCRAPPING AND CLEANING THE DATA
***

To obtain the data from the pdf file I've used the *urllib.request* library. Steps taken:
<br>
1. Save the original file
2. Read the xlsx file and clear the data
4. Save the data as csv file

In [None]:
# Create a file path for the original data
path = ("data/cao2020_eos" +  nowstr + ".xlsx")

# Copying network xlsx file
urlrq.urlretrieve('http://www2.cao.ie/points/CAOPointsCharts2020.xlsx', path)

In [None]:
# Download and parse the excel spreadsheet
df2020_eos = pd.read_excel("http://www2.cao.ie/points/CAOPointsCharts2020.xlsx", skiprows = 10)

In [None]:
# Filter out only level 8 courses
df2020_eos = df2020_eos.loc[df2020_eos["LEVEL"] == 8]
# Remove last 12 columns
df2020_eos = df2020_eos.iloc[: , :-12]
# Save pandas data frame to disk
df2020_eos.to_csv(("data/cao2020_eos" +  nowstr + ".csv"))
#df2020_eos

## 3. CONCATENATING THE DATA <a id='CONCATENATE'></a>
<hr style="border:2px solid gray"> </hr>

In [None]:
# Defining a dictionary
df_dict = {
    "2019": [("data/cao2019" +  nowstr)],
    "2019_eos": [("data/cao2019_eos" +  nowstr)],
    "2020": [("data/cao2020" +  nowstr)],
    "2020_eos": [("data/cao2020_eos" +  nowstr)],
    "2021": [("data/cao2021" +  nowstr)]
}

In [None]:
# Creating an empty list for adding dataframes
dataframe = []
# Loop - reading the csv files and appending to list
for year, path in df_dict.items():
    data = pd.read_csv ((path[0] + ".csv"), encoding='cp1252')
    dataframe.append (data)
# Conecating all dataframes into one
allcourses = pd.concat (dataframe)
# Filering out columns
allcourses = allcourses [["CODE", "TITLE"]]
# Remove duplicates created by conecating
allcourses.drop_duplicates(subset=["CODE"], inplace=True, ignore_index=False)
# Sort the table by "CODE" column
allcourses.sort_values("CODE", inplace = True)
#allcourses

In [None]:
#2019 df
dataframe[0].columns = ["CODE","TITLE", "R1_POINTS_2019", "R2_POINTS_2019"]
#2019_eos is #dataframe[1]
#2020 df
dataframe[2].columns = ["CODE","TITLE", "R1_POINTS_2020", "R2_POINTS_2020"]
#2020_eos df
dataframe[3] = dataframe[3][["COURSE CODE2","EOS", "EOS Mid-point"]]
dataframe[3].columns = ["CODE","EOS_2020", "MID_2020"]
#2021 df
dataframe[4].columns = ["CODE","TITLE", "R1_POINTS_2021", "R2_POINTS_2021"]

In [None]:
# Loop - set "CODE" column as index for all df
for i in dataframe:
    i.set_index("CODE", inplace=True)

In [None]:
allcourses.set_index("CODE", inplace=True)
allcourses = allcourses.join(dataframe[0][["R1_POINTS_2019", "R2_POINTS_2019"]])
#allcourses

In [None]:
allcourses = allcourses.join(dataframe[1][["EOS_2019", "MID_2019"]])
allcourses = allcourses.join(dataframe[2][["R1_POINTS_2020", "R2_POINTS_2020"]])
allcourses = allcourses.join(dataframe[3][["EOS_2020", "MID_2020"]])
allcourses = allcourses.join(dataframe[4][["R1_POINTS_2021", "R2_POINTS_2021"]])
#allcourses

In [None]:
#allcourses.sort_values("CODE", inplace = True)
allcourses.to_csv ("data/Final_table.csv")
allcourses

## 4. DATA ANALYSIS
<hr style="border:2px solid gray"> </hr>

In [None]:
df = pd.read_csv ("data/Final_table.csv", index_col = ["CODE", "TITLE"])
df

In [None]:
df.dtypes

In [None]:
# Filtering out the non numeric values in df
    # Replace all strings with "none"
df_numeric = df.replace(r'\D', '', regex=True)
    # Change columns type from object to numeric
df_numeric = df.apply(pd.to_numeric, errors='coerce')
df_numeric

In [None]:
df_numeric.to_csv ("Numeric_table.csv")

In [None]:
df_numeric.dtypes

In [None]:
# 2019
r1_2019 = df_numeric ["R1_POINTS_2019"]
r2_2019 = df_numeric ["R2_POINTS_2019"]
eos_2019 = df_numeric ["EOS_2019"]
mid_2019 = df_numeric ["MID_2019"]
# 2020
r1_2020 = df_numeric ["R1_POINTS_2020"]
r2_2020 = df_numeric ["R2_POINTS_2020"]
eos_2020 = df_numeric ["EOS_2020"]
mid_2020 = df_numeric ["MID_2020"]
# 2021
r1_2021 = df_numeric ["R1_POINTS_2021"]
r2_2021 = df_numeric ["R2_POINTS_2021"]

In [None]:
r1_2019

In [None]:
df_numeric.describe()

In [None]:
df_points = df_numeric [["R1_POINTS_2019", "R1_POINTS_2020", "R1_POINTS_2021",
                        "R2_POINTS_2019", "R2_POINTS_2020", "R2_POINTS_2021"]]
df_points.corr()

In [None]:
sns.heatmap (data = df_points.corr(), square = True, annot = True, cmap = "mako")
plt.xticks (rotation = 45)
plt.yticks (rotation = 45)
plt.show()

In [None]:
import numpy as np
plt.rcParams['figure.figsize'] = [25, 10]
plt.xticks (ticks = (100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100))
plt.hist ([r1_2019, r1_2020, r1_2021], alpha = 0.7)
plt.show()

In [None]:
new = pd.read_csv ("Numeric_table.csv")
new

In [None]:
sns.scatterplot (data = new)
plt.show()

In [None]:
sns.boxplot (data = [r1_2019, r1_2020, r1_2021])
plt.show()

In [None]:
sns.kdeplot (data = [r1_2019, r1_2020, r1_2021])
plt.show()

### DATA COMPARISON

# CONCLUSION

***

## REFERENCES
***

- [1] http://www2.cao.ie/handbook/handbook2022/hb.pdf
- [2] https://en.wikipedia.org/wiki/Central_Applications_Office7
- [3] https://www.independent.ie/life/family/learning/understanding-your-cao-course-guide-26505318.html
- [4] https://en.wikipedia.org/wiki/Regular_expression

## TESTING

In [None]:
df_numeric.dropna(axis = 0, how = 'all', inplace = True)
df_numeric

In [None]:
a = df[df['R1_POINTS_2019'].str.contains("\*", na = False)]
a

In [None]:
a['R1_POINTS_2019'].str.split('[0-9]').str[-1]
