# Fundamentals of Data Analysis Assignment
## Autumn 2021

---
<br>

## Part 2 - CAO points

![CAO_logo](./Images/cao.png)
---

The brief was to analyse the CAO points for the years 2019, 2020, and 2021. The main tasks were to download the data from the CAO website, analyse the data using pandas and also to use visualisations to help better explain this analysis.

### Downloading the data

# A detailed comparison of CAO points in 2019, 2020, and 2021

The data can be found at the following links:
- 2021 - http://www2.cao.ie/points/l8.php
- 2020 - http://www2.cao.ie/points/CAOPointsCharts2020.xlsx
- 2019 - http://www2.cao.ie/points/lvl8_19.pdf

A quick glance at the file extensions tells us we are dealing with three different file types so scraping the data is going to involve a few different methods.

In [1]:
# libraries required for the entire analysis
import pandas as pd # basic data analysis
import matplotlib.pyplot as plt # plots
import datetime as dt # Dates and times
import re # regular expressions
import requests as rq # HTTP requests
import urllib.request as urlrq # for downloading and saving excel file
import csv

## 2021 Points

The url - http://www2.cao.ie/points/l8.php - returns a plain text file. For analysis we need to download that, extract only the information we need, and then convert it to a csv file. The first part of code below uses the **requests** library to fetch the data. As the data is still being updated (at time of writing) we are using a timecpode to name the downloaded file. We use the **datetime** library for the current date and time which is then converted into a string using the **strftime** method. A path is created using this stringified datetime preceded by a folder data and CAO2021.

In [2]:
# fetch the cao url
resp = rq.get('http://www2.cao.ie/points/l8.php')

# get the current date and time
now = dt.datetime.now()

# format as a string
nowstr = now.strftime('%Y%m%d_%H%M%S')

# Create a filepath for the original data using the datetime
path2021 = 'data/CAO2021_' + nowstr + '.html'

During the lectures it was discovered that some characters weren't being displayed properly. The issues lay with the fadas in the Irish language words, plus a hyphen. The server stated that the page should be decoded with **iso-8859-1** but this didn't allow for some characters. We changed the decoding to **cp1252** which solved the issue.

In [3]:
# the server uses the wrong encoding, fix it
original_encoding = resp.encoding

# change to cp1252
resp.encoding = 'cp1252'

We then save the original file which the following code. 

In [4]:
# save the original html file
with open(path2021, 'w') as f:
    f.write(resp.text)

#### Using regular expressions to isolate the data we want

On inspection the file contains a lot of information we don't need for the analysis such as headings, links, college names, etc., so the next challenge was to isolate only what was needed, i.e. 1) course code 2) course name 3) points.
The following code uses a **regular expression** to identify only the lines that match the expression.

In [5]:
# compile the regular expression for matching lines

# ([A-Z]{2}[0-9]{3}) = represents course code - i.e. CW078
# followed by 2 spaces
# (.*) represents amount of text - dot(.) = wildcard *=zero or more of
# ([0-9]{3}) = 3 digit number - i.e. points
# (\*) = literal asterisk(? = O or 1 of)
# (space + *) any amount of spaces
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)')

We then loop through these lines and save then to a csv file also stored in the **data** folder. 

In [6]:
# function to isolate # and * 
def points_to_array(s):
    portfolio = ''
    if s[0] == '#':
        portfolio = '#'
    random = ''
    if s[-1] == '*':
        portfolio = '*'
    points = ''
    for i in s:
        if i.isdigit():
            points = points + i
    return [points, portfolio, random]

In [15]:
# Loop through lines and save what we want to a csv file

# path for csv file
path2021b = 'data/CAO2021_csv_' + nowstr + '.csv'

# keep track of courses
no_lines = 0

with open(path2021b, 'w') as f:
    # write a header row
    f.write(','.join(["Course_Code", "Course_Title", "PointsR1", "PointsR2"]) + '\n')
    # loop through lines of response
    for line in resp.iter_lines():
        dline = line.decode('cp1252')
        # match only the lines we want - ones representing courses
        if re_course.fullmatch(dline):
            # add to line counter
            no_lines = no_lines + 1
            # course code (first 5 characters)
            course_code = dline[:5]
            # course_title
            course_title = dline[7:57]
            # round 1 points
            course_points = re.split(' +', dline[60:])
            if len(course_points) != 2:
                course_points = course_points[:2]
            # join the fields using a comma
            # split the line on 2 spaces or more
            points_to_array(course_points) 
            linesplit = (course_code, course_title, course_points[0], course_points[1])
            f.write(','.join(linesplit) + '\n')

Just to confirm we have every line we need, we print out the total number of lines and then try to verify that online. 

In [8]:
# prints the total number of lines    
print(f"Total number of lines is {no_lines}.")

Total number of lines is 949.


**N.B.** It was verified on 8/11 against the original data and we're good!

In [18]:
# adding headers to file
# should be a way of doing this earlier?
df2021 = pd.read_csv(path2021b, encoding='cp1252')  

In [19]:
df2021

Unnamed: 0,Course_Code,Course_Title,PointsR1,PointsR2
0,AL801,Software Design for Virtual Reality and Gaming...,300,
1,AL802,Software Design in Artificial Intelligence for...,313,
2,AL803,Software Design for Mobile Apps and Connected ...,350,
3,AL805,Computer Engineering for Network Infrastructur...,321,
4,AL810,Quantity Surveying ...,328,
...,...,...,...,...
944,WD211,Creative Computing ...,270,
945,WD212,Recreation and Sport Management ...,262,
946,WD230,Mechanical and Manufacturing Engineering ...,230,230
947,WD231,Early Childhood Care and Education ...,266,


## 2020 Points

The 2020 data is already in an excel spreadsheet so a different approach is needed!
We first save the original data

In [11]:
# save the original file to disk
# create the path again using the datetime
path2020 = 'data/CAO2020_' + nowstr + '.html'

urlrq.urlretrieve("http://www2.cao.ie/points/CAOPointsCharts2020.xlsx", path2020)

('data/CAO2020_20211108_110920.html',
 <http.client.HTTPMessage at 0x7fa5ca767e50>)

In [12]:
# Fundamentals-of-Data-Analysis-Assignment/data/CAOPointsCharts2020.xlsx

# Read and store content
# of an excel file from a URL - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
# skip the first 10 rows
df2020 = pd.read_excel ("http://www2.cao.ie/points/CAOPointsCharts2020.xlsx", skiprows=10)
df2020
# did a few spotchecks to make sure it was read in correctly i.e. df.iloc[254]

# Write the dataframe object
# into csv file
# read_file.to_csv ("test.csv",
#                   index = None,
#                   header=True)
   
# # read csv file and convert
# # into a dataframe object
# df = pd.DataFrame(pd.read_csv("test.csv"))


Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


In [14]:
# saving updated pandas dataframe to disk as csv file
# creating path
path2020b = 'data/CAO2020_' + nowstr + '.csv'
# writing to csv
df2020.to_csv(path2020b)

## 2019 Points

These are in PDF format! A bit of googling found the tabula library which seems to deal with PDFs very easily. Quicker than Ian's lecture anyway! :)

In [None]:
# just found this by searching - https://github.com/chezou/tabula-py
import tabula
pdf_path = "http://www2.cao.ie/points/lvl8_19.pdf"
df = tabula.read_pdf(pdf_path, stream=True, pages="all")
df

In [None]:
# convert PDF into CSV file
tabula.convert_into("http://www2.cao.ie/points/lvl8_19.pdf", 'data/CAO2019_' + nowstr + '.csv', output_format="csv", pages='all')

Issue with HEI columns! Need to be able to delete those rows

In [None]:
# reads in csv file from data folder
df2019 = pd.read_csv("data/CAO2019_20211101_122559.csv")

# deletes rows where there are blanks - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
df2019 = df2019.dropna()
df2019

In [None]:
path2019b = 'data/CAO2019_' + nowstr + '.csv'
# saving amended file to folder
df2019.to_csv(path2019b)

Think I need to edit each csv file so they are all formatted the same

### Concat and join

In [22]:
courses2021 = df2021[['Course_Code', 'Course_Title']]
courses2021

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming...
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructur...
4,AL810,Quantity Surveying ...
...,...,...
944,WD211,Creative Computing ...
945,WD212,Recreation and Sport Management ...
946,WD230,Mechanical and Manufacturing Engineering ...
947,WD231,Early Childhood Care and Education ...


In [26]:
courses2020 = df2020[['COURSE CODE2', 'COURSE TITLE']]
courses2020.columns = ['Course_Code', 'Course_Title']
courses2020

Unnamed: 0,Course_Code,Course_Title
0,AC120,International Business
1,AC137,Liberal Arts
2,AD101,"First Year Art & Design (Common Entry,portfolio)"
3,AD102,Graphic Design and Moving Image Design (portfo...
4,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
1459,WD208,Manufacturing Engineering
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [28]:
allCourses = pd.concat([courses2021, courses2020])
allCourses

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming...
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructur...
4,AL810,Quantity Surveying ...
...,...,...
1459,WD208,Manufacturing Engineering
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [34]:
allCourses[allCourses.duplicated(keep=False)]

Unnamed: 0,Course_Code,Course_Title
35,CW068,Applied Social Studies in Professional Social ...
80,CR220,Fine Art at CIT Crawford College of Art and De...
312,AD102,Graphic Design and Moving Image Design (portfo...
455,TR034,Management Science and Information Systems Stu...
459,TR040,Middle Eastern and European Languages and Cult...
777,LM076,Product Design and Technology (portfolio requi...
3,AD102,Graphic Design and Moving Image Design (portfo...
196,CR220,Fine Art at CIT Crawford College of Art and De...
246,CW068,Applied Social Studies in Professional Social ...
813,LM076,Product Design and Technology (portfolio requi...


In [36]:
# copy of dataframe with duplicates removed
allCourses.drop_duplicates()

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming...
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructur...
4,AL810,Quantity Surveying ...
...,...,...
1459,WD208,Manufacturing Engineering
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [37]:
allCourses[allCourses.duplicated(subset=['Course_Code'])]

Unnamed: 0,Course_Code,Course_Title
0,AC120,International Business
1,AC137,Liberal Arts
2,AD101,"First Year Art & Design (Common Entry,portfolio)"
3,AD102,Graphic Design and Moving Image Design (portfo...
4,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
1455,WD200,Arts (options)
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [41]:
# drop duplicates based on code
# inplace changes original dataframe - default is opposite
allCourses.drop_duplicates(subset=['Course_Code'], inplace=True)
allCourses

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming...
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructur...
4,AL810,Quantity Surveying ...
...,...,...
1449,WD188,Applied Health Care
1456,WD205,Molecular Biology with Biopharmaceutical Science
1457,WD206,Electronic Engineering
1458,WD207,Mechanical Engineering


# THE END