# A Comparison Analysis of CAO points for 2019, 2020, 2021
Source: [CAO Webpage](http://www.cao.ie/index.php?page=mediastats)
***

## Table of Contents

#### 1. Introduction

#### 2. Retrieving the Data

#### 3. Concatenating & Joining the Data Sets

<br>

### Import Libraries

***



In [195]:
# Import libraries
import pandas as pd

# Create plots.
import matplotlib.pyplot as plt

# Nice plot style. 
import seaborn as sns

# Numerical operations. 
import numpy as np

# Regular expressions.
import re

# HTTP requests.
import requests as rq

# To get dates and times.
import datetime as dt

# Opening URLs.
import urllib.request as urlrq

# Engine to read in excel file.
import openpyxl as oxl

# Ensures plots are shown.
%matplotlib inline

<br>

### Funtion to retrieve current date and time
***

In [196]:
def time():
    # Gets the current date and time
    cur_time = dt.datetime.now()
    # Format as a string
    current_time = cur_time.strftime('%Y%m%d_%H%M%S')
    return current_time

<br>

# 2021 CAO data

http://www.cao.ie/index.php?page=points&p=2021&bb=points

***


<br>

When first starting this report, the 2021 points were only available in a HTML format in the form of round 1 and round 2 data. Later an excel file became available with data specifying the Interview/Portfolio and EOS midpoints. Because of this, the first part of this section will demonstrate the extraction of the round 1 and round 2 data from the HTML, while the latter part of this section will demonstrate the retrieval of Interview/Portfolio and EOS Midpoints columns from the excel file. 

<br>

### Retrieve data from webserver
***

In [197]:
# Retrieves CAO points from the webserver.
response = rq.get('http://www2.cao.ie/points/l8.php')

# Response 200 signifies a successful request/response.
response

<Response [200]>

<br>

### Save the original data
***

In [198]:
# Creates a file path for the original data
pathHTML = 'data/cao2021' + time() + '.html'# Note the importance of the filename and how it
                                              # will be easy to find in folders/sorted appropriately

In [199]:
# Saves the original html file.
with open(pathHTML, 'w') as f:
    f.write(response.text)

<br>

### Charset error on server

***


Technically, server states decoding as:

```
    Content-Type: text/html; charset=iso-8859-1.
``` 

However, one line uses \x96 which isn't defined in iso-8859-1. 

Therefore, we use the similar decoding standard cp1252, which is very similar but includes \x96. 

In [200]:
# Server uses incorrect encoding
orig_encoding = response.encoding

# Corrected encoding to cp1252
response.encoding = 'cp1252'

<br>

### Using regular expressions to extract desired data
***

To do: Explain what the regular expression is doing. COnsider doing step-by-step explanation. 

In [201]:
# Compiles the regular expression for matching lines so it doesn't recompile repeatedly.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})  (.*)')  # 'r' python treats string as raw string and doesnt evaluate back slashes
                                                                    # \ {character} means we want the literal character ie., *
                                                                    # ? means 0 or 1 of 
                                                                    # + means 1 or more of 

<br>

### Iterating through the response line by line
***

To do:

Here we are dealing with the hash and astrisks denoting portfolio and random selection consecutively.

make sure to spell out what hash and * mean. 

Include image as example. 

In [202]:
# Function to separate the *, # and digits in the points string. 

def points_to_list(string):
    portfolio = ''
    if string[0] == '#':
        portfolio = '#'
    random = ''
    if string[-1] == "*":
        random = '*'
    points = ''
    for i in string:
        # Extracts only the digits from the string. 
        if i.isdigit():
            points += i
    return [points, portfolio, random]

In [203]:
# Create a path for the CSV file.
path2021 = 'data/cao2021' + time() + '.csv'

# Keeping count of the courses we are processing.
course_count = 0

# Open the CSV file for writing.
with open(path2021, 'w') as f:
    # Loops through and prints data from response line by line.
    for line in response.iter_lines():
        # Decoding turns bytes into code points and those code points can be displayed on the screen
        d_line = line.decode('cp1252')
        # Match the string specified in re_course, returning only the courses from the response
        if re_course.fullmatch(d_line):
            # Adds one to the course count
            course_count += 1
            # Extract course code and strip of any white space.
            course_code = d_line[:5].strip()
            # Extract course title and strip it of any white space.
            course_title = d_line[7:57].strip()
            # Points.
            course_points = re.split(' +' , d_line[60:])
            # The last course created 3 substrings w/ line split above & the last substring was subsequently removed.
            # Removing substrings that are not useful. 
            if len(course_points) != 2:
                course_points = course_points[:2]
            # Rejoin the substrings with commas. Because course points is a list we need to specify both items. 
            line_split = [course_code, course_title, course_points[0], course_points[1]]
            f.write(','.join(line_split) + '\n')

            # https://web.microsoftstream.com/video/89e78275-d944-444e-bed7-d964a3bd2c35 30.00


<br>

After writing the data to file, I went into the CSV and created a row with headings for each column. 

In [269]:
# Read in 2021 csv file and add header row.
df_2021 = pd.read_csv("data/cao202120211115_190905.csv", 
                names=["course_code", "course_title", "rnd_1", "rnd_2"])    

df_2021['rnd_1'] = df_2021['rnd_1'].to_numpy()

df_2021

Unnamed: 0,course_code,course_title,rnd_1,rnd_2
0,AL801,Software Design for Virtual Reality and Gaming...,300,
1,AL802,Software Design in Artificial Intelligence for...,313,
2,AL803,Software Design for Mobile Apps and Connected ...,350,
3,AL805,Computer Engineering for Network Infrastructur...,321,
4,AL810,Quantity Surveying ...,328,
...,...,...,...,...
944,WD211,Creative Computing ...,270,
945,WD212,Recreation and Sport Management ...,262,
946,WD230,Mechanical and Manufacturing Engineering ...,230,230
947,WD231,Early Childhood Care and Education ...,266,


<br>

## Get EOS points & portfolio/interview data
***

[read_excel() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)

In [270]:
# Save file path to a variable.
eos_excel = 'data/CAOPointsCharts2021.xlsx'

In [271]:
# Read in data.
eos_2021 = pd.read_excel(eos_excel, skiprows=11, engine='openpyxl') # Use openpyxl to open xlsx spreadsheet.

# Check data.
eos_2021.head()

Unnamed: 0,CATEGORY (ISCED Description),Course Title,Course Code,R1 Points,R1 Random,R2 Points,R2 Random,EOS Points,EOS Random,EOS Midpoints,Course Level,HEI,Test/Interview,AVP,v
0,Engineering and engineering trades,Music and Instrument Technology,AL605,211,,,,211,,319,6,Athlone Institute of Technology,,,
1,Health,Pharmacy Technician,AL630,308,,,,308,,409,6,Athlone Institute of Technology,,,
2,Health,Dental Nursing,AL631,311,,,,311,,400,6,Athlone Institute of Technology,,,
3,Biological and related sciences,Applied Science,AL632,297,,,,297,,454,6,Athlone Institute of Technology,,,
4,Business and administration,Business,AL650,AQA,,AQA,,AQA,,351,6,Athlone Institute of Technology,,avp,


<br>

### Drop columns
***

In [272]:
# Delete irrelevant columns.
eos_2021 = eos_2021.drop(columns=['Course Title', 'CATEGORY (ISCED Description)', 'R1 Points', 'R1 Random', 'R2 Points ', 'R2 Random', 'EOS Points', 'EOS Random', 'Course Level', 'HEI', 'AVP', 'v'])

eos_2021


Unnamed: 0,Course Code,EOS Midpoints,Test/Interview
0,AL605,319,
1,AL630,409,
2,AL631,400,
3,AL632,454,
4,AL650,351,
...,...,...,...
1446,WD211,392,
1447,WD212,304,
1448,WD230,361,
1449,WD231,366,


<br>

### Concatenate and join tables
***

In [273]:
# Set the index to be the course code as this is what we will join the dataframes on.
eos_2021.set_index('Course Code', inplace=True)
df_2021.set_index('course_code', inplace=True)

In [274]:
# Join dataframes.
df_2021 = df_2021.join(eos_2021)

In [275]:
# Check the dataframe.
df_2021

Unnamed: 0_level_0,course_title,rnd_1,rnd_2,EOS Midpoints,Test/Interview
course_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AL801,Software Design for Virtual Reality and Gaming...,300,,359,
AL802,Software Design in Artificial Intelligence for...,313,,381,
AL803,Software Design for Mobile Apps and Connected ...,350,,398,
AL805,Computer Engineering for Network Infrastructur...,321,,381,
AL810,Quantity Surveying ...,328,,377,
...,...,...,...,...,...
WD211,Creative Computing ...,270,,392,
WD212,Recreation and Sport Management ...,262,,304,
WD230,Mechanical and Manufacturing Engineering ...,230,230,361,
WD231,Early Childhood Care and Education ...,266,,366,


In [276]:
# Count the rows in the dataframe.
len(df_2021)

949

On the 12-11-2021, it was verified that 949 courses were documented in the 2021 CAO data online at the following webpage http://www2.cao.ie/points/l8.php. This corresponds with the number of courses in our dataset ensuring that there was no loss of data.
***

<br>

# 2020 CAO data

[2020 data from CAO website](http://www.cao.ie/index.php?page=points&p=2020)
***

In [212]:
# Save the URL in a variable.
url_2020 = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'

<br>

### Save the original file
***

In [213]:
# Create a file path for the original data set.
pathxlsx = 'data/CAO2020' + time() + '.xlsx'

# Opening URL
urlrq.urlretrieve(url_2020, pathxlsx)

('data/CAO202020211208_185217.xlsx', <http.client.HTTPMessage at 0x12535d940>)

<br>

### Load data with pandas
[Pandas documentation for reading in excel data](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
***

The function from the pandas library ot read in an excel file is `pandas.read_excel()`. 

One of this function's parameters is `engine`. If we set the value to openpyxl, we can use this library to open newer excel file formats.

In [214]:
# Load and parse spread sheet. 
df_2020 = pd.read_excel(pathxlsx, skiprows=10, engine='openpyxl') # Use openpyxl to open xlsx spreadsheet.
df_2020

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


In [215]:
# Delete irrelevant columns.
df_2020.drop(columns=['COURSE TITLE', 
                    'CATEGORY (i.e.ISCED description)', 
                    'R1 Random *', 
                    'R2 Random*',
                    'LEVEL',
                    'EOS', 
                    'EOS Random *',  
                    'HEI', 
                    'avp', 
                    'v', 
                    'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6', 
                    'Column7', 'Column8'])


Unnamed: 0,COURSE CODE2,R1 POINTS,R2 POINTS,EOS Mid-point,Test/Interview #
0,AC120,209,,280,
1,AC137,252,,270,
2,AD101,#+matric,,#+matric,#
3,AD102,#+matric,,#+matric,#
4,AD103,#+matric,,#+matric,#
...,...,...,...,...,...
1459,WD208,188,,339,
1460,WD210,279,,337,
1461,WD211,271,,318,
1462,WD212,270,,349,


In [216]:
# Checking random row to ensure data integrity.
df_2020.iloc[650]

CATEGORY (i.e.ISCED description)                                             Arts
COURSE TITLE                        Arts (Drama, Theatre and Performance Studies)
COURSE CODE2                                                                GY118
R1 POINTS                                                                     451
R1 Random *                                                                   NaN
R2 POINTS                                                                     NaN
R2 Random*                                                                    NaN
EOS                                                                           451
EOS Random *                                                                  NaN
EOS Mid-point                                                                 492
LEVEL                                                                           8
HEI                                        National University of Ireland, Galway
Test/Interview #

In [217]:
# Checking the last row.
df_2020.iloc[-1]


CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Mechanical and Manufacturing Engineering
COURSE CODE2                                                           WD230
R1 POINTS                                                                253
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      253
EOS Random *                                                             NaN
EOS Mid-point                                                            369
LEVEL                                                                      8
HEI                                        Waterford Institute of Technology
Test/Interview #                                                         NaN

In [218]:
# Saving the data to a CSV file.
path2020 = 'data/cao2020_' + time() + '.csv'
#df_2020.to_csv(path2020, names=["area", "course_title", "course_code", "Mid-point"])

df_2020

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


On 8 December 2021, it was verified that there were 1464 courses listed on the CAO webpage for the year 2020. This corresponds with the number of courses in our data set. 

<br>

# 2019 CAO data
 
http://www2.cao.ie/points/lvl8_19.pdf

***


## Convert pdf to csv
***

What did I do to prepare data before reading in csv file below?

Copy and pasted pdf in Preview, pasted into a Word document so it formats nicely.

Then copied the data from the Word document to a csv file, while deleting preamble and unnecessary data such as preamble, page numbers and full Higher Education INstitution's names while keeping the course code and points etc. 

In [219]:
# Read in 2019 cao csv file & use the tab character as the delimiter.
df_2019 = pd.read_csv('data/cao2019_20211101_213010.csv', sep='\t')
df_2019

Unnamed: 0,Course Code,INSTITUTION and COURSE,EOS,Mid
0,AL801,Software Design with Virtual Reality and Gaming,304,328
1,AL802,Software Design with Cloud Computing,301,306
2,AL803,Software Design with Mobile Apps and Connected...,309,337
3,AL805,Network Management and Cloud Infrastructure,329,442
4,AL810,Quantity Surveying,307,349
...,...,...,...,...
925,WD200,Arts (options),221,296
926,WD210,Software Systems Development,271,329
927,WD211,Creative Computing,275,322
928,WD212,Recreation and Sport Management,274,311


In [220]:
# Check columns to find white space. 
df_2019.columns

Index(['Course Code ', 'INSTITUTION and COURSE ', 'EOS ', 'Mid '], dtype='object')

In [221]:
# Remove white space in column titles.
df_2019.columns = df_2019.columns.str.strip()

As seen above there is white space at the end of the course code and course title strings. It is important to move any white space now as it will impede access to the data later on. 


In [222]:
# Strip data of any white space.
df_2019['Course Code'] = df_2019['Course Code'].str.strip()
df_2019['INSTITUTION and COURSE'] = df_2019['INSTITUTION and COURSE'].str.strip()

## Concatenate & Join Data Sets
***

In [305]:
courses2021 = df_2021

In [306]:
courses2021

Unnamed: 0_level_0,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021
course_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AL801,Software Design for Virtual Reality and Gaming...,300,,359,
AL802,Software Design in Artificial Intelligence for...,313,,381,
AL803,Software Design for Mobile Apps and Connected ...,350,,398,
AL805,Computer Engineering for Network Infrastructur...,321,,381,
AL810,Quantity Surveying ...,328,,377,
...,...,...,...,...,...
WD211,Creative Computing ...,270,,392,
WD212,Recreation and Sport Management ...,262,,304,
WD230,Mechanical and Manufacturing Engineering ...,230,230,361,
WD231,Early Childhood Care and Education ...,266,,366,


In [307]:
# Create new data frame with columns that are applicable to our analysis.
courses2020 = df_2020[['COURSE CODE2', 'COURSE TITLE']]
courses2020.columns = ['course_code', 'course_title']

In [308]:
courses2020

Unnamed: 0,course_code,course_title
0,AC120,International Business
1,AC137,Liberal Arts
2,AD101,"First Year Art & Design (Common Entry,portfolio)"
3,AD102,Graphic Design and Moving Image Design (portfo...
4,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
1459,WD208,Manufacturing Engineering
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [309]:
df_2019.columns

Index(['Course Code', 'INSTITUTION and COURSE', 'EOS', 'Mid'], dtype='object')

In [310]:
courses2019 = df_2019[['Course Code', 'INSTITUTION and COURSE', ]]
courses2019.columns = ['course_code', 'course_title', ]
courses2019

Unnamed: 0,course_code,course_title
0,AL801,Software Design with Virtual Reality and Gaming
1,AL802,Software Design with Cloud Computing
2,AL803,Software Design with Mobile Apps and Connected...
3,AL805,Network Management and Cloud Infrastructure
4,AL810,Quantity Surveying
...,...,...
925,WD200,Arts (options)
926,WD210,Software Systems Development
927,WD211,Creative Computing
928,WD212,Recreation and Sport Management


#### Concatenate Data Frames
***

Pandas function to concatenate data frames: 

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
`pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None,` 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`names=None, verify_integrity=False, sort=False, copy=True)`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

Setting `ignore_index` to True ensures that the indices from the old data frames are not brought into the new data frame. This is to ensure no duplication of indices. For example, we do not want multiple rows all with 0 as their index.

In [311]:
# Concatenating 2021 & 2021 courses. 
all_courses = pd.concat([courses2021, courses2020, courses2019], ignore_index=True)
all_courses

Unnamed: 0,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021,course_code
0,Software Design for Virtual Reality and Gaming...,300,,359,,
1,Software Design in Artificial Intelligence for...,313,,381,,
2,Software Design for Mobile Apps and Connected ...,350,,398,,
3,Computer Engineering for Network Infrastructur...,321,,381,,
4,Quantity Surveying ...,328,,377,,
...,...,...,...,...,...,...
3338,Arts (options),,,,,WD200
3339,Software Systems Development,,,,,WD210
3340,Creative Computing,,,,,WD211
3341,Recreation and Sport Management,,,,,WD212


In [312]:
# Reset index.
all_courses.reset_index()

Unnamed: 0,index,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021,course_code
0,0,Software Design for Virtual Reality and Gaming...,300,,359,,
1,1,Software Design in Artificial Intelligence for...,313,,381,,
2,2,Software Design for Mobile Apps and Connected ...,350,,398,,
3,3,Computer Engineering for Network Infrastructur...,321,,381,,
4,4,Quantity Surveying ...,328,,377,,
...,...,...,...,...,...,...,...
3338,3338,Arts (options),,,,,WD200
3339,3339,Software Systems Development,,,,,WD210
3340,3340,Creative Computing,,,,,WD211
3341,3341,Recreation and Sport Management,,,,,WD212


## Manage Duplicate Rows
***

To deal with duplicate rows, the following functions form the pandas library can be used: 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`pandas.DataFrame.duplicated()` [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`pandas.DataFrame.drop_duplicates()` [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)

In [313]:
# Find extra duplicates based on course code. 
all_courses[all_courses.duplicated(subset=['course_code'])]

Unnamed: 0,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021,course_code
1,Software Design in Artificial Intelligence for...,313,,381,,
2,Software Design for Mobile Apps and Connected ...,350,,398,,
3,Computer Engineering for Network Infrastructur...,321,,381,,
4,Quantity Surveying ...,328,,377,,
5,Civil Engineering ...,,,0,,
...,...,...,...,...,...,...
3338,Arts (options),,,,,WD200
3339,Software Systems Development,,,,,WD210
3340,Creative Computing,,,,,WD211
3341,Recreation and Sport Management,,,,,WD212


In [314]:
# Remove duplicate rows, implementing changes in-place, while ignoring indices. 
all_courses.drop_duplicates(subset=['course_code'], inplace=True, ignore_index=True)

In [315]:
# Take a look at the data. 
all_courses

Unnamed: 0,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021,course_code
0,Software Design for Virtual Reality and Gaming...,300,,359,,
1,International Business,,,,,AC120
2,Liberal Arts,,,,,AC137
3,"First Year Art & Design (Common Entry,portfolio)",,,,,AD101
4,Graphic Design and Moving Image Design (portfo...,,,,,AD102
...,...,...,...,...,...,...
1599,Applied Archaeology,,,,,SG446
1600,Music Technology,,,,,TL803
1601,Computing with Digital Media,,,,,TL812
1602,Construction Management,,,,,TL842


In [316]:
# Compare two rows.
all_courses.loc[175] == all_courses.loc[176]

course_title           False
points_r1_2021         False
points_r2_2021         False
eos_mid_2021           False
test/interview_2021    False
course_code            False
dtype: bool

<br>

### Join the points
***

This code joins the 2021 points to the all_course dataframe. 

In [317]:
df_2021

Unnamed: 0_level_0,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021
course_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AL801,Software Design for Virtual Reality and Gaming...,300,,359,
AL802,Software Design in Artificial Intelligence for...,313,,381,
AL803,Software Design for Mobile Apps and Connected ...,350,,398,
AL805,Computer Engineering for Network Infrastructur...,321,,381,
AL810,Quantity Surveying ...,328,,377,
...,...,...,...,...,...
WD211,Creative Computing ...,270,,392,
WD212,Recreation and Sport Management ...,262,,304,
WD230,Mechanical and Manufacturing Engineering ...,230,230,361,
WD231,Early Childhood Care and Education ...,266,,366,


### Why are we setting the index to be column code? 

So that we can join the rows based on common course_code and not on index.

In [318]:
# Change df_2021 column names to include the year.
df_2021.columns = ['course_title', 'points_r1_2021', 'points_r2_2021', 'eos_mid_2021', 'test/interview_2021']

# Check dataframe.
df_2021

Unnamed: 0_level_0,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021
course_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AL801,Software Design for Virtual Reality and Gaming...,300,,359,
AL802,Software Design in Artificial Intelligence for...,313,,381,
AL803,Software Design for Mobile Apps and Connected ...,350,,398,
AL805,Computer Engineering for Network Infrastructur...,321,,381,
AL810,Quantity Surveying ...,328,,377,
...,...,...,...,...,...
WD211,Creative Computing ...,270,,392,
WD212,Recreation and Sport Management ...,262,,304,
WD230,Mechanical and Manufacturing Engineering ...,230,230,361,
WD231,Early Childhood Care and Education ...,266,,366,


In [321]:
# Set the all_courses index to be the course_code column.
all_courses.set_index('course_code', inplace=True)

Unnamed: 0,course_code,course_title,points_r1_2021,points_r2_2021,eos_mid_2021,test/interview_2021
0,,Software Design for Virtual Reality and Gaming...,300,,359,
1,AC120,International Business,,,,
2,AC137,Liberal Arts,,,,
3,AD101,"First Year Art & Design (Common Entry,portfolio)",,,,
4,AD102,Graphic Design and Moving Image Design (portfo...,,,,
...,...,...,...,...,...,...
1599,SG446,Applied Archaeology,,,,
1600,TL803,Music Technology,,,,
1601,TL812,Computing with Digital Media,,,,
1602,TL842,Construction Management,,,,


In [322]:
# Join the 2021 points to the all_courses dataframe.
all_courses = all_courses.join(df_2021)
all_courses

ValueError: columns overlap but no suffix specified: Index(['course_title', 'points_r1_2021', 'points_r2_2021', 'eos_mid_2021',
       'test/interview_2021'],
      dtype='object')

Join the 2020 points to the all_courses dataframe.

In [None]:
df_2020.columns

In [None]:
# Extract columns. 
df_2020 = df_2020[['COURSE CODE2', 'R1 POINTS', 'R2 POINTS']]

# Change column names.
df_2020.columns = ['course_code', 'points_r1_2020', 'points_r2_2020']
df_2020

In [None]:
# Set index to be course_code column. 
df_2020.set_index('course_code', inplace=True)

In [None]:
# Join 2020 points to all_courses dataframe.
all_courses = all_courses.join(df_2020)

# Check that the points were added.
all_courses

In [None]:
df_2019.columns

In [None]:
# Extract columns. 
df_2019 = df_2019[['Course Code', 'Mid', 'EOS']]

# Change column names.
df_2019.columns = ['course_code', 'mid_2019', 'eos_2019']
df_2019

In [None]:
# Set the index for df_2019 to be the course_code column. 
df_2019.set_index('course_code', inplace=True)

In [None]:
# Join 2019 points to all_courses dataframe.
all_courses = all_courses.join(df_2019)

# Check.
all_courses

At this stage it would be a good idea to do a spot check to ensure that the joins took place correctly and the data maintained its integrity.

In [None]:
df_2019['mid_2019']

***
# End