# Fundamentals of Data Analysis Assignment
## Autumn 2021

---
<br>

## Part 2 - CAO points

![CAO_logo](./Images/cao.png)
---

<br>

# A detailed comparison of CAO points in 2019, 2020, and 2021

The brief was to analyse the CAO points for the years 2019, 2020, and 2021. The main tasks were to download the data from the CAO website, analyse the data using pandas and also to use visualisations to help better explain this analysis.

<br>

### Downloading the data

The data can be found at the following links which we save as variables to be used later.

A quick glance at the file extensions tells us we are dealing with three different file types so scraping the data is going to involve a few different methods.

In [1]:
url2021 = 'http://www2.cao.ie/points/l8.php'
url2020 = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'
url2019 = 'http://www2.cao.ie/points/lvl8_19.pdf'

To begin we import all the necessary libraries for the analysis and visualisations. These are shortened as per convention and economy of space.

In [2]:
# basic data analysis
import pandas as pd

# plotting
import matplotlib.pyplot as plt

# Dates and times
import datetime as dt

# Regular expressions
import re

# HTTP requests
import requests as rq

# for downloading and saving excel file
import urllib.request as urlrq

# working with csv files
import csv

# for pdf files
import tabula

We're going to use a timestamp to name the different updates of the downloaded and saved files.

In [3]:
# get the current date and time
now = dt.datetime.now()

# format as a string
nowstr = now.strftime('%Y%m%d_%H%M%S')

<br>

## 2021 Points

The url - http://www2.cao.ie/points/l8.php - returns a plain text file. For analysis we need to download that, extract only the information we need, and then convert it to a csv file. The first part of code below uses the **requests** library to fetch the data. As the data is still being updated (at time of writing) we are using a timestamp created above to name the downloaded file. We use the **datetime** library for the current date and time which is then converted into a string using the **strftime** method. A path is created using this stringified datetime preceded by a folder data and CAO2021.

In [4]:
# fetch the cao url
resp = rq.get(url2021)

# Create a filepath for the original data using the datetime
path2021 = 'data/CAO2021_' + nowstr + '.html'

# confirm it's working (if we get a '200' response message)
resp

<Response [200]>

We then save the original file which the following code. 

In [5]:
# save the original html file
with open(path2021, 'w') as f:
    f.write(resp.text)

During the lectures it was discovered that some characters weren't being displayed properly. The issues lay with the fadas in the Irish language words, plus a stray 'em' hyphen. The server stated that the page should be decoded with **iso-8859-1** but this didn't allow for these particular characters. We changed the decoding to **cp1252** which solved the issue.

In [6]:
# the server uses the wrong encoding, fix it
original_encoding = resp.encoding

# change to cp1252
resp.encoding = 'cp1252'

<br>

#### Using regular expressions to isolate the data we want

On inspection the file contains a lot of information we don't need for the analysis such as headings, links, college names, etc., so the next challenge was to isolate only what was needed, i.e. 1) course code 2) course name 3) points.
The following code uses a **regular expression** to identify only the lines that match the expression.
- ([A-Z]{2}[0-9]{3}) = represents course code - i.e. CW078
- followed by 2 spaces
- (.*) represents an amount of text - dot(.) = wildcard *=zero or more of

In [7]:
# compile the regular expression for matching lines
re_course = re.compile(r'([A-Z]{2}[0-9]{3})(.*)')

A function created in the lecture to isolate # and *. **need to work out how to use it!**

In [8]:
# function to isolate # and * 
# def points_to_array(s):
#     portfolio = ''
#     if s[0] == '#':
#         portfolio = '#'
#     random = ''
#     if s[-1] == '*':
#         portfolio = '*'
#     points = ''
#     for i in s:
#         if i.isdigit():
#             points = points + i
#     return [points, portfolio, random]

We create a new path for saving the extracted data as a csv file.

In [9]:
# path for csv file
path2021b = 'data/CAO2021_csv_' + nowstr + '.csv'

We then loop through these lines and save then to a csv file to be also stored in the **data** folder. 

In [10]:
# keep track of courses
no_lines = 0

with open(path2021b, 'w') as f:
    # write a header row
    f.write(','.join(["Course_Code", "Course_Title", "PointsR1_2021", "PointsR2_2021"]) + '\n')
    # loop through lines of response
    for line in resp.iter_lines():
        # decode to cp1252
        dline = line.decode('cp1252')
        # match only the lines we want - ones representing courses
        if re_course.fullmatch(dline):
            # add to line counter
            no_lines = no_lines + 1
            # course code (first 5 characters)
            course_code = dline[:5]
            # course_title
            course_title = dline[7:57].strip()
            # round 1 points
            course_points = re.split(' +', dline[60:])
            if len(course_points) != 2:
                course_points = course_points[:2]
            # join the fields using a comma
            # split the line on 2 spaces or more
            linesplit = (course_code, course_title, course_points[0], course_points[1])
            f.write(','.join(linesplit) + '\n')

Just to confirm we have every line we need, we print out the total number of lines and then try to verify that online. 

In [11]:
# prints the total number of lines    
print(f"Total number of lines is {no_lines}.")

Total number of lines is 949.


**N.B.** It was verified on 8/11 against the original data and we're good!

We open the new csv file with **pandas** to inspect.

In [12]:
# open the csv file and save to variable df2021
df2021 = pd.read_csv(path2021b, encoding='cp1252') 

# have a look
df2021

Unnamed: 0,Course_Code,Course_Title,PointsR1_2021,PointsR2_2021
0,AL801,Software Design for Virtual Reality and Gaming,300,
1,AL802,Software Design in Artificial Intelligence for...,313,
2,AL803,Software Design for Mobile Apps and Connected ...,350,
3,AL805,Computer Engineering for Network Infrastructure,321,
4,AL810,Quantity Surveying,328,
...,...,...,...,...
944,WD211,Creative Computing,270,
945,WD212,Recreation and Sport Management,262,
946,WD230,Mechanical and Manufacturing Engineering,230,230
947,WD231,Early Childhood Care and Education,266,


All looks good!

## 2020 Points

We move on to the 2020 points.
The 2020 data is already in an excel spreadsheet so a different approach is needed!
We first save the original data

In [13]:
# save the original file to disk
# create the path again using the datetime function
path2020 = 'data/CAO2020_' + nowstr + '.html'

urlrq.urlretrieve(url2020, path2020)

('data/CAO2020_20211124_120115.html',
 <http.client.HTTPMessage at 0x7fac6f877310>)

In [14]:
# Read and store content of an excel file from a URL - 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
# skip the first 10 rows
df2020 = pd.read_excel (url2020, skiprows=10)

# have a look
df2020

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


We can see that we have a lot more data than the 2021 version plus different headers. So straight away I'd like to rename the matching headers so we can look at both files together.

In [15]:
# change the necessary headers - Course_Code,Course_Title,PointsR1,PointsR2
df2020 = df2020.rename(columns={'COURSE TITLE': 'Course_Title', 'COURSE CODE2': 'Course_Code', 
                                'R1 POINTS': 'PointsR1_2020', 'R2 POINTS': 'PointsR2_2020'})
df2020

Unnamed: 0,CATEGORY (i.e.ISCED description),Course_Title,Course_Code,PointsR1_2020,R1 Random *,PointsR2_2020,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


Now to isolate the relevent rows and save them to a new csv file

In [16]:
# extract the relevent columns and reorder
df2020 = df2020[['Course_Code','Course_Title','PointsR1_2020','PointsR2_2020']]

# have a look
df2020

Unnamed: 0,Course_Code,Course_Title,PointsR1_2020,PointsR2_2020
0,AC120,International Business,209,
1,AC137,Liberal Arts,252,
2,AD101,"First Year Art & Design (Common Entry,portfolio)",#+matric,
3,AD102,Graphic Design and Moving Image Design (portfo...,#+matric,
4,AD103,Textile & Surface Design and Jewellery & Objec...,#+matric,
...,...,...,...,...
1459,WD208,Manufacturing Engineering,188,
1460,WD210,Software Systems Development,279,
1461,WD211,Creative Computing,271,
1462,WD212,Recreation and Sport Management,270,


The new files look similar now but we can see straight away that 2020 has considerable more rows than 2021! Something we'll have to address later! In the meantime we save to a new file again using the timestamp to name the file.

In [17]:
# saving updated pandas dataframe as csv file
# creating path
path2020b = 'data/CAO2020_' + nowstr + '.csv'
# writing to csv
df2020.to_csv(path2020b)

<br>

## 2019 Points

These are in PDF format! A bit of googling found the tabula library which seems to deal with PDFs very easily. Quicker than Ian's lecture anyway! :)
Link to tabula - https://github.com/chezou/tabula-py

In [18]:
# access the pdf file using parameters in referenced link above
df = tabula.read_pdf(url2019, stream=True, pages="all")

# have a look
df

[   Course Code                             INSTITUTION and COURSE   EOS    Mid
 0          NaN                    Athlone Institute of Technology   NaN    NaN
 1        AL801    Software Design with Virtual Reality and Gaming   304  328.0
 2        AL802               Software Design with Cloud Computing   301  306.0
 3        AL803  Software Design with Mobile Apps and Connected...   309  337.0
 4        AL805        Network Management and Cloud Infrastructure   329  442.0
 5        AL810                                 Quantity Surveying   307  349.0
 6        AL820                 Mechanical and Polymer Engineering   300  358.0
 7        AL830                                    General Nursing   410  429.0
 8        AL832                                Psychiatric Nursing   387  403.0
 9        AL836                       Nutrition and Health Science   352  383.0
 10       AL837            Sports Science with Exercise Physiology   351  392.0
 11       AL838                         

Seems to work!  
Next to convert it to a csv file and name it using the timestamp method again, and save it to the data folder

In [19]:
# convert PDF into CSV file
tabula.convert_into(url2019, 'data/CAO2019_' + nowstr + '.csv', output_format="csv", pages='all')

Have a look

In [20]:
# reads in the csv file
df2019 = pd.read_csv("data/CAO2019_20211101_122559.csv")

# have a look
df2019

Unnamed: 0,Course Code,INSTITUTION and COURSE,EOS,Mid
0,,Athlone Institute of Technology,,
1,AL801,Software Design with Virtual Reality and Gaming,304,328
2,AL802,Software Design with Cloud Computing,301,306
3,AL803,Software Design with Mobile Apps and Connected...,309,337
4,AL805,Network Management and Cloud Infrastructure,329,442
...,...,...,...,...
960,WD200,Arts (options),221,296
961,WD210,Software Systems Development,271,329
962,WD211,Creative Computing,275,322
963,WD212,Recreation and Sport Management,274,311


We have an issue with the univeristy columns in that they are taking up rows! We need to be able to delete those rows. A bit of googling found the following solution - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [21]:
# deletes rows where there are blanks
df2019 = df2019.dropna()

# have a look
df2019

Unnamed: 0,Course Code,INSTITUTION and COURSE,EOS,Mid
1,AL801,Software Design with Virtual Reality and Gaming,304,328
2,AL802,Software Design with Cloud Computing,301,306
3,AL803,Software Design with Mobile Apps and Connected...,309,337
4,AL805,Network Management and Cloud Infrastructure,329,442
5,AL810,Quantity Surveying,307,349
...,...,...,...,...
960,WD200,Arts (options),221,296
961,WD210,Software Systems Development,271,329
962,WD211,Creative Computing,275,322
963,WD212,Recreation and Sport Management,274,311


We also need to rename the headers to match the other years files, but we have an issue I think - we don't have the round 1 and round 2 points! According to https://www.independent.ie/life/family/learning/understanding-your-cao-course-guide-26505318.htmlINSTITUTION EOS is the *".. FINAL CUT-OFF points, in other word, the points score achieved by the last applicant being offered a place on that course in 2008. The second column gives the MID figure, that is, the points score of the applicant midwaybetween the highest and the lowest applicant being offered a place."*   
EOS roughly compares to Round 2 offers, so I will concentrate on that for the moment.

In [22]:
df2019 = df2019.rename(columns={'INSTITUTION and COURSE': 'Course_Title', 
                                'Course Code': 'Course_Code', 'EOS': 'EOS_2019'})
df2019

Unnamed: 0,Course_Code,Course_Title,EOS_2019,Mid
1,AL801,Software Design with Virtual Reality and Gaming,304,328
2,AL802,Software Design with Cloud Computing,301,306
3,AL803,Software Design with Mobile Apps and Connected...,309,337
4,AL805,Network Management and Cloud Infrastructure,329,442
5,AL810,Quantity Surveying,307,349
...,...,...,...,...
960,WD200,Arts (options),221,296
961,WD210,Software Systems Development,271,329
962,WD211,Creative Computing,275,322
963,WD212,Recreation and Sport Management,274,311


In [23]:
path2019b = 'data/CAO2019_' + nowstr + '.csv'
# saving amended file to folder
df2019.to_csv(path2019b)

<br>

### Concat and join

*The following all from week08 lectures*

In [24]:
courses2021 = df2021[['Course_Code', 'Course_Title']]
courses2021

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
944,WD211,Creative Computing
945,WD212,Recreation and Sport Management
946,WD230,Mechanical and Manufacturing Engineering
947,WD231,Early Childhood Care and Education


In [25]:
courses2020 = df2020[['Course_Code', 'Course_Title']]
courses2020

Unnamed: 0,Course_Code,Course_Title
0,AC120,International Business
1,AC137,Liberal Arts
2,AD101,"First Year Art & Design (Common Entry,portfolio)"
3,AD102,Graphic Design and Moving Image Design (portfo...
4,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
1459,WD208,Manufacturing Engineering
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [26]:
courses2019 = df2019[['Course_Code', 'Course_Title']]
courses2019

Unnamed: 0,Course_Code,Course_Title
1,AL801,Software Design with Virtual Reality and Gaming
2,AL802,Software Design with Cloud Computing
3,AL803,Software Design with Mobile Apps and Connected...
4,AL805,Network Management and Cloud Infrastructure
5,AL810,Quantity Surveying
...,...,...
960,WD200,Arts (options)
961,WD210,Software Systems Development
962,WD211,Creative Computing
963,WD212,Recreation and Sport Management


In [27]:
allCourses = pd.concat([courses2021, courses2020, courses2019], ignore_index=True)
allCourses

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
3323,WD200,Arts (options)
3324,WD210,Software Systems Development
3325,WD211,Creative Computing
3326,WD212,Recreation and Sport Management


In [28]:
allCourses.sort_values('Course_Code')

Unnamed: 0,Course_Code,Course_Title
175,AC120,International Business
949,AC120,International Business
2579,AC120,International Business
950,AC137,Liberal Arts
2580,AC137,Liberal Arts
...,...,...
2412,WD230,Mechanical and Manufacturing Engineering
946,WD230,Mechanical and Manufacturing Engineering
3327,WD230,Mechanical and Manufacturing Engineering
947,WD231,Early Childhood Care and Education


In [29]:
allCourses[allCourses.duplicated(keep=False)]

Unnamed: 0,Course_Code,Course_Title
4,AL810,Quantity Surveying
5,AL811,Civil Engineering
6,AL820,Mechanical and Polymer Engineering
7,AL830,General Nursing
10,AL836,Nutrition and Health Science
...,...,...
3323,WD200,Arts (options)
3324,WD210,Software Systems Development
3325,WD211,Creative Computing
3326,WD212,Recreation and Sport Management


In [30]:
# copy of dataframe with duplicates removed
allCourses.drop_duplicates()

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
3269,TL801,Animation Visual Effects and Motion Design
3270,TL802,"TV, Radio and New Media"
3271,TL803,Music Technology
3274,TL812,Computing with Digital Media


In [31]:
allCourses[allCourses.duplicated(subset=['Course_Code'])]

Unnamed: 0,Course_Code,Course_Title
949,AC120,International Business
950,AC137,Liberal Arts
951,AD101,"First Year Art & Design (Common Entry,portfolio)"
952,AD102,Graphic Design and Moving Image Design (portfo...
953,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
3323,WD200,Arts (options)
3324,WD210,Software Systems Development
3325,WD211,Creative Computing
3326,WD212,Recreation and Sport Management


In [32]:
# drop duplicates based on code
# inplace changes original dataframe - default is opposite
allCourses.drop_duplicates(subset=['Course_Code'], inplace=True, ignore_index=True)
allCourses

Unnamed: 0,Course_Code,Course_Title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
1644,SG441,Environmental Science
1645,SG446,Applied Archaeology
1646,TL803,Music Technology
1647,TL812,Computing with Digital Media


### Join to the points

In [33]:
df2021.set_index('Course_Code', inplace=True)
df2021.columns = ['Course_Title', 'PointsR1_2021', 'PointsR2_2021']
df2021

Unnamed: 0_level_0,Course_Title,PointsR1_2021,PointsR2_2021
Course_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL801,Software Design for Virtual Reality and Gaming,300,
AL802,Software Design in Artificial Intelligence for...,313,
AL803,Software Design for Mobile Apps and Connected ...,350,
AL805,Computer Engineering for Network Infrastructure,321,
AL810,Quantity Surveying,328,
...,...,...,...
WD211,Creative Computing,270,
WD212,Recreation and Sport Management,262,
WD230,Mechanical and Manufacturing Engineering,230,230
WD231,Early Childhood Care and Education,266,


In [34]:
allCourses.set_index('Course_Code', inplace=True)

In [35]:
allCourses = allCourses.join(df2021[['PointsR1_2021', 'PointsR2_2021']])
allCourses

Unnamed: 0_level_0,Course_Title,PointsR1_2021,PointsR2_2021
Course_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL801,Software Design for Virtual Reality and Gaming,300,
AL802,Software Design in Artificial Intelligence for...,313,
AL803,Software Design for Mobile Apps and Connected ...,350,
AL805,Computer Engineering for Network Infrastructure,321,
AL810,Quantity Surveying,328,
...,...,...,...
SG441,Environmental Science,,
SG446,Applied Archaeology,,
TL803,Music Technology,,
TL812,Computing with Digital Media,,


In [36]:
df2020_r1 = df2020[['Course_Code', 'PointsR1_2020', 'PointsR2_2020']]
df2020_r1

Unnamed: 0,Course_Code,PointsR1_2020,PointsR2_2020
0,AC120,209,
1,AC137,252,
2,AD101,#+matric,
3,AD102,#+matric,
4,AD103,#+matric,
...,...,...,...
1459,WD208,188,
1460,WD210,279,
1461,WD211,271,
1462,WD212,270,


In [37]:
df2020_r1.set_index('Course_Code', inplace=True)
df2020_r1

Unnamed: 0_level_0,PointsR1_2020,PointsR2_2020
Course_Code,Unnamed: 1_level_1,Unnamed: 2_level_1
AC120,209,
AC137,252,
AD101,#+matric,
AD102,#+matric,
AD103,#+matric,
...,...,...
WD208,188,
WD210,279,
WD211,271,
WD212,270,


In [38]:
allCourses = allCourses.join(df2020_r1)
allCourses

Unnamed: 0_level_0,Course_Title,PointsR1_2021,PointsR2_2021,PointsR1_2020,PointsR2_2020
Course_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AL801,Software Design for Virtual Reality and Gaming,300,,303,
AL802,Software Design in Artificial Intelligence for...,313,,332,
AL803,Software Design for Mobile Apps and Connected ...,350,,337,
AL805,Computer Engineering for Network Infrastructure,321,,333,
AL810,Quantity Surveying,328,,319,
...,...,...,...,...,...
SG441,Environmental Science,,,,
SG446,Applied Archaeology,,,,
TL803,Music Technology,,,,
TL812,Computing with Digital Media,,,,


In [39]:
df2019_EOS = df2019[['Course_Code', 'EOS_2019']]
df2019_EOS

Unnamed: 0,Course_Code,EOS_2019
1,AL801,304
2,AL802,301
3,AL803,309
4,AL805,329
5,AL810,307
...,...,...
960,WD200,221
961,WD210,271
962,WD211,275
963,WD212,274


In [40]:
df2019_EOS.set_index('Course_Code', inplace=True)
df2019_EOS

Unnamed: 0_level_0,EOS_2019
Course_Code,Unnamed: 1_level_1
AL801,304
AL802,301
AL803,309
AL805,329
AL810,307
...,...
WD200,221
WD210,271
WD211,275
WD212,274


In [41]:
allCourses = allCourses.join(df2019_EOS)
allCourses

Unnamed: 0_level_0,Course_Title,PointsR1_2021,PointsR2_2021,PointsR1_2020,PointsR2_2020,EOS_2019
Course_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AL801,Software Design for Virtual Reality and Gaming,300,,303,,304
AL802,Software Design in Artificial Intelligence for...,313,,332,,301
AL803,Software Design for Mobile Apps and Connected ...,350,,337,,309
AL805,Computer Engineering for Network Infrastructure,321,,333,,329
AL810,Quantity Surveying,328,,319,,307
...,...,...,...,...,...,...
SG441,Environmental Science,,,,,297
SG446,Applied Archaeology,,,,,289
TL803,Music Technology,,,,,264
TL812,Computing with Digital Media,,,,,369


Reorder again

In [42]:
allCourses = allCourses.sort_values('Course_Code')

Next I think I need to create an equivelent of EOS in both 2020 and 2021, so I presume I need the lower score (or obviously some have only the one score)

In [43]:
allCourses

Unnamed: 0_level_0,Course_Title,PointsR1_2021,PointsR2_2021,PointsR1_2020,PointsR2_2020,EOS_2019
Course_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AC120,International Business,294,294,209,,234
AC137,Liberal Arts,271,270,252,,252
AD101,First Year Art and Design (Common Entry portfo...,#554,,#+matric,,# +mat
AD102,Graphic Design and Moving Image Design (portfo...,#538,,#+matric,,# +mat
AD103,Textile and Surface Design and Jewellery and O...,#505,,#+matric,,# +mat
...,...,...,...,...,...,...
WD211,Creative Computing,270,,271,,275
WD212,Recreation and Sport Management,262,,270,,274
WD230,Mechanical and Manufacturing Engineering,230,230,253,,273
WD231,Early Childhood Care and Education,266,,,,


So I want to get rid of all of the unwnted characters and just leave numbers. I found a solution here - https://pretagteam.com/question/remove-characters-from-pandas-column
By looking through the dataset I've identified all of the things I want to remove and replace them with blank space.

In [44]:
allCourses = allCourses.replace('#', '', regex = True)
allCourses = allCourses.replace('AQA', '', regex = True)
allCourses = allCourses.replace('\*', '', regex = True)
allCourses = allCourses.replace('\+matric', '', regex = True)
allCourses = allCourses.replace('\+mat', '', regex = True)
allCourses = allCourses.replace('e\)', '', regex = True)
allCourses

Unnamed: 0_level_0,Course_Title,PointsR1_2021,PointsR2_2021,PointsR1_2020,PointsR2_2020,EOS_2019
Course_Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AC120,International Business,294,294,209,,234
AC137,Liberal Arts,271,270,252,,252
AD101,First Year Art and Design (Common Entry portfo...,554,,,,
AD102,Graphic Design and Moving Image Design (portfo...,538,,,,
AD103,Textile and Surface Design and Jewellery and O...,505,,,,
...,...,...,...,...,...,...
WD211,Creative Computing,270,,271,,275
WD212,Recreation and Sport Management,262,,270,,274
WD230,Mechanical and Manufacturing Engineering,230,230,253,,273
WD231,Early Childhood Care and Education,266,,,,


Save this version as a csv to file

In [45]:
# create a new path to save file to
pathAllCourses = 'data/CAOAllYears_' + nowstr + '.csv'

# saving amended file to folder
allCourses.to_csv(pathAllCourses)

<br>

### Some analysis

So I had to convert the columns to a numeric type to perform any kind of analysis. Probably a much easier way of doing this?  
I use the **describe()** function to do some exploring.

In [46]:
allCourses['PointsR1_2021'] = pd.to_numeric(allCourses['PointsR1_2021'])
x = allCourses['PointsR1_2021']
x.describe()

count     923.000000
mean      407.666306
std       128.706224
min        57.000000
25%       303.000000
50%       391.000000
75%       499.000000
max      1028.000000
Name: PointsR1_2021, dtype: float64

In [47]:
allCourses['PointsR2_2021'] = pd.to_numeric(allCourses['PointsR2_2021'])
y = allCourses['PointsR2_2021']
y.describe()

count    255.000000
mean     414.749020
std      141.693386
min       60.000000
25%      293.500000
50%      424.000000
75%      521.500000
max      904.000000
Name: PointsR2_2021, dtype: float64

In [48]:
allCourses['PointsR1_2020'] = pd.to_numeric(allCourses['PointsR1_2020'])
z = allCourses['PointsR1_2020']
z.describe()

count    1394.000000
mean      350.995696
std       134.433752
min        55.000000
25%       252.250000
50%       316.500000
75%       433.000000
max      1088.000000
Name: PointsR1_2020, dtype: float64

In [49]:
allCourses['PointsR2_2020'] = pd.to_numeric(allCourses['PointsR2_2020'])
a = allCourses['PointsR2_2020']
a.describe()

count    316.000000
mean     334.329114
std      142.016943
min      100.000000
25%      212.000000
50%      305.000000
75%      462.750000
max      768.000000
Name: PointsR2_2020, dtype: float64

Same method doesn't seem to work for the 2019 points? Relied on a different answer from the same page - https://stackoverflow.com/questions/39173813/pandas-convert-dtype-object-to-int


In [50]:
# allCourses['EOS_2019'] = allCourses['EOS_2019'].astype(int)
allCourses['EOS_2019'] = pd.to_numeric(allCourses['EOS_2019'], errors='coerce').fillna(0, downcast='infer')
b = allCourses['EOS_2019']
b.describe()

count    1649.000000
mean      205.523347
std       207.988613
min         0.000000
25%         0.000000
50%       251.000000
75%       360.000000
max       979.000000
Name: EOS_2019, dtype: float64

# THE END