![cao](img/cao.png)

<br>

# CAO Points Analysis

[Official CAO Website](https://www.cao.ie/)

***

## Overview
***

This notebook analyses the CAO points from the last three years by converting the following: 


- 2021 points from http into a dataframe.

- 2020 points from an excel spreadsheet into another dataframe.

- 2019 points from a pdf into a dataframe. 



Then I combine all three points into one dataframe for analysis. 

<br>

## Importing modules
***

### Regular Expressions
Regular expression also known as [regexes](https://realpython.com/regex-python/) are special sequence of characters that are used to form a search pattern. In other words, a user can use a regular expression to search through a particular file in order to find that particular pattern or sequence. Python has a built-package for regular expressions called "[re](https://docs.python.org/3/library/re.html)". 


### Requests
This is another built-in package in python. [Requests](https://www.pythonforbeginners.com/requests/using-requests-in-python) is imported to allow a user to send HTTP/1.1 requests. To put it simply, this module contains various functions and operations that allow the user to retrieve data from a http. 


### Datetime
The [datetime module](https://www.geeksforgeeks.org/python-datetime-module/) is imported when working with dates and times. There are six main categories in this module:

1. date - used for year, month or day.


2. time - used for hours, minutes, seconds, microseconds, and tzinfo.


3. datetime - is a combination of both date and time. 


4. timedelta - used to represent duration.


5. tzinfo - gives time zone information objects.


6. timezone – gives tzinfo as fixed offset from UTC. 


### Pandas
[Pandas](https://mode.com/python-tutorial/libraries/pandas/#:~:text=Pandas%20is%20a%20Python%20library%20for%20data%20analysis.&text=Pandas%20is%20built%20on%20top,NumPy%27s%20methods%20with%20less%20code.) is another python library used for data analysis. It uses dataframes and operations to manipulate numerical tables and time series. Pandas will be used in this notebook to store and compare the cao points from 2021, 2020 and 2019. 


### Urllib request
This is an [extensive library](https://docs.python.org/3/library/urllib.request.html) used for opening urls. Unlike the requests library, this module offers more functionality and quicker way to open and read data from urls. 

<br>

In [1]:
# For regular expressions. - ref 1.
import re

# Convenient HTTP requests - ref 4.
import requests as rq

# Dates and times. - ref 8. 
import datetime as dt

# Pandas for data frames. - ref 9.
import pandas as pd

# For downloading urls. - ref 10.
import urllib.request as urlrq

# references to use at end!!!!!


- https://realpython.com/regex-python/


- https://realpython.com/regex-python-part-2/


- https://www.mygreatlearning.com/blog/regular-expression-in-python/


- https://docs.python-requests.org/en/latest/user/quickstart/#make-a-request


- https://howchoo.com/g/ywi5m2vkodk/working-with-datetime-objects-and-timezones-in-python


- https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html


- https://stackoverflow.com/questions/16870648/python-read-website-data-line-by-line-when-available


- https://www.geeksforgeeks.org/python-urllib-module/

<br>

## Time stamp
***

Here I created a time stamp by getting the current date and time using the datetime module. Next the strftime function is used to take in the current date and time and convert it into a string using the strftime method. There are numerious formats that can be used. A list of the strftime format can be found [here](https://strftime.org/). 


In the code cell below, I take the current date and time and convert it into the following format. This will then be used as the time stamp going forward.

- %Y stands for the year as a decimal number.


- %m represents the month as a decimal number.


- %d is for the day of the month as a decimal number.


- %H gives the hour in a 24 hour clock format.


- %M is used to get the minute as a decimal number.


- %S produces the second as a decimal number.


In [2]:
# Get the current date and time.
now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')

<br>

# 2021 CAO Points

http://www.cao.ie/index.php?page=points&p=2021
***

This section converts the 2021 CAO points from a http link into a pandas dataframe. 

<br>

### Request the http link
***

In the code cell below, [requests.get](https://docs.python-requests.org/en/latest/user/quickstart/#make-a-request) is used to fetch the data from the selected http link. 


It is also good practice to check the http respone. This is done by call `resp`. As you can see the returned respone is 200 which means the request was successful. A list of different response codes can be found [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)


In [3]:
# Fetch the 2021 CAO points URL - ref 4.
resp = rq.get('http://www2.cao.ie/points/l8.php')

# Check response. 200 means OK. 401 means error. 
resp

<Response [200]>

<br>

### Save origional data set
***

Here I use the time stamp I created above to open a file path for the origional http data. It will be stored in the data folder in this repository and the time stamp is used in the filename in order to keep track of the data each time the code is run in case an error ever occurs. 


In [4]:
# Create a file path for the original data.
pathhtml = 'data/cao2021_' + nowstr + '.html'

<br>

### Error on server
***

[Encoding](https://stackoverflow.com/questions/4657416/difference-between-encoding-and-encryption) is used to transfer data in a safe way so that it can be used on different systems. The encoding on my machine may differ from anothers which is why we need to decode using the method of that particular server.


The server says we should decode as per:

```Content-Type: text/html; charset=iso-8859-1```


However, an error occured. One of the lines uses \x96 which isn't defined in iso-8859-1. This error was referring to a fada in one of the Irish course names. 


Therefore, we need to use a similar decoding standard [cp1252](https://en.wikipedia.org/wiki/Windows-1252#:~:text=Windows%2D1252%20or%20CP%2D1252,Spanish%2C%20French%2C%20and%20German.) instead. This encoding method includes characters from the Latin alphabet which would include fadas. 

In [5]:
# The server uses the wrong encoding.
original_encoding = resp.encoding

# Change encoding here. 
resp.encoding = 'cp1252'

<br>

Using a [with statement](https://www.pythonforbeginners.com/files/with-statement-in-python) here to open file created above and write the above request to that file. 

In [6]:
# Open and write origional http data to file.
with open(pathhtml, 'w') as f:
    f.write(resp.text)

<br>

### Regular Expressions
***

Using a regular expression here to select the desired lines within the http file.  This [blog](https://www.mygreatlearning.com/blog/regular-expression-in-python/) gives very clear explaination of characters used when working with regular expressions. In the code cell below:


- [re.compile](https://www.tutorialspoint.com/Why-do-we-use-re-compile-method-in-Python-regular-expression) collects a regular expression pattern into pattern objects. 


- The [r](https://developers.google.com/edu/python/regular-expressions#:~:text=The%20%27r%27%20at%20the%20start,needs%20this%20feature%20badly!) at the start of the regular expression below refers to the start of the pattern string. 


- The letters or numbers inside the square brackets set what you are searching for. 


- The number within the braces set the amount of characters to search for.  


- An important note here is the use of two blank spaces as part of this regualr expression. If these characters were not added then the expression would not work correctly.


- Finally, the full stop look for any character, except for a new line and the asterisk looks to match the preceding characters zero or more times.  


In [7]:
# Compile the regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})  (.*)')

<br>

### Loop through the lines of the response
***

In [8]:
# COME BACK TO THIS AGAIN!!!!

# creating a helper fuction for # and *

def points_to_array(s):
    portfolio = ''
    if s[0] == '#':
        portfolio = '#'
        
    random = ''
    if s[-1] == '*':
        random = '*'
        
    points = ''
    
    for i in s:
        if i.isdigit():
            points = points + i # not addition but linking these two things together 
            
    return[points, portfolio, random]


<br>

Must use the line.decode function when reading in the webpage it uses a different encoding method to that of my system's default encoding and therefore must be decoded. ref 7.

<br>

In [9]:
# Creating file path.
path2021 = 'data/cao2021_csv_' + nowstr + '.csv'

In [10]:
# Keeps track of how many courses we process.
no_lines = 0

# Open the csv file for writing.
with open(path2021, 'w') as f:
    # Write a header row.
    f.write(','.join(['code', 'title', 'pointsR1', 'pointsR2']) + '\n')
    # Loop through lines of the response - ref 6.
    for line in resp.iter_lines():
        # Decode the line, using the wrong encoding!
        dline = line.decode('cp1252')
        # Match only the lines representing courses.
        if re_course.fullmatch(dline):
            # Add one to the lines counter.
            no_lines = no_lines + 1
            
            #extract first 5 characters for course code
            course_code = dline[:5]
            
            # extract course title from 7 up to 57 
            # strip everything after the title
            course_title = dline[7:57].strip()
            
            #course_round1 = dline[60:]
            #print(f"'{course_code} {len(dline)}'")
            
            course_points = re.split(' +', dline[60:])
            #print(f"'{course_code} {course_points}'")
            
            # if the length of points is not equal to 2 print it 
            if len(course_points) != 2:
                
                # prints out last line which contains 3 
                # print(f"'{course_code} {course_points}'")
                
                course_points = course_points[:2]
                
            # join the fields using a comma.
            linesplit = [course_code, course_title, course_points[0], course_points[1]]
            
            # Rejoin the substrings with commas in between.
            f.write(','.join(linesplit) + '\n')
               
# Print the total number of processed lines.
print(f"Total number of lines is {no_lines}.")
   
        # Pick out the relevant parts of the matched line - ref 5.
        #csv_version = re_course.sub(r'\1,\2,\3,\4', line.decode('iso-8859-1'))
        # Print the CSV-style line
        #print(csv_version)

Total number of lines is 949.


# TO DO HERE

- tidy up 2021 points 


- write a sub function to deal with the * and # components to seperate from points


- more comments and explainations

<br>

# VERIFY THIS MANUALLY!!!! 


**NB**: It was verified as of //21 that there were 949 courses exactly on the CAO 2021 points list

***

In [11]:
df2021 = pd.read_csv(path2021, encoding='cp1252')

In [12]:
df2021

Unnamed: 0,code,title,pointsR1,pointsR2
0,AL801,Software Design for Virtual Reality and Gaming,300,
1,AL802,Software Design in Artificial Intelligence for...,313,
2,AL803,Software Design for Mobile Apps and Connected ...,350,
3,AL805,Computer Engineering for Network Infrastructure,321,
4,AL810,Quantity Surveying,328,
...,...,...,...,...
944,WD211,Creative Computing,270,
945,WD212,Recreation and Sport Management,262,
946,WD230,Mechanical and Manufacturing Engineering,230,230
947,WD231,Early Childhood Care and Education,266,


***

<br>

# 2020 CAO Points

http://www.cao.ie/index.php?page=points&p=2020

***

In [13]:
url2020 = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'

<br>

### Save origional data set

***

In [14]:
# Create a file path for the original data.
pathxlsx = 'data/cao2020_' + nowstr + '.xlsx'

In [15]:
urlrq.urlretrieve(url2020, pathxlsx)

('data/cao2020_20211202_154434.xlsx',
 <http.client.HTTPMessage at 0x225782ab340>)

<br>

### Load Excel Spreadsheet using Pandas

***

In [16]:
# Download and parse the excel spreadsheet.
df2020 = pd.read_excel(url2020, skiprows=10)

In [17]:
df2020

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


In [18]:
# Spot check a random row.
df2020.iloc[753]

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Road Transport Technology and Management
COURSE CODE2                                                           LC286
R1 POINTS                                                                264
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      264
EOS Random *                                                             NaN
EOS Mid-point                                                            360
LEVEL                                                                      7
HEI                                         Limerick Institute of Technology
Test/Interview #                                                         NaN

In [19]:
# Spot check the last row.
df2020.iloc[-1]

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Mechanical and Manufacturing Engineering
COURSE CODE2                                                           WD230
R1 POINTS                                                                253
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      253
EOS Random *                                                             NaN
EOS Mid-point                                                            369
LEVEL                                                                      8
HEI                                        Waterford Institute of Technology
Test/Interview #                                                         NaN

In [20]:
# Create a file path for the pandas data.
path2020 = 'data/cao2020_' + nowstr + '.csv'

In [21]:
# Save pandas data frame to disk.
df2020.to_csv(path2020)

 <br>

## 2019 CAO Points

http://www.cao.ie/index.php?page=points&p=2019

***   

**Steps to reproduce**

1.  Download original pdf file.
2.  Open original pdf file in Microsoft Word.
3.  Save Microsoft Word's converted PDF in docx format.
4.  Re-save Word document for editing.
5.  Delete headers and footers.
6.  Delete preamble on page 1.
7.  Select all and copy.
8.  Paste into Visual Studio Code.
9.  Remove HEI name lines and black lines.
10. Change column heading "COURSE AND INSTITUTION" to "Course".
11. Change backticks to apostrophes.
12. Replaced double tab charater at on line 28 with single tab.
13. Delete tabs at end of lines 604, 670, 700, 701, 793, and 830.

In [22]:
df2019 = pd.read_csv('data/cao2019_20211103_202410_edited.csv', sep='\t')

In [23]:
df2019

Unnamed: 0,Course Code,Course,EOS,Mid
0,AL801,Software Design with Virtual Reality and Gaming,304,328.0
1,AL802,Software Design with Cloud Computing,301,306.0
2,AL803,Software Design with Mobile Apps and Connected...,309,337.0
3,AL805,Network Management and Cloud Infrastructure,329,442.0
4,AL810,Quantity Surveying,307,349.0
...,...,...,...,...
925,WD200,Arts (options),221,296.0
926,WD210,Software Systems Development,271,329.0
927,WD211,Creative Computing,275,322.0
928,WD212,Recreation and Sport Management,274,311.0


EOS means end of season points. 

MID means mid points person had who got the course

***

# Concat and join
***

In [24]:
# CREATED DF CALLED COURSES2021

In [25]:
courses2021 = df2021[['code', 'title']]
courses2021

Unnamed: 0,code,title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
944,WD211,Creative Computing
945,WD212,Recreation and Sport Management
946,WD230,Mechanical and Manufacturing Engineering
947,WD231,Early Childhood Care and Education


In [26]:
# CREATED DF CALLED COURSES2020

In [27]:
courses2020 = df2020[['COURSE CODE2','COURSE TITLE']]
# set column heading to be the same as 2021
courses2020.columns = ['code', 'title']
courses2020

Unnamed: 0,code,title
0,AC120,International Business
1,AC137,Liberal Arts
2,AD101,"First Year Art & Design (Common Entry,portfolio)"
3,AD102,Graphic Design and Moving Image Design (portfo...
4,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
1459,WD208,Manufacturing Engineering
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [28]:
# CONCATENATE COURSES2021 AND COURSES2020 
# PUT CODE AND TITLE ON TOP OF ONE ANOTHER

In [29]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

allcourses = pd.concat([courses2021, courses2020], ignore_index=True)
allcourses

Unnamed: 0,code,title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
2408,WD208,Manufacturing Engineering
2409,WD210,Software Systems Development
2410,WD211,Creative Computing
2411,WD212,Recreation and Sport Management


In [30]:
# SORT THE VALUES TO SHOW THERE ARE DUPLICATES IN THE CONCATENATED DF

In [31]:
allcourses.sort_values('code')

Unnamed: 0,code,title
175,AC120,International Business
949,AC120,International Business
950,AC137,Liberal Arts
176,AC137,Liberal Arts
951,AD101,"First Year Art & Design (Common Entry,portfolio)"
...,...,...
2411,WD212,Recreation and Sport Management
2412,WD230,Mechanical and Manufacturing Engineering
946,WD230,Mechanical and Manufacturing Engineering
947,WD231,Early Childhood Care and Education


In [32]:
# DISPLAY DUPLICATE COURSE ON TWO INDEX'

In [33]:
allcourses.loc[175]['title']

'International Business'

In [34]:
allcourses.loc[949]['title']

'International Business'

In [35]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

# Finds all extra copies of duplicated rows.
allcourses[allcourses.duplicated()]

Unnamed: 0,code,title
949,AC120,International Business
950,AC137,Liberal Arts
952,AD102,Graphic Design and Moving Image Design (portfo...
955,AD204,Fine Art (portfolio)
956,AD211,Fashion Design (portfolio)
...,...,...
2404,WD200,Arts (options)
2409,WD210,Software Systems Development
2410,WD211,Creative Computing
2411,WD212,Recreation and Sport Management


In [36]:
# Returns a copy of the data frame with duplciates removed.
allcourses.drop_duplicates()

Unnamed: 0,code,title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
2403,WD195,Architectural and Building Information Modelli...
2405,WD205,Molecular Biology with Biopharmaceutical Science
2406,WD206,Electronic Engineering
2407,WD207,Mechanical Engineering


In [37]:
# REMOVE DUPLICATES BASED ON CODE ALONE

In [38]:
# Finds all extra copies of duplicated rows.
allcourses[allcourses.duplicated(subset=['code'])]

Unnamed: 0,code,title
949,AC120,International Business
950,AC137,Liberal Arts
951,AD101,"First Year Art & Design (Common Entry,portfolio)"
952,AD102,Graphic Design and Moving Image Design (portfo...
953,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
2404,WD200,Arts (options)
2409,WD210,Software Systems Development
2410,WD211,Creative Computing
2411,WD212,Recreation and Sport Management


In [39]:
# DF OF ALLCOURSES
# CONTAINS FULL LIST OF COURSES AVAILABLE IN 2021, 2020 OR BOTH

In [40]:
# INPLACE=TRUE MEANS IT MAKES THE CHANGE IN THE DF AS OPPOSED TO RETURNING A NEW ONE

#IGNORE_INDEX=TRUE IGNORES THE INDES OF ORIGIONAL ARRAYS AND BASICALLY RE-SETS THE INDEX ON THE NEW DF

In [41]:
# Returns a copy of the data frame with duplciates removed - based only on code.
allcourses.drop_duplicates(subset=['code'], inplace=True, ignore_index=True)

In [42]:
allcourses

Unnamed: 0,code,title
0,AL801,Software Design for Virtual Reality and Gaming
1,AL802,Software Design in Artificial Intelligence for...
2,AL803,Software Design for Mobile Apps and Connected ...
3,AL805,Computer Engineering for Network Infrastructure
4,AL810,Quantity Surveying
...,...,...
1512,WD188,Applied Health Care
1513,WD205,Molecular Biology with Biopharmaceutical Science
1514,WD206,Electronic Engineering
1515,WD207,Mechanical Engineering


<br>

# Join to points

In [43]:
# INPLACE=TRUE AGAIN PERMANENTLY CHANGES THE INDEX OF DF2021 AND SETS IT AS THE CODE

In [44]:
# Set the index to the code column.
df2021.set_index('code', inplace=True)
df2021.columns = ['title', 'points_r1_2021', 'points_r2_2021']
df2021

Unnamed: 0_level_0,title,points_r1_2021,points_r2_2021
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL801,Software Design for Virtual Reality and Gaming,300,
AL802,Software Design in Artificial Intelligence for...,313,
AL803,Software Design for Mobile Apps and Connected ...,350,
AL805,Computer Engineering for Network Infrastructure,321,
AL810,Quantity Surveying,328,
...,...,...,...
WD211,Creative Computing,270,
WD212,Recreation and Sport Management,262,
WD230,Mechanical and Manufacturing Engineering,230,230
WD231,Early Childhood Care and Education,266,


In [45]:
# INPLACE=TRUE AGAIN PERMANENTLY CHANGES THE INDEX OF ALLCOURSES AND SETS IT AS THE CODE

In [46]:
# Set the index to the code column.
allcourses.set_index('code', inplace=True)

In [47]:
# NOW JOINING POINTS FROM DF2021 ARRAY TO ALLCOURSES ARRAY

In [48]:
allcourses = allcourses.join(df2021[['points_r1_2021']])
allcourses

Unnamed: 0_level_0,title,points_r1_2021
code,Unnamed: 1_level_1,Unnamed: 2_level_1
AL801,Software Design for Virtual Reality and Gaming,300
AL802,Software Design in Artificial Intelligence for...,313
AL803,Software Design for Mobile Apps and Connected ...,350
AL805,Computer Engineering for Network Infrastructure,321
AL810,Quantity Surveying,328
...,...,...
WD188,Applied Health Care,
WD205,Molecular Biology with Biopharmaceutical Science,
WD206,Electronic Engineering,
WD207,Mechanical Engineering,


In [49]:
# DOING THE SAME THING FOR DF2020

In [50]:
df2020_r1 = df2020[['COURSE CODE2', 'R1 POINTS']]
df2020_r1.columns = ['code', 'points_r1_2020']
df2020_r1

Unnamed: 0,code,points_r1_2020
0,AC120,209
1,AC137,252
2,AD101,#+matric
3,AD102,#+matric
4,AD103,#+matric
...,...,...
1459,WD208,188
1460,WD210,279
1461,WD211,271
1462,WD212,270


In [51]:
# PERMANETLY CHANGING INDEX OF DF2020 TO THE CODE COLUMN

In [52]:
# Set the index to the code column.
df2020_r1.set_index('code', inplace=True)
df2020_r1

Unnamed: 0_level_0,points_r1_2020
code,Unnamed: 1_level_1
AC120,209
AC137,252
AD101,#+matric
AD102,#+matric
AD103,#+matric
...,...
WD208,188
WD210,279
WD211,271
WD212,270


In [53]:
# FINALLY JOINING DF2020 TO ALL COURSES DF
# CREATES A TABLE CONTAIN COLUMN FOR 2021 POINTS AND 2020 PONITS
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

In [54]:
# Join 2020 points to allcourses.
allcourses = allcourses.join(df2020_r1)
allcourses

Unnamed: 0_level_0,title,points_r1_2021,points_r1_2020
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL801,Software Design for Virtual Reality and Gaming,300,303
AL802,Software Design in Artificial Intelligence for...,313,332
AL803,Software Design for Mobile Apps and Connected ...,350,337
AL805,Computer Engineering for Network Infrastructure,321,333
AL810,Quantity Surveying,328,319
...,...,...,...
WD188,Applied Health Care,,201
WD205,Molecular Biology with Biopharmaceutical Science,,228
WD206,Electronic Engineering,,179
WD207,Mechanical Engineering,,198


# TO DO


***

- ADD EXPLANATIONS ABOUT WHAT IS HAPPENING


- MAYBE ADD PLOTS TO COMPARE DATA


- ADD CONTENT TO THE README


- SAVE NEW DF OF ALLCOURSES TO A CSV IN DATA FOLDER


- JOIN IN 2019 POINTS TO ALLCOURSES DF AND COMPARE


<br>


# This notebook should have:
 ***
 
 1. origional data file 2021,2020,2019 from cao website
 
 
2. cleaned data files x 3 again


3. merged data file with all 3 - analyse this one with plots

<br>

# References:

***
 
All references and code used in these notebooks have been sourced in Oct/Nov/Dec 2021 from the following webpages:

 
- https://docs.python.org/3/library/re.html


- https://realpython.com/regex-python/


- https://realpython.com/regex-python-part-2/


- https://docs.python-requests.org/en/latest/user/quickstart/#make-a-request


- https://www.mygreatlearning.com/blog/regular-expression-in-python/


- https://stackoverflow.com/questions/16870648/python-read-website-data-line-by-line-when-available


- https://sites.pitt.edu/~naraehan/python3/mbb12.html


- https://docs.python.org/3/library/datetime.html


- https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html


- https://docs.python.org/3/library/urllib.request.html


- 




<br>

***

# End