<a href="https://colab.research.google.com/github/Giraud-Pierre/DeepLearning_FineTuneLLama2Project/blob/main/src/WebScrappingUQACCourses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**How to use this notebook :**

This notebook intends to gather data on UQAC programs and classes by using webscraping techniques on the UQAC website.

Here are the steps to follow to use this notebook:

- Run the setup which will install and import all required libraries
- Run Web scrapping which will define and run a proper web scrapping technique for target pages we selected ("https://cours.uqac.ca/premier-cycle" and "https://cours.uqac.ca/cycles-superieurs")
- You can then
  - either save the results in a google sheet after authentifying on your google account (do not forget to delete them after usage to save space on your google drive)
  - or save the variables in a pickle and download using the colab API to be able to reload them in a different colab session

#**Setup**

##Install All the Required Packages

In [1]:
!pip install selenium
!apt-get update
!apt-get instal chromium-driver
!pip install gspread
!pip install --upgrade google-auth

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:9 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
E: Invalid operation instal


##Import All the Required Libraries

In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import numpy as np
from google.auth import default
from google.colab import auth
from google.auth import default
import gspread

#**Web scrapping**

In [3]:
def web_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--verbose")
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument("--window-size=1920, 1200")
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=options)
    return driver

In [4]:
def GetCourseContent(courseURL):
  # Get the content from a course URL
  driver.get(courseURL)
  return driver.find_elements(By.XPATH, '//div[@id="texte"]')[0].text

def GetAllCoursesURLFromProgram(programContainer):
  # Get all the courses URL available from a progam
  coursesContainer = programContainer.find_elements(By.XPATH, './/table/tbody/tr/td/a')
  coursesURL = []
  for course in coursesContainer:
    coursesURL.append(course.get_attribute('href'))
  return coursesURL

def GetInfoFromProgram(programURL):
  # Get the infos on a program URL, as well as all the courses
  # available to the program and their contents
  driver.get(programURL)

  try:
    testContainer = driver.find_element(By.XPATH,'//h1')
    #print(testContainer.text)
    if "Décanat des études" in testContainer.text:
      return None,None,None,None
  except:
    pass

  programContainer = driver.find_elements(By.XPATH, '//div[@id="texte"]')[0]
  programContent = programContainer.text

  try:
    title = programContainer.find_element(By.XPATH,'.//h1').text
    noProg = programContainer.find_element(By.XPATH,'.//span[@class="noprog"]').text

    coursesURL = GetAllCoursesURLFromProgram(programContainer)

    coursesContent = []

    for course in coursesURL:
      coursesContent.append(GetCourseContent(course))
  except:
    return None,None,None,None

  return title, noProg, programContent, coursesContent


def getProgramsURLFromProgamListPage(programListPageURL):
  # Get all the programs URL from a program page
  # ie: "https://cours.uqac.ca/premier-cycle"
  driver.get(programListPageURL)

  allProgramsContainer = driver.find_elements(By.XPATH, '//div[@id="texte"]/div[@id="liste_prog"]/table/tbody/tr/td')
  allProgramsURL = []

  for container in allProgramsContainer:
    program = container.find_element(By.XPATH,'.//a').get_attribute('href')
    if('cours-offerts' not in program):
      allProgramsURL.append(program)

  return allProgramsURL

def GetAllInfosOnAllPrograms(URL):
  # Get all programs from a programs page as well as the corresponding courses infos
  # ie : "https://cours.uqac.ca/premier-cycle"

  programsURL = getProgramsURLFromProgamListPage(URL)

  #print(programsURL)

  programsContent=[]

  index = 0
  #limit = index + 4

  print("Progress : 0 / " + str(len(programsURL)) + " programs")


  for program in programsURL:
    title, noProg, programContent, coursesContent = GetInfoFromProgram(program)
    index +=1
    if title != None:
      print(str(index) + " / " + str(len(programsURL)) + " programs ==> ")
      print(title + " : " + program)
      programsContent.append([title, noProg, programContent, coursesContent])

  return programsContent

In [5]:
driver = web_driver()
premierCycleURL = "https://cours.uqac.ca/premier-cycle"

premierCyclePrograms = GetAllInfosOnAllPrograms(premierCycleURL)


Progress : 0 / 237 programs
3 / 237 programs ==> 
Certificat en français langue seconde ou étrangère : culture, études et travail : http://programmes.uqac.ca/4447
4 / 237 programs ==> 
Programme court d'apprentissage du français parlé et écrit pour non-francophones : http://programmes.uqac.ca/9983
5 / 237 programs ==> 
Baccalauréat en éducation préscolaire et en enseignement primaire : http://programmes.uqac.ca/7992
6 / 237 programs ==> 
Certificat de perfectionnement en transmission d'une langue autochtone : http://programmes.uqac.ca/4661
7 / 237 programs ==> 
Certificat en études pluridisciplinaires : http://programmes.uqac.ca/4385
8 / 237 programs ==> 
Certificat en formation d'aides-enseignants en milieu autochtone : http://programmes.uqac.ca/4659
9 / 237 programs ==> 
Certificat en formation de suppléants en milieu scolaire autochtone : http://programmes.uqac.ca/4660
10 / 237 programs ==> 
Certificat en formation de suppléants en milieu scolaire autochtone : http://programmes.uqac

In [None]:
driver = web_driver()
premierCycleURL = "https://cours.uqac.ca/cycles-superieurs"

CyclesSuperieurs = GetAllInfosOnAllPrograms(premierCycleURL)

Doctorat en lettres : http://programmes.uqac.ca/2056
Doctorat en lettres : http://programmes.uqac.ca/3136
Maîtrise en art : http://programmes.uqac.ca/3848
Maîtrise en lettres : http://programmes.uqac.ca/2036
Maîtrise en lettres : http://programmes.uqac.ca/3073
Maîtrise en linguistique : http://programmes.uqac.ca/3637
Programme court de deuxième cycle en pratiques artistiques en théâtre jeunesse : http://programmes.uqac.ca/9032
Diplôme d'études supérieures spécialisées en gestion publique en contexte autochtone : http://programmes.uqac.ca/1803
Diplôme d'études supérieures spécialisées multidisciplinaires de recherche en contexte autochtone : http://programmes.uqac.ca/1806
Microprogramme de deuxième cycle en gestion publique en contexte autochtone : http://programmes.uqac.ca/0884
Programme court de deuxième cycle en gestion publique en contexte autochtone : http://programmes.uqac.ca/0885
Diplôme d'études supérieures spécialisées en design de jeu vidéo narratif : http://programmes.uqac.ca

#**Saving results in a google sheet**

In [14]:
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

In [None]:
gs = gc.create('PremierCycles')
sh = gs.sheet1
#sh.append_row(["title","program number","program content","courses content"])

for program in premierCyclePrograms:
  programData = []
  programData.append(program[0])
  programData.append(program[1])
  programData.append(program[2])
  for course in program[3]:
    programData.append(course)
  sh.append_row(programData)

In [None]:
gs = gc.create('CyclesSupérieurs')
sh = gs.sheet1
sh.append_row(["title","program number","program content","courses content"])

for program in CyclesSuperieurs:
  programData = []
  programData.append(program[0])
  programData.append(program[1])
  programData.append(program[2])
  for course in program[3]:
    programData.append(course)
  sh.append_row(programData)

#**Saving variables in a pickle**

Can be used to store and load colab variables for usage in a later session (as here, the web scrapping can take a long time, you can store the variable here to be able to reload them later without having to redo all the web scrapping)

In [None]:
import pickle

In [None]:
filePathCyclesSuperieurs = "CyclesSuperieurs.pickle"
filePathPremierCycles = "PremiersCycles.pickle"

# Open the file in binary mode
with open(filePathPremierCycles, 'wb') as file:
    # Serialize and write the variable to the file
    pickle.dump(premierCyclePrograms, file)

# Open the file in binary mode
with open(filePathCyclesSuperieurs, 'wb') as file:
    # Serialize and write the variable to the file
    pickle.dump(CyclesSuperieurs, file)

In [None]:
# Open the file in binary mode
with open(filePathCyclesSuperieurs, 'rb') as file:
    # Deserialize and retrieve the variable from the file
    loaded_data = pickle.load(file)

In [None]:
loaded_data