# **1. Corpus**

## Crawl URLs to extract all internal links 

**XENU Link Sleuth**  
https://home.snafu.de/tilman/xenulink.html 

*Le logiciel XENU Link Sleuth a finalement été retenu pour cette tâche* 

## Scrape textual data from crawled URLs
**BeautifulSoup HTML Parser**  
Réf : https://realpython.com/python-web-scraping-practical-introduction/

In [1]:
path = '../03-corpus/1-crawler/'
acteur = "ramq"
pdfs = False

La liste des URLs à scrapper pour chaque corpus est contenue dans un fichier CSV. 
On commence donc par lire le CSV pour extraire nos URLS.

In [2]:
from pandas import *

regex = '.*png.*|.*jpeg.*|.*jpg.*|.*docx.*|.*js.*|.*font.*|.*gif.*|.*formulaire.*|.*?f%5B0%5D.*|.*img.*|.*%5Bfilter%.*|.*css.*|.*scripts.*|.*zip.*|.*xlsx.*|.*cms.*|.*/images/.*|.*sondage.*|.*/depenses/.*|.*demandes-acces.*'
if not pdfs:
    regex += '|.*\.pdf'

# encoding= 'ISO-8859-1' "utf-8"
with open(path + acteur + '.csv', encoding='UTF-8') as f:
    csv = read_csv(f, sep=';')
csv
    

Unnamed: 0,Address,Type,Title,Charset,Description
0,https://www.ramq.gouv.qc.ca/fr,text/html,Accueil | Régie de l’assurance maladie du Québ...,utf-8,La Régie de l’assurance maladie du Québec admi...
1,https://www.ramq.gouv.qc.ca/en,text/html,Home | Régie de l’assurance maladie du Québec ...,utf-8,The Régie de l&#039;assurance maladie du Québe...
2,https://www.ramq.gouv.qc.ca/en/citizens,,English,utf-8,
3,https://www.ramq.gouv.qc.ca/fr/nous-joindre,text/html,Nous joindre | Régie de l’assurance maladie du...,utf-8,Vous avez des questions ou besoin d’informatio...
4,https://www.ramq.gouv.qc.ca/fr/citoyens/assura...,text/html,Assurance maladie | Régie de l’assurance malad...,utf-8,La RAMQ administre le régime d’assurance malad...
...,...,...,...,...,...
5417,https://www.ramq.gouv.qc.ca/sites/default/file...,application/pdf,"<div class=""file""><span class=""file_name"">Télé...",utf-8,
5418,https://www.ramq.gouv.qc.ca/sites/default/file...,application/pdf,"<div class=""file""><span class=""file_name"">Télé...",utf-8,
5419,https://www.ramq.gouv.qc.ca/en/media/7336,text/html,English,utf-8,
5420,https://www.ramq.gouv.qc.ca/sites/default/file...,application/pdf,"<div class=""file""><span class=""file_name"">Télé...",utf-8,


In [3]:
# Nettoyer ce qui ne devrait pas se trouver là
csv = csv[~csv["Address"].str.contains(regex, case=False)]
csv

Unnamed: 0,Address,Type,Title,Charset,Description
0,https://www.ramq.gouv.qc.ca/fr,text/html,Accueil | Régie de l’assurance maladie du Québ...,utf-8,La Régie de l’assurance maladie du Québec admi...
1,https://www.ramq.gouv.qc.ca/en,text/html,Home | Régie de l’assurance maladie du Québec ...,utf-8,The Régie de l&#039;assurance maladie du Québe...
2,https://www.ramq.gouv.qc.ca/en/citizens,,English,utf-8,
3,https://www.ramq.gouv.qc.ca/fr/nous-joindre,text/html,Nous joindre | Régie de l’assurance maladie du...,utf-8,Vous avez des questions ou besoin d’informatio...
4,https://www.ramq.gouv.qc.ca/fr/citoyens/assura...,text/html,Assurance maladie | Régie de l’assurance malad...,utf-8,La RAMQ administre le régime d’assurance malad...
...,...,...,...,...,...
5394,https://www.ramq.gouv.qc.ca/en/media/6921,text/html,English,utf-8,
5402,https://www.ramq.gouv.qc.ca/en/media/2141,text/html,English,utf-8,
5406,https://www.ramq.gouv.qc.ca/en/media/13046,text/html,English,utf-8,
5412,https://www.ramq.gouv.qc.ca/en/media/5786,text/html,English,utf-8,


In [4]:
#liste = csv[csv['Type'] != 'application/pdf'] # On va scraper les PDFs avec une autre librairie que BeautifulSoup

#liste = csv['Address'].tolist()

liste = csv[['Address', 'Type']]
fr = csv[~csv["Address"].str.contains('/en/')][['Address', 'Type']] # Données en français
en = csv[csv["Address"].str.contains('/en/')][['Address', 'Type']] # Données en anglais
print("On va tenter d'aspirer {} pages Web".format(len(liste)))

liste

On va tenter d'aspirer 4050 pages Web


Unnamed: 0,Address,Type
0,https://www.ramq.gouv.qc.ca/fr,text/html
1,https://www.ramq.gouv.qc.ca/en,text/html
2,https://www.ramq.gouv.qc.ca/en/citizens,
3,https://www.ramq.gouv.qc.ca/fr/nous-joindre,text/html
4,https://www.ramq.gouv.qc.ca/fr/citoyens/assura...,text/html
...,...,...
5394,https://www.ramq.gouv.qc.ca/en/media/6921,text/html
5402,https://www.ramq.gouv.qc.ca/en/media/2141,text/html
5406,https://www.ramq.gouv.qc.ca/en/media/13046,text/html
5412,https://www.ramq.gouv.qc.ca/en/media/5786,text/html


In [5]:
import requests, re, ssl, os, sys, pandas as pd
from bs4 import BeautifulSoup
#from requests.packages.urllib3.util.retry import Retry

def getTextHTML(url):    
    html = requests.get(url, headers = {'User-Agent': 'My User Agent 1.0'}, verify=False)
    html.encoding = 'utf-8'
    html = html.text

    soup = BeautifulSoup(html, "html.parser")
    tags_to_remove = ['head', 'header', 'script', 'footer', 'nav', 'form'] # Enlever 'form' pour le site du CHU Ste Justine

    # Classes CSS spécifiques aux différents sites
    attr_to_remove = ['div[class="contenu-fluide piv"]', 'div[role="navigation"]', 'div[class="section__wrapper section__wrapper--padding tac grid--inline-block"]', 'div[id="slidebox"]',
    'div[class="col-md-12 mise-a-jour"]', 'h1[class="sr-only"]', 'a[class="cd-top js-cd-top"]', 'div[class="nocontent"]', 'p[class="footer-lien-resonances"]', 'p[class="suivre"]',
    'section[class="field field-name-field-date-de-mise-jour field-type-datetime field-label-inline clearfix view-mode-full"]', 'a[class="active"]', 'p[class="footer-resonances"]',
    'div[class="item-list item-list-pager"]', 'ul[class="pub-solr-sub-menu"]', 'a[href="#main-content"]', 'div[id="block-sociauxcrchum"]', 'div[class="visually-hidden"]',
    'div[id="Breadcrumb"]', 'div[id="pageInfo"]', 'div[id="breadcrumb"]', 'div[class="pagesCreation"]', 'a[href="#contenu"]', 'div[class="bandeau"]',
    'div[id="seeAlso"]', 'a[href="/nous-ecrire.aspx"]', 'li[class="CMSListMenuLI"]', 'li[class="CMSListMenuLI navFirst"]', 'li[class="CMSListMenuLI navLast"]',
    'div[class="alert alert-danger"]', 'span[class="alertoverflow"]', 'div[class="alert alert-warning alert-dismissible"]', 'ul[class="menu"]', 'div[id="letters-filter"]',
    'ul[class="pager"]', 'a[href="#main-menu"]', 'ul[class="custom_menu"]', 'h2[class="element-invisible"]', 'div[class="breadcrumb"]',
    'a[class="all-cta"]', 'div[class="sub-menu-inner container"]', 'div[class="fixed-dk-nav"]', 'div[class="fixed-dk-nav-container"]', 'div[class="container-inner"]',
    'div[class="socials"]', 'div[class="breadcrumbs"]', 'a[class="btn-print"]', 'ul[class="list-buttons"]', 'p[class="visually-hidden"]', 
    'a[class="back-to-top"]', 'a[class="sr-only sr-only-focusable"]', 'ol[class="breadcrumb"]', 'div[class="container-fluid piv_bas"]', 'div[class="col-12 formBasPage"]',
    'div[class="container-fluid rangee-footer"]', 'a[class="visuallyHidden passerContenu"]', 'div[id="bandeau-alerte"]', 'div[class="menu-sec-wrapper col-12 col-lg-12"]',
    'a[href="#layout-content"]', 'div[class="paragraph feedback"]', 'p[class="last-update"]', 'ul[class="footer__menu--list"]', 'div[class="footer__info"]',
    'section[class="hello-bar"]', 'section[class="breadcrumb"]', 'div[class="menu-page"]', 'div[class="no-print menu-non-voyant"]', 'div[class="navigation"]',
    'div[class="pure-bloc pure-u-1 pure-u-md-1-3 pure-u-lg-1-4 side-menu"]', 'div[class="pied"]', 'div[class="social"]', 'div[class="piv-bas"]',
    'div[class="partage"]', 'div[class="pied-print no-screen"]', 'div[class="carte dynamic-carte-interactive-display ui-carte-panel"]',
    'div[id="piv"]', 'ul[class="social-nav top-bar-social"]', 'div[class="sidebar"]', 'ul[class="side-menu"]', 'div[class="zoom-button-wrapper"]',
    'a[href="#maincontent"]', 'a[href="#content"]', 'p[id="breadcrumbs"]', 'div[class="mega-menu-wrap"]', 'div[class="menu_2"]', 'div[class="welcome"]',
    'div[class="header_two"]', 'div[class="footer"]', 'div[class="footer-wrapper"]', 'div[class="custom-accessibility-tools js-only"]', 'section[role="navigation"]',
    'div[class="container-fluid container-blue container-dl-menu"]', 'div[class="col-xs-12 dl-menuwrapper menu-mobile visible-x"]']
    
    for t in tags_to_remove:
        tags = soup.find_all(t)
        for tag in tags:
            tag.decompose()

    for t in attr_to_remove:
        attr = soup.select(t)
        for a in attr:
            a.decompose()


    data = soup.get_text(separator=' ').replace("\n", " ").replace("\r", " ") 
    data = re.compile(r"\s+").sub(" ", data).strip()
    
    return data



In [6]:
import io
from PyPDF2 import PdfReader

def getTextPDF(url):
    pdf_link = requests.get(url)
    with io.BytesIO(pdf_link.content) as f:
        reader = PdfReader(f)
        number_of_pages = len(reader.pages)
        text = ''
        if number_of_pages <= 30:
            for i in range(number_of_pages):
                page = reader.pages[i]
                text += page.extract_text().lower().replace('\n', '').replace('\x84', '').replace('\xa010', '').replace('\xa0', '')
        if text :
            return text


In [7]:
def scrape_list(x):
    output = []
    for site in x:
        pdfs = liste[liste['Type'] == 'application/pdf']['Address'].tolist()
        htmls = liste[liste['Type'] == 'text/html']['Address'].tolist() 
        nans = liste[liste["Type"].isnull()]['Address'].tolist()  # Il y a des page spour lesquelles le crawler n'a pas été en mesure de consigner une valeur 
        
        htmls = htmls + nans

        for site in htmls :
            try: 
                text = getTextHTML(site)
                if not '���' in text:
                    output.append({'Address': site, 'text':text})
            except Exception as e:
                print("ERROR " + " - " + site)
                print(e)

        for site in pdfs :
            try: 
                text = getTextPDF(site)
                if not '���' in text:
                    output.append({'Address': site, 'text':text})
            except Exception as e:
                print("ERROR " + " - " + site)
                print(e)

    return output

In [8]:
sites_fr = scrape_list(fr)
if(len(en) > 0):
    sites_en = scrape_list(en)



In [None]:
output_path = '../03-corpus/2-data/'

sites_fr = pd.DataFrame(sites_fr)
sites_fr = csv.merge(sites_fr, how='right', on='Address')
sites_fr = sites_fr[['Address', 'Title', 'Type', 'text']]

sites_fr

Unnamed: 0,Address,Title,Type,text
0,https://pinel.qc.ca/,Accueil - Institut national de psychiatrie lég...,text/html,L’Institut national de psychiatrie légale Phil...
1,https://pinel.qc.ca/evenements/,Événements - Institut national de psychiatrie ...,text/html,Événements L'Institut national de psychiatrie ...
2,https://pinel.qc.ca/nouvelles/,Nouvelles - Institut national de psychiatrie l...,text/html,Nouvelles Les journalistes qui désirent réalis...
3,https://pinel.qc.ca/covid-19/,COVID-19 - Institut national de psychiatrie lé...,text/html,COVID-19 Quelques outils et ressources pour vo...
4,https://pinel.qc.ca/carriere/,Carrières - Institut national de psychiatrie l...,text/html,Carrières Vous pouvez faire une différence. Co...
...,...,...,...,...
1005,https://pinel.qc.ca/?p=19757,,,Ordres du jour et procès-verbaux 2022 Réunion ...
1006,https://pinel.qc.ca/?p=3192,,,"Auditorium Lionel-Béliveau Le 20 avril 2006, l..."
1007,https://pinel.qc.ca/?p=13292,,,"Professeure adjointe, École de travail social ..."
1008,https://pinel.qc.ca/?p=18296,,,"Professeure agrégée Julie Carpentier, Ph. D. D..."


In [None]:
sites_fr.to_csv(output_path + '1-fr/' + acteur + '.csv', escapechar='/')
if(len(en) > 0):
    sites_en = pd.DataFrame(sites_en)
    sites_en = csv.merge(sites_en, how='right', on='Address')
    sites_en = sites_en[['Address', 'Title', 'Type', 'text']]
    sites_en.to_csv(output_path + '1-en/' + acteur + '_en.csv', escapechar='/')