# MODS203 Project
## Ryan Borhani, Mathilde Froger, Apolline Isaia, Solal Urien

### Load library

In [2]:
import pandas as pd
import urllib.request
import re
import requests

from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen

### Make the request and extract the HTML code

In [4]:
hdr = {'User-Agent': 'Mozilla/6.0'}
url = 'https://www.doctolib.fr/dentiste/paris-75019'
req = Request(url,headers=hdr)
page = urlopen(req)
soup = bs(page,'html.parser')
soup

<!DOCTYPE html>
<!--Looking at our code ?
Come have a closer look, we're hiring :)
https://careers.doctolib.fr/?origin=home-footer&amp;utm_button=footer&amp;utm_content-group=homepage&amp;utm_page-url={page-url}&amp;utm_website=doctolib_patients--><html lang="fr"><head data-country="fr" data-env="production"><meta charset="utf-8"/><meta content="authenticity_token" name="csrf-param"/>
<meta content="wwhK7TZBq0EoXBv4scqXViCFwzKGfeTeF9hVj+e477ynD43p7S3F8hi9PNmawUBvA7qXiNiFFkYFKQTz86tVfA==" name="csrf-token"/><meta content="origin-when-cross-origin" name="referrer"/><meta content="width=device-width, initial-scale=1.0, user-scalable=0, minimum-scale=1.0, maximum-scale=1.0, , viewport-fit=cover" name="viewport"/><meta content="Trouvez rapidement un chirurgien-dentiste à Paris - Paris 19e Arrondissement ou un praticien pratiquant des actes de chirurgie dentaire et prenez rendez-vous gratuitement en ligne en quelques clics" name="description"/><meta content="app-id=fr.doctolib.www" name="goo

### Extract the doctors
We do some simple operations on the extracted information.

In [5]:
soup.title.string #Extract the web site name as a string

'Chirurgien-dentiste à Paris - Paris 19e Arrondissement 75019 : Rendez-vous par Internet sous 24h - Doctolib'

In [6]:
soup.a.get('href') #Extract the first link of the page

'/dentiste/paris-75019?page=2'

We then find the main information. For this, we have identified the type that contains the information on the doctors.

In [7]:
doct = soup.findAll(type= "application/ld+json")
len(doct)

3

The list is composed of 3 elements, but we only want the one containing the information about the doctors.

In [8]:
info_doc = str(doct[1])

A clear problem that we can underline here is the fact that this list has to be heavily processed before we can add the information to a Dataframe.

## Data Processing
We first divide the list in different parts, each containing all the information of a single doctor.

In [9]:
cur_nb = info_doc.find('{') #our indice
cur_parent = 1 #nb of parenthesis to close
list_inf = []

for k in range (cur_nb+1, len(info_doc)):
    if (info_doc[k]=='{'):
        cur_parent +=1
    elif (info_doc[k] == '}'):
        cur_parent -=1 
        if (cur_parent == 0): #we found the parenthesis that closes the first parenthesis
            list_inf.append(info_doc[cur_nb+1:k])
            cur_nb = info_doc.find('{',k)
list_inf

['"@context":"http://schema.org/","@type":"Physician","name":"Gabriel KOSKAS","medicalSpecialty":"Chirurgien-dentiste","legalName":"","url":"/dentiste/paris/dr-gabriel-koskas","address":{"@type":"PostalAddress","name":"","streetAddress":"41 Avenue Simon Bolivar","postalCode":"75019","addressLocality":"Paris"},"paymentAccepted":"Cash, Check, Credit card"',
 '"@context":"http://schema.org/","@type":"Hospital","name":"Centre Dentaire Buttes Chaumont ","medicalSpecialty":"Centre dentaire","legalName":null,"url":"/centre-dentaire/paris/centre-dentaire-buttes-chaumont","address":{"@type":"PostalAddress","name":"Centre dentaire des Buttes Chaumont","streetAddress":"63 Rue Manin","postalCode":"75019","addressLocality":"Paris"},"paymentAccepted":"Cash, Check, Credit card"',
 '"@context":"http://schema.org/","@type":"Hospital","name":"Centre médical et dentaire Stalingrad (CRAMIF)","medicalSpecialty":"Centre de santé","legalName":null,"url":"/centre-de-sante/paris/centre-medical-stalingrad","add

In [10]:
list_inf[0]

'"@context":"http://schema.org/","@type":"Physician","name":"Gabriel KOSKAS","medicalSpecialty":"Chirurgien-dentiste","legalName":"","url":"/dentiste/paris/dr-gabriel-koskas","address":{"@type":"PostalAddress","name":"","streetAddress":"41 Avenue Simon Bolivar","postalCode":"75019","addressLocality":"Paris"},"paymentAccepted":"Cash, Check, Credit card"'

As we can observe, we now have several strings each describing a doctor.
We are going to process each of these strings. Although the method is quite rough, it is efficient, and we obtain exactly the dataframe that we needed.

In [11]:
list_name, list_type, list_spe,list_adr,list_post_cod, list_cit = [],[],[],[],[],[]

for doc in list_inf: #boucle pour les noms des médecins ou des hopitals
    i=0
    cur_nb = doc.find('name')
    while (doc[cur_nb+6+i] != ','):
        i+=1
    list_name.append(doc[cur_nb+7:cur_nb+5+i])

for doc in list_inf: #boucle pour les type des médecins/hopitals
    i=0
    cur_nb = doc.find('type')
    while (doc[cur_nb+6+i] != ','):
        i+=1
    list_type.append(doc[cur_nb+7:cur_nb+5+i])

for doc in list_inf: #boucle pour les spécialités
    i=0
    cur_nb = doc.find('medicalSpecialty')
    while (doc[cur_nb+18+i] != ','):
        i+=1
    list_spe.append(doc[cur_nb+19:cur_nb+17+i])

for doc in list_inf: #boucle pour les villes
    i=0
    cur_nb = doc.find('addressLocality')
    while (doc[cur_nb+16+i] != ','):
        i+=1
    list_cit.append(doc[cur_nb+18:cur_nb+14+i])

for doc in list_inf: #boucle pour les adresses (rue et numéro)
    i=0
    cur_nb = doc.find('streetAddress')
    while (doc[cur_nb+6+i] != ','):
        i+=1
    list_adr.append(doc[cur_nb+16:cur_nb+5+i])

for doc in list_inf:#boucle pour les codes postaux
    i=0
    cur_nb = doc.find('postalCode')
    while (doc[cur_nb+11+i] != ','):
        i+=1
    list_post_cod.append(int(doc[cur_nb+13:cur_nb+10+i])) # le code postal est un entier

data = {'Name':list_name,'Type':list_type, 'Speciality':list_spe,'Address':list_adr,'City':list_cit,'Postal_Code':list_post_cod}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Type,Speciality,Address,City,Postal_Code
0,Gabriel KOSKAS,Physician,Chirurgien-dentiste,41 Avenue Simon Bolivar,Paris,75019
1,Centre Dentaire Buttes Chaumont,Hospital,Centre dentaire,63 Rue Manin,Paris,75019
2,Centre médical et dentaire Stalingrad (CRAMIF),Hospital,Centre de santé,3 Rue du Maroc,Paris,75019
3,Jeanine Serouya,Physician,Chirurgien-dentiste,136 Avenue de Flandre,Paris,75019
4,Caroline Chenneveau,Physician,Chirurgien-dentiste,97 Rue de Belleville,Paris,75019
5,Charles CREHANGE,Physician,Chirurgien-dentiste,118 Avenue Jean Jaurès,Paris,75019
6,Apolline TYBURCZY,Physician,Chirurgien-dentiste,22 Avenue de Laumière,Paris,75019
7,Mahmoud BENAZZOUK,Physician,Chirurgien-dentiste,145 Boulevard Sérurier,Paris,75019
8,Cyrille Candet,Physician,Chirurgien-dentiste,20 Rue Manin,Paris,75019
9,Anne-Elisabeth PIEUS,Physician,Chirurgien-dentiste,22 Avenue de Laumière,Paris,75019
