# Scraping immobilier

Dans cette section, je mets en place un scraping pour identifier des surfaces de bureau disponibles.

Pour référence : https://github.com/snehaashrihari/Web-Scraping-using-BeautifulSoup/

## Sites web disponibles

* Se Loger : https://www.seloger.com/immobilier/locations/immo-montrouge-92/bien-bureaux/
    qui est protégé par un puzzle et agrège en fait plusieurs sources à savoir :
* JLL : https://immobilier.jll.fr/search?tenureType=rent&propertyType=office&city=MONTROUGE&postcode=92120&page=1
* CBRE
* Evolis
* BNP Parisbas Real Estate : https://www.bnppre.fr/a-louer/bureau/hauts-de-seine-92/montrouge-92120/
    * attention ! Sur cette pagen il n'y pas que des adresses à Montrouge !
* Axeaven
* Jean Louis Thouard Immobilier (?!?)

alternative : 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# import sys
# !{sys.executable} -m pip install sqlalchemy

In [3]:
# import sys
# !{sys.executable} -m pip install psycopg2-binary

In [4]:
#import sys
#!{sys.executable} -m pip install chart_studio

## Scraping - récupération des données

In [5]:
from explo_scraping import scrape_save

scrape_save()

WebDriverException: Message: Process unexpectedly closed with status 0


## Exploration des résultats

In [None]:
from explo_scraping import load_data

df = load_data()

### Graphe - surface

In [None]:
import matplotlib.pyplot as plt
import datetime as dt

today = dt.date.today()

In [None]:
mask = df["date"] > dt.date(year=2023, month=11, day=1)

aux = df[mask].groupby(['source', 'date'])[["surface_m2"]].sum().unstack(level="source")
ax = aux.plot.bar()
ax.legend(loc='lower left')
ax.set_ylabel("Surface disponible [m2]")

plt.savefig(f"surface_dispo_{today}", bbox_inches="tight")

In [None]:
aux

### Graphe - nombre d'offres

In [None]:
aux2 = df.groupby(['source', 'date'])[["surface_m2"]].count().unstack(level="source")
ax2 = aux2.plot.bar()

ax2.legend(loc='lower left')
ax2.set_ylabel("Nombre d'offres disponibles")

plt.savefig(f"offres_dispo_{today}", bbox_inches="tight")

In [None]:
aux2

### Calcul et évolution du prix moyen

In [None]:
df["total_price"] = df["surface_m2"] * df["price_eur_per_year_per_m2"]
total_value_series = df[["total_price", "source", "date"]].groupby(['source', 'date'])[["total_price"]].sum()
total_surface_series = df[["surface_m2", "source", "date"]].groupby(['source', 'date'])[["surface_m2"]].sum()
aux_series = total_value_series["total_price"].divide(total_surface_series["surface_m2"])

aux_series.unstack(level="source").plot.bar()

In [None]:
aux_series.unstack(level="source")

In [None]:
dict()[0]

## Autres explorations

### Date spécifique

In [None]:
import datetime as dt

date = dt.date(2023, 11, 3)

In [None]:
df[df["date"]==date]

### Donnée générale

In [None]:
df

## Perspectives

### Indicateur consolidé

Si on recueille des données de plusieurs sources, il va falloir les consolider/dédupliquer...

Est-ce que adresse + surface serait un bon point de départ ? (+ prix, qui sera sans doute le même dans les différentes agences).

Note : sur la base de ce critère, il y a déjà des duplications entre JLL et BNP au 2023-11-01 ! (cas du "98-100 RUE MAURICE ARNOUX", par exemple).

### Indicateur "de dynamique"

Si on ne veut pas regarder vraiment la surface disponible mais juste "prendre le pouls" du marché, alors il est intéressant d'utiliser deux sources différentes, même si elles ne sont pas dé-dupliquée.

### Identification des bâtiments et consolidation

Une approche pour la consolidation serait de se baser sur l'adresse et les éventuelles références internes (JLL et BNP). Une bonne approche pour la référence interne serait de considérer "href" associée.

Ensuite, on pourrait faire la déduplication manuellement et la compléter lorsque de nouvelles référence apparaissent ?

Evidemment, cette approche nécessiterait beaucoup plus de temps, il faut se demander si ça vaut la peine !

# Bac à sable

### Explo avec Plotly (non fonctionnelle)

In [None]:
dict()[0]

In [None]:
import chart_studio.plotly as py
import plotly.graph_objects as go

In [None]:
bnp_aux = aux.loc["BNP"]
jll_aux = aux.loc["JLL"]

In [None]:

trace_bnp = go.Bar(x=bnp_aux.index,
                   y=bnp_aux.surface_m2)

In [None]:

trace_jll = go.Bar(x=jll_aux.index,
                   y=jll_aux.surface_m2)

In [None]:
data = [trace_bnp, trace_jll]


layout = go.Layout(title="Evolution de la surface présentée par source",
                xaxis=dict(title='Date'),
                yaxis=dict(title='Surface disponible'))

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='jupyter-styled_bar')

### Autre bac à sable

In [None]:
import json
from bs4 import BeautifulSoup
import requests

from typing import List, Tuple, Optional

from commons import RentalItem

In [None]:
url="https://immobilier.jll.fr/search?tenureType=rent&propertyType=office&city=MONTROUGE&postcode=92120"

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

raw_items = list(soup.find_all(
            "div", {"class": "SRPPropertyCard SRPPropertyCard--default col-sm-6"}))

In [None]:
aux = raw_items[0]

In [None]:
aux

In [None]:
aux.find("a", href=True).get("href")

In [None]:
aux

In [None]:
aux.attrs.get("href")

## Problème sur JLL

In [None]:
from bs4 import BeautifulSoup
import requests

from typing import List, Tuple, Optional

from commons import RentalItem

In [None]:
url =  "https://immobilier.jll.fr/search?tenureType=rent&propertyType=office&city=MONTROUGE&postcode=92120"

In [None]:
import cloudscraper

scraper = cloudscraper.create_scraper()

In [None]:
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}
page = scraper.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
from jll_parser import get_nbr_items

In [None]:
get_nbr_items(soup)

In [None]:
from jll_parser import get_rental_items

In [None]:
rental_items = get_rental_items(soup)

In [None]:
rental_items

In [None]:
from lxml import html

In [None]:
with open(r'D:\Ecologie - EELV\data\surface_bureaux\Resultats JLL France_2023-11-18.htm','r') as f:
    bla = html.fromstring(f.read())

In [None]:
with open(r'D:\Ecologie - EELV\data\surface_bureaux\Resultats JLL France_2023-11-18.htm','r') as f:
    soup = BeautifulSoup(f, 'html.parser')

In [None]:
soup.find_all("h3", class_="SRPOffersSearchSummary")

In [None]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Cookie': 'OptanonConsent=isGpcEnabled=0&datestamp=Sat+Nov+18+2023+10%3A44%3A47+GMT%2B0100+(heure+normale+d%E2%80%99Europe+centrale)&version=202212.1.0&isIABGlobal=false&hosts=&consentId=e0927f4b-fa23-4eef-9ae7-28a9593cf2e5&interactionCount=1&landingPath=NotLandingPage&groups=C0001%3A1%2CC0004%3A0%2CC0003%3A0%2CC0002%3A0&geolocation=FR%3BIDF&AwaitingReconsent=false; __fs_dncs_trackingid_jllfrance=95038d2d-eedd-4fc2-b0cb-1f8264ffa1a8; __fs_dncs_exttrack=1; OptanonAlertBoxClosed=2023-10-28T16:23:59.024Z; ARRAffinity=b3ab359c4ca6aa3ecdd6e61996ab677eba545d81029a2f999191c2aecff3687a; ARRAffinitySameSite=b3ab359c4ca6aa3ecdd6e61996ab677eba545d81029a2f999191c2aecff3687a; SEMItem=LandingPage%3Dhttps%253A%252F%252Fimmobilier.jll.fr%252Fsearch%253FtenureType%253Drent%2526propertyType%253Doffice%2526city%253DMONTROUGE%2526postcode%253D92120%26ReferralString%3D%26ReferralOrigin%3DDirect%20-%20v3%26Language%3Dfr%26gclid%3D%26UtmCampaign%3D%26UtmSource%3D%26UtmMedium%3D%26UtmContent%3D%26UtmTerm%3D; RT="z=1&dm=jll.fr&si=tucojz5w2v&ss=lp3v4lma&sl=0&tt=0"',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'cross-site',
    'TE': 'trailers',
    'referrer': 'https://google.fr',
}

In [None]:
page = scraper.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
rental_items = get_rental_items(soup)

In [None]:
rental_items

In [None]:
len(rental_items)

In [None]:
from jll_parser import parser

In [None]:
bla = parser()

In [None]:
import pandas as pd
import datetime as dt

In [None]:
list_items = rental_items

aux_df = pd.DataFrame.from_records([item.__dict__ for item in list_items])
aux_df["date"] = today_dt = dt.date.today()
aux_df["source"] = "BNP"

In [None]:
from explo_scraping import save_data

In [None]:
save_data(aux_df)

In [None]:
from bnp_re_parser import parser as bnp_parser

In [None]:
nbr_items, rental_items = bnp_parser()

In [None]:
nbr_items

In [None]:
len(rental_items)

## Explo Selenium

In [None]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [None]:
options = Options()
options.add_argument("--headless")

In [None]:
driver = webdriver.Firefox(options=options)
bla = driver.get("https://immobilier.jll.fr/search?tenureType=rent&propertyType=office&city=MONTROUGE&postcode=92120")

In [None]:
html_source = driver.page_source

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html_source, 'html.parser')

In [None]:
from jll_parser import full_parser

In [None]:
bla = full_parser()

In [None]:
from jll_parser import get_page_content

In [None]:
yo = get_page_content()

In [None]:
from jll_parser import parser

In [None]:
output = parser(yo)

In [None]:
len(output[1])