In [1]:
%load_ext autoreload
%autoreload 2

# Property Scraper

## [Portale delle Vendite Pubbliche](https://pvp.giustizia.it/pvp/it/homepage.page)

Author : Emanuele Cannizzaro

# Introduction

This notebook supports a set of tasks required to create the property databases stored at [Italia Estate Database](http://italia.estate-database.com/) and [real estate database](https://docs.google.com/spreadsheets/d/1BRloTbcVOFAL9up2wIsvaAjFuJep9f3TWQwp_f02ntw/edit#gid=1748870476).

The project also contains a [Google Lookerstudio dashboard](https://lookerstudio.google.com/u/0/reporting/2da6484e-c407-40aa-88e0-3ca0f6ddd31d/page/p_0pjh8udt4c).

![PVP](../docs/images/pvp.png)


In [2]:
from bs4 import BeautifulSoup
# Reactor restart
#from crochet import setup, wait_for
from datetime import date, datetime
from glob import glob
from gspread import service_account
import json
#import linkcheck
from multiprocessing import Pool
import os
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess
#from scrapy.crawler import CrawlerRunner
#from scrapy.utils.project import get_project_settings
import subprocess
from subprocess import run
from tqdm.auto import tqdm

from property_scraper import PVP_URL_ROOTNAME, ROOT_FOLDER
from property_scraper import fix_hyperlinks, get_basename, get_filename_from_identifier, get_parameters
from property_scraper.astegiudiziarie.spiders.search import AsteGiudiziarieSearchSpider
from property_scraper.process import Checker, Configurator, Extractor, Fixer
from property_scraper.store import SearchStore, SearchPropertyStore, PropertyStore

In [39]:
#setup()

In [3]:
pd.set_option('display.max_rows', None)

# Description

The following usecases are considered:

1. Prepare Scrapy Search Configuration;
1. Create or Update the search, search_property and property tables of the database;
1. Repair the downloaded HTML pages for the localhost;
1. Check the localhost server links;
1. Check the cloud web server links;
1. ...


## To do

1. To perform a new incremental search there is a need to understand when to stop downloading the search pages. 

1. Look for missing values 
if I am working on a Pandas dataframe, how can I identify whether there are missing rows in a column of continuous integer values? An elegant solution would use min, max and range(min, max+1).

A. ChatGPT
You can identify missing rows in a column of continuous integer values in a Pandas DataFrame by comparing the expected range of values to the actual values in the column. Here's an example:

In this example, we first create an example DataFrame with a column id containing continuous integer values. We then get the expected range of values using the range() function and convert it to a set. Next, we get the actual values in the id column of the DataFrame and convert it to a set. We then subtract the actual set from the expected set to get a set of missing values.

Finally, we check whether the missing set is non-empty and print out the missing values if any, or a message indicating that there are no missing values.

# Setup

Let's define the runtime options.

In [4]:
usecase = 'scrape'
#usecase = 'update_results'
#usecase = 'update_search_configuration'
#usecase = 'get_links'
#usecase = 'fix_links'
#usecase = 'check_links'

search_engine = 'pvp'
#search_engine = 'blinks'
#search_engine = 'guber'
#search_engine = 'blinks'

run_in_parallel = True
no_processes = 5


### Running a Scrapy Spider within Jupyter Notebook

To run a Scrapy spider within Jupyter Notebook, follow these steps:

1. **Install the necessary dependencies**: Make sure you have Scrapy and Jupyter Notebook installed in your Python environment. You can install them using pip:

In [13]:
#!python -m pip install scrapy jupyter

2. **Create a new Jupyter Notebook**: Launch Jupyter Notebook by running the command `jupyter notebook` in your terminal. Create a new notebook by clicking on "New" and selecting "Python 3" or use an existing notebook.

3. **Import the necessary modules**: In the first cell of your Jupyter Notebook, import the required modules.

4. **Define your Scrapy spider**: In a new cell, define your Scrapy spider as you normally would. Here's an example spider:

In [7]:
if usecase == 'scrape':
    spider = AsteGiudiziarieSearchSpider()

5. **Run the Scrapy spider**: In the next cell, run the Scrapy spider using the CrawlerProcess class and the project settings.

In [43]:
if usecase == 'scrape':
    #get_project_settings()
    #crawler = CrawlerProcess()#get_project_settings())
    crawler = CrawlerRunner()

In [44]:
if usecase == 'scrape':
    crawler.crawl(AsteGiudiziarieSearchSpider)

2023-05-16 03:53:03 [scrapy.crawler] INFO: Overridden settings:
{}
2023-05-16 03:53:03 [scrapy.extensions.telnet] INFO: Telnet Password: e1bcee9111781742
2023-05-16 03:53:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-05-16 03:53:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrap

<Deferred at 0x7ff29dde2d50>

In [46]:
if usecase == 'scrape':
    pass
    #crawler.start()

6. **Execute the cells**: Run the cells in your Jupyter Notebook by clicking the "Run" button or using the keyboard shortcut. The Scrapy spider will start crawling the URLs specified in the **start_urls** attribute.

You should see the Scrapy spider's logs in the Jupyter Notebook output, indicating the progress of the crawling process. Once the spider has finished crawling, you can access and process the scraped data within the same notebook.

Note: Remember that Scrapy is primarily designed to be run from the command line, and running it within Jupyter Notebook may have some limitations and differences in behavior compared to running it in a traditional terminal environment.

Copy code
To run the code cells with Python code, you can use the Jupyter Notebook interface and execute the cells one by one by clicking the "Run" button or using the keyboard shortcut. This will start the Scrapy spider and display the crawling progress and any logs or errors in the output area.

After the spider has finished crawling, you can continue working in subsequent cells to process the scraped data or perform any additional tasks.

In [None]:
!scrapy runspider /home/git/property_scraper/property_scraper/astegiudiziarie/spiders/search.py --output astegiudiziarie_search_default.csv

2023-05-16 05:15:26 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-05-16 05:15:26 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.11.0 (main, Mar  1 2023, 18:26:19) [GCC 11.2.0], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.2, Platform Linux-5.19.0-35-generic-x86_64-with-glibc2.36
2023-05-16 05:15:27 [numexpr.utils] INFO: NumExpr defaulting to 8 threads.
2023-05-16 05:15:29 [root] DEBUG: This message should go to the log file
2023-05-16 05:15:29 [root] INFO: So should this
2023-05-16 05:15:29 [root] ERROR: And non-ASCII stuff, too, like Øresund and Malmö
2023-05-16 05:15:30 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-05-16 05:15:30 [scrapy.utils.log] DEBUG: Using reactor: tw

In [5]:
rc = {
    'name' : 'PVP Italia',
    'credentials' : os.path.expanduser('~/gspreadscraper.json'),
    'workbook_id' : '1BRloTbcVOFAL9up2wIsvaAjFuJep9f3TWQwp_f02ntw',
    
    'folder' : '/home/data/property_scraper/demos',
    'JSON filename' : '/home/data/property_scraper/demos/{search_engine}_search_property_localhost.json',

    'search excel filename' : '/home/data/property_scraper/demos/pvp_search_default.enriched.xlsx',
    'search CSV filename' : '/home/data/property_scraper/demos/pvp_search_default.enriched.csv',
    'search parquet filename' : '/home/data/property_scraper/demos/pvp_search_default.enriched.parquet',
    'search by location CSV filename' : '/home/data/property_scraper/demos/pvp_search_default_bylocation.csv',
    'search by location parquet filename' : '/home/data/property_scraper/demos/pvp_search_default_bylocation.parquet',
    'search pattern' : '/home/data/property_scraper/demos/pvp_search_default.csv*',

    'search property excel filename' : '/home/data/property_scraper/demos/pvp_search_property_localhost.enriched.xlsx',
    'search property CSV filename' : '/home/data/property_scraper/demos/pvp_search_property_localhost.enriched.csv',
    'search property parquet filename' : '/home/data/property_scraper/demos/pvp_search_property_localhost.enriched.parquet',
    'search property pattern' : '/home/data/property_scraper/demos/pvp_search_property_localhost.csv*',

    'property excel filename' : '/home/data/property_scraper/demos/pvp_property_localhost.enriched.xlsx',
    'property CSV filename' : '/home/data/property_scraper/demos/pvp_property_localhost.enriched.csv',
    'property parquet filename' : '/home/data/property_scraper/demos/pvp_property_localhost.enriched.parquet',
    'property pattern' : '/home/data/property_scraper/demos/pvp_property_localhost.csv*',
}

print(json.dumps(rc, indent=4))

{
    "name": "PVP Italia",
    "credentials": "/home/emanuele/gspreadscraper.json",
    "workbook_id": "1BRloTbcVOFAL9up2wIsvaAjFuJep9f3TWQwp_f02ntw",
    "folder": "/home/data/property_scraper/demos",
    "JSON filename": "/home/data/property_scraper/demos/{search_engine}_search_property_localhost.json",
    "search excel filename": "/home/data/property_scraper/demos/pvp_search_default.enriched.xlsx",
    "search CSV filename": "/home/data/property_scraper/demos/pvp_search_default.enriched.csv",
    "search parquet filename": "/home/data/property_scraper/demos/pvp_search_default.enriched.parquet",
    "search by location CSV filename": "/home/data/property_scraper/demos/pvp_search_default_bylocation.csv",
    "search by location parquet filename": "/home/data/property_scraper/demos/pvp_search_default_bylocation.parquet",
    "search pattern": "/home/data/property_scraper/demos/pvp_search_default.csv*",
    "search property excel filename": "/home/data/property_scraper/demos/pvp_searc

In [6]:
if search_engine == 'pvp':
    original_url_rootname = PVP_URL_ROOTNAME
else:
    raise ValueError(f'Plese specify the correct value for {original_url_rootname}!')
original_url_rootname

'https://pvp.giustizia.it'

# Update Search Configuration

In [7]:
if usecase == 'update_search_configuration':
    with open(json_filename, 'r') as f:
        data = json.load(f)

In [8]:
if usecase == 'update_search_configuration':
    url_rootname = data['root']
    print(url_rootname)

In [9]:
if usecase == 'update_search_configuration':
    rootname = ROOT_FOLDER
    print(rootname)

In [10]:
if usecase == 'update_search_configuration':
    data['searches'] = sorted(set(data['searches']))
    print(len(data['searches']))
    print('\n'.join(data['searches'][:10]))

In [11]:
if usecase == 'update_search_configuration':
    data['properties'] = sorted(set(data['properties']))
    print(len(data['properties']))
    print('\n'.join(data['properties'][:10]))

## Search

In [12]:
if usecase == 'update_search_configuration':
    filenames = glob(f'/home/git/property_scraper/www/{search_engine}/{search_engine}_search_*.html')
    filenames = sorted([os.path.relpath(x, '/home/git/property_scraper/www') for x in filenames])
    print(len(filenames))
    print('\n'.join(filenames[:10]))

In [13]:
if usecase == 'update_search_configuration':
    data['searches'] = sorted(set([x if x.startswith(f'{search_engine}/{search_engine}_search_') else f'{x[:2*len(search_engine) + 1]}_search{x[2*len(search_engine) + 1:]}' for x in filenames]))
    print(len(data['searches']))
    print('\n'.join(data['searches'][:10]))

## Property

In [14]:
if usecase == 'update_search_configuration':
    filenames = glob(f'/home/git/property_scraper/www/{search_engine}/{search_engine}_property_LTT*.html')
    filenames = sorted([os.path.relpath(x, '/home/git/property_scraper/www') for x in filenames])
    print(len(filenames))
    print('\n'.join(filenames[:10]))

In [15]:
if usecase == 'update_search_configuration':
    if 0 == 1:
        filenames_new = sorted(set([x if x.startswith(f'{search_engine}/{search_engine}_search_') else f'{x[:2*len(search_engine) + 1]}_property{x[2*len(search_engine) + 1:]}' for x in filenames]))
        print(len(filenames_new))
        filenames_new[:10]

In [16]:
if usecase == 'update_search_configuration':
    if 0 == 1:
        pbar = tqdm(range(len(filenames)))
        for ix in pbar:
            os.rename(os.path.join('/home/git/property_scraper/99_backup', filenames[ix]), 
                      os.path.join('/home/git/property_scraper/99_backup', filenames_new[ix]))

In [17]:
if usecase == 'update_search_configuration':
    data['properties'] = sorted(set([x if x.startswith(f'{search_engine}/{search_engine}_property_') else f'{x[:2*len(search_engine) + 1]}_search{x[2*len(search_engine) + 1:]}' for x in filenames]))
    print(len(data['properties']))
    print('\n'.join(data['properties'][:10]))

In [18]:
if usecase == 'update_search_configuration':
    with open(json_filename, 'w') as f:
        json.dump(data, f, indent=4)

In [19]:
if usecase == 'update_search_configuration':
    if 0 == 1:
        filename = f'/home/git/property_scraper/demos/{search_engine}_search_localhost.json'
        with open(filename, 'r') as f:
            data = json.load(f)
        #print(json.dumps(data, indent=4))

# Update Results

## Search

In [20]:
if usecase == 'update_results':
    store = SearchStore(name=rc['name'], credentials=rc['credentials'], workbook_id=rc['workbook_id'])

In [21]:
if usecase == 'update_results':
    store.results = store.update_results(rc['search parquet filename'], rc['search pattern'])

(13432, 23)


id,pvp_search_20230428071707_000000,pvp_search_20230428071707_000001,pvp_search_20230428071707_000002
Basename,pvp_search_20230428071707_000000.html,pvp_search_20230428071707_000001.html,pvp_search_20230428071707_000002.html
Filename,/home/git/property_scraper/www/pvp/pvp_search_...,/home/git/property_scraper/www/pvp/pvp_search_...,/home/git/property_scraper/www/pvp/pvp_search_...
Scaricato?,True,True,True
Hyperlink relativo riparato?,False,False,False
Localita' estratta dal nome del file,,,
Numero di Pagine,10726,10726,10726
Numero di Risultati,536292,536292,536292
Numero di Risultati per Pagina,50,50,50
Identificativo della Pagina,3693,3694,3695
Codice dello Stato della Risposta,200,200,200


/home/data/property_scraper/demos/pvp_search_default.csv*


In [22]:
if usecase == 'update_results':
    store.fix_results()

store.results.dtypes

Basename                                object
Filename                                object
Scaricato?                                bool
Hyperlink relativo riparato?              bool
Localita' estratta dal nome del file    object
Numero di Pagine                         int64
Numero di Risultati                      int64
Numero di Risultati per Pagina           int64
Identificativo della Pagina              int64
Codice dello Stato della Risposta        int64
Nome del Ragno                          object
URL                                     object
Localhost URL                           object
Data della Ricerca                      object
tipo_bene                               object
geo                                     object
raggio                                   int64
ordinamento                             object
ordine_localita                         object
view                                    object
elementiPerPagina                        int64
frame4_item  

In [23]:
n = 5
store.show(n)

Basename -> 
	pvp_search_20230428071707_000000.html
	pvp_search_20230428071707_000001.html
	pvp_search_20230428071707_000002.html
	pvp_search_20230428071707_000003.html
	pvp_search_20230428071707_000004.html
Filename -> 
	/home/git/property_scraper/www/pvp/pvp_search_20230428071707_000000.html
	/home/git/property_scraper/www/pvp/pvp_search_20230428071707_000001.html
	/home/git/property_scraper/www/pvp/pvp_search_20230428071707_000002.html
	/home/git/property_scraper/www/pvp/pvp_search_20230428071707_000003.html
	/home/git/property_scraper/www/pvp/pvp_search_20230428071707_000004.html
Scaricato? -> 
	True
	False
Hyperlink relativo riparato? -> 
	False
	True
Localita' estratta dal nome del file -> 
	None
	bologna
	genova
	palermo
	ravenna
Numero di Pagine -> 
	10726
	10727
	10728
	10729
	10730
Numero di Risultati -> 
	536292
	536293
	536294
	536295
	536296
Numero di Risultati per Pagina -> 
	50
	35
	44
	19
	8
Identificativo della Pagina -> 
	3693
	3694
	3695
	3696
	3697
Codice dello Stat

Let's remove the incomplete searches.

In [24]:
if usecase == 'update_results':
    store.results_bylocation = store.create_results_bylocation_table()

(13432, 23) != (13432, 23)


Unnamed: 0_level_0,Data della Ricerca,2023-02-04,2023-04-25,2023-04-26,2023-04-27,2023-04-28,2023-04-29,2023-04-30,2023-11-04
Localita' estratta dal nome del file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
bari,Numero di Pagine,0,52,0,0,0,0,0,0
bari,Prima Pagina,0,1,0,0,0,0,0,0
bari,Ultima Pagina,0,52,0,0,0,0,0,0
bergamo,Numero di Pagine,0,11,0,0,0,0,0,10
bergamo,Prima Pagina,0,1,0,0,0,0,0,1
bergamo,Ultima Pagina,0,11,0,0,0,0,0,10
bologna,Numero di Pagine,31,32,0,0,0,0,0,32
bologna,Prima Pagina,1,1,0,0,0,0,0,1
bologna,Ultima Pagina,31,11,0,0,0,0,0,1
busto+arsizio,Numero di Pagine,0,11,0,0,0,0,0,11


In [25]:
if usecase == 'update_results':
    store.backup_results(rc['search parquet filename'], rc['search by location parquet filename'])

Renaming file from /home/data/property_scraper/demos/pvp_search_default.enriched.parquet to /home/data/property_scraper/demos/pvp_search_default.enriched.20230508.parquet...
Saving file /home/data/property_scraper/demos/pvp_search_default.enriched.parquet...
Saving file /home/data/property_scraper/demos/pvp_search_default_bylocation.parquet...


In [26]:
if usecase == 'update_results':
    store.backup_results(rc['search CSV filename'], rc['search by location CSV filename'])

Renaming file from /home/data/property_scraper/demos/pvp_search_default.enriched.csv to /home/data/property_scraper/demos/pvp_search_default.enriched.20230508.csv...
Saving file /home/data/property_scraper/demos/pvp_search_default.enriched.csv...
Renaming file from /home/data/property_scraper/demos/pvp_search_default.enriched.csv to /home/data/property_scraper/demos/pvp_search_default_bylocation.20230508.csv...
Saving file /home/data/property_scraper/demos/pvp_search_default_bylocation.csv...


In [27]:
if usecase == 'update_results':
    store.backup_results(rc['search excel filename'], '')

Renaming file from /home/data/property_scraper/demos/pvp_search_default.enriched.xlsx to /home/data/property_scraper/demos/pvp_search_default.enriched.20230508.xlsx...
Saving file /home/data/property_scraper/demos/pvp_search_default.enriched.xlsx...


In [28]:
if usecase == 'update_results':
    store.to_gsheet(sheetname='search', bylocation_sheetname='search_bylocation')

KeyError: "None of ['Data della Ricerca'] are in the columns"

Check whether the search page was downloaded.

In [None]:
if usecase == 'update_results':
    if 0 == 1:
        search['Scaricato?'] = search['Filename'].apply(lambda x: is_downloaded(x))
        display(search[~search['Scaricato?']].head(3).T)

Check whether the search page hyperlinks were fixed.

In [None]:
%%time

if usecase == 'update_results':
    if 0 == 1:
        pbar = tqdm(search[search['Scaricato?']].index)
        for ix in pbar:
            search.loc[ix, 'Hyperlink relativo riparato?'] = is_fixed(search.loc[ix, 'Filename'])
        print(search[~search['Hyperlink relativo riparato?']].shape)
        display(search[~search['Hyperlink relativo riparato?']].head(3).T)

In [None]:
if usecase == 'update_results':
    if 0 == 1:
        print(search[~search['Hyperlink relativo riparato?']].shape)
        print(search.shape)

In [None]:
if usecase == 'update_results':
    if 0 == 1:
        basenames = [os.path.basename(x) for x in data['searches']]
        indices = search[search['Basename'].isin(basenames)].index
        search.loc[indices, 'Hyperlink relativo riparato?'] = True
        print(search[~search['Hyperlink relativo riparato?']].shape)
        display(search.head(3).T)

In [None]:
if usecase == 'update_results':
    if 0 == 1:
        df = search[search['Scaricato?'] & ~search['Hyperlink relativo riparato?']]
        display(df.head(3).T)

Let's enrich the search results.

In [None]:
if usecase == 'update_results':
    if 0 == 1:
        search['Data della Ricerca'] = search[['Basename']].apply(lambda x: get_datetime(x[0]), axis=1)
        #searches['data_della_ricerca'] = pd.to_datetime(searches['data_della_ricerca'])
        search['Data della Ricerca'] = search['Data della Ricerca'].map(lambda x: x.strftime('%d/%m/%Y'))

        url = search.iloc[0]['URL']
        #'https://pvp.giustizia.it/pvp/it/risultati_ricerca.page?localita=Palermo&geo=geografica&ordinamento=data_vendita_decre&view=list&frame4_item=63&categoria=&nazione=ITA&elementiPerPagina=50&ordine_localita=a_z&tipo_bene=immobili'
        parameters = get_parameters(url)
        print(parameters)

        for key in parameters.keys():
            search[key] = search[['URL']].apply(lambda x: get_parameter(key, x[0]), axis=1)

        search = search.reset_index()
        search = search.rename(columns=scraper.SEARCH_NAMES)
        search = search.set_index(['Identificativo'])

        #search["Localita' estratta dal nome del file"] = search["Localita' estratta dal nome del file"].astype(str)
        indices = search[pd.isnull(search["Localita' estratta dal nome del file"])].index
        search.loc[indices, "Localita' estratta dal nome del file"] = None

        print(search.shape)
        display(search.head(3).T)

Let's update the google sheet...

## Search Property

In [29]:
if usecase == 'update_results':
    store = SearchPropertyStore(name=rc['name'], credentials=rc['credentials'], workbook_id=rc['workbook_id'])

In [30]:
if usecase == 'update_results':
    store.results = store.update_results(rc['search property parquet filename'], rc['search property pattern'])

/home/data/property_scraper/demos/pvp_search_property_localhost.csv*
/home/data/property_scraper/demos/pvp_search_property_localhost.csv


  0%|          | 0/1 [00:00<?, ?it/s]

(50, 18)
(50, 18)


Unnamed: 0,0,1,2
Basename,pvp_search_property_LTT2336958.html,pvp_search_property_LTT2362187.html,pvp_search_property_LTT2459324.html
Data di vendita,20/09/2019 10:00,20/09/2019 10:00,20/09/2019 10:00
Filename,/home/git/property_scraper/demos/downloads/pvp...,/home/git/property_scraper/demos/downloads/pvp...,/home/git/property_scraper/demos/downloads/pvp...
Identificativo,pvp_search_property_LTT2336958,pvp_search_property_LTT2362187,pvp_search_property_LTT2459324
Indirizzo,"Roncalceci, Via Sauro Babini, 239 Ravenna (RA)","Fornace Zarattini, vVa Dei Granatieri, 28 Rave...","Marina di Ravenna, Viale Fratelli Bandiera, 2 ..."
Scaricato?,False,False,False
Hyperlink relativo riparato?,False,False,False
Lotto,LOTTO UNICO,LOTTO UNICO,Lotto nr. 1
Modalità consegna,Presso il venditore,Presso il venditore,Presso il venditore
Numero della procedura,153/2018,153/2018,153/2018


(50, 18)
(50, 18)


Unnamed: 0,0,1,2
Basename,pvp_search_property_LTT2336958.html,pvp_search_property_LTT2362187.html,pvp_search_property_LTT2459324.html
Data di vendita,20/09/2019 10:00,20/09/2019 10:00,20/09/2019 10:00
Filename,/home/git/property_scraper/demos/downloads/pvp...,/home/git/property_scraper/demos/downloads/pvp...,/home/git/property_scraper/demos/downloads/pvp...
Identificativo,pvp_search_property_LTT2336958,pvp_search_property_LTT2362187,pvp_search_property_LTT2459324
Indirizzo,"Roncalceci, Via Sauro Babini, 239 Ravenna (RA)","Fornace Zarattini, vVa Dei Granatieri, 28 Rave...","Marina di Ravenna, Viale Fratelli Bandiera, 2 ..."
Scaricato?,False,False,False
Hyperlink relativo riparato?,False,False,False
Lotto,LOTTO UNICO,LOTTO UNICO,Lotto nr. 1
Modalità consegna,Presso il venditore,Presso il venditore,Presso il venditore
Numero della procedura,153/2018,153/2018,153/2018


Unnamed: 0,0,1,2
Basename,pvp_search_property_LTT2336958.html,pvp_search_property_LTT2362187.html,pvp_search_property_LTT2459324.html
Data di vendita,20/09/2019 10:00,20/09/2019 10:00,20/09/2019 10:00
Filename,/home/git/property_scraper/demos/downloads/pvp...,/home/git/property_scraper/demos/downloads/pvp...,/home/git/property_scraper/demos/downloads/pvp...
Identificativo,pvp_search_property_LTT2336958,pvp_search_property_LTT2362187,pvp_search_property_LTT2459324
Indirizzo,"Roncalceci, Via Sauro Babini, 239 Ravenna (RA)","Fornace Zarattini, vVa Dei Granatieri, 28 Rave...","Marina di Ravenna, Viale Fratelli Bandiera, 2 ..."
Scaricato?,False,False,False
Hyperlink relativo riparato?,False,False,False
Lotto,LOTTO UNICO,LOTTO UNICO,Lotto nr. 1
Modalità consegna,Presso il venditore,Presso il venditore,Presso il venditore
Numero della procedura,153/2018,153/2018,153/2018


In [31]:
if usecase == 'update_results':
    store.fix_results()

store.results.dtypes

Basename                              object
Data di vendita                       object
Filename                              object
Identificativo                        object
Indirizzo                             object
Scaricato?                              bool
Hyperlink relativo riparato?            bool
Lotto                                 object
Modalità consegna                     object
Numero della procedura                object
Offerta minima                       float64
Prezzo Base                          float64
Codice dello Stato della Risposta      int64
Rialzo minimo                        float64
Identificativo della ricerca          object
Nome del Ragno                        object
URL                                   object
Localhost URL                         object
Batch                                 object
dtype: object

In [32]:
n = 5
store.show(n)

Basename -> 
	pvp_search_property_LTT2336958.html
	pvp_search_property_LTT2362187.html
	pvp_search_property_LTT2459324.html
	pvp_search_property_LTT2427653.html
	pvp_search_property_LTT2396728.html
Data di vendita -> 
	20/09/2019 10:00
Filename -> 
	/home/git/property_scraper/demos/downloads/pvp/pvp_search_property_LTT2336958.html
	/home/git/property_scraper/demos/downloads/pvp/pvp_search_property_LTT2362187.html
	/home/git/property_scraper/demos/downloads/pvp/pvp_search_property_LTT2459324.html
	/home/git/property_scraper/demos/downloads/pvp/pvp_search_property_LTT2427653.html
	/home/git/property_scraper/demos/downloads/pvp/pvp_search_property_LTT2396728.html
Identificativo -> 
	pvp_search_property_LTT2336958
	pvp_search_property_LTT2362187
	pvp_search_property_LTT2459324
	pvp_search_property_LTT2427653
	pvp_search_property_LTT2396728
Indirizzo -> 
	Roncalceci, Via Sauro Babini, 239 Ravenna (RA)
	Fornace Zarattini, vVa Dei Granatieri, 28 Ravenna (RA)
	Marina di Ravenna, Viale Fratelli

In [33]:
if usecase == 'update_results':
    store.backup_results(rc['search property parquet filename'], None)

Saving file /home/data/property_scraper/demos/pvp_search_property_localhost.enriched.parquet...


In [34]:
if usecase == 'update_results':
    store.backup_results(rc['search property CSV filename'], None)

Saving file /home/data/property_scraper/demos/pvp_search_property_localhost.enriched.csv...


In [35]:
if usecase == 'update_results':
    store.backup_results(rc['search excel filename'], None)

Renaming file from /home/data/property_scraper/demos/pvp_search_default.enriched.xlsx to /home/data/property_scraper/demos/pvp_search_default.enriched.20230508.xlsx...
Saving file /home/data/property_scraper/demos/pvp_search_default.enriched.xlsx...


AttributeError: 'SearchPropertyStore' object has no attribute 'results_bylocation'

In [36]:
if usecase == 'update_results':
    store.to_gsheet(sheetname='search_property', bylocation_sheetname=None)

In [None]:
if usecase == 'update_results':
    if 0 == 1:
        for cid in ['indirizzo', 'numero_di_procedura']:
            search_property[cid] = search_property[cid].astype(str)
        for cid in ['is_downloaded']:
            search_property[cid] = search_property[cid].astype(bool)
        for cid in ['offerta_minima', 'prezzo_base']:
            search_property[cid] = search_property[cid].astype(float)
        for cid in ['response_status_code']:
            search_property[cid] = search_property[cid].astype(int)

## Property

In [37]:
if usecase == 'update_results':
    store = PropertyStore(name=rc['name'], credentials=rc['credentials'], workbook_id=rc['workbook_id'])

In [40]:
if usecase == 'update_results':
    store.results = store.update_results(rc['property parquet filename'], rc['property pattern'])

/home/data/property_scraper/demos/pvp_property_localhost.csv*
/home/data/property_scraper/demos/pvp_property_localhost.csv


  0%|          | 0/1 [00:00<?, ?it/s]

(247838, 47)
(247838, 47)


Unnamed: 0,0,1,2
allegati,Relazione generale Mantova_Omissis.pdf (292 Kb...,Avviso di vendita Spada 4 GEN 19 A.pdf (109 Kb...,"avviso di vendita.pdf (24 Kb),ordinanza.pdf (7..."
allegati_url,/pvp-resources/cms/documents/2018/20181106/af7...,https://pvp.giustizia.it/pvp-resources/cms/doc...,https://pvp.giustizia.it/pvp-resources/cms/doc...
Basename,,,
bene_descrizione,LOTTO B9Area da urbanizzare di circa 3.286 mq ...,Quota di proprietà del 50% di Appartamento al ...,Appartamento ad uso abitativo al piano terreno...
bene_disponibilita,,,
bene_foglio,,,
bene_particella,,,
bene_piano,,,
bene_sub,,,
bene_tipologia,Senza incanto,Senza incanto,Senza incanto


(247838, 47)
(247838, 47)


Unnamed: 0,0,1,2
allegati,Relazione generale Mantova_Omissis.pdf (292 Kb...,Avviso di vendita Spada 4 GEN 19 A.pdf (109 Kb...,"avviso di vendita.pdf (24 Kb),ordinanza.pdf (7..."
allegati_url,/pvp-resources/cms/documents/2018/20181106/af7...,https://pvp.giustizia.it/pvp-resources/cms/doc...,https://pvp.giustizia.it/pvp-resources/cms/doc...
Basename,,,
bene_descrizione,LOTTO B9Area da urbanizzare di circa 3.286 mq ...,Quota di proprietà del 50% di Appartamento al ...,Appartamento ad uso abitativo al piano terreno...
bene_disponibilita,,,
bene_foglio,,,
bene_particella,,,
bene_piano,,,
bene_sub,,,
bene_tipologia,Senza incanto,Senza incanto,Senza incanto


Unnamed: 0,0,1,2
allegati,Relazione generale Mantova_Omissis.pdf (292 Kb...,Avviso di vendita Spada 4 GEN 19 A.pdf (109 Kb...,"avviso di vendita.pdf (24 Kb),ordinanza.pdf (7..."
allegati_url,/pvp-resources/cms/documents/2018/20181106/af7...,https://pvp.giustizia.it/pvp-resources/cms/doc...,https://pvp.giustizia.it/pvp-resources/cms/doc...
Basename,,,
bene_descrizione,LOTTO B9Area da urbanizzare di circa 3.286 mq ...,Quota di proprietà del 50% di Appartamento al ...,Appartamento ad uso abitativo al piano terreno...
bene_disponibilita,,,
bene_foglio,,,
bene_particella,,,
bene_piano,,,
bene_sub,,,
bene_tipologia,Senza incanto,Senza incanto,Senza incanto


In [41]:
if usecase == 'update_results':
    store.fix_results()

store.results.dtypes

allegati                              object
allegati_url                          object
Basename                              object
bene_descrizione                      object
bene_disponibilita                   float64
bene_foglio                          float64
bene_particella                      float64
bene_piano                           float64
bene_sub                             float64
bene_tipologia                        object
bene_vani                            float64
custode                               object
custode_email                         object
custode_telefono                      object
custode_telefono_mobile               object
Data di vendita                       object
data_pubblicazione_sul_portale        object
delegato_alla_vendita                 object
descrizione                           object
dettaglio                            float64
dettaglio_lotto_descrizione           object
dettaglio_lotto_titolo                object
esecuzione

In [42]:
n = 5
store.show(n)

allegati -> 
	Relazione generale Mantova_Omissis.pdf (292 Kb),bando XII vendita 14.12.2018.pdf (411 Kb)
	Avviso di vendita Spada 4 GEN 19 A.pdf (109 Kb),ESTR_SENT_FALL.pdf (128 Kb),PERIZIA SPADA SEGREGATA.pdf (1727 Kb)
	avviso di vendita.pdf (24 Kb),ordinanza.pdf (729 Kb),Perizia.pdf (384 Kb),foto.pdf (2992 Kb),planimetria.pdf (318 Kb)
	avviso12 161.07.pdf (235 Kb),ordinanza 161.07.pdf (167 Kb)
	795821_perizia.pdf (3677 Kb),795820_ordinanza.pdf (343 Kb)
allegati_url -> 
	/pvp-resources/cms/documents/2018/20181106/af71de477c3709141b9d7d691974ce2b.pdf,/pvp-resources/cms/documents/2018/20181106/22574d7c3264406812ebc2655e0e7c73.pdf
	https://pvp.giustizia.it/pvp-resources/cms/documents/2018/20181106/92e21937c5c301dff6a41e452e09bc62.pdf,https://pvp.giustizia.it/pvp-resources/cms/documents/2018/20181106/e87642c6b376b3da150ffcc28ea2d2d6.pdf,https://pvp.giustizia.it/pvp-resources/cms/documents/2018/20181106/5a64d43edead47867022e20e9951ed45.pdf
	https://pvp.giustizia.it/pvp-resources/cms/documen

In [43]:
if usecase == 'update_results':
    store.backup_results(rc['property parquet filename'], None)

Saving file /home/data/property_scraper/demos/pvp_property_localhost.enriched.parquet...


In [44]:
if usecase == 'update_results':
    store.backup_results(rc['property CSV filename'], None)

Saving file /home/data/property_scraper/demos/pvp_property_localhost.enriched.csv...


In [46]:
if usecase == 'update_results':
    pass
    #store.backup_results(rc['property excel filename'], None)

In [49]:
if usecase == 'update_results':
    table_name = 'property'
    sql_statement = pd.io.sql.get_schema(store.results, table_name).replace('"', '`')#.replace('\n', '').replace('"', "")
    print(sql_statement)

CREATE TABLE `property` (
`allegati` TEXT,
  `allegati_url` TEXT,
  `Basename` TEXT,
  `bene_descrizione` TEXT,
  `bene_disponibilita` REAL,
  `bene_foglio` REAL,
  `bene_particella` REAL,
  `bene_piano` REAL,
  `bene_sub` REAL,
  `bene_tipologia` TEXT,
  `bene_vani` REAL,
  `custode` TEXT,
  `custode_email` TEXT,
  `custode_telefono` TEXT,
  `custode_telefono_mobile` TEXT,
  `Data di vendita` TEXT,
  `data_pubblicazione_sul_portale` TEXT,
  `delegato_alla_vendita` TEXT,
  `descrizione` TEXT,
  `dettaglio` REAL,
  `dettaglio_lotto_descrizione` TEXT,
  `dettaglio_lotto_titolo` TEXT,
  `esecuzione` REAL,
  `Filename` REAL,
  `giudice` TEXT,
  `Identificativo` REAL,
  `Indirizzo` TEXT,
  `inserzione` REAL,
  `Scaricato?` INTEGER,
  `Hyperlink relativo riparato?` INTEGER,
  `luogo_vendita` TEXT,
  `modalita_di_vendita` TEXT,
  `numero_lotto` TEXT,
  `numero_procedura` TEXT,
  `Offerta minima` REAL,
  `Prezzo Base` REAL,
  `pubblicato_in` TEXT,
  `pubblicato_in_url` TEXT,
  `Codice dello St

In [51]:
filename = '/home/data/property_scraper/demos/pvp_property_localhost.enriched.sql'

from sqlalchemy import create_engine
engine = create_engine(f'sqlite://{filename}', echo=False)
store.results.to_sql('property', con=engine, if_exists='append')

OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: https://sqlalche.me/e/14/e3q8)

In [None]:
stop here

Let's remove the incomplete searches.

In [None]:
if usecase == 'update_results':
    assert(len(sorted(store.results[store.results['indirizzo'] == 'False'].index)) == 0)

In [None]:
if usecase == 'update_results':
    assert(len(sorted(store.results[store.results['response_status_code'] != 200].index)) == 0)

In [None]:
if usecase == 'update_results':
    if 0 == 1:
        for cid in ['indirizzo']: #, 'numero_di_procedura']:
            property[cid] = property[cid].astype(str)
        for cid in ['is_downloaded']:
            property[cid] = property[cid].astype(bool)
        for cid in ['offerta_minima', 'prezzo_base']:
            property[cid] = property[cid].astype(float)
        for cid in ['response_status_code']:
            property[cid] = property[cid].astype(int)

In [None]:
#search.to_parquet(parquet_filename)

# Link Extractor

In [None]:
import pandas as pd

dataframe = pd.DataFrame(worksheet.get_all_records())

## Search

In [None]:
worksheet = workbook.worksheet('search')

In [None]:
search = pd.DataFrame(worksheet.get_all_records())
search = search.set_index('Identificativo')
print(search.shape)
display(search.head())

In [None]:
filenames = glob(os.path.join(rootname, 'pvp', 'pvp_search_*.html'))
basenames = [os.path.basename(x) for x in filenames]
print(f"{len(filenames)} -> \n\t" + '\n\t'.join(filenames[:10]))

In [None]:
search_missing = set(basenames) - set(search['Basename'].unique())
print(len(search_missing))
print('\n'.join(sorted(search_missing)[:5]))
print('\n'.join(sorted(search_missing)[-5:]))

In [None]:
if 0 == 1:
    repair_url('http://localhost/pvp/pvp_property_LTT4199300.html')

In [None]:
urls = searches_to_be_downloaded(search_missing)
print(f"{len(urls)} -> " + '\n'.join(sorted(urls)))

In [None]:
if 0 == 1:
    pbar = tqdm(search.index)
    for idx in pbar:
        filename = search.loc[idx, 'Filename']
        links = get_links(filename)
        pbar.set_description(f"{filename:>80} -> {len(links):>10}")
        print('\n'.join(links[:10]))
        # http://localhost/pvp/pvp_property_LTT5906692.html -> 
        break

## Search Property

In [None]:
worksheet = workbook.worksheet('search_property')

In [None]:
search_property = pd.DataFrame(worksheet.get_all_records())
search_property = search_property.set_index('Identificativo')
print(search_property.shape)
display(search_property.head())

In [None]:
filenames = glob(os.path.join(rootname, 'pvp', 'pvp_property_*.html'))
basenames = [os.path.basename(x) for x in filenames]
print(f"{len(filenames)} -> \n\t" + '\n\t'.join(filenames[:10]))

In [None]:
search_property_missing = set(basenames) - set(search_property['Basename'].unique())
print(len(search_property_missing))
print('\n'.join(sorted(search_property_missing)[:5]))
print('\n'.join(sorted(search_property_missing)[-5:]))

In [None]:
if 0 == 1:
    repair_url('http://localhost/pvp/pvp_property_LTT4199300.html')

In [None]:
urls = searches_to_be_downloaded(search_property_missing)
print(f"{len(urls)} -> " + '\n'.join(sorted(urls)))

## Property

In [None]:
worksheet = workbook.worksheet('property')

In [None]:
property = pd.DataFrame(worksheet.get_all_records())
property['Basename'] = property['url'].apply(lambda x: os.path.basename(x))
property['Identificativo'] = property['Basename'].apply(lambda x: os.path.splitext(x)[0].split('_')[-1].replace('LTT', ''))
property = property.set_index('Identificativo')
print(property.shape)
display(property.head())

In [None]:
filenames = glob(os.path.join(rootname, 'pvp', 'pvp_property_*.html'))
basenames = [os.path.basename(x) for x in filenames]
print(f"{len(filenames)} -> \n\t" + '\n\t'.join(filenames[:10]))

In [None]:
property_missing = set(basenames) - set(property['Basename'].unique())
print(len(property_missing))
print('\n'.join(sorted(property_missing)[:5]))
print('\n'.join(sorted(property_missing)[-5:]))

In [None]:
if 0 == 1:
    repair_url('http://localhost/pvp/pvp_property_LTT4199300.html')

In [None]:
urls = searches_to_be_downloaded(search_property_missing)
print(f"{len(urls)} -> " + '\n'.join(sorted(urls)))

# Link Fixer

## Search

In [None]:
if usecase == 'fix_links':
    parquet_filename = '/home/git/property_scraper/demos/pvp_search_default.enriched.parquet'
    search = pd.read_parquet(parquet_filename)
    print(search.shape)
    display(search.head(3).T)

In [None]:
if usecase == 'fix_links':
    #_filenames = [x for x in sorted(data['searches'])]
    #print(f"{len(_filenames):10} -> {_filenames[:1]}")
    indices = search[~search['Scaricato?']].index
    filenames_to_be_downloaded = [os.path.join(search_engine, x) for x in sorted(search.loc[indices, 'Basename'].unique())]
    print(f"Number of Search Files to be downloaded: {len(filenames_to_be_downloaded):36} -> {filenames_to_be_downloaded[:1]}")
    indices = search[search['Scaricato?'] & search['Hyperlink relativo riparato?']].index
    filenames_fixed = [os.path.join(search_engine, x) for x in sorted(search.loc[indices, 'Basename'].unique())]
    print(f"Number of Search Files downloaded and with Hyperlinks fixed: {len(filenames_fixed):16} -> {filenames_fixed[:1]}")
    indices = search[search['Scaricato?'] & ~search['Hyperlink relativo riparato?']].index
    filenames_to_be_fixed = [os.path.join(search_engine, x) for x in sorted(search.loc[indices, 'Basename'].unique())]
    print(f"Number of Search Files downloaded but with Hyperlinks to be fixed: {len(filenames_to_be_fixed):10} -> {filenames_to_be_fixed[:1]}")
    if 0 == 1:
        #filenames = [x for x in sorted(data['searches']) if 'pvp/pvp_search_20230429' in x]
        #filenames = filenames[10:11]
        filenames_to_be_fixed = ['pvp/pvp_search_20230429072621_000069.html']
        print(filenames_to_be_fixed)

In [None]:
def fix_search_hyperlinks(filename:str):
    prefixes = {
        '/': f'{PVP_URL_ROOTNAME}/',
        'prenotazione_visita.page?': f'{PVP_URL_ROOTNAME}/pvp/it/prenotazione_visita.page?',
        f'{PVP_URL_ROOTNAME}/pvp/en/': f'{PVP_URL_ROOTNAME}/pvp/it/',
        'http://localhost/pvp-resources/': f'{PVP_URL_ROOTNAME}/pvp-resources/',
        'http://localhost/pvp/it/': f'{PVP_URL_ROOTNAME}/pvp/it/'
    }
    
    func = {
        f'{PVP_URL_ROOTNAME}/pvp/it/dettaglio_annuncio.page?': get_filename_from_identifier,
        f'{PVP_URL_ROOTNAME}/pvp/en/dettaglio_annuncio.page?': get_filename_from_identifier
    }
    keys = ['contentId']
    root = f'http://localhost/{search_engine}/{search_engine}_property'
    
    path = os.path.join('/home/git/property_scraper/demos/downloads', filename)
    subfolder = f'../../../www/{search_engine}'
    text = None
    with open(path, 'r') as f:
        text = f.read()
        text = text.replace("'/pvp-resources/", f"'{PVP_URL_ROOTNAME}/pvp-resources/")
        text = text.replace('"/pvp-resources/', f'"{PVP_URL_ROOTNAME}/pvp-resources/')
        text = text.replace('http://localhost/pvp/de/', f'{PVP_URL_ROOTNAME}/pvp/de/')
        text = text.replace('http://localhost/pvp/en/', f'{PVP_URL_ROOTNAME}/pvp/en/')
        text = text.replace('http://localhost/pvp/it/', f'{PVP_URL_ROOTNAME}/pvp/it/')
        text = text.replace('http://localhost/pvp/en/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
        text = text.replace('http://localhost/pvp/it/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
        text = text.replace('http://localhost/pvp/it/dettaglio_annuncio.page?', f'{PVP_URL_ROOTNAME}/pvp/it/dettaglio_annuncio.page?')
        text = text.replace('http://localhost:8000/pvp/de/', f'{PVP_URL_ROOTNAME}/pvp/de/')
        text = text.replace('http://localhost:8000/pvp/en/', f'{PVP_URL_ROOTNAME}/pvp/en/')
        text = text.replace('http://localhost:8000/pvp/it/', f'{PVP_URL_ROOTNAME}/pvp/it/')
        text = text.replace('http://localhost:8000/pvp/en/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
        text = text.replace('http://localhost:8000/pvp/it/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
    if text:
        #print(f"File {path} was modified!")
        opath = os.path.realpath(os.path.join(os.path.dirname(path), subfolder, os.path.basename(path)))
        #print(f'Copying filename {path} to {opath} ... ')
        with open(opath, 'w') as of:
            of.write(text)
        fix_hyperlinks(opath, filters=func, keys=keys, root=root, url_rootname=url_rootname, prefixes=prefixes, overwrite=True)#, subfolder=subfolder)
        #break

In [None]:
%%time

# Define the inputs to your function
n0 = 0
filenames = filenames_to_be_fixed[n0:]

if usecase == 'fix_links':
    if run_in_parallel:
        # Create a multiprocessing pool
        with Pool(processes=no_processes) as mp_pool:
            # Use the pool to apply the function to each input in parallel
            results = []
            for result in tqdm(mp_pool.imap_unordered(fix_property_hyperlinks, filenames), total=len(filenames)):
                results.append(result)
    else:
        pbar = tqdm(filenames, position=0)
        #ix = 0
        for filename in pbar:
            pbar.set_description(f'{filename: >60}')
            fix_search_hyperlinks(filename)

## Property

In [None]:
if usecase == 'fix_links':
    #filenames = [x for x in sorted(data['properties'])]
    filenames_to_be_fixed = glob('/home/git/property_scraper/demos/downloads/pvp/pvp_property_LTT*.html')
    filenames_to_be_fixed = [os.path.relpath(x, '/home/git/property_scraper/demos/downloads') for x in filenames_to_be_fixed]
    print(f"Number of Search Files downloaded but with Hyperlinks to be fixed: {len(filenames_to_be_fixed):10} -> {filenames_to_be_fixed[:1]}")    
    if 0 == 1:
        #filenames = [x for x in sorted(data['searches']) if 'pvp/pvp_search_20230429' in x]
        filenames = filenames[-10:]
        print(filenames)    

In [None]:
def fix_property_hyperlinks(filename:str):
    prefixes = {
        '/': f'{PVP_URL_ROOTNAME}/',
        'prenotazione_visita.page?': f'{PVP_URL_ROOTNAME}/pvp/it/prenotazione_visita.page?',
        f'{PVP_URL_ROOTNAME}/pvp/en/': f'{PVP_URL_ROOTNAME}/pvp/it/',
        'http://localhost/pvp-resources/': f'{PVP_URL_ROOTNAME}/pvp-resources/',
        'http://localhost/pvp/it/': f'{PVP_URL_ROOTNAME}/pvp/it/',
        'http://localhost:8000/pvp/documents/': './documents/'
    }

    func = {
        f'{PVP_URL_ROOTNAME}/pvp-resources/cms/documents/': get_basename,

    }
    #https://pvp.giustizia.it/pvp-resources/cms/documents/2018/20181003/c54d1e6f718d6a7275bf317b0ebcc4b2.pdf
    keys = [None]
    root = f'http://localhost:8000/{search_engine}/documents/'
    
    path = os.path.join('/home/git/property_scraper/demos/downloads', filename)
    subfolder = f'../../../www/{search_engine}'
    text = None
    with open(path, 'r') as f:
        text = f.read()
        text = text.replace("'/pvp-resources/", f"'{PVP_URL_ROOTNAME}/pvp-resources/")
        text = text.replace('"/pvp-resources/', f'"{PVP_URL_ROOTNAME}/pvp-resources/')
        text = text.replace('http://localhost/pvp/de/', f'{PVP_URL_ROOTNAME}/pvp/de/')
        text = text.replace('http://localhost/pvp/en/', f'{PVP_URL_ROOTNAME}/pvp/en/')
        text = text.replace('http://localhost/pvp/it/', f'{PVP_URL_ROOTNAME}/pvp/it/')
        text = text.replace('http://localhost/pvp/en/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
        text = text.replace('http://localhost/pvp/it/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
        text = text.replace('http://localhost/pvp/it/dettaglio_annuncio.page?', f'{PVP_URL_ROOTNAME}/pvp/it/dettaglio_annuncio.page?')
        text = text.replace('http://localhost:8000/pvp/de/', f'{PVP_URL_ROOTNAME}/pvp/de/')
        text = text.replace('http://localhost:8000/pvp/en/', f'{PVP_URL_ROOTNAME}/pvp/en/')
        text = text.replace('http://localhost:8000/pvp/it/', f'{PVP_URL_ROOTNAME}/pvp/it/')
        text = text.replace('http://localhost:8000/pvp/en/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
        text = text.replace('http://localhost:8000/pvp/it/risultati_ricerca.page?', f'{PVP_URL_ROOTNAME}/pvp/it/risultati_ricerca.page?')
    if text:
        #print(f"File {path} was modified!")
        opath = os.path.realpath(os.path.join(os.path.dirname(path), subfolder, os.path.basename(path)))
        #print(f'Copying filename {path} to {opath} ... ')
        with open(opath, 'w') as of:
            of.write(text)
        fix_hyperlinks(opath, filters=func, keys=keys, root=root, url_rootname=url_rootname, prefixes=prefixes, overwrite=True) #, subfolder=subfolder')
    #break

In [None]:
%%time

# Define the inputs to your function
n0 = 0 #231054
filenames = filenames_to_be_fixed[n0:]

if usecase == 'fix_links':
    if run_in_parallel:
        # Create a multiprocessing pool
        with Pool(processes=no_processes) as mp_pool:
            # Use the pool to apply the function to each input in parallel
            results = []
            for result in tqdm(mp_pool.imap_unordered(fix_property_hyperlinks, filenames), total=len(filenames)):
                results.append(result)
    else:
        pbar = tqdm(filenames_to_be_fixed[n0:], position=0)
        #ix = 0
        for filename in pbar:
            pbar.set_description(f'{filename: <50}')
            fix_property_hyperlinks(filename)

# Link Checker

## Search

In [None]:
%%time 

if usecase == 'check_links':
    urls = data['searches']
    linkchecker(search_engine, urls, url_rootname, f'{search_engine}_search')

## Property

In [None]:
%%time 

if usecase == 'check_links':
    urls = data['properties']
    linkchecker(search_engine, urls, url_rootname, f'{search_engine}_property')

In [None]:
#!linkchecker --verbose --file-output=csv/../demos/linkchecker-test.csv http://localhost:8001/pvp/pvp_LTT1002377.html
#!linkchecker --check-extern --verbose --ignore-url=https://pvp.giustizia.it/pvp/it/ultimi_annunci.page --ignore-url=https://pvp.giustizia.it//analiticositi.giustizia.it/piwik.php?idsite=17 --output=csv http://localhost:8001/pvp/pvp_LTT1002377.html

In [None]:
if usecase == 'check_links':
    filename = f'../demos/{search_engine}linkchecker-output.txt'

In [None]:
if usecase == 'check_links':
    with open(filename, 'r') as f:
        for rowid, line in enumerate(f.readlines()):
            if rowid == 10:
                break
            print(line.strip())

In [None]:
if usecase == 'check_links':
    links = pd.read_csv(filename, sep=';', comment='#', skiprows=[0, 1, 2, -1])
    display(links.head(3).T)