Generic software for finding websites of enterprises using a search engine and machine learning.
This repo is still work in progress, i.e. the training function will be documented later.
This repository contains the software that was used for research on finding enterprise websites. For a detailed description of the methodology implemented we refer to the discussion paper on this subject from Statistics Netherlands:
Searching for business websites by Arnout van Delden, Dick Windmeijer and Olav ten Bosch
In short the software operates as follows:
- training a model for finding websites using google search for predicting in the train and test phase
- applying the trained model to a dataset with unknown URLs using google search in the predict phase
It is possible to skip the train and test phase and use the pre-trained model that is provided in this repository.
Google Search IDs
This software uses the Google custom search JSON API which offers 100 search queries per day for free. Use the paid version if you need more.
To get started configure a custom search engine and get your API key from here.
Make sure to enable the search engine feature 'Search whole internet'.
Then add the API key and the search engine ID to the
Assuming an up-to-date Python Anaconda distribution, use the following commands to install urlfinding from your anaconda prompt:
git clone https://github.com/SNStatComp/urlfinding.git # or download and unzip this repository cd urlfinding python setup.py install
Quick start: finding websites of NSIs
The examples folder contains a Python notebook examples/nsis.ipynb showing how to search for websites of National Statistical Offices (NSIs) using the pre-trained model provided in this repo.
urlfinding module as follows:
import urlfinding as uf
Then you have the following functions:
uf.search(base_file, googleconfig, blacklist, nrows)
This function startes a Google search.
base_file: A .csv file with a list of enterprises for which you want to find the webaddress. If you want to use the pretrained ML model provided (data/model.pkl_) the file must at least include the following columns: _id, tradename, legalname, address, postalcode and municipality. The column names can be specified in a mapping file (see config/mappings.yml for an example). The legal name can be the same as the tradename if you have only one name.
googleconfig: This file contains your credentials for using the Google custom search engine API
blacklist: A file containing urls you want to exclude from your search
nrows: Number of records to process. Google provides 100 queries per day for free. The urlfinding software issues 6 queries per record (see methodology paper reference above). Thus for example 10 enterprises 6 * 10 = 60 queries are fired. Every query returns at most 10 search results.
This function creates a file (<YYYYMMDD_>searchResult.csv) in the _data folder containing the search results, where YYYYMMDD is the current date.
To facilitate splitting up multiple search sessions on bigger data files, the search function creates a file
maxrownum in the project folder which contains the id of the record that was processed last. The search function will read this file upon startup and start on the next record. Hence, if you want to start again from the beginning of a file either remove the
maxrownum file or replace its content with 0.
uf.extract(date, data_files, blacklist)
This function extracts a feature file to be used for training your Machine Learning model or predicting using your an already trained model.
date: Used for adding a 'timestamp' to the name of the created feature file
data_files: list of files containing the search results
blacklist: see above
This function creates the feature file <YYYYMMDD_>_features___agg.csv in the data folder
uf.predict(feature_file, model_file, base_file)
This function predicts urls using a previously trained ML model.
feature_file: file containing the features
model_file: Pickle file containing the ML model (created with our package)
base_file: See base_file at
This function creates the file <base__file>_url.csv in the data folder containing the predicted urls. This file contains all data from the base file with 3 columns added:
host: the predicted url
eqPred: An indicator showing whether the predicted url is the right one
pTrue: An indicator showing the confidence of the prediction, a number between 0 and 1 where 0: almost certain not the url and 1: almost certain the right url. eqPred is derived from pTrue: if pTrue>0.5 then eqPred=True else eqPred=False
Documentation will follow later.
- The urlfinding returns domains of length 2 only, i.e. no subdomains of cbs.nl