In [1]:
import urlfinding as uf

## Get data from the web

To start scraping the following information is needed:
- **base_file**: A .csv file with a list of enterprises for which you want to find the webaddress. If you want to use the pretrained ML model provided (_data/model.pkl_) the file must at least include the following columns:  _id, tradename, legalname, address, postalcode and locality_. The column names can be specified in a mapping file (see _config/mappings.yml_ for an example)
- **googleconfig**: This file contains your credentials for using the [Google custom search engine API](https://developers.google.com/custom-search/v1/overview)
- **blacklist**: A file containing urls you want to exclude from your search
- **nrows**: Number of enterprises you want to search for. Google provides 100 queries per day for free. In this example for every enterprise 6 queries are performed, thus for 10 enterprises 6 * 10 = 60 queries. Every query returns at most 10 search results.

This function creates a file (<_YYYYMMDD_>_searchResult.csv_) in the _data_ folder containing the search results, where YYYYMMDD is the formatted current date.

In [None]:
base_file    = './data/NSIs.csv'
googleconfig = './config/config.yml'
blacklist    = './data/blacklist.txt'
nrows        = 10

uf.scrape.start(base_file, googleconfig, blacklist, nrows)

## Create the feature file

Next create the features used for training your Machine Learning model or predicting using your model.
- **date**: Used for adding a 'timestamp' to the name of the created feature file
- **data_files**: list of files containing the search results
- **blacklist**: see above

This function creates the feature file <_YYYYMMDD_>_features_\__agg.csv_ in the _data_ folder

In [None]:
date       = '20200110'
data_files = ['./data/20200109searchResult.csv'] # change this, should contain the file(s) you created
blacklist  = './data/blacklist.txt'

uf.process.start(date, data_files, blacklist)

## Predict urls

Finally predict urls using the provided or your own trained ML model
- **model_file**: Pickle file containing the ML model (created with our package)
- **feature_file**: file containing the features
- **base_file**: The same file that was used for scraping (see above)

This function creates the file <_base_\_file>_\_url.csv_ in the _data_ folder containing the predicted urls. This file contains all data from the base file with 3 columns added:
- **host**: the predicted url
- **eqPred**: An indicator showing whether the predicted url is the right one
- **pTrue**: An indicator showing the confidence of the prediction, a number between 0 and 1 where 0: almost certain not the url and 1: almost certain the right url. **eqPred** is derived from **pTrue**: if pTrue>0.5 then eqPred=True else eqPred=False

In [None]:
model_file   = './data/model.pkl'
feature_file = './data/20200110features_agg.csv'
base_file    = './data/NSIs.csv'

uf.predict.start(feature_file, model_file, base_file)

## Train a model

### TODO
