> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

## Use Connector class for accessing the internet
Even if logging is not important for the below exercises, get in the habit of using this class for connecting to the internet, to practice logging your activity. This will be expected in the final exam.

In [13]:
import requests,os,time
def ratelimit():
    "A function that handles the rate of your calls."
    time.sleep(1) # sleep one second.

class Connector():
  def __init__(self,logfile,overwrite_log=False,connector_type='requests',session=False,path2selenium='',n_tries = 5,timeout=30):
    """This Class implements a method for reliable connection to the internet and monitoring. 
    It handles simple errors due to connection problems, and logs a range of information for basic quality assessments
    
    Keyword arguments:
    logfile -- path to the logfile
    overwrite_log -- bool, defining if logfile should be cleared (rarely the case). 
    connector_type -- use the 'requests' module or the 'selenium'. Will have different since the selenium webdriver does not have a similar response object when using the get method, and monitoring the behavior cannot be automated in the same way.
    session -- requests.session object. For defining custom headers and proxies.
    path2selenium -- str, sets the path to the geckodriver needed when using selenium.
    n_tries -- int, defines the number of retries the *get* method will try to avoid random connection errors.
    timeout -- int, seconds the get request will wait for the server to respond, again to avoid connection errors.
    """
    
    ## Initialization function defining parameters. 
    self.n_tries = n_tries # For avoiding triviel error e.g. connection errors, this defines how many times it will retry.
    self.timeout = timeout # Defining the maximum time to wait for a server to response.
    ## not implemented here, if you use selenium.
    if connector_type=='selenium':
      assert path2selenium!='', "You need to specify the path to you geckodriver if you want to use Selenium"
      from selenium import webdriver 
      ## HIN download the latest geckodriver here: https://github.com/mozilla/geckodriver/releases

      assert os.path.isfile(path2selenium),'You need to insert a valid path2selenium the path to your geckodriver. You can download the latest geckodriver here: https://github.com/mozilla/geckodriver/releases'
      self.browser = webdriver.Firefox(executable_path=path2selenium) # start the browser with a path to the geckodriver.

    self.connector_type = connector_type # set the connector_type
    
    if session: # set the custom session
      self.session = session
    else:
      self.session = requests.session()
    self.logfilename = logfile # set the logfile path
    ## define header for the logfile
    header = ['id','project','connector_type','t', 'delta_t', 'url', 'redirect_url','response_size', 'response_code','success','error']
    if os.path.isfile(logfile):        
      if overwrite_log==True:
        self.log = open(logfile,'w')
        self.log.write(';'.join(header))
      else:
        self.log = open(logfile,'a')
    else:
      self.log = open(logfile,'w')
      self.log.write(';'.join(header))
    ## load log 
    with open(logfile,'r') as f: # open file
        
      l = f.read().split('\n') # read and split file by newlines.
      ## set id
      if len(l)<=1:
        self.id = 0
      else:
        self.id = int(l[-1][0])+1
            
  def get(self,url,project_name):
    """Method for connector reliably to the internet, with multiple tries and simple error handling, as well as default logging function.
    Input url and the project name for the log (i.e. is it part of mapping the domain, or is it the part of the final stage in the data collection).
    
    Keyword arguments:
    url -- str, url
    project_name -- str, Name used for analyzing the log. Use case could be the 'Mapping of domain','Meta_data_collection','main data collection'. 
    """
     
    project_name = project_name.replace(';','-') # make sure the default csv seperator is not in the project_name.
    if self.connector_type=='requests': # Determine connector method.
      for _ in range(self.n_tries): # for loop defining number of retries with the requests method.
        ratelimit()
        t = time.time()
        try: # error handling 
          response = self.session.get(url,timeout = self.timeout) # make get call

          err = '' # define python error variable as empty assumming success.
          success = True # define success variable
          redirect_url = response.url # log current url, after potential redirects 
          dt = t - time.time() # define delta-time waiting for the server and downloading content.
          size = len(response.text) # define variable for size of html content of the response.
          response_code = response.status_code # log status code.
          ## log...
          call_id = self.id # get current unique identifier for the call
          self.id+=1 # increment call id
          #['id','project_name','connector_type','t', 'delta_t', 'url', 'redirect_url','response_size', 'response_code','success','error']
          row = [call_id,project_name,self.connector_type,t,dt,url,redirect_url,size,response_code,success,err] # define row to be written in the log.
          self.log.write('\n'+';'.join(map(str,row))) # write log.
          return response,call_id # return response and unique identifier.

        except Exception as e: # define error condition
          err = str(e) # python error
          response_code = '' # blank response code 
          success = False # call success = False
          size = 0 # content is empty.
          redirect_url = '' # redirect url empty 
          dt = t - time.time() # define delta t

          ## log...
          call_id = self.id # define unique identifier
          self.id+=1 # increment call_id

          row = [call_id,project_name,self.connector_type,t,dt,url,redirect_url,size,response_code,success,err] # define row
          self.log.write('\n'+';'.join(map(str,row))) # write row to log.
    else:
      t = time.time()
      ratelimit()
      self.browser.get(url) # use selenium get method
      ## log
      call_id = self.id # define unique identifier for the call. 
      self.id+=1 # increment the call_id
      err = '' # blank error message
      success = '' # success blank
      redirect_url = self.browser.current_url # redirect url.
      dt = t - time.time() # get time for get method ... NOTE: not necessarily the complete load time.
      size = len(self.browser.page_source) # get size of content ... NOTE: not necessarily correct, since selenium works in the background, and could still be loading.
      response_code = '' # empty response code.
      row = [call_id,project_name,self.connector_type,t,dt,url,redirect_url,size,response_code,success,err] # define row 
      self.log.write('\n'+';'.join(map(str,row))) # write row to log file.
    # Using selenium it will not return a response object, instead you should call the browser object of the connector.
    ## connector.browser.page_source will give you the html.
      return call_id
logfile = 'filename'## name your log file.
connector = Connector(logfile)

In [69]:
import pandas as pd
import time

# Exercise Set 8: Introduction to Web Scraping

*Afternoon, August 16, 2018*

In this Exercise Set we shall practice our webscraping skills utiilizing only basic python. We shall cover variations between static and dynamic pages and build. 

## Exercise Section 8.1: Scraping Jobnet.dk

This exercise you get to practice locating the request that the JavaScript sends to get the job data that it builds the joblistings from. You should use the **>Network Monitor<** tool in your browser.

Furthermore you practice spotting how the pagination is done, without clicking on the next page button, but instead changing a small parameter in the URL.

> **Ex. 8.1.1:** Hit the joblisting webpage here: https://job.jobnet.dk/CV and locate the request that gets the joblisting data using the the **>Network Monitor<**. *(Hint: Filter by XHR files)  

> **Ex. 8.1.2.:** Use the `request` module to collect the first 20 results and unpack the relevant `json` data into a `pandas` DataFrame.

> **Ex. 8.1.3.:** Store the 'TotalResultCount' value for later use.

In [95]:
import requests
TotalResultCount = json.loads(requests.get('https://job.jobnet.dk/CV/FindWork/Search?Offset=0').text)['TotalResultCount']

URL = 'https://job.jobnet.dk/CV/FindWork/Search'

data_list = []

for i in range(0, TotalResultCount, 20):
    response = requests.get('https://job.jobnet.dk/CV/FindWork/Search?Offset=%s' % i)
    data_list.append(pd.DataFrame(json.loads(response.text)['JobPositionPostings']))

In [45]:
import json
json_data = json.loads(response.text)

[job['JobHeadline'] for job in json_data['JobPositionPostings']]

['Manager, Procure to Pay (BPI)',
 'Business Development Manager',
 'Studentermedhjælpere til AV/IT Support på RUC',
 'Vice President for Portfolio, Project & Resource Management',
 'Frontend App Developer',
 'Head of Site Investigations Procurement (f/m/d)',
 "Global Category Manager, Direct Materials and CMO's",
 'Økonomimedarbejder til Borgercenter Voksne - Staben i Socialforvaltningen',
 'Rybners Handelsgymnasium (HHX) søger underviser til dansk og psykolog (vikariat)',
 'iOS Developer',
 'Android Developer',
 'AWS Lead with DevOps',
 'Lærer til de praktisk/musiske fag på Katrinedals Skole (vikariat)',
 'Energiske og kreative sosu-hjælpere til dagvagter på Egebo',
 'Uddannelseschef til KEA Digital',
 'Pædagogmedhjælper til Brolopperne',
 'Senior Regional Talent Manager',
 'Business Process Manager Commercial',
 'Medarbejder til rengøring og kantineservice søges',
 'Pædagog R3']

In [97]:
TotalResultCount = json_data['TotalResultCount']
print(TotalResultCount)
pd.DataFrame(json_data['JobPositionPostings']).head()

15001


Unnamed: 0,AnonymousEmployer,AssignmentStartDate,AutomatchType,Country,DetailsUrl,EmploymentType,HasLocationValues,HiringOrgCVR,HiringOrgName,ID,...,UseWorkPlaceAddressForJoblog,Weight,WorkHours,WorkPlaceAbroad,WorkPlaceAddress,WorkPlaceCity,WorkPlaceNotStatic,WorkPlaceOtherAddress,WorkPlacePostalCode,WorkplaceID
0,False,0001-01-01T00:00:00,0,Danmark,https://job.jobnet.dk/CV/FindWork/Details/5030832,,True,47458714,LEGO SYSTEM A/S,5030832,...,True,1.0,Fuldtid,False,Åstvej 1,Billund,False,False,7190,11393
1,False,0001-01-01T00:00:00,0,Danmark,https://job.jobnet.dk/CV/FindWork/Details/5030827,,True,30799968,NORDIC BIOSCIENCE A/S,5030827,...,True,1.0,Fuldtid,False,Herlev Hovedgade 205,Herlev,False,False,2730,91803
2,False,0001-01-01T00:00:00,0,Danmark,https://job.jobnet.dk/CV/FindWork/Details/5030824,,True,29057559,Roskilde Universitet,5030824,...,True,1.0,Deltid,False,Universitetsvej 1,Roskilde,False,False,4000,66024
3,False,0001-01-01T00:00:00,0,Danmark,https://job.jobnet.dk/CV/FindWork/Details/5030822,,True,12516479,CHR HANSEN A/S,5030822,...,True,1.0,Fuldtid,False,Bøge Alle 10,Hørsholm,False,False,2970,8958
4,False,0001-01-01T00:00:00,0,Danmark,https://job.jobnet.dk/CV/FindWork/Details/5030819,,True,37073512,Be My Eyes IVS,5030819,...,True,1.0,Fuldtid,False,Sletvej 2F,Tranbjerg J,False,False,8310,0


> **Ex. 8.1.4:** This exercise is about paging the results. We need to understand the websites pagination scheme. 

> Now scroll down the webpage and press the next page button. See how the parameters of the url changes as you turn the pages.

> **Ex. 8.1.5:** Design a`for` loop using the `range` function that changes this paging parameter in the URL. Use the TotalResultCount parameter from before to define the limits of the range function. Store these urls in a container. 

>**extra** Change the SortValue parameter from BestMatch to CreationDate, to make the sorting amendable to updating results daily.

*(HINT: See that the parameter is an offset and that this relates to the number of results pr. call made.)*

In [None]:
# [Answer to Ex. 8.1.4-5 here]

> **Ex.8.1.6:** Pick 20 random links using the `random.sample()` function and collect them using the `Connector` class. Also use the `time.sleep()` function to limit the rate of your calls. Make sure to save the links already collected in a `set()` container to avoid having to reload links already collected. ***extra***: monitor the time left to completing the loop by using `tqdm.tqdm()` function.

> **Ex.8.1.7:** Load all the results into a DataFrame.

In [17]:
# [Answer to Ex. 8.1.6-7 here]

## Exercise Section 8.2: Scraping Trustpilot.com
Now for a slightly more elaborate, yet still simple scraping problem. Here we want to scrape trustpilot for user reviews. This data is very nice since it provides free labeled data (rating) to train a machine learning model to understand positive and negative sentiment. 

Here you will practice crawling a website collecting the links to each company review page, and finally locate another behind the scenes JavaScript request that gets the review data in a neat json format.

> **Ex. 8.2.1:** Visit the https://www.trustpilot.com/ website and locate the categories page.
From this page you find links to company listings.

> **Ex. 8.2.2:**
Get the category page using the `requests` module and extract each link to a specific category page from the HTML. This can be done using the basic python `.split()` string method. Make sure only links within the ***/categories/*** section are kept, checking each string using the ```if 'pattern' in string``` condition. 

*(Hint: The links are relative. You need to add the domain name)*


In [55]:
import re
link_regex = re.compile(r'href="/categories/([^"]+)')
trustpilot_text = requests.get('https://www.trustpilot.com/categories').text
link_m = link_regex.findall(trustpilot_text)
link_m

['agistment_service',
 'feed_store',
 'aquarium',
 'aquarium_shop',
 'bird_shop',
 'dog_breeder',
 'dog_day_care_center',
 'dog_trainer',
 'dog_walker',
 'horseback_riding_service',
 'pet_adoption_service',
 'pet_boarding_service',
 'pet_groomer',
 'pet_sitter',
 'pet_store',
 'pet_supply_store',
 'pet_trainer',
 'tack_shop',
 'veterinarian',
 'zoo',
 'aromatherapy_supply_store',
 'barber_shop',
 'barber_supply_store',
 'beauty_product_supplier',
 'beauty_products_wholesaler',
 'beauty_salon',
 'beauty_supply_store',
 'body_piercing_shop',
 'cosmetic_products_manufacturer',
 'cosmetics_industry',
 'cosmetics_store',
 'cosmetics_wholesaler',
 'cosmetics_and_parfumes_supplier',
 'day_spa',
 'fitness_and_nutrition_service',
 'flavours_fragrances_and_aroma_supplier',
 'foot_care',
 'hair_extensions_supplier',
 'hair_product_store',
 'hair_removal_service',
 'hair_replacement_service',
 'hair_salon',
 'health_spa',
 'health_and_beauty_shop',
 'herb_shop',
 'herbalist',
 'makeup_artist',
 'm

> **Ex. 8.2.3:** Get one of the category section links. Write a function to extract the links to the company review page from the HTML.

> **Ex. 8.2.4:** Figure out how the pagination is done, by following how the url changes when pressing the **next page**-button to obtain more company listings. Write a function that builds links to paging all the company listing results of each category. This includes parsing the number of subpages of each category and changing the correct parameter in the url.

(Hint: Find the maximum number of result pages, right before the next page button and make a loop change the page parameter of the url.)


In [68]:
params = {'numberofreviews': '0', 
          'page': '1',
          'status': 'collecting', 
          'timeperiod': '0'}
url = 'https://www.trustpilot.com/categories/business_to_business_service/businesses?'
json.loads(requests.get(url, params = params).text)


{'filteredBusinessUnitList': [{'businessUnitId': '4bdc9828000064000505dc60',
   'displayName': 'Flashbay',
   'identifyingName': 'www.flashbay.com',
   'numberOfReviews': 14922.0,
   'stars': 5.0,
   'trustscore': '9.9'},
  {'businessUnitId': '50b8769100006400051f0dbb',
   'displayName': 'DoItYourselfLettering.com',
   'identifyingName': 'doityourselflettering.com',
   'numberOfReviews': 7866.0,
   'stars': 5.0,
   'trustscore': '9.9'},
  {'businessUnitId': '51b3db64000064000539d4c7',
   'displayName': 'Nevada Corporate Headquarters, Inc.',
   'identifyingName': 'nchinc.com',
   'numberOfReviews': 3335.0,
   'stars': 5.0,
   'trustscore': '9.9'},
  {'businessUnitId': '4bec568700006400050b477b',
   'displayName': 'Trusted Translations, Inc.',
   'identifyingName': 'www.trustedtranslations.com',
   'numberOfReviews': 1072.0,
   'stars': 5.0,
   'trustscore': '9.9'},
  {'businessUnitId': '51f836c200006400056d0a2b',
   'displayName': 'TheFlyerLab.com',
   'identifyingName': 'theflyerlab.co

In [98]:
params = {'numberofreviews': '0', 
              'page': '1',
              'status': 'collecting', 
              'timeperiod': '0'}

In [100]:
'&'.join(['{0}={1}'.format(k,v) for (k, v) in params.items()])

'numberofreviews=0&page=1&status=collecting&timeperiod=0'

In [93]:
def get_business_urls(category):
    """
    Returns a list of urls for the the businesses in the supllied category.
    """
    params = {'numberofreviews': '0', 
              'page': '1',
              'status': 'collecting', 
              'timeperiod': '0'} #parameters to be used in the request
    url = 'https://www.trustpilot.com/categories/{}/businesses'.format(category)
    business_root_url = 'https://www.trustpilot.com/review/' #root url for business links
    
    try: #Error handling
        response = requests.get(url, params = params)
        response.raise_for_status()
    except requests.exceptions.RequestException as err:
        print(err)
        return None
    
    total_page_number = json.loads(response.text)['totalPageNumber']
    
    business_urls = []
    for i in range(1, total_page_number+1):
        params['page'] = i #change page
        business_unit_list = json.loads(requests.get(url, params = params).text)['filteredBusinessUnitList']
        business_urls += [business_root_url + unit['identifyingName'] 
                          for unit in business_unit_list] #append new urls
        time.sleep(1)

    return business_urls

In [94]:
get_business_urls('blablakljk')

500 Server Error: Internal Server Error for url: https://www.trustpilot.com/categories/blablakljk/businesses?numberofreviews=0&page=1&status=collecting&timeperiod=0


> **Ex. 8.2.5:** Loop through all categories and build the paging links using the above defined function.

> **Ex. 8.2.6:** Randomly pick one of category listing links you have generated, and get the links to the companies listed using the other function defined. 

> **Ex. 8.2.7:** Visit one of these links and inspect the **>Network Monitor<** to locate the request that loads the review data. Use the requests module to retrieve this link and unpack the json results to a pandas DataFrame.


In [26]:
#[Answer to Ex.8.2.5-7]

Congratulations on coming this far. By now you are almost - still need to figure out how to page the reviews and to find the company ID in the html -, ready to deploy a scraper collecting all reviews on trustpilot. 
If you wanna see just how valuable such data could be visit the follow blogpost: https://blog.openai.com/unsupervised-sentiment-neuron/

In [102]:

import requests
import os
import time


def ratelimit(duration=0.5):
    "A function that handles the rate of your calls."
    time.sleep(duration)  # sleep one second.



class LogFile:

    def __init__(self, file, mode):
        self.file = file 
        self.mode = mode
        self.f = None

    def __str__(self):
        return f'Logfile {self.file}'

    def __repr__(self):
        return f'Logfile {self.file} in mode {self.mode}'

    def write(self, content):
        with open(self.file, self.mode) as f:
            f.write(content)

    def flush(self):
        # Deprecated
        return 

    def read(self):
        with open(self.file, 'r') as f:
            content = f.read()
        return content



class Connector():
    def __init__(self, 
                 logfile, 
                 overwrite_log=False, 
                 connector_type='requests', 
                 session=False, 
                 path2selenium='', 
                 n_tries=5, 
                 timeout=30
                 ):
        """This Class implements a method for reliable connection to the internet 
            and monitoring. It handles simple errors due to connection problems, and logs 
            a range of information for basic quality assessments

        Keyword arguments:
        logfile -- path to the logfile
        overwrite_log -- bool, defining if logfile should be cleared (rarely the case). 
        connector_type -- use the 'requests' module or the 'selenium'. 
                          Will have different since the selenium webdriver does not have a 
                          similar response object when using the get method, and 
                          monitoring the behavior cannot be automated in the same way.
        session -- requests.session object. For defining custom headers and proxies.
        path2selenium -- str, sets the path to the geckodriver needed when using selenium.
        n_tries -- int, defines the number of retries the *get* method will try to avoid 
                   random connection errors.
        timeout -- int, seconds the get request will wait for the server to respond, 
                   again to avoid connection errors.
        """

        # Initialization function defining parameters.
        # For avoiding triviel error e.g. connection errors, this defines how many times it will retry.
        self.n_tries = n_tries
        # Defining the maximum time to wait for a server to response.
        self.timeout = timeout
        # not implemented here, if you use selenium.
        if connector_type == 'selenium':
            assert path2selenium != '', "You need to specify the path to you geckodriver if you want to use Selenium"
            from selenium import webdriver
            # HIN download the latest geckodriver here: https://github.com/mozilla/geckodriver/releases

            assert os.path.isfile(
                path2selenium), 'You need to insert a valid path2selenium the path to your geckodriver. You can download the latest geckodriver here: https://github.com/mozilla/geckodriver/releases'
            # start the browser with a path to the geckodriver.
            self.browser = webdriver.Firefox(executable_path=path2selenium)

        self.connector_type = connector_type  # set the connector_type

        if session:  # set the custom session
            self.session = session
        else:
            self.session = requests.session()
        self.logfilename = logfile  # set the logfile path
        # define header for the logfile
        header = ['id', 'project', 'connector_type', 't', 'delta_t', 'url',
                  'redirect_url', 'response_size', 'response_code', 'success', 'error']

        if os.path.isfile(logfile):
            if overwrite_log == True:
                self.log = LogFile(logfile, 'w')
                self.log.write(';'.join(header))
                self.log.mode = 'a'
            else:
                self.log = LogFile(logfile, 'a')
        else:
            self.log = LogFile(logfile, 'w')
            self.log.write(';'.join(header))

        # load log
        l = self.log.read().split('\n')
        if len(l) <= 1:
            self.id = 0
        else:
            self.id = int(l[-1][0])+1

        # with open(logfile, 'r') as f:  # open file

        #     l = f.read().split('\n')  # read and split file by newlines.
        #     # set id
        #     if len(l) <= 1:
        #         self.id = 0
        #     else:
        #         self.id = int(l[-1][0])+1

    def get(self, url, project_name, params = None):
        """Method for connector reliably to the internet, with multiple tries and simple 
            error handling, as well as default logging function.
        Input url and the project name for the log (i.e. is it part of mapping the domain, 
        or is it the part of the final stage in the data collection).

        Keyword arguments:
        url -- str, url
        project_name -- str, Name used for analyzing the log. Use case could be the 
                            'Mapping of domain','Meta_data_collection','main data collection'. 
        params -- dict, Mapping of parameters to be used in the url.
        """

        # make sure the default csv seperator is not in the project_name.
        project_name = project_name.replace(';', '-')
        if self.connector_type == 'requests':  # Determine connector method.
            # for loop defining number of retries with the requests method.
            for _ in range(self.n_tries):
                ratelimit()
                t = time.time()
                try:  # error handling
                    response = self.session.get(
                        url, params = params, timeout=self.timeout)  # make get call

                    # define python error variable as empty assumming success.
                    err = ''
                    success = True  # define success variable
                    redirect_url = response.url  # log current url, after potential redirects
                    # define delta-time waiting for the server and downloading content.
                    dt = t - time.time()
                    # define variable for size of html content of the response.
                    size = len(response.text)
                    response_code = response.status_code  # log status code.
                    # log...
                    call_id = self.id  # get current unique identifier for the call
                    self.id += 1  # increment call id
                    #['id','project_name','connector_type','t', 'delta_t', 'url', 'redirect_url','response_size', 'response_code','success','error']
                    # define row to be written in the log.
                    row = [call_id, project_name, self.connector_type, t, dt,
                           url, redirect_url, size, response_code, success, err]
                    self.log.write('\n'+';'.join(map(str, row)))  # write log.
                    self.log.flush()
                    # return response and unique identifier.
                    return response, call_id

                except Exception as e:  # define error condition
                    err = str(e)  # python error
                    response_code = ''  # blank response code
                    success = False  # call success = False
                    size = 0  # content is empty.
                    redirect_url = ''  # redirect url empty
                    dt = t - time.time()  # define delta t

                    # log...
                    call_id = self.id  # define unique identifier
                    self.id += 1  # increment call_id

                    row = [call_id, project_name, self.connector_type, t, dt, url,
                           redirect_url, size, response_code, success, err]  # define row
                    # write row to log.
                    self.log.write('\n'+';'.join(map(str, row)))
                    self.log.flush()
        else:
            t = time.time()
            ratelimit()
            url = url +'?{}'.format('&'.join(['{0}={1}'.format(k,v) for (k, v) in params.items()]))
            self.browser.get(url)  # use selenium get method
            # log
            call_id = self.id  # define unique identifier for the call.
            self.id += 1  # increment the call_id
            err = ''  # blank error message
            success = ''  # success blank
            redirect_url = self.browser.current_url  # redirect url.
            # get time for get method ... NOTE: not necessarily the complete load time.
            dt = t - time.time()
            # get size of content ... NOTE: not necessarily correct, since selenium works in the background, and could still be loading.
            size = len(self.browser.page_source)
            response_code = ''  # empty response code.
            row = [call_id, project_name, self.connector_type, t, dt, url,
                   redirect_url, size, response_code, success, err]  # define row
            # write row to log file.
            self.log.write('\n'+';'.join(map(str, row)))
            self.log.flush()
        # Using selenium it will not return a response object, instead you should call the browser object of the connector.
        # connector.browser.page_source will give you the html.
            return None, call_id


In [107]:
connector = Connector('logfile.txt')
params = {'numberofreviews': '0', 
          'page': '1',
          'status': 'collecting', 
          'timeperiod': '0'} #parameters to be used in the request
url = 'https://www.trustpilot.com/categories/pet_store/businesses'
response, id = connector.get(url, 'Test project', params = params)

In [110]:
response.url

'https://www.trustpilot.com/categories/pet_store/businesses?numberofreviews=0&page=1&status=collecting&timeperiod=0'

In [113]:
import urllib
#from urllib.parse import urlparse
#scheme://netloc/path;parameters?query#fragment
result = urllib.parse.urlparse('https://www.trustpilot.com/categories/pet_store/businesses')
#'{0}://{1}/{2};{3}?{4}#{5}' % result
result

ParseResult(scheme='https', netloc='www.trustpilot.com', path='/categories/pet_store/businesses', params='', query='', fragment='')