In [1]:
import pandas as pd
import requests
from collections import Counter
import time
import random
import pickle

### Clean amsterdam.nl reference URLs

In [3]:
%cd '/Users/natalipeeva/Desktop'

/Users/natalipeeva/Desktop


In [4]:
questions = pd.read_csv(open('URL Analysis/questions.csv', 'r'))

In [5]:
# Extract amsterdam.nl only
mask = questions['URLs'].astype(str).str.contains('www.amsterdam.nl')
amsterdam_questions = questions[mask]

In [6]:
amsterdam_questions.head()

Unnamed: 0,Year,Month,Question,Answer,Document,URLs
296,2020,6,\n7.\nKan het college de reeds bestaande zwemp...,\nVoor het vinden van de officiële zwemplekken...,https://amsterdam.raadsinformatie.nl/document/...,https://www.zwemwater.nl/.Hier\nhttps://maps.a...
615,2021,8,\n \n3. Huisartsen geven aan meer informatie n...,"\nDe uitvoerder van de regeling, het CAK, lij...",https://amsterdam.raadsinformatie.nl/document/...,https://www.amsterdam.nl/zorg-ondersteuning/on...
620,2021,8,\n \n8. Weten ongedocumenteerden de weg naar m...,"\nDe Kruispost wordt goed bezocht, maar het c...",https://amsterdam.raadsinformatie.nl/document/...,https://www.amsterdam.nl/zorg-ondersteuning/on...
856,2015,11,\n \n6. Het spuug zal door een speciaal daarto...,\nEr zijn 50 BOA’s van GVB opgeleid om speeks...,https://amsterdam.raadsinformatie.nl/document/...,https://www.amsterdam.nl/wonen-leefomgeving/ve...
1084,2013,7,\n \n \n \n2. Wat gaat de wethouder doen om d...,\nEr is op dit moment nog geen sprake van ont...,https://amsterdam.raadsinformatie.nl/document/...,http://www.amsterdam.nl/publish/pages/419505/a...


### Clean URLs

In [27]:
urls = sum(map(lambda x: x.split('\n'), amsterdam_questions[amsterdam_questions['URLs'].notnull()]['URLs']), [])
urls = list(map(lambda x: x if x.startswith('http') else f'http://{x}', urls)) # Get all URLs

In [31]:
# TO DO:

# input http://www.amsterdam.nl/internationaal  -> this is the old version of the link so it needs to be changed 

# update link to its current version by using a request: 

# result should be that: https://www.amsterdam.nl/bestuur-organisatie/organisatie/bestuur-organisatie/bureau-internationale-betrekkingen/?vkurl=internationaal

# clean link to: www.amsterdam.nl/bestuur-organisatie/organisatie/bestuur-organisatie/bureau-internationale-betrekkingen/?vkurl=internationaal

# add a function that adds back https:// in the front if necessary

### Update the URL path with the current path

In [36]:
# Example
r = requests.get('http://www.amsterdam.nl/internationaal')
print('Statuscode:', r.status_code) # code is 200 but the path is still different

The URL in the reference answer is 'http://www.amsterdam.nl/internationaal' but the current URL path if the same page is: 'https://www.amsterdam.nl/bestuur-organisatie/organisatie/bestuur-organisatie/bureau-internationale-betrekkingen/?vkurl=internationaal'. 

**The following implementation deals with such cases and ensures that the URL path of the page is substituted its most recent version.**

*Date 04.05.2023*

In [79]:
### Example ###

url = 'http://www.amsterdam.nl/internationaal'

response = requests.get(url, allow_redirects=True) # allow_redirects=True to follow redirects

final_url = response.url # get URL

print(final_url)


https://www.amsterdam.nl/bestuur-organisatie/organisatie/bestuur-organisatie/bureau-internationale-betrekkingen/?vkurl=internationaal


*The example in the previous cell returns the current URL path of the web-page. Next we create a function which is going to be applied to all URLs in our collection of reference URls.*

In [50]:
def update_url(url):
    """
    Returns the most recent version of a URL.
    """
    # allow_redirects=True to follow redirects
    response = requests.get(url, allow_redirects=True)

    # get URL
    final_url = response.url

    return final_url

### Check for any invalid URLs

What are the most common characters that a URL ends with besides letters. 

The objective is to remove the '/' character in order to normalize the URLs and also to see if there are any potentially problematic characters. 

In [80]:
# create an empty list to store the invalid URLs
invalid_urls = []
invalid_chars = [url[-1] for url in urls if not url[-1].isalpha()]


In [81]:
Counter(invalid_chars).most_common()

[('/', 47),
 ('.', 32),
 (')', 7),
 ('=', 2),
 ('9', 2),
 ('2', 1),
 ('8', 1),
 ('4', 1),
 ('1', 1)]

There seem to be some characters which might be disturbing the URLs. First, we try to get the most recent version of the URL and then we check again how many of the characters are left. The assumption is that there will be less, since redirecting might update the uRL if there is an error. 

### Apply URL update to all URLs

In [61]:
updated_urls = []
for url in urls:
    try:
        updated_urls.append(update_url(url))
        time.sleep(random.uniform(2, 5)) # make sure not to overload the server
    except: 
        updated_urls.append(('error',url)) # keep track of errors to check later

*Checking the 'error' URLs*

In [69]:
for url in updated_urls:
    if type(url) == tuple:
        print(url)

('error', 'http://www.Amsterdam.nl)')
('error', 'http://www.voedselpoortamsterdam.nl')
('error', 'http://www.voedselpoortamsterdam.nl')
('error', 'http://www.voedselpoortamsterdam.nl')


*Check the 'faulty' chars again*

In [70]:
invalid_urls = []
invalid_chars = [url[-1] for url in updated_urls if not url[-1].isalpha()]


In [86]:
valid_chars = [url[-1] for url in updated_urls if url[-1].isalpha()]

In [88]:
Counter(valid_chars).most_common()

[('D', 5),
 ('e', 5),
 ('f', 4),
 ('s', 4),
 ('t', 4),
 ('n', 4),
 ('w', 3),
 ('l', 2),
 ('p', 2),
 ('r', 1),
 ('d', 1),
 ('k', 1),
 ('g', 1),
 ('i', 1),
 ('c', 1)]

In [72]:
Counter(invalid_chars).most_common()

[('/', 68),
 ('.', 20),
 (')', 6),
 ('http://www.voedselpoortamsterdam.nl', 3),
 ('=', 2),
 ('9', 2),
 ('http://www.Amsterdam.nl)', 1),
 ('2', 1),
 ('8', 1),
 ('#', 1),
 ('4', 1),
 ('1', 1),
 ('5', 1)]

*Storing the URLs which might have an error*

In [74]:
urls_to_check = []
for url in updated_urls:
    if  not url[-1].isalpha() and not url[-1] == '/':
        urls_to_check.append(url)

In [78]:
print(urls_to_check[:3]) # 3 examples

['https://www.amsterdam.nl/vragenondernemers)', 'https://www.amsterdam.nl/vragenondernemers.', ('error', 'http://www.Amsterdam.nl)')]


In [85]:
with open ('urls_to_check.pickle', 'wb') as f:
    pickle.dump(urls_to_check, f) # save the list

### Findings

- Remove the '.' for sure 
- ) might be useful example: 
-> https://www.amsterdam.nl/veelgevraagd/?productid=%7B87FAD1C9-60E9-4CEA-B9AE-6D5594A0E841%7D#case_%7B63E55F58-F93C-4A68-BEAA-896C8F8FBBB1%7D)
-> with and without ) it changes 
- the = looks weird but the page opens 
- remove '/' since it doesn't affect the webpage


In [None]:
# Remove the '.' for sure 
# ) might be useful example: https://www.amsterdam.nl/veelgevraagd/?productid=%7B87FAD1C9-60E9-4CEA-B9AE-6D5594A0E841%7D#case_%7B63E55F58-F93C-4A68-BEAA-896C8F8FBBB1%7D)

In [None]:
# Updates the URL 
# Removes the last char if necessary 
# normalzies it 

Class Reference_URL 

- get the most recent version
- check for invalid characters



### Check how URLs are structured

The objective is to check how URLs are structured so that the scraped content could be properly formatted to match

In [95]:
for url in updated_urls:
    if 'www.amsterdam.nl/veelgevraagd' in url:
        print(url)

https://www.amsterdam.nl/veelgevraagd/?caseid=%7BD6E280FB-4A76-40A0-9B88-12B87E446FA6%7D
https://www.amsterdam.nl/veelgevraagd/?productid=
https://www.amsterdam.nl/veelgevraagd/?productid=%7B249D3A8E-ED07-4E4C-BFAD-49F174342FD5%7D
https://www.amsterdam.nl/veelgevraagd/?caseid=%7B2A574844-AA85-4A2C-8CD3-8CB494F4997E%7D
https://www.amsterdam.nl/veelgevraagd/?caseid=%7B81530DE0-8A69-4085-A9A8-739EF202B595%7D&_ga=2.153284539.280737159.1528702078-235779210.1528702078
https://www.amsterdam.nl/veelgevraagd/?caseid=%7B0509871D-A851-40C4-8C1A-E79B5E121D67%7D
https://www.amsterdam.nl/veelgevraagd/?productid=
https://www.amsterdam.nl/veelgevraagd/?productid=%7BD5F9EF09-0F3A-4E59-8435-4873EB7CD609%7D#case_%7B33F0B504-EDEB-42EE-A8C5-7EF394F65D3A%7D#
https://www.amsterdam.nl/veelgevraagd/?productid=%7B87FAD1C9-60E9-4CEA-B9AE-6D5594A0E841%7D#case_%7B63E55F58-F93C-4A68-BEAA-896C8F8FBBB1%7D)
https://www.amsterdam.nl/veelgevraagd/?productid=%7B0497C2EC-D574-42FC-BB56-140DD7641EC5%7D


In the collected documents I don't have any '%' symbols, instead I have {} symbols -> I think that's why there might be a mismatch between the two versions of the URLs (supporting docs and refrence docs)

In [92]:
updated_urls[:]

['https://www.zwemwater.nl/.Hier',
 'https://maps.amsterdam.nl/zwemwater/',
 'https://www.amsterdam.nl/veelgevraagd/?caseid=%7BD6E280FB-4A76-40A0-9B88-12B87E446FA6%7D',
 'https://www.ggd.amsterdam.nl/gezond-wonen/zwemmen-open-water/',
 'https://www.amsterdam.nl/zorg-ondersteuning/ondersteuning/vluchtelingen/ongedocumenteerden/',
 'https://www.amsterdam.nl/zorg-ondersteuning/ondersteuning/vluchtelingen/ongedocumenteerden/',
 'https://www.amsterdam.nl/wonen-leefomgeving/veiligheid/openbare-orde/overlastgebieden/agressie-geweld/agressie-geweld/',
 'https://www.amsterdam.nl/publish/pages/419505/amsterdamse_zorgambitie.pdf',
 'https://www.amsterdam.nl/vragenondernemers)',
 'https://www.amsterdam.nl/vragenondernemers.',
 ('error', 'http://www.Amsterdam.nl)'),
 'https://www.amsterdam.nl/wonen-leefomgeving/groene-stad',
 'https://www.amsterdam.nl/privacy/loket/',
 'https://www.amsterdam.nl/veelgevraagd/?productid=',
 'https://www.amsterdam.nl/publish/pages/858023/brief_wethouder_choho_hotspots

In [97]:
with open ('updated_urls.pickle', 'wb') as f:
    pickle.dump(updated_urls, f)