# URL Analysis

This notebook is going to do exploratory analysis on the URLs which have been used as a reference in some of the answers from the MunicipalQA dataset.

The following analysis has been performed:

1. Most frequent URL domains
2. Most frequent main paths of the corresponding domain
3. Topics in the referenced URLs

In [None]:
import os

# Get the notebook directory
notebook_dir = os.getcwd()

# Get the root directory by navigating upwards two levels
root_dir = os.path.dirname(os.path.abspath(os.path.join(notebook_dir, '../../')))

# Change the current working directory to the root directory
os.chdir(root_dir)

In [None]:
import pandas as pd
from yarl import URL
import os
from collections import Counter

In [None]:
data_dir = 'data/question_answer/questions.csv'
questions = pd.read_csv(open(data_dir, 'r'))
urls = sum(map(lambda x: x.split('\n'), questions[questions['URLs'].notnull()]['URLs']), [])
urls = list(map(lambda x: x if x.startswith('http') else f'https://{x}', urls))
len(urls)

In [None]:
questions['Year'] = pd.to_numeric(questions['Year'], errors='coerce')

In [None]:
questions['Year'].dropna().min()

In [None]:
questions['Year'].dropna().max()

In [None]:
len(questions)

### make URLs from URL type - yarl

In [None]:
sample = pd.DataFrame()
sample['url'] = urls
sample['url'] = sample['url'].apply(lambda url: URL(url))
sample['path'] =sample.url.apply(lambda url: url.path)
sample['host'] =sample.url.apply(lambda url: url.host)

In [None]:
print('The most common domains are:')
print()
print(sample['host'].value_counts().head(20))

In [None]:
import matplotlib.pyplot as plt

# Get the top 20 most common domains and sort them
top_domains = sample['host'].value_counts().head(5).sort_values(ascending=True)

# Create a horizontal bar plot
plt.figure(figsize=(10, 6))  # Set the figure size
plt.barh(top_domains.index, top_domains.values)  # Create the horizontal bar plot
plt.xlabel('Frequency', fontsize=20)  # Set the x-axis label with fontsize
plt.ylabel('Domains', fontsize=20)  # Set the y-axis label with fontsize
plt.title('Most Common Domains', fontsize=20)  # Set the title with fontsize
plt.xticks(fontsize=20)  # Set the font size of the x-axis ticks
plt.yticks(fontsize=20)  # Set the font size of the y-axis ticks

# Save the plot in high quality
plt.savefig('bar_plot.png', dpi=300, bbox_inches='tight')

# Display the plot
plt.show()


### Check URL (file) extensions

In [None]:
pdf_count = 0
non_html_count = 0

extensions = []

for url in urls:
    # Extract the file extension from the URL
    file_ext = os.path.splitext(url)[1]
    extensions.append(file_ext)

In [None]:
Counter(extensions).most_common(5)

There are some .pdf which should be accounted for during collection.

### Inspect amsterdam.nl paths

In [None]:
def extract_string(url, domain):
    """
    Extracts the string between 'amsterdam.nl/' and the next '/' in a URL.
    """
    
    start_index = url.find(domain + '/') + len(domain + '/')
    end_index = url.find('/', start_index)
    if end_index == -1:
        end_index = len(url)
    return url[start_index:end_index]

In [None]:
######### Amsterdam.nl ###########
paths = []
for url in urls:
    if 'www.amsterdam.nl' in url:
        paths.append(extract_string(url, 'amsterdam.nl'))

Counter(paths).most_common(5)

In [None]:
######### Rijksoverheid ###########
paths = []
for url in urls:
    if 'www.rijksoverheid.nl' in url:
        paths.append(extract_string(url, 'rijksoverheid.nl'))

Counter(paths).most_common()

In [None]:
######### amsterdam.raadsinformatie.nl ###########
paths = []
for url in urls:
    if 'amsterdam.raadsinformatie.nl' in url:
        paths.append(extract_string(url, 'amsterdam.raadsinformatie.nl'))

Counter(paths).most_common()

In [None]:
######### www.parool.nl  ###########
paths = []
for url in urls:
    if 'www.parool.nl' in url:
        paths.append(extract_string(url, 'parool.nl'))

Counter(paths).most_common()

In [None]:
######### www.rivm.nl  ###########
paths = []
for url in urls:
    if 'www.rivm.nl' in url:
        paths.append(extract_string(url, 'rivm.nl'))

Counter(paths).most_common()

In [None]:
######### www.ggd.amsterdam.nl  ###########
paths = []
for url in urls:
    if 'www.ggd.amsterdam.nl' in url:
        paths.append(extract_string(url, 'ggd.amsterdam.nl'))

Counter(paths).most_common()

In [None]:
######### www.tweedekamer.nl  ###########
paths = []
for url in urls:
    if 'www.tweedekamer.nl' in url:
        paths.append(extract_string(url, 'tweedekamer.nl'))

Counter(paths).most_common()

In [None]:
######### www.infomil.nl  ###########
paths = []
for url in urls:
    if 'www.infomil.nl' in url:
        paths.append(extract_string(url, 'infomil.nl'))

Counter(paths).most_common()

In [None]:
######### www.infomil.nl  ###########
paths = []
for url in urls:
    if 'data.amsterdam.nl' in url:
        paths.append(extract_string(url, 'data.amsterdam.nl'))

Counter(paths).most_common()

In order to be sure of the factual validity of our corpus we are going to collect supporting documents only from 

# Conclusions

1. The most frequently used domains as a reference are: amsterdam.nl, rijksoverheid.nl, rivm.nl, etc. 
2. The most frequent sub-domains were analyzed and would be taken into account during collection. 



**Additional findings after manual exploration**:

1. Some URL paths are different at their current version than the version that has been used at the time of referencing 
2. Some URLs appear to be directing to a non-existent page
3. A common error that results in a URL being broken is a wrong ending, which is either a "." or a ")."

# Actions to take
1. Build a collection of supporting documents based on the most common domains and URL paths (url, html_content)
2. Update the URLs of the references with their most current versions 
3. Clean the URLs if needed (e.g. if they end with a '.')
4. Collect the HTML content of the refrence URLs as well