# CFC Insight Technical Challenge
### Adam Richmond - December 2022


I wrote this challenge as a Jupyter notebook in order to maximise the presentability and readability of the code. I think this will also make it easier both to discuss in a follow-up interview and to see the logical flow of how I approached the problem. 

## Section 1
_Scrape the index webpage hosted at `cfcunderwriting.com`_

I'll use the python `requests` package to get the data from the website, and the `BeautifulSoup` package for parsing HTML

In [1]:
# import necessary libraries
import requests
from bs4 import BeautifulSoup

# get page data
index_page = requests.get("https://www.cfcunderwriting.com/en-gb/")

# parse HTML
index_soup = BeautifulSoup(index_page.content, "html.parser")

Hence the HTML content of the index web page is displayed in a readable format. Additionally, using a Jupyter notebook means the page's content is cached in the `index_soup` variable and available in other cells, so that unnecessary GET requests to the site can be avoided.

## Section 2
_Write a list of all externally loaded resources (e.g. images/scripts/fonts not hosted
on cfcunderwriting.com) to a JSON output file_

My strategy for this is as follows:
1. Find all of the HTML tags typically associated with external resources, such as img or script
2. Find the "src" attribute of these tags and check it's external - if so, add it to a dictionary
3. Convert the dictionary to JSON and write it to a file.

By inspecting the above there are a number of resources of various types used in the site, but not all are externally hosted. 

In [6]:
# import libraries: regex to search text and json handling for output
import re
import json


# define a list of HTML tags usually associated with external resources
external_tag_types = ["img", "script", "font", "iframe"]


def find_all_sources(soup, external_tags):
    """
    Given a list of tag types, finds all externally sourced instances of each tag. 
    """
    output = {key: [] for key in external_tags}
    # for each type of tag, find all instances, then for each instance, get its source, and keep if external
    for tag_type in external_tag_types:
        tags = soup.find_all(tag_type)
        for tag in tags:
            source = get_source_of_external_tag(tag)
            if source is not None:
                output[tag_type].append(source)
    return output


                
def get_source_of_external_tag(tag):
    """
    Given a tag, returns its source. \
    Either by finding ("src") attribute, or a "src" declaration in innerHTML. \
    Returns None if tag's source is internal
    """
    source = None
    # check which attributes exist in the tag that indicate its source
    attributes = list(tag.attrs)
    if "src" in attributes:
        source = tag["src"]
    # if a script has a "src" declaration in its innerHTML, find that too (using regex for "src=")
    elif "src=" in tag.text:
        regex_search = re.search(r"(src=\")([^\"]*)", tag.text)
        if regex_search:
            source = regex_search.group(2)
    if source:
        if not source_is_external(source):  # return None if source is internal
            source = None
    return source


def source_is_external(source):
    """
    Given a source, returns a boolean representing if the tag is internal or external based on its source
    """
    is_external = False
    if source.startswith("//"):  # external resources are addressed by double slash, internal by single
        is_external = True
    elif (source.startswith("http")) and ("www.cfcunderwriting.com" not in source):  # if http, check not internal
        is_external = True
    return is_external

    
def write_file(output):
    """
    Writes the external source data to a file
    """
    with open("external_sources.json", "w") as f:
        f.write(output)
    
sources = find_all_sources(index_soup, external_tag_types)
write_file(json.dumps(sources))
print(json.dumps(sources))

{"img": [], "script": ["//js.hsforms.net/forms/v2.js", "https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js", "https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.1.3/js/bootstrap.min.js", "//js.hs-scripts.com/6072523.js", "//js.hsforms.net/forms/v2.js", "https://www.google.com/recaptcha/api.js?render=6LemiyEaAAAAAGwb4nR8oX38fxyM36xjIGbwz6d4"], "font": [], "iframe": ["https://www.googletagmanager.com/ns.html?id=GTM-NGGN5FB"]}


The output is written in the `external_sources.json` file.

## Section 3
_Enumerate the page's hyperlinks and identify the location of the "Privacy Policy" page._

To do this I simply find all `a` tags and check the tag's text to see if "privacy policy" is mentioned.

In [3]:
# find all links - denoted by HTML a-tag
links = index_soup.find_all("a")

# print if the tag's contents mention privacy policy
print([link for link in links if "privacy policy" in link.text.lower()])

[<a href="/en-gb/support/privacy-policy/">Privacy policy</a>, <a data-udi="umb://document/b8c2aeeef41e4e73812afab207e45b89" href="/en-gb/support/privacy-policy/" title="Privacy policy">Privacy Policy</a>]


The link is therefore `base_url/en-gb/support/privacy-policy`, where `base_url` is www.cfcunderwriting.com.

## Section 4
_Use the privacy policy URL identified in step 3 and scrape the pages content.
Produce a case-insentitive word frequency count for all of the visible text on the page.
Your frequency count should also be written to a JSON output file_

I'll scrape and parse the data from this page as before.

In [4]:
# get page data
privacy_page = requests.get("https://www.cfcunderwriting.com/en-gb/support/privacy-policy")

# parse HTML
privacy_soup = BeautifulSoup(index_page.content, "html.parser")

My strategy for finding all visible words on the page is as follows:
1. Find all child tags of the main "body" tag
2. Check these tags don't have their "style.visibility" attribute set to "hidden"
3. Find all of the words in the tag's innerHTML, cleaning up newlines as I go, to produce a space-separated list of words
4. Split this list of words by the space character and count its length to get the word count

In [5]:
def get_words_from_tag(tag):
    """
    Finds all of the words in a div - separated by a space - and returns them in a list. \
    Returns "" if tag is invisible
    """
    # first check visibility is not hidden
    try:
        invisible = "visibility: hidden" in tag["style"]
    except Exception:
        invisible = False
    # find words in div and reformat
    if invisible:
        return ""
    else:
        words = replace_punctuation(tag.text)
        return words

def replace_punctuation(text):
    """ 
    Replaces newlines and punctuations
    """
    if "\n" in text:  # replace newlines
        text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text


# find all tags inside <body> and get their innerHTML contents unless they are hidden
text_list = []
body_tag = privacy_soup.find("body")
for child_tag in body_tag:
    words = get_words_from_tag(child_tag)
    if words != "" and words != " ":  # don't add if empty
        text_list.append(words)

# remove any duplicates
text_list = list(dict.fromkeys(text_list))

# flatten list of entries into list of words
word_list = []
for text in text_list:
    word_list += text.split(" ")
    
# finally, drop all empty spaces
word_list = list(filter(lambda entry: entry != '', word_list))

# find the word count by splitting by space character and counting list length
result = {"word_count": len(word_list)}
with open("word_count.json", "w") as f:
    f.write(json.dumps(result))
print(result)

{'word_count': 1092}


Hence, the word count of visible words on the Privacy Policy page is 1,092.