Alex Jones (alexander.g.jones.23@dartmouth.edu) <br>
March 15, 2022 <br>
LING 28 (Rolando Coto-Solano), Winter 2022 <br>
Final Project


---

This notebook contains code for performing the web crawling procedure: extracting monolingual Kalaallisut and Danish sentences from multilingual websites.

In [5]:
!pip install clean-text unidecode
from bs4 import BeautifulSoup
import requests
import re
from cleantext import clean
from tqdm import tqdm

You should consider upgrading via the '/Users/sheilaflaherty/miniconda3/bin/python -m pip install --upgrade pip' command.[0m


In [8]:
# Maps domains to language-specific subdomains, as well as the regex keys for finding relative links associated with those subdomains
domain_dict = {'https://greenland-travel.gl':         {'lang_codes': ['', '/kl/'], 
                                                       'regex_keys': [r'/.*', r'\/kl\/.*'],
                                                       'file_names': ['gtav_da.txt', 'gtav_kl.txt']},
               
                'http://www.ral.gl':                  {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['ral_kl.txt']},
               
                'http://www.ral.dk':                  {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['ral_da.txt']},
               
                'http://www.royalarcticline.com':     {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['ral_en.txt']},
               
                'https://ina.gl':                     {'lang_codes': ['/?lang=kl', '/?lang=da', '/?lang=en'], 
                                                       'regex_keys': [r'\/\?lang=kl\/.*', r'\/\?lang=da\/.*', r'\/\?lang=en\/.*'],
                                                       'file_names': ['ina_kl.txt', 'ina_da.txt', 'ina_en.txt']},
               
                'https://www.banken.gl':              {'lang_codes': ['/gl', '/da/', '/en/'], 
                                                       'regex_keys': [r'\/gl\/.*', r'\/da\/.*', r'\/en\/.*'],
                                                       'file_names': ['banken_kl.txt', 'banken_da.txt', 'banken_en.txt']},
               
                'https://brugseni.gl':                {'lang_codes': ['', '/gl/'], 
                                                       'regex_keys': [r'\/.*', r'\/gl\/.*'],
                                                       'file_names': ['brug_da.txt', 'brug_kl.txt']},
               
                'https://diskolineexplorer.com':      {'lang_codes': ['/kl/', '', '/en/'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/.*', r'\/en\/.*'],
                                                       'file_names': ['disko_kl.txt', 'disko_da.txt', 'disko_en.txt']},
               
                'https://www.mit.gl':                 {'lang_codes': ['', '/en/', '/gl/'], 
                                                       'regex_keys': [r'\/.*', r'\/en\/.*', r'\/gl\/.*'],
                                                       'file_names': ['mit_da.txt', 'mit_en,txt', 'mit_kl.txt']},
               
                'https://aul.gl':                     {'lang_codes': ['/kl/in-gr-oplev-kalaallit-nunaat/', '/da/oplev-groenland/', '/en/experience-greenland/'],
                                                       'regex_keys': [r'\/kl\/in-gr-oplev-kalaallit-nunaat\/.*', r'\/da\/oplev-groenland\/.*', r'\/en\/experience-greenland\/.*'],
                                                       'file_names': ['aul_kl.txt', 'aul_da.txt', 'aul_en.txt']},
               
                'https://www.kni.gl':                 {'lang_codes': ['/kl/', '/da/', '/en/'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/da\/.*', r'\/en\/.*'],
                                                       'file_names': ['kni_kl.txt', 'kni_da.txt', 'kni_en.txt']},
               
                'https://nukissiorfiit.gl':           {'lang_codes': ['/kl/', '/da/'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/da\/.*'],
                                                       'file_names': ['nuki_kl.txt', 'nuki_da.txt']},
               
                'https://nunaoil.gl':                 {'lang_codes': ['/kl/', '', '/en/'],
                                                       'regex_keys': [r'\/kl\/.*', r'\/en\/.*'],
                                                       'file_names': ['nuna_kl.txt', 'nuna_da.txt', 'nuna_en.txt']},
               
                'https://nanoqmedia.gl':              {'lang_codes': ['/kl/', '/da/'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/da\/.*'],
                                                       'file_names': ['nanoq_kl.txt', 'nanoq_da.txt']},
               
                'https://bus.gl':                     {'lang_codes': ['/kl/', '/da/', '/en/'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/da\/.*', r'\/en\/.*'],
                                                       'file_names': ['bus_kl.txt', 'bus_da.txt', 'bus_en.txt']},
               
                'https://www.pilersuisoq.gl':         {'lang_codes': ['/kl/', '/da/'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/da\/.*'],
                                                       'file_names': ['piler_kl.txt', 'piler_da.txt']},
               
                'https://www.pisiffik.gl':            {'lang_codes': ['/gl/', '/da/'],
                                                       'regex_keys': [r'\/gl\/.*', r'\/da\/.*'],
                                                       'file_names': ['pisiff_kl.txt', 'pisiff_da.txt']},
               
                'https://hotelarctic.com':            {'lang_codes': ['/gl/', '', '/en/'], 
                                                       'regex_keys': [r'\/gl\/.*', r'\/.*', r'\/en\/.*'],
                                                       'file_names': ['arctic_kl.txt', 'arctic_da.txt', 'arctic_en.txt']},
               
                'https://hhe.gl':                     {'lang_codes': ['/?lang=gl', '', '/?lang=en'], 
                                                       'regex_keys': [r'\/\?lang=gl\/.*', r'\/.*', r'\/\?lang=en\/.*'],
                                                       'file_names': ['hhe_kl.txt', 'hhe_da.txt', 'hhe_en.txt']},
               
                'https://www.banknordik.gl':          {'lang_codes': ['/kl', '/da/'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/da\/.*'],
                                                       'file_names': ['banknord_kl.txt', 'banknord_da.txt']},
               
                'https://www.polarseafood.gl':        {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['psf_kl.txt']},
               
                'https://www.polarseafood.dk':        {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['psf_da.txt']},
               
                'http://climategreenland.gl':         {'lang_codes': ['', '/da/', '/en/'], 
                                                       'regex_keys': [r'\/.*', r'\/da\/.*', r'\/en\/.*'],
                                                       'file_names': ['climate_kl.txt', 'climate_da.txt', 'climate_en.txt']},
               
                'https://www.businessingreenland.gl': {'lang_codes': ['/kl-GL', '/da', '/en'], 
                                                       'regex_keys': [r'\/kl-GL\/.*', r'\/da\/.*', r'\/en\/.*'],
                                                       'file_names': ['big_kl.txt', 'big_da.txt', 'big_en.txt']},
               
                'https://dk.usembassy.gov':           {'lang_codes': ['/kal/', '/da/', ''], 
                                                       'regex_keys': [r'\/kal\/.*', r'\/da\/.*', r'\/.*'],
                                                       'file_names': ['embassy_kl.txt', 'embassy_da.txt', 'embassy_en.txt']},
               
                'https://nka.gl':                     {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['nka_kl.txt']},
               
                'https://da.nka.gl':                  {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['nka_da.txt']},
               
                'https://en.nka.gl':                  {'lang_codes': [''], 
                                                       'regex_keys': [r'\/.*'],
                                                       'file_names': ['nka_en.txt']},
               
                'https://natur.gl':                   {'lang_codes': ['/?lang=kl', '', '/?lang=en'], 
                                                       'regex_keys': [r'\/\?lang=kl\/.*', r'\/.*', r'\/\?lang=en\/.*'],
                                                       'file_names': ['natur_kl.txt', 'natur_da.txt', 'natur_en.txt']},
               
                'https://oqaasileriffik.gl':          {'lang_codes': ['', '/da/', '/en/'], 
                                                       'regex_keys': [r'\/.*', r'\/da\/.*', r'\/en\/.*'],
                                                       'file_names': ['oqaas_kl.txt', 'oqaas_da.txt', 'oqaas_en.txt']},
               
                'https://www.katak.gl':               {'lang_codes': ['/kl', '/da', '/en'], 
                                                       'regex_keys': [r'\/kl\/.*', r'\/da\/.*', r'\/en\/.*'],
                                                       'file_names': ['katak_kl.txt', 'katak_da.txt', 'katak_en.txt']}
               
               }

In [11]:
NUM_SUBDOM = sum([len(domain_dict[dom]['lang_codes']) for dom in domain_dict])
print(f'We have {NUM_SUBDOM}/70 subdomains left to scrape')

We have 25/70 subdomains left to scrape


In [15]:
def getHTML(url):
  '''
  Purpose: Get HTML from input URL
  
  Args: 
  url -- web link
  
  Returns:
  htmlstr (str) -- page HTML
  soup (bs4 object) -- a BeautifulSoup object containing HTML properties
  '''
  my_session = requests.session()
  for_cookies = my_session.get(DOMAIN_NAME)
  cookies = for_cookies.cookies
  headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
  response = my_session.get(url, headers=headers, cookies=cookies)
  soup = BeautifulSoup(response.content, 'html.parser')
  htmlstr = soup.prettify()
  return htmlstr, soup

def getPageLinks(soup, html):
   '''
   Purpose: Gets page links from HTML

   Args: 
   soup (bs4 object) -- a BeautifulSoup object containing HTML properties
   html (str) -- an HTML string

   Returns:
   page_links (List[str]) -- a list of links found on page
   '''
    # Uncomment the following two lines and comment out the third if crawling from sermitsiaq.ag
#     page_links = re.findall(REL_LINK_REGEX, html)
#     page_links = [re.sub(r' .*', '', page.replace('>', '').replace('"', '')) for page in page_links]
    page_links = [link.get('href') for link in soup.find_all('a') if link.get('href') and str(link.get('href'))[0]=='/' and (str(link.get('href')).split('/')[1] not in BAD_SUBDOMS)] #and str(link.get('href'))[0:3]!='/kl']
    return page_links

def getPageText(soup):
  '''
  Purpose: Get plaintext from page
  
  Args:
  soup (bs4 object) -- a BeautifulSoup object containing HTML properties
  
  Returns:
  cleanTexts (List[str]) -- a list of cleaned texts from page
  '''
  global seen_sentences
  # Extract plaintext from html
  for script in soup(["script", "style"]):
      script.extract()
  text = soup.get_text()
  lines = (line.strip() for line in text.splitlines())
  chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
  text = '\n'.join(chunk for chunk in chunks if chunk)
  sentences = text.split('.')
  # Clean texts
  cleanTexts = [clean(text) for text in sentences if text]
  def customTextCheck(text: str) -> bool:
    return (('\n' not in set(text)) and not (re.search(r'https', text)) and (len(text.split())>1))
  cleanTexts = [text+'\n' for text in cleanTexts if customTextCheck(text)]
  cleanTexts = [text for text in cleanTexts if text not in seen_sentences]
  seen_sentences = seen_sentences.union(set(cleanTexts))
  return cleanTexts

def write_to_file(sentences, path_to_write):
  with open(path_to_write, 'a') as f:
    f.writelines(sentences)
  
def scrapeText(url, path_to_write):
  '''
  Wrapper function for entire crawling process. Feed it a homepage URL and the path to the directory
  you want the crawled sentences to be written to. Then be patient!
  '''
  html, soup = getHTML(url)
  text = getPageText(soup)
  write_to_file(text, path_to_write)
  page_links = getPageLinks(soup, html)
  if (len(visited) % 10 == 0) and (len(visited) > 1):
    print(f'Parsed {len(visited)} articles')
  for rel_link in page_links:
    if rel_link not in visited:
      full_link = DOMAIN_NAME+rel_link
      visited.add(rel_link)
      #print(f'Currently scraping {full_link}')
      scrapeText(full_link, path_to_write)

In [13]:
# Crawl on Greenlandic domains!!
# BE WARNED: This'll take a LONG TIME if you're looking to crawl every domain in the domain dict

for domain in tqdm(domain_dict):
    visited = set() # visited links
    seen_sentences = set()
    DOMAIN_NAME = domain
    subdict = domain_dict[domain]
    subdoms, regexes, file_names = subdict['lang_codes'], subdict['regex_keys'], subdict['file_names']
    
    for subdom, regex, file_name in zip(subdoms, 
                                        regexes, 
                                        file_names):
        HOMEPAGE = DOMAIN_NAME + subdom
        REL_LINK_REGEX = regex
        FILE_TO_WRITE = '../data/' + file_name
        BAD_SUBDOMS = set(subdoms) - set(subdom)
        try:
            scrapeText(HOMEPAGE, FILE_TO_WRITE)
        except:
            print(f'Could not crawl {HOMEPAGE}')
            continue

  0%|          | 0/12 [00:00<?, ?it/s]

Parsed 10 articles
Parsed 20 articles
Parsed 30 articles
Parsed 40 articles
Parsed 50 articles
Parsed 60 articles
Parsed 70 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 80 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 90 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Could not crawl https://www.banknordik.gl/kl
Parsed 100 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 110 articles
Parsed 120 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
  8%|▊         | 1/12 [3:03:43<33:40:54, 11023.16s/it]

Could not crawl https://www.banknordik.gl/da/
Parsed 10 articles
Parsed 20 articles
Parsed 30 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 40 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 50 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
 17%|█▋        | 2/12 [3:05:28<21:31:18, 7747.89s/it] 

Could not crawl https://www.polarseafood.gl
Parsed 10 articles
Parsed 20 articles
Parsed 30 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 40 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
 25%|██▌       | 3/12 [3:06:38<13:36:40, 5444.45s/it]

Could not crawl https://www.polarseafood.dk


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 10 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Could not crawl http://climategreenland.gl/da/
Parsed 20 articles


 33%|███▎      | 4/12 [3:08:05<8:31:37, 3837.19s/it] Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 10 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 20 articles
Parsed 30 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 40 articles
Could not crawl https://www.businessingreenland.gl/kl-GL
Parsed 50 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 60 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 70 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Parsed 80 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Could not crawl https://www.businessingreenland.gl/da
Parsed 90 articles
Parsed 100 articles


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
 42%|████▏     | 5/12 [3:11:15<5:20:01, 2743.06s/it]

Could not crawl https://www.businessingreenland.gl/en
Parsed 10 articles


 83%|████████▎ | 10/12 [3:11:57<15:31, 465.82s/it]  Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
 92%|█████████▏| 11/12 [3:12:18<05:32, 332.22s/it]

Parsed 10 articles
Parsed 20 articles
Parsed 30 articles
Parsed 40 articles
Parsed 50 articles
Parsed 60 articles
Could not crawl https://www.katak.gl/kl
Parsed 70 articles
Parsed 80 articles
Parsed 90 articles
Parsed 100 articles
Parsed 110 articles
Parsed 120 articles
Could not crawl https://www.katak.gl/da
Parsed 130 articles
Parsed 140 articles


100%|██████████| 12/12 [3:15:49<00:00, 979.13s/it]
