<a href="https://colab.research.google.com/github/OlenaBugaiova/collecting-data-about-norwegian-agriculture/blob/main/%22NLR_Web_Scraping_of_Agriculture_Text%22%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Web scrape text data from "Om Norsk Landbruksrådgiving" (NLR) - a norwegian website on agriculture








In [None]:
OM_NORSK_LANDBRUKS_RADGIVING_URL = 'https://www.nlr.no/fagartikler'

# Import Libraries

In [None]:
from bs4 import BeautifulSoup
from bs4 import NavigableString
from bs4 import Tag

import requests
import json
from google.colab import files
import io

import ssl

import time

import re
import string

import warnings
warnings.filterwarnings("ignore")

# Home Page

In the Norwegian language text, we need to properly encode the Norwegian alphabet

In [None]:
page = requests.get(OM_NORSK_LANDBRUKS_RADGIVING_URL, verify = False)
page.encoding = page.apparent_encoding
home_webpage = BeautifulSoup(page.text, 'html')



In [None]:
home_webpage

<!DOCTYPE html>
<html class="no-js" lang="no">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<link href="/assets/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/assets/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/assets/favicon/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/assets/favicon/site.webmanifest" rel="manifest"/>
<link color="#418459" href="/assets/favicon/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#418459" name="msapplication-TileColor"/>
<meta content="/assets/favicon/browserconfig.xml" name="msapplication-config"/>
<meta content="#ffffff" name="theme-color"/>
<meta content="Norsk Landbruksrådgiving" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="yes" name="mobile-web-app-

In the home webpage we can see that the articles are set into entries.html dynamically. We can't parse them from the webpage



```
<div class="p-entry-filter__entries" v-html="entries.html"></div>
```



# URLs of Articles

We found that URL for each article has the following structure:



```
https://www.nlr.no/fagartikler/kategori/region/title
```
where title consists of lower case words split by "-", for example:

```
https://www.nlr.no/fagartikler/grovfor/ostlandet/musa-odelegger-for-store-belop
```





In [None]:
BASE_URL = OM_NORSK_LANDBRUKS_RADGIVING_URL

In [None]:
categories = [
    'fornybar-energi',
    'frukt-og-baer',
    'froavl',
    'grovfor',
    'gronnsaker',
    'hms',
    'hydroteknikk',
    'jord',
    'klima',
    'korn',
    'kulturlandskap',
    'landbruksbygg',
    'maskinteknikk',
    'olje-og-belgvekster',
    'plantevern',
    'potet',
    'veksthus',
    'okologisk',
    'okonomi'
]

In [None]:
regions = [
    'innlandet',
    'midt',
    'nord',
    'sor',
    'vest',
    'ostlandet'
    ]

# Create a File for Collecting URLs

We will create a list with URL prefixes for different cases and store them in a file **"nlr urls.json"**. On the website, we will select a region and a category and copy titles from the filtered articles manually into the file. Then  we will download the file to create URLs for each article



In [None]:
# a flag to decide if we need to create a new file
is_file_already_created = True

In [None]:
NLR_ARTICLES_URLS_FILE_NAME = 'nlr urls.json'

In [None]:
articles_url_prefix_options = []

for category in categories:
    for region in regions:

        article_url = BASE_URL + '/' + category + '/' + region
        articles_url_prefix_options.append(article_url)

In [None]:
len(articles_url_prefix_options)

114

We have 114 combinations of categories and regions forming URL prefixes. We will transform the list into a dictionary with key as URL prefix and value as an empty list for titles

In [None]:
articles_url_prefix_options_dict = {}

for url_prefix in articles_url_prefix_options:
    articles_url_prefix_options_dict[url_prefix] = []

In [None]:
if not is_file_already_created:
    output_articles_urls = json.dumps(
        articles_url_prefix_options_dict,
        ensure_ascii = False,
        indent = 4
        )

    with open(NLR_ARTICLES_URLS_FILE_NAME, 'w') as f:
      f.write(str(output_articles_urls))

    files.download(NLR_ARTICLES_URLS_FILE_NAME)

# Upload the File with URLs

In [None]:
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name = fn, length = len(uploaded[fn])))

Saving nlr urls.json to nlr urls.json
User uploaded file "nlr urls.json" with length 49635 bytes


In [None]:
NLR_ARTICLES_URLS_FILE_NAME = "nlr urls.json"

In [None]:
nlr_articles_urls_file = open(NLR_ARTICLES_URLS_FILE_NAME)
nlr_articles_urls_content = nlr_articles_urls_file.read()

Convert text into json format

In [None]:
nlr_articles_urls_json = json.loads(nlr_articles_urls_content)

Print all the URL prefixes that are the options for possible URLs

In [None]:
print(json.dumps(nlr_articles_urls_json, ensure_ascii = False, indent = 4))

{
    "https://www.nlr.no/fagartikler/fornybar-energi/nord": [
        "Fossilfri maskinpark – muligheter for landbruket og status",
        "Strømproduksjon fra husdyrgjødsel – enkel betraktning"
    ],
    "https://www.nlr.no/fagartikler/fornybar-energi/sor": [
        "Høgaktuelt med energiproduksjon på garden",
        "Bioenergi er lønnsomt"
    ],
    "https://www.nlr.no/fagartikler/fornybar-energi/vest": [
        "Garden som energiprodusent?",
        "BIOGJØDSEL AV HUSDYRGJØDSEL OG MATAVFALL HAR HØG GJØDSELVERDI"
    ],
    "https://www.nlr.no/fagartikler/fornybar-energi/ostlandet": [
        "Vindturbiner: Et framtidig skue på norske gårdstun?",
        "Status solceller: Er det fortsatt lurt?",
        "Biogassproduksjon på gården - den nye oljen?"
    ],
    "https://www.nlr.no/fagartikler/frukt-og-baer/innlandet": [
        "Fellefangst av jordbærsnutebille -hvor og hvor mange?",
        "Jordbærsnutebille - hvor mye skade og i hvilke sorter?",
        "Forekomst av sjukdo

# Method to Transform Titles

All titles are used in urls in lower case format and separated with '-', for example:

**Title**: Regenerativt landbruk- kva betyr det eigentleg?

**URL**: regenerativt-landbruk-kva-betyr-det-eigentleg

Special cases:
1. Special symbols of Norwegian language are replaced with symbols from Latin alphabet:

    *   Frislepp av mjølkeproduksjon ut året: frislepp-av-mjolkeproduksjon-ut-aret
    *   Næringsforsyning i økologisk åkerbruk: naeringsforsyning-i-okologisk-akerbruk
    *   Tørråte – settepotet og primærsmitte: status-om-torrate-settepotet-primaersmitte-dronebilder
    *   Pass på fôropptaket om du gir frossent fôr: pass-pa-foropptaket-om-du-gir-frossent-for

2. Punctuation is removed:

    *   Bekjemping av høymole etter 1., 2. eller 3. slått?: bekjemping-av-hoymole-etter-1-2-eller-3-slatt
    *   Det «mystiske» syreløselige kaliumet: det-mystiske-syreloselige-kaliumet

3. Some pages have version number two:
    *   Fakta om potetsorter: fakta-om-potetsorter-2
    *   Organisk gjødsel til korn: organisk-gjodsel-til-korn-2
    *   Dekkvekster av belgvekster og korn: dekkvekster-av-belgvekster-og-korn-2



In [None]:
def transform_title(title):

    # transform to lowercase
    title = title.lower()

    # replace spaces with '-'
    title = title.replace(' ', '-')

    # replace special symbols of Norwegian language with symbols from Latin alphabet
    norwegian_symbols = {
        ord('ø'): 'o',
        ord('å'): 'a',
        ord('æ'): 'ae',
        ord('ô'): 'o',
        ord('ê'): 'e',
        oed('è'): 'e'
    }

    title = title.translate(norwegian_symbols)

    # remove punctuation
    title = '-'.join(
        [
            word.translate(str.maketrans('','', string.punctuation)) for word in title.split('-')
            ]
        )

    title = '-'.join(
        [
            word for word in title.split('-') if word
            ]
        )

    # some titles after transformation contain multiple "-" between words
    # it happens when the title had "-" as punctuation sign
    # we replace it with single "-" sign
    title = title.replace('-–-', '-')

    return title

## Combine URLs Prefixes and Titles

We will combine URL prefixes and titles in a loop and store the resulting URLs in a list. For each title we perform transformation first

In [None]:
articles_urls = []

for url_prefix, titles in nlr_articles_urls_json.items():
    for title in titles:
        if title:
            title = transform_title(title)
            article_url = url_prefix + '/' + title
            articles_urls.append(article_url)

In [None]:
len(articles_urls)

1024

Print URLs

In [None]:
articles_urls

['https://www.nlr.no/fagartikler/fornybar-energi/nord/fossilfri-maskinpark-muligheter-for-landbruket-og-status',
 'https://www.nlr.no/fagartikler/fornybar-energi/nord/stromproduksjon-fra-husdyrgjodsel-enkel-betraktning',
 'https://www.nlr.no/fagartikler/fornybar-energi/sor/hogaktuelt-med-energiproduksjon-pa-garden',
 'https://www.nlr.no/fagartikler/fornybar-energi/sor/bioenergi-er-lonnsomt',
 'https://www.nlr.no/fagartikler/fornybar-energi/vest/garden-som-energiprodusent',
 'https://www.nlr.no/fagartikler/fornybar-energi/vest/biogjodsel-av-husdyrgjodsel-og-matavfall-har-hog-gjodselverdi',
 'https://www.nlr.no/fagartikler/fornybar-energi/ostlandet/vindturbiner-et-framtidig-skue-pa-norske-gardstun',
 'https://www.nlr.no/fagartikler/fornybar-energi/ostlandet/status-solceller-er-det-fortsatt-lurt',
 'https://www.nlr.no/fagartikler/fornybar-energi/ostlandet/biogassproduksjon-pa-garden-den-nye-oljen',
 'https://www.nlr.no/fagartikler/frukt-og-baer/innlandet/fellefangst-av-jordbaersnutebille-

# Download Webpages

In [None]:
webpages = {}

start_time = time.time()
for article_url in articles_urls:

    article_page = requests.get(article_url, verify = False)
    article_page.encoding = article_page.apparent_encoding
    article_webpage = BeautifulSoup(article_page.text, 'html')

    webpages[article_url] = article_webpage

    # sometimes articles have version 2
    article_url = article_url + '-2'
    article_page = requests.get(article_url, verify = False)
    article_page.encoding = article_page.apparent_encoding
    article_webpage = BeautifulSoup(article_page.text, 'html')

    webpages[article_url] = article_webpage

print(f'execution time: {(time.time() - start_time) / 60} minutes')



execution time: 45.52338623205821 minutes


Collect urls for pages that were not found. Such pages have the following message

```
<title>Intern serverfeil har oppstått - NLR</title>
```



In [None]:
server_error_message = 'serverfeil'

active_webpages = {}
empty_webpages_urls = []

for url, webpage in webpages.items():

    webpage_text = webpage.getText()

    if server_error_message in webpage_text:
        empty_webpages_urls.append(url)

    else:
        active_webpages[url] = webpage

In [None]:
len(empty_webpages_urls)

1199

In [None]:
empty_webpages_urls

['https://www.nlr.no/fagartikler/fornybar-energi/nord/fossilfri-maskinpark-muligheter-for-landbruket-og-status-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/nord/stromproduksjon-fra-husdyrgjodsel-enkel-betraktning-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/sor/hogaktuelt-med-energiproduksjon-pa-garden-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/sor/bioenergi-er-lonnsomt-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/vest/garden-som-energiprodusent-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/vest/biogjodsel-av-husdyrgjodsel-og-matavfall-har-hog-gjodselverdi-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/ostlandet/vindturbiner-et-framtidig-skue-pa-norske-gardstun-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/ostlandet/status-solceller-er-det-fortsatt-lurt-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/ostlandet/biogassproduksjon-pa-garden-den-nye-oljen-2',
 'https://www.nlr.no/fagartikler/frukt-og-baer/innlandet/fellefangst-av-j

Pages can be empty because version 2 doesn't exist. But they also can be empty because some urls contain default value instead of region. We can check this case:

In [None]:
webpages_default_region = []

for empty_webpage_url in empty_webpages_urls:
    url = '/'.join([
        part if part not in regions else 'default' for part in empty_webpage_url.split('/')
        ])
    webpages_default_region.append(url)

In [None]:
webpages_default_region

['https://www.nlr.no/fagartikler/fornybar-energi/default/fossilfri-maskinpark-muligheter-for-landbruket-og-status-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/stromproduksjon-fra-husdyrgjodsel-enkel-betraktning-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/hogaktuelt-med-energiproduksjon-pa-garden-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/bioenergi-er-lonnsomt-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/garden-som-energiprodusent-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/biogjodsel-av-husdyrgjodsel-og-matavfall-har-hog-gjodselverdi-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/vindturbiner-et-framtidig-skue-pa-norske-gardstun-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/status-solceller-er-det-fortsatt-lurt-2',
 'https://www.nlr.no/fagartikler/fornybar-energi/default/biogassproduksjon-pa-garden-den-nye-oljen-2',
 'https://www.nlr.no/fagartikler/frukt-og-baer/default/fell

In [None]:
n_webpages_before = len(active_webpages)

In [None]:
for article_url in webpages_default_region:

    article_page = requests.get(article_url, verify = False)
    article_page.encoding = article_page.apparent_encoding
    article_webpage = BeautifulSoup(article_page.text, 'html')

    if server_error_message not in article_webpage.getText():
        active_webpages[article_url] = article_webpage



In [None]:
n_webpages_after = len(active_webpages)
n = n_webpages_after - n_webpages_before

print(f'{n} webpages had default value as region in url')

34 webpages had default value as region in url


# Parse Webpages

Print the first webpage

In [None]:
for url, webpage in active_webpages.items():
    print(url)
    print(webpage)
    break


https://www.nlr.no/fagartikler/fornybar-energi/nord/fossilfri-maskinpark-muligheter-for-landbruket-og-status
<!DOCTYPE html>
<html class="no-js" lang="no">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<link href="/assets/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/assets/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/assets/favicon/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/assets/favicon/site.webmanifest" rel="manifest"/>
<link color="#418459" href="/assets/favicon/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#418459" name="msapplication-TileColor"/>
<meta content="/assets/favicon/browserconfig.xml" name="msapplication-config"/>
<meta content="#ffffff" name="theme-color"/>
<meta content="Norsk Landbruksrådgiving" name="apple-mobile-web-app

In [None]:
nlr_text = {}
for url, webpage in active_webpages.items():

    title = webpage.find(attrs = {'class' : 'p-hero__title'})
    description = webpage.find(attrs = {'class' : 'p-hero__text'})

    if title:
        title_text = title.get_text()

        main_text = []

        if description:
            description = description.get_text()

            main_text.append(description)

        main_content = webpage.find(attrs = {'id' : 'mainContent'})

        for paragraph in main_content.find_all(['p']):

            paragraph_text = paragraph.get_text()

            if paragraph_text:
                main_text.append(paragraph_text)

        nlr_text[title_text] = ' '.join(main_text)

In [None]:
nlr_text['Bioenergi er lønnsomt']

'\nDet har vært skrevet metervis av artikler og fagskrifter om det grønne skifte. Alt er grønt, og et av de viktigste tiltakene for Norge skal komme fra den grønne skogen. Det er sant, men det skjer også i dag, halvparten av CO2 utslippet på fastlands Norge blir tatt opp av skogen. Videre mål blir å redusere CO2 utslippet. Den enkleste og mest lønnsomme måten man gjør det på er å velge bioenergi som oppvarmingskilde. Nå har alle gårdeiere i Sør-Norge mulighet til å få gratis befaring og hjelp til å se på mulighetene på egen gård.\n\nEiere av landbrukseiendommer har et kjempefortrinn. De kan være selvforsynt med energi. En gård kan hente varmen fra bioenergi hentet i egen skog, og låvetaket er velegnet for solcellepaneler. Og med dette spare energikostnader. Du blir ikke lenger påvirket av høye nettleier eller høye strømpriser på gitte tider av døgnet. I snitt betaler du kanskje bare 50 øre/kWh for biovarme. Merkostnaden ved å bruke litt mer flis altså øke varmebehovet er ikke mer enn 2

In [None]:
len(nlr_text)

707

# Empty URL Analyses

We can see that out of 1024 URLs 707 were not empty. But what about the rest of URLs? We can check why other URLs leaded to an empty page.

* I printed empty page URLs
* I looked into a few of such URLs and copied a category and a region
* I selected the category and region in a filter of fagarticle webpage and found the corresponding title

Here are my observations:

On the next webpage the article is located under the ostlandet region but the URL references to the midt region

```
# working url
https://www.nlr.no/fagartikler/klima/midt/kva-er-eigentleg-nlr-klima-forsterad

# our url
https://www.nlr.no/fagartikler/klima/ostlandet/kva-er-eigentleg-nlr-klima-forsterad
```

A similar case with the following URL: the webpage is located under innlandet but the working URL references to the ostlandet region

```
# working url
https://www.nlr.no/fagartikler/frukt-og-baer/ostlandet/6-fordeler-med-fangvekster-og-noen-utfordringer

# our url
https://www.nlr.no/fagartikler/frukt-og-baer/innlandet/6-fordeler-med-fangvekster-og-noen-utfordringer
```


In the following few cases title is modifyed in url

```
# working url
https://www.nlr.no/fagartikler/klima/nord/klimapavirkning-for-vekstsesongen-2020-notat

# title
Klimapåvirkning for vekstsesongen 2020
```


```
# working url
https://www.nlr.no/fagartikler/klima/sor/klimadrypp-nr-7-karbonkretslopet

# title
Karbonkretsløpet
```



```
# working url
https://www.nlr.no/fagartikler/klima/ostlandet/poteten-selv-en-klimavinner-kan-bli-bedre

# title
Selv en klimavinner kan bli bedre!
```






We can see that these cases are exceptions from the rule we observed earlie. Therefore it is hard to generate working URLs automatically

# Output

In [None]:
output_text_data = json.dumps(nlr_text, ensure_ascii = False, indent = 4)

with open('nlr_text_data.json', 'w') as f:
  f.write(str(output_text_data))

files.download('nlr_text_data.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>