# Scraping the web to get texts in Souletin Basque

This humble project aims to collect present-day texts written in Souletin Basque. 

Souletin is a marginal variety with a rich literary tradition in the past. Today it can be considered as an endangered variety. Xiberoko Botza is a free radio station with a web of news written in Souletin Basque.


Let us import the REQUESTS module, necessary to send requests:

In [1]:
import requests
xb = 'https://xiberokobotza.org'
session = requests.session()
x = requests.get(xb)
x.status_code

200

First of all, I need to be sure about the encoding of the website. 

NB: Modern Souletin spelling uses the grapheme <ü>, which provokes lots of problems in the case of not chosing the correct encoding. UTF-8 allows us for working with Souletin spelling.

In [2]:
x.encoding

'utf-8'

Below I want to know the HTML content of the page:

In [3]:
x.content; # I use the semicolon because the output is endless.

Now I want to obtain the headers of the site:

In [4]:
x.headers

{'Server': 'nginx', 'Date': 'Tue, 24 Nov 2020 09:16:45 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Powered-By': 'PHP/7.0.33, PleskLin', 'Expires': 'Wed, 17 Aug 2005 00:00:00 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Pragma': 'no-cache', 'Set-Cookie': '6a56f77a457edbe1f83039fafdb0f05b=8cunlvlv2tv357hps4lqph6en1; path=/; secure; HttpOnly', 'X-Content-Type-Options': 'nosniff', 'Last-Modified': 'Tue, 24 Nov 2020 09:16:45 GMT'}

Before downloading anything from the website, it is convenient to make an effort in order not to get blocked.

I introduce a pause between requests:


In [5]:
import time

for _ in range(5):
    response = session.get(xb)
    print(response.headers['Date'])
    time.sleep(3)

Tue, 24 Nov 2020 09:16:46 GMT
Tue, 24 Nov 2020 09:16:50 GMT
Tue, 24 Nov 2020 09:16:54 GMT
Tue, 24 Nov 2020 09:16:58 GMT
Tue, 24 Nov 2020 09:17:02 GMT


And present myself as a well-respected browser:

In [6]:
from fake_useragent import UserAgent
ua = UserAgent(verify_ssl=False)

headers = {'User-Agent': ua.random}
print(headers)
response = session.get(xb, headers=headers)

{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0; ja-JP) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27'}


Again, a pause between requests (random time):


In [7]:
import random
for _ in range(5):
    
    response = session.get(xb)
    print(response.headers['Date'])
    time.sleep(random.uniform(1.1, 5.2))

Tue, 24 Nov 2020 09:17:07 GMT
Tue, 24 Nov 2020 09:17:12 GMT
Tue, 24 Nov 2020 09:17:16 GMT
Tue, 24 Nov 2020 09:17:19 GMT
Tue, 24 Nov 2020 09:17:23 GMT


Здесь я хотел работать с помощью прокси, но не получилось:

In [8]:
# known_proxy_ip = 'http://180.246.205.208:57648'
# proxy = {'http': known_proxy_ip, 'https': known_proxy_ip}
# response = requests.get(xb, proxies=proxy)
# print(response.headers)

# Parsing the web

This code imports the library Beautifulsoup and parses the web, and presents the content in its hierarchical architecture:

In [18]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(x.content, 'html.parser')
# print(soup.prettify())

In [10]:
list(soup.children);

In [11]:
# The number in brakets refers to a nesting level. 
# By knowing at which level is the text we are interested in, we can easily get it.
list(soup.children)[22]

<div aria-label="next arrow" class="n2-ss-widget n2-ss-widget-display-desktop n2-ss-widget-display-tablet n2-ss-widget-display-mobile n2-style-0ef3c74dd87a58a89eb483c67447d002-heading nextend-arrow n2-ow nextend-arrow-next nextend-arrow-animated-fade n2-ib" data-ssright="0+15" data-sstop="height/2-nextheight/2" id="n2-ss-2-arrow-next" role="button" style="position: absolute;" tabindex="0"><img alt="next arrow" class="n2-ow" data-hack="data-lazy-src" data-no-lazy="1" src="

In [12]:
# And here we get text. 
some_text = list(soup.children)[14]
some_text.get_text()

'AZKEN BERRIAKPAC berri baten egiteko bidean dira 2023ko!2020-11-16 - Laborantxa / Üngüramena\nPAC berria, Europako laborantxa lagüntzak, sekülan beno berdeago\xa0: komünikazione faltsüa ote da\xa0?\nHori dü salhatzen ELB sindikatüak, ber zentzüan Eüskal Herriko FDSEA sindikatüa inkiet da bortüko laborantxa süstengatzen düen lagüntzen geroarentako.\nPAC lagüntzen parte handi bat lür eremüen arabera eta kabale heinaren arabera kalkültaürik da, sos laguntzen %\xa080 a laborarien % 20ak hunkitzen dü.\n2023 ko PAC-a mementoan eztabidan da, erreformaren\xa0herroka handiak urtarilan elkiko dira.\nBenoit Tauzin laborariak erraiten deikü bere etxaltean zertangainen PAC-a hunkitzen düan eta zer dion hortaz:\n\xa0\n\n<p class="n2-font-84674749537f8c78f9216705fcfdb018-paragraph  n2-style-2e69df52557e65e0b420ac781d38f41a-heading  n2-ow">.mejs-container {max-width: 300px;}</p>\n<p class="n2-font-84674749537f8c78f9216705fcfdb018-paragraph  n2-style-2e69df52557e65e0b420ac781d38f41a-heading  n2-ow">.m

Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.


In [13]:
# The tag 'p' refers to text. However, we will need to clean the result, which shows the tags.
soup.find_all('p');

In [14]:
soup.find_all('p')[15].get_text()

'Begiatürik den hitz ordü bakotxa azaroaren 27an izanen da xiberoko botzan'

In [15]:
# And here we add a condition:
soup.find_all('p', class_='n2-font-84674749537f8c78f9216705fcfdb018-paragraph');

The code above gives me the text I am interested in, but with the tags, which happen to have endless names. 

I decide to save the output of the code above into a .txt file, and then clean it with RE

In [16]:
dirty_text = soup.find_all('p', class_='n2-font-84674749537f8c78f9216705fcfdb018-paragraph')
%store dirty_text >xb_dirty.txt

Writing 'dirty_text' (ResultSet) to file 'xb_dirty.txt'.


Now I clean the text previously saved, by writing in the file:

In [17]:
with open('xb_dirty.txt', 'r') as file:
    filedata = file.read()
filedata = filedata.replace('<p class="n2-font-84674749537f8c78f9216705fcfdb018-paragraph n2-style-2e69df52557e65e0b420ac781d38f41a-heading n2-ow">', '')

# Write the file out again
with open('xb_dirty.txt', 'w') as file:
    file.write(filedata)

# Извините меня.

I wanted to do some preprocessing with the Souletin text, but I just did not managed. In the web I found a good lemmatizer for Basque (https://github.com/ixa-ehu/ixa-pipe-pos), but it has not been developed to work in python. 

Then I found a lemmatizer (https://nlp.johnsnowlabs.com/2020/07/29/lemma_eu.html) that allows working with python, and here the problems arose when downloading. 
