Here, we prepare a sample from MaCoCu-is, on which we use the genre classifiers. To prepare the sample, we first need to discard all texts with text length smaller than 75 - I created a dictionary of all domains and urls of texts that are long enough. Then I calculated the frequency of the domains. I discarded domains that have less than 10 instances (if I wouldn't, the median would be 6 texts per domain). Then I calculated the median and took random 500 domains above the median and 500 domains below the median.

In [1]:
import gzip
import wget
import regex as re
import pandas as pd
import numpy as np
import json
import random
from tqdm import tqdm

In [2]:
# Compile regex for url and domain
url_re = re.compile('url="(.*?)"')
domain_re = re.compile('domain="(.*?)"')

## Download and open the XLM file

In [3]:
# Download the corpus
# Download the file

#Defining the zip file URL
url = "https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1518/MaCoCu-is.xml.gz"

# Downloading the file by sending the request to the URL
corpus_file = wget.download(url)
print('Downloading Completed')

Downloading Completed


In [3]:
file = gzip.open('MaCoCu-is.xml.gz', 'rt', encoding='utf-8')

## Create a dataframe with the most frequent domains and a sample of 10 URLs from each domain

First, get a list of all domains and urls of texts that have are more than 75 words long.

In [5]:
text_counter = 0

texts = []

for line in tqdm(file):
	if line.startswith("<doc"):
		current_text = []
		pure_text = ""
		text_length = 0
		current_url = ""
		current_domain = ""
		current_url = url_re.search(line).group(1)
		current_domain = domain_re.search(line).group(1)
	elif line.startswith("<p"):
		continue
	elif line.startswith("</p"):
		continue
	elif line.startswith("</doc"):
		text_length = len(pure_text.split())
		if text_length > 75:
			current_text = [current_domain, current_url]
			texts.append(current_text)
			text_counter += 1
	elif line.startswith("<corpus"):
		continue
	elif line.startswith("</corpus"):
		continue
	else:
		pure_text += line

51972334it [02:43, 316944.24it/s]


In [6]:
# Create a dataframe
df = pd.DataFrame({"domain": [x[0] for x in texts], "url": [x[1] for x in texts]})

df.head()

Unnamed: 0,domain,url
0,1.is,https://1.is/
1,7.is,https://7.is/
2,8.is,https://www.8.is/
3,vb.is,https://www.vb.is/
4,fa.is,https://www.fa.is/


In [7]:
# Sort the df based on the domain
df = df.sort_values("domain")

df.head()

Unnamed: 0,domain,url
1038994,002galleri.blogspot.com,http://002galleri.blogspot.com/2014/07/stasetn...
113353,002galleri.blogspot.com,http://002galleri.blogspot.com/
151324,020262.blog.is,https://020262.blog.is/blog/020262/
507738,02johann.wixsite.com,https://02johann.wixsite.com/lokaverkefni/heim...
766658,02johann.wixsite.com,https://02johann.wixsite.com/lokaverkefni/ad-b...


In [8]:
df.describe(include="all")

Unnamed: 0,domain,url
count,1207915,1207915
unique,14025,1207744
top,visir.is,https://www.dalvikurbyggd.is/is/frettir?page=295
freq,34272,2


We got 1,207,744 texts in 14,025 domains.

In [9]:
df.domain.value_counts().to_dict()

{'visir.is': 34272,
 'hugi.is': 26494,
 'vb.is': 22817,
 'sarpur.is': 17137,
 'austurfrett.is': 14261,
 'baekur.is': 14028,
 'timarit.is': 13375,
 'hlad.is': 13257,
 'kjarninn.is': 12847,
 'eyjar.net': 12104,
 'visindavefur.is': 11644,
 'baggalutur.is': 9516,
 'fiskifrettir.is': 9484,
 'stundin.is': 9327,
 'eldri.reykjavik.is': 8598,
 'dv.is': 8297,
 'kvikmyndir.is': 7841,
 'ruv.is': 7630,
 'skemman.is': 7612,
 'ogmundur.is': 7447,
 'frettabladid.is': 7309,
 'spjall.kvartmila.is': 7185,
 'pjatt.is': 7137,
 'althingi.is': 6579,
 'midjan.is': 6369,
 'omarragnarsson.blog.is': 6133,
 'fararheill.is': 5948,
 'nature.is': 5858,
 'sykur.is': 5769,
 'liverpool.is': 5727,
 'sudurnes.net': 5710,
 'heimaslod.is': 5590,
 'jeppaspjall.is': 5528,
 'blog.dv.is': 5143,
 'fib.is': 5052,
 'stjornarradid.is': 4814,
 'raudikrossinn.is': 4645,
 'biblian.is': 4551,
 'bjorn.is': 4438,
 'spjall.vaktin.is': 4202,
 'reglugerd.is': 4095,
 'gamli.rvk.is': 4042,
 'laeknabladid.is': 4003,
 'hedinsfjordur.is': 3934,

In [10]:
# Calculate domain distribution
domain_distribution = pd.DataFrame({"domain": list(df.domain.value_counts().to_dict().keys()), "frequency":list(df.domain.value_counts().to_dict().values())})
domain_distribution

Unnamed: 0,domain,frequency
0,visir.is,34272
1,hugi.is,26494
2,vb.is,22817
3,sarpur.is,17137
4,austurfrett.is,14261
...,...,...
14020,hfh.is,1
14021,songbok.web.is,1
14022,nmm.is,1
14023,hfp.kopavogur.is,1


In [11]:
# Discard instances with frequency less than 10
domain_distribution = domain_distribution[domain_distribution["frequency"] > 9]

domain_distribution.shape

(5431, 2)

I first discarded domains with less than 10 texts, then calculated the median. The remaining number of domains was 5,431.

In [12]:
# Find the median
domain_distribution.frequency.describe()

count     5431.000000
mean       217.582213
std       1038.590171
min         10.000000
25%         17.000000
50%         35.000000
75%        100.000000
max      34272.000000
Name: frequency, dtype: float64

The median number of texts per domain is 35. Now, I'll take 500 domains with frequency below this number and 500 domains with frequency above this number. I'll split the dataset into two based on the median.

In [13]:
domain_distribution.head()

Unnamed: 0,domain,frequency
0,visir.is,34272
1,hugi.is,26494
2,vb.is,22817
3,sarpur.is,17137
4,austurfrett.is,14261


In [14]:
top_domain_distribution = domain_distribution[domain_distribution["frequency"] > 35]
top_domain_distribution.shape

(2698, 2)

In [15]:
bottom_domain_distribution = domain_distribution[domain_distribution["frequency"] < 35]
bottom_domain_distribution.shape

(2697, 2)

In [16]:
# Take a random sample from top and random sample from bottom
top_domain_distribution = top_domain_distribution.sample(n=500)
top_domain_distribution.shape

(500, 2)

In [17]:
top_domain_distribution["frequency"].describe()

count      500.000000
mean       481.548000
std       1826.161286
min         36.000000
25%         59.000000
50%        109.000000
75%        252.500000
max      26494.000000
Name: frequency, dtype: float64

In [18]:
bottom_domain_distribution = bottom_domain_distribution.sample(n=500)
bottom_domain_distribution.head()

Unnamed: 0,domain,frequency
3647,old.eldsveitir.is,21
5159,skeljungur.is,11
3199,harabanar.com,26
3431,uniqueart.is,23
3387,lysi.is,24


In [19]:
# Join the dataframes
domain_sample = pd.concat([top_domain_distribution, bottom_domain_distribution])
domain_sample.describe(include="all")

Unnamed: 0,domain,frequency
count,1000,1000.0
unique,1000,
top,motivm.is,
freq,1,
mean,,249.906
std,,1311.296635
min,,10.0
25%,,17.0
50%,,35.0
75%,,109.0


In [20]:
# Create a list of domains
domain_sample_list = domain_sample.domain.to_list()
domain_sample_list[:10]

['motivm.is',
 'hox.is',
 'yehi.is',
 'emobi.is',
 'gildi.is',
 'sigsig.blog.is',
 'barky.blog.is',
 'logos.blog.is',
 'heilsugaeslan.is',
 'rustikus.blog.is']

In [21]:
# For each domain, sample out 10 texts from the initial dataframe

# First create the initial df to which all others in the loop will be added
final_sample = df[df["domain"] == domain_sample_list[0]].sample(n=10)

final_sample

Unnamed: 0,domain,url
64094,motivm.is,http://www.motivm.is/pictures/104
249612,motivm.is,http://motivm.is/blog/yearmonth/2006/09
55017,motivm.is,http://motivm.is/pictures/13
55920,motivm.is,http://motivm.is/pictures/49
52964,motivm.is,http://www.motivm.is/pictures/21
52980,motivm.is,http://www.motivm.is/pictures/57
51748,motivm.is,http://www.motivm.is/pictures/74
48987,motivm.is,http://www.motivm.is/pictures/14
51738,motivm.is,http://motivm.is/pictures/73
66213,motivm.is,http://www.motivm.is/pictures/114


In [22]:
# Add all other domains
remaining_list = domain_sample_list[1:]

for i in remaining_list:
	added_instances = df[df["domain"] == i].sample(n=10)
	final_sample = pd.concat([final_sample, added_instances])

final_sample.shape

(10000, 2)

In [23]:
final_sample.describe()

Unnamed: 0,domain,url
count,10000,10000
unique,1000,10000
top,motivm.is,http://www.motivm.is/pictures/104
freq,10,1


The final sample has 10.000 instances from 1000 domains.

In [24]:
# Save the df
final_sample.to_csv("MaCoCu-is-sample-domains-and-urls.csv", sep="\t")

## Extract the text from the TMX based on the URL list

In [4]:
# Open the df with domains and urls for the sample
final_sample = pd.read_csv("MaCoCu-is-sample-domains-and-urls.csv", sep="\t", index_col = 0)

final_sample.head(2)

Unnamed: 0,domain,url
64094,motivm.is,http://www.motivm.is/pictures/104
249612,motivm.is,http://motivm.is/blog/yearmonth/2006/09


In [5]:
final_sample.shape

(10000, 2)

In [6]:
# Create a list of urls in the sample
url_list = list(final_sample["url"].unique())
url_list[:10]

['http://www.motivm.is/pictures/104',
 'http://motivm.is/blog/yearmonth/2006/09',
 'http://motivm.is/pictures/13',
 'http://motivm.is/pictures/49',
 'http://www.motivm.is/pictures/21',
 'http://www.motivm.is/pictures/57',
 'http://www.motivm.is/pictures/74',
 'http://www.motivm.is/pictures/14',
 'http://motivm.is/pictures/73',
 'http://www.motivm.is/pictures/114']

In [27]:
len(url_list)

10000

In [7]:
# Now that I have the URL list, I will extract texts from the MaCoCu-is.xml.gz for the sample based on the URL list.

text_all_counter = 0

texts_all = []

for line in tqdm(file):
	if line.startswith("<doc"):
		current_text = []
		pure_text = ""
		text_string = ""
		current_url = ""
		current_domain = ""
		current_url = url_re.search(line).group(1)
		current_domain = domain_re.search(line).group(1)
		text_string += line
	elif line.startswith("<p"):
		text_string += line
	elif line.startswith("</p"):
		text_string += line
	elif line.startswith("</doc"):
		text_string += line
		if current_url in url_list:
			current_text = [current_domain, current_url, pure_text, text_string]
			texts_all.append(current_text)
			text_all_counter += 1
	elif line.startswith("<corpus"):
		continue
	elif line.startswith("</corpus"):
		continue
	else:
		text_string += line
		pure_text += line

51972334it [04:54, 176291.14it/s]


In [9]:
text_all_counter

10000

In [10]:
# Create a dataframe out of the text file

df_long_texts = pd.DataFrame({"domain": [x[0] for x in texts_all], "url": [x[1] for x in texts_all], "text": [x[2] for x in texts_all],"doc": [x[3] for x in texts_all]})

df_long_texts.head()

Unnamed: 0,domain,url,text,doc
0,ft.is,https://ft.is/,Félag Tölvunarfræðinga\nMarkmið félagsins er a...,"<doc id=""macocu.is.97"" title=""Félag Tölvunarfr..."
1,pfi.is,http://www.pfi.is/,Stefnt er að því að halda þessa fundi í septem...,"<doc id=""macocu.is.142"" title=""Póstmannafélag ..."
2,sfa.is,https://sfa.is/,Samtök forstöðumanna almenningsbókasafna\nSamt...,"<doc id=""macocu.is.143"" title=""Samtök forstöðu..."
3,hrt.is,https://hrt.is/,Hreinsun gatna og gangstétta\nHreinsitækni er ...,"<doc id=""macocu.is.179"" title=""Hreinsitækni eh..."
4,kfh.is,https://kfh.is/,"Kæru félagar, við viljum vekja athygli á að sa...","<doc id=""macocu.is.202"" title=""KFH | Kristileg..."


In [11]:
df_long_texts.describe(include="all")

Unnamed: 0,domain,url,text,doc
count,10000,10000,10000,10000
unique,1000,10000,10000,10000
top,ft.is,https://ft.is/,Félag Tölvunarfræðinga\nMarkmið félagsins er a...,"<doc id=""macocu.is.97"" title=""Félag Tölvunarfr..."
freq,10,1,1,1


In [12]:
df_long_texts.domain.value_counts()

ft.is                   10
grodrarstod.is          10
gudfinnur.blog.is       10
skagafrettir.is         10
hreysti.is              10
                        ..
unnurkaren.com          10
steingerdur.com         10
salud68.blogspot.com    10
al.is                   10
bilasolur.is            10
Name: domain, Length: 1000, dtype: int64

In [13]:
# Add information on length
df_long_texts["length"] = df_long_texts["text"].str.split().str.len()

df_long_texts.head()

Unnamed: 0,domain,url,text,doc,length
0,ft.is,https://ft.is/,Félag Tölvunarfræðinga\nMarkmið félagsins er a...,"<doc id=""macocu.is.97"" title=""Félag Tölvunarfr...",255
1,pfi.is,http://www.pfi.is/,Stefnt er að því að halda þessa fundi í septem...,"<doc id=""macocu.is.142"" title=""Póstmannafélag ...",304
2,sfa.is,https://sfa.is/,Samtök forstöðumanna almenningsbókasafna\nSamt...,"<doc id=""macocu.is.143"" title=""Samtök forstöðu...",421
3,hrt.is,https://hrt.is/,Hreinsun gatna og gangstétta\nHreinsitækni er ...,"<doc id=""macocu.is.179"" title=""Hreinsitækni eh...",244
4,kfh.is,https://kfh.is/,"Kæru félagar, við viljum vekja athygli á að sa...","<doc id=""macocu.is.202"" title=""KFH | Kristileg...",3156


In [14]:
df_long_texts["length"].describe()

count    10000.000000
mean       582.432400
std       1707.816705
min         76.000000
25%        126.000000
50%        225.000000
75%        485.000000
max      79573.000000
Name: length, dtype: float64

In [15]:
# Save the final sample
df_long_texts.to_csv("MaCoCu-is-sample.csv", sep="\t")