Here, we prepare a sample from MaCoCu-sl, on which we use the genre classifiers. To prepare the sample, we first need to discard all texts with text length smaller than 75 - I created a dictionary of all domains and urls of texts that are long enough. Then I calculated the frequency of the domains. I discarded domains that have less than 10 instances (if I wouldn't, the median would be 6 texts per domain). Then I calculated the median and took random 500 domains above the median and 500 domains below the median.

In [1]:
import gzip
import wget
import regex as re
import pandas as pd
import numpy as np
import json
import random
from tqdm import tqdm

In [2]:
# Compile regex for url and domain
url_re = re.compile('url="(.*?)"')
domain_re = re.compile('domain="(.*?)"')

## Download and open the XLM file

In [3]:
# Download the corpus
# Download the file

#Defining the zip file URL
url = "https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1517/MaCoCu-sl.xml.gz"

# Downloading the file by sending the request to the URL
corpus_file = wget.download(url)
print('Downloading Completed')

Downloading Completed


In [32]:
file = gzip.open('MaCoCu-sl.xml.gz', 'rt', encoding='utf-8')

## Create a dataframe with the most frequent domains and a sample of 10 URLs from each domain

First, get a list of all domains and urls of texts that have are more than 75 words long.

In [5]:
text_counter = 0

texts = []

for line in tqdm(file):
	if line.startswith("<doc"):
		current_text = []
		pure_text = ""
		text_length = 0
		current_url = ""
		current_domain = ""
		current_url = url_re.search(line).group(1)
		current_domain = domain_re.search(line).group(1)
	elif line.startswith("<p"):
		continue
	elif line.startswith("</p"):
		continue
	elif line.startswith("</doc"):
		text_length = len(pure_text.split())
		if text_length > 75:
			current_text = [current_domain, current_url]
			texts.append(current_text)
			text_counter += 1
	elif line.startswith("<corpus"):
		continue
	elif line.startswith("</corpus"):
		continue
	else:
		pure_text += line

151967742it [07:43, 327820.79it/s]


In [6]:
# Create a dataframe
df = pd.DataFrame({"domain": [x[0] for x in texts], "url": [x[1] for x in texts]})

df.head()

Unnamed: 0,domain,url
0,e3.si,https://www.e3.si/
1,x5.si,http://www.x5.si/
2,a1.si,https://www.a1.si/
3,op.si,https://www.op.si/
4,fd.si,https://fd.si/


In [7]:
# Sort the df based on the domain
df = df.sort_values("domain")

df.head()

Unnamed: 0,domain,url
337411,007.com.hr,http://www.007.com.hr/bugdetector.html
177449,090linije.si,http://www.090linije.si/pogoji.htm
501444,090linije.si,http://www.090linije.si/simobil/pogoji.htm
550412,090vedezevanje.com,https://090vedezevanje.com/tarot-karte/
96050,090vedezevanje.com,https://090vedezevanje.com/


In [8]:
df.describe(include="all")

Unnamed: 0,domain,url
count,3787272,3787272
unique,49096,3779253
top,najdi.si,https://www.rtvslo.si/sport
freq,41255,8


We got 3,779,253 texts in 49,096 domains.

In [9]:
df.domain.value_counts().to_dict()

{'najdi.si': 41255,
 'rtvslo.si': 39012,
 'regionalobala.si': 35803,
 'primorske.si': 31999,
 'zurnal24.si': 29880,
 '1zavse.si': 28335,
 'sodnapraksa.si': 27388,
 'slo-tech.com': 26412,
 'mladina.si': 26191,
 'uradni-list.si': 25820,
 'radiostudent.si': 24408,
 'novice.svet24.si': 20773,
 'govorise.metropolitan.si': 19664,
 'dk.um.si': 18882,
 'radio1.si': 18151,
 'mojmojster.net': 18087,
 'dnevnik.si': 18001,
 'muziker.si': 17008,
 'tax-fin-lex.si': 16593,
 'siol.net': 16535,
 'ringaraja.net': 16507,
 'monitor.si': 16082,
 'moj-letak.si': 15774,
 'mojaobcina.si': 15596,
 'moja-lekarna.com': 14498,
 'vsi.si': 14387,
 'instore.si': 14279,
 'publishwall.si': 14159,
 'citymagazine.si': 14128,
 'sloski.si': 12958,
 'elektronik.si': 12233,
 'nogomania.com': 12215,
 'deloindom.delo.si': 11834,
 'politikis.si': 11517,
 'viva.si': 11371,
 'joker.muzej.si': 11280,
 'nova24tv.si': 10936,
 'pesem.si': 10301,
 'gov.si': 10134,
 'blog.uporabnastran.si': 9510,
 'sodisce.si': 9321,
 'mladipodjetnik.

In [10]:
# Calculate domain distribution
domain_distribution = pd.DataFrame({"domain": list(df.domain.value_counts().to_dict().keys()), "frequency":list(df.domain.value_counts().to_dict().values())})
domain_distribution

Unnamed: 0,domain,frequency
0,najdi.si,41255
1,rtvslo.si,39012
2,regionalobala.si,35803
3,primorske.si,31999
4,zurnal24.si,29880
...,...,...
49091,tapetnistvo-damjan.si,1
49092,eurograf.si,1
49093,tapetnistvo-kolar.si,1
49094,tapetnistvo-kopac.si,1


In [11]:
# Discard instances with frequency less than 10
domain_distribution = domain_distribution[domain_distribution["frequency"] > 9]

domain_distribution.shape

(20400, 2)

If we don't discard domains with less than 10 texts, the median is 6 texts which is not enough. So, I first discarded domains with less than 10 texts, then calculated the median. The remaining number of domains was 20,400.

In [12]:
# Find the median
domain_distribution.frequency.describe()

count    20400.000000
mean       181.051225
std       1044.961694
min         10.000000
25%         16.000000
50%         32.000000
75%         84.000000
max      41255.000000
Name: frequency, dtype: float64

The median number of texts per domain is 32. Now, I'll take 500 domains with frequency below this number and 500 domains with frequency above this number. I'll split the dataset into two based on the median.

In [13]:
domain_distribution.head()

Unnamed: 0,domain,frequency
0,najdi.si,41255
1,rtvslo.si,39012
2,regionalobala.si,35803
3,primorske.si,31999
4,zurnal24.si,29880


In [14]:
top_domain_distribution = domain_distribution[domain_distribution["frequency"] > 32]
top_domain_distribution.shape

(9983, 2)

In [15]:
bottom_domain_distribution = domain_distribution[domain_distribution["frequency"] < 32]
bottom_domain_distribution.shape

(10198, 2)

In [16]:
# Take a random sample from top and random sample from bottom
top_domain_distribution = top_domain_distribution.sample(n=500)
top_domain_distribution.shape

(500, 2)

In [18]:
top_domain_distribution["frequency"].describe()

count     500.00000
mean      262.04000
std       601.91954
min        33.00000
25%        51.75000
50%        85.00000
75%       217.00000
max      5444.00000
Name: frequency, dtype: float64

In [19]:
bottom_domain_distribution = bottom_domain_distribution.sample(n=500)
bottom_domain_distribution.head()

Unnamed: 0,domain,frequency
15516,obcina-krizevci.si,16
11700,cotondetulear.si,25
16633,modulninja.shop,14
12423,migajznami.si,23
12247,mladi-gasilec.si,24


In [20]:
# Join the dataframes
domain_sample = pd.concat([top_domain_distribution, bottom_domain_distribution])
domain_sample.describe(include="all")

Unnamed: 0,domain,frequency
count,1000,1000.0
unique,1000,
top,nepremagljiva.si,
freq,1,
mean,,139.932
std,,442.624319
min,,10.0
25%,,16.0
50%,,32.0
75%,,85.0


In [21]:
# Create a list of domains
domain_sample_list = domain_sample.domain.to_list()
domain_sample_list[:10]

['nepremagljiva.si',
 'althea.si',
 'odvetnik-kutnjak.si',
 'kshuje.si',
 'kmetija.mojforum.si',
 'slaj-anic.si',
 'agrarne.si',
 'yogalishesana.com',
 'jeziki-stejejo.si',
 'donaulab.si']

In [22]:
# For each domain, sample out 10 texts from the initial dataframe

# First create the initial df to which all others in the loop will be added
final_sample = df[df["domain"] == domain_sample_list[0]].sample(n=10)

final_sample

Unnamed: 0,domain,url
375762,nepremagljiva.si,https://nepremagljiva.si/tag/dopust/
1682289,nepremagljiva.si,https://nepremagljiva.si/tri-noro-dobre-sladic...
1661643,nepremagljiva.si,https://nepremagljiva.si/vse-o-moji-vecni-bitk...
2234934,nepremagljiva.si,https://nepremagljiva.si/spoznajte-terapijo-ki...
677331,nepremagljiva.si,https://nepremagljiva.si/category/zdravje/
2536790,nepremagljiva.si,https://nepremagljiva.si/fredagsmys-prezivljaj...
429620,nepremagljiva.si,https://nepremagljiva.si/tag/uporaba/
1243342,nepremagljiva.si,https://nepremagljiva.si/slastni-cokoladni-rec...
2411809,nepremagljiva.si,https://nepremagljiva.si/oktobrski-astroloski-...
2496389,nepremagljiva.si,https://nepremagljiva.si/iz-vsake-krizne-situa...


In [23]:
# Add all other domains
remaining_list = domain_sample_list[1:]

for i in remaining_list:
	added_instances = df[df["domain"] == i].sample(n=10)
	final_sample = pd.concat([final_sample, added_instances])

final_sample.shape

(10000, 2)

In [24]:
final_sample.describe()

Unnamed: 0,domain,url
count,10000,10000
unique,1000,9996
top,nepremagljiva.si,https://www.izobrazevanjarfr.si/
freq,10,2


The final sample has 10.000 instances from 1000 domains.

In [26]:
# Save the df
final_sample.to_csv("MaCoCu-sample-domains-and-urls-sample2.csv", sep="\t")

## Extract the text from the TMX based on the URL list

In [4]:
# Open the df with domains and urls for the sample
final_sample = pd.read_csv("MaCoCu-sample-domains-and-urls-sample2.csv", sep="\t", index_col = 0)

final_sample.head(2)

Unnamed: 0,domain,url
1461252,lkrv.fri.uni-lj.si,http://lkrv.fri.uni-lj.si/~ajurisic/seminar/in...
2129748,lkrv.fri.uni-lj.si,http://lkrv.fri.uni-lj.si/~ajurisic/tec_ac/np_...


In [27]:
final_sample.shape

(10000, 2)

In [28]:
# Create a list of urls in the sample
url_list = list(final_sample["url"].unique())
url_list[:10]

['https://nepremagljiva.si/tag/dopust/',
 'https://nepremagljiva.si/tri-noro-dobre-sladice-v-kozarcu/',
 'https://nepremagljiva.si/vse-o-moji-vecni-bitki-s-hormoni/',
 'https://nepremagljiva.si/spoznajte-terapijo-ki-odklene-vas-potencial/',
 'https://nepremagljiva.si/category/zdravje/',
 'https://nepremagljiva.si/fredagsmys-prezivljajte-petke-kot-to-pocnejo-svedi/',
 'https://nepremagljiva.si/tag/uporaba/',
 'https://nepremagljiva.si/slastni-cokoladni-recepti/',
 'https://nepremagljiva.si/oktobrski-astroloski-koledar-pripnite-si-pasove/',
 'https://nepremagljiva.si/iz-vsake-krizne-situacije-izstopim-kot-zmagovalka/']

In [29]:
len(url_list)

9996

In [33]:
# Now that I have the URL list, I will extract texts from the MaCoCu-sl.xml.gz for the sample based on the URL list.

text_all_counter = 0

texts_all = []

for line in tqdm(file):
	if line.startswith("<doc"):
		current_text = []
		pure_text = ""
		text_string = ""
		current_url = ""
		current_domain = ""
		current_url = url_re.search(line).group(1)
		current_domain = domain_re.search(line).group(1)
		text_string += line
	elif line.startswith("<p"):
		text_string += line
	elif line.startswith("</p"):
		text_string += line
	elif line.startswith("</doc"):
		text_string += line
		if current_url in url_list:
			current_text = [current_domain, current_url, pure_text, text_string]
			texts_all.append(current_text)
			text_all_counter += 1
	elif line.startswith("<corpus"):
		continue
	elif line.startswith("</corpus"):
		continue
	else:
		text_string += line
		pure_text += line

151967742it [46:34, 54375.89it/s] 


In [34]:
text_all_counter

10043

In [35]:
# Create a dataframe out of the text file

df_long_texts = pd.DataFrame({"domain": [x[0] for x in texts_all], "url": [x[1] for x in texts_all], "text": [x[2] for x in texts_all],"doc": [x[3] for x in texts_all]})

df_long_texts.head()

Unnamed: 0,domain,url,text,doc
0,zsc.si,http://www.zsc.si/,"Nacionalna varnost je preresna stvar, da bi jo...","<doc id=""macocu.si.283"" title=""ZSC - Zveza slo..."
1,pas.si,https://www.pas.si/,Prezračevanje prostorov je nujno zaradi najman...,"<doc id=""macocu.si.390"" title=""Preverjeno - Ak..."
2,fsp.si,http://fsp.si/,"Vsaka talna obloga ima svoje značilnosti, zara...","<doc id=""macocu.si.446"" title=""FSP Poslovne No..."
3,gzl.si,https://www.gzl.si/,Gasilska zveza Ljubljana\nJavna gasilska služb...,"<doc id=""macocu.si.783"" title=""Gasilska zveza ..."
4,zsc.si,http://www.zsc.si/,Mednarodna organizacija častnikov rezerve (CIO...,"<doc id=""macocu.si.869"" title=""ZSC - Zveza slo..."


In [36]:
df_long_texts.describe(include="all")

Unnamed: 0,domain,url,text,doc
count,10043,10043,10043,10043
unique,1000,9996,10043,10043
top,zsc.si,http://www.zsc.si/,"Nacionalna varnost je preresna stvar, da bi jo...","<doc id=""macocu.si.283"" title=""ZSC - Zveza slo..."
freq,13,4,1,1


In [37]:
df_long_texts.domain.value_counts()

zsc.si                         13
jvvz.org                       13
zupnijabarje.si                12
jeziki-stejejo.si              12
os-miren.si                    12
                               ..
zlati-ghee.si                  10
izvir-klub.si                  10
davorin.net                    10
obala.net                      10
otroska-kosarkarska-sola.si    10
Name: domain, Length: 1000, dtype: int64

Some URLs appear multiple times with different texts, so at the end, our sample consits of 10.043 texts. The problem with this is a) that some domains have more instances than other, and b) that texts under some of the URLs might be shorter than 75 words. That is why we calculated the length of the texts again and discarded those with length less than 75 words. Then we also sampled out the domains with more than 10 texts, so that at the end all domains have 10 instances.

In [38]:
# Add information on length
df_long_texts["length"] = df_long_texts["text"].str.split().str.len()

df_long_texts.head()

Unnamed: 0,domain,url,text,doc,length
0,zsc.si,http://www.zsc.si/,"Nacionalna varnost je preresna stvar, da bi jo...","<doc id=""macocu.si.283"" title=""ZSC - Zveza slo...",200
1,pas.si,https://www.pas.si/,Prezračevanje prostorov je nujno zaradi najman...,"<doc id=""macocu.si.390"" title=""Preverjeno - Ak...",124
2,fsp.si,http://fsp.si/,"Vsaka talna obloga ima svoje značilnosti, zara...","<doc id=""macocu.si.446"" title=""FSP Poslovne No...",1951
3,gzl.si,https://www.gzl.si/,Gasilska zveza Ljubljana\nJavna gasilska služb...,"<doc id=""macocu.si.783"" title=""Gasilska zveza ...",440
4,zsc.si,http://www.zsc.si/,Mednarodna organizacija častnikov rezerve (CIO...,"<doc id=""macocu.si.869"" title=""ZSC - Zveza slo...",112


In [39]:
df_long_texts["length"].describe()

count    10043.000000
mean       430.642139
std       1186.604180
min          9.000000
25%        125.000000
50%        220.000000
75%        424.000000
max      56091.000000
Name: length, dtype: float64

In [40]:
# Filter out texts, shorter than 75 words
df_long_texts = df_long_texts[df_long_texts["length"] > 75]
df_long_texts.shape

(10026, 5)

In [30]:
# Save the intermediate dataframe
df_long_texts.to_csv("MaCoCu-sl-sample.csv", sep="\t")

In [41]:
all_domains_frequency = df_long_texts.domain.value_counts().to_dict()

all_domains_frequency

{'zsc.si': 12,
 'os-miren.si': 12,
 'jeziki-stejejo.si': 12,
 'mipi.si': 12,
 'zupnijabarje.si': 12,
 'mzpm-ljubljana.si': 12,
 'zmos.si': 12,
 'easistent.com': 12,
 'ospivka6.splet.arnes.si': 11,
 '2020.cityofwomen.org': 11,
 'pohistvo123.si': 11,
 'dars.si': 11,
 'izo.sik.si': 11,
 'karra.si': 11,
 'jvvz.org': 11,
 'sportnazvezavelenje.si': 11,
 'digitalpartner.si': 11,
 'modulninja.shop': 11,
 'muzej-ptuj-ormoz.si': 10,
 'imk.si': 10,
 'matejgrudnik.si': 10,
 'netfork-akademija.si': 10,
 'mojstr.si': 10,
 'purityherbsslovenija.si': 10,
 'os-dobravlje.si': 10,
 'tehnozvezdje.si': 10,
 'zag.si': 10,
 'o-md.mb.edus.si': 10,
 'alteka.si': 10,
 'hippocampus.si': 10,
 'kemoplast.si': 10,
 'lifelong.blogspot.com': 10,
 'veganskakuharija.blogspot.com': 10,
 'riedingcompetition.com': 10,
 'vecna-optimistka.blogspot.com': 10,
 'mozirskigaj.com': 10,
 'mikrobiolog.blogspot.com': 10,
 'maticpise.com': 10,
 'hr.drazbe123.com': 10,
 'congress-ambassador.si': 10,
 'center-iris.si': 10,
 'pionirski

In [42]:
# Filter out a part of texts from domains that have more than 10 texts
for item in ['zsc.si', 'os-miren.si', 'jeziki-stejejo.si', 'mipi.si', 'zupnijabarje.si', 'mzpm-ljubljana.si', 'zmos.si', 'easistent.com']:
	df_long_texts = df_long_texts.drop(df_long_texts[df_long_texts['domain'] == item].sample(n=2).index)

for item in ['ospivka6.splet.arnes.si', '2020.cityofwomen.org', 'pohistvo123.si', 'dars.si',
 'izo.sik.si', 'karra.si', 'jvvz.org', 'sportnazvezavelenje.si', 'digitalpartner.si', 'modulninja.shop']:
	df_long_texts = df_long_texts.drop(df_long_texts[df_long_texts['domain'] == item].sample(n=1).index)

In [43]:
df_long_texts.describe(include="all")

Unnamed: 0,domain,url,text,doc,length
count,10000,10000,10000,10000,10000.0
unique,1000,9974,10000,10000,
top,zsc.si,http://www.zsc.si/,"Nacionalna varnost je preresna stvar, da bi jo...","<doc id=""macocu.si.283"" title=""ZSC - Zveza slo...",
freq,10,3,1,1,
mean,,,,,431.5055
std,,,,,1188.840319
min,,,,,76.0
25%,,,,,126.0
50%,,,,,221.0
75%,,,,,424.0


In [44]:
# Check if all domains have the same number of instances
df_long_texts.domain.value_counts()

zsc.si                         10
o-md.mb.edus.si                10
dieta.si                       10
zupnija-gomilsko.rkc.si        10
kmetijskaoprema.si             10
                               ..
smarcan.si                     10
napisigovor.si                 10
infohit.si                     10
sadmavrica.si                  10
otroska-kosarkarska-sola.si    10
Name: domain, Length: 1000, dtype: int64

In [45]:
# Save the final sample
df_long_texts.to_csv("MaCoCu-sl-sample2.csv", sep="\t")