Here, we prepare a sample from MaCoCu-hr, on which we use the genre classifiers. To prepare the sample, we first need to discard all texts with text length smaller than 75 - I created a dictionary of all domains and urls of texts that are long enough. Then I calculated the frequency of the domains. I discarded domains that have less than 10 instances (if I wouldn't, the median would be 6 texts per domain). Then I calculated the median and took random 500 domains above the median and 500 domains below the median.

In [1]:
import gzip
import wget
import regex as re
import pandas as pd
import numpy as np
import json
import random
from tqdm import tqdm

In [2]:
# Compile regex for url and domain
url_re = re.compile('url="(.*?)"')
domain_re = re.compile('domain="(.*?)"')

## Download and open the XLM file

In [3]:
# Download the corpus
# Download the file

#Defining the zip file URL
url = "https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1512/MaCoCu-mk.xml.gz"

# Downloading the file by sending the request to the URL
corpus_file = wget.download(url)
print('Downloading Completed')

Downloading Completed


In [3]:
file = gzip.open('MaCoCu-mk.xml.gz', 'rt', encoding='utf-8')

## Create a dataframe with the most frequent domains and a sample of 10 URLs from each domain

First, get a list of all domains and urls of texts that have are more than 75 words long.

In [5]:
text_counter = 0

texts = []

for line in tqdm(file):
	if line.startswith("<doc"):
		current_text = []
		pure_text = ""
		text_length = 0
		current_url = ""
		current_domain = ""
		current_url = url_re.search(line).group(1)
		current_domain = domain_re.search(line).group(1)
	elif line.startswith("<p"):
		continue
	elif line.startswith("</p"):
		continue
	elif line.startswith("</doc"):
		text_length = len(pure_text.split())
		if text_length > 75:
			current_text = [current_domain, current_url]
			texts.append(current_text)
			text_counter += 1
	elif line.startswith("<corpus"):
		continue
	elif line.startswith("</corpus"):
		continue
	else:
		pure_text += line

47208386it [02:36, 302411.61it/s]


In [6]:
# Create a dataframe
df = pd.DataFrame({"domain": [x[0] for x in texts], "url": [x[1] for x in texts]})

df.head()

Unnamed: 0,domain,url
0,ia.mk,https://www.ia.mk/
1,dd.mk,https://www.dd.mk/
2,gg.mk,https://gg.mk/
3,qs.mk,https://www.qs.mk/
4,fn.mk,https://fn.mk/


In [7]:
# Sort the df based on the domain
df = df.sort_values("domain")

df.head()

Unnamed: 0,domain,url
300689,1000knigi.mon.gov.mk,http://1000knigi.mon.gov.mk/book.php?id=252
316176,1000knigi.mon.gov.mk,http://1000knigi.mon.gov.mk/book.php?id=1522
311898,1000knigi.mon.gov.mk,http://1000knigi.mon.gov.mk/book.php?id=1000
311899,1000knigi.mon.gov.mk,http://1000knigi.mon.gov.mk/book.php?id=1001
302153,1000knigi.mon.gov.mk,http://www.1000knigi.mon.gov.mk/book.php?id=107


In [8]:
df.describe(include="all")

Unnamed: 0,domain,url
count,1475266,1475266
unique,6257,1475261
top,gol.mk,https://forum.kajgana.com/threads/%D0%93%D0%BB...
freq,65658,4


We got 1,475,261 texts in 6,257 domains.

In [9]:
df.domain.value_counts().to_dict()

{'gol.mk': 65658,
 'denar.mk': 34908,
 'kafepauza.mk': 34305,
 'press24.mk': 33496,
 'sitel.com.mk': 32750,
 'off.net.mk': 26432,
 'plusinfo.mk': 26229,
 'slobodnaevropa.mk': 25271,
 'daily.mk': 25080,
 'forum.carclub.mk': 21873,
 'ekran.mk': 21340,
 'sdk.mk': 20331,
 'okno.mk': 19441,
 'novatv.mk': 19390,
 'fashionel.mk': 19198,
 'utrinski.com.mk': 18518,
 'it.mk': 18360,
 '365.com.mk': 17775,
 'vardarfans.mk': 17252,
 'tvm.mk': 16547,
 'usb.mk': 16155,
 'moirecepti.mk': 14786,
 'ringeraja.mk': 13522,
 'mkd.mk': 11824,
 'forum.idividi.com.mk': 11802,
 'skopjeinfo.mk': 11590,
 'katolici.mk': 10512,
 'mn.mk': 10211,
 'lokalno.mk': 9757,
 'radar.mk': 9732,
 'arhiva.sdsm.org.mk': 9585,
 'forum.it.mk': 9517,
 'puls24.mk': 9463,
 'civilmedia.mk': 9396,
 'pravdiko.mk': 9347,
 'arhiva.sportmedia.mk': 9175,
 'telefoni.mk': 8928,
 'kukuriku.com.mk': 8604,
 'zdravstvo.mk': 8398,
 'dubai-portal.com.mk': 8209,
 'reporter.mk': 8156,
 'publicitet.mk': 8136,
 'forum.femina.mk': 8042,
 'g-sport.mk': 7

In [10]:
# Calculate domain distribution
domain_distribution = pd.DataFrame({"domain": list(df.domain.value_counts().to_dict().keys()), "frequency":list(df.domain.value_counts().to_dict().values())})
domain_distribution

Unnamed: 0,domain,frequency
0,gol.mk,65658
1,denar.mk,34908
2,kafepauza.mk,34305
3,press24.mk,33496
4,sitel.com.mk,32750
...,...,...
6252,elektrobojan.com.mk,1
6253,myprint.mk,1
6254,elektroluks.com.mk,1
6255,semering.mk,1


In [11]:
# Discard instances with frequency less than 10
domain_distribution = domain_distribution[domain_distribution["frequency"] > 9]

domain_distribution.shape

(2606, 2)

I first discarded domains with less than 10 texts, then calculated the median. The remaining number of domains was 2,606.

In [12]:
# Find the median
domain_distribution.frequency.describe()

count     2606.000000
mean       561.796239
std       2715.305274
min         10.000000
25%         18.000000
50%         43.000000
75%        146.750000
max      65658.000000
Name: frequency, dtype: float64

The median number of texts per domain is 43. Now, I'll take 500 domains with frequency below this number and 500 domains with frequency above this number. I'll split the dataset into two based on the median.

In [13]:
domain_distribution.head()

Unnamed: 0,domain,frequency
0,gol.mk,65658
1,denar.mk,34908
2,kafepauza.mk,34305
3,press24.mk,33496
4,sitel.com.mk,32750


In [14]:
top_domain_distribution = domain_distribution[domain_distribution["frequency"] > 43]
top_domain_distribution.shape

(1288, 2)

In [15]:
bottom_domain_distribution = domain_distribution[domain_distribution["frequency"] < 43]
bottom_domain_distribution.shape

(1300, 2)

In [16]:
# Take a random sample from top and random sample from bottom
top_domain_distribution = top_domain_distribution.sample(n=500)
top_domain_distribution.shape

(500, 2)

In [17]:
top_domain_distribution["frequency"].describe()

count      500.00000
mean       995.12600
std       2733.36359
min         44.00000
25%         72.00000
50%        156.50000
75%        527.25000
max      21340.00000
Name: frequency, dtype: float64

In [18]:
bottom_domain_distribution = bottom_domain_distribution.sample(n=500)
bottom_domain_distribution.head()

Unnamed: 0,domain,frequency
1470,msu.mk,33
1507,povrzani.mk,31
1671,fmmm.utms.edu.mk,26
1461,mbdp.com.mk,33
1518,adriamed.mk,31


In [19]:
# Join the dataframes
domain_sample = pd.concat([top_domain_distribution, bottom_domain_distribution])
domain_sample.describe(include="all")

Unnamed: 0,domain,frequency
count,1000,1000.0
unique,1000,
top,newbalkanpolitics.org.mk,
freq,1,
mean,,507.911
std,,1992.37393
min,,10.0
25%,,18.0
50%,,43.0
75%,,156.25


In [20]:
# Create a list of domains
domain_sample_list = domain_sample.domain.to_list()
domain_sample_list[:10]

['newbalkanpolitics.org.mk',
 'mk.voanews.com',
 'denesen.mk',
 'krivogastani.com',
 'kfsm.mk',
 'marxists.org',
 'suteren.mk',
 'aitonix.mk',
 'kulturnariznica.mk',
 'forum.gsm.mk']

In [21]:
# For each domain, sample out 10 texts from the initial dataframe

# First create the initial df to which all others in the loop will be added
final_sample = df[df["domain"] == domain_sample_list[0]].sample(n=10)

final_sample

Unnamed: 0,domain,url
476371,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/Fear-and-...
583732,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/THE-BORDE...
1284056,newbalkanpolitics.org.mk,http://www.newbalkanpolitics.org.mk/item/Oднос...
1297315,newbalkanpolitics.org.mk,http://www.newbalkanpolitics.org.mk/item/Recur...
1215721,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/SHAMEFUL-...
539857,newbalkanpolitics.org.mk,http://www.newbalkanpolitics.org.mk/item/Inter...
1071795,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/The%20Ody...
605494,newbalkanpolitics.org.mk,http://www.newbalkanpolitics.org.mk/item/Кога-...
578966,newbalkanpolitics.org.mk,http://www.newbalkanpolitics.org.mk/item/Каков...
1028097,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/The-Posit...


In [22]:
# Add all other domains
remaining_list = domain_sample_list[1:]

for i in remaining_list:
	added_instances = df[df["domain"] == i].sample(n=10)
	final_sample = pd.concat([final_sample, added_instances])

final_sample.shape

(10000, 2)

In [23]:
final_sample.describe()

Unnamed: 0,domain,url
count,10000,10000
unique,1000,10000
top,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/Fear-and-...
freq,10,1


The final sample has 10.000 instances from 1000 domains.

In [24]:
# Save the df
final_sample.to_csv("MaCoCu-mk-sample-domains-and-urls.csv", sep="\t")

## Extract the text from the TMX based on the URL list

In [4]:
# Open the df with domains and urls for the sample
final_sample = pd.read_csv("MaCoCu-mk-sample-domains-and-urls.csv", sep="\t", index_col = 0)

final_sample.head(2)

Unnamed: 0,domain,url
476371,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/Fear-and-...
583732,newbalkanpolitics.org.mk,http://newbalkanpolitics.org.mk/item/THE-BORDE...


In [5]:
final_sample.shape

(10000, 2)

In [6]:
# Create a list of urls in the sample
url_list = list(final_sample["url"].unique())
url_list[:10]

['http://newbalkanpolitics.org.mk/item/Fear-and-Politics/mk',
 'http://newbalkanpolitics.org.mk/item/THE-BORDERS-OF-OUR-MINDS/mk',
 'http://www.newbalkanpolitics.org.mk/item/Oдносите-меѓу-црквата-и-државата,-националниот-идентитет-и-безбедноста-во-Грција-по-студената-војна/mk',
 'http://www.newbalkanpolitics.org.mk/item/Recurrent-Challenges-to-the-Implementation-of-Intrastate-Peace-Agreements:The-Resistance-of-State-Authorities/mk',
 'http://newbalkanpolitics.org.mk/item/SHAMEFUL-BEHAVIOR-OF-THE-OFFICIALS-OF-THE-EUROPEAN-AGENCY-FOR-RECONSTRUCTION/mk',
 'http://www.newbalkanpolitics.org.mk/item/International-politics/mk',
 'http://newbalkanpolitics.org.mk/item/The%20Odyssey%20of%20the%20Roma%20Refugees%20from%20Kosovo/mk',
 'http://www.newbalkanpolitics.org.mk/item/Кога-ништо-друго-не-успева/mk',
 'http://www.newbalkanpolitics.org.mk/item/Каков-однос-кон-минатото/mk',
 'http://newbalkanpolitics.org.mk/item/The-Position-of-the-Turks-in-the-Republic-of-Macedonia/mk']

In [7]:
len(url_list)

10000

In [8]:
# Now that I have the URL list, I will extract texts from the MaCoCu-mk.xml.gz for the sample based on the URL list.

text_all_counter = 0

texts_all = []

for line in tqdm(file):
	if line.startswith("<doc"):
		current_text = []
		pure_text = ""
		text_string = ""
		current_url = ""
		current_domain = ""
		current_url = url_re.search(line).group(1)
		current_domain = domain_re.search(line).group(1)
		text_string += line
	elif line.startswith("<p"):
		text_string += line
	elif line.startswith("</p"):
		text_string += line
	elif line.startswith("</doc"):
		text_string += line
		if current_url in url_list:
			current_text = [current_domain, current_url, pure_text, text_string]
			texts_all.append(current_text)
			text_all_counter += 1
	elif line.startswith("<corpus"):
		continue
	elif line.startswith("</corpus"):
		continue
	else:
		text_string += line
		pure_text += line

47208386it [05:54, 133165.23it/s]


In [9]:
text_all_counter

10000

In [10]:
# Create a dataframe out of the text file

df_long_texts = pd.DataFrame({"domain": [x[0] for x in texts_all], "url": [x[1] for x in texts_all], "text": [x[2] for x in texts_all],"doc": [x[3] for x in texts_all]})

df_long_texts.head()

Unnamed: 0,domain,url,text,doc
0,gg.mk,https://gg.mk/,"Тоа е тоа, екипа. Претпоследната епизода ја на...","<doc id=""macocu.mk.4"" title=""GG.MK - е-спорт &..."
1,qs.mk,https://www.qs.mk/,"Парадата на гордоста се враќа под слоганот ""Во...","<doc id=""macocu.mk.5"" title=""КВИР СКВЕР Скопје..."
2,in2.mk,http://in2.mk/,"Од крајот на месец Ноември 2009 година, своето...","<doc id=""macocu.mk.25"" title=""IN2 - ПOЧЕTHA CT..."
3,ldp.mk,https://ldp.mk/,Што не прави либрали?\nЕманципација од стравов...,"<doc id=""macocu.mk.35"" title=""Либерално демокр..."
4,iep.mk,https://iep.mk/,ИНИЦИЈАТИВА ЗА ЕВРОПСКА ПЕРСПЕКТИВА\nЗдружение...,"<doc id=""macocu.mk.37"" title=""Иницијатива за Е..."


In [11]:
df_long_texts.describe(include="all")

Unnamed: 0,domain,url,text,doc
count,10000,10000,10000,10000
unique,1000,10000,10000,10000
top,gg.mk,https://gg.mk/,"Тоа е тоа, екипа. Претпоследната епизода ја на...","<doc id=""macocu.mk.4"" title=""GG.MK - е-спорт &..."
freq,10,1,1,1


In [12]:
df_long_texts.domain.value_counts()

gg.mk                         10
respublica.edu.mk             10
vinodonia.mk                  10
dekra.mk                      10
investinpelagoniaregion.mk    10
                              ..
zenica.mk                     10
tesoridimare.mk               10
grafopres.mk                  10
insumak.mk                    10
sud.mk                        10
Name: domain, Length: 1000, dtype: int64

In [13]:
# Add information on length
df_long_texts["length"] = df_long_texts["text"].str.split().str.len()

df_long_texts.head()

Unnamed: 0,domain,url,text,doc,length
0,gg.mk,https://gg.mk/,"Тоа е тоа, екипа. Претпоследната епизода ја на...","<doc id=""macocu.mk.4"" title=""GG.MK - е-спорт &...",864
1,qs.mk,https://www.qs.mk/,"Парадата на гордоста се враќа под слоганот ""Во...","<doc id=""macocu.mk.5"" title=""КВИР СКВЕР Скопје...",1054
2,in2.mk,http://in2.mk/,"Од крајот на месец Ноември 2009 година, своето...","<doc id=""macocu.mk.25"" title=""IN2 - ПOЧЕTHA CT...",162
3,ldp.mk,https://ldp.mk/,Што не прави либрали?\nЕманципација од стравов...,"<doc id=""macocu.mk.35"" title=""Либерално демокр...",343
4,iep.mk,https://iep.mk/,ИНИЦИЈАТИВА ЗА ЕВРОПСКА ПЕРСПЕКТИВА\nЗдружение...,"<doc id=""macocu.mk.37"" title=""Иницијатива за Е...",182


In [14]:
df_long_texts["length"].describe()

count    10000.000000
mean       407.222600
std        867.246509
min         76.000000
25%        128.000000
50%        213.000000
75%        408.000000
max      38281.000000
Name: length, dtype: float64

In [17]:
# Save the final sample
df_long_texts.to_csv("MaCoCu-mk-sample.csv", sep="\t")