# Preparing data

<a target="_blank" href="https://colab.research.google.com/github/Koffair/colab_pipelines/blob/main/notebooks/prepare_data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# TODOs
- miért kell a google acconut => ez adja a drive-ot és ezzel tudjuk használni a colabet


The aim of this Notebook is to prepare the data for:
- training a huggingsound model
- training various output correction tools (e.g. an ngram language model) that
will help us to increase the accuracy of our transcripts

The magic will happen elswhere, here we just get some data and prepare them for further processing.

# Data sources
 - [OSCAR 2019](https://oscar-project.org/post/oscar-2019/) Hungarian sub-corpus
 - [nyest.hu](https://www.nyest.hu/) a corpus containing all the articles from nyest (closed copyrighted material)

## Prerequisites
- a Google account
- Colab Pro+ subscription, or any other cloud-based Jupyter Notebook support with a GPU, like Datalore, or a decent machine with a GPU
- download the abovementioned datasets to your Google Drive

## WARNINGS
- If you are not familiar with Jupyter Notebooks, take some time to get used to id e.g. [this resource](https://www.manning.com/liveproject/getting-started-with-jupyter-notebook) explains the very basics of it
- If you are new to Colab, take some time to familiarize yourself with it. You may find [this course](https://www.manning.com/liveproject/getting-started-with-Google-Colab-using-PyTorch) helpful.
- You can run the cells of this notebook on Colab. Click on the "Open in Colab" badge at the top of the page.
- Don't run this notebook! Click File > "Save a Copy in Drive" before you start working and you modify anything.
- Check the path of your data. Probably you have to modify the path to the data files according to the folder structure of your Google Drive.

## Uncompressing the data
The following setps uncompress the data files in the appropriate directories.
The original compressed files will be deleted!

### Connecting to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!ls "/content/gdrive/My Drive/Colab Notebooks/"

corpora  interim  mcc_langmods	models	prepare_data.ipynb


### Uncompress OSCAR txt files





In [None]:
!cd "/content/gdrive/My Drive/Colab Notebooks/corpora/OSCAR2019_hu"; gzip -d *.gz

### Uncompress nyest corpus

In [None]:
!cd "/content/gdrive/My Drive/Colab Notebooks/corpora/nyest"; unzip contents.zip 

Archive:  contents.zip
  inflating: contents.csv            


## Clean up text corpora


In [None]:
# getting nltk punkt tokenizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### OSCAR 2019

In [None]:
!pip install blingfire

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting blingfire
  Downloading blingfire-0.1.8-py3-none-any.whl (42.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: blingfire
Successfully installed blingfire-0.1.8


In [None]:
import os
import concurrent.futures

from blingfire import text_to_words

data_root = "/content/gdrive/My Drive/Colab Notebooks/corpora/OSCAR2019_hu"
text_files = [
    e for e in os.listdir(data_root) if os.path.isfile(os.path.join(data_root, e))
]

with open("/content/gdrive/My Drive/Colab Notebooks/interim/oscar.txt", "w") as outfile:
  for text_file in text_files:
    print(text_file)
    with open(os.path.join(data_root, text_file), "r") as infile:
      with concurrent.futures.ProcessPoolExecutor() as executor:
        res = {executor.submit(text_to_words, line) for line in infile}
        for future in concurrent.futures.as_completed(res):
          data = future.result()
          wds = data.split()
          wds = [wd.lower() for wd in wds if wd.isalnum()]
          wds = " ".join(wds)
          outfile.write(wds + "\n")


### nyest.hu

In [None]:
import html
import re

import pandas as pd

data_root = "/content/gdrive/My Drive/Colab Notebooks/corpora/nyest/contents.csv"
df = pd.read_csv(data_root, sep=";")
df.fillna('', inplace=True)

CLEANR = re.compile('<.*?>')
CDATA = re.compile('\/\/\s&lt;!\[CDATA\[\n.*\n\/\/\s*\]\]&gt;')


def clean_txt(txt):
    """Postprocess txt, removes unescaped html entities"""
    txt = txt.replace("&amp;gt;", " ").replace("&amp;nbsp;", " ").replace("&quot;", " ")
    txt = txt.replace("&#x27", " ").replace("::adbox::7::", "").replace("&amp;lt;", " ")
    txt = txt.replace("&amp;amp;", " ")
    return txt


def cleanhtml(raw_html):
    """Clean raw html page"""
    cleaned_txt = clean_txt(html.escape(re.sub(CLEANR, ' ', raw_html)))
    return re.sub(CDATA, ' ', cleaned_txt)


with open("/content/gdrive/My Drive/Colab Notebooks/interim/nyest.txt", "w") as outfile:
    for _, row in df.iterrows():
        title = cleanhtml(row[0])
        lead = cleanhtml(row[3])
        text = cleanhtml(row[4])
        full_text = " ".join([title, lead, text])
        sentences = sent_tokenize(full_text)
        for sentence in sentences:
            if sentence:
                words = word_tokenize(sentence)
                words = [word.lower() for word in words if word.isalnum()]
                if words:
                    s = " ".join(words)
                    outfile.write(s + "\n")

### Concatenate corpora

In [None]:
!cd "/content/gdrive/My Drive/Colab Notebooks/interim/"; cat *.txt > merged_corpus.txt