## Baseline pipeline

In [1]:
from nemo_curator.download import download_common_crawl

# Download your dataset
dataset = download_common_crawl("/code/datasets/common_crawl/", "2022-04", "2023-5", output_type="jsonl", url_limit=5)

In [2]:
from nemo_curator.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.filters import WordCountFilter
from nemo_curator.modules.modify import Modify
from nemo_curator import ScoreFilter, Sequential

# Build your pipeline
curation_pipeline = Sequential([
  # Fix unicode
  Modify(UnicodeReformatter()),
  # Discard short records
  ScoreFilter(WordCountFilter(min_words=100))
])
# Execute the pipeline on your dataset
curated_dataset = curation_pipeline(dataset)
df = curated_dataset.to_pandas()

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('text', 'int64'))



In [4]:
len(df[df.language == "POLISH"])

1646

In [8]:
df.head(5)

Unnamed: 0,filename,language,source_id,text,url,warc_id
0,CC-MAIN-20221126080725-20221126110725-00000.wa...,KOREAN,crawl-data-CC-MAIN-2022-49-segments-1669446706...,"대한제일 천안출장안마 BOBO출장안마\n\n제주출장안마 꿈과 사람, 청녀, 여러 복...",http://0130.com.cn/news/shownews.php?id=188,230340ee-4f5c-4c5a-b1d6-d98a986f47d9
1,CC-MAIN-20221126080725-20221126110725-00000.wa...,KOREAN,crawl-data-CC-MAIN-2022-49-segments-1669446706...,대한제일 천안출장안마 BOBO출장안마\n\n전주출장샵 토요일 샤오 얀 공격\n\n2...,http://0130.com.cn/news/shownews.php?id=43,d79e1ca3-04f5-4268-b8aa-12dc3582dfef
2,CC-MAIN-20221126080725-20221126110725-00000.wa...,POLISH,crawl-data-CC-MAIN-2022-49-segments-1669446706...,"Obchody ""Czerwca 76"" z rekomendacją Komisji Ku...",http://095160170158.vectranet.pl/wiadomosci/it...,7bc391ef-9e15-46f4-a3c5-f66f6472f98f
3,CC-MAIN-20221126080725-20221126110725-00000.wa...,ENGLISH,crawl-data-CC-MAIN-2022-49-segments-1669446706...,Opret ny bruger\n\nIf you agree with the follo...,http://123nu.dk/lystfiskeri/forum/registration...,605120ef-1935-4607-80d8-daa8e9591aeb
5,CC-MAIN-20221126080725-20221126110725-00000.wa...,VIETNAMESE,crawl-data-CC-MAIN-2022-49-segments-1669446706...,"Xin chào tất cả anh em , hôm nay diễn đàn XSMB...",http://1368.info/soi-cau-3-cang/,c789125f-a204-42f6-b97b-73815ff257f2


In [7]:
df1 = df[df.language == "POLISH"]['text'].reset_index(drop=True)

In [6]:
df1

0       Obchody "Czerwca 76" z rekomendacją Komisji Ku...
1       Tytuł artykułu\n\n1988\n\nCzasopismo\n\nWydawc...
2       awx2 architekci Tag\n\nWłaściciele DOMU MK pod...
3       W artykule "Tabele i wykresy na start!"Tabele ...
4       Studenci z sercem pełnym Caritas\n\nŻycie stud...
                              ...                        
1641    Alkohol tylko od piątku do niedzieli? Sprawdź,...
1642    Usługi wdrożeniowe\n\nUsługi wdrożeniowe\n\nUs...
1643    Kalendarz\n\nKonkurs Biblijny dla klas 4-5\n\n...
1644    Konkurs plastyczny dla kl. 6a i 6b\n\n9 maja 2...
1645    Tutaj jesteś:\n\nNordic walking to idealny kom...
Name: text, Length: 1646, dtype: object

In [14]:
df1.to_csv('text.txt', header=False, index=False)