### Preprocessing new datasets

TopMost can preprocess datasets for topic modeling in a standard way.
A dataset must include two files: `train.jsonlist` and `test.jsonlist`. Each contains a list of json, like

```json
{"label": "rec.autos", "text": "WHAT car is this!?..."}
{"label": "comp.sys.mac.hardware", "text": "A fair number of brave souls who upgraded their..."}
```

`label` is optional.

**Here we download and preprocess 20newsgroup as follows.**

In [1]:
from topmost.data import download_20ng
from topmost.data import download_dataset
from topmost.preprocessing import Preprocessing

# download raw data
download_20ng.download_save(output_dir='./datasets/20NG')


  from .autonotebook import tqdm as notebook_tqdm


{'20ng_all': ['talk.religion.misc', 'comp.windows.x', 'rec.sport.baseball', 'talk.politics.mideast', 'comp.sys.mac.hardware', 'sci.space', 'talk.politics.guns', 'comp.graphics', 'comp.os.ms-windows.misc', 'soc.religion.christian', 'talk.politics.misc', 'rec.motorcycles', 'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'misc.forsale', 'sci.crypt', 'rec.autos', 'sci.med', 'sci.electronics', 'alt.atheism']}
===>name:  20ng_all
===>categories:  ['talk.religion.misc', 'comp.windows.x', 'rec.sport.baseball', 'talk.politics.mideast', 'comp.sys.mac.hardware', 'sci.space', 'talk.politics.guns', 'comp.graphics', 'comp.os.ms-windows.misc', 'soc.religion.christian', 'talk.politics.misc', 'rec.motorcycles', 'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'misc.forsale', 'sci.crypt', 'rec.autos', 'sci.med', 'sci.electronics', 'alt.atheism']
===>subset:  train
Downloading articles
data size:  11314
Saving to ./datasets/20NG
===>name:  20ng_all
===>categories:  ['talk.religion.misc', 'comp.windows.x',

In [2]:
! head -2 ./datasets/20NG/train.jsonlist

{"group": "rec.autos", "text": "From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"}
{"group": "comp.sys.mac.hardware", "text": "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D

In [3]:
# download stopwords
download_dataset('stopwords', cache_path='./datasets')

# preprocess raw data
preprocessing = Preprocessing(vocab_size=5000, stopwords='./datasets/stopwords/snowball_stopwords.txt')

rst = preprocessing.preprocess_jsonlist(dataset_dir='./datasets/20NG', label_name="group")

preprocessing.save('./datasets/20NG', **rst)

https://raw.githubusercontent.com/BobXWu/TopMost/master/data/stopwords.zip
Using downloaded and verified file: ./datasets/stopwords.zip
Found training documents 11314 testing documents 7532
label2id:  {'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6, 'rec.autos': 7, 'rec.motorcycles': 8, 'rec.sport.baseball': 9, 'rec.sport.hockey': 10, 'sci.crypt': 11, 'sci.electronics': 12, 'sci.med': 13, 'sci.space': 14, 'soc.religion.christian': 15, 'talk.politics.guns': 16, 'talk.politics.mideast': 17, 'talk.politics.misc': 18, 'talk.religion.misc': 19}


===>parse train texts: 100%|██████████| 11314/11314 [00:19<00:00, 583.18it/s]
===>parse test texts: 100%|██████████| 7532/7532 [00:10<00:00, 705.35it/s]
===>parse texts: 100%|██████████| 11314/11314 [02:47<00:00, 67.69it/s] 
===>making word embeddings: 100%|██████████| 5000/5000 [00:02<00:00, 1868.27it/s]


===> number of found embeddings: 4957/5000
Real vocab size: 5000
Real training size: 11314 	 avg length: 110.543


===>parse texts: 100%|██████████| 7532/7532 [01:35<00:00, 78.61it/s] 


Real testing size: 7532 	 avg length: 106.663
