manipulative_news_methodology/data_processing at master · texty/manipulative_news_methodology

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
classify_tech_tags.pkl		classify_tech_tags.pkl
dvect_tech_tags.pkl		dvect_tech_tags.pkl
itos_ru.pkl		itos_ru.pkl
itos_uk.pkl		itos_uk.pkl
langdetect.py		langdetect.py
tokenize_and_numericalize.ipynb		tokenize_and_numericalize.ipynb

README.md

Data preprocessing

This folder contains a notebook to turn readability html into tokenized text, and then to numerical ids for classifier.

Notebook is written either for Postgresql database with all htmls (commented code), or for ../htmls_sample.jl.bz2 sample of 100k articles. Final output - sequence of token ids at each article - is an input to LM classifier

itos.pkl - token dictionaries, composed during fine-tuning of language model. A list of up to 60k most common words with minimum frequency 15. classify_tech_tags.pkl - scikit-learn LogisticRegressionClassifier to detect technical html elements, such as datetime, "read also" paragraps, social media buttons etc.

langdetect.py - simple script to detect language with cld2 or python langid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_processing

data_processing

README.md

README.md

classify_tech_tags.pkl

classify_tech_tags.pkl

dvect_tech_tags.pkl

dvect_tech_tags.pkl

itos_ru.pkl

itos_ru.pkl

itos_uk.pkl

itos_uk.pkl

langdetect.py

langdetect.py

tokenize_and_numericalize.ipynb

tokenize_and_numericalize.ipynb

README.md

Data preprocessing

Files

data_processing

Directory actions

More options

Directory actions

More options

Latest commit

History

data_processing

Folders and files

parent directory

Data preprocessing