Unsupervised categorization of news articles
- Install PyTorch
pip install torch torchvision
if cuda gpu available (Recommended).pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
for cpu only.- https://pytorch.org - or just go to this site to get the best version for you.
- Run
pip install -r requirements.txt
orconda env create -f environment.yml
if conda is installed. - Open a python interpreter and run
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet')
- Download the following and extract inside data/
https://download.geonames.org/export/dump/IN.zip https://download.geonames.org/export/dump/admin1CodesASCII.txt
- Install docker and run following commands for elastic and kibana(development) if needed
Run the curl command in
docker pull docker.elastic.co/elasticsearch/elasticsearch-oss:7.8.1 docker pull docker.elastic.co/kibana/kibana-oss:7.8.1 docker run -d --name elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.8.1 docker run -d --name kibana --link elastic:elasticsearch -p 5601:5601 docker.elastic.co/kibana/kibana-oss:7.8.1
create_index.txt
to create elasticsearch index
- Put the dataset inside a folder
data/
and rundata_clean.py
- Run
get_langs_fast.py
. - Run
ultimate_destruction.py
, when asked for input enter thestart end
values for DF subpart processing. - Run
get_locs_people.py
,get_time.py
,get_coords.py
for extracting locs, people and time. - Results will be saved in
data/cleaned/submission_*.csv
. - Run
populate_elastic.py
(orpopulate_elastic_notime.py
if web scraped html are not available), default elastic port used onlocalhost
.
data_clean
used to get a pandas DataFrame from the dataset csv.get_langs_fast.py
used to get a languages DataFrame for title, desc and long_desc in the processsed DataFrame.get_locs_people.py
,get_time.py
,get_coords.py
used to extraction location, time and coordinates from articles and some scraped urls.preprocess.py
used to apply lemmatization and simple preprocessing on text.ultimate_destruction.py
uses the combination of both of the above dataframes with the complete implementation to generate a DataFrame of keywords.
csv.reader
used to read the given dataset.- The read lines are directly fed to a
csv.writer
with a different separator to avoid collisions. - The lines which don't have exactly 5 rows are fed into a
lineCleaner
function. lineCleaner
uses regex to find the id, url and long_description. The rest is found using a combination of separator logic based on unbalancedquotechars
- If even the
lineCleaner
fails, the row is written into the badrows.csv file with None values for all exceptid
in the main DF. We only had 693 bad rows in this run(<0.05%), so we didn't waste time on it anymore.
langdetect
library was used to get languages for each content. Parallelized the detection for faster results.RoBERTa
requires minimal to no preprocessing.
- A
the_destructor
function is applied on each row and it does the following:- Iterate through each row and give
long_description
toRoBERTa
andtitle
ordescription
for Null content to themultilingual bert
. - Sentences tokenized.
- Keywords are selected from the tokenized content using
RegexpParser
tree. - All candidate keywords generated from the tree are then processed with
RoBERTa
orBERT
to get their embeddings alongwith main content embeddings. - The candidate keywords and main content embeddings are then passed to the
get_topk
function which appliesMMR
algo to return top k keywords. MMR
automatically eliminates duplicates (i.e. keywords with similarity over a given threshold).the_destructor
return id and keywords for each row.
- Iterate through each row and give
- Embeddings generated for all categories in the tree.
- Category embeddings and article embeddings fed into MMR to get top 2 matching categories.
Bennani-Smires K., Musat C., Hossman A., Baeriswyl M., Jaggi M., 2018 Simple Unsupervised Keyphrase Extraction using Sentence Embeddings. arXiv:1801.04470
Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V., 2019 RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692