Telugu-Newspaper-Article-Dataset

This Project scraps articles from archives of Telugu newspaper website Andhra Jyoti. A set of queries is created and corresponding ground truth answers were retrieved by a combination of 2 popular ranking functions namely BM25 and tf-idf.

Dataset

Complete Dataset can be downloaded from here .

(OR) If you choose to create dataset by yourself using the code, here you go.

Requirements

Python3
Pip3
Telugu Language should be enabled in Language settings of your machine, to be able to see telugu text.

Execution Steps

Open the terminal and change current working directory to the location where you want to clone the project.

git clone https://github.com/AnushaMotamarri/Telugu-Newspaper-Article-Dataset
cd Telugu-Newspaper-Article-Dataset
python makedirs.py 
pip3 install bs4
pip3 install requests
python3 scrapeTelugu.py

You should now be seeing text files getting created in subfolders of the directory telugudata.

This Scraper is website specific. So, it does not work with other websites.

Related Works

A similar work on Malayalam Dataset can be found here

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
makedirs.py		makedirs.py
qrels_te.txt		qrels_te.txt
queries_te.txt		queries_te.txt
scrapeTelugu.py		scrapeTelugu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telugu-Newspaper-Article-Dataset

Dataset

Requirements

Execution Steps

Related Works

About

Releases

Packages

Contributors 2

Languages

AnushaMotamarri/Telugu-Newspaper-Article-Dataset

Folders and files

Latest commit

History

Repository files navigation

Telugu-Newspaper-Article-Dataset

Dataset

Requirements

Execution Steps

Related Works

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages