karwanbazaar

a scrapy based crawler to get articles from দৈনিক মতিকণ্ঠ, a satire news site in Bangla

running locally

git clone https://github.com/ShawonAshraf/karwanbazaar.git
cd karwanbazaar

# conda env
conda env create -f ghochu.yml
source activate karwanbazaar

# run
python main.py

spiders

there are 4 spiders in the pipeline which need to run sequentially since each spider is dependent on the output from the others.

karwanbazaar/spiders
├── archives.py
├── article_content.py
├── article_urls.py
├── index.py

running order of the spiders:

{
    0: index,
    1: archives,
    2: article_urls,
    3: article_content
}

all the spiders generate output files (jsonl, txt, html) which are saved in the output directory.

final output

articles.jsonl contains the final output with all the posts in jsonl format.

this is the format of one line of the articles.jsonl file:

{
  "article_id": "article id",
  "title": "title of the article", 
  "content": "content of the article"
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
karwanbazaar		karwanbazaar
output		output
.gitignore		.gitignore
README.md		README.md
ghochu.yml		ghochu.yml
main.py		main.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

karwanbazaar

karwanbazaar

output

output

.gitignore

.gitignore

README.md

README.md

ghochu.yml

ghochu.yml

main.py

main.py

scrapy.cfg

scrapy.cfg

Repository files navigation

karwanbazaar

running locally

spiders

final output

About

Releases

Packages

Languages

ShawonAshraf/karwanbazaar

Folders and files

Latest commit

History

Repository files navigation

karwanbazaar

running locally

spiders

final output

About

Resources

Stars

Watchers

Forks

Languages