Automatic Scraping project for extracting FAQs and Help center articles

Introduction

This is an FAQs Automatic scraping from help centers articles using Scrapy tool, by saying Automatic I mean providing a list of companies and their HelpCenter URLs, and the scraper will start automatically follow all the internal articles and extract FAQs as a (Question, Answer) pairs.

How does it work

currently, there are two types of operations to do in the scraper:

general: scraping general HelpCenter content.
zendesk: scraping Zendesk companies.

the scraper read a list of companies and start scraping the info, it writes to JSON files, a folder for each company.

Zendesk:

this is a straight-forward one , Zendesk simply have a common patter in their URL,

f'{company_domain}/api/v2/help_center/en-us/sections.json',
f'{company_domain}/api/v2/help_center/en-us/articles.json'

by simply telling the spider to follow those links you can get all the articles, and their sections which I did in zendesk_spider.

general

this is the tricky one, here the objective is to scrap any other help-center URL by following the tree pattern if exists the tree pattern is simply start_url>>start_url/categories>>start_url>>category>>article. to do that I keep recursively following the pattern while being careful, to avoid hitting URLs that are not help-center articles. after that, I store all the HTML content I get and start processing them as a tree looking for the last HTML page that contains the article and extract FAQS from it.

How to use

to build and run the scraper

docker build -f scraping_docker -t scraper .
docker run scraper -f filename -t operation_type

to build and run the processor

docker build -f processing_docker -t processor .
docker run processor -f filename -t operation_type

Extras:

you can use AWS and MongoDB with this project , it will require more configurations, but I added writes and pipelines that can help to do that.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
FAQ		FAQ
test		test
.gitignore		.gitignore
processing_docker		processing_docker
project_structure		project_structure
readme.md		readme.md
requirements.txt		requirements.txt
scraping_docker		scraping_docker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

FAQ

test

test

.gitignore

.gitignore

processing_docker

processing_docker

project_structure

project_structure

readme.md

readme.md

requirements.txt

requirements.txt

scraping_docker

scraping_docker

Repository files navigation

Automatic Scraping project for extracting FAQs and Help center articles

Introduction

How does it work

Zendesk:

general

How to use

Extras:

About

Languages

Mogady/FAQs-automatic-scraper

Folders and files

Latest commit

History

Repository files navigation

Automatic Scraping project for extracting FAQs and Help center articles

Introduction

How does it work

Zendesk:

general

How to use

Extras:

About

Topics

Resources

Stars

Watchers

Forks

Languages