Skip to content

A news articles search engine with complete frontend and backend hosted on Amazon Web Services (AWS) using CloudWatch, Lambda, SQS, Elasticsearch, Elastic Beanstalk, and S3

Albert-Z-Guo/News-Articles-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

News Articles Search Engine

Project Description

This project periodically import news articles crawled from the internet by the Common Crawl. The imported new articles are searchable with keywords (highlighted in returned results in pages), publishing date, and language (English and Non-English).

This project consists of 4 parts:

  1. An AWS Lambda function that periodically imports (checking every hour) the latest crawled news articles and sends parsed news to AWS Simple Queue Service (SQS)
  2. An AWS Lambda function that retrieves parsed news from AWS SQS and post them to AWS Elasticsearch Service
  3. Backend Search API
  4. Frontend Search Website

When posting articles to AWS Elasticsearch service, the news HTML webpages collectively stored in a .warc.gz file are each parsed to the following fields in a JSON Object:

  • URL
  • Title
  • Text
  • Language
  • Publishing Date

The backend Search API is deployed to an Apache Tomcat environment managed by AWS Elastic Beanstalk. The frontend Search website is hosted in an AWS S3 bucket.

The complete code base is available upon request only due to academic integrity policy. A video demo that describes this project is available here.

About

A news articles search engine with complete frontend and backend hosted on Amazon Web Services (AWS) using CloudWatch, Lambda, SQS, Elasticsearch, Elastic Beanstalk, and S3

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published