Skip to content

This project builds a functional web search engine that incorporates what we have learned in CIS555. To achieve this aim, we have the following goals in mind: a) Our system is able to crawl a large corpus of web documents. (4*10^5 urls,200GB AWS S3) b) Our system can process large amounts of data efficiently. c) Our system can return accurate an…

Notifications You must be signed in to change notification settings

Kyriezxc/cis555-search-engine

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

This project builds a functional web search engine that incorporates what we have learned in CIS555. To achieve this aim, we have the following goals in mind:

a) Our system is able to crawl a large corpus of web documents. (4*10^5 urls,200GB AWS S3)

b) Our system can process large amounts of data efficiently.

c) Our system can return accurate and meaningful results based on user search query.

d) Our system is robust and can be deployed on the cloud as a real product.


While details of scalability (of the project) can be found in our final report,

(1) we crawled 4*10^5 urls with 200GB data stored on AWS S3.

(2) indexer is on AWS RDS mysql wih 35-70GB invertedIndex (as well as tf-idf scores, summary statistics etc.).


Full name: Feng Xiang, Yezheng Li, Xinyu Ma, Shenqi Hu

SEAS login: fxiang, yezheng, xinyuma, hshenqi

Which features did you implement?

(list features, or write 'Entire assignment')

Entire assignment

Did you complete any extra-credit tasks? If so, which ones?

(list extra-credit tasks)

Process pagerank using Apache Spark

Any special instructions for building and running your solution?

(include detailed instructions, or write 'None')

None

Did you personally write all the code you are submitting (other than code from the course web page)?

[x] Yes

[ ] No

Did you copy any code from the Internet, or from classmates?

[ ] Yes

[x] No

Did you collaborate with anyone on this assignment?

[ ] Yes

[x] No


One of the screenshots (all saved screenshots are in report as well) is shown as following picture

About

This project builds a functional web search engine that incorporates what we have learned in CIS555. To achieve this aim, we have the following goals in mind: a) Our system is able to crawl a large corpus of web documents. (4*10^5 urls,200GB AWS S3) b) Our system can process large amounts of data efficiently. c) Our system can return accurate an…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 98.5%
  • HTML 1.5%