GitHub - ShaishavJogani/Web-Crawler-for-Recommendation: Crawl the web pages of JAVA tutorial sites and index them using Apache Lucene. Recommend crawl content based on the user's current reading.

ShaishavJogani / Web-Crawler-for-Recommendation Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Crawl the web pages of JAVA tutorial sites and index them using Apache Lucene. Recommend crawl content based on the user's current reading.

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Crawler		Crawler
Server		Server
WebUI		WebUI
Readme.txt		Readme.txt
Report.docx		Report.docx

Repository files navigation

Crawler:(Crawler folder)
1> To crawl Java Wikibooks Pages data, run: 
	python crwal.py
2> To crawl Oracle Java Tutorial data, run: 
	python crwalOracle.py
3> For stemming and stop words removal on Wikibooks crawl data
	-> Change the raw_file_path in CleanData.py to 'Documents\\Raw\\'
	-> run: python CleanData.py
4> For stemming and stop words removal on Oracle Java Tutorial data
	-> Change the raw_file_path in CleanData.py to 'Documents\\Oracle\\'
	-> run: python CleanData.py


Indexing with Lucene: (server folder)
1> Open the 'Server' project in eclipse.
2> Import all the necessary jar files from jars folder.
3> Change the path of IndexDataPath, CodeDatePath, RawDataPath and OracleDataPath in SimpleLuceneIndexing.java, which links to the appropriate Documents folder files created in crawler.


Web Application: (WebUI folder)
1> Open the index.html page.