Skip to content
A Distributed Search Engine Instance based on Nutch & Solr.
XSLT JavaScript Python CSS
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
apache-nutch-1.8
dedicated-crawlers
email-reminder
filter
helper
notes
presentation
recommender
reference
site
solr-4.8.0
text-classifier
web-server
.gitignore
README.md
brainstorm_ideas.md
schedule.md

README.md

Campus Search Engine

An open sourse distributed search engine instance based on Apache Nutch and Apache Solr.

Introduction

This project was first derived from a course project of Parallel Algorithm in SZU. We retrieve and integrate information for the benefits of students and faculty in SZU. We also apply some machine learning and data mining skills, especially recommender system techniques to enhance user experience, and, more importantly, Hack Data For Fun! Feel free to contact us and contribute.

Contributors

License

Apache License, Version 2.0

Features

  1. Full-text retrieval on all public websites on campus.
  2. E-mail reminder.
  3. News filter and recommender.

Reference

  1. NutchTutorial
    官方nutch教程,里面写的相当详细。
    注:这个是nutch 1.x 版本的教程, nutch有2.x的版本了,但是文档不是很多。建议还是用nutch 1.7或1.8

  2. Python 爬虫如何入门学习?
    知乎上关于爬虫的一个很好的讲解,里面涉及到集群爬虫,不过用的是python.

  3. Git 教學(1) : Git 的基本使用
    git 入门教程,写的很详细很好,第二篇Git 教學(2):Git Branch 的操作與基本工作流程讲解git下多人协作的流程。

  4. Nutch – How It Works

You can’t perform that action at this time.