The Semantix crawler - in progress.
Implements a Naive Bayes classifier using the NLTK library.
Takes in crawled HTML pages of a whole website and classifies the website based on a business type
such as restaurant
or medical
.
Based on the business type, further classify the website's content into relevant data such as hours of operation, location, and menu items for restaurants.
- Install Python 2.7.3.
- Clone the project and navigate into it.
- Install
virtualenv
and make sure it is activated. All Python libraries should be installed whilevirtualenv
is activated. - Install Flask.
- Install BeautifulSoup.
- Install NLTK.
- Obtain crawled websites data from someone on the team.
- Install Python 2.7.3.
git clone https://github.com/rhuang/semantix.git
sudo pip install virtualenv
virtualenv venv
. venv/bin/activate
pip install -U Flask beautifulsoup4 pyyaml nltk
./start
orpython semantix.py
- To run locally first start the environment by running
winStart.bat
- Then run
python semantix.py
- In your browser type
127.0.0.1:5001
- Run
./start
.
We activate a virtual environment to ensure our project runs on the enclosed Python version and is not affected by the other Python versions installed on the machine. Flask is also installed into the virtual environment, and not globally on our machine.
You can also run python app/main.py
to check out the main algorithms without starting flask.
OCR recognition is done using the Tesseract library.
brew install tesseract
Usage:
tesseract [image_name] [output_file]