Semantix

The Semantix crawler - in progress.

Implements a Naive Bayes classifier using the NLTK library.

Takes in crawled HTML pages of a whole website and classifies the website based on a business type such as restaurant or medical.

Based on the business type, further classify the website's content into relevant data such as hours of operation, location, and menu items for restaurants.

Installation

Install Python 2.7.3.
Clone the project and navigate into it.
Install virtualenv and make sure it is activated. All Python libraries should be installed while virtualenv is activated.
Install Flask.
Install BeautifulSoup.
Install NLTK.
Obtain crawled websites data from someone on the team.

Quick Installation Commands

Install Python 2.7.3.
git clone https://github.com/rhuang/semantix.git
sudo pip install virtualenv
virtualenv venv
. venv/bin/activate
pip install -U Flask beautifulsoup4 pyyaml nltk
./start or python semantix.py

Windows

To run locally first start the environment by running winStart.bat
Then run python semantix.py
In your browser type 127.0.0.1:5001

Mac

Run ./start.

Notes

We activate a virtual environment to ensure our project runs on the enclosed Python version and is not affected by the other Python versions installed on the machine. Flask is also installed into the virtual environment, and not globally on our machine.

You can also run python app/main.py to check out the main algorithms without starting flask.

OCR Recognition

OCR recognition is done using the Tesseract library.

brew install tesseract

Usage:
tesseract [image_name] [output_file]

Name		Name	Last commit message	Last commit date
Latest commit History 357 Commits
.idea		.idea
app		app
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
gunicorn_config.py		gunicorn_config.py
semantix.pid		semantix.pid
semantix.py		semantix.py
settings.py		settings.py
start		start
winStart.bat		winStart.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantix

Installation

Quick Installation Commands

Windows

Mac

Notes

OCR Recognition

About

Releases

Packages

Languages

MarcRoopchand/semantix

Folders and files

Latest commit

History

Repository files navigation

Semantix

Installation

Quick Installation Commands

Windows

Mac

Notes

OCR Recognition

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages