GitHub - BowerCode/NLP-WebsiteClassifier: A tool to help classify web content or any text into topics.

Website Classifier

Classify web pages with NLP
Explore the docs »

View Demo · Report Bug · Request Feature

About the Project

In this project we built a function that leverages various Natural Language Processing (NLP) techniques to extract web content from a user provided URL and classify the web content based on a list of known topics. For simplicity, We will be using boilerpy3 to help us extract only the relevant text data from a given URL, apply transfer learning with Spacy's english language model for word similarity comparison and finally nltk for lammetization.

The idea is to leverage the function built here to support other work such as feature generation for modelling work and insight analytics.

Getting Started

To get started, please follow the below guidelines on prerequisites and installation.

Prerequisites

Spacy==2.3.2
NLTK==3.4.5
Boilerpy3==1.0.2
TextBlob==0.15.3
Pandas==1.0.3
Numpy==1.18.2

Installation

Fork and star this repo ;)
Create a folder on your machine for your project
Inside the folder right-click and select Git Bash Here
Git clone this repo into the folder by running the below command

git clone https://github.com/hklchung/NLP-WebsiteClassifier.git

Usage

First run everything inside main.py --this will help you load all the required packages and load the functions needed
If you never had Spacy before, you may have to download the Spacy English language model first by running the below in Git Bash or Terminal:

python -m spacy download en_core_web_md

Run this in console: nltk.download('popular')
Update the list of topics if needed
Run this in console: classify_web(url, topics = topics)

Sentiment analysis capability has also been added to support users to understand the sentiment in web pages. You can retrieve sentiment analysis results by changing the analyse_sentiment argument to True in the function.

Contributing

I welcome anyone to contribute to this project so if you are interested, feel free to add your code. Alternatively, if you are not a programmer but would still like to contribute to this project, please click on the request feature button at the top of the page and provide your valuable feedback.

Contact

Leslie Chung

Known Issues

Websites that lack any text content will not return any results

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
webcon_class.py		webcon_class.py
webcontent.py		webcontent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website Classifier

Table of Contents

About the Project

Getting Started

Prerequisites

Installation

Usage

Contributing

Contact

Known Issues

About

Releases

Packages

Languages

BowerCode/NLP-WebsiteClassifier

Folders and files

Latest commit

History

Repository files navigation

Website Classifier

Table of Contents

About the Project

Getting Started

Prerequisites

Installation

Usage

Contributing

Contact

Known Issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages