Skip to content

A tool to help classify web content or any text into topics.

Notifications You must be signed in to change notification settings

BowerCode/NLP-WebsiteClassifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues

Python 3.6 spacy 2.3.2 nltk 3.4.5 boilerpy3 1.0.2 textblob 0.15.3 License MIT


Website Classifier

Classify web pages with NLP
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About the Project

In this project we built a function that leverages various Natural Language Processing (NLP) techniques to extract web content from a user provided URL and classify the web content based on a list of known topics. For simplicity, We will be using boilerpy3 to help us extract only the relevant text data from a given URL, apply transfer learning with Spacy's english language model for word similarity comparison and finally nltk for lammetization.

The idea is to leverage the function built here to support other work such as feature generation for modelling work and insight analytics.

Getting Started

To get started, please follow the below guidelines on prerequisites and installation.

Prerequisites

  • Spacy==2.3.2
  • NLTK==3.4.5
  • Boilerpy3==1.0.2
  • TextBlob==0.15.3
  • Pandas==1.0.3
  • Numpy==1.18.2

Installation

  1. Fork and star this repo ;)
  2. Create a folder on your machine for your project
  3. Inside the folder right-click and select Git Bash Here
  4. Git clone this repo into the folder by running the below command
git clone https://github.com/hklchung/NLP-WebsiteClassifier.git

Usage

  1. First run everything inside main.py --this will help you load all the required packages and load the functions needed
  2. If you never had Spacy before, you may have to download the Spacy English language model first by running the below in Git Bash or Terminal:
python -m spacy download en_core_web_md
  1. Run this in console: nltk.download('popular')
  2. Update the list of topics if needed
  3. Run this in console: classify_web(url, topics = topics)

Sentiment analysis capability has also been added to support users to understand the sentiment in web pages. You can retrieve sentiment analysis results by changing the analyse_sentiment argument to True in the function.

Contributing

I welcome anyone to contribute to this project so if you are interested, feel free to add your code. Alternatively, if you are not a programmer but would still like to contribute to this project, please click on the request feature button at the top of the page and provide your valuable feedback.

Contact

Known Issues

  • Websites that lack any text content will not return any results

About

A tool to help classify web content or any text into topics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%