Classify web pages with NLP
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
In this project we built a function that leverages various Natural Language Processing (NLP) techniques to extract web content from a user provided URL and classify the web content based on a list of known topics. For simplicity, We will be using boilerpy3 to help us extract only the relevant text data from a given URL, apply transfer learning with Spacy's english language model for word similarity comparison and finally nltk for lammetization.
The idea is to leverage the function built here to support other work such as feature generation for modelling work and insight analytics.
To get started, please follow the below guidelines on prerequisites and installation.
- Spacy==2.3.2
- NLTK==3.4.5
- Boilerpy3==1.0.2
- TextBlob==0.15.3
- Pandas==1.0.3
- Numpy==1.18.2
- Fork and star this repo ;)
- Create a folder on your machine for your project
- Inside the folder right-click and select Git Bash Here
- Git clone this repo into the folder by running the below command
git clone https://github.com/hklchung/NLP-WebsiteClassifier.git
- First run everything inside main.py --this will help you load all the required packages and load the functions needed
- If you never had Spacy before, you may have to download the Spacy English language model first by running the below in Git Bash or Terminal:
python -m spacy download en_core_web_md
- Run this in console: nltk.download('popular')
- Update the list of topics if needed
- Run this in console: classify_web(url, topics = topics)
Sentiment analysis capability has also been added to support users to understand the sentiment in web pages. You can retrieve sentiment analysis results by changing the analyse_sentiment argument to True in the function.
I welcome anyone to contribute to this project so if you are interested, feel free to add your code. Alternatively, if you are not a programmer but would still like to contribute to this project, please click on the request feature button at the top of the page and provide your valuable feedback.
- Websites that lack any text content will not return any results