Skip to content

Automatic tag generation of news articles using LDA and Clustering using different techniques

Notifications You must be signed in to change notification settings

Prachal80/News-Article-Analysis-using-NLP

Repository files navigation

News-Article-Analysis-using-NLP

Project Description: News articles on websites, blogs or newspapers heavily rely on text mining and data mining techniques in order to improve their customer service and search performance. Data mining can be leveraged to extract important information that can be further used based on specific needs. For instance, clustering and classification can be used to automate tag generation in blog sites and websites. Tags are very important for blogs to improve search performance and find relevant topics easily. In this project, we aim to implement various clustering techniques, like k-means, DBSCAN, CLARA, and few others to automate the tag generation based on the results. Analysis and comparison of various techniques are important to check which method works best in the domain of Natural Language Processing. Moreover, based on the error, content, and nature of raw data, we will use different data cleaning techniques to get proper appropriate input data.

Dataset: The dataset used in this project is “All The News”[3], which consists of 15 news publishers in the US. The article’s content is in the English language and articles are dated from the year 2013 to 2018. Link: https://components.one/datasets/all-the-news-articles-dataset/

References [1] Moe R.E. (2014) Clustering in a News Corpus. In: Sojka P., Horák A., Kopeček I., Pala K. (eds) Text, Speech, and Dialogue. TSD 2014. Lecture Notes in Computer Science, vol 8655. Springer, Cham [2] Austin L.E Kraus, News Articles Clustering Using Unsupervised Learning, Medium article, August 2, 2019 [3] Andrew Thompson, “All the News”,https://www.kaggle.com/snapcrack/all-the-news/ [4] Evaluation Metrics, https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

Download dataset from:https://www.kaggle.com/snapcrack/all-the-news and keep them in data folder

Detials: KMeansClustering notebook has the implementation of K-Means TagGeneration notebook has the code for LDA Processing contains basic preprocessing steps

About

Automatic tag generation of news articles using LDA and Clustering using different techniques

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published