Project Description: News articles on websites, blogs or newspapers heavily rely on text mining and data mining techniques in order to improve their customer service and search performance. Data mining can be leveraged to extract important information that can be further used based on specific needs. For instance, clustering and classification can be used to automate tag generation in blog sites and websites. Tags are very important for blogs to improve search performance and find relevant topics easily. In this project, we aim to implement various clustering techniques, like k-means, DBSCAN, CLARA, and few others to automate the tag generation based on the results. Analysis and comparison of various techniques are important to check which method works best in the domain of Natural Language Processing. Moreover, based on the error, content, and nature of raw data, we will use different data cleaning techniques to get proper appropriate input data.
Dataset: The dataset used in this project is “All The News”[3], which consists of 15 news publishers in the US. The article’s content is in the English language and articles are dated from the year 2013 to 2018. Link: https://components.one/datasets/all-the-news-articles-dataset/
References [1] Moe R.E. (2014) Clustering in a News Corpus. In: Sojka P., Horák A., Kopeček I., Pala K. (eds) Text, Speech, and Dialogue. TSD 2014. Lecture Notes in Computer Science, vol 8655. Springer, Cham [2] Austin L.E Kraus, News Articles Clustering Using Unsupervised Learning, Medium article, August 2, 2019 [3] Andrew Thompson, “All the News”,https://www.kaggle.com/snapcrack/all-the-news/ [4] Evaluation Metrics, https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
Download dataset from:https://www.kaggle.com/snapcrack/all-the-news and keep them in data folder
Detials: KMeansClustering notebook has the implementation of K-Means TagGeneration notebook has the code for LDA Processing contains basic preprocessing steps