News-Article-Analysis-using-NLP

Project Description: News articles on websites, blogs or newspapers heavily rely on text mining and data mining techniques in order to improve their customer service and search performance. Data mining can be leveraged to extract important information that can be further used based on specific needs. For instance, clustering and classification can be used to automate tag generation in blog sites and websites. Tags are very important for blogs to improve search performance and find relevant topics easily. In this project, we aim to implement various clustering techniques, like k-means, DBSCAN, CLARA, and few others to automate the tag generation based on the results. Analysis and comparison of various techniques are important to check which method works best in the domain of Natural Language Processing. Moreover, based on the error, content, and nature of raw data, we will use different data cleaning techniques to get proper appropriate input data.

Dataset: The dataset used in this project is “All The News”[3], which consists of 15 news publishers in the US. The article’s content is in the English language and articles are dated from the year 2013 to 2018. Link: https://components.one/datasets/all-the-news-articles-dataset/

References [1] Moe R.E. (2014) Clustering in a News Corpus. In: Sojka P., Horák A., Kopeček I., Pala K. (eds) Text, Speech, and Dialogue. TSD 2014. Lecture Notes in Computer Science, vol 8655. Springer, Cham [2] Austin L.E Kraus, News Articles Clustering Using Unsupervised Learning, Medium article, August 2, 2019 [3] Andrew Thompson, “All the News”,https://www.kaggle.com/snapcrack/all-the-news/ [4] Evaluation Metrics, https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

Download dataset from:https://www.kaggle.com/snapcrack/all-the-news and keep them in data folder

Detials: KMeansClustering notebook has the implementation of K-Means TagGeneration notebook has the code for LDA Processing contains basic preprocessing steps

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
20ClustersSSE.png		20ClustersSSE.png
CMPE 255 Report.pdf		CMPE 255 Report.pdf
DaviesScore_20Clusters.png		DaviesScore_20Clusters.png
KMeansClustering.ipynb		KMeansClustering.ipynb
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
TagGeneration.ipynb		TagGeneration.ipynb
Topic 11.png		Topic 11.png
Topic 12.png		Topic 12.png
Topic 13.png		Topic 13.png
non-en.PNG		non-en.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20ClustersSSE.png

20ClustersSSE.png

CMPE 255 Report.pdf

CMPE 255 Report.pdf

DaviesScore_20Clusters.png

DaviesScore_20Clusters.png

KMeansClustering.ipynb

KMeansClustering.ipynb

Preprocessing.ipynb

Preprocessing.ipynb

README.md

README.md

TagGeneration.ipynb

TagGeneration.ipynb

Topic 11.png

Topic 11.png

Topic 12.png

Topic 12.png

Topic 13.png

Topic 13.png

non-en.PNG

non-en.PNG

Repository files navigation

News-Article-Analysis-using-NLP

About

Releases

Packages

Languages

Prachal80/News-Article-Analysis-using-NLP

Folders and files

Latest commit

History

Repository files navigation

News-Article-Analysis-using-NLP

About

Resources

Stars

Watchers

Forks

Languages