# News Article Clustering

In this track you're asked to cluster news articles into the topics. You will use a subset of [fetch_20newsgroups](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) dataset. 

## Marking criteria: 

- **8 points**: Implement TF-IDF for words vectorization.
- **2 points**: Reduce dimensionality with TruncatedSVD and answer the theoretical questions about it.
- **10 points**: Implement DBSCAN clustering algorithm.

## Baseline 

You can look at the baseline, or the basic solution of this task. Your assignments will be to implement TF-IDF vectorizer and DBSCAN from scratch and append dimension reduction on the token vectors to improve the results.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import DBSCAN
import pandas as pd


# Data set
# Take only 3 categories from 20
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),
                                     categories = [ 'alt.atheism', 'talk.religion.misc', 'comp.graphics'])
# Example of the sample
print(f"Sample:\n{newsgroups_train.data[0]}\n")

# Tokenization the articles
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')
X = vectorizer.fit_transform(newsgroups_train.data)

# Convert TF-IDF to pd.Dataframe for simplicity
tfidf_tokens = vectorizer.get_feature_names_out() 
df_tfidfvect = pd.DataFrame(data = X.toarray(), columns = tfidf_tokens)
print(f"TF-IDF table:\n {df_tfidfvect}\n")

# Reduce the dimension
svdT = TruncatedSVD(n_components=50)
svdTFit = svdT.fit_transform(df_tfidfvect)

# Clusterization
db = DBSCAN(eps=0.5, min_samples=5).fit(svdTFit)
skl_labels = db.labels_

# See how many clusters we've got (ideally, should be 3 and some outliers marked as -1)
skl_labels.sort()
print(f"Article labels {skl_labels}")

### 1 Words vectorization & Dimensionality reduction by TruncatedSVD [10 points]

As far as you have short articles (or set of words, or **documents**) as data points, you cannot apply clusterization to them directly. Thus, you need to encode the articles somehow. In this assignment you're suggested to use TF-IDF as a tokenizer. TF-IDF (Term Frequency - Inverse Document Frequency) get the corpus of sentences and produce a matrix $T$ with shape $N$ x $M$, where $N$ - number of sentences, $M$ - number of **all** unique words in the whole corpus. The item of the matrix $T_i,j$ ($i-th$ word in $j-th$ document) is calculated as follows:

$$T(i,j) = tf(i,j) * idf(i) $$

where

$$tf(i, j) = {n_{i,j} \over \sum_j n_{k,j}}$$

$n_{i, j} $ - number of entrances of the word in the sentence,${\sum_j n_ {k, j}} $- overall number of words in the document. So, it's a simple frequency of a particular word in a sentence; note it can be equal to zero if the word does not appear in the document. $idf(i) $ is equal to:

$$idf(i) = log {|D| \over |{d_k \in D, | n_i \in d_k}|}$$ 

$|D|$ - overall number of documents, and the denominator represents the number of documents, containing the word $n_i$.

By the end of the day, all of your articles are encoded into the TF-IDF matrix as its row, have the same dimensionality and might be used for clustering algorithms. 

While we do not work with semantic analysis of the text for clusterization, such tokens would implicitly encode the structure of the documents: we can suppose that the documents from one category (e.g. sports) utilize the similar words. TF-IDF gains with higher value the words that appear a lot in the several documents, but not in all corpus. Also note that the TF-IDF matrix would be very sparse and you'll need to reduce its dimensionality. 

### 1.1 TF-IDF [8 points]

In this task you're asked to **implement the formulas above** from scratch to calculate a TF-IDF matrix. 

You also have a subtask, which is important from the practical view: data cleaning. The articles contain a lot of punctuation, special characters and various forms of the same words that you need to handle. For example, words *clusters*, *Clusters*, *clusters? * *_clusters* without preprocessing would be considered as different, which greatly increase the number of columns in a TF-IDF matrix and reduce its descriptive power. Therefore, **before** the calculation of the metrics, it's recommended to:
1) Remove all punctuation marks;
2) Remove special symbols, such as `\n` and `\t`;
3) Remove stop words, i.e. the most common and non-informative, such a `a`, `the`, `I`, etc. You can use `nltk` package to load them;
4) Convert every word in the lower case.

Also, before the calculation you can remove the words that are too rare in the corpus, for example, appear less than 2 times (see [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) documentation). 

***Do not use any additional libraries***

In [None]:
import string
import math 
import numpy as np
import nltk
from nltk.corpus import stopwords


nltk.download('stopwords')
stops = set(stopwords.words("English"))
    
# Write your code here

### 1.2 TruncatedSVD [2 points]

Since the number of features of the resulting frame is pretty big, it would not be reasonable to clusterize them directly. However, PCA algorithm is not applicable here, because the data is too sparse (try yourself, sklearn should warn you!). Instead of that we are going to use **Truncated Singular Value Decomposition**.

You can use the realization of TruncatedSVD from sklearn. You're asked to read the following articles and pages [1](https://en.wikipedia.org/wiki/Singular_value_decomposition), [2](https://langvillea.people.cofc.edu/DISSECTION-LAB/Emmie%27sLSI-SVDModule/p5module.html) and answer the questions:

1) What are the ways to choose the proper number of eigenvalues in truncated SVD?
2) In which cases truncated SVD is not appropriate?
3) Are there any specific preprocessing steps that should be taken before applying truncated singular value decomposition to a raw text corpus?

In [None]:
from sklearn.decomposition import TruncatedSVD

# You can change n_components to another number
svdT = TruncatedSVD(n_components=50)
svdTFit = svdT.fit_transform(my_data_frame)
svdTFit

### 2. DBSCAN [10 points]

Now everything is ready for clusterization. In this task you're asked to **implement** DBSCAN algorithm. We do not provide any guidelines, but you can use the lecture slides.

The data set is not the easiest one for clusterization, even with such a powerful algorithm. Thus, your goal is to realize DBSCAN that divides it to at least **two clusters**, but **not more than six**, except the outliers. Print the labels as a proof. In case you have done everything you could, but still have only one cluster, choose other 3 topic from fetch_20newsgroups ([check](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) for target names). The distinct articles can be more separable. 

***Do not use any additional libraries***

In [None]:
from sklearn import metrics
import numpy as np
import math 

# Write your code here

### References

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
https://en.wikipedia.org/wiki/Tf–idf