Knn from scratch
Task:
- Given a dataset of documents with content from 5 different fields ( namely busi- ness, entertainment, politics, sport, and tech ), cluster them using any clustering algorithm of your choice.
- Do not use any libraries for this part. You are expected to code your clustering algorithm from scratch.
- For feature extraction you can use the vectorizers provided by sklearn or by using the pre trained embeddings. ( Code snippet for the usage of these embeddings has been provided in the previous question ).
- You might have to perform some pre-processing on the raw documents before you apply your algorithm.
- We have provided ground truth document tags for the documents. Report accuracy score on these documents.
- We will test your score on the documents for which the tags have not been provided.
- In the dataset, the number after the ’ ’ symbol in the file name denotes the cluster label.