# Document Clustering & Analysis using Custom K-Means
This **Jupyter Notebook** looks a subset of the Newsgroup dataset. The **subset** for this dataset includes 2,500 documents, each belonging to one of **5 categories** which are: Windows (0), Crypt (1), Christian (2), Hockey (3), Forsale (4). The documents are represented by 9328 terms (stems). 

The **vocabulary** for the dataset is given in the file "terms.txt" and the **term-by-document** matrix is given in "matrix.txt". The actual category labels for the document is provided in the file "classes.txt". The **goal** of this lab is to perform clustering on the documents and compare the clusters to the actual categories.

### Importing Libraries
We start by importing the following libraries:
- **Pandas:** a powerful open-source data analysis and manipulation library for Python. It provides data structures and functions for efficiently working with structured data
- **NumPy:** a open source Python library that's widely used in science and engineering. The NumPy library contains multidimensional array data structures, such as the homogeneous, N-dimensional ndarray
- **Sklearn:**  a free and open-source machine learning library for the Python programming language

In [25]:
import pandas as pd
import numpy as np
from kMeans import *
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer

## 1 |  Cosine Distance Function
We start off by creating the cosine distance function. We create our own distance function, where instead of using Euclidean distance, we use Cosine Similarity. This is the distance function that will be used to pass to the KMeans function in the Python Script. We use a written version that computes the Cosine Similarity between two n-dimensional vectors and returns the inverse as the distance between the vectors. 

### Changes to the KMeans.py File
Here is the following change that was made to the `kMeans.py` file where the Cosine Distance function was added with the following logic:
```
def distCosine(vecA, vecB):
    vector_A = np.array(vecA)
    vector_B = np.array(vecB)

    dot_product = np.dot(vector_A, vector_B)
    magnitude_A = np.linalg.norm(vector_A)
    magnitude_B = np.linalg.norm(vector_B)

    if magnitude_A == 0 or magnitude_B == 0:
        return 1.0

    cosine_similarity = dot_product / (magnitude_A * magnitude_B)
    cosine_distance = 1 - cosine_similarity

    return cosine_distance
```

## 2 | Data Preprocessing

In [26]:
# loading the matrix data and transposing it
data_matrix = np.loadtxt("matrix.txt", delimiter=",")
data_matrix = data_matrix.T

In [27]:
# splitting the dataset into training and testing dataset
train_data, test_data = train_test_split(data_matrix, test_size=0.2, random_state=99)

In [28]:
# performing the TF-IDF transformation
tfidf_transformer = TfidfTransformer()
train_data_tfidf = tfidf_transformer.fit_transform(train_data).toarray()
test_data_tfidf = tfidf_transformer.transform(test_data).toarray()

## 2 | Performing KMeans Clustering 

In [30]:
with open('terms.txt', 'r') as file:
    vocabulary = [line.strip() for line in file]

In [None]:
k = 5

centroids, cluster_assment = kMeans(train_data_tfidf, k, distMeas=disCosine, createC