# Medius Health Data Science Challenge
By Arwa Siddiqui

#### Data Description: 
1. Dataset is provided in “data” folder. Each file in the data folder is considered as a document.
2. There are 300 documents.
3. Each document has some text information.

#### Task Description: 
Grouping documents with the same semantic description into clusters. 
1. Process the text data in each document/file using NLP, data processing, text mining.
3. Develop a model to partition the data into multiple clusters. It is required to develop the end-to-end model in python instead of using any data clustering libraries or pre-trained models.

The **outcome** of the model: 
1. Number of clusters and the data points in each cluster. 
2. Report the number of clusters found in the data.
3. Find out the topics of each cluster. (You can run any benchmark off-the-shelf topic modelling algorithm like Latent Dirichlet Allocation (LDA) or PLSA)
4. If possible, visualize the cluster. (bonus point)

## Load imports 

In [141]:
# Import libraries 
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

import re
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import genesis
#nltk.download('genesis')
#nltk.download('wordnet')
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
genesis_ic = wn.ic(genesis, False, 0.0)

from csv import reader,writer
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import stopwords
from sklearn.metrics import roc_auc_score

## Making KNN classifier from scratch

In [126]:
# Build KNN Classifier from scratch 

class KNN_NLC_Classifer():
    def __init__(self, k=1, distance_type = 'path'):
        self.k = k
        self.distance_type = distance_type

    # This function is used for training
    def fit(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train

    # This function runs the K(1) nearest neighbour algorithm and
    # returns the label with closest match. 
    def predict(self, x_test):
        self.x_test = x_test
        y_predict = []

        for i in range(len(x_test)):
            max_sim = 0
            max_index = 0
            for j in range(self.x_train.shape[0]):
                temp = self.document_similarity(x_test[i], self.x_train[j])
                if temp > max_sim:
                    max_sim = temp
                    max_index = j
            y_predict.append(self.y_train[max_index])
        return y_predict

# Process Text Documents

In [205]:
# Access documents from Dataset
import os

# Get a list of all files under the current directory
flist = []
for root, dirs, files in os.walk('data/'):
    flist += [os.path.join(root, f) for f in files]
flist.pop(0)
print("Number of documents:", len(flist))

Number of documents: 300


In [206]:
# Read documents into one document (i.e. list)
new_list = []
for file in flist: 
    f = open(file, "r")
    text = f.read()
    new_list.append(text)
    f.close()

In [207]:
raw = raw.replace('\n', ''); raw = raw.replace('\t', ''); raw = raw.replace('\\n', ''); raw = raw.replace('\\t', '')
raw = raw.replace('::', ''); raw = raw.replace('>>', '')
raw = str(new_list).strip('[]')
tokens = word_tokenize(raw)
sents = nltk.sent_tokenize(raw)

words = [w.lower() for w in tokens]
vocab = sorted(set(words))

In [204]:
print(words)



In [202]:
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.most_common(10)

[('e', 72734),
 ('n', 61864),
 ('t', 56244),
 ('a', 51345),
 ('o', 47969),
 ('i', 45961),
 ('s', 45075),
 ('r', 39190),
 ('c', 25865),
 ('l', 25714)]

In [140]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]

["'xref",
 ':',
 'cantaloupe.srv.cs.cmu.edu',
 'talk.politics.misc:176869',
 'alt.sex:110443',
 'soc.men:67802',
 'misc.legal:59959\\npath',
 ':',
 'cantaloupe.srv.cs.cmu.edu',
 '!',
 'das-news.harvard.edu',
 '!',
 'husc-news.harvard.edu',
 '!',
 'hsdndev',
 '!',
 'yale',
 '!',
 'gumbi',
 '!',
 'wupost',
 '!',
 'zaphod.mps.ohio-state.edu',
 '!',
 'ub',
 '!',
 'galileo.cc.rochester.edu',
 '!',
 'uhura.cc.rochester.edu',
 '!',
 'as010b\\nnewsgroup',
 ':',
 'talk.politics.misc',
 ',',
 'alt.sex',
 ',',
 'soc.men',
 ',',
 'misc.legal\\nsubject',
 ':',
 'Re',
 ':',
 'stop',
 'put',
 'down',
 'white',
 'het',
 'males.\\nmessage-id',
 ':',
 '<',
 '1993apr5.213327.23802',
 '@',
 'galileo.cc.rochester.edu',
 '>',
 '\\nfrom',
 ':',
 'as010b',
 '@',
 'uhura.cc.rochester.edu',
 '(',
 'tree',
 'of',
 'schnopia',
 ')',
 '\\ndate',
 ':',
 '5',
 'apr',
 '93',
 '21:33:27',
 'gmt\\nsender',
 ':',
 'news',
 '@',
 'galileo.cc.rochester.edu\\nrefer',
 ':',
 '<',
 'djng2b1w165w',
 '@',
 'cybernet.cse.fau.ed

916279

154451

761828