<a href="https://colab.research.google.com/github/Hamzakhan942/A-semantic-approach-for-text-clustering-using-WordNet-and-lexical-chains/blob/main/A_semantic_approach_for_text_clustering_using_WordNet_and_lexical_chains.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###A semantic approach for text clustering using WordNet and lexical chains
## **http://dx.doi.org/10.1016/j.eswa.2014.10.023**
Traditional clustering algorithms do not consider the semantic relationships among words so that cannot
accurately represent the meaning of documents. To overcome this problem, introducing semantic information from ontology such as WordNet has been widely used to improve the quality of text clustering.
However, there still exist several challenges, such as synonym and polysemy, high dimensionality,
extracting core semantics from texts, and assigning appropriate description for the generated clusters.
In this paper, we report our attempt towards integrating WordNet with lexical chains to alleviate these
problems. The proposed approach exploits ontology hierarchical structure and relations to provide a
more accurate assessment of the similarity between terms for word sense disambiguation. Furthermore,
we introduce lexical chains to extract a set of semantically related words from texts, which can represent
the semantic content of the texts. Although lexical chains have been extensively used in text summarization, their potential impact on text clustering problem has not been fully investigated. Our integrated
way can identify the theme of documents based on the disambiguated core features extracted, and in parallel downsize the dimensions of feature space. The experimental results using the proposed framework
on reuters-21578 show that clustering performance improves significantly compared to several classical
methods.

###Importing Libraries

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk import word_tokenize 
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer 
from nltk.stem import WordNetLemmatizer

from sklearn.metrics.cluster import homogeneity_score
from sklearn.metrics import f1_score
from sklearn.metrics.cluster import completeness_score
from sklearn.cluster import KMeans

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
stop_words = set(stopwords.words('english'))

###Sense Disambiguation Class

In [None]:
"""
Sift4
-----

As described in:

    <http://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html>

Sift4 is an approximation of Levenshtein distance,
with O(n) complexity (whereas Levensthein is O(n*m)).
"""


def sift4(s1, s2, max_offset=5):
    """
    This is an implementation of general Sift4.
    """
    t1, t2 = list(s1), list(s2)
    l1, l2 = len(t1), len(t2)

    if not s1:
        return l2

    if not s2:
        return l1

    # Cursors for each string
    c1, c2 = 0, 0

    # Largest common subsequence
    lcss = 0

    # Local common substring
    local_cs = 0

    # Number of transpositions ('ab' vs 'ba')
    trans = 0

    # Offset pair array, for computing the transpositions
    offsets = []

    while c1 < l1 and c2 < l2:
        if t1[c1] == t2[c2]:
            local_cs += 1

            # Check if current match is a transposition
            is_trans = False
            i = 0
            while i < len(offsets):
                ofs = offsets[i]
                if c1 <= ofs['c1'] or c2 <= ofs['c2']:
                    is_trans = abs(c2-c1) >= abs(ofs['c2'] - ofs['c1'])
                    if is_trans:
                        trans += 1
                    elif not ofs['trans']:
                        ofs['trans'] = True
                        trans += 1
                    break
                elif c1 > ofs['c2'] and c2 > ofs['c1']:
                    del offsets[i]
                else:
                    i += 1
            offsets.append({
                'c1': c1,
                'c2': c2,
                'trans': is_trans
            })

        else:
            lcss += local_cs
            local_cs = 0
            if c1 != c2:
                c1 = c2 = min(c1, c2)

            for i in range(max_offset):
                if c1 + i >= l1 and c2 + i >= l2:
                    break
                elif c1 + i < l1 and s1[c1+i] == s2[c2]:
                    c1 += i - 1
                    c2 -= 1
                    break

                elif c2 + i < l2 and s1[c1] == s2[c2 + i]:
                    c2 += i - 1
                    c1 -= 1
                    break

        c1 += 1
        c2 += 1

        if c1 >= l1 or c2 >= l2:
            lcss += local_cs
            local_cs = 0
            c1 = c2 = min(c1, c2)

    lcss += local_cs
    return round(max(l1, l2) - lcss + trans)


def penn_to_wordnet(tag):
    """
    Convert a Penn Treebank PoS tag to WordNet PoS tag.
    """
    if tag in ['NN', 'NNS', 'NNP', 'NNPS']:
        return 'n' #wordnet.NOUN
    elif tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
        return 'v' #wordnet.VERB
    elif tag in ['RB', 'RBR', 'RBS']:
        return 'r' #wordnet.ADV
    elif tag in ['JJ', 'JJR', 'JJS']:
        return 'a' #wordnet.ADJ
    return None





In [None]:
!pip install spacy



In [None]:
import math
import numpy as np
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from scipy.sparse.csgraph import connected_components
import spacy
# from broca.distance.sift4 import sift4
# from broca.vectorize import Vectorizer
# from broca.common.util import penn_to_wordnet
# from broca.common.shared import spacy

# from spacy.lang.en import English
# spacy = English()

stops = stop_words
nlp = spacy.load("en_core_web_sm")


class DCSVectorizer():
    def __init__(self, alpha=1.5, relation_weights=[0.8, 0.5, 0.3], n_chains=100):
        self.alpha = 1.5
        self.relation_weights = relation_weights
        self.n_chains = n_chains

        # Cache concept => description
        # and (c1, c2) => similarity
        self.descriptions = {}
        self.concept_sims = {}


    def vectorize(self, docs):
        """
        Vectorizes a list of documents using their DCS representations.
        """
        doc_core_sems, all_concepts = self._extract_core_semantics(docs)

        shape = (len(docs), len(all_concepts))
        vecs = np.zeros(shape)
        for i, core_sems in enumerate(doc_core_sems):
            for con, weight in core_sems:
                j = all_concepts.index(con)
                vecs[i,j] = weight

        # Normalize
        return vecs/np.max(vecs)


    def _process_doc(self, doc):
        """
        Applies DCS to a document to extract its core concepts and their weights.
        """
        # Prep
        doc = doc.lower()
        # nlp = spacy.load("en_core_web_sm") (t, penn_to_wordnet(t.tag_)) spacy(doc, tag=True, parse=False, entity=False)
        # nlp = spacy.load("en_core_web_sm")
        doc = nlp(doc)
        tagged_tokens = [(t.text, penn_to_wordnet(t.tag_)) for t in doc]
        tokens = [t for t, tag in tagged_tokens]
        term_concept_map = self._disambiguate_doc(tagged_tokens)
        concept_weights = self._weight_concepts(tokens, term_concept_map)

        # Compute core semantics
        lexical_chains = self._lexical_chains(doc, term_concept_map)
        core_semantics = self._core_semantics(lexical_chains, concept_weights)
        core_concepts = [c for chain in core_semantics for c in chain]

        return [(con, concept_weights[con]) for con in core_concepts]


    def _disambiguate_doc(self, tagged_tokens):
        """
        Takes a list of tagged tokens, representing a document,
        in the form:
            [(token, tag), ...]
        And returns a mapping of terms to their disambiguated concepts (synsets).
        """

        # Group tokens by PoS
        pos_groups = {pos: [] for pos in [wn.NOUN, wn.VERB, wn.ADJ, wn.ADV]}
        for tok, tag in tagged_tokens:
            if tag in pos_groups:
                pos_groups[tag].append(tok)

        #print(pos_groups)

        # Map of final term -> concept mappings
        map = {}
        for tag, toks in pos_groups.items():
            map.update(self._disambiguate_pos(toks, tag))

        #nice_map = {k: map[k].lemma_names() for k in map.keys()}
        #print(json.dumps(nice_map, indent=4, sort_keys=True))

        return map


    def _disambiguate_pos(self, terms, pos):
        """
        Disambiguates a list of tokens of a given PoS.
        """
        # Map the terms to candidate concepts
        # Consider only the top 3 most common senses
        candidate_map = {term: wn.synsets(term, pos=pos)[:3] for term in terms}

        # Filter to unique concepts
        concepts = set(c for cons in candidate_map.values() for c in cons)

        # Back to list for consistent ordering
        concepts = list(concepts)
        sim_mat = self._similarity_matrix(concepts)

        # Final map of terms to their disambiguated concepts
        map = {}

        # This is terrible
        # For each term, select the candidate concept
        # which has the maximum aggregate similarity score against
        # all other candidate concepts of all other terms sharing the same PoS
        for term, cons in candidate_map.items():
            # Some words may not be in WordNet
            # and thus have no candidate concepts, so skip
            if not cons:
                continue
            scores = []
            for con in cons:
                i = concepts.index(con)
                scores_ = []
                for term_, cons_ in candidate_map.items():
                    # Some words may not be in WordNet
                    # and thus have no candidate concepts, so skip
                    if term == term_ or not cons_:
                        continue
                    cons_idx = [concepts.index(c) for c in cons_]
                    top_sim = max(sim_mat[i,cons_idx])
                    scores_.append(top_sim)
                scores.append(sum(scores_))
            best_idx = np.argmax(scores)
            map[term] = cons[best_idx]

        return map


    def _similarity_matrix(self, concepts):
        """
        Computes a semantic similarity matrix for a set of concepts.
        """
        n_cons = len(concepts)
        sim_mat = np.zeros((n_cons, n_cons))
        for i, c1 in enumerate(concepts):
            for j, c2 in enumerate(concepts):
                # Just build the lower triangle
                if i >= j:
                    sim_mat[i,j] = self._semsim(c1, c2) if i != j else 1.
        return sim_mat + sim_mat.T - np.diag(sim_mat.diagonal())


    def _semsim(self, c1, c2):
        """
        Computes the semantic similarity between two concepts.
        The semantic similarity is a combination of two sem sims:
            1. An "explicit" sem sim metric, that is, one which is directly
            encoded in the WordNet graph. Here it is just Wu-Palmer similarity.
            2. An "implicit" sem sim metric. See `_imp_semsim`.
        Note we can't use the NLTK Wu-Palmer similarity implementation because we need to
        incorporate the implicit sem sim, but it's fairly straightforward --
        leaning on <http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html#Synset.wup_similarity>,
        see that for more info. Though...the formula in the paper includes an extra term in the denominator,
        which is wrong, so we leave it out.
        """
        if c1 == c2:
            return 1.

        if (c1, c2) in self.concept_sims:
            return self.concept_sims[(c1, c2)]

        elif (c2, c1) in self.concept_sims:
            return self.concept_sims[(c2, c1)]

        else:
            need_root = c1._needs_root()
            subsumers = c1.lowest_common_hypernyms(c2, simulate_root=need_root)

            if not subsumers:
                # For relationships not in WordNet, fallback on just implicit semsim.
                return self._imp_semsim(c1, c2)

            subsumer = subsumers[0]
            depth = subsumer.max_depth() + 1
            len1 = c1.shortest_path_distance(subsumer, simulate_root=need_root)
            len2 = c2.shortest_path_distance(subsumer, simulate_root=need_root)

            if len1 is None or len2 is None:
                # See above
                return self._imp_semsim(c1, c2)

            len1 += depth
            len2 += depth

            imp_score = self._imp_semsim(c1, c2)

            sim = (2.*depth + imp_score)/(len1 + len2 + imp_score)
            self.concept_sims[(c1, c2)] = sim
            return sim


    def _imp_semsim(self, c1, c2):
        """
        The paper's implicit semantic similarity metric
        involves iteratively computing string overlaps;
        this is a modification where we instead use
        inverse Sift4 distance (a fast approximation of Levenshtein distance).
        Frankly ~ I don't know if this is an appropriate
        substitute, so I'll have to play around with this and see.
        """

        desc1 = self._description(c1)
        desc2 = self._description(c2)

        raw_sim = 1/(sift4(desc1, desc2) + 1)
        return math.log(raw_sim + 1)


    def _core_semantics(self, lex_chains, concept_weights):
        """
        Returns the n representative lexical chains for a document.
        """
        chain_scores = [self._score_chain(lex_chain, adj_submat, concept_weights) for lex_chain, adj_submat in lex_chains]
        scored_chains = zip(lex_chains, chain_scores)
        scored_chains = sorted(scored_chains, key=lambda x: x[1], reverse=True)

        thresh = (self.alpha/len(lex_chains)) * sum(chain_scores)
        return [chain for (chain, adj_mat), score in scored_chains if score >= thresh][:self.n_chains]


    def _extract_core_semantics(self, docs):
        """
        Extracts core semantics for a list of documents, returning them along with
        a list of all the concepts represented.
        """
        all_concepts = []
        doc_core_sems = []
        for doc in docs:
            core_sems = self._process_doc(doc)
            doc_core_sems.append(core_sems)
            all_concepts += [con for con, weight in core_sems]
        return doc_core_sems, list(set(all_concepts))


    def _lexical_chains(self, doc, term_concept_map):
        """
        Builds lexical chains, as an adjacency matrix,
        using a disambiguated term-concept map.
        """
        concepts = list({c for c in term_concept_map.values()})

        # Build an adjacency matrix for the graph
        # Using the encoding:
        # 1 = identity/synonymy, 2 = hypernymy/hyponymy, 3 = meronymy, 0 = no edge
        n_cons = len(concepts)
        adj_mat = np.zeros((n_cons, n_cons))

        for i, c in enumerate(concepts):
            # TO DO can only do i >= j since the graph is undirected
            for j, c_ in enumerate(concepts):
                edge = 0
                if c == c_:
                    edge = 1
                # TO DO when should simulate root be True?
                elif c_ in c._shortest_hypernym_paths(simulate_root=False).keys():
                    edge = 2
                elif c in c_._shortest_hypernym_paths(simulate_root=False).keys():
                    edge = 2
                elif c_ in c.member_meronyms() + c.part_meronyms() + c.substance_meronyms():
                    edge = 3
                elif c in c_.member_meronyms() + c_.part_meronyms() + c_.substance_meronyms():
                    edge = 3

                adj_mat[i,j] = edge

        # Group connected concepts by labels
        concept_labels = connected_components(adj_mat, directed=False)[1]
        lexical_chains = [([], []) for i in range(max(concept_labels) + 1)]
        for i, concept in enumerate(concepts):
            label = concept_labels[i]
            lexical_chains[label][0].append(concept)
            lexical_chains[label][1].append(i)

        # Return the lexical chains as (concept list, adjacency sub-matrix) tuples
        return [(chain, adj_mat[indices][:,indices]) for chain, indices in lexical_chains]


    def _score_chain(self, lexical_chain, adj_submat, concept_weights):
        """
        Computes the score for a lexical chain.
        """
        scores = []

        # Compute scores for concepts in the chain
        for i, c in enumerate(lexical_chain):
            score = concept_weights[c] * self.relation_weights[0]
            rel_scores = []
            for j, c_ in enumerate(lexical_chain):
                if adj_submat[i,j] == 2:
                    rel_scores.append(self.relation_weights[1] * concept_weights[c_])

                elif adj_submat[i,j] == 3:
                    rel_scores.append(self.relation_weights[2] * concept_weights[c_])

            scores.append(score + sum(rel_scores))

        # The chain's score is just the sum of its concepts' scores
        return sum(scores)


    def _weight_concepts(self, tokens, term_concept_map):
        """
        Calculates weights for concepts in a document.
        This is just the frequency of terms which map to a concept.
        """

        weights = {c: 0 for c in term_concept_map.values()}
        for t in tokens:
            # Skip terms that aren't one of the PoS we used
            if t not in term_concept_map:
                continue
            con = term_concept_map[t]
            weights[con] += 1

        # TO DO paper doesn't mention normalizing these weights...should we?
        return weights


    def _description(self, concept):
        """
        Returns a "description" of a concept,
        as defined in the paper.
        The paper describes the description as a string,
        so this is a slight modification where we instead represent
        the definition as a list of tokens.
        """
        if concept not in self.descriptions:
            lemmas = concept.lemma_names()
            gloss = self._gloss(concept)
            glosses = [self._gloss(rel) for rel in self._related(concept)]
            raw_desc = ' '.join(lemmas + [gloss] + glosses)
            desc = [w for w in raw_desc.split() if w not in stops]
            self.descriptions[concept] = desc
        return self.descriptions[concept]


    def _gloss(self, concept):
        """
        The concatenation of a concept's definition and its examples.
        """
        return  ' '.join([concept.definition()] + concept.examples())


    def _related(self, concept):
        """
        Returns related concepts for a concept.
        """
        return concept.hypernyms() + \
                concept.hyponyms() + \
                concept.member_meronyms() + \
                concept.substance_meronyms() + \
                concept.part_meronyms() + \
                concept.member_holonyms() + \
                concept.substance_holonyms() + \
                concept.part_holonyms() + \
                concept.attributes() + \
                concept.also_sees() + \
                concept.similar_tos()


###Loading Dataset

In [None]:
# import os
# DATA_SET_PATH = '/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/Reuters-DataSets/'
# dataset_files = os.listdir(DATA_SET_PATH+'Reuters')
# categories = os.listdir(DATA_SET_PATH+'Reuters GT')

Mapping classes

In [None]:
# cat_map = {}

# for category in categories:
#   cat_map[category] = os.listdir(DATA_SET_PATH+'Reuters GT/'+category)

creating corpus

In [None]:
# corpus = []
# for file_name in dataset_files:
#   text = open(DATA_SET_PATH+'Reuters/'+file_name, 'r')
#   text = text.read()
#   corpus.append(text)

# corpus

custom encoder-decoder

In [None]:
# encoder = {}
# decoder = {}

# for code, cat in enumerate(cat_map.keys()):
#   encoder[cat] = code
#   decoder[code] = cat

# encoder

###Labelling Dataset

In [None]:
# labels = []
# decoded_cats = []
# seen = set()
# for file_name in dataset_files:
#   for key in cat_map.keys():
#     if file_name in cat_map[key] and file_name not in seen:
#       seen.add(file_name)
#       labels.append(encoder[key])
#       decoded_cats.append(key)

# len(labels)
# decoded_cats 

In [None]:
# id = []
# doc = []
# label = []
# for sentence in corpus:
#   id.append(sentence.split('ID: ')[1].split(' ')[0])
#   if len(sentence.split('TEXT: ')) > 1:
#     doc.append(sentence.split('TEXT: ')[1])
#   else:
#     doc.append(' ')

saving Dataset as compressed pickle

In [None]:
# import pandas as pd
# data_frame = pd.DataFrame()
# data_frame['id'] = id
# data_frame['file_name'] = dataset_files
# data_frame['category'] = decoded_cats
# data_frame['label'] = labels 
# data_frame['doc'] = doc
# data_frame.to_pickle('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/reuters_pickle.pk1')

In [None]:
# data_frame

Unnamed: 0,id,file_name,category,label,doc
0,9958,21531.txt,barley,11,
1,17379,8200.txt,ipi,0,China's total wage bill for state employees gr...
2,17424,8251.txt,carcass,3,Members of local 174 of the United Food and Co...
3,17380,8202.txt,barley,11,West German use of tapioca is likely to declin...
4,17303,8117.txt,ipi,0,Japan's preliminary industrial production inde...
...,...,...,...,...,...
781,17387,8209.txt,iron-steel,2,A Greek bulk carrier loaded with iron ore has ...
782,17384,8206.txt,jobs,5,Unemployment in the European Community fell in...
783,17312,8127.txt,iron-steel,2,South Korea unveiled a shopping list of 2.6 bi...
784,17315,8130.txt,iron-steel,2,Stelco Inc said contract negotiations with Uni...


###Loading saved pickle data frame

In [None]:
import pandas as pd
unpickled_df = pd.read_pickle("/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/reuters_pickle.pk1") 

In [None]:
corpus_unpkled = unpickled_df['doc'].tolist()  

filteration step

In [None]:
corpus_filtered = []
labels_filterted = unpickled_df['label'].tolist()
for index, sentence in enumerate(corpus_unpkled):
  if len(sentence) > 1:
    corpus_filtered.append(sentence)
  else:
    labels_filterted.pop(index)


###Corpus Disambiguation - no preprocessing

In [None]:
disamb = DCSVectorizer()
results = disamb.vectorize(corpus_filtered)

In [None]:
results.shape 

(725, 2071)

In [None]:
np.save('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/reuters_results.npy', results)
# np.save('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/reuters_50_lex_results.npy', results)

###1. Saved Results 

In [None]:
X = np.load('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/reuters_50_lex_results.npy')

In [None]:
kmeans = KMeans(n_clusters=15, random_state=135).fit(X)

In [None]:
clusts = kmeans.labels_.tolist()
result_frame = pd.DataFrame()
result_frame['True Label'] = labels_filterted
result_frame['Predicted Labels'] = clusts

In [None]:
f_score = f1_score(labels_filterted, clusts, average='micro') 
purity = homogeneity_score(labels_filterted, clusts) 
completeness = completeness_score(labels_filterted, clusts)
print('=========================================')
print(f'| F Score\t| {f_score}\t|\n| Purity\t| {purity}\t|\n| Completeness\t| {completeness}\t|')
print('=========================================')

| F Score	| 0.06344827586206897	|
| Purity	| 0.04042845972406231	|
| Completeness	| 0.06398373825269524	|


###Corpus Disambiguation - Prepocessed

In [None]:
def pre_processor(docs):
  processed_text = []
  ps = PorterStemmer()
  lm = WordNetLemmatizer() 
  for sent in docs:
      sent = sent.lower()
      sent = word_tokenize(sent)
      sent = [word for word in sent if word not in stop_words]
      sent = [clean for clean in sent if clean.isalpha()]  # removal of special characters
      sent = [lm.lemmatize(lem) for lem in sent]
      sent = [ps.stem(stem) for stem in sent]
      sent = ' '.join(sent)
      processed_text.append(sent)

  return processed_text

In [None]:
unpickled_two = pd.read_pickle("/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/reuters_pickle.pk1")
stem_lem_corpus =  unpickled_two['doc'].tolist()
stem_lem_corpus = pre_processor(stem_lem_corpus) 

In [None]:
new_corpus_filtered = []
new_labels_filterted = unpickled_two['label'].tolist()
for index, sentence in enumerate(stem_lem_corpus):
  if len(sentence) > 1:
    new_corpus_filtered.append(sentence)
  else:
    new_labels_filterted.pop(index)


In [None]:
disamb = DCSVectorizer()
new_results = disamb.vectorize(new_corpus_filtered) 

In [None]:
new_results.shape

(725, 1351)

In [None]:
# np.save('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/new_reuters_results.npy', new_results)
np.save('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/new_reuters_100_lex_results.npy', new_results)

###2. Saved Results

In [None]:
X2 = np.load('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/new_reuters_100_lex_results.npy')

In [None]:
new_kmeans = KMeans(n_clusters=15, random_state=135).fit(X)
new_clusts = new_kmeans.labels_.tolist()
purity = homogeneity_score(new_labels_filterted, new_clusts)
f_score = f1_score(new_labels_filterted, new_clusts, average='micro') 
completeness = completeness_score(new_labels_filterted, new_clusts)
print('=========================================')
print(f'| F Score\t| {f_score}\t|\n| Purity\t| {purity}\t|\n| Completeness\t| {completeness}\t|')
print('=========================================') 
analysis_results = [f_score, purity, completeness]

| F Score	| 0.06344827586206897	|
| Purity	| 0.04042845972406231	|
| Completeness	| 0.06398373825269524	|


###Tuned Results for max purity

In [None]:
new_kmeans = KMeans(n_clusters=97, random_state=34).fit(X)
new_clusts = new_kmeans.labels_.tolist()
purity = homogeneity_score(new_labels_filterted, new_clusts)
f_score = f1_score(new_labels_filterted, new_clusts, average='micro') 
completeness = completeness_score(new_labels_filterted, new_clusts)
print('=========================================')
print(f'| F Score\t| {f_score}\t|\n| Purity\t| {purity}\t|\n| Completeness\t| {completeness}\t|')
print('=========================================')
tuned_purity = [f_score, purity, completeness]

| F Score	| 0.009655172413793104	|
| Purity	| 0.20283730361969843	|
| Completeness	| 0.16146775772465602	|


###Comparison Study

In [None]:
dat = pd.read_pickle('/content/drive/MyDrive/FYP-1-DLS-Clustering /DataSets/graphing_data.pkl') 
dat['dcs'] = analysis_results
dat['tuned_purity'] = tuned_purity
dat

Unnamed: 0,tf-idf,n-gram,dcs,tuned_purity
F Score,0.06069,0.067586,0.111724,0.009655
Purity,0.068473,0.064117,0.04181,0.202837
Completeness,0.065251,0.066043,0.071828,0.161468


Graphing the data

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Bar(
  x = [['F Score', 'Purity', 'Completeness'],
       ["TF IDF", "TF IDF", "TF IDF"]],
  y = [0.06069, 0.068473, 0.065251],
  name = "TF IDF",
))

fig.add_trace(go.Bar(
  x = [['F Score', 'Purity', 'Completeness'],
       ["N GRAM", "N GRAM", "N GRAM"]],
  y = dat['n-gram'],
  name = "N GRAM",
))

fig.add_trace(go.Bar(
  x = [['F Score', 'Purity', 'Completeness'],
       ["Word Sense", "Word Sense", "Word Sense"]],
  y = dat['dcs'],
  name = "Word Sense",
))


fig.update_layout(title_text="Multi-category axis")

fig.show()

Purity Tuned Resulted

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Bar(
  x = [['F Score', 'Purity', 'Completeness'],
       ["TF IDF", "TF IDF", "TF IDF"]],
  y = [0.06069, 0.068473, 0.065251],
  name = "TF IDF",
))

fig.add_trace(go.Bar(
  x = [['F Score', 'Purity', 'Completeness'],
       ["N GRAM", "N GRAM", "N GRAM"]],
  y = dat['n-gram'],
  name = "N GRAM",
))

fig.add_trace(go.Bar(
  x = [['F Score', 'Purity', 'Completeness'],
       ["Word Sense", "Word Sense", "Word Sense"]],
  y = dat['dcs'],
  name = "Word Sense",
))

fig.add_trace(go.Bar(
  x = [['F Score', 'Purity', 'Completeness'],
       ["DCS Tuned", "DCS Tuned", "DCS Tuned"]],
  y = dat['tuned_purity'],
  name = "DCS Tuned",
))

fig.update_layout(title_text="Multi-category axis")

fig.show()

###Logs of results




seeed 32 - F Score: 0.08275862068965517 Purity: 0.074076717340167 - 29

seeed 56 - F Score: 0.08275862068965517 Purity: 0.11232005623026102 - 45

seeed 34 - F Score: 0.08275862068965517 Purity: 0.20283730361969843 - 97

seeed 126 - F Score: 0.08275862068965517 Purity: 0.03270145809601008 - 10

seeed 129 - F Score: 0.08275862068965517 Purity: 0.037093393568451406 - 11

seeed 77 - F Score: 0.08275862068965517 Purity: 0.03721067754401603 - 12

seeed 116 - F Score: 0.08275862068965517 Purity: 0.04066024768518942 - 13

seeed 116 - F Score: 0.08275862068965517 Purity: 0.04197344729337054 - 14

seeed 1 - F Score: 0.08275862068965517 Purity: 0.05088905270023905 - 15

In [None]:
# 29. -seeed 32- F Score: 0.08275862068965517 Purity: 0.074076717340167
# 45. -seeed 56- F Score: 0.08275862068965517 Purity: 0.11232005623026102
# 97. -seeed 34- F Score: 0.08275862068965517 Purity: 0.20283730361969843
# 43. -seeed 126- F Score: 0.08275862068965517 Purity: 0.03270145809601008 - 10
# 8. -seeed 129- F Score: 0.08275862068965517 Purity: 0.037093393568451406 - 11
# 25. -seeed 77- F Score: 0.08275862068965517 Purity: 0.03721067754401603 - 12
# 10. -seeed 116- F Score: 0.08275862068965517 Purity: 0.04066024768518942 - 13
# 12. -seeed 116- F Score: 0.08275862068965517 Purity: 0.04197344729337054 - 14
# 62. -seeed 1- F Score: 0.08275862068965517 Purity: 0.05088905270023905 - 15

Tuner 

In [None]:
import random
max_f = 0
max_p = 0
mf = ''
mp = ''
for k in range(0, 100):
  seed = random.randint(0, 150)
  new_kmeans = KMeans(n_clusters=15, random_state=seed).fit(X)
  new_clusts = new_kmeans.labels_.tolist()
  purity = homogeneity_score(new_labels_filterted, new_clusts)
  f_score = f1_score(new_labels_filterted, new_clusts, average='micro') 
  print(f'Seed: {seed}\tF-Score:{f_score}\tPurity: {purity}')
  if f_score > max_f:
    max_f = f_score
    mf = f'Seed: {seed}\tF-Score:{f_score}\tPurity: {purity}'
  if purity > max_p:
    max_p = purity
    mp = f'Seed: {seed}\tF-Score:{f_score}\tPurity: {purity}'

Seed: 25	F-Score:0.07724137931034483	Purity: 0.03474481452610758
Seed: 18	F-Score:0.07172413793103448	Purity: 0.04171561234620978
Seed: 11	F-Score:0.07172413793103448	Purity: 0.03881990021073411
Seed: 45	F-Score:0.08551724137931034	Purity: 0.03705000068618942
Seed: 101	F-Score:0.08	Purity: 0.043447184444153235
Seed: 99	F-Score:0.08689655172413793	Purity: 0.03901449397954723
Seed: 135	F-Score:0.11172413793103449	Purity: 0.04180956080250367
Seed: 57	F-Score:0.08827586206896551	Purity: 0.04047244318333482
Seed: 22	F-Score:0.05655172413793103	Purity: 0.039489503788570156
Seed: 91	F-Score:0.08689655172413793	Purity: 0.04069565653850308
Seed: 92	F-Score:0.06206896551724138	Purity: 0.03794281309835939
Seed: 68	F-Score:0.02620689655172414	Purity: 0.04122321376116574
Seed: 145	F-Score:0.07724137931034483	Purity: 0.03999303826776661
Seed: 3	F-Score:0.08137931034482758	Purity: 0.04168762594642023
Seed: 75	F-Score:0.07586206896551724	Purity: 0.04006531622613011
Seed: 84	F-Score:0.08275862068965517

In [None]:
print(mf,'\n', mp)

Seed: 135	F-Score:0.11172413793103449	Purity: 0.04180956080250367 
 Seed: 65	F-Score:0.07586206896551724	Purity: 0.0488155766749684
