### This Notebook is an exploration of the QA dataset, especially to cluster similar questions.


### Clustering is similar to that of document categorization, where you start with a whole corpus of documents and are tasked with segregating them into various groups based on some distinctive properties, attributes, and features of the documents.

### We will try three different clustering algorithms in this notebook:

    * K-means clustering
    * Affinity propagation
    * Ward’s agglomerative hierarchical clustering

### Importing libraries

In [None]:
import json
import numpy as np 
import pandas as pd
import re
import os
import random

# For plotting
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

from tqdm import tqdm_notebook as tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))        

### Reading the JSON files

In [None]:
ids = []
ans = []
candidates = []
questions = []

with open('/kaggle/input/tensorflow2-question-answering/simplified-nq-train.jsonl', 'r') as json_file:
    cnt = 0
    for line in tqdm(json_file):
        json_data = json.loads(line)        
        ids.append(str(json_data['example_id']))
        questions.append(json_data['question_text'])
        candidates = json_data['long_answer_candidates']

In [None]:
tr_data = pd.DataFrame()

tr_data['example_id'] = ids
tr_data['question'] = questions

In [None]:
tr_data.head(10)

### Normalization and Feature extraction 

In [None]:
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import *
import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk import sent_tokenize, word_tokenize
import re
import string
from nltk.stem import WordNetLemmatizer

stopword_list = nltk.corpus.stopwords.words('english')
wnl = WordNetLemmatizer()
ps = PorterStemmer()

tokenizer = nltk.RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))

def preprocessing(data):
    txt = data.str.lower().str.cat(sep=' ')
    words = tokenizer.tokenize(txt)
    words = [w for w in words if not w in stop_words]
    return words

def tokenize_text(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    return tokens

## Helper functions

In [None]:
def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [None]:
def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [None]:
def keep_text_characters(text):
    filtered_tokens = []
    tokens = tokenize_text(text)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [None]:
def normalize_corpus(corpus, lemmatize=True,  only_text_chars=False, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        text = text.lower()
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        if only_text_chars:
            text = keep_text_characters(text)
 
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
        else:
            normalized_corpus.append(text)
    return normalized_corpus

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def build_feature_matrix(documents, feature_type='frequency',  ngram_range=(1, 1), min_df=0.0, max_df=1.0):
    feature_type = feature_type.lower().strip()
    if feature_type == 'binary':
        vectorizer = CountVectorizer(binary=True, min_df=min_df, max_df=max_df, ngram_range=ngram_range)
    elif feature_type == 'frequency':
        vectorizer = CountVectorizer(binary=False, min_df=min_df, max_df=max_df, ngram_range=ngram_range)
    elif feature_type == 'tfidf':
        vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, ngram_range=ngram_range)
    else:
        raise Exception("Wrong feature type. Possible values are binary, frequency, or tfidf")
    feature_matrix = vectorizer.fit_transform(documents).astype(float)
    return vectorizer, feature_matrix

## Normalize Corpus

In [None]:
## Taking only a subset of questions

qns = tr_data['question'][:500]

In [None]:
norm_docs = normalize_corpus(qns,  lemmatize=True, only_text_chars=True)

In [None]:
tr_data_500 = tr_data.iloc[:500]

## Extract features

In [None]:
vectorizer, feature_matrix = build_feature_matrix(norm_docs, feature_type='tfidf', min_df=0, max_df=0.8, ngram_range=(1, 1))

### Viewing the number of features & getting the feature names

In [None]:
print(feature_matrix.shape)

In [None]:
# get feature names
feature_names = vectorizer.get_feature_names()

In [None]:
# print sample features
print(feature_names[:20])

### Above are some of the features extracted from the normalized documents

## K-means Clustering

The k-means clustering algorithm is a centroid-based clustering model that tries to cluster data into groups or clusters of equal variance.
The criteria or measure that this algorithm tries to minimize is inertia, also known as within-cluster sum-of-squares. One main disadvantage of this algorithm is that the number of clusters k need to be specified in advance, as is the case with all other centroid-based clustering models.

In [None]:
from sklearn.cluster import KMeans

# define the k-means clustering function

def k_means(feature_matrix, num_clusters=5):
    km = KMeans(n_clusters=num_clusters, max_iter=10000)
    km.fit(feature_matrix)
    clusters = km.labels_
    return km, clusters

In [None]:
# set k = 10(decided arbitrarily, right approach would be elbow method/silhoutte score which we will get to).
# Lets say we want 10 clusters from the list of questions we got 

num_clusters = 10
km_obj, clusters = k_means(feature_matrix=feature_matrix,num_clusters=num_clusters)
tr_data_500['Clusters'] = clusters

In [None]:
from collections import Counter

## Getting the total questions per cluster 

c = Counter(clusters)
print(c.items())

### Since we have not used any word embedding techniques to extract features, what we have now won't be the best clustering model. We will start by defining a function to extract important information from our cluster analysis:

The below function is pretty self-explanatory. What it does is basically extract the key features per cluster that were essential in defining the cluster    from the centroids. It also  retrieves the questions' example ID that belong to each cluster and stores everything in a dictionary.
We will now define a function that uses this data structure and prints the results in a clear format:

In [None]:
def get_cluster_data(clustering_obj, tr_data_500, feature_names, num_clusters,topn_features=10):
    cluster_details = {}
    # get cluster centroids
    ordered_centroids = clustering_obj.cluster_centers_.argsort()[:, ::-1]
    # get key features & questions for each cluster
    
    for cluster_num in range(num_clusters):
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster_num'] = cluster_num
        key_features = [feature_names[index] for index in ordered_centroids[cluster_num, :topn_features]]
        cluster_details[cluster_num]['key_features'] = key_features
        qnss = tr_data_500[tr_data_500['Clusters'] == cluster_num]['example_id'].values.tolist()
        cluster_details[cluster_num]['Questions'] = qnss
    return cluster_details

In [None]:
def print_cluster_data(cluster_data):
    # print cluster details
    for cluster_num, cluster_details in cluster_data.items():
        print('Cluster {} details:'.format(cluster_num))
        print('-'*20)
        print('Key features:', cluster_details['key_features'])
        print("Example ID's in this cluster:")
        print(', '.join(cluster_details['Questions']))
        print('='*80)

In [None]:
# Get clustering analysis data

cluster_data = get_cluster_data(clustering_obj=km_obj, tr_data_500=tr_data_500, feature_names=feature_names, num_clusters=num_clusters, topn_features=5)

# print clustering analysis results to see what are those features that come under the same cluster

print_cluster_data(cluster_data)

## Visualize the clusters


There are challenges associated with visualizing clusters. This happens especially when dealing with multidimensional feature spaces and unstructured text data, such as this dataset. Dimensionality reduction techniques can be applied here to reduce the dimensionality such that we can visualize these clusters in 2- or 3-dimensional plots. We will be using PCA here for visualizing clusters.

In [None]:
## Importing and Apply PCA

cosine_distance = 1 - cosine_similarity(feature_matrix)

from sklearn.decomposition import PCA

pca = PCA(n_components=2) # project from 784 to 2 dimensions

principalComponents = pca.fit_transform(cosine_distance)

p_df = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])

p_df.shape

In [None]:
# Explaining the Variance ratio

print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

In [None]:
# Plot the first two principal components of each point to learn about the data:

from pylab import rcParams
rcParams['figure.figsize'] = 17, 9

plt.scatter(principalComponents[:, 0], principalComponents[:, 1], s= 5, c=clusters, cmap='Spectral')

plt.gca().set_aspect('equal', 'datalim')

plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))

plt.title('Visualizing the clusters', fontsize=25);

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

### What can be seen in the plot above is there are cluster of documents(questions) that are close to each other(purple ones mainly) and there are questions that are under the same cluster but distance apart(cosine distance). 

### Other 2 clustering techniques + Embeddings to follow, Meanwhile share your thoughts and hey upvote? :)