**Serch Questions similar to query based on their Cosine-similarity value**

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
question_data = pd.read_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data/Preprocessed_selected_data.pkl")

In [7]:
question_data.shape

(296099, 10)

In [2]:
#get tf-idf vectorizer model
questions_tfidf = pickle.load(open("/content/drive/MyDrive/StackOverflow_CaseStudy/Saved_Model/tfidf_model_ques.pickle", "rb"))

# we are converting a dictionary with word as a key, and the idf as a value
idf_dict = dict(zip(questions_tfidf.get_feature_names_out(), list(questions_tfidf.idf_)))
tfidf_vocab = set(questions_tfidf.get_feature_names_out())

Function to get k-most similar questions to the query string based on cosine similarity

In [3]:
word_embedding_size = 128
def GetSimilarQuestions(query, no_sim_ques, word_dict, tfidf_wtd_ques_vec_embeddings):
  ''' This function finds k-most similar questions to the searched query based on their cosine similarity values'''

  #convert query words to vectors if word is present in both model vocabulary and tf-idf vectorizer vocabulary 
  query_vector = np.array([word_dict[w] for w in query.split() if w in word_dict and w in tfidf_vocab ])
  #get tf-idf valuesfor each query words if word is present in both model vocabulary and tf-idf vectorizer vocabulary
  query_tf_idf = np.array([(idf_dict[w]*(query.count(w)/len(query.split()))) for w in query.split() if w in word_dict and w in tfidf_vocab ])
  #get tf-idf weighted vector embedding for query string
  query_tf_idf_vec = np.sum(query_vector*query_tf_idf[:,None], axis=0)
  if(np.sum(query_tf_idf)!=0):
    query_tfidf_wtd_vec = query_tf_idf_vec/np.sum(query_tf_idf)
  
  #get cosine similiarity value of each question w.r.t. query string
  cos_sim = pd.Series(cosine_similarity(query_tfidf_wtd_vec.reshape(1, -1), tfidf_wtd_ques_vec_embeddings)[0])
  #get n-larget values i.e. k (no_sim_ques) most similar questions to query string
  sim_questions = cos_sim.nlargest(no_sim_ques).index
  return sim_questions

**Word2Vec model based word-vector dictionary and Question Embeddings**

In [4]:
word2vec_vectors = np.array(pd.read_csv('/content/drive/MyDrive/StackOverflow_CaseStudy/Saved_Model/vectors.tsv',
                           sep = '\t', header=None))

word2vec_vocab = pd.read_csv('/content/drive/MyDrive/StackOverflow_CaseStudy/Saved_Model/metadata.tsv',
                           sep = '\t', header=None)

word2vec_vocab = word2vec_vocab[0].values

word2vec_dict = dict(zip(word2vec_vocab, word2vec_vectors))

In [5]:
#get Word2Vec model based tf-idf weighted Question Embeddings
tfidf_weighted_word2vec_ques_embeddings = np.load('/content/drive/MyDrive/StackOverflow_CaseStudy/Saved_Model/tfIdf_Wtd_W2V_QuestionEmbeddings.npy')


**LSTM model based word-vector dictionary and Question Embeddings:**

In [3]:
lstm_vocab_vector_dict = pickle.load(open("/content/drive/MyDrive/StackOverflow_CaseStudy/Model2_data/Best_LSTM_Model_Vocab_Vector_dict.pkl", "rb"))

In [4]:
lstm_embedded_questions = pd.read_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Questions_lstm_embeddings_dataset.pkl")

In [7]:
lstm_embedded_questions.columns

Index(['Id', 'Title', 'Ques_Text', 'tfidf_wtd_lstm_embed_questions'], dtype='object')

In [61]:
lstm_tfidf_wtd_ques_vec_embeddings = np.array(lstm_embedded_questions['tfidf_wtd_lstm_embed_questions'].tolist())

In [62]:
lstm_tfidf_wtd_ques_vec_embeddings.shape

(296099, 128)

**Experimented searches based on both word2vec and LSTM model based embeddings**

In [66]:
#LSTM based
no_of_sim_ques=10
query = 'regex to remove numbers'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, lstm_vocab_vector_dict, lstm_tfidf_wtd_ques_vec_embeddings)
lstm_embedded_questions.iloc[sim_ques_to_query][['Id', 'Title']]

(128,)


Unnamed: 0,Id,Title
690480,25155970,Validating UK phone number (Regex C#)
190257,7432680,European Numbers RegEx
556404,20346740,Words regex in java
1019182,36991770,Create phone number regex
1081501,39330110,how to test regex in javascript in if statement
834367,30323610,Check if a textbox contains numbers only
686255,25004320,Trying to pull out certain numbers from a string
669910,24416660,Regex to match exactly numbers of chars
1002637,36384860,Capturing multiple instances of the same group...
467755,17160420,Splitting array within array


In [9]:
#Word2Vec based
no_of_sim_ques=10
query = 'regex to remove numbers'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, word2vec_dict, tfidf_weighted_word2vec_ques_embeddings)
question_data.iloc[sim_ques_to_query][['Id','Title']]

Unnamed: 0,Id,Title
690480,25155970,Validating UK phone number (Regex C#)
785281,28578280,Javascript regex to allow negative numbers
834367,30323610,Check if a textbox contains numbers only
1019182,36991770,Create phone number regex
556404,20346740,Words regex in java
701648,25553470,What's a regex that matches all numbers except...
1084577,39449720,javascript converts regex pattern
964078,34976310,Regex that accepts number less than a maxium size
295783,11088530,Using Java regex to validate date from a long ...
966763,35072940,Need help on Regex return values


In [67]:
#LSTM based
no_of_sim_ques=10
query = 'run program in background'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, lstm_vocab_vector_dict, lstm_tfidf_wtd_ques_vec_embeddings)
lstm_embedded_questions.iloc[sim_ques_to_query][['Id', 'Title']]

(128,)


Unnamed: 0,Id,Title
350286,12987720,Running program in background
1012415,36739780,Chrome extension keydown listener in backgroun...
618981,22593980,change background image every 15 sec
576965,21100930,Navigate to URL in background.html
754478,27466500,Specific Background based on time
147154,5911600,ie7 jQuery.click only registers when target ha...
506522,18543810,How to Inherit grandparent CSS not parent?
646926,23589460,jQuery mobile dynamically changing background ...
920123,33384920,Directly calling background task in UWP app
561292,20524470,How can I insert background image in my java d...


In [12]:
#Word2Vec based
query = 'run program in background'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, word2vec_dict, tfidf_weighted_word2vec_ques_embeddings)
question_data.iloc[sim_ques_to_query][['Title','Body']]

Unnamed: 0,Title,Body
350286,Running program in background,"<p><br/> I've a mini project I have to do in ""..."
623076,Chrome Extension js: Sharing functions between...,<p>Suppose I have a JavaScript function <code>...
576965,Navigate to URL in background.html,<p>I'm attempting to program a chrome extensio...
611273,How to get background page from a window opene...,<p>The app opens a popup from the background p...
516219,"How to run javascript function in ""background""...",<p>I've done an HTML form which has a lot of q...
646926,jQuery mobile dynamically changing background ...,<p>I am trying to set the background image pag...
197388,I need to have an array of backgrounds?,"<p>In my side-scroller, I want to have 3 backg..."
471422,Not able to put a break point for a function b...,<p>I am not able to put a break point in a fun...
511450,Create a background task in IntelliJ plugin,<p>I'm developing an IntelliJ-idea plugin and ...
772910,Upload div background url,<p>Currently i am working on faking a social m...


In [73]:
#LSTM based
no_of_sim_ques=10
query = 'spell check function in java'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, lstm_vocab_vector_dict, lstm_tfidf_wtd_ques_vec_embeddings)
lstm_embedded_questions.iloc[sim_ques_to_query][['Id', 'Title']]

(128,)


Unnamed: 0,Id,Title
512844,18775170,Solr All checkers need to use the same Analyzer
485179,17780020,Google Spell check URI not working
256052,9727210,How can I use a timer to delay a specific task...
733812,26714420,user spell checking on certain text area even ...
126879,5195460,Spell checker for .NET / C#
220408,8483040,Is there a way to access spell checker of the ...
477972,17525490,Spell check merge fields in MS Word programmat...
239,23620,What Javascript rich text editor will not brea...
247944,9446160,How to disable javadoc spell checker in NetBeans
851310,30922890,How to enable spell check for words with less ...


In [13]:
#Word2Vec based
query = 'spell check function in java'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, word2vec_dict, tfidf_weighted_word2vec_ques_embeddings)
question_data.iloc[sim_ques_to_query][['Title','Body']]

Unnamed: 0,Title,Body
512844,Solr All checkers need to use the same Analyzer,<p>I'm trying to run the spellcheck on some in...
733812,user spell checking on certain text area even ...,<p>I am using jsf 2.1 I need a plugin or a way...
485179,Google Spell check URI not working,"<p><a href=""http://www.google.com/tbproxy/spel..."
256052,How can I use a timer to delay a specific task...,<p>I am making a text-based RPG for a personal...
220408,Is there a way to access spell checker of the ...,<p>There is a text area on a jsp page and if t...
126879,Spell checker for .NET / C#,<p>Does somebody know a good multilanguage spe...
8482,Looking for Java spell checker library,<p>I am looking for an open source Java spell ...
477972,Spell check merge fields in MS Word programmat...,<p>I have been trying to enable spell check fo...
851310,How to enable spell check for words with less ...,"<p>I have CKEditor 4, using SCAYT spell check...."
239,What Javascript rich text editor will not brea...,"<p>I'm using TinyMCE in an ASP.Net project, an..."


In [75]:
#LSTM based
no_of_sim_ques=15
query = 'local variable'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, lstm_vocab_vector_dict, lstm_tfidf_wtd_ques_vec_embeddings)
lstm_embedded_questions.iloc[sim_ques_to_query][['Id', 'Title']]

(128,)


Unnamed: 0,Id,Title
1063937,38663360,It is possible to get local variable names pro...
310576,11594120,"Unexpected behavior when ""using"" is used"
560668,20500550,Can someone help me understand recursion?
726763,26463480,Why is it not possible to get local variable n...
922899,33484210,Local hidden variable field and null pointer e...
759491,27653930,IFrame does not work for local sources? HTML
162724,6464720,Binding a Java Integer to JavaScriptEngine doe...
111871,4653920,How can a local function variable be accessibl...
852613,30970280,Bytecode instrumentation using ASM 5.0 . injec...
311109,11612280,Why does java treat class scope and method sco...


In [None]:
#Word2Vec based
query = 'local variable'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, word2vec_dict, tfidf_weighted_word2vec_ques_embeddings)
question_data.iloc[sim_ques_to_query][['Title','Body']]

Unnamed: 0,Title,Body
852613,Bytecode instrumentation using ASM 5.0 . injec...,<p>I am doing Java bytecode analyse. I want to...
905338,Android development - variable inside inner class,<p>My second question for android development ...
1063937,It is possible to get local variable names pro...,<p>It is possible to get local variable names ...
217575,Does a thread that Only modifies a local varia...,<p>I have a doubt on the java synchronization ...
310576,"Unexpected behavior when ""using"" is used","<p>Can I use user defined type, say a class in..."
726763,Why is it not possible to get local variable n...,<p>If I have a code like this:</p>\n\n<pre><co...
921806,Local Variable Must be Final Java Error,<p>I want to update a variable every time I cl...
931654,Java: Local variable mi defined in an enclosin...,"<p>I get the error, as in subject, and I kindl..."
922899,Local hidden variable field and null pointer e...,<p>I'm getting a Warning on IDE (JAVA NetBeans...
322688,Value of local variable is not used? Its not s...,<p>Okay so i'm making a simple user input kind...


In [91]:
#LSTM based
no_of_sim_ques=15
query = 'Key Value Pair Data Structure'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, lstm_vocab_vector_dict, lstm_tfidf_wtd_ques_vec_embeddings)
lstm_embedded_questions.iloc[sim_ques_to_query][['Id', 'Title']]

(128,)


Unnamed: 0,Id,Title
388440,14328710,How can i create pairs of objects in java or o...
462703,16979790,checking for matching string
55764,2606530,What happens to my PriorityQueue if my Compara...
871129,31633160,JavaScript - Looping and adding a key value pa...
746523,27178700,A for loop is stuck when trying to get a pair ...
294743,11052020,How to store Key-Value pairs using JDO (datanu...
922492,33469560,Can someone explain this generic method
704113,25643230,Awt Dimension class vs custom Pair class perfo...
642997,23448490,Groovy remove value from collection
532912,19495670,"ListView of Dictionary<string,string> Need to ..."


In [14]:
#Word2Vec based
query = 'Key Value Pair Data Structure'
sim_ques_to_query = GetSimilarQuestions(query.lower(), no_of_sim_ques, word2vec_dict, tfidf_weighted_word2vec_ques_embeddings)
question_data.iloc[sim_ques_to_query][['Title','Body']]

Unnamed: 0,Title,Body
87,Best implementation for Key Value Pair Data St...,<p>So I've been poking around with C# a bit la...
340147,Multi-dimensional data structure in java,<p>I need to create a data structure that is o...
278407,What data structure to use?,<p>I want to store information using data stru...
882328,"Return Dictionary key, based on value match?",<p>I have this <strong>huge</strong> item tabl...
608941,What data structure/algorithm could be used to...,<p>What data structure and/or algorithm could ...
283562,Sorting pairs of chars to make a path/chain of...,<p>i'm making a method for my friend's applica...
184713,Find string in array of objects- javascript or...,<p>I've got a JSON response that looks like th...
591238,c# nested arrays of structures initialization,<p>I am trying to build a structure of structu...
532912,"ListView of Dictionary<string,string> Need to ...",<p>So I have a Dictionarr and have bound this ...
590489,duplicate key value pairs,<p>Is there a native data structure in java th...
