<h4> Problem </h4>
<p>
When you search for a dataset with a query on the system/platform,you're shown a list of related dataset search queries. How would you build such a system/platform that also generates a list of related searches for each query? 
</p>
    
<h4> Objectives </h4>
<p>
The objective is to build a model which lists the similar or semantically similar lists of queries for the search criteria
</p>

<h5> Data Collection </h5>
<p>
I have chosen StackOverflow dataset for this project.  Stack Overflow contains a gigantic dataset of crowd-sourced question links of high quality, which gives an incredible chance to assess the retrieval algorithms of a model which is learning to retrieve similar questions. The StackOverflow archives contains all stack overflow data among which only the posts.xml is downloaded for this analysis which has the tags Python and Java.

</p>

<p>The below query joins 2 tables(stackoverflow.post_questions and stackoverflow.posts_answers) and collects the required data. The datapoints are limited to 1000000. </p>


In [None]:
import pandas as pd
import os
from google.cloud import bigquery
from google.oauth2 import service_account


query= """SELECT q.id, q.title, q.body, q.tags, a.body as answers, a.score 
FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_id WHERE q.tags LIKE '%python%' or q.tags LIKE '%java%'
LIMIT 1000000 """

df = pd.read_gbq(query, project_id='coherent-span-229615',dialect='standard')

In [None]:
os.getcwd()

'/content'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
df.to_csv(os.getcwd()+"/drive/My Drive"+"/Posts_x.csv")

In [None]:
import pandas as pd
import os
df=pd.read_csv(os.getcwd()+"/drive/My Drive"+"/Posts_x.csv")

There are no missing values detected in any of the columns

In [None]:
df.isna().sum()

id         0
title      0
body       0
tags       0
answers    0
score      0
dtype: int64

In [None]:
# # with pd.option_context('display.max_colwidth', None):
# #     print(df1[df1.id==12671371])
# df1

#del df['Unnamed: 0']
df.head()

Unnamed: 0,id,title,body,tags,answers,score
0,41495471,Why did my ChildNodes return undefined on JS e...,<p>I am adding click handler into one of table...,javascript,<p><strong>UPDATE</strong></p>\n\n<p>Updated s...,1
1,41495471,Why did my ChildNodes return undefined on JS e...,<p>I am adding click handler into one of table...,javascript,<p>Duplicate values for the <code>id</code> at...,1
2,55776279,Why is data not received by Node.js through PO...,<p>So I have this functionality.\nThe client i...,javascript|jquery|node.js,<p>Because you are doing the request to the ur...,0
3,55776279,Why is data not received by Node.js through PO...,<p>So I have this functionality.\nThe client i...,javascript|jquery|node.js,<p>your routing specifies a localhost:8080/res...,0
4,55776279,Why is data not received by Node.js through PO...,<p>So I have this functionality.\nThe client i...,javascript|jquery|node.js,<p>You are missing leading slash in<br>\n<code...,0


In the dataset, each row contains a question and one of the answers it has but there may be rows with identical questions as one question might have more than one answer. Hence,I combined all these redundant rows as one while summing up the scores of each answer and concatenating all answers. 



In [None]:
df=df.groupby(['id','title', 'body','tags'],as_index=False).agg(lambda x :' \n '.join(x) if x.dtype=='str' else x.sum())

In [None]:
df.shape

(55139, 6)

Since the raw data is prepared, the next step is to clean and process the data. Here are some basic text processing including the following steps:
1. Convert to lower case.
2. Remove Programming code from the answers/questions
3. Remove HTML Tags 
4. Remove Punctuations/Escape characters.
5. Remove External links
6. Expand the contractions
7. Remove stopwords
and Finally, I also normalized the numeric 'scores' column.  I  didn't perform Stemming or Lemmatization on the data because it might lose the domain specific term  so I did not want to alter the domain specific terms used in our corpus and risk losing precious information


In [None]:
import re
from bs4 import BeautifulSoup

def removeCode(text):
    soup = BeautifulSoup(text, 'lxml')
    for code in soup("code"):
        code.decompose()    
    
    return soup.get_text()

def removeHtmlTags(text):
        text=text.lower()
        return re.sub('<[^<]+?>', ' ', str(text))

def removeLinks(text):
    return re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', str(text))

def cleanEscseq(text):
  
    text=re.sub('\n', ' ', str(text))
    text=re.sub('\r', ' ', str(text))
    return text

def decontracted(text):
    # specific
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can\'t", "can not", text)

    # general
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    return text

def cleanpunc(text): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|"|#|:|=|+|_|{|}|[|]|-|$|%|^|&|]',r'',text)
    cleaned = re.sub(r'[.|,|)|(|\|/|-|~|`|>|<|*|$|@|;|→]',r'',cleaned)
    return  cleaned

def normalize(text):
    text=removeCode(text)
    text=removeHtmlTags(text)
    text=removeLinks(text)
    text=cleanEscseq(text)
    text=decontracted(text)
    text=cleanpunc(text)
    return text



df['body_clean']=df.body.apply(normalize)
df['answers_clean']=df.answers.apply(normalize)
df['title']=df.title.apply(normalize)

# df['urls_body_texts']=re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').findall(df.body)
df['urls_body_text']=df.body.apply(lambda x: re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').findall(str(x)))
df['urls_answers_text']=df.answers.apply(lambda x: re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').findall(str(x)))
df

Unnamed: 0.1,id,title,body,tags,Unnamed: 0,answers,score,body_clean,answers_clean,urls_body_text,urls_answers_text
0,5966,best way to abstract seasonshowepisode data,"<p>Basically, I've written an API to www.thetv...",python|data-structures,424410,<p>I don't get this part here:</p>\n\n<blockqu...,9,basically i have written an api to wwwthetvdbc...,i do not get this part here this worked okay ...,[http://github.com/dbr/tvdb_api/tree/master/tv...,[http://docs.python.org/lib/module-sqlite3.html]
1,15139,building standalone applications in javascript,<p>With the increased power of JavaScript fram...,javascript|deployment|web-applications|browser,52661,"<p>I'm with ScottKoon here, Adobe AIR is great...",39,with the increased power of javascript framewo...,i am with scottkoon here adobe air is great i...,[],"[http://downloadify.info, http://github.com/dc..."
2,20003,repository layout for large maven projects,<p>I have a large application (~50 modules) us...,java|svn|maven-2,6457,<p>I think you're better off flattening your d...,18,i have a large application 50 modules using a ...,i think you are better off flattening your dir...,[],[https://stackoverflow.com/questions/16829/str...
3,37628,what is reflection and why is it useful,"<p>What is reflection, and why is it useful?</...",java|reflection|terminology,592914,<p><code>Reflection</code> has many <strong>us...,2442,what is reflection and why is it useful i am p...,has many uses the one i am more familiar with...,[],[http://en.wikipedia.org/wiki/Java_API_for_XML...
4,48239,getting the id of the element that fired an event,<p>Is there any way to get the ID of the eleme...,javascript|jquery,1109010,"<p>In the case of delegated event handlers, wh...",1805,is there any way to get the id of the element ...,in the case of delegated event handlers where ...,[],[http://api.jquery.com/category/events/event-o...
...,...,...,...,...,...,...,...,...,...,...,...
55134,62108196,search the menu and navigate to that searched ...,<p>I'm designing a website which has around 80...,javascript|java|jquery|bootstrap-4|searchbar,90034,<p>Assuming that you need help with a search f...,0,i am designing a website which has around 80 m...,assuming that you need help with a search func...,[],[]
55135,62109049,when add data into access database show the er...,"<p>Exception in thread ""AWT-EventQueue-0"" java...",java|access,89563,<p>This line seems to be an offender:</p>\n\n<...,0,exception in thread awteventqueue0 javalangcla...,this line seems to be an offender why are you...,[],[]
55136,62109497,how to add one to a cookie,"<p>My goal is to add 1 to a cookie, once a but...",javascript|jquery|cookies,24226,<p>Change the line</p>\n\n<pre><code> var coun...,0,my goal is to add 1 to a cookie once a button ...,change the line to the parseint function par...,[],[https://developer.mozilla.org/en-US/docs/Web/...
55137,62109637,how to pull response from a url and than put s...,"<p>I need to send a request to <a href=""https:...",javascript|authentication|minecraft,49504,<pre><code>const postData = async _ =&gt; {\n ...,0,i need to send a request to and get the resp...,this should get the response you need judging...,"[https://authserver.mojang.com/authenticate, h...","[https://wiki.vg/Authentication, https://wiki...."


In [None]:
df['Total_score']=df.score

In [None]:
df.to_csv('JusttoMerge.csv')

In [None]:
df.score = (df.score - df.score.mean()) / (df.score.max() - df.score.min())


In [None]:
((df.score)*(df.score.max() - df.score.min()))/df.score.mean() = df.score


In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token)
    return result

def preprocess_text(text):
    return ' '.join(preprocess(text))

df['body_clean'] =df['body_clean'].apply(preprocess_text)# Preprocess the articles text, saving the results as ‘processed_docs’
df['answers_clean']=df['answers_clean'].apply(preprocess_text)
df['title']=df['title'].apply(preprocess_text)

df

Unnamed: 0,id,title,body,tags,answers,score,body_clean,answers_clean,urls_body_text,urls_answers_text
0,5966,best abstract data,"<p>Basically, I've written an API to www.thetv...",python|data-structures,<p>I don't get this part here:</p>\n\n<blockqu...,0.000557,basically written wwwthetvdbcom python current...,worked okay easy checking supposed exist raise...,[http://github.com/dbr/tvdb_api/tree/master/tv...,[http://docs.python.org/lib/module-sqlite3.html]
1,15139,building standalone applications javascript,<p>With the increased power of JavaScript fram...,javascript|deployment|web-applications|browser,"<p>I'm with ScottKoon here, Adobe AIR is great...",0.005081,increased power javascript frameworks like jqu...,scottkoon adobe great nice imho widget jquery ...,[],"[http://downloadify.info, http://github.com/dc..."
2,20003,repository layout large maven projects,<p>I have a large application (~50 modules) us...,java|svn|maven-2,<p>I think you're better off flattening your d...,0.001914,large application modules structure similar fo...,think better flattening directory structure wa...,[],[https://stackoverflow.com/questions/16829/str...
3,37628,reflection useful,"<p>What is reflection, and why is it useful?</...",java|reflection|terminology,<p><code>Reflection</code> has many <strong>us...,0.367415,reflection useful particularly interested java...,uses familiar able create code dynamic classes...,[],[http://en.wikipedia.org/wiki/Java_API_for_XML...
4,48239,getting element fired event,<p>Is there any way to get the ID of the eleme...,javascript|jquery,"<p>In the case of delegated event handlers, wh...",0.271365,element fires event thinking like course conta...,case delegated event handlers like code like l...,[],[http://api.jquery.com/category/events/event-o...
...,...,...,...,...,...,...,...,...,...,...
55134,62108196,search menu navigate searched menu menus,<p>I'm designing a website which has around 80...,javascript|java|jquery|bootstrap-4|searchbar,<p>Assuming that you need help with a search f...,-0.000800,designing website menus live search search men...,assuming need help search function easiest thi...,[],[]
55135,62109049,data access database error,"<p>Exception in thread ""AWT-EventQueue-0"" java...",java|access,<p>This line seems to be an offender:</p>\n\n<...,-0.000800,exception thread awteventqueue class cast clas...,line offender casting need casting look import...,[],[]
55136,62109497,cookie,"<p>My goal is to add 1 to a cookie, once a but...",javascript|jquery|cookies,<p>Change the line</p>\n\n<pre><code> var coun...,-0.000800,goal cookie button clicked code looks like bas...,change line parseint function parses string ar...,[],[https://developer.mozilla.org/en-US/docs/Web/...
55137,62109637,pull response said response json,"<p>I need to send a request to <a href=""https:...",javascript|authentication|minecraft,<pre><code>const postData = async _ =&gt; {\n ...,-0.000800,need send request response dont know server ne...,response need judging payload link,"[https://authserver.mojang.com/authenticate, h...","[https://wiki.vg/Authentication, https://wiki...."


In [None]:
df.head()

Unnamed: 0,id,title,body,tags,answers,score,body_clean,answers_clean,urls_body_text,urls_answers_text
0,382,meaning type safety warning certain java gener...,<p>What is the meaning of the <em>Java warning...,java|generics|warnings|casting|type-safety,<p>This warning is there because Java is not a...,0.004066,meaning java warning type safety cast object l...,warning java actually storing type information...,[],[]
1,742,class views django,"<p><a href=""http://www.djangoproject.com/"" rel...",python|django|views|oop,<p>Sounds to me like you're trying to combine ...,0.006159,django view points function problem want chang...,sounds like trying combine things combined nee...,[http://www.djangoproject.com/],[http://code.djangoproject.com/browser/django/...
2,845,detect defined font page,<p>Suppose I have the following CSS rule in my...,javascript|html|css|fonts,<p>A simplified form is:</p>\n\n<pre><code>fun...,0.01353,suppose following rule page detect defined fon...,simplified form need complete check outcalibri...,[],[http://www.lalit.org/lab/javascript-css-font-...
3,1873,triple quotes delimit databound javascript str...,<p>How do I delimit a Javascript data-bound st...,asp.net|javascript|anchor|quotes,<p>Passing variable to function without single...,0.003338,delimit javascript databound string parameter ...,passing variable function single quote double ...,[],[]
4,2933,create crossplatform python,<p>Python works on multiple platforms and can ...,python|user-interface|deployment|tkinter|relea...,<p>Another system (not mentioned in the accept...,0.03464,python works multiple platforms desktop applic...,mentioned accepted answer pyinstaller worked p...,[],"[http://www.pyinstaller.org/, http://www.pyins..."


## Feature Extraction

I constructed a new feature column called 'Corpus' which has the combined feilds:  the title, question body, and all the answers. Also, I constructed 2 features for sentiment analysis using the open Source Textblob library


In [None]:
from bs4 import BeautifulSoup
from textblob import TextBlob
id_list=[]
title_list = [] 
question_list = []
questionClean_list = []
answers_list = []
answersClean_list = []
urls_body_list=[]
urls_answers_list=[]
sentiment_polarity_list = []
sentiment_subjectivity_list = []
score_list =[]
tag_list = []
corpus = []

for i, row in df.iterrows():
    id_list.append(row.id)
    title_list.append(row.title) 
    question_list.append(row.body)
    answers_list.append(row.answers)
    questionClean_list.append(row.body_clean)
    answersClean_list.append(row.body_clean)
    tag_list.append(row.tags)  
    score_list.append(row.score)       # Append votes
    urls_body_list.append(row.urls_body_text)
    urls_answers_list.append(row.urls_answers_text)


    
    corpus.append(title_list[-1] + ' ' + questionClean_list[-1]+ answersClean_list[-1])     # Combine the updated body and answers to make the corpus
    
    sentiment = TextBlob(row.answers_clean).sentiment
    sentiment_polarity_list.append(sentiment.polarity)
    sentiment_subjectivity_list.append(sentiment.subjectivity)

preprocessed_df = pd.DataFrame({'Id':id_list, 'Title': title_list, 'Corpus': corpus, 'Original_question': question_list,'Original_answer': answers_list, 'Questions_cleaned': questionClean_list,'Answers_cleaned': answersClean_list,  'Tags': tag_list, 'Total_scores':score_list,'URL_list_questions': urls_body_list,'URL_list_ans': urls_answers_list, 'sentiment_polarity': sentiment_polarity_list, 'sentiment_subjectivity':sentiment_subjectivity_list})
preprocessed_df.head()

Unnamed: 0,Id,Title,Corpus,Original_question,Original_answer,Questions_cleaned,Answers_cleaned,Tags,Total_scores,URL_list_questions,URL_list_ans,sentiment_polarity,sentiment_subjectivity
0,5966,best abstract data,best abstract data basically written wwwthetvd...,"<p>Basically, I've written an API to www.thetv...",<p>I don't get this part here:</p>\n\n<blockqu...,basically written wwwthetvdbcom python current...,basically written wwwthetvdbcom python current...,python|data-structures,0.000557,[http://github.com/dbr/tvdb_api/tree/master/tv...,[http://docs.python.org/lib/module-sqlite3.html],0.234314,0.598529
1,15139,building standalone applications javascript,building standalone applications javascript in...,<p>With the increased power of JavaScript fram...,"<p>I'm with ScottKoon here, Adobe AIR is great...",increased power javascript frameworks like jqu...,increased power javascript frameworks like jqu...,javascript|deployment|web-applications|browser,0.005081,[],"[http://downloadify.info, http://github.com/dc...",0.296179,0.614218
2,20003,repository layout large maven projects,repository layout large maven projects large a...,<p>I have a large application (~50 modules) us...,<p>I think you're better off flattening your d...,large application modules structure similar fo...,large application modules structure similar fo...,java|svn|maven-2,0.001914,[],[https://stackoverflow.com/questions/16829/str...,0.198214,0.525595
3,37628,reflection useful,reflection useful reflection useful particular...,"<p>What is reflection, and why is it useful?</...",<p><code>Reflection</code> has many <strong>us...,reflection useful particularly interested java...,reflection useful particularly interested java...,java|reflection|terminology,0.367415,[],[http://en.wikipedia.org/wiki/Java_API_for_XML...,0.109639,0.486383
4,48239,getting element fired event,getting element fired event element fires even...,<p>Is there any way to get the ID of the eleme...,"<p>In the case of delegated event handlers, wh...",element fires event thinking like course conta...,element fires event thinking like course conta...,javascript|jquery,0.271365,[],[http://api.jquery.com/category/events/event-o...,0.043864,0.487912


In [None]:
# with pd.option_context('display.max_colwidth', None):
#     print(content_token_df)
preprocessed_df.to_csv(os.getcwd()+"/drive/My Drive/Colab Notebooks/File/"+"preprocessed_df.csv")
preprocessed_df.head()

Unnamed: 0,Id,Title,Corpus,Original_question,Original_answer,Questions_cleaned,Answers_cleaned,Tags,Total_scores,URL_list_questions,URL_list_ans,sentiment_polarity,sentiment_subjectivity
0,5966,best abstract data,best abstract data basically written wwwthetvd...,"<p>Basically, I've written an API to www.thetv...",<p>I don't get this part here:</p>\n\n<blockqu...,basically written wwwthetvdbcom python current...,basically written wwwthetvdbcom python current...,python|data-structures,0.000557,[http://github.com/dbr/tvdb_api/tree/master/tv...,[http://docs.python.org/lib/module-sqlite3.html],0.234314,0.598529
1,15139,building standalone applications javascript,building standalone applications javascript in...,<p>With the increased power of JavaScript fram...,"<p>I'm with ScottKoon here, Adobe AIR is great...",increased power javascript frameworks like jqu...,increased power javascript frameworks like jqu...,javascript|deployment|web-applications|browser,0.005081,[],"[http://downloadify.info, http://github.com/dc...",0.296179,0.614218
2,20003,repository layout large maven projects,repository layout large maven projects large a...,<p>I have a large application (~50 modules) us...,<p>I think you're better off flattening your d...,large application modules structure similar fo...,large application modules structure similar fo...,java|svn|maven-2,0.001914,[],[https://stackoverflow.com/questions/16829/str...,0.198214,0.525595
3,37628,reflection useful,reflection useful reflection useful particular...,"<p>What is reflection, and why is it useful?</...",<p><code>Reflection</code> has many <strong>us...,reflection useful particularly interested java...,reflection useful particularly interested java...,java|reflection|terminology,0.367415,[],[http://en.wikipedia.org/wiki/Java_API_for_XML...,0.109639,0.486383
4,48239,getting element fired event,getting element fired event element fires even...,<p>Is there any way to get the ID of the eleme...,"<p>In the case of delegated event handlers, wh...",element fires event thinking like course conta...,element fires event thinking like course conta...,javascript|jquery,0.271365,[],[http://api.jquery.com/category/events/event-o...,0.043864,0.487912
