**Download StackSample dataset from Kaggle:**

In [None]:
! pip install -q kaggle

from google.colab import files

files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"shwetasahu0506","key":"c9e1925945fcbb13069a37ffa022f61e"}'}

In [None]:
#@title
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
#@title
!kaggle datasets download -d stackoverflow/stacksample

Downloading stacksample.zip to /content
 99% 1.10G/1.11G [00:12<00:00, 101MB/s] 
100% 1.11G/1.11G [00:12<00:00, 94.2MB/s]


In [None]:
!unzip stacksample.zip

Archive:  stacksample.zip
  inflating: Answers.csv             
  inflating: Questions.csv           
  inflating: Tags.csv                


In [None]:
from collections import Counter
import pandas as pd
import numpy as np
from datetime import datetime
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from tqdm.notebook import tqdm
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import TextVectorization, Dense,Input,Activation,Embedding, Conv1D,concatenate, MaxPool1D, Flatten, Dropout
from tensorflow.keras.models import Model
import tensorflow.keras.initializers
from sklearn.metrics import f1_score
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

In [None]:
#@title Read Data
questions = pd.read_csv("Questions.csv", encoding='latin1')
answers = pd.read_csv("Answers.csv", encoding='latin1')
tags = pd.read_csv("Tags.csv", encoding='latin1')
print(questions.shape)
print(answers.shape)
print(tags.shape)


(1264216, 7)
(2014516, 6)
(3750994, 2)


In [None]:
questions.columns

Index(['Id', 'OwnerUserId', 'CreationDate', 'ClosedDate', 'Score', 'Title',
       'Body'],
      dtype='object')

In [None]:
answers.columns

Index(['Id', 'OwnerUserId', 'CreationDate', 'ParentId', 'Score', 'Body'], dtype='object')

In [None]:
tags.columns

Index(['Id', 'Tag'], dtype='object')

**Questions which does not have any answers:**

In [None]:
questions[~questions.Id.isin(answers.ParentId)]

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
1453,122670,5056.0,2008-09-23T18:04:36Z,2013-12-31T16:35:28Z,46,What is the LINQ way to implode/join a string ...,<p>I have the following string array:</p>\n\n<...
3600,270460,34395.0,2008-11-06T21:53:46Z,2016-08-14T09:56:48Z,5,DTD Validation With Python?,<p>I was wondering which of the built in pytho...
5213,364300,7599.0,2008-12-12T22:01:32Z,,2,Has anyone had success getting PR_DEFAULT_STOR...,<p>The following piece of test code <em>runs</...
5410,375910,4893.0,2008-12-17T20:28:25Z,2011-08-30T13:13:05Z,10,Creating a Temp Dir in Java,<blockquote>\n <p><strong>Possible Duplicate:...
7851,525490,20955.0,2009-02-08T10:15:21Z,2013-12-30T03:04:01Z,1,open file select dialog with javascript,"<p>I have hidden input type=file field, and i ..."
...,...,...,...,...,...,...,...
1264211,40143210,5610777.0,2016-10-19T23:38:01Z,,0,URL routing in PHP (MVC),<p>I am building a custom MVC project and I ha...
1264212,40143300,3791161.0,2016-10-19T23:48:09Z,,0,Bigquery.Jobs.Insert - Resumable Upload?,<p>The API docs show that you should be able t...
1264213,40143340,7028647.0,2016-10-19T23:52:50Z,,1,Obfuscating code in android studio,<p>Under minifyEnabled I changed from false to...
1264214,40143360,871677.0,2016-10-19T23:55:24Z,,0,How to fire function after v-model change?,<p>I have input which I use to filter my array...


We will not consider these questions for further case study.

In [None]:
questions = questions[questions.Id.isin(answers.ParentId)]
questions.shape

(1102568, 7)

**Number of answers per Question**

In [None]:
ans_per_ques = answers.groupby("ParentId").agg(ans_count=('ParentId', 'count')).sort_values(['ans_count'], ascending=[False])

In [None]:
answers.groupby(['ParentId','Score']).agg(ans_count=('ParentId', 'count')).sort_values(['ans_count'], ascending=[False])

Unnamed: 0_level_0,Unnamed: 1_level_0,ans_count
ParentId,Score,Unnamed: 2_level_1
406760,3,52
38210,1,50
38210,0,44
406760,1,41
23930,1,37
...,...,...
12206630,0,1
12206610,2,1
12206600,261,1
12206600,5,1


We will betaking maximum of 4 answers per question based on score.

In [None]:
answers_for_w2vmodel = answers.sort_values('Score',ascending = False).groupby('ParentId').head(4)
answers_for_w2vmodel.shape

(1904324, 6)

In [None]:
answers_for_w2vmodel[answers_for_w2vmodel['ParentId']==406760]

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
21972,408158,6044.0,2009-01-02T22:13:29Z,406760,877,<p><strong>Programmers who don't code in their...
21804,406775,26507.0,2009-01-02T13:21:32Z,406760,770,"<p><strong>The only ""best practice"" you should..."
21815,406812,4200.0,2009-01-02T13:44:11Z,406760,713,<p><strong>Most comments in code are in fact a...
21956,407985,47544.0,2009-01-02T20:51:20Z,406760,712,"<p><strong>""Googling it"" is okay!</strong></p>..."


In [None]:
#https://stackoverflow.com/questions/47600818/python-pandas-groupby-sum-and-concatenate-strings

answers_for_w2vmodel = answers_for_w2vmodel.groupby(['ParentId'],as_index=False).agg({'Body': ' '.join, 'Score': 'sum'})
answers_for_w2vmodel.rename(columns = {'Score':'SummedScore', 'Body':'AnsBody'}, inplace = True)
answers_for_w2vmodel.head()

Unnamed: 0,ParentId,AnsBody,SummedScore
0,80,<p>I wound up using this. It is a kind of a ha...,19
1,90,<p>My easy click-by-click instructions (<stron...,34
2,120,<p>The Jeff Prosise version from MSDN magazine...,9
3,180,"<p>My first thought on this is ""how generate N...",49
4,260,"<p><a href=""http://www.codeproject.com/Article...",44


**Questions without tags**

In [None]:
questions[~questions.Id.isin(tags.Id)]

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body


There are no questions without tags

**Tags Distribution**

In [None]:
tag_dist = tags.groupby("Tag").agg(tag_count=('Tag', 'count')).sort_values(by='tag_count', ascending=False)

In [None]:
tag_dist

Unnamed: 0_level_0,tag_count
Tag,Unnamed: 1_level_1
javascript,124155
java,115212
c#,101186
php,98808
android,90659
...,...
tbcd,1
evil-dicom,1
evo,1
tbar,1


In [None]:
tags_per_ques = tags.groupby("Id").agg(ques_count=('Id', 'count')).sort_values(by='ques_count', ascending=False)

In [None]:
tags_per_ques

Unnamed: 0_level_0,ques_count
Id,Unnamed: 1_level_1
11053790,5
5221980,5
25573000,5
25573010,5
16468060,5
...,...
8240020,1
11770790,1
21883080,1
3949000,1


The maximum number of tags per ques is 5 and mininum is 1.

**Text Preprocessing**

In [None]:
#function to remove html scripts
def striphtml(data):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', str(data))
    return cleantext

In [None]:
#download stop words and remove not and nor from it
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stop_words.remove('no'); stop_words.remove('not'); stop_words.remove('nor')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#function to Preprocess the Body of questions and answers
def ProcessBody(data):
    start = datetime.now()
    preprocessed_data_list=[]
    with_code=0
    len_pre=0
    len_post=0
    questions_proccesed = 0
    questions_with_code = 0
    lstBody = []
    lstCode = []
    for body in data:
        
        is_code = 0
        #check if there is code in the text and remove it if there
        if '<code>' in body:
            questions_with_code+=1
            is_code = 1
        x = len(body)
        len_pre+=x
        
        code = str(re.findall(r'<code>(.*?)</code>', body, flags=re.DOTALL))
        lstCode.append(code)

        body=re.sub('<code>(.*?)</code>', ' ', body, flags=re.MULTILINE|re.DOTALL)

        #remove urls
        body = re.sub('http://\S+|https://\S+',' ', body, flags=re.MULTILINE)

        #remove html
        body=striphtml(body.encode('utf-8'))

        #decontractions
        body = body.replace("won't", "will not").replace("can\'t", "can not").replace("n\'t", " not").replace("\'re", " are").\
                                                replace("\'s", " is").replace("\'d", " would").replace("\'ll", " will").\
                                                replace("\'t", " not").replace("\'ve", " have").replace("\'m", " am")

        body=re.sub(r'[^A-Za-z0-9#+\-]+',' ',body)
        words=word_tokenize(str(body.lower()))
        
        #Removing all single letter and and stopwords from question exceptt for the letter 'c'
        body = ' '.join(j for j in words if j not in stop_words and (len(j)!=1 or j=='c'))
        lstBody.append(body)

        len_post+=len(body)
        questions_proccesed += 1
        if (questions_proccesed%100000==0):
            print("number of questions completed=",questions_proccesed)

    no_dup_avg_len_pre=(len_pre*1.0)/questions_proccesed
    no_dup_avg_len_post=(len_post*1.0)/questions_proccesed

    print( "Avg. length of questions(Body) before processing: %d"%no_dup_avg_len_pre)
    print( "Avg. length of questions(Body) after processing: %d"%no_dup_avg_len_post)
    print ("Percent of questions containing code: %d"%((questions_with_code*100.0)/questions_proccesed))

    print("Time taken to run this cell :", datetime.now() - start)
    return lstBody, lstCode

In [None]:
#funtion to Preprocess the question title
def ProcessTitle(data):
    start = datetime.now()
    lstTitle=[]
    questions_proccesed = 0
    for title in data:
        # decontractions
        title = title.replace("won't", "will not").replace("can\'t", "can not").replace("n\'t", " not").replace("\'re", " are").\
                                                replace("\'s", " is").replace("\'d", " would").replace("\'ll", " will").\
                                                replace("\'t", " not").replace("\'ve", " have").replace("\'m", " am")

        title=re.sub(r'[^A-Za-z0-9#+\-]+',' ',title)
        words=word_tokenize(str(title.lower()))
        
        #Removing all single letter and and stopwords from question exceptt for the letter 'c'
        title = ' '.join(j for j in words if j not in stop_words and (len(j)!=1 or j=='c'))
        lstTitle.append(title)
        questions_proccesed += 1
        if (questions_proccesed%100000==0):
            print("number of questions completed=",questions_proccesed)
            
    print("Time taken to run this cell :", datetime.now() - start)
    return lstTitle

**Preprocess the Question Title**

In [None]:
ques_title = list(questions['Title'])
ques_title_lst = ProcessTitle(ques_title)

number of questions completed= 100000
number of questions completed= 200000
number of questions completed= 300000
number of questions completed= 400000
number of questions completed= 500000
number of questions completed= 600000
number of questions completed= 700000
number of questions completed= 800000
number of questions completed= 900000
number of questions completed= 1000000
number of questions completed= 1100000
Time taken to run this cell : 0:01:24.576731


In [None]:
questions['Processed_Ques_Title'] = ques_title_lst
questions.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Processed_Ques_Title
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,good branching merging tutorials tortoisesvn
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,asp net site maps
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,function creating color wheels
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,adding scripting functionality net applications


**Preprocess the Question Body**

In [None]:
#Preprocess the Question Body
ques_body = list(questions['Body'])
ques_body_lst, ques_code_lst = ProcessBody(ques_body)

number of questions completed= 100000
number of questions completed= 200000
number of questions completed= 300000
number of questions completed= 400000
number of questions completed= 500000
number of questions completed= 600000
number of questions completed= 700000
number of questions completed= 800000
number of questions completed= 900000
number of questions completed= 1000000
number of questions completed= 1100000
Avg. length of questions(Body) before processing: 1368
Avg. length of questions(Body) after processing: 319
Percent of questions containing code: 74
Time taken to run this cell : 0:06:12.144818


In [None]:
questions['Processed_Ques_Body'] = ques_body_lst
questions.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Processed_Ques_Title,Processed_Ques_Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...,written database generation script application...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,good branching merging tutorials tortoisesvn,really good tutorials explaining branching mer...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,asp net site maps,anyone got experience creating sql-based asp n...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,function creating color wheels,something pseudo-solved many times never quite...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,adding scripting functionality net applications,little game written c uses database back-end n...


**Preprocess the Answer Body**

In [None]:
ans_body = list(answers['Body'])
ans_body_lst, ans_code_lst = ProcessBody(ans_body)

number of questions completed= 100000
number of questions completed= 200000
number of questions completed= 300000
number of questions completed= 400000
number of questions completed= 500000
number of questions completed= 600000
number of questions completed= 700000
number of questions completed= 800000
number of questions completed= 900000
number of questions completed= 1000000
number of questions completed= 1100000
number of questions completed= 1200000
number of questions completed= 1300000
number of questions completed= 1400000
number of questions completed= 1500000
number of questions completed= 1600000
number of questions completed= 1700000
number of questions completed= 1800000
number of questions completed= 1900000
number of questions completed= 2000000
Avg. length of questions(Body) before processing: 741
Avg. length of questions(Body) after processing: 210
Percent of questions containing code: 70
Time taken to run this cell : 0:08:06.155384


In [None]:
answers['Processed_Ans_Body'] = ans_body_lst
answers.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,Processed_Ans_Body
0,92,61.0,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Vers...",good resource source control general not reall...
1,124,26.0,2008-08-01T16:09:47Z,80,12,<p>I wound up using this. It is a kind of a ha...,wound using kind hack actually works pretty we...
2,199,50.0,2008-08-01T19:36:46Z,180,1,<p>I've read somewhere the human eye can't dis...,read somewhere human eye not distinguish less ...
3,269,91.0,2008-08-01T23:49:57Z,260,4,"<p>Yes, I thought about that, but I soon figur...",yes thought soon figured another domain-specif...
4,307,49.0,2008-08-02T01:49:46Z,260,28,"<p><a href=""http://www.codeproject.com/Article...",really great introduction providing script abi...


In [None]:
questions.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data/Preprocessed_Questions.pkl")
answers.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data/Preprocessed_Answers.pkl")
tags.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data/Tags.pkl")

**Select questions for LSTM based model**

Due to resource constraints we are considering questions having Javascript, Java or C# tags which are ~300k in total

In [None]:
questions_for_model2 = questions[questions.Id.isin(tags[(tags.Tag=='javascript')|(tags.Tag=='java')|(tags.Tag=='c#')]['Id'])]

In [None]:
questions_for_model2.shape

(296099, 9)

In [None]:
answers_for_model2 = answers[answers.ParentId.isin(questions_for_model2.Id)]

In [None]:
answers_for_model2.shape

(579778, 7)

In [None]:
tags_for_model2 = tags[tags.Id.isin(questions_for_model2.Id) & ((tags.Tag=='javascript')|(tags.Tag=='java')|(tags.Tag=='c#'))]

In [None]:
tags_for_model2.shape

(299453, 2)

In [None]:
questions_for_model2.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data/Model2_Preprocessed_Questions.pkl")
answers_for_model2.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data/Model2_Preprocessed_Answers.pkl")
tags_for_model2.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data/Model2_Tags.pkl")

**Further processing for Word2Vec model**

In [None]:
# Converting 'CreationDate' to UNIX time format, so that we can use it for TimeBasedSplitting
lst = questions['CreationDate'].values
questions['clean_CreationDate'] = [string.replace('T', ' ').split('.')[0] for string in lst]
questions['UNIX_CreationDate'] = pd.to_datetime(questions['clean_CreationDate']).astype(int)/10**9
del questions['clean_CreationDate']
questions.head()

  questions['UNIX_CreationDate'] = pd.to_datetime(questions['clean_CreationDate']).astype(int)/10**9


Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Processed_Ques_Title,Processed_Ques_Body,UNIX_CreationDate
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...,written database generation script application...,1217599000.0
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,good branching merging tutorials tortoisesvn,really good tutorials explaining branching mer...,1217602000.0
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,asp net site maps,anyone got experience creating sql-based asp n...,1217606000.0
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,function creating color wheels,something pseudo-solved many times never quite...,1217616000.0
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,adding scripting functionality net applications,little game written c uses database back-end n...,1217633000.0


In [None]:
#merge Anwers with Questions, there will be multiple rows for a Question if it has multiple answers
questions = pd.merge(questions, answers, left_on='Id', right_on='ParentId')

questions.head()

Unnamed: 0,Id_x,OwnerUserId_x,CreationDate_x,ClosedDate,Score_x,Title,Body_x,Processed_Ques_Title,Processed_Ques_Body,UNIX_CreationDate,Id_y,OwnerUserId_y,CreationDate_y,ParentId,Score_y,Body_y,Processed_Ans_Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...,written database generation script application...,1217599000.0,124,26.0,2008-08-01T16:09:47Z,80,12,<p>I wound up using this. It is a kind of a ha...,wound using kind hack actually works pretty we...
1,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...,written database generation script application...,1217599000.0,10008,1109.0,2008-08-13T16:09:09Z,80,6,"<p>The <a href=""http://en.wikipedia.org/wiki/S...",sqlite api function called something like take...
2,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...,written database generation script application...,1217599000.0,3770976,364174.0,2010-09-22T15:37:30Z,80,1,<p>What about making your delimiter something ...,making delimiter something little complex like...
3,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,good branching merging tutorials tortoisesvn,really good tutorials explaining branching mer...,1217602000.0,92,61.0,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Vers...",good resource source control general not reall...
4,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,good branching merging tutorials tortoisesvn,really good tutorials explaining branching mer...,1217602000.0,202317,20709.0,2008-10-14T18:41:45Z,90,2,"<p>You can also try <em><a href=""http://www.co...",also try version control standalone programmer...


In [None]:
questions = questions.drop(['OwnerUserId', 'CreationDate', 'ClosedDate', 'ParentId', 'AnsBody'], axis=1)

In [None]:
questions = questions.sort_values(['UNIX_CreationDate'], ascending=[True])
questions.head()

Unnamed: 0,Id,Score,Title,Body,Processed_Ques_Title,Processed_Ques_Body,UNIX_CreationDate,SummedScore,Processed_Ans_Body
0,80,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...,written database generation script sql want ex...,1217599000.0,19,wound using kind hack actually works pretty we...
1,90,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,good branching merging tutorials tortoisesvn,really good tutorials explaining branching mer...,1217602000.0,34,easy click-by-click instructions specific tort...
2,120,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,asp net site maps,anyone got experience creating sql-based asp n...,1217606000.0,9,jeff prosise version msdn magazine works prett...
3,180,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,function creating color wheels,something pseudo-solved many times never quite...,1217616000.0,49,first thought generate vectors space maximize ...
4,260,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,adding scripting functionality net applications,little game written c uses database back-end n...,1217633000.0,44,oleg shilo c script solution code project real...


In [None]:
#concatenate Question Title,Body and all its answers Body as full text for Question, this entire text will be used to create samples in Word2Vec model
questions['Ques_Text'] = questions[['Processed_Ques_Title', 'Processed_Ques_Body', 'Processed_Ans_Body']].agg(' '.join, axis=1)
questions.head()

Unnamed: 0,Id,Score,Title,Body,Processed_Ques_Title,Processed_Ques_Body,UNIX_CreationDate,SummedScore,Processed_Ans_Body,Ques_Text
0,80,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,sqlstatement execute multiple queries one stat...,written database generation script sql want ex...,1217599000.0,19,wound using kind hack actually works pretty we...,sqlstatement execute multiple queries one stat...
1,90,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,good branching merging tutorials tortoisesvn,really good tutorials explaining branching mer...,1217602000.0,34,easy click-by-click instructions specific tort...,good branching merging tutorials tortoisesvn r...
2,120,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,asp net site maps,anyone got experience creating sql-based asp n...,1217606000.0,9,jeff prosise version msdn magazine works prett...,asp net site maps anyone got experience creati...
3,180,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,function creating color wheels,something pseudo-solved many times never quite...,1217616000.0,49,first thought generate vectors space maximize ...,function creating color wheels something pseud...
4,260,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,adding scripting functionality net applications,little game written c uses database back-end n...,1217633000.0,44,oleg shilo c script solution code project real...,adding scripting functionality net application...


Due to resource constraints we are considering questions having Javascript, Java or C# tags which are ~300k in total

In [None]:
questions_for_model = questions[questions.Id.isin(tags[(tags.Tag=='javascript')|(tags.Tag=='java')|(tags.Tag=='c#')]['Id'])]

In [None]:
questions_for_model.head()

Unnamed: 0,Id,Score,Title,Body,Processed_Ques_Title,Processed_Ques_Body,UNIX_CreationDate,SummedScore,Processed_Ans_Body,Ques_Text
4,260,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,adding scripting functionality net applications,little game written c uses database back-end n...,1217633000.0,44,oleg shilo c script solution code project real...,adding scripting functionality net application...
8,650,79,Automatically update version number,<p>I would like the version property of my app...,automatically update version number,would like version property application increm...,1217762000.0,94,built stuff not using replace revision build n...,automatically update version number would like...
10,930,28,How do I connect to a database and loop over a...,<p>What's the simplest way to connect and quer...,connect database loop recordset c,simplest way connect query database set records c,1217811000.0,56,goyuix -- excellent something written memory n...,connect database loop recordset c simplest way...
11,1010,14,"How to get the value of built, encoded ViewState?",<p>I need to grab the base64-encoded represent...,get value built encoded viewstate,need grab base64-encoded representation viewst...,1217822000.0,10,rex suspect good place start looking solutions...,get value built encoded viewstate need grab ba...
12,1040,42,How do I delete a file which is locked by anot...,<p>I'm looking for a way to delete a file whic...,delete file locked another process c,looking way delete file locked another process...,1217829000.0,50,killing processes not healthy thing scenario i...,delete file locked another process c looking w...


In [None]:
#save the selected questions data to a pickle file
questions_for_model.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_selected_data.pkl")

In [None]:
questions.to_pickle("/content/drive/MyDrive/StackOverflow_CaseStudy/Preprocessed_data.pkl")

In [None]:
selected_ques_tags = tags[(tags.Tag=='javascript')|(tags.Tag=='java')|(tags.Tag=='c#')]