# Data Preparation

In this notebook, we use a subset of [Stack Exchange network](https://archive.org/details/stackexchange) question data which includes original questions tagged as 'JavaScript', their duplicate questions and their answers. Here, we provide the steps to prepare the data to use in model development for training a model that will match a new question with an existing original question. 

In [1]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")

import os
import pandas as pd
from utilities import read_csv_gz, clean_text, round_sample_strat, random_merge

Below, we define some parameters that will be used in the data cleaning as well as train and test set preparation.

In [2]:
# The size of the test set
test_size = 0.21
# The minimum length of clean text
min_text = 150
# The minimum number of duplicates per question
min_dupes = 12
# The maximum number of duplicate matches
match = 20

## Data cleaning

Next, we download the questions, duplicate questions and answers and load the datasets into pandas dataframes using the helper functions.

In [3]:
# URLs to original questions, duplicate questions, and answers.
data_url = 'https://bostondata.blob.core.windows.net/stackoverflow/{}'
questions_url = data_url.format('orig-q.tsv.gz')
dupes_url = data_url.format('dup-q.tsv.gz')
answers_url = data_url.format('ans.tsv.gz')

In [4]:
# Load datasets.
questions = read_csv_gz(questions_url, names=('Id', 'AnswerId', 'Text0', 'CreationDate'))
dupes = read_csv_gz(dupes_url, names=('Id', 'AnswerId', 'Text0', 'CreationDate'))
answers = read_csv_gz(answers_url, names=('Id', 'Text0'))

Let's now check the dataframes. Notice that questions and duplicates have "AnswerID" column that would help match with the index of answers dataframe.

In [5]:
questions.head()

Unnamed: 0_level_0,AnswerId,Text0,CreationDate
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
220231,220233,Accessing the web page's HTTP Headers in JavaS...,2008-10-20 22:54:38.767
391979,810461,Get client IP using just JavaScript?. <p>I nee...,2008-12-24 18:22:30.780
109086,109091,Stop setInterval call in JavaScript. <p>I am u...,2008-09-20 19:29:55.377
46155,46181,Validate email address in JavaScript?. <p>How ...,2008-09-05 16:10:11.093
121499,121708,"When onblur occurs, how can I find out which e...",2008-09-23 14:48:43.483


In [6]:
dupes.head()

Unnamed: 0_level_0,AnswerId,Text0,CreationDate
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
665430,665404,"Disable ""Back"" & ""Refresh"" Button in Browser. ...",2009-03-20 09:13:31.800
114525,336868,"The difference between the two functions? (""fu...",2008-09-22 12:24:06.583
1347093,147765,ASP.NET Page_Unload to stop user from leaving ...,2009-08-28 13:46:51.217
1208252,26633883,See if a variable is an array using JavaScript...,2009-07-30 17:57:42.363
177867,122704,How do I copy the data of an element with jque...,2008-10-07 10:23:40.017


In [7]:
answers.head()

Unnamed: 0_level_0,Text0
Id,Unnamed: 1_level_1
119473,"<p>Try <a href=""http://johannburkard.de/blog/p..."
324533,"<p>Adapted from <a href=""http://www.javascript..."
108232,"<p>That is known as a textbox watermark, and i..."
194399,<p><strong>Obfuscation:</strong></p> <p>Try <a...
80127,"<p>In JavaScript, ""this"" always refers to the ..."


Let's check the first original question's text.

In [8]:
questions.iloc[0,1]

'Accessing the web page\'s HTTP Headers in JavaScript. <p>How do I access a page\'s HTTP response headers via JavaScript?</p> <p>Related to <a href="http://stackoverflow.com/questions/220149/how-do-i-access-the-http-request-header-fields-via-javascript"><strong>this question</strong></a>, which was modified to ask about accessing two specific HTTP headers.</p> <blockquote> <p><strong>Related:</strong><br> <a href="http://stackoverflow.com/questions/220149/how-do-i-access-the-http-request-header-fields-via-javascript">How do I access the HTTP request header fields via JavaScript?</a></p> </blockquote>'

Let's now check the duplicates for that question.

In [9]:
dupes[dupes.AnswerId == questions.iloc[0,0]]

Unnamed: 0_level_0,AnswerId,Text0,CreationDate
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3177208,220233,Monitoring http request header on a page. <blo...,2010-07-05 04:20:19.663
12258705,220233,How can I read the current headers without mak...,2012-09-04 07:31:07.973
12256134,220233,How to know mime-type or content-type of curre...,2012-09-04 02:43:08.860
15135883,220233,How to access http response headers. <pre><cod...,2013-02-28 12:44:38.393
14673437,220233,Translate Prototype into jQuery?. <blockquote>...,2013-02-03 14:19:00.697
17466305,220233,How to read HTTP header values from JavaScript...,2013-07-04 09:08:32.240
26647511,220233,Is there a JS API to get information about hea...,2014-10-30 07:43:01.117
35604233,220233,How to read http request headers with javascri...,2016-02-24 14:00:49.247


Below is the answer to the original question.

In [10]:
answers.at[questions.iloc[0,0],'Text0']

'<p>Unfortunately, there isn\'t an API to give you the HTTP response headers for your initial page request. That was the original question posted here. It has been <a href="http://stackoverflow.com/questions/12258705/how-can-i-read-the-current-headers-without-making-a-new-request-with-js">repeatedly asked</a>, too, because some people would like to get the actual response headers of the original page request without issuing another one.</p> <h1><br/>For AJAX Requests:</h1> <p>If an HTTP request is made over AJAX, it is possible to get the response headers with the <strong><code>getAllResponseHeaders()</code></strong> method. It\'s part of the XMLHttpRequest API. To see how this can be applied, check out the <em><code>fetchSimilarHeaders()</code></em> function below. Note that this is a work-around to the problem that won\'t be reliable for some applications.</p> <pre><code>myXMLHttpRequest.getAllResponseHeaders(); </code></pre> <ul> <li><p>The API was specified in the following candida

Next, we use the helper functions to clean questions, duplicates and answers from unwanted text such as code, html tags and links. Notice that we add a new column 'Text' to each dataframe for clean text in lowercase.

In [11]:
# Clean up all text, and keep only data with some clean text.
for df in (questions, dupes, answers):
    df['Text'] = df.Text0.apply(clean_text).str.lower()

In [12]:
questions = questions[questions.Text.str.len() > 0]
answers = answers[answers.Text.str.len() > 0]
dupes = dupes[dupes.Text.str.len() > 0]

Let's compare the first original question and cleaned version as an example.

In [13]:
# Original question.
questions.iloc[0,1]

'Accessing the web page\'s HTTP Headers in JavaScript. <p>How do I access a page\'s HTTP response headers via JavaScript?</p> <p>Related to <a href="http://stackoverflow.com/questions/220149/how-do-i-access-the-http-request-header-fields-via-javascript"><strong>this question</strong></a>, which was modified to ask about accessing two specific HTTP headers.</p> <blockquote> <p><strong>Related:</strong><br> <a href="http://stackoverflow.com/questions/220149/how-do-i-access-the-http-request-header-fields-via-javascript">How do I access the HTTP request header fields via JavaScript?</a></p> </blockquote>'

In [14]:
# After cleaning.
questions.iloc[0,3]

"accessing the web page's http headers in javascript. how do i access a page's http response headers via javascript? related to this question, which was modified to ask about accessing two specific http headers.  related: how do i access the http request header fields via javascript? "

it turns out that some duplicate questions were also in original questions. Also, some original questions and some duplicate questions were duplicated in the datasets. In the following, we remove them from the dataframes.

In [15]:
# First, remove dupes that are questions, then remove duplicated questions and dupes.
dupes = dupes[~dupes.index.isin(questions.index)]
questions = questions[~questions.index.duplicated(keep='first')]
dupes = dupes[~dupes.index.duplicated(keep='first')]

We also make sure we keep questions with answers and duplicates.

In [16]:
# Keep only questions with answers and dupes, answers to questions, and dupes of questions.
questions = questions[questions.AnswerId.isin(answers.index) & questions.AnswerId.isin(dupes.AnswerId)]
answers = answers[answers.index.isin(questions.AnswerId)]
dupes = dupes[dupes.AnswerId.isin(questions.AnswerId)]

In [17]:
# Verify data integrity.
assert questions.AnswerId.isin(answers.index).all()
assert answers.index.isin(questions.AnswerId).all()
assert questions.AnswerId.isin(dupes.AnswerId).all()
assert dupes.AnswerId.isin(questions.AnswerId).all()

Below are some statistics on the data. Notice that some questions have very low number of duplicates while others may have a large number. 

In [18]:
# Report on the data.
print('Text statistics:')
print(pd.DataFrame([questions.Text.str.len().describe()
                    .rename('questions'),
                    answers.Text.str.len().describe()
                    .rename('answers'),
                    dupes.Text.str.len().describe()
                    .rename('dupes')]))
print('\nDuplication statistics:')
print(pd.DataFrame([dupes.AnswerId.value_counts().describe()
                    .rename('duplications')]))
print('\nLargest class: {:.2%}'.format(
    dupes.AnswerId.value_counts().max()
    / dupes.shape[0]))


Text statistics:
             count        mean         std   min    25%    50%    75%     max
questions   1714.0  415.827305  319.857854  56.0  225.0  334.0  509.0  3982.0
answers     1714.0  616.274212  673.060199   1.0  178.0  375.0  757.0  3982.0
dupes      16139.0  441.303612  363.638297  25.0  247.0  357.0  519.0  3989.0

Duplication statistics:
               count      mean        std  min  25%  50%  75%     max
duplications  1714.0  9.415986  41.638847  1.0  3.0  4.0  7.0  1369.0

Largest class: 8.48%


Now, we reset all indexes to use them as columns in the rest of the steps.

In [19]:
# Reset each dataframe's index.
questions.reset_index(inplace=True)
answers.reset_index(inplace=True)
dupes.reset_index(inplace=True)

We filter the questions and duplicates to have at least min_text number of characters.

In [20]:
# Apply the minimum text length to questions and dupes.
questions = questions[questions.Text.str.len() >= min_text]
dupes = dupes[dupes.Text.str.len() >= min_text]

In [21]:
# Keep only questions with dupes, and dupes of questions.
label_column = 'AnswerId'
questions = questions[questions[label_column].isin(dupes[label_column])]
dupes = dupes[dupes[label_column].isin(questions[label_column])]

Here, we remove questions and their duplicates that are less than min_dupes parameter.

In [22]:
# Restrict the questions to those with a minimum number of dupes.
answerid_count = dupes.groupby(label_column)[label_column].count()
answerid_min = answerid_count.index[answerid_count >= min_dupes]
questions = questions[questions[label_column].isin(answerid_min)]
dupes = dupes[dupes[label_column].isin(answerid_min)]

In [23]:
 # Verify data integrity.
assert questions[label_column].isin(dupes[label_column]).all()
assert dupes[label_column].isin(questions[label_column]).all()

Here are some statistics on the resulting dataset.

In [24]:
# Report on the data.
print('Restrictions: min_text={}, min_dupes={}'.format(
    min_text, min_dupes))
print('Restricted text statistics:')
print(pd.DataFrame([questions.Text.str.len().describe()
                    .rename('questions'),
                    dupes.Text.str.len().describe()
                    .rename('dupes')]))
print('\nRestricted duplication statistics:')
print(pd.DataFrame([dupes[label_column].value_counts().describe()
                    .rename('duplications')]))
print('\nRestricted largest class: {:.2%}'.format(
    dupes[label_column].value_counts().max()
    / dupes.shape[0]))

Restrictions: min_text=150, min_dupes=12
Restricted text statistics:
            count        mean         std    min     25%    50%    75%     max
questions   182.0  413.450549  218.028193  153.0  264.25  338.5  510.5  1475.0
dupes      8260.0  479.882324  398.791447  150.0  270.00  380.0  553.0  3989.0

Restricted duplication statistics:
              count       mean         std   min   25%   50%   75%     max
duplications  182.0  45.384615  117.074823  12.0  15.0  20.0  33.0  1328.0

Restricted largest class: 16.08%


## Prepare train and test sets

In this part, we prepare train and test sets. For training a binary classification model, we will need to construct match and non-match pairs from duplicates and their questions. Finding matching pairs can be accomplished by joining each duplicate with its question. However, non-match examples need to be constructed randomly. 

As a first step, to make sure we train and test the performance of the model on each question, we will need to have examples of match and non-match pairs for each question both in train and test sets. In order to achieve that, we split the duplicates in a stratified manner into train and test sets making sure at least 1 or more duplicates per question is in the test set depending on test_size parameter and number of duplicates per each question.

In [25]:
# Split dupes into train and test ensuring at least one of each label class is in test.
dupes_test = round_sample_strat(dupes, dupes[label_column], frac=test_size)
dupes_train = dupes[~dupes.Id.isin(dupes_test.Id)]

In [26]:
assert (dupes_test[label_column].unique().shape[0] == dupes[label_column].unique().shape[0])

In [27]:
# The relevant columns for text pairs data.
balanced_pairs_columns = ['Id_x', 'AnswerId_x', 'Text_x', 'Id_y', 'Text_y', 'AnswerId_y', 'Label', 'n']

Next, we pair each training duplicate in train set with its matching question and N-1 random questions using the helper function.

In [28]:
%%time
# Use AnswerId to pair each training dupe with its matching question and also with N-1 questions not its match.
balanced_pairs_train = random_merge(dupes_train, questions, N=match)

CPU times: user 53.7 s, sys: 158 ms, total: 53.8 s
Wall time: 53.8 s


Labeling is done such that matching pairs are labeled as 1 and non-match pairs are labeled as 0.

In [29]:
# Label records by matching AnswerIds.
balanced_pairs_train['Label'] = (balanced_pairs_train.AnswerId_x == balanced_pairs_train.AnswerId_y).astype(int)

In [30]:
# Keep only the relevant data.
balanced_pairs_train = balanced_pairs_train[balanced_pairs_columns]

In [31]:
balanced_pairs_train.head()

Unnamed: 0,Id_x,AnswerId_x,Text_x,Id_y,Text_y,AnswerId_y,Label,n
0,177867,122704,how do i copy the data of an element with jque...,122102,what is the most efficient way to clone an obj...,122704,1,0
1,565430,122704,(deep) copying an array using jquery. possibl...,122102,what is the most efficient way to clone an obj...,122704,1,0
2,3474697,122704,how to clone js object?. possible duplicate: ...,122102,what is the most efficient way to clone an obj...,122704,1,0
3,10801878,122704,how can i copy a variable without pointing to ...,122102,what is the most efficient way to clone an obj...,122704,1,0
4,9610918,122704,how do i get a new reference to an object. po...,122102,what is the most efficient way to clone an obj...,122704,1,0


In [32]:
# Sort the data by dupe ID and Label.
balanced_pairs_train.sort_values(by=['Id_x', 'Label'], ascending=[True, False], inplace=True)

In testing set, we match each duplicate with all the original questions and label them same way as training set.

In [33]:
%%time
# Use AnswerId to pair each testing dupe with all questions.
balanced_pairs_test = random_merge(dupes_test, questions, N=questions.shape[0])

CPU times: user 19.3 s, sys: 40.1 ms, total: 19.3 s
Wall time: 19.3 s


In [34]:
# Label records by matching AnswerIds.
balanced_pairs_test['Label'] = (balanced_pairs_test.AnswerId_x == balanced_pairs_test.AnswerId_y).astype(int)

In [35]:
# Keep only the relevant data.
balanced_pairs_test = balanced_pairs_test[balanced_pairs_columns]

In [36]:
balanced_pairs_test.head()

Unnamed: 0,Id_x,AnswerId_x,Text_x,Id_y,Text_y,AnswerId_y,Label,n
0,13782698,6700,get total number of items on json object?. po...,5223,"length of a javascript object (that is, associ...",6700,1,0
1,6283466,6700,count key/values in json. possible duplicate:...,5223,"length of a javascript object (that is, associ...",6700,1,0
2,29730927,27943,php distance between two points latitude longi...,27928,calculate distance between two latitude-longit...,27943,1,0
3,25048673,27943,what is the equation of longitude and latitude...,27928,calculate distance between two latitude-longit...,27943,1,0
4,35494335,31047,how to check for null or undefined in javascri...,31044,"is there an ""exists"" function for jquery?. how...",31047,1,0


In [37]:
# Sort the data by dupe ID and Label.
balanced_pairs_test.sort_values(by=['Id_x', 'Label'], ascending=[True, False], inplace=True)

Finally, we report the final train and test sets and save as text files to be used by modeling.

In [38]:
# Report on the datasets.
print('balanced_pairs_train: {:,} rows with {:.2%} matches'.format(balanced_pairs_train.shape[0], 
                                                                   balanced_pairs_train.Label.mean()))
print('balanced_pairs_test: {:,} rows with {:.2%} matches'.format(balanced_pairs_test.shape[0], 
                                                                  balanced_pairs_test.Label.mean()))

balanced_pairs_train: 132,500 rows with 5.00% matches
balanced_pairs_test: 297,570 rows with 0.55% matches


In [39]:
# Save the data.
balanced_pairs_train_path = 'balanced_pairs_train.tsv'
print('Writing {:,} to {}'.format(balanced_pairs_train.shape[0], balanced_pairs_train_path))
balanced_pairs_train.to_csv(balanced_pairs_train_path, sep='\t',header=True, index=False)

balanced_pairs_test_path = 'balanced_pairs_test.tsv'
print('Writing {:,} to {}'.format(balanced_pairs_test.shape[0], balanced_pairs_test_path))
balanced_pairs_test.to_csv(balanced_pairs_test_path, sep='\t', header=True, index=False)

# Save original questions to be used for scoring later.
questions_path = 'questions.tsv'
print('Writing {:,} to {}'.format(questions.shape[0], questions_path))
questions.to_csv(questions_path, sep='\t', header=True, index=False)

# Save the test duplicate questions to be used with the scoring function.
dupes_test_path = 'dupes_test.tsv'
print('Writing {:,} to {}'.format(dupes_test.shape[0], dupes_test_path))
dupes_test.to_csv(dupes_test_path, sep='\t', header=True, index=False)

Writing 132,500 to balanced_pairs_train.tsv
Writing 297,570 to balanced_pairs_test.tsv
Writing 182 to questions.tsv
Writing 1,635 to dupes_test.tsv


We can now move on to [building the model](01_Create_Model.ipynb).