In [0]:
import pandas as pd
import numpy as np
import spacy

**Load Dataset**

The dataset has been collected from **Google BigQuery dataset** . It includes an archive of Stack Overflow content, including posts, votes, tags, and badges.
The dataset is available [here](https://https://www.kaggle.com/stackoverflow/stackoverflow)

In [0]:
EN = spacy.load('en_core_web_sm')
df = pd.read_csv('Original_data.csv')

Our Dataset consists of following columns :
* id
* title (ques title)
* body (question)
* tags (associated tags)
* answers
* score


In [0]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,body,tags,answers,score
0,0,17869101,Unable to install Pygame using pip,<p>I'm trying to install Pygame. I am running ...,python|pygame|install|pip,<p>Try doing this:</p>\n\n<pre><code>sudo apt-...,19
1,1,27567208,How do I open sub window after I click on butt...,<p>HI I am trying to make a simple converter.\...,python|user-interface|pyqt4,<p>Make two programs: <b>main_win.py </b> and...,-2
2,2,31172719,pip install access denied on Windows,<p>I am trying to run <code>pip install mitmpr...,python|windows|pip|access-denied,<p>One additional thing that has not been cove...,12
3,3,1545606,Python k-means algorithm,<p>I am looking for Python implementation of k...,python|algorithm|cluster-analysis|k-means,"<p><a href=""http://docs.scipy.org/doc/scipy/re...",54
4,4,6707398,ValueError when using strptime to get a dateti...,<p>Im trying to convert a date string to a dat...,python|datetime,<pre><code>&gt;&gt;&gt; datetime.datetime.strp...,-5


Size of dataset is

In [0]:
print(df.shape)

(500000, 7)


There may be some questions which are common, so we will concatenate those answers on the basis of common tags. And we will also add the scores to get the collective score. 

In [0]:
aggregations = {
    'answers':{
        'combined_answers': lambda x: "\n".join(x)
    },
    'score':{
        'combined_score': 'sum'
    }
}

grouped = df.groupby(['id','title', 'body','tags'],as_index=False).agg(aggregations)
deduped_df = pd.DataFrame(grouped)

**Updated Dataset**

Dataset after applying the above aggregation function

In [0]:
deduped_df.head()

Unnamed: 0_level_0,id,title,body,tags,answers,score
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,combined_answers,combined_score
0,469,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,python|macos|fonts|photoshop,<p>There must be a method in Cocoa to get a li...,39
1,773,How do I use itertools.groupby()?,<p>I haven't been able to find an understandab...,python|iteration,<p><strong>IMPORTANT NOTE:</strong> You have t...,847
2,1171,What is the most efficient graph data structur...,<p>I need to be able to manipulate a large (10...,python|performance|data-structures|graph-theory,"<p>Even though this question is now quite old,...",81
3,5136,Does anyone have experience creating a shared ...,<p>A researcher has created a small simulation...,python|c|matlab,<p>I won't help much but I remember that I was...,11
4,5966,Best way to abstract season/show/episode data,"<p>Basically, I've written an API to www.thetv...",python|data-structures,<p>I have done something similar in the past a...,8



The data here is given along with the HTML markup eg. p tags, h1-h6 tags and the code tags. We will remove those in text processing. 

We have constructed a new column 'post_corpus' by combining the title, question body, and all the answers.

We have also prepended the title to the question body. 

We have also constructed urls for each question. 

And we have 2 features for sentiment using the open Source Textblob library

In [0]:
from bs4 import BeautifulSoup
from textblob import TextBlob

title_list = [] 
content_list = []
url_list = []
comment_list = []
sentiment_polarity_list = []
sentiment_subjectivity_list = []
vote_list =[]
tag_list = []
corpus_list = []

for i, row in deduped_df.iterrows():
    title_list.append(row.title.values[0])    # Get question title
    tag_list.append(row.tags.values[0])     # Get question tags
    
    # Questions
    content = row.body.values[0]
    soup = BeautifulSoup(content, 'lxml')
    if soup.code: soup.code.decompose()     # Remove the code section
    tag_p = soup.p
    tag_pre = soup.pre
    text = ''
    if tag_p: text = text + tag_p.get_text()
    if tag_pre: text = text + tag_pre.get_text()
        
    content_list.append(str(row.title.values[0]) + ' ' + str(text))   # Append title and question body data to the updated question body
    
    url_list.append('https://stackoverflow.com/questions/' + str(row.id.values[0]))
    
    # Answers
    content = row.answers.values[0]
    soup = BeautifulSoup(content, 'lxml')
    if soup.code: soup.code.decompose()
    tag_p = soup.p
    tag_pre = soup.pre
    text = ''
    if tag_p: text = text + tag_p.get_text()
    if tag_pre: text = text + tag_pre.get_text()
    comment_list.append(text)
    
    vote_list.append(row.score.values[0])       # Append votes
    
    corpus_list.append(content_list[-1] + ' ' + comment_list[-1])     # Combine the updated body and answers to make the corpus
    
    sentiment = TextBlob(row.answers.values[0]).sentiment
    sentiment_polarity_list.append(sentiment.polarity)
    sentiment_subjectivity_list.append(sentiment.subjectivity)

content_token_df = pd.DataFrame({'original_title': title_list, 'post_corpus': corpus_list, 'question_content': content_list, 'question_url': url_list, 'tags': tag_list, 'overall_scores':vote_list,'answers_content': comment_list, 'sentiment_polarity': sentiment_polarity_list, 'sentiment_subjectivity':sentiment_subjectivity_list})

In [0]:
content_token_df.iloc[1]

original_title                            How do I use itertools.groupby()?
post_corpus               How do I use itertools.groupby()? I haven't be...
question_content          How do I use itertools.groupby()? I haven't be...
question_url                        https://stackoverflow.com/questions/773
tags                                                    [python, iteration]
overall_scores                                                          847
answers_content           IMPORTANT NOTE: You have to sort your data first.
sentiment_polarity                                                -0.137932
sentiment_subjectivity                                             0.737756
Name: 1, dtype: object

**Filter tags**

Now there are many tags assoicated with the questions. Here we are considering only top 200 tags. We will count the frequencies of the questions with a particular tag and then we will select top 200 tags.

In [0]:
content_token_df.tags = content_token_df.tags.apply(lambda x: x.split('|'))   # Convert raw text data of tags into lists

# Make a dictionary to count the frequencies for all tags
tag_freq_dict = {}
for tags in content_token_df.tags:
    for tag in tags:
        if tag not in tag_freq_dict:
            tag_freq_dict[tag] = 0
        else:
            tag_freq_dict[tag] += 1

In [0]:
import heapq
most_common_tags = heapq.nlargest(200, tag_freq_dict, key=tag_freq_dict.get)

In [0]:
most_common_tags[0:30]

['python',
 'python-3.x',
 'pandas',
 'django',
 'python-2.7',
 'numpy',
 'list',
 'matplotlib',
 'dataframe',
 'dictionary',
 'regex',
 'tkinter',
 'string',
 'flask',
 'csv',
 'arrays',
 'tensorflow',
 'json',
 'beautifulsoup',
 'selenium',
 'html',
 'web-scraping',
 'google-app-engine',
 'machine-learning',
 'mysql',
 'opencv',
 'scipy',
 'scikit-learn',
 'function',
 'linux']

In [0]:
final_indices = []
for i,tags in enumerate(content_token_df.tags.values.tolist()):
    if len(set(tags).intersection(set(most_common_tags)))>1:   # The minimum length for common tags should be 2 because 'python' is a common tag for all
        final_indices.append(i)


In [0]:
final_data = content_token_df.iloc[final_indices]

Text Preprocessing
1. Tokenisation
2. Lower Case
3. Remove stopwords, punctuation

In [0]:
import nltk
import re
from nltk.corpus import stopwords

def tokenize_text(text):
    tokens = EN.tokenizer(text)
    return [token.text.lower() for token in tokens if not token.is_space]

def to_lowercase(words):
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def normalize(words):
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    return words

def tokenize_code(text):
    return RegexpTokenizer(r'\w+').tokenize(text)

def preprocess_text(text):
    return ' '.join(normalize(tokenize_text(text)))

In [0]:
import spacy
EN = spacy.load('en_core_web_sm')

# Preprocess text for 'question_body', 'post_corpus' and a new column 'processed_title'
final_data.question_content = final_data.question_content.apply(lambda x: preprocess_text(x))
final_data.post_corpus = final_data.post_corpus.apply(lambda x: preprocess_text(x))
final_data['processed_title'] = final_data.original_title.apply(lambda x: preprocess_text(x))

# Normalize numeric data for the scores
final_data.overall_scores = (final_data.overall_scores - final_data.overall_scores.mean()) / (final_data.overall_scores.max() - final_data.overall_scores.min())

In [0]:
final_data.tags = final_data.tags.apply(lambda x: '|'.join(x))    # Combine the lists back into text data
final_data.drop(['answers_content'], axis = 1)     # Remove the answers_content columns

Unnamed: 0,original_title,post_corpus,question_content,question_url,tags,overall_scores,sentiment_polarity,sentiment_subjectivity,processed_title
0,How can I find the full path to a font from it...,find full path font display name mac using pho...,find full path font display name mac using pho...,https://stackoverflow.com/questions/469,python|macos|fonts|photoshop,0.004447,0.116667,0.554167,find full path font display name mac
1,How do I use itertools.groupby()?,use itertoolsgroupby nt able find understandab...,use itertoolsgroupby nt able find understandab...,https://stackoverflow.com/questions/773,python|iteration,0.112598,-0.137932,0.737756,use itertoolsgroupby
2,What is the most efficient graph data structur...,efficient graph data structure python need abl...,efficient graph data structure python need abl...,https://stackoverflow.com/questions/1171,python|performance|data-structures|graph-theory,0.010068,0.179258,0.511521,efficient graph data structure python
3,Does anyone have experience creating a shared ...,anyone experience creating shared library matl...,anyone experience creating shared library matl...,https://stackoverflow.com/questions/5136,python|c|matlab,0.000699,0.268590,0.515934,anyone experience creating shared library matlab
5,How should I unit test a code-generator?,unit test code generator difficult open ended ...,unit test code generator difficult open ended ...,https://stackoverflow.com/questions/11060,c++|python|unit-testing|code-generation|swig,0.001502,0.122062,0.408964,unit test code generator
...,...,...,...,...,...,...,...,...,...
290887,How to reference a specific part of a dictiona...,reference specific part dictionary python alri...,reference specific part dictionary python alri...,https://stackoverflow.com/questions/59122330,python|dictionary|rotten-tomatoes,-0.000238,0.125000,0.516667,reference specific part dictionary python
290888,Permission Error: [Errno 1] Operation Not perm...,permission error errno 1 operation permitted t...,permission error errno 1 operation permitted t...,https://stackoverflow.com/questions/59122425,python|pyinstaller,-0.000773,0.050000,0.541111,permission error errno 1 operation permitted
290889,what is the Error in int object iteration?,error int object iteration int object subscrip...,error int object iteration int object subscrip...,https://stackoverflow.com/questions/59122649,python|object|int,-0.000773,0.083333,0.309630,error int object iteration
290890,Python creating objects in a loop append items...,python creating objects loop append items list...,python creating objects loop append items list...,https://stackoverflow.com/questions/59122774,python|list|loops|object,-0.000506,0.167929,0.478283,python creating objects loop append items list...


In [0]:
final_data.to_csv('Preprocessed_data.csv', index=False)