<h1 style="text-align:center"> Stack Overflow: Tag Prediction </h1>

<img src='../images/so_tag/pic1.jpg'/>

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import sqlite3
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from wordcloud import WordCloud
import re
import os
from sqlalchemy import create_engine # database connection
import datetime as dt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.metrics import f1_score,precision_score,recall_score
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from skmultilearn.adapt import mlknn
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import GaussianNB
from datetime import datetime
from pathlib import Path

<p style='font-size:24px'><b> Description </b></p>
<p>
Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.<br />
<br />
Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.<br />
<br />
</p>

<p style='font-size:18px'><b> Problem Statemtent </b></p>
Suggest the tags based on the content that was there in the question posted on Stackoverflow.

<p style='font-size:14px'><b> Source:  </b> https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/</p>


<h2> Business Objectives and Constraints </h2>

1. Predict as many tags as possible with high precision and recall.
2. Incorrect tags could impact customer experience on StackOverflow.
3. No strict latency constraints.

<h2> Data </h2>

Refer: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
<br>
All of the data is in 2 files: Train and Test.<br />
<pre>
<b>Train.csv</b> contains 4 columns: Id,Title,Body,Tags.<br />
<b>Test.csv</b> contains the same columns but without the Tags, which you are to predict.<br />
<b>Size of Train.csv</b> - 6.75GB<br />
<b>Size of Test.csv</b> - 2GB<br />
<b>Number of rows in Train.csv</b> = 6034195<br />
</pre>
The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).<br />
<br />


__Data Field Explaination__

Dataset contains 6,034,195 rows. The columns in the table are:<br />
<pre>
<b>Id</b> - Unique identifier for each question<br />
<b>Title</b> - The question's title<br />
<b>Body</b> - The body of the question<br />
<b>Tags</b> - The tags associated with the question in a space-seperated format (all lowercase, should not contain tabs '\t' or ampersands '&')<br />
</pre>

<br />

<h2> Mapping to a Machine Learning Problem </h2>



<p> It is a multi-label classification problem  <br>
<b>Multi-label Classification</b>: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A question on Stackoverflow might be about any of C, Pointers, FileIO and/or memory-management at the same time or none of these. <br>
__Credit__: http://scikit-learn.org/stable/modules/multiclass.html
</p>

<h3> Performance metric </h3>

<b>Micro-Averaged F1-Score (Mean F Score) </b>: 
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

<i>F1 = 2 * (precision * recall) / (precision + recall)</i><br>

In the multi-class and multi-label case, this is the weighted average of the F1 score of each class. <br>

<b>'Micro f1 score': </b><br>
Calculate metrics globally by counting the total true positives, false negatives and false positives. This is a better metric when we have class imbalance.
<br>

<b>'Macro f1 score': </b><br>
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
<br>

https://www.kaggle.com/wiki/MeanFScore <br>
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html <br>
<br>
<b> Hamming loss </b>: The Hamming loss is the fraction of labels that are incorrectly predicted. <br>
https://www.kaggle.com/wiki/HammingLoss <br>

## Read the data file

In [2]:
path = Path('../data/so_tag')
list(path.iterdir())

[PosixPath('../data/so_tag/._lr_with_more_title_weight.pkl'),
 PosixPath('../data/so_tag/OpenedFilesView.exe'),
 PosixPath('../data/so_tag/Test.csv'),
 PosixPath('../data/so_tag/tokenaizer.JPG'),
 PosixPath('../data/so_tag/._Titlemoreweight.db'),
 PosixPath('../data/so_tag/._tokenaizer.JPG'),
 PosixPath('../data/so_tag/train_no_dup-003.db'),
 PosixPath('../data/so_tag/._train_no_dup-003.db'),
 PosixPath('../data/so_tag/._SO_Tag_Predictor.ipynb'),
 PosixPath('../data/so_tag/._lr_with_equal_weight-002.pkl'),
 PosixPath('../data/so_tag/._Processed.db'),
 PosixPath('../data/so_tag/Processed.db'),
 PosixPath('../data/so_tag/lr_with_equal_weight-002.pkl'),
 PosixPath('../data/so_tag/SO_Tag_Predictor.ipynb'),
 PosixPath('../data/so_tag/SampleSubmission.csv'),
 PosixPath('../data/so_tag/lr_with_more_title_weight.pkl'),
 PosixPath('../data/so_tag/Titlemoreweight.db'),
 PosixPath('../data/so_tag/._tag_counts_dict_dtm.csv'),
 PosixPath('../data/so_tag/Train.csv'),
 PosixPath('../data/so_tag/tag_c

In [4]:
if not os.path.isfile('train.db'):
    start = datetime.now()
    disk_engine = create_engine('sqlite:///train.db')
    chunksize = 200000
    j = 0
    index = 1
    for df in pd.read_csv(path/'Train.csv', names=['Id', 'Title', 'Body', 'Tags'], chunksize=chunksize, iterator=True, encoding='utf-8'):
        df.index += index
        j += 1
        df.to_sql('data', disk_engine, if_exists='append')
        index = df.index[-1] + 1
    end = datetime.now()
    print(f'Time take to create db:: {end - start}')

Time take to create db:: 0:03:38.329734


In [5]:
start = datetime.now()
con = sqlite3.connect('train.db')
rows = pd.read_sql_query("""SELECT count(*) FROM data""", con)
print(f"Number of rows in the db:: {(rows['count(*)'].values[0])}")
con.close()
end = datetime.now()
print(f'Time take to create db:: {end - start}')

Number of rows in the db:: 6034196
Time take to create db:: 0:00:00.051860


## Check for duplicates

In [38]:
start = datetime.now()
con = sqlite3.connect('train.db')
df_duplicates = pd.read_sql_query('SELECT title, body, tags, COUNT(*) as cnt_dup FROM data GROUP BY title, body, tags', con)
con.close()
print("Time taken to run this cell :", datetime.now() - start)



Time taken to run this cell : 0:01:20.327533


In [39]:
df_duplicates.head()

Unnamed: 0,Title,Body,Tags,cnt_dup
0,Implementing Boundary Value Analysis of S...,<pre><code>#include&lt;iostream&gt;\n#include&...,c++ c,1
1,Dynamic Datagrid Binding in Silverlight?,<p>I should do binding for datagrid dynamicall...,c# silverlight data-binding,1
2,Dynamic Datagrid Binding in Silverlight?,<p>I should do binding for datagrid dynamicall...,c# silverlight data-binding columns,1
3,java.lang.NoClassDefFoundError: javax/serv...,"<p>I followed the guide in <a href=""http://sta...",jsp jstl,1
4,java.sql.SQLException:[Microsoft][ODBC Dri...,<p>I use the following code</p>\n\n<pre><code>...,java jdbc,2


In [40]:
df_duplicates.shape

(4206315, 4)

In [41]:
print(f"Total duplicate rows {((rows['count(*)'].values[0] - df_duplicates.shape[0]) / rows['count(*)'].values[0]) * 100}")



Total duplicate rows 30.292038906260256


In [42]:
df_duplicates.cnt_dup.value_counts()

1    2656284
2    1272336
3     277575
4         90
5         25
6          5
Name: cnt_dup, dtype: int64

In [46]:
df_duplicates[df_duplicates['Tags'].isnull()]

Unnamed: 0,Title,Body,Tags,cnt_dup
777547,Do we really need NULL?,<blockquote>\n <p><strong>Possible Duplicate:...,,1
962680,Find all values that are not null and not in a...,<p>I am running into a problem which results i...,,1
1126558,Handle NullObjects,<p>I have done quite a bit of research on best...,,1
1256102,How do Germans call null,"<p>In german null means 0, so how do they call...",,1
2430668,Page cannot be null. Please ensure that this o...,<p>I get this error when i remove dynamically ...,,1
3329908,"What is the difference between NULL and ""0""?","<p>What is the difference from NULL and ""0""?</...",,1
3551595,a bit of difference between null and space,<p>I was just reading this quote</p>\n\n<block...,,2


In [52]:
df_duplicates['Tags'].fillna('', inplace=True)

In [62]:
start = datetime.now()
df_duplicates['tags_count'] = df_duplicates['Tags'].apply(lambda x: len(x.split(' ')))
end = datetime.now()
print(f'Time take to create db:: {end - start}')
df_duplicates.head()

Time take to create db:: 0:00:02.471747


Unnamed: 0,Title,Body,Tags,cnt_dup,tags_count
0,Implementing Boundary Value Analysis of S...,<pre><code>#include&lt;iostream&gt;\n#include&...,c++ c,1,2
1,Dynamic Datagrid Binding in Silverlight?,<p>I should do binding for datagrid dynamicall...,c# silverlight data-binding,1,3
2,Dynamic Datagrid Binding in Silverlight?,<p>I should do binding for datagrid dynamicall...,c# silverlight data-binding columns,1,4
3,java.lang.NoClassDefFoundError: javax/serv...,"<p>I followed the guide in <a href=""http://sta...",jsp jstl,1,2
4,java.sql.SQLException:[Microsoft][ODBC Dri...,<p>I use the following code</p>\n\n<pre><code>...,java jdbc,2,2


In [54]:
# if not os.path.isfile(path/'train_no_dup.db'):
#     disk_dup = create_engine("sqlite:///train_no_dup.db")
#     no_dup = pd.DataFrame(df_duplicates, columns=['Title', 'Body', 'Tags'])
#     no_dup.to_sql('no_dup_train',disk_dup)

In [60]:
start = datetime.now()
con = sqlite3.connect('train_no_dup.db')
tag_data = pd.read_sql_query("""SELECT Tags FROM no_dup_train""", con)
#Always remember to close the database
con.close()

# Let's now drop unwanted column.
tag_data.drop(tag_data.index[0], inplace=True)
#Printing first 5 columns from our data frame
tag_data.head()
end = datetime.now()
print(f'Time take to create db:: {end - start}')

Time take to create db:: 0:00:06.373139


In [64]:
df_duplicates[df_duplicates.Tags.isnull()]

Unnamed: 0,Title,Body,Tags,cnt_dup,tags_count


## Analysis of Tags

In [65]:
vectorizer = CountVectorizer(tokenizer=lambda x: x.split())
tags_dtm = vectorizer.fit_transform(df_duplicates.Tags)

In [66]:
tags = vectorizer.get_feature_names()
#Lets look at the tags we have.
print("Some of the tags we have :", tags[:10])

Some of the tags we have : ['.a', '.app', '.asp.net-mvc', '.aspxauth', '.bash-profile', '.class-file', '.cs-file', '.doc', '.drv', '.ds-store']


### Number of times a tag appeared

In [72]:
freqs = tags_dtm.sum(axis=0).A1
result = dict(zip(tags, freqs))

In [73]:
import gc; gc.collect()

172

In [74]:
if not os.path.isfile(path/'tag_counts_dict_dtm.csv'):
    with open(path/'tag_counts_dict_dtm.csv', 'w') as csv_file:
        writer = csv.writer(csv_file)
        for key, value in result.items():
                writer.writerow([key, value])
tag_df = pd.read_csv(path/"tag_counts_dict_dtm.csv", names=['Tags', 'Counts'])
tag_df.head()

Unnamed: 0,Tags,Counts
0,.a,18
1,.app,37
2,.asp.net-mvc,1
3,.aspxauth,21
4,.bash-profile,138


In [None]:
tag_df_sorted = tag_df.sort_values(['Counts'], ascending=False)
tag_counts = tag_df_sorted.