<h1 style="text-align:center"> Stack Overflow: Tag Prediction </h1>

<img src='../images/so_tag/pic1.jpg'/>

In [4]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import sqlite3
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from wordcloud import WordCloud
import re
import os
from sqlalchemy import create_engine # database connection
import datetime as dt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.metrics import f1_score,precision_score,recall_score
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from skmultilearn.adapt import mlknn
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import GaussianNB
from datetime import datetime
from pathlib import Path

<p style='font-size:24px'><b> Description </b></p>
<p>
Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.<br />
<br />
Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.<br />
<br />
</p>

<p style='font-size:18px'><b> Problem Statemtent </b></p>
Suggest the tags based on the content that was there in the question posted on Stackoverflow.

<p style='font-size:18px'><b> Source:  </b> https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/</p>


<h2> Business Objectives and Constraints </h2>

1. Predict as many tags as possible with high precision and recall.
2. Incorrect tags could impact customer experience on StackOverflow.
3. No strict latency constraints.

<h2> Data </h2>

Refer: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
<br>
All of the data is in 2 files: Train and Test.<br />
<pre>
<b>Train.csv</b> contains 4 columns: Id,Title,Body,Tags.<br />
<b>Test.csv</b> contains the same columns but without the Tags, which you are to predict.<br />
<b>Size of Train.csv</b> - 6.75GB<br />
<b>Size of Test.csv</b> - 2GB<br />
<b>Number of rows in Train.csv</b> = 6034195<br />
</pre>
The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).<br />
<br />


__Data Field Explaination__

Dataset contains 6,034,195 rows. The columns in the table are:<br />
<pre>
<b>Id</b> - Unique identifier for each question<br />
<b>Title</b> - The question's title<br />
<b>Body</b> - The body of the question<br />
<b>Tags</b> - The tags associated with the question in a space-seperated format (all lowercase, should not contain tabs '\t' or ampersands '&')<br />
</pre>

<br />

<h2> Mapping to a Machine Learning Problem </h2>



<p> It is a multi-label classification problem  <br>
<b>Multi-label Classification</b>: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A question on Stackoverflow might be about any of C, Pointers, FileIO and/or memory-management at the same time or none of these. <br>
__Credit__: http://scikit-learn.org/stable/modules/multiclass.html
</p>

<h3> Performance metric </h3>

<b>Micro-Averaged F1-Score (Mean F Score) </b>: 
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

<i>F1 = 2 * (precision * recall) / (precision + recall)</i><br>

In the multi-class and multi-label case, this is the weighted average of the F1 score of each class. <br>

<b>'Micro f1 score': </b><br>
Calculate metrics globally by counting the total true positives, false negatives and false positives. This is a better metric when we have class imbalance.
<br>

<b>'Macro f1 score': </b><br>
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
<br>

https://www.kaggle.com/wiki/MeanFScore <br>
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html <br>
<br>
<b> Hamming loss </b>: The Hamming loss is the fraction of labels that are incorrectly predicted. <br>
https://www.kaggle.com/wiki/HammingLoss <br>

## Read the data file

In [5]:
path = Path('../data/so_tag')
list(path.iterdir())

[PosixPath('../data/so_tag/._lr_with_more_title_weight.pkl'),
 PosixPath('../data/so_tag/OpenedFilesView.exe'),
 PosixPath('../data/so_tag/Test.csv'),
 PosixPath('../data/so_tag/tokenaizer.JPG'),
 PosixPath('../data/so_tag/._Titlemoreweight.db'),
 PosixPath('../data/so_tag/._tokenaizer.JPG'),
 PosixPath('../data/so_tag/train_no_dup-003.db'),
 PosixPath('../data/so_tag/._train_no_dup-003.db'),
 PosixPath('../data/so_tag/._SO_Tag_Predictor.ipynb'),
 PosixPath('../data/so_tag/._lr_with_equal_weight-002.pkl'),
 PosixPath('../data/so_tag/._Processed.db'),
 PosixPath('../data/so_tag/Processed.db'),
 PosixPath('../data/so_tag/lr_with_equal_weight-002.pkl'),
 PosixPath('../data/so_tag/SO_Tag_Predictor.ipynb'),
 PosixPath('../data/so_tag/SampleSubmission.csv'),
 PosixPath('../data/so_tag/lr_with_more_title_weight.pkl'),
 PosixPath('../data/so_tag/Titlemoreweight.db'),
 PosixPath('../data/so_tag/._tag_counts_dict_dtm.csv'),
 PosixPath('../data/so_tag/Train.csv'),
 PosixPath('../data/so_tag/tag_c

In [6]:
start = datetime.now()
disk_engine = create_engine('sqlite:///train.db')
chunksize = 180000
j = 0
index_start = 1
for df in pd.read_csv(path/'Train.csv', names=['Id', 'Title', 'Body', 'Tags'], chunksize=chunksize, iterator=True, encoding='utf-8'):
    df.index += index_start
    j+=1
    print('{} rows'.format(j*chunksize))
    df.to_sql('data', disk_engine, if_exists='append')
    index_start = df.index[-1] + 1
print("Time taken to run this cell :", datetime.now() - start)

180000 rows
360000 rows
540000 rows
720000 rows
900000 rows
1080000 rows
1260000 rows
1440000 rows
1620000 rows
1800000 rows
1980000 rows
2160000 rows
2340000 rows
2520000 rows
2700000 rows
2880000 rows
3060000 rows
3240000 rows
3420000 rows
3600000 rows
3780000 rows
3960000 rows
4140000 rows
4320000 rows
4500000 rows
4680000 rows
4860000 rows
5040000 rows
5220000 rows
5400000 rows
5580000 rows
5760000 rows
5940000 rows
6120000 rows
Time taken to run this cell : 0:03:41.212066


In [7]:
start = datetime.now()
con = sqlite3.connect('train.db')
num_rows = pd.read_sql_query("""SELECT count(*) FROM data""", con)
#Always remember to close the database
print("Number of rows in the database :","\n",num_rows['count(*)'].values[0])
con.close()
print("Time taken to count the number of rows :", datetime.now() - start)


Number of rows in the database : 
 6034196
Time taken to count the number of rows : 0:00:00.104634


## Check for duplicates

In [10]:
start = datetime.now()
con = sqlite3.connect('train.db')
df_duplicates = pd.read_sql_query('SELECT title, body, tags, COUNT(*) as cnt_dup FROM data GROUP BY title, body, tags', con)
con.close()
print("Time taken to run this cell :", datetime.now() - start)



Time taken to run this cell : 0:01:32.577932


In [11]:
df_duplcates.head()

Unnamed: 0,Title,Body,Tags,cnt_dup
0,Implementing Boundary Value Analysis of S...,<pre><code>#include&lt;iostream&gt;\n#include&...,c++ c,1
1,Dynamic Datagrid Binding in Silverlight?,<p>I should do binding for datagrid dynamicall...,c# silverlight data-binding,1
2,Dynamic Datagrid Binding in Silverlight?,<p>I should do binding for datagrid dynamicall...,c# silverlight data-binding columns,1
3,java.lang.NoClassDefFoundError: javax/serv...,"<p>I followed the guide in <a href=""http://sta...",jsp jstl,1
4,java.sql.SQLException:[Microsoft][ODBC Dri...,<p>I use the following code</p>\n\n<pre><code>...,java jdbc,2


In [12]:
df_duplcates.shape

(4206315, 4)

In [28]:
print(f"Total duplicate rows {((num_rows['count(*)'].values[0] - df_duplcates.shape[0]) / num_rows['count(*)'].values[0]) * 100}")



Total duplicate rows 30.292038906260256


In [21]:
num_rows.shape[0]

1

In [24]:
num_rows['count(*)'].values[0]

6034196