<a href="https://colab.research.google.com/github/mrparamvir/Multiclass-Multilabel-prediction-For-stack-overflow-Questions/blob/main/Multiclass_Multilabel_prediction_For_stack_overflow_Questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multiclass Multilabel Prediction for Stack Overflow Questions 🔎📝

End to End **Multiclass Multilabel Prediction** for **Stack Overflow Questions**.

### 1. Problem

Given Text for the Questions, Predict **Tags** Associated with them.

### 2. Data
[Kaggle's StackSample](https://www.kaggle.com/stackoverflow/stacksample): 10% of Stack Overflow Q&A.

### 3. Features
**Questions :** 
- Title
- Body
- Creation Date
- Closed Date (if applicable)
- Score
- Owner ID 
for all Non Deleted **Stack Overflow Questions** whose ID is A Multiple of 10.

**Answers :** 
- Body
- Creation Date
- Score
- Owner ID 
for each of the Answers to these Questions. 
The ParentID Column Links back to the Questions Table.

**Tags :** 
contains the **Tags** on each of these Questions.

## Import Necessary Tools

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from pylab import rcParams
rcParams['figure.figsize'] = 10, 10
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
import warnings
warnings.filterwarnings("ignore")

## Getting our data ready

In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"paramviryadav","key":"727c233f3f9e3e7e070703ff70d55e09"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle

# change the permission
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d stackoverflow/stacksample

Downloading stacksample.zip to /content
 99% 1.09G/1.11G [00:09<00:00, 141MB/s]
100% 1.11G/1.11G [00:09<00:00, 120MB/s]


In [None]:
# Unzip the uploaded data into Google Drive
!unzip "stacksample.zip"

Archive:  stacksample.zip
  inflating: Answers.csv             
  inflating: Questions.csv           
  inflating: Tags.csv                


In [None]:
answers1 = pd.read_csv('Answers.csv', encoding='ISO-8859-1')
questions1 = pd.read_csv('Questions.csv', encoding='ISO-8859-1')
tags1 = pd.read_csv('Tags.csv')

In [None]:
answers = answers1.copy()
questions = questions1.copy()
tags = tags1.copy()

In [None]:
answers.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,92,61.0,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Vers..."
1,124,26.0,2008-08-01T16:09:47Z,80,12,<p>I wound up using this. It is a kind of a ha...
2,199,50.0,2008-08-01T19:36:46Z,180,1,<p>I've read somewhere the human eye can't dis...
3,269,91.0,2008-08-01T23:49:57Z,260,4,"<p>Yes, I thought about that, but I soon figur..."
4,307,49.0,2008-08-02T01:49:46Z,260,28,"<p><a href=""http://www.codeproject.com/Article..."


In [None]:
questions.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


In [None]:
tags.head()

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn


In [None]:
print('Shape of Answers dataset:', answers.shape)
print('Shape of Questions dataset:', questions.shape)
print('Shape of Tags dataset:', tags.shape)

Shape of Answers dataset: (2014516, 6)
Shape of Questions dataset: (1264216, 7)
Shape of Tags dataset: (3750994, 2)


In [None]:
print('Number of unique Scores :', questions['Score'].nunique())

questions[questions.Score >= 5].shape

Number of unique Scores : 532


(93153, 7)

In [None]:
questions = questions[questions.Score >=5]
print(questions.shape)
questions.head()

(93153, 7)


Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...


In [None]:
questions.drop(columns=['OwnerUserId', 'CreationDate', 'ClosedDate', 'Score'], inplace=True)

In [None]:
top10_tags = list(tags.Tag.value_counts()[:10].index)
print(top10_tags)

new_tags = tags[tags.Tag.isin(top10_tags)]
print(new_tags.shape)
new_tags.reset_index(drop=True, inplace=True)
new_tags.head()

['javascript', 'java', 'c#', 'php', 'android', 'jquery', 'python', 'html', 'c++', 'ios']
(826739, 2)


Unnamed: 0,Id,Tag
0,260,c#
1,330,c++
2,650,c#
3,930,c#
4,1010,c#


In [None]:
questions.isnull().sum()

Id       0
Title    0
Body     0
dtype: int64

In [None]:
new_tags.isnull().sum()

Id     0
Tag    0
dtype: int64

In [None]:
df = questions.merge(new_tags, on='Id')
print(df.shape)
df.head()

(56008, 4)


Unnamed: 0,Id,Title,Body,Tag
0,260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,c#
1,330,Should I use nested classes in this case?,<p>I am working on a collection of classes use...,c++
2,650,Automatically update version number,<p>I would like the version property of my app...,c#
3,930,How do I connect to a database and loop over a...,<p>What's the simplest way to connect and quer...,c#
4,1010,"How to get the value of built, encoded ViewState?",<p>I need to grab the base64-encoded represent...,c#


In [None]:
df.isnull().sum()

Id       0
Title    0
Body     0
Tag      0
dtype: int64

In [None]:
df.drop(columns=['Id'], inplace=True)

In [None]:
train_size = round(df.shape[0]*0.8)
test_size = df.shape[0] - train_size
print(train_size)
print(test_size)

44806
11202


In [None]:
X = df.drop(['Tag'],axis=1)
y = df['Tag']

In [None]:
y1 = df['Tag']
y_train1, y_test1 = y1[:train_size], y1[train_size:]
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
y_test1 = LE.fit_transform(y_test1)

In [None]:
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
y = LE.fit_transform(y)

In [None]:
y[y==7]

array([7, 7, 7, ..., 7, 7, 7])

In [None]:
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(44806, 2) (11202, 2)
(44806,) (11202,)


In [None]:
X_train.head()

Unnamed: 0,Title,Body
0,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...
1,Should I use nested classes in this case?,<p>I am working on a collection of classes use...
2,Automatically update version number,<p>I would like the version property of my app...
3,How do I connect to a database and loop over a...,<p>What's the simplest way to connect and quer...
4,"How to get the value of built, encoded ViewState?",<p>I need to grab the base64-encoded represent...


In [None]:
X_test.head()

Unnamed: 0,Title,Body
44806,CoreData - Update model class instead of creat...,<p>I am using CoreData in my iOS application. ...
44807,How does this declaration invoke the Most Vexi...,<p>Consider the following program:</p>\n\n<pre...
44808,Eclipse Errors/Warnings ignore assert in unuse...,<p>I have had 'hidden' bugs due to Eclipse not...
44809,Does msvcrt uses a different heap for allocati...,<p>I've read about that some time ago but am u...
44810,"The method show(FragmentManager, String) in th...",<p>I have a problem with Fragments.</p>\n\n<p>...


### Text Preprocessing

In [None]:
!pip install beautifulsoup4



In [None]:
from bs4 import BeautifulSoup

X_train['Body'] = X_train['Body'].apply(lambda x: BeautifulSoup(x).get_text())
X_test['Body'] = X_test['Body'].apply(lambda x: BeautifulSoup(x).get_text())

X_train['Title'] = X_train['Title'].apply(lambda x: BeautifulSoup(x).get_text())
X_test['Title'] = X_test['Title'].apply(lambda x: BeautifulSoup(x).get_text())

In [None]:
X_train.replace('[^a-zA-Z]',' ', regex=True, inplace=True)
X_test.replace('[^a-zA-Z]',' ', regex=True, inplace=True)

In [None]:
for index in X_train.columns:
  X_train[index] = X_train[index].str.lower()

for index in X_test.columns:
  X_test[index] = X_test[index].str.lower()

In [None]:
X_train = X_train.replace('\s+', ' ', regex=True)
X_test = X_test.replace('\s+', ' ', regex=True)

In [None]:
X_train

Unnamed: 0,Title,Body
0,adding scripting functionality to net applicat...,i have a little game written in c it uses a da...
1,should i use nested classes in this case,i am working on a collection of classes used f...
2,automatically update version number,i would like the version property of my applic...
3,how do i connect to a database and loop over a...,what s the simplest way to connect and query a...
4,how to get the value of built encoded viewstate,i need to grab the base encoded representation...
...,...,...
44801,finally equivalent for if elif statements in ...,does python have a finally equivalent for its ...
44802,os listdir is removing character accent,in windows file explorer create a new txt file...
44803,does using the this keyword affect java perfor...,does using the this keyword affect java perfor...
44804,aligning on the same line,i m just trying to align some text on the same...


In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
type(X_train)

pandas.core.frame.DataFrame

In [None]:
stop_words = set(stopwords.words('english')) 
len(stop_words)
X_train['Body'] = X_train['Body'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
X_train['Title'] = X_train['Title'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

X_test['Body'] = X_test['Body'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
X_test['Title'] = X_test['Title'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

In [None]:
stemmer = nltk.SnowballStemmer(language='english')

X_train['Body'] = X_train['Body'].apply(lambda x: ' '.join(stemmer.stem(term) for term in x.split()))
X_train['Title'] = X_train['Title'].apply(lambda x: ' '.join(stemmer.stem(term) for term in x.split()))

X_test['Body'] = X_test['Body'].apply(lambda x: ' '.join(stemmer.stem(term) for term in x.split()))
X_test['Title'] = X_test['Title'].apply(lambda x: ' '.join(stemmer.stem(term) for term in x.split()))

In [None]:
train_lines = []
for row in range(0,X_train.shape[0]):
  train_lines.append(' '.join(str(x) for x in X_train.iloc[row,:]))

test_lines = []
for row in range(0,X_test.shape[0]):
  test_lines.append(' '.join(str(x) for x in X_test.iloc[row,:]))

In [1]:
train_lines

In [2]:
test_lines

In [None]:
len(train_lines), len(test_lines)

(44806, 11202)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

countvector = CountVectorizer()
X_train_cv = countvector.fit_transform(train_lines)
X_test_cv = countvector.transform(test_lines)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

tfidfvector = TfidfTransformer()
X_train_tf = tfidfvector.fit_transform(X_train_cv)
X_test_tf = tfidfvector.fit_transform(X_test_cv)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier # Binary Relevance

lr = LogisticRegression()
clf = OneVsRestClassifier(lr)

In [None]:
clf.fit(X_train_tf, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [None]:
y_pred = clf.predict(X_test_tf)
y_pred

array([4, 2, 5, ..., 0, 2, 0])

In [None]:
y_pred[y_pred==8]

array([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

[[1341   71   17    2    9  166   32    2    6    2]
 [  14 1068   35    5    7   19   12    4    7    7]
 [  10   63 1018    1    4   21    2    1    6   14]
 [  10   15    2  274    5    7  134   69   22   10]
 [  35  107   37    5  933   18   35    1   10   16]
 [ 138  103   21    5    3 1247   14    3    6    8]
 [  24  100   14  111   11   25 1247  161   36   26]
 [   3    5    1   71    4    2  181  195   13    4]
 [   2   47    1    9    3   14   23   12  484    4]
 [   1   45   21    2    1   11   12    2    4 1011]]
              precision    recall  f1-score   support

           0       0.85      0.81      0.83      1648
           1       0.66      0.91      0.76      1178
           2       0.87      0.89      0.88      1140
           3       0.56      0.50      0.53       548
           4       0.95      0.78      0.86      1197
           5       0.82      0.81      0.81      1548
           6       0.74      0.71      0.72      1755
           7       0.43      0.41   

### Obtained Accuracy using Logistic Regression: 78.71