![alt text](https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/se/se-logo.svg?v=d29f0785ebb7)

The objective of notebook is to build a model to automatically predict tags for a given a StackExchange question by using the text of the question in PyTorch using TorchText.

Dataset:Over 85,000 questions and over 1300 unique tags

The question-answering site StackOverflow allows users to assign tags to questions in order to make them easier for other people to find. Further experts on a certain topic can subscribe to tags to receive digests of new questions for which they might have an answer. Therefore it is both in the interest of the original poster and in the interest of people who are interested in the answer that a question gets assigned appropriate tags.


## **Importing Libraries**

In [16]:
import zipfile
import ast
import pandas as pd
import re
pd.set_option('display.max_colwidth', 200)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer

In [43]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Embedding, LSTM, Dense, BatchNormalization, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [13]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Unzipping the CSV files🤐

In [None]:
# Paths of the ZIP files
question_zip_url = '/content/drive/MyDrive/Assignment 7 Auto Tag Prediction/Questions.csv.zip'
ans_zip_url = '/content/drive/MyDrive/Assignment 7 Auto Tag Prediction/Answers.csv.zip'
tags_zip_url = '/content/drive/MyDrive/Assignment 7 Auto Tag Prediction/Tags.csv.zip'

In [None]:
# Extracted directory
extracted_dir_path = '/content/drive/MyDrive/Assignment 7 Auto Tag Prediction/'

In [None]:
'''# Unzipping questions
with zipfile.ZipFile(question_zip_url, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir_path)
# Unzipping answers
with zipfile.ZipFile(ans_zip_url, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir_path)
# Unzipping tags
with zipfile.ZipFile(tags_zip_url, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir_path)'''

"# Unzipping questions\nwith zipfile.ZipFile(question_zip_url, 'r') as zip_ref:\n    zip_ref.extractall(extracted_dir_path)\n# Unzipping answers\nwith zipfile.ZipFile(ans_zip_url, 'r') as zip_ref:\n    zip_ref.extractall(extracted_dir_path)\n# Unzipping tags\nwith zipfile.ZipFile(tags_zip_url, 'r') as zip_ref:\n    zip_ref.extractall(extracted_dir_path)"

### Loading the data🔃

In [18]:
# Question
question = pd.read_csv('/content/drive/MyDrive/Assignment 7 Auto Tag Prediction/Questions.csv', encoding='latin1')
# Answers
answers = pd.read_csv('/content/drive/MyDrive/Assignment 7 Auto Tag Prediction/Answers.csv', encoding='latin1')
# Tags
tags = pd.read_csv('/content/drive/MyDrive/Assignment 7 Auto Tag Prediction/Tags.csv', encoding='latin1')

In [None]:
question.head(2)

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learning?,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...


In [None]:
question['Body'][0]

'<p>Last year, I read a blog post from <a href="http://anyall.org/">Brendan O\'Connor</a> entitled <a href="http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/">"Statistics vs. Machine Learning, fight!"</a> that discussed some of the differences between the two fields.  <a href="http://andrewgelman.com/2008/12/machine_learnin/">Andrew Gelman responded favorably to this</a>:</p>\n\n<p>Simon Blomberg: </p>\n\n<blockquote>\n  <p>From R\'s fortunes\n  package: To paraphrase provocatively,\n  \'machine learning is statistics minus\n  any checking of models and\n  assumptions\'.\n  -- Brian D. Ripley (about the difference between machine learning\n  and statistics) useR! 2004, Vienna\n  (May 2004) :-) Season\'s Greetings!</p>\n</blockquote>\n\n<p>Andrew Gelman:</p>\n\n<blockquote>\n  <p>In that case, maybe we should get rid\n  of checking of models and assumptions\n  more often. Then maybe we\'d be able to\n  solve some of the problems that the\n  machine learning people can

In [None]:
answers.head(2)

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
0,5,23.0,2010-07-19T19:14:43Z,3,85,"<p>The R-project</p>\n\n<p><a href=""http://www.r-project.org/"">http://www.r-project.org/</a></p>\n\n<p>R is valuable and significant because it was the first widely-accepted Open-Source alternativ..."
1,9,50.0,2010-07-19T19:16:27Z,3,13,"<p><a href=""http://incanter.org/"">Incanter</a> is a Clojure-based, R-like platform (environment + libraries) for statistical computing and graphics. </p>\n"


In [None]:
tags.head(2)

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior


## Exploratory Data Analysis & Data Preparation

In [None]:
tags['Tag'].nunique()

1315

In [19]:
# remove "-" from the tags
tags['Tag'] = tags['Tag'].apply(lambda x:re.sub("-"," ",x))

In [20]:
# group tags Id wise
tags = tags.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')
tags.head()

Unnamed: 0,Id,tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [21]:
# merge tags and questions
df = pd.merge(question,tags, how = 'inner', on = 'Id')
df = df[['Id','Body','tags']]
df.head(10)

Unnamed: 0,Id,Body,tags
0,6,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach...",[machine learning]
1,21,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...,"[forecasting, population, census]"
2,22,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n,"[bayesian, frequentist]"
3,31,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests....","[hypothesis testing, t test, p value, interpretation, intuition]"
4,36,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ...","[correlation, teaching]"
5,93,"<p>We're trying to use a Gaussian process to model h(t) -- the hazard function -- for a very small initial population, and then fit that using the available data. While this gives us nice plots f...","[nonparametric, survival, hazard]"
6,95,<p>I have been using various GARCH-based models to forecast volatility for various North American equities using historical daily data as inputs.</p>\n\n<p>Asymmetric GARCH models are often cited ...,"[time series, garch, volatility forecasting, finance]"
7,103,<p>What is the best blog on data visualization?</p>\n\n<p>I'm making this question a community wiki since it is highly subjective. Please limit each answer to one link.</p>\n\n<hr>\n\n<p><strong>...,"[data visualization, references]"
8,113,"<p>I have been looking into theoretical frameworks for method selection (note: not model selection) and have found very little systematic, mathematically-motivated work. By 'method selection', I m...","[machine learning, methodology, theory]"
9,114,"<p>What statistical research blogs would you recommend, and why?</p>\n",[references]


In [None]:
df.shape

(85085, 3)

In [22]:
# Checking the occurence of the tags

freq = {}
for i in df['tags']:
    for j in i:
        if j in freq.keys():
            freq[j] = freq[j] +1
        else:
            freq[j] =1

In [23]:
# we can sort the dictionary in descending order
freq = dict(sorted(freq.items(), key = lambda x:x[1], reverse= True))

In [24]:
# Top 10 most frequent tags
top_10_tags = list(freq.keys())[:10]
print(top_10_tags)

['r', 'regression', 'machine learning', 'time series', 'probability', 'hypothesis testing', 'self study', 'distributions', 'logistic', 'classification']


`We will use only those questions/queries that are associated with the top 10 tags.`

In [25]:
# finding the queries associated with common tags

x =[]
y=[]

for i in range(len(df['tags'])):
    temp = []
    for j in df['tags'][i]:
        if j in top_10_tags:
            temp.append(j)
    if len(temp)>1:
        x.append(df['Body'][i])
        y.append(temp)

In [None]:
y[:5]

[['r', 'time series'],
 ['regression', 'distributions'],
 ['distributions', 'probability', 'hypothesis testing'],
 ['hypothesis testing', 'self study'],
 ['r', 'regression', 'time series']]

In [26]:
# We should combine the labels by space
y = [",".join([str(j) for j in i]) for i in y]

In [None]:
y[:5]

['r,time series',
 'regression,distributions',
 'distributions,probability,hypothesis testing',
 'hypothesis testing,self study',
 'r,regression,time series']

In [27]:
dframe = pd.DataFrame({'query':x, 'tags':y})

In [28]:
dframe.tail()

Unnamed: 0,query,tags
11101,"<p>This is my first post at Cross Validated. I was having a doubt and hoped it can be cleared instead of stackoverflow.</p>\n\n<p>Currently I'm working on a Predictive Model, which takes in server...","machine learning,classification"
11102,"<p>I am working on predicting a time series of daily data for one month that looks like this:\n<a href=""http://i.stack.imgur.com/kxwc1.jpg"" rel=""nofollow""><img src=""http://i.stack.imgur.com/kxwc1....","r,time series"
11103,<p>I am conducting a multifactorial analyisis involving categorical variables by using R. The response is âyesâ or ânoâ (Iâm therefore using binary logistic regression) and the predictor...,"r,regression,logistic"
11104,"<p>In computer science literature, we always see different algorithms are trained with a lot of data (n=100,000), and then they are tested on a test set (n=10,000). Then, often,if one algorithm NU...","machine learning,hypothesis testing"
11105,"<p>Given the two continuous random variables $X$ and $Y$ and a random variable $Z=\{0,1\}$ that denotes groups, the following null-hypothesis is devised:</p>\n\n<p>$$H_0:X\perp Y\:\vert\:Z$$</p>\n...","hypothesis testing,self study"


In [None]:
dframe['query'][0]

"<p>I recently started working for a tuberculosis clinic.  We meet periodically to discuss the number of TB cases we're currently treating, the number of tests administered, etc.  I'd like to start modeling these counts so that we're not just guessing whether something is unusual or not.  Unfortunately, I've had very little training in time series, and most of my exposure has been to models for very continuous data (stock prices) or very large numbers of counts (influenza).  But we deal with 0-18 cases per month (mean 6.68, median 7, var 12.3), which are distributed like this:</p>\n\n<p>[image lost to the mists  of time]</p>\n\n<p>[image eaten by a grue]</p>\n\n<p>I've found a few articles that address models like this, but I'd greatly appreciate hearing suggestions from you - both for approaches and for R packages that I could use to implement those approaches.</p>\n\n<p><strong>EDIT:</strong>  mbq's answer has forced me to think more carefully about what I'm asking here; I got too hu

## Text Cleaning & Preprocessing

In [29]:
def train_word2vec(tokenized_text):
    model = Word2Vec(sentences=tokenized_text, vector_size=100, window=5, min_count=1)
    return model

def word_to_vec(tokenized_text, model):
    vecs = []
    for tokens in tokenized_text:
        vec = []
        for token in tokens:
            if token in model.wv:
                vec.extend(model.wv[token])
        vecs.append(vec)
    return vecs

In [30]:
def data_clean(df):
    df['cleaned_text'] = df['query'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text()) # To Remove HTML tags
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words)) # To Remove Stop Words
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub("[^a-zA-Z]", " ", x)) # To Remove any non alphabetic characters

    df['cleaned_text'] = df['cleaned_text'].str.lower() # To Lower all texts

    df['tokenized_text'] = df['cleaned_text'].apply(lambda x: word_tokenize(x)) # Tokenization

    df['tokenized_text'] = df['tokenized_text'].apply(lambda x: [token.strip() for token in x])
    df['tokenized_text'] = df['tokenized_text'].apply(lambda x: [stemmer.stem(word) for word in x])

    model = train_word2vec(df['tokenized_text'])

    df['vectors'] = word_to_vec(df['tokenized_text'], model)

    return df[['vectors', 'tags']]

In [31]:
df = data_clean(dframe)

**Converting the labels using MultiLabelBinarization**

In [32]:
def multi_label_binarization(df):

    tags = df['tags'].str.split(',')

    mlb = MultiLabelBinarizer()

    binary_tags = mlb.fit_transform(tags)

    binary_tags_df = pd.DataFrame(binary_tags, columns=mlb.classes_)

    df = pd.concat([df, binary_tags_df], axis=1)

    df = df.drop(columns=['tags'])

    return df

In [33]:
final_df = multi_label_binarization(df)

In [None]:
final_df.to_csv('Final_DataFrame.csv', index= False)

As the Vectors in the vectors column have different length we must do padding in order to fit a LSTM / RNN model as these models requires a specified input sequence

In [None]:
final_df['vectors'][0]

In [35]:
max_seq_length = 1000
X_padded = pad_sequences(final_df['vectors'], maxlen=max_seq_length, dtype='float32', padding='post', truncating='post', value=0.0)

In [36]:
X_padded.shape

(11106, 1000)

In [37]:
y = final_df.drop(columns=['vectors']).values

In [44]:
x_train,x_test,y_train,y_test=train_test_split(X_padded, y, test_size=0.2, random_state=9)

In [45]:
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

In [46]:
early_stopping = EarlyStopping(monitor = 'accuracy' , patience = 2  ,restore_best_weights = True )
model = Sequential()
model.add(LSTM(100, input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(50))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))


model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test),verbose=1,callbacks=[early_stopping])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<keras.src.callbacks.History at 0x7c6d05b5a5c0>

In [47]:
y_pred = model.predict(X_test)
from sklearn.metrics import classification_report

# Convert the one-hot encoded labels back to original labels
y_pred = np.argmax(y_pred, axis=1)
y_true = np.argmax(y_test, axis=1)

# Generate classification report
class_report = classification_report(y_true, y_pred)
print(class_report)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       301
           1       0.00      0.00      0.00       275
           2       0.00      0.00      0.00       187
           3       0.00      0.00      0.00       346
           4       0.00      0.00      0.00       213
           5       0.00      0.00      0.00       164
           6       0.26      1.00      0.41       569
           7       0.00      0.00      0.00       146
           8       0.00      0.00      0.00        21

    accuracy                           0.26      2222
   macro avg       0.03      0.11      0.05      2222
weighted avg       0.07      0.26      0.10      2222



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
