1. Import libraries and dataset
2. Inspect data
3. Clean and pre-process data
4. Reshape Target variable
5. Extract features from data
6. Build model for multilabel classification
7. Make predictions and evaluate model on validation set
8. Define inference function for new data

### Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
import spacy
from tqdm import tqdm
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_colwidth', 200)

### Read Dataset

In [2]:
# read dataset
df_questions = pd.read_hdf('auto_tagging_data_v2.h5')

### Inspect Data

In [3]:
df_questions.sample(6, random_state=11)

Unnamed: 0,Id,Title,Body,Tags
41763,92185,Why is Sampling Importance Resampling (SIR) better than Importance Sampling (IS)?,"<p>From what I understand, SIR is a mechanism for sampling from a distribution $p$ that works as follows:</p>\n\n<ol>\n<li>Approximate a target distribution $p$ using an importance sample $S$ fro...","[sampling, mcmc]"
4245,179778,optimization approach in logistic regression,<p>In logistic regression we need to maximise the log likelihood which boils down to minimising a function which is sum of multiple log functions. We normally use gradient descent approach there. ...,"[machine-learning, logistic, classification, optimization]"
37183,168679,Consequences of violating proportional hazards assumption in Cox model,"<p>What are the consequences of violating the Proportional Hazards assumption in a Cox Model? I've got a Model where two factors are highly significative, but all the estimated betas associated to...","[regression, survival, cox-model]"
55932,144226,Moments and density tails,"<p>Assume that the first $n$ moments $m_1,\dots\,m_n$ of a random variable $X\in\mathbb{R}$ are known, but not its probability density function $p(x)$. </p>\n\n<p>Does there exist a methodology to...","[probability, pdf]"
47629,142745,What is the demonstration of the variance of the difference of two dependent variables?,"<p>I know that the variance of the difference of two independent variables is the sum of variances, and I can prove it. I want to know where the covariance goes in the other case.</p>\n","[variance, covariance]"
49639,195347,Rules for choosing how much training data one needs to learn a Radial Basis Function (RBF) model?,<p>I was trying to understand how much data I would need compared to the number of parameters (and to have good generalization) when I train a radial basis function (RBF) network on a regression t...,"[machine-learning, nonlinear-regression]"


In [4]:
# combine title and body
df_questions['Text'] = df_questions["Title"] + " " + df_questions["Body"]
df_questions.sample(6, random_state=11)

Unnamed: 0,Id,Title,Body,Tags,Text
41763,92185,Why is Sampling Importance Resampling (SIR) better than Importance Sampling (IS)?,"<p>From what I understand, SIR is a mechanism for sampling from a distribution $p$ that works as follows:</p>\n\n<ol>\n<li>Approximate a target distribution $p$ using an importance sample $S$ fro...","[sampling, mcmc]","Why is Sampling Importance Resampling (SIR) better than Importance Sampling (IS)? <p>From what I understand, SIR is a mechanism for sampling from a distribution $p$ that works as follows:</p>\n\n..."
4245,179778,optimization approach in logistic regression,<p>In logistic regression we need to maximise the log likelihood which boils down to minimising a function which is sum of multiple log functions. We normally use gradient descent approach there. ...,"[machine-learning, logistic, classification, optimization]",optimization approach in logistic regression <p>In logistic regression we need to maximise the log likelihood which boils down to minimising a function which is sum of multiple log functions. We n...
37183,168679,Consequences of violating proportional hazards assumption in Cox model,"<p>What are the consequences of violating the Proportional Hazards assumption in a Cox Model? I've got a Model where two factors are highly significative, but all the estimated betas associated to...","[regression, survival, cox-model]",Consequences of violating proportional hazards assumption in Cox model <p>What are the consequences of violating the Proportional Hazards assumption in a Cox Model? I've got a Model where two fact...
55932,144226,Moments and density tails,"<p>Assume that the first $n$ moments $m_1,\dots\,m_n$ of a random variable $X\in\mathbb{R}$ are known, but not its probability density function $p(x)$. </p>\n\n<p>Does there exist a methodology to...","[probability, pdf]","Moments and density tails <p>Assume that the first $n$ moments $m_1,\dots\,m_n$ of a random variable $X\in\mathbb{R}$ are known, but not its probability density function $p(x)$. </p>\n\n<p>Does th..."
47629,142745,What is the demonstration of the variance of the difference of two dependent variables?,"<p>I know that the variance of the difference of two independent variables is the sum of variances, and I can prove it. I want to know where the covariance goes in the other case.</p>\n","[variance, covariance]","What is the demonstration of the variance of the difference of two dependent variables? <p>I know that the variance of the difference of two independent variables is the sum of variances, and I ca..."
49639,195347,Rules for choosing how much training data one needs to learn a Radial Basis Function (RBF) model?,<p>I was trying to understand how much data I would need compared to the number of parameters (and to have good generalization) when I train a radial basis function (RBF) network on a regression t...,"[machine-learning, nonlinear-regression]",Rules for choosing how much training data one needs to learn a Radial Basis Function (RBF) model? <p>I was trying to understand how much data I would need compared to the number of parameters (and...


In [6]:
df_questions['Text'].head()

0    The Two Cultures: statistics vs. machine learning? <p>Last year, I read a blog post from <a href="http://anyall.org/">Brendan O'Connor</a> entitled <a href="http://anyall.org/blog/2008/12/statisti...
1    Forecasting demographic census <p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census ...
2                             Bayesian and frequentist reasoning in plain English <p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n
3    What is the meaning of p values and t values in statistical tests? <p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk b...
4    Examples for teaching: Correlation does not mean causation <p>There is an old saying: "Correlation does not mean causation". When I teach, I tend to use the following standard

### Clean and Pre-process Data

In [5]:
def clean_text(text):
    # remove html tags and url links
    text = re.sub(r'<.*?>', '', text)
    # remove everything alphabets
    text = re.sub("[^a-zA-Z]"," ",text)
    # remove whitespaces
    text = ' '.join(text.split())
    
    return text

In [6]:
df_questions['Text'] = df_questions['Text'].apply(lambda x: clean_text(x))
df_questions['Text'] = df_questions['Text'].str.lower()

In [8]:
df_questions[['Id', 'Text', 'Tags']].sample(6, random_state=11)

Unnamed: 0,Id,Text,Tags
41763,92185,"Why is Sampling Importance Resampling (SIR) better than Importance Sampling (IS)? <p>From what I understand, SIR is a mechanism for sampling from a distribution $p$ that works as follows:</p>\n\n...","[sampling, mcmc]"
4245,179778,optimization approach in logistic regression <p>In logistic regression we need to maximise the log likelihood which boils down to minimising a function which is sum of multiple log functions. We n...,"[machine-learning, logistic, classification, optimization]"
37183,168679,Consequences of violating proportional hazards assumption in Cox model <p>What are the consequences of violating the Proportional Hazards assumption in a Cox Model? I've got a Model where two fact...,"[regression, survival, cox-model]"
55932,144226,"Moments and density tails <p>Assume that the first $n$ moments $m_1,\dots\,m_n$ of a random variable $X\in\mathbb{R}$ are known, but not its probability density function $p(x)$. </p>\n\n<p>Does th...","[probability, pdf]"
47629,142745,"What is the demonstration of the variance of the difference of two dependent variables? <p>I know that the variance of the difference of two independent variables is the sum of variances, and I ca...","[variance, covariance]"
49639,195347,Rules for choosing how much training data one needs to learn a Radial Basis Function (RBF) model? <p>I was trying to understand how much data I would need compared to the number of parameters (and...,"[machine-learning, nonlinear-regression]"


In [9]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [10]:
def strip_stopwords(text):
    clean_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(clean_text)

In [11]:
df_questions['Text_clean'] = df_questions['Text'].apply(lambda x: strip_stopwords(x))

### Reshape Target Variable

In [12]:
from sklearn.preprocessing import MultiLabelBinarizer

In [13]:
multilabel_binarizer = MultiLabelBinarizer()

multilabel_binarizer.fit(df_questions['Tags'])

# transform target variable ("Tags")
Y = multilabel_binarizer.transform(df_questions['Tags'])

In [14]:
Y.shape

(76365, 100)

### Feature Extraction

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)

X_tfidf = tfidf_vectorizer.fit_transform(df_questions['Text_clean'])

### Train-Test Split

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
# split dataset into training and validation set
x_train_tfidf, x_val_tfidf, y_train_tfidf, y_val_tfidf = train_test_split(X_tfidf, Y, test_size=0.2, random_state=9)

### Model Building

In [19]:
from sklearn.linear_model import LogisticRegression

# Binary Relevance
from sklearn.multiclass import OneVsRestClassifier

# Performance metric
from sklearn.metrics import f1_score

In [20]:
lr = LogisticRegression()
clf = OneVsRestClassifier(lr)

In [21]:
# fit model on train data
clf.fit(x_train_tfidf, y_train_tfidf)



OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None)

### Predictions and Performane Evaluation

In [22]:
# make predictions for validation set
y_pred = clf.predict(x_val_tfidf)

In [23]:
# print prediction
print(y_pred[:3])

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [24]:
multilabel_binarizer.inverse_transform(y_pred)[:3]

[('prediction',), ('distributions', 'mean', 'variance'), ()]

In [35]:
# evaluate performance
f1_score(y_val_tfidf, y_pred)

ValueError: Target is multilabel-indicator but average='binary'. Please choose another average setting.

In [36]:
f1_score(y_val_tfidf, y_pred, average=None)

array([0.02040816, 0.6080402 , 0.60983607, 0.38918919, 0.5776699 ,
       0.13636364, 0.35433071, 0.66122449, 0.51111111, 0.2246696 ,
       0.51282051, 0.44018059, 0.71470588, 0.18461538, 0.58695652,
       0.5596222 , 0.30337079, 0.51572327, 0.65030675, 0.07960199,
       0.29577465, 0.4009324 , 0.0130719 , 0.40993789, 0.28172043,
       0.03296703, 0.18367347, 0.26900585, 0.23602484, 0.52849741,
       0.35379061, 0.51936219, 0.40559441, 0.30769231, 0.35631155,
       0.04225352, 0.5       , 0.04347826, 0.28571429, 0.04166667,
       0.63688213, 0.38063439, 0.01104972, 0.46096654, 0.53134328,
       0.37241379, 0.16494845, 0.46994536, 0.52903226, 0.        ,
       0.21333333, 0.        , 0.35483871, 0.23312883, 0.31578947,
       0.18875502, 0.15151515, 0.75644699, 0.05263158, 0.31336406,
       0.36333333, 0.19487179, 0.45962733, 0.49101796, 0.13492063,
       0.5060241 , 0.77272727, 0.26666667, 0.46153846, 0.05376344,
       0.05970149, 0.39434276, 0.3030303 , 0.41975309, 0.57242

In [38]:
np.mean(f1_score(y_val_tfidf, y_pred, average=None))

0.34361729591925294

In [39]:
# evaluate performance
f1_score(y_val_tfidf, y_pred, average="micro")

0.4312861087264206

In [40]:
f1_score(y_val_tfidf, y_pred, average="macro")

0.34361729591925294

In [29]:
# predict probabilities
y_pred_prob = clf.predict_proba(x_val_tfidf)

In [51]:
# set threshold value
t = 0.45

# convert to integers
y = (y_pred_prob >= t).astype(int)
f1_score(y_val_tfidf, y, average="micro")

0.45988232147633057

### Inference

In [52]:
def infer_tags(q):
    q = clean_text(q)
    q = q.lower()
    q = strip_stopwords(q)
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)

In [81]:
# give new question
new_q = "Regression line in ggplot doesn't match computed regression Im using R and created a chart using ggplot2. I then create a regression so I can make some predicitions I pass my data frame of to the predict function predict(regression, Measures) I'd expect the predictions to be the same as if I used the regression line on the chart, but they aren't the same. Why would this be the case? Is there a setting in ggplot or is my expectation incorrect?"

# get tags
infer_tags(new_q)

[('r', 'regression')]