# Predict tags on StackOverflow with linear models

In this project we will  predict tags for posts from [StackOverflow](https://stackoverflow.com) Using multilabel classification approach.



### Data

The data is retrieved from one of Coursera AI cources 

### Text preprocessing

For this and most of the following assignments you will need to use a list of stop words. It can be downloaded from *nltk*:

In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\Users\Noha
[nltk_data]     Magdy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
from ast import literal_eval
import pandas as pd
import numpy as np

In [5]:
def read_data(filename):
    data = pd.read_csv(filename, sep='\t')
    data['tags'] = data['tags'].apply(literal_eval)
    return data

In [6]:
train = read_data('data/train.tsv')
validation = read_data('data/validation.tsv')
test = pd.read_csv('data/test.tsv', sep='\t')

In [7]:
train.head()

Unnamed: 0,title,tags
0,How to draw a stacked dotplot in R?,[r]
1,mysql select all records where a datetime fiel...,"[php, mysql]"
2,How to terminate windows phone 8.1 app,[c#]
3,get current time in a specific country via jquery,"[javascript, jquery]"
4,Configuring Tomcat to Use SSL,[java]


It could be noticed that a number of tags for a post is not fixed and could be as many as necessary.

 initialize *X_train*, *X_val*, *X_test*, *y_train*, *y_val*.

In [8]:
X_train, y_train = train['title'].values, train['tags'].values
X_val, y_val = validation['title'].values, validation['tags'].values
X_test = test['title'].values

In [9]:
import re

In [10]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text 
    text = re.sub(REPLACE_BY_SPACE_RE, ' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(BAD_SYMBOLS_RE, '', text) # delete symbols which are in BAD_SYMBOLS_RE from text 
    text = " ".join([x for x in text.split() if x not in STOPWORDS]) # delete stopwords from text
    return text

In [11]:
def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            return "Wrong answer for the case: '%s'" % ex
    return 'tests are passed.'

In [12]:
print(test_text_prepare())

tests are passed.


Run implementation for questions from file *text_prepare_tests.tsv*.

In [13]:
X_train = [text_prepare(x) for x in X_train]
X_val = [text_prepare(x) for x in X_val]
X_test = [text_prepare(x) for x in X_test]

In [14]:
X_train[:3]

['draw stacked dotplot r',
 'mysql select records datetime field less specified value',
 'terminate windows phone 81 app']


(WordsTagsCount). Find 3 most popular tags and 3 most popular words in the train data

In [15]:
# Dictionary of all tags from train corpus with their counts.
tags_counts = {}
# Dictionary of all words from train corpus with their counts.
words_counts = {}


def word_counter (text):
    text= text.split()
    for i in text:
        if words_counts.get(i):
            words_counts[i]+=1
        else: 
            words_counts[i]=1

def tag_counter (tags):
    for i in tags:
        if tags_counts.get(i):
            tags_counts[i]+=1
        else: 
            tags_counts[i]=1

            
for j in X_train:
    word_counter (j)
    
for n in y_train:
    tag_counter(n)
    

In [16]:
most_common_tags = sorted(tags_counts.items(), key=lambda x: x[1], reverse=True)[:3]
most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:3]

print (most_common_tags)
print (most_common_words)

[('javascript', 19078), ('c#', 19077), ('java', 18661)]
[('using', 8278), ('php', 5614), ('java', 5501)]


### Transforming text to a vector


#### Bag of words

One of the well-known approaches is a *bag-of-words* representation. To create this transformation, We will
1. Find *N* most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.
2. For each title in the corpora create a zero vector with the dimension equals to *N*.
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.



In [17]:
DICT_SIZE = 5000

most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:5000]
most_common_5000 = [word[0] for word in most_common_words]
WORDS_TO_INDEX = {}
INDEX_TO_WORDS = {}

for j in range (0,5000):
    WORDS_TO_INDEX[most_common_5000[j]]=j
    
for i in range (0,5000):
    INDEX_TO_WORDS[i]=most_common_5000[i]
    

ALL_WORDS = WORDS_TO_INDEX.keys()

'''
text= 'hi you me are'
text= text.split()
index ={}
inverse= {}
for i in range (0,len(text)):
    index[i]=text[i]
for j in range (0,len(text)):
    inverse[text[j]]=j

ALL_WORDS = inverse.keys()
#print (list(ALL_WORDS))

testcase='hi how are you'

'''
def my_bag_of_words(text, words_to_index, dict_size):
    """
        text: a string
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    text=text.split()
    result_vector = np.zeros(dict_size)
    for i in text:
        if i in list(words_to_index.keys()):
            result_vector[words_to_index[i]]=1

    return result_vector

In [18]:
def test_my_bag_of_words():
    words_to_index = {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
    examples = ['hi how are you']
    answers = [[1, 1, 0, 1]]
    for ex, ans in zip(examples, answers):
        if (my_bag_of_words(ex, words_to_index, 4) != ans).any():
            return "Wrong answer for the case: '%s'" % ex
    return 'tests are passed.'

In [19]:
print(test_my_bag_of_words())

tests are passed.


Applying the implemented function to all samples 

In [20]:
from scipy import sparse as sp_sparse

In [21]:
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_val_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val])
X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])
print('X_train shape ', X_train_mybag.shape)
print('X_val shape ', X_val_mybag.shape)
print('X_test shape ', X_test_mybag.shape)

X_train shape  (100000, 5000)
X_val shape  (30000, 5000)
X_test shape  (20000, 5000)


In [22]:
row = X_train_mybag[10].toarray()[0]
non_zero_elements_count = 0

for i in row:
    if i != 0:
        non_zero_elements_count+=1  
        
print (non_zero_elements_count)

7


#### TF-IDF

The second approach extends the bag-of-words framework by taking into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 



In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
def tfidf_features(X_train, X_val, X_test):
    """
        X_train, X_val, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """

    tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.7, ngram_range=(1, 2), token_pattern='(\S+)')
    
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_val = tfidf_vectorizer.transform(X_val)
    X_test = tfidf_vectorizer.transform(X_test)
    
    
    return X_train, X_val, X_test, tfidf_vectorizer.vocabulary_

In [25]:
X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_val, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

In [26]:
if 'c#' in tfidf_reversed_vocab.values():
    print ("ok")
else:
    print ("not ok")

ok


### MultiLabel classifier

 in this part each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s.

In [27]:
from sklearn.preprocessing import MultiLabelBinarizer

In [28]:
mlb = MultiLabelBinarizer(classes=sorted(tags_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_val = mlb.fit_transform(y_val)
print (np.sum(y_train[20]))

3


In [29]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [30]:
tags = len(tags_counts)

def train_classifier(X_train, y_train):
    """
      X_train, y_train — training data
      
      return: trained classifier
    """
    
    # Create and fit LogisticRegression wraped into OneVsRestClassifier.
    return  OneVsRestClassifier(LogisticRegression()).fit(X_train, y_train)

Train the classifiers for different data transformations: *bag-of-words* and *tf-idf*.

In [31]:
classifier_mybag = train_classifier(X_train_mybag, y_train)
classifier_tfidf = train_classifier(X_train_tfidf, y_train)



create predictions for the data.

In [32]:
y_val_predicted_labels_mybag = classifier_mybag.predict(X_val_mybag)
y_val_predicted_scores_mybag = classifier_mybag.decision_function(X_val_mybag)

y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)

In [37]:
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_tfidf)
y_val_inversed = mlb.inverse_transform(y_val)
for i in range(3):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_val[i],
        ','.join(y_val_inversed[i]),
        ','.join(y_val_pred_inversed[i])
    ))

Title:	odbc_exec always fail
True labels:	php,sql
Predicted labels:	


Title:	access base classes variable within child class
True labels:	javascript
Predicted labels:	


Title:	contenttype application json required rails
True labels:	ruby,ruby-on-rails
Predicted labels:	json,ruby-on-rails




we would need to compare the results of different predictions

### Evaluation



In [34]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

Implementing the function *print_evaluation_scores* which calculates and prints:
 - *accuracy*
 - *F1-score macro/micro/weighted*
 - *Precision macro/micro/weighted*

In [35]:
def print_evaluation_scores(y_val, predicted):
    print ("accuracy: ",accuracy_score(y_val, predicted, normalize=True, sample_weight=None))
    #print ("Fi-score: ",f1_score(y_val, predicted),average = 'macro')
    #print ("recall_score", recall_score(y_val, predicted))

In [36]:
print('Bag-of-words')
print_evaluation_scores(y_val, y_val_predicted_labels_mybag)
print('Tfidf')
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)

Bag-of-words
accuracy:  0.3617
Tfidf
accuracy:  0.3307
