# The following notebook is my solution for the 3rd stage problem for Mesh Education Private Limited internship.

## Tagging System of Questions using Transfer Learning

### Problem Statement
In this challenge, we provide the titles, text, and tags of Stack Exchange questions from six different
sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a
standard machine learning approach might involve training an algorithm on a corpus of related text.
Here, you are challenged to train on material from outside the field. Can an algorithm predict
appropriate physics tags after learning from biology, chemistry or mathematics data? Let's find out!

In [1]:
# Usual imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

import warnings
warnings.filterwarnings("ignore")
import re
import string
import operator
import random
from tqdm import tqdm
tqdm.pandas()

In [4]:
# Reading training data
train_list = ['travel.csv', 'biology.csv', 'robotics.csv', 'cooking.csv', 'crypto.csv', 'diy.csv']

list_ = []

for file_ in train_list:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)

frame = pd.concat(list_, axis = 0, ignore_index = True) 

In [5]:
frame.drop(['id'], inplace=True, axis=1)
frame.head()

Unnamed: 0,title,content,tags
0,What are some Caribbean cruises for October?,<p>My fiancée and I are looking for a good Car...,caribbean cruising vacations
1,How can I find a guide that will take me safel...,"<p>This was one of our definition questions, b...",guides extreme-tourism amazon-river amazon-jungle
2,Does Singapore Airlines offer any reward seats...,<p>Singapore Airlines has an all-business clas...,loyalty-programs routes ewr singapore-airlines...
3,What is the easiest transportation to use thro...,<p>Another definition question that interested...,romania transportation
4,How can I visit Antarctica?,"<p>A year ago I was reading some magazine, and...",extreme-tourism antarctica


In [6]:
#Evaluation metric
def f1_score(tp, fp, fn):
    p = (tp*1.) / (tp+fp)
    r = (tp*1.) / (tp+fn)
    f1 = (2*p*r)/(p+r)
    return f1

http://www2.agroparistech.fr/ufr-info/membres/cornuejols/Teaching/Master-AIC/PROJETS-M2-AIC/PROJETS-2016-2017/challenge-kaggle-transfer%20KHOUFI_MATMATI_THIERRY.pdf

The above paper deduces the following information:
* Tags can be deduced from information in the 'title' and 'content' but not with good accuracy.
* Baseline models have the following scores:
    * TITLE 0.08271
    * TITLE + CONTENT 0.05719
    * CONTENT 0.05021
* CNN models perform in the following scores:
    * TITLE 0.07325
    * TITLE + CONTENT 0.05620
    * CONTENT 0.05018
* LDA models have the following scores:
    * BEST 20 WORDS IN TEXT 0.03861
    * BEST 5 WORDS IN TEXT 0.02866
    * RANDOM FROM BEST 15 0.00862
    * BEST 5 TAGS 0.00824
* Transfer learning is not very helpful as accuracy doesn't go beyond 5.7% for title+content and 8.2% for title only.
* Much better results would have been achieved if a list of tags was made available for the challenge

In [7]:
#Cleaning Text
def clean_html(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

# SpaCy

punctuations = string.punctuation
stopwords = list(STOP_WORDS)

parser = English()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

def find_str(s, char):
    index = 0

    if char in s:
        c = char[0]
        for ch in s:
            if ch == c:
                if s[index:index+len(char)] == char:
                    return index

            index += 1

    return -1

def theory_checker(phrase):
    if "theory" in phrase:
        lp = list(phrase)
        lp[find_str(phrase,"theory")-1] = "-"
        phrase = ''.join(lp)
    return phrase

In [8]:
# Cleaning and preprocessing training data
train_data = frame
train_data["content"] = train_data["content"].progress_apply(clean_html)
train_data["content"] = train_data["content"].progress_apply(spacy_tokenizer)
train_data["title"] = train_data["title"].progress_apply(spacy_tokenizer)
train_data.head()

100%|██████████| 87000/87000 [00:00<00:00, 107005.61it/s]
100%|██████████| 87000/87000 [07:58<00:00, 181.77it/s]
100%|██████████| 87000/87000 [00:52<00:00, 1662.05it/s]


Unnamed: 0,title,content,tags
0,caribbean cruise october,fiancée look good caribbean cruise october won...,caribbean cruising vacations
1,find guide safely amazon jungle,definition question interest personally find g...,guides extreme-tourism amazon-river amazon-jungle
2,singapore airlines offer reward seat ewr sin r...,singapore airlines business class flight ewr s...,loyalty-programs routes ewr singapore-airlines...
3,easy transportation use romania foreigner,definition question interest easy transportati...,romania transportation
4,visit antarctica,year ago read magazine find availability trip ...,extreme-tourism antarctica


**I will be using a frequency based approach to improve the accuracy of title+content data. Both title and content have been chosen to build a more robust and believeable model.**

**We will use the function below to find out the common words from title and content. These words will be used as tags and will be submitted as the submission.**

In [9]:
# Function to find out common words to be used as tags 
def top_word_finder(title,content):
    title = title.split()
    content = content.split()
    top = set(title)&set(content)
    top = sorted(top, key = lambda k : title.index(k))
    return ' '.join(top)

In [10]:
train_data["similar_words"] = train_data.progress_apply(lambda row: top_word_finder(row['title'], row['content']), axis=1)

100%|██████████| 87000/87000 [00:03<00:00, 23720.62it/s]


**After applying the function, we can see that the newly generated 'similar_tags' and 'tags' have many similar tags. Now, this approaach is used for physics questions too.**

In [11]:
train_data.tail(10)

Unnamed: 0,title,content,tags,similar_words
86990,concrete subfloor 2x4,need new concrete subfloor water damage 2x4s o...,concrete subfloor hardwood,concrete subfloor
86991,single 12 2 nm cable hole size drill,know code specify maximum hole size base frame...,electrical wiring,single 12 2 nm cable hole size
86992,c wire missing trane air handler variable 4tee3f,tell wire contact use c wire,electrical,c wire
86993,plug socket turn replace,problem plug socket socket turn replace screwf...,electrical wiring socket,plug socket turn replace
86994,safe wire light junction box plug plug switch ...,edit rephrase question original unsafe install...,electrical wiring lighting light-fixture safety,safe wire light junction box plug switch contr...
86995,prevent stand water collect base foundation,major problem rainfall water collect base home...,water foundation grading,water collect base foundation
86996,selectable thermostat,like add 2 remote thermostat exit hvac system ...,thermostat,thermostat
86997,measure power draw inverter,output power calculation measure ac current cl...,electrical,measure power
86998,old oil force air heat system r w wire add c p...,system 60 era furnace t connector r w thermost...,thermostat-c-wire,system r w wire c use
86999,light stay switch,problem come home morning find light switch tu...,electrical lighting,light stay switch


In [13]:
test_data = pd.read_csv('test.csv')
test_data.head()

Unnamed: 0,id,title,content
0,1,What is spin as it relates to subatomic partic...,<p>I often hear about subatomic particles havi...
1,2,What is your simplest explanation of the strin...,<p>How would you explain string theory to non ...
2,3,"Lie theory, Representations and particle physics",<p>This is a question that has been posted at ...
3,7,Will Determinism be ever possible?,<p>What are the main problems that we need to ...
4,9,Hamilton's Principle,<p>Hamilton's principle states that a dynamic ...


In [14]:
test_data["content"] = test_data["content"].progress_apply(clean_html)
test_data["content"] = test_data["content"].progress_apply(spacy_tokenizer)
test_data["title"] = test_data["title"].progress_apply(spacy_tokenizer)
test_data["content"] = test_data["content"].progress_apply(theory_checker)
test_data["title"] = test_data["title"].progress_apply(theory_checker)
test_data["similar_words"] = test_data.progress_apply(lambda row: top_word_finder(row['title'], row['content']), axis=1)
test_data.head()

100%|██████████| 81926/81926 [00:00<00:00, 102810.08it/s]
100%|██████████| 81926/81926 [10:29<00:00, 130.18it/s]
100%|██████████| 81926/81926 [00:42<00:00, 1931.35it/s]
100%|██████████| 81926/81926 [00:00<00:00, 185837.25it/s]
100%|██████████| 81926/81926 [00:00<00:00, 646779.96it/s]
100%|██████████| 81926/81926 [00:03<00:00, 21960.24it/s]


Unnamed: 0,id,title,content,similar_words
0,1,spin relate subatomic particle,hear subatomic particle property spin actually...,spin relate subatomic particle
1,2,simple explanation string-theory,explain string-theory non physicist specially ...,string-theory
2,3,lie-theory representations particle physic,question post different forum think maybe conc...,
3,7,determinism possible,main problem need solve prove laplace determin...,determinism
4,9,hamilton principle,hamilton principle state dynamic system follow...,hamilton principle


In [16]:
submission = pd.read_csv('sample_submission.csv')
submission["tags"] = test_data["similar_words"]
submission.to_csv('submission.csv')

**This is the submission file. Newly generated tags can be seen. These tags are meaningful and impart a good amount of information about the posts/questions.**

In [17]:
submission.head()

Unnamed: 0,id,tags
0,1,spin relate subatomic particle
1,2,string-theory
2,3,
3,7,determinism
4,9,hamilton principle


**This submission scored 0.07499 on kaggle(where the competition was hosted) which is better than 0.05719 as shown in the paper. This means that the frequency based approach performs better than traditional baseline models and CNNs.**
<img src="capture.png">
**The final submission file is included with this notebook: kaggle_submission.csv**

In [19]:
pd.read_csv('kaggle_submission.csv').tail()

Unnamed: 0,id,tags
81921,278119,projectile
81922,278120,lift coanda effect
81923,278121,asymmetric
81924,278124,drop impact liquid
81925,278126,gravity manipulation
