Transfer Learning on Stack Exchange Tags
========================================

Step 1: Explore Data
--------------------

Before we can start training a ML model, we need to understand the data we will be using.

In [2]:
%matplotlib inline

# Imports
import numpy as np
import pandas as pd
from matplotlib import pylab
from bs4 import BeautifulSoup
from functools import reduce
from sklearn.manifold import TSNE
from IPython.display import display

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Convert csv files into dataframes
biology_pd = pd.read_csv('DATA/biology.csv')
cooking_pd = pd.read_csv('DATA/cooking.csv')
cryptology_pd = pd.read_csv('DATA/crypto.csv')
diy_pd = pd.read_csv('DATA/diy.csv')
robotics_pd = pd.read_csv('DATA/robotics.csv')
travel_pd = pd.read_csv('DATA/travel.csv')
test_pd = pd.read_csv('DATA/test.csv')

# Print dataframe heads
print('Biology: %i questions' % biology_pd.shape[0])
display(biology_pd.head())
print('Cooking: %i questions' % cooking_pd.shape[0])
display(cooking_pd.head())
print('Crytology: %i questions' % cryptology_pd.shape[0])
display(cryptology_pd.head())
print('DIY: %i questions' % diy_pd.shape[0])
display(diy_pd.head())
print('Robotics: %i questions' % robotics_pd.shape[0])
display(robotics_pd.head())
print('Travel: %i questions' % travel_pd.shape[0])
display(travel_pd.head())
print('Test: %i questions' % test_pd.shape[0])
display(test_pd.head())

Instructions for updating:
non-resource variables are not supported in the long term
Biology: 13196 questions


Unnamed: 0,id,title,content,tags
0,1,What is the criticality of the ribosome bindin...,"<p>In prokaryotic translation, how critical fo...",ribosome binding-sites translation synthetic-b...
1,2,How is RNAse contamination in RNA based experi...,<p>Does anyone have any suggestions to prevent...,rna biochemistry
2,3,Are lymphocyte sizes clustered in two groups?,<p>Tortora writes in <em>Principles of Anatomy...,immunology cell-biology hematology
3,4,How long does antibiotic-dosed LB maintain goo...,<p>Various people in our lab will prepare a li...,cell-culture
4,5,Is exon order always preserved in splicing?,<p>Are there any cases in which the splicing m...,splicing mrna spliceosome introns exons


Cooking: 15404 questions


Unnamed: 0,id,title,content,tags
0,1,How can I get chewy chocolate chip cookies?,<p>My chocolate chips cookies are always too c...,baking cookies texture
1,2,How should I cook bacon in an oven?,<p>I've heard of people cooking bacon in an ov...,oven cooking-time bacon
2,3,What is the difference between white and brown...,"<p>I always use brown extra large eggs, but I ...",eggs
3,4,What is the difference between baking soda and...,<p>And can I use one in place of the other in ...,substitutions please-remove-this-tag baking-so...
4,5,"In a tomato sauce recipe, how can I cut the ac...",<p>It seems that every time I make a tomato sa...,sauce pasta tomatoes italian-cuisine


Crytology: 10432 questions


Unnamed: 0,id,title,content,tags
0,3,What are the benefits of the two permutation t...,<p>Why do we use a permutation table in the fi...,block-cipher des permutation
1,7,Why use a 1-2 Oblivious Transfer instead of a ...,"<p>When initiating an <a href=""http://en.wikip...",oblivious-transfer multiparty-computation func...
2,8,Why do we append the length of the message in ...,"<p>As we know, <a href=""http://en.wikipedia.or...",sha-1 hash
3,9,What is the general justification for the hard...,<p>Since most cryptographic hash functions are...,hash cryptanalysis preimage-resistance
4,14,"How can I use asymmetric encryption, such as R...",<p>RSA is not designed to be used on long bloc...,encryption rsa public-key


DIY: 25918 questions


Unnamed: 0,id,title,content,tags
0,1,"How do I install a new, non load bearing wall ...",<p>I'm looking to finish my basement and simpl...,remodeling basement carpentry
1,2,What kind of caulk should I use around my bath...,<p>I would like to recaulk between the bathtub...,caulking bathroom
2,3,Is fiberglass mesh tape a good choice for dryw...,<p>I'm going to be doing some drywalling short...,drywall
3,4,Are there ways to determine if a wall is load ...,"<p>Other than looking up blue prints, which ma...",walls load-bearing structural
4,5,How do I safely replace a worn out electrical ...,<p>I have a number of outlets that are old and...,repair electrical


Robotics: 2771 questions


Unnamed: 0,id,title,content,tags
0,1,What is the right approach to write the spin c...,<p>Imagine programming a 3 wheel soccer robot....,soccer control
1,2,How can I modify a low cost hobby servo to run...,"<p>I've got some hobby servos (<a href=""http:/...",control rcservo
2,3,What useful gaits exist for a six legged robot...,"<p><a href=""http://www.oricomtech.com/projects...",gait walk
3,4,Good Microcontrollers/SOCs for a Robotics Project,<p>I am looking for a starting point for my pr...,microcontroller arduino raspberry-pi
4,5,Nearest-neighbor data structure for non-Euclid...,<p>I'm trying to implement a nearest-neighbor ...,motion-planning rrt


Travel: 19279 questions


Unnamed: 0,id,title,content,tags
0,1,What are some Caribbean cruises for October?,<p>My fiancée and I are looking for a good Car...,caribbean cruising vacations
1,2,How can I find a guide that will take me safel...,"<p>This was one of our definition questions, b...",guides extreme-tourism amazon-river amazon-jungle
2,4,Does Singapore Airlines offer any reward seats...,<p>Singapore Airlines has an all-business clas...,loyalty-programs routes ewr singapore-airlines...
3,5,What is the easiest transportation to use thro...,<p>Another definition question that interested...,romania transportation
4,6,How can I visit Antarctica?,"<p>A year ago I was reading some magazine, and...",extreme-tourism antarctica


Test: 81926 questions


Unnamed: 0,id,title,content
0,1,What is spin as it relates to subatomic partic...,<p>I often hear about subatomic particles havi...
1,2,What is your simplest explanation of the strin...,<p>How would you explain string theory to non ...
2,3,"Lie theory, Representations and particle physics",<p>This is a question that has been posted at ...
3,7,Will Determinism be ever possible?,<p>What are the main problems that we need to ...
4,9,Hamilton's Principle,<p>Hamilton's principle states that a dynamic ...


In [3]:
# Stop words from Stanford's NLP codebase: 
# github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/patterns/surface

stop_words = ["", " ", "a", "about", "above", "after", "again", "against", "all", "am", "an", 
              "and","any", "are", "aren't", "at", "be", "because", "been", "before", "being", 
              "below", "between", "both", "but", "by", "can", "can't",  "cannot", "could",
              "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down", 
              "during", "each", "few", "for", "from", "further", "had",  "hadn't","has", 
              "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", 
              "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", 
              "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it", 
              "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", 
              "myself", "no", "nor", "not" , "of", "off", "on", "once", "only", "or", "other",
              "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shan't", 
              "she", "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", 
              "than", "that", "that's", "the", "their", "theirs", "them", "themselves", 
              "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", 
              "they've", "this", "those", "to", "too", "under", "until", "up", "very", "was",
              "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what", 
              "what's", "when", "when's", "where", "where's", "which", "while", "who", 
              "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't","you", 
              "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", 
              "yourselves", "return", "arent", "cant", "couldnt", "didnt", "doesnt", "dont", 
              "hadnt", "hasnt", "havent", "hes", "heres", "hows", "im", "isnt", "its", "lets",
              "mustnt", "shant", "shes", "shouldnt", "thats", "theres", "theyll", "theyre", 
              "theyve", "wasnt", "were", "werent", "whats", "whens", "wheres", "whos", "whys",
              "wont", "wouldnt", "youd", "youll", "youre", "youve"]

Step 2: Preprocess Data
--------------------------

We need to convert the raw data into a format that our ML model can use during training and inference. We will accomplish this by parsing the HTML, then creating two NumPy arrays for each topic: data and labels (Test will only have data). The data array will be comprised of the words from the title and cleansed content, and the labels will be the words from the tags. Array elements will be words themselves, not characters or sentences.

In [4]:
# Convert dataframes to ndarrays
biology_np = biology_pd[['title', 'content', 'tags']].to_numpy()  
cooking_np = cooking_pd[['title', 'content', 'tags']].to_numpy()  
cryptology_np = cryptology_pd[['title', 'content', 'tags']].to_numpy()  
diy_np = diy_pd[['title', 'content', 'tags']].to_numpy()  
robotics_np = robotics_pd[['title', 'content', 'tags']].to_numpy()  
travel_np = travel_pd[['title', 'content', 'tags']].to_numpy()  
test_np = test_pd[['title', 'content']].to_numpy()  


# Parse html
def parse_html(data_np):    
    for i in range(data_np.shape[0]):
        soup = BeautifulSoup(data_np[i,1], 'html.parser')
        soup = soup.get_text()
        soup = BeautifulSoup(soup, 'html.parser')
        soup = soup.decode('utf8')
        soup = soup.replace('\n', ' ')
        data_np[i,1] = soup

parse_html(biology_np)
parse_html(cooking_np)
parse_html(cryptology_np)
parse_html(diy_np)
parse_html(robotics_np)
parse_html(travel_np)
parse_html(test_np)

# Create datasets and labels
def to_list(data):    
    for i in range(len(data)):
        data[i] = [''.join([ch for ch in s if ch.isalnum()]) for s in data[i].split(' ')]
        #data[i] = [x for x in data[i] if len(x) > 0]
    #return [x for xs in data for x in xs if len(x) > 0]
    return [x for xs in data for x in xs if x not in stop_words]

biology_x = to_list(biology_np[:,0] + ' ' + biology_np[:,1])
biology_y = to_list(biology_np[:,2])
cooking_x = to_list(cooking_np[:,0] + ' ' + cooking_np[:,1])
cooking_y = to_list(cooking_np[:,2])
cryptology_x = to_list(cryptology_np[:,0] + ' ' + cryptology_np[:,1])
cryptology_y = to_list(cryptology_np[:,2])
diy_x = to_list(diy_np[:,0] + ' ' + diy_np[:,1])
diy_y = to_list(diy_np[:,2])
robotics_x = to_list(robotics_np[:,0] + ' ' + robotics_np[:,1])
robotics_y = to_list(robotics_np[:,2])
travel_x = to_list(travel_np[:,0] + ' ' + travel_np[:,1])
travel_y = to_list(travel_np[:,2])
test_x = to_list(test_np[:,0] + ' ' + test_np[:,1])

# Print sample data and labels
print('Biology data: %i words' % len(biology_x))
print(biology_x[:50])
print('\nBiology labels: %i words' % len(biology_y))
print(biology_y[:10])
print('\nCooking data: %i words' % len(cooking_x))
print(cooking_x[:50])
print('\nCooking labels: %i words' % len(cooking_y))
print(cooking_y[:10])
print('\nCryptology data: %i words' % len(cryptology_x))
print(cryptology_x[:50])
print('\nCryptology labels: %i words' % len(cryptology_y))
print(cryptology_y[:10])
print('\nDiy data: %i words' % len(diy_x))
print(diy_x[:50])
print('\nDiy labels: %i words' % len(diy_y))
print(diy_y[:10])
print('\nRobotics data: %i words' % len(robotics_x))
print(robotics_x[:50])
print('\nRobotics labels: %i words' % len(robotics_y))
print(robotics_y[:10])
print('\nTravel data: %i words' % len(travel_x))
print(travel_x[:50])
print('\nTravel labels: %i words' % len(travel_y))
print(travel_y[:10])
print('\nTest data: %i words' % len(test_x))
print(test_x[:50])

Biology data: 796057 words
['What', 'criticality', 'ribosome', 'binding', 'site', 'relative', 'start', 'codon', 'prokaryotic', 'translation', 'In', 'prokaryotic', 'translation', 'critical', 'efficient', 'translation', 'location', 'ribosome', 'binding', 'site', 'relative', 'start', 'codon', 'Ideally', 'supposed', '7b', 'away', 'start', 'How', '9', 'bases', 'away', 'even', 'Will', 'observable', 'effect', 'translation', 'How', 'RNAse', 'contamination', 'RNA', 'based', 'experiments', 'prevented', 'Does', 'anyone', 'suggestions', 'prevent', 'RNAse', 'contamination']

Biology labels: 33129 words
['ribosome', 'bindingsites', 'translation', 'syntheticbiology', 'rna', 'biochemistry', 'immunology', 'cellbiology', 'hematology', 'cellculture']

Cooking data: 888022 words
['How', 'I', 'get', 'chewy', 'chocolate', 'chip', 'cookies', 'My', 'chocolate', 'chips', 'cookies', 'always', 'crisp', 'How', 'I', 'get', 'chewy', 'cookies', 'like', 'Starbucks', 'Thank', 'everyone', 'answered', 'So', 'far', 'tip'

Step 3: Embed Words
-------------------

Word embedding have shown to be extremely useful for practically all NLP tasks, as it converted words into dense vectors that are semantically meaningful, as opposed to a basic one-hot encoding method.
Our implementation is based on the Skip-gram, explained in this TensorFlow [tutorial.](https://www.tensorflow.org/versions/r0.12/tutorials/word2vec/index.html)

In [5]:
batch_size = 64
embedding_size = 64
vocab_size = 10000
num_sampled = 64
num_context = 4
data_index = 0

def create_batch(data, data_index, num_context):
    batch_targets = np.ndarray([batch_size], np.int32)
    batch_contexts = np.ndarray([batch_size, 1], np.int32)
    
    for i in range(0, batch_size, num_context): 
        context_indexes = [x for x in range(data_index, data_index + num_context + 1)]
        del context_indexes[len(context_indexes) // 2]
        batch_targets[i:i + num_context] = data_index + num_context // 2
        batch_contexts[i:i + num_context, 0] = context_indexes
        data_index = (data_index + 1) % len(data)

    return batch_targets, batch_contexts

test_batch_targets, test_batch_contexts = create_batch(robotics_x,
                                                       data_index,
                                                       num_context)

print('Original: ' + str(test_x[:batch_size // num_context + num_context]) + '\n')
print('Target: ' + str([test_x[x] for x in test_batch_targets]) + '\n')
print('Context: ' + str([test_x[x[0]] for x in test_batch_contexts]))

Original: ['What', 'spin', 'as', 'relates', 'subatomic', 'particles', 'I', 'often', 'hear', 'subatomic', 'particles', 'property', 'called', 'spin', 'also', 'actually', 'relate', 'spinning', 'axis', 'like']

Target: ['as', 'as', 'as', 'as', 'relates', 'relates', 'relates', 'relates', 'subatomic', 'subatomic', 'subatomic', 'subatomic', 'particles', 'particles', 'particles', 'particles', 'I', 'I', 'I', 'I', 'often', 'often', 'often', 'often', 'hear', 'hear', 'hear', 'hear', 'subatomic', 'subatomic', 'subatomic', 'subatomic', 'particles', 'particles', 'particles', 'particles', 'property', 'property', 'property', 'property', 'called', 'called', 'called', 'called', 'spin', 'spin', 'spin', 'spin', 'also', 'also', 'also', 'also', 'actually', 'actually', 'actually', 'actually', 'relate', 'relate', 'relate', 'relate', 'spinning', 'spinning', 'spinning', 'spinning']

Context: ['What', 'spin', 'relates', 'subatomic', 'spin', 'as', 'subatomic', 'particles', 'as', 'relates', 'particles', 'I', 'relat

In [6]:
graph = tf.Graph()
with graph.as_default():
    train_x = tf.placeholder(tf.int32, [batch_size])
    train_y = tf.placeholder(tf.int32, [batch_size, 1])
        
    embedding_space = tf.Variable(tf.random_uniform([vocab_size, embedding_size]))
    embedded_train_x = tf.nn.embedding_lookup(embedding_space, train_x)
    weights = tf.Variable(tf.truncated_normal([vocab_size, embedding_size]))
    biases = tf.Variable(tf.zeros([vocab_size]))
    
    loss = tf.reduce_mean(tf.nn.nce_loss(weights, 
                                         biases, 
                                         embedded_train_x, 
                                         train_y, 
                                         num_sampled, 
                                         vocab_size))

    optimizer = tf.train.AdamOptimizer().minimize(loss)   
    norm = tf.sqrt(tf.reduce_sum(tf.square(embedding_space), 1, keep_dims=True))
    normalized_embedding_space = embedding_space / norm    

ValueError: Dimensions must be equal, but are 1 and 64 for '{{node nce_loss/MatMul}} = MatMul[T=DT_INT32, transpose_a=false, transpose_b=true](Placeholder_1, nce_loss/Slice_1)' with input shapes: [64,1], [64,64].

In [None]:
num_steps = 10001
data_index = 0

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()   

    for step in range(num_steps):
        batch_x, batch_y = create_batch(robotics_x, data_index, num_context) 
        _, l = session.run([optimizer, loss], {train_x:batch_x, train_y:batch_y})
        
        if step % 1000 == 0:            
            print('Loss at step %i: %.2f' % (step, l))
            
    final_embedding_space = normalized_embedding_space.eval()

In [None]:
tsne = TSNE()
tsne_embedding_space = tsne.fit_transform(final_embedding_space[:100])

pylab.figure(figsize=(9, 9))

for i in range(len(tsne_embedding_space)):
    x, y = tsne_embedding_space[i, :]
    pylab.scatter(x, y)
    pylab.annotate(robotics_x[i], xy=(x, y))

pylab.show()

In [None]:
import re
import csv
import operator
from collections import defaultdict

stop_words = set(['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', "c'mon", "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'co', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', 'currently', 'd', 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'do', 'does', "doesn't", 'doing', "don't", 'done', 'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight', 'either', 'else', 'elsewhere', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'far', 'few', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'h', 'had', "hadn't", 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he's", 'hello', 'help', 'hence', 'her', 'here', "here's", 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'i', "i'd", "i'll", "i'm", "i've", 'ie', 'if', 'ignored', 'immediate', 'in', 'inasmuch', 'inc', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'knows', 'known', 'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'little', 'look', 'looking', 'looks', 'ltd', 'm', 'mainly', 'many', 'may', 'maybe', 'me', 'mean', 'meanwhile', 'merely', 'might', 'more', 'moreover', 'most', 'mostly', 'much', 'must', 'my', 'myself', 'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'no', 'nobody', 'non', 'none', 'noone', 'nor', 'normally', 'not', 'nothing', 'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'p', 'particular', 'particularly', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', 'she', 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 't', "t's", 'take', 'taken', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that's", 'thats', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', "there's", 'thereafter', 'thereby', 'therefore', 'therein', 'theres', 'thereupon', 'these', 'they', "they'd", "they'll", "they're", "they've", 'think', 'third', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlikely', 'until', 'unto', 'up', 'upon', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'uucp', 'v', 'value', 'various', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", "we'll", "we're", "we've", 'welcome', 'well', 'went', 'were', "weren't", 'what', "what's", 'whatever', 'when', 'whence', 'whenever', 'where', "where's", 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', "who's", 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', "won't", 'wonder', 'would', 'would', "wouldn't", 'x', 'y', 'yes', 'yet', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves', 'z', 'zero', ''])
def f1_score(tp, fp, fn):
    p = (tp*1.) / (tp+fp)
    r = (tp*1.) / (tp+fn)
    f1 = (2*p*r)/(p+r)
    return f1

def clean_html(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

def get_words(text):
    word_split = re.compile('[^a-zA-Z0-9_\\+\\-/]')
    return [word.strip().lower() for word in word_split.split(text)]

data_path = "../input/"
in_file = open(data_path+"test.csv")
out_file = open("sub_freq.csv", "w")
reader = csv.DictReader(in_file)
writer = csv.writer(out_file)
writer.writerow(['id','tags'])
for ind, row in enumerate(reader):
    text = clean_html(row["title"])
    tfrequency_dict = defaultdict(int)
    word_count = 0.
    for word in get_words(text):
        if word not in stop_words and word.isalpha():
            tfrequency_dict[word] += 1
			word_count += 1.
	for word in tfrequency_dict:
		tf = tfrequency_dict[word] / word_count
		tfrequency_dict[word] = tf 
	pred_title_tags = sorted(tfrequency_dict, key=tfrequency_dict.get, reverse=True)[:10]

	text = clean_html(row["content"])
	dfrequency_dict = defaultdict(int)
	word_count = 0.
	for word in get_words(text):
		if word not in stop_words and word.isalpha():
			dfrequency_dict[word] += 1
			word_count += 1.
	for word in dfrequency_dict:
		tf = dfrequency_dict[word] / word_count
		dfrequency_dict[word] = tf 
	pred_content_tags = sorted(dfrequency_dict, key=dfrequency_dict.get, reverse=True)[:10]

	pred_tags_dict = {}
	for word in set(pred_title_tags + pred_content_tags):
		pred_tags_dict[word] = tfrequency_dict.get(word,0) + dfrequency_dict.get(word,0)
	pred_tags = set(sorted(pred_tags_dict, key=pred_tags_dict.get, reverse=True)[:3])
	
	writer.writerow([row['id'], " ".join(pred_tags)])
	if ind%50000 == 0:
		print("Processed : ", ind)


in_file.close()
out_file.close()