# Indeed / Hackerrank.com Machine Learning Contest

### By Stephen Fox
### April 2017
_____

## Overview
The objective is to accurately tag job descriptions with one of 12 tags (e.g. '5-plus-years-experience-needed'). More details are provided in the accompanying README and project report (.pdf).

In [2]:
# Import the pandas package, then use the "read_csv" function to read
# the labeled training data

import pandas as pd       
train = pd.read_csv("train.tsv", header=0, \
                    delimiter="\t", quoting=3)

test = pd.read_csv("test.tsv", header=0, \
                    delimiter="\t", quoting=3)

print "Data read successfully!"
print "Number of train points:", len(train)
print "Number of test points:", len(test)

# Print the column headers
print "\n"
print "Train file headers:"
print "\n"
print train.dtypes.index

print "\n"
print "Test file headers:"
print "\n"
print test.dtypes.index


Data read successfully!
Number of train points: 4375
Number of test points: 2921


Train file headers:


Index([u'tags', u'description'], dtype='object')


Test file headers:


Index([u'description'], dtype='object')


In [3]:
# Examine a typical job description

example1 = train["description"][0]
print example1

THE COMPANY    Employer is a midstream service provider to the onshore Oil and Gas markets.  It is a a fast growing filtration technology company providing environmentally sound solutions to the E&P’s for water and drilling fluids management and recycling.    THE POSITION    The North Dakota Regional Technical Sales Representative reports directly to the VP of Sales and covers a territory that includes North Dakota and surrounding areas of South Dakota, Wyoming and Montana.  Specific duties for this position include but are not limited to:     Building sales volume within the established territory from existing and new accounts   Set up and maintain a strategic sales plan for the territory   Present technical presentations, product demonstrations & training   Maintain direct contact with customers, distributors and representatives   Prospect new customer contacts and referrals   Gather and record customer & competitor information   Provide accurate and updated forecasts for the 

In [4]:
# Remove punctuation and weird characters (e.g. )
import re

# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z0-9]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1)  # The text to search
print letters_only

THE COMPANY    Employer is a midstream service provider to the onshore Oil and Gas markets   It is a a fast growing filtration technology company providing environmentally sound solutions to the E P   s for water and drilling fluids management and recycling     THE POSITION    The North Dakota Regional Technical Sales Representative reports directly to the VP of Sales and covers a territory that includes North Dakota and surrounding areas of South Dakota  Wyoming and Montana   Specific duties for this position include but are not limited to         Building sales volume within the established territory from existing and new accounts      Set up and maintain a strategic sales plan for the territory      Present technical presentations  product demonstrations   training      Maintain direct contact with customers  distributors and representatives      Prospect new customer contacts and referrals      Gather and record customer   competitor information      Provide accurate and updated fo

In [5]:
# Convert all to lower case and split words

lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words
print words

['the', 'company', 'employer', 'is', 'a', 'midstream', 'service', 'provider', 'to', 'the', 'onshore', 'oil', 'and', 'gas', 'markets', 'it', 'is', 'a', 'a', 'fast', 'growing', 'filtration', 'technology', 'company', 'providing', 'environmentally', 'sound', 'solutions', 'to', 'the', 'e', 'p', 's', 'for', 'water', 'and', 'drilling', 'fluids', 'management', 'and', 'recycling', 'the', 'position', 'the', 'north', 'dakota', 'regional', 'technical', 'sales', 'representative', 'reports', 'directly', 'to', 'the', 'vp', 'of', 'sales', 'and', 'covers', 'a', 'territory', 'that', 'includes', 'north', 'dakota', 'and', 'surrounding', 'areas', 'of', 'south', 'dakota', 'wyoming', 'and', 'montana', 'specific', 'duties', 'for', 'this', 'position', 'include', 'but', 'are', 'not', 'limited', 'to', 'building', 'sales', 'volume', 'within', 'the', 'established', 'territory', 'from', 'existing', 'and', 'new', 'accounts', 'set', 'up', 'and', 'maintain', 'a', 'strategic', 'sales', 'plan', 'for', 'the', 'territory'

In [6]:
# Remove english stop words

from nltk.corpus import stopwords # Import the stop word list
print stopwords.words("english") 

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [7]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print words

['company', 'employer', 'midstream', 'service', 'provider', 'onshore', 'oil', 'gas', 'markets', 'fast', 'growing', 'filtration', 'technology', 'company', 'providing', 'environmentally', 'sound', 'solutions', 'e', 'p', 'water', 'drilling', 'fluids', 'management', 'recycling', 'position', 'north', 'dakota', 'regional', 'technical', 'sales', 'representative', 'reports', 'directly', 'vp', 'sales', 'covers', 'territory', 'includes', 'north', 'dakota', 'surrounding', 'areas', 'south', 'dakota', 'wyoming', 'montana', 'specific', 'duties', 'position', 'include', 'limited', 'building', 'sales', 'volume', 'within', 'established', 'territory', 'existing', 'new', 'accounts', 'set', 'maintain', 'strategic', 'sales', 'plan', 'territory', 'present', 'technical', 'presentations', 'product', 'demonstrations', 'training', 'maintain', 'direct', 'contact', 'customers', 'distributors', 'representatives', 'prospect', 'new', 'customer', 'contacts', 'referrals', 'gather', 'record', 'customer', 'competitor', '

In [8]:
# Define a function to clean up the text input, combining the previously evaluated methods

from bs4 import BeautifulSoup

def job_to_words( raw_job ):
    # Function to convert a raw job posting to a string of words
    # The input is a single string (a raw job description), and 
    # the output is a single string (a preprocessed job description)
    #
    # 1. Remove HTML
    job_text = BeautifulSoup(raw_job, "lxml").get_text() 
    #
    # 1. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z0-9]", " ", job_text) 
    #
    # 2. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 3. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 4. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 5. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

In [9]:
# Check the output on an example

clean_job = job_to_words( example1 )
print clean_job

company employer midstream service provider onshore oil gas markets fast growing filtration technology company providing environmentally sound solutions e p water drilling fluids management recycling position north dakota regional technical sales representative reports directly vp sales covers territory includes north dakota surrounding areas south dakota wyoming montana specific duties position include limited building sales volume within established territory existing new accounts set maintain strategic sales plan territory present technical presentations product demonstrations training maintain direct contact customers distributors representatives prospect new customer contacts referrals gather record customer competitor information provide accurate updated forecasts territory identify new product opportunities build long term relationships customers reps distributors candidate requirement ideal candidate possess technical degree preferably oil gas discipline 5 years experience pref

In [10]:
# Get the number of jobs based on the dataframe column size
num_jobs = train["description"].size

# Initialize an empty list to hold the clean jobs
clean_train_jobs = []

print "Cleaning and parsing the training set job descriptions...\n"

# Loop over each job; create an index i that goes from 0 to the length
# of the job list 
for i in xrange( 0, num_jobs ):
    # Call our function for each one, and add the result to the list of

    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print "Job %d of %d\n" % ( i+1, num_jobs )                                                                    

    # clean jobs
    clean_train_jobs.append( job_to_words( train["description"][i] ) )    

Cleaning and parsing the training set job descriptions...

Job 1000 of 4375

Job 2000 of 4375

Job 3000 of 4375

Job 4000 of 4375



In [11]:
print clean_train_jobs[0]
print train["tags"][0]

company employer midstream service provider onshore oil gas markets fast growing filtration technology company providing environmentally sound solutions e p water drilling fluids management recycling position north dakota regional technical sales representative reports directly vp sales covers territory includes north dakota surrounding areas south dakota wyoming montana specific duties position include limited building sales volume within established territory existing new accounts set maintain strategic sales plan territory present technical presentations product demonstrations training maintain direct contact customers distributors representatives prospect new customer contacts referrals gather record customer competitor information provide accurate updated forecasts territory identify new product opportunities build long term relationships customers reps distributors candidate requirement ideal candidate possess technical degree preferably oil gas discipline 5 years experience pref

In [12]:
# Look for instances where number is followed by 'year..':

number_list=['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15']
i = 0
for word in clean_train_jobs[0]:
    if word in number_list:
        print "found it"
    else:
        i+=1
print i

found it
1792


In [13]:
print "Creating the bag of words for descriptions...\n"
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer1 = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 2000,
                              ngram_range = (1,4)) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer1.fit_transform(clean_train_jobs)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()

Creating the bag of words for descriptions...



In [14]:
# Use a similar approach to vectorize the training labels

# Get the number of jobs based on the dataframe column size
num_data = train["tags"].size

# Initialize an empty list to hold the clean job labels
train_labels = []

print "Cleaning and parsing the training set job labels...\n"

# Loop over each job; create an index i that goes from 0 to the length
# of the job list 
for i in xrange( 0, num_data ):
    # Call our function for each one, and add the result to the list of

    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print "Job %d of %d\n" % ( i+1, num_data )                                                                    

    # clean out NaN
    if type(train["tags"][i]) != str:
        train["tags"][i] = ''
    train_labels.append(train["tags"][i]) 

print "Sample training labels: \n", train_labels[0]

Cleaning and parsing the training set job labels...

Job 1000 of 4375

Job 2000 of 4375

Job 3000 of 4375

Job 4000 of 4375

Sample training labels: 
licence-needed supervising-job 5-plus-years-experience-needed


In [15]:
print "Creating the bag of words for job labels...\n"
from sklearn.feature_extraction.text import CountVectorizer

# Keep words hyphenated
pattern = "(?u)\\b[\\w-]+\\b"

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer2 = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000,
                             token_pattern=pattern) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_labels = vectorizer2.fit_transform(train_labels)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_labels = train_data_labels.toarray()

print "Sample vectorized training labels: \n",train_data_labels[0]

Creating the bag of words for job labels...

Sample vectorized training labels: 
[0 0 1 0 0 0 0 1 0 0 0 1]


In [16]:
print "Confirm training data description shape: \n", train_data_features.shape
print "Confirm training data labels shape: \n",train_data_labels.shape

Confirm training data description shape: 
(4375, 2000)
Confirm training data labels shape: 
(4375, 12)


In [17]:
# Take a look at the words in the job description vocabulary

vocab_description = vectorizer1.get_feature_names()
print "Job label vocabulary: \n", vocab_description

Job label vocabulary: 
[u'00', u'000', u'00pm', u'10', u'10 years', u'100', u'11', u'12', u'13', u'14', u'15', u'18', u'20', u'2010', u'2012', u'2013', u'2014', u'24', u'25', u'30', u'40', u'40 hours', u'401', u'401k', u'50', u'500', u'60', u'75', u'80', u'90', u'abilities', u'ability', u'ability communicate', u'ability effectively', u'ability manage', u'ability multi', u'ability multi task', u'ability read', u'ability work', u'ability work independently', u'able', u'able pass', u'able work', u'academic', u'access', u'accommodations', u'accommodations may', u'accommodations may made', u'accommodations may made enable', u'accordance', u'according', u'account', u'account executive', u'account manager', u'accountability', u'accountable', u'accounting', u'accounts', u'accredited', u'accuracy', u'accurate', u'accurately', u'achieve', u'achieving', u'acquisition', u'across', u'act', u'action', u'actions', u'active', u'actively', u'activities', u'activity', u'ad', u'adapt', u'add', u'addition

In [18]:
# Take a look at the words in the job label vocabulary (should be the 12 labels)

vocab = vectorizer2.get_feature_names()
print "Job label vocabulary: \n", vocab

Job label vocabulary: 
[u'1-year-experience-needed', u'2-4-years-experience-needed', u'5-plus-years-experience-needed', u'associate-needed', u'bs-degree-needed', u'full-time-job', u'hourly-wage', u'licence-needed', u'ms-or-phd-needed', u'part-time-job', u'salary', u'supervising-job']


In [19]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_labels, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

331 1-year-experience-needed
1043 2-4-years-experience-needed
636 5-plus-years-experience-needed
209 associate-needed
970 bs-degree-needed
885 full-time-job
451 hourly-wage
524 licence-needed
83 ms-or-phd-needed
328 part-time-job
669 salary
751 supervising-job


In [27]:
# Evaluate a few classifiers, then choose one for optimizing next

# Import the desired classifiers, splitters, metrics etc.

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report

from time import time

# All classifiers are named clf for compatibility with tester.py
# Comment out ('#') all classifiers other than the desired one

#clf = DecisionTreeClassifier(random_state=42)
#clf = KNeighborsClassifier()
clf = RandomForestClassifier(random_state=42)

# Split data into training and testing sets, using 30% split

t0 = time()

features_train, features_test, labels_train, labels_test = \
    train_test_split(train_data_features, train_data_labels, test_size=0.3, random_state=42)
    
clf.fit(features_train,labels_train)
labels_train_est = clf.predict(features_train)
labels_pred = clf.predict(features_test)

print "Results for Training: \n", classification_report(labels_train, labels_train_est)
print "\n"
print "Results for Testing: \n", classification_report(labels_test, labels_pred)

print "total train/test/prediction time:", round(time()-t0, 3), "s"

Results for Training: 
             precision    recall  f1-score   support

          0       1.00      0.83      0.91       233
          1       1.00      0.93      0.96       719
          2       1.00      0.88      0.93       430
          3       1.00      0.81      0.90       153
          4       1.00      0.97      0.98       680
          5       1.00      0.90      0.95       610
          6       1.00      0.91      0.96       317
          7       1.00      0.87      0.93       382
          8       1.00      0.77      0.87        48
          9       1.00      0.91      0.95       225
         10       1.00      0.87      0.93       451
         11       1.00      0.88      0.93       529

avg / total       1.00      0.90      0.95      4777



Results for Testing: 
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        98
          1       0.57      0.16      0.25       324
          2       0.62      0.05      0.09      

  'precision', 'predicted', average, warn_for)


In [75]:
# This code block was used to optimize the model
# Various pipe components were commented / uncommented to test their effects
# The process and interim results are discussed in the accompanying report

# Import the necessary libaries

from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest,f_classif
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.grid_search import GridSearchCV

t0 = time()

# Build estimator from PCA and Univariate selection:

#combined_features = FeatureUnion([('pca', PCA()), ('select', SelectKBest())])

# Use combined features to transform dataset:

#X_features = combined_features.fit(features,labels).transform(features)

# Piping: combine scaling, feature selection, PCA and classification
# into a single pipeline

#algo = DecisionTreeClassifier(random_state=42)
algo = RandomForestClassifier(random_state=42)

pipe = Pipeline([
#        ('scaler',MinMaxScaler()),
#        ('select',SelectKBest()),
#                 ('reduce_dim', PCA()),
#        ('features',combined_features),
                 ('algo',algo)
    ])

# Cross Validation - choose parameters that maximize the F1 score

# Parameter grid

para = {
#    'select__k':[23],
#    'select__k':np.arange(22,24),
#    'reduce_dim__n_components':np.arange(1,15),
#    'features__pca__n_components':[1, 2, 3],
#    'features__select__k':np.arange(1,24),
#    'algo__criterion': ["gini"],
#    'algo__criterion': ["gini","entropy"],
#    'algo__min_samples_split': [10],
#    'algo__min_samples_split': [2, 10, 20],
#    'algo__min_samples_split': np.arange(2,6),
#    'algo__max_depth': [9],
#    'algo__max_depth': [None, 2, 5, 10],
#    'algo__max_depth': np.arange(8, 15),
#    'algo__n_estimators': np.arange(8,12)],
#    'algo__n_estimators': [9,15,25,50,75,100,125],
    'algo__n_estimators': [9],
#    'algo__max_features': ['auto',None,2,5,10],
    'algo__max_features': [None],
#    'algo__criterion': ["gini","entropy"],
#    'algo__max_depth': [None,2,5,10],
#    'algo__min_samples_split': np.arange(1,5),
#    'algo__min_samples_leaf': np.arange(1,5),
#    'algo__min_weight_fraction_leaf': [0,0.05,0.1,0.2,0.5],
#    'algo__max_leaf_nodes':[None,2,3,5],
    'algo__n_jobs': [-1],
#    'algo__oob_score': [True,False]
#    'algo__min_samples_leaf': [1],
#    'algo__min_samples_leaf': [1, 5, 10],
#    'algo__min_samples_leaf': np.arange(1,3),
#    'algo__class_weight':["balanced"],
#    'algo__class_weight':["balanced",None],
#    'algo__max_leaf_nodes':[None,2,5,10],
#    'algo__max_leaf_nodes':np.arange(7,11),
#    'algo__splitter': ["random"]
#    'algo__splitter': ["best","random"]
       }

# Because of the small size of the dataset, use stratified shuffle split cross validation
# I found that 50 splits provided scores that closely matched the tester.py results and also
# kept runtimes to relatively reasonable durations

sss = StratifiedShuffleSplit(train_data_labels, 10, random_state = 42)
#cv_clf = GridSearchCV(pipe,param_grid=para, cv = sss, scoring='f1_weighted')

# Use 20% holdout for post CV testing

features_train, features_test, labels_train, labels_test = \
    train_test_split(train_data_features, train_data_labels, test_size=0.05, random_state=42)

sss = StratifiedShuffleSplit(labels_train, 10, random_state = 42)
cv_clf = GridSearchCV(pipe,param_grid=para, cv = sss, scoring='f1_weighted')

# Run CV on the training subset only
cv_clf.fit(features_train,labels_train)
clf = cv_clf.best_estimator_

print "model build and validation time:", round(time()-t0, 3), "s"
print '\n'
print "Best F1 score: %0.3f" % cv_clf.best_score_
print '\n'
print "Best Parameters:"
print '\n'
print cv_clf.best_params_
print '\n'

# Split data into training and testing sets, using 30% split

t0 = time()

#features_train, features_test, labels_train, labels_test = \
#    train_test_split(train_data_features, train_data_labels, test_size=0.3, random_state=42)
    
#clf.fit(features_train,labels_train)
labels_train_est = clf.predict(features_train)
labels_full_est = clf.predict(train_data_features)
labels_pred = clf.predict(features_test)

print "Results on Training set: \n", classification_report(labels_train, labels_train_est)
print "\n"
print "Results on Full set: \n", classification_report(train_data_labels, labels_full_est)
print "\n"
print "Results on Testing set: \n", classification_report(labels_test, labels_pred)

print "total train/test/prediction time:", round(time()-t0, 3), "s"

model build and validation time: 916.158 s


Best F1 score: 1.000


Best Parameters:


{'algo__max_features': None, 'algo__n_estimators': 9, 'algo__n_jobs': -1}


Results on Training set: 
             precision    recall  f1-score   support

          0       1.00      0.88      0.93       309
          1       0.99      0.98      0.99       995
          2       0.99      0.95      0.97       604
          3       1.00      0.93      0.97       199
          4       0.99      0.98      0.98       926
          5       0.98      0.95      0.96       841
          6       1.00      0.95      0.97       433
          7       1.00      0.92      0.96       492
          8       1.00      0.83      0.91        77
          9       0.98      0.92      0.95       316
         10       0.99      0.95      0.97       637
         11       1.00      0.94      0.97       708

avg / total       0.99      0.95      0.97      6537



Results on Full set: 
             precision    recall  f1-score

In [76]:
# Read the test data

# Verify that there are 2,921 rows and 1 column
print test.shape

# Create an empty list and append the clean jobs one by one
num_test = len(test["description"])
clean_test = [] 

print "Cleaning and parsing the test set job descriptions...\n"
for i in xrange(0,num_test):
    if( (i+1) % 500 == 0 ):
        print "Job %d of %d\n" % (i+1, num_test)
    raw_test = job_to_words( test["description"][i] )
    clean_test.append( raw_test )

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer1.transform(clean_test)
test_data_features = test_data_features.toarray()

# Use the classifier to make job label predictions
result = clf.predict(test_data_features)

(2921, 1)
Cleaning and parsing the test set job descriptions...

Job 500 of 2921

Job 1000 of 2921

Job 1500 of 2921

Job 2000 of 2921

Job 2500 of 2921



In [77]:
# Compare average tags per job for train and test sets
# I would expect the average to be similar for each set

train_tagged = np.sum(train_data_labels)
test_tagged = np.sum(result)

train_size = 4375
test_size = 2921

print "tags per train set:", (float(train_tagged) / train_size)
print "tags per test set:", (float(test_tagged) / test_size)

print "Difference (tags per train - tags per test):",(float(train_tagged) 
                                                     / train_size) - (float(test_tagged) / test_size)

tags per train set: 1.57257142857
tags per test set: 1.0445053064
Difference (tags per train - tags per test): 0.52806612217


In [78]:
# Convert numerical labels back to actual descriptions

num_labels = len(vocab)
num_test = len(result)

test_tags = []
for i in range(0,num_test):
    row = []
    for j in range(0,num_labels):
        if result[i][j] >= 0.5:
            row.append(vocab[j])
        b = ' '.join(row)
    test_tags.append(b)

In [79]:
# Confirm the test tag contains all 2,921 test points

print len(test_tags)

2921


In [80]:
# Copy the results to a pandas dataframe with a "tags" header

output = pd.DataFrame( data={"tags":test_tags})

# Use pandas to write the tab-separated output file

output.to_csv( "tags.tsv",sep='\t',index=False)