<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

# Text Analytics - Assignment 2
COMPETITION TASK: 

+ Learn the classification model for training set with 5 categorical data from ['business', 'entertainment', 'politics', 'sport', 'tech'].

+ Apply learned model to get the labels for "testdata.csv"

## Team Members: 
Laura Brierton - 15317451, Clodagh Lalor - 13354426, Jeremy Schiff - student#, Peter Concannon - student#

============================================================================================================================

The following is a jupyter notebook extension to create a table of contents:


In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

# Pre-Processing

## Step 1: Import packages

In [2]:
import pandas as pd
import numpy as np
import re
import nltk, json
from wordcloud import WordCloud
from nltk import ngrams
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.cluster import KMeans
import math
from sklearn.ensemble import VotingClassifier

We started by importing our training and test datasets and placing them into dataframes:

In [3]:
raw_trainset = pd.read_csv('trainingset.csv',sep='^',header=0)
raw_testdata = pd.read_csv('testdata.csv',sep='^',header=0)
raw_trainset.head()

Unnamed: 0,content,category
0,French boss to leave EADS The French co-head o...,business
1,"Gamers could drive high-definition TV, films, ...",tech
2,Stalemate in pension strike talks Talks aimed ...,politics
3,Johnny and Denise lose Passport Johnny Vaughan...,entertainment
4,Tautou 'to star in Da Vinci film' French actre...,entertainment


## Step 2: Extract Tokens

We used the below function to tokenise our datasets:

In [4]:
# Define the Function to convert raw text to tokens
def convert_tokens(rawtext, verbose=False):
    # First: Tokenization
    # start by removing hyphens to allow for better tokenization
    rawtext = rawtext.replace('-', ' ')
    pattern = r'\w+'
    tokenizer = RegexpTokenizer(pattern)
    token_words = tokenizer.tokenize(rawtext)
    if (verbose):
        print('Tokens:' + str(token_words[0:10]))
    
    # Second: Decapitalization 
    decap_token_words = [word.lower() for word in token_words]
    if (verbose):
        print('Decapitalized Tokens:' + str(decap_token_words[0:10]))
    
    # Third: Remove stop words
    json_data=open('stopwords.json', encoding="utf8").read()
    stopwords_json = json.loads(json_data)
    stopwords_json_en = set(stopwords_json['en'])
    stopwords_nltk_en = set(stopwords.words('english'))
    # Combine the stopwords
    stoplist_combined = set.union(stopwords_json_en, stopwords_nltk_en)

    
    rmsw_token_words = ([word for word in decap_token_words if word.lower() not in stoplist_combined])
    if (verbose):
        print('Stopwords removed:' + str(rmsw_token_words[0:20]))
    
    ## Fouth: remove CAP words
    rmcap_token_words =[]
    for word in rmsw_token_words:
        if word.isupper():
            rmcap_token_words.append(word.title())
        else:
            rmcap_token_words.append(word)
    if (verbose):
        print('CAPITALIZED removed:' + str(rmcap_token_words[0:20]))
        
     ## Fifth : Remove salutation
    salutation = ['mr','mrs','mss','dr','phd','prof','rev', 'professor']
    rmsalu_token_words = ([word for word in rmcap_token_words if word.lower() not in salutation])
    if (verbose):
        print('Salutation removed:' + str(rmsalu_token_words[0:20]))
        
     ## Sixth: Remove words containing numbers
    rmnb_token_words = ([word for word in rmsalu_token_words if not re.search(r"\d+", word)])
    if (verbose):
        print('Number removed: ' + str(rmnb_token_words[0:20]))
        
    ## define transfer tag function:
    def transfer_tag(treebank_tag):
        if treebank_tag.startswith('j' or 'J'):
            return 'a'
        elif treebank_tag.startswith('v' or 'V'):
            return 'v'
        elif treebank_tag.startswith('n' or 'N'):
            return 'n'
        elif treebank_tag.startswith('r' or 'R'):
            return 'r'
        else:
            # As default pos in lemmatization is Noun
            return 'n'
    
    ## Seventh: Lemmatization
    wnl = WordNetLemmatizer()

    lemma_words = []
    for word, tag in nltk.pos_tag(rmnb_token_words):
        firstletter = tag[0].lower() # -> get the first letter of tag and put them decapitalized form
        wtag = transfer_tag(firstletter) # -> extract the word's tag (noun, verb, adverb, adjective)
        if not wtag:
            lemma_words.extend([word])
        ##please note we had to hardcode the following words in due to an error with word net
        elif word == "boss":
            lemma_words.extend([(word)])
        elif word == "gamers":
            lemma_words.extend([("gamer")])
        else:
            lemma_words.extend([wnl.lemmatize(word, wtag)]) # -> get lemma for word with tag
    if (verbose):
        print('Lemmas : ' + str(lemma_words[0:10]))
        
    
    ## RETURN
    return lemma_words

We tokenized our training and test set separately
<br>
<br>
Extract tokens for training set:

In [None]:
## we next create a dataframe that contained the content category and bag of words for each document
df_handle = raw_trainset.copy()
df_handle["Tokens"] = df_handle.apply(lambda row: convert_tokens(row["content"]), axis=1)
df_handle.head(10)

Extract tokens for test set:
<br>
* Note how we dont have the category column in this dataset, as this is what we want our model to ultimately predict!

In [None]:
df_handle_test = raw_testdata.copy()
df_handle_test_onhold = raw_testdata.copy() #on hold until the end of the document
df_handle_test["Tokens"] = df_handle_test.apply(lambda row: convert_tokens(row["content"]), axis=1)
df_handle_test.head(10)

We continued the tokenizing process and created entries that contain only the noun or only the adjective tokens.

In [None]:
#Generalization of the extraction
def extract_pos_tokens(tokens, pos):
    # helper for list comprehension
    def is_pos(treebank_tag):
        if treebank_tag.startswith(pos):
            return True
        else:
            return False
    return [word for (word, tag) in nltk.pos_tag(tokens) if is_pos(tag)]

#Specific noun instance
def extract_noun_tokens(tokens):
    # note that this does not include the "or 'n'" component which was both unnecessary and didnt work on my machine
    # furthermore, it does not take noun to be the default
    return extract_pos_tokens(tokens, 'N')
    
#Specific adjective instance
def extract_adj_tokens(tokens):
    # same idea as with nounds - the or 'j' is unneeded
    return extract_pos_tokens(tokens, 'J')

This step adds our noun_token and adjective_token columns to our training dataset:

In [None]:
df_handle["noun_tokens"] = df_handle.apply(lambda row: extract_noun_tokens(row["Tokens"]), axis=1)
df_handle["adjective_tokens"] = df_handle.apply(lambda row: extract_adj_tokens(row["Tokens"]), axis=1)

df_handle.head(10)

Next we do the same for our test set:

In [None]:
df_handle_test["noun_tokens"] = df_handle_test.apply(lambda row: extract_noun_tokens(row["Tokens"]), axis=1)
df_handle_test["adjective_tokens"] = df_handle_test.apply(lambda row: extract_adj_tokens(row["Tokens"]), axis=1)

df_handle_test.head(10)

## Step 3: Deconstruction - Wordclouds and Frequency 

### Wordcloud

Next we decided to create a wordcloud for the entire corpus, to get an idea of the most common words and if there was any common pattern. We thought that it might help us to decide if there were any more pre-processing steps we needed to take before moving onto our Analysis stage.

In [None]:
## Wordcloud function
def wordcloudplot(tokens, name):
    
    text2 = ' '.join(tokens)

    wordcloud = WordCloud(width=1600, height=800).generate(text2)
    plt.figure( figsize=(20,10), facecolor='k')
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)
    
    # save to file if filename given
    if name:
        wordcloud.to_file(name)
        
    plt.show()

In [None]:
#joined tokens refers to all the tokens from all the documents in the training data dataframe
joined_tokens = [token for document in df_handle["Tokens"] for token in document]

#saves wordcloud of all tokens as file. Please note, this word cloud is for all tokens not just noun tokens
wordcloudplot(joined_tokens, 'img_wordcloud1.png')

Looking at our word cloud, we can see immediately that a lot of verbs are present, this will not necessarily help us with our classification step and so this influenced us to look at noun and adjective tokens instead, going forward.

In [None]:
noun_tokens = [token for document in df_handle["noun_tokens"] for token in document]
adjective_tokens = [token for document in df_handle["adjective_tokens"] for token in document]
wordcloudplot(noun_tokens, 'img_wordcloud2.png')
wordcloudplot(adjective_tokens, 'img_wordcloud3.png')

Looking at these separate wordclouds separately gives us a nice overview of our trainin dataset. We can see a good reflection of the known classes used. It indicates to us a pretty balanced dataset as a result.

### Frequency

Building on this, we wanted to look at the frequency of certain nouns and adjectives overall in the data with hopes that we could glean some information to aid classification.

In [None]:
joined_noun_tokens = [token for document in df_handle["noun_tokens"] for token in document]
word_frequency = nltk.FreqDist(joined_noun_tokens)
word_frequency.plot(20, title='Twenty Most Common Nouns')

joined_adjective_tokens = [token for document in df_handle["adjective_tokens"] for token in document]
word_frequency = nltk.FreqDist(joined_adjective_tokens)
word_frequency.plot(20, title='Twenty Most Common Adjectives')

The nouns looks pretty useful with words like government, film, and game all likely being high indicators as to the category of the data. On the other hand, the adjectives seem much less useful with words like high, big, and good being so prevalent.

# Analysis

## Step 4 : Most Common Bigrams and Trigrams

Another part of our analysis involved looking at bigrams and trigrams, which we thought might aid us in classification.

### Bigrams

In [None]:
joined_tokens = [token for document in df_handle["Tokens"] for token in document]

bigram = ngrams(joined_tokens, 2)
bi_frequencies = nltk.FreqDist(bigram)
dict_items =list(dict(bi_frequencies).items())
#make a dataframe of the bigrams and their frequencies
df_bigramFreq = pd.DataFrame(dict_items, columns=['bigram','freq']).sort_values(by='freq', ascending=False)
df_bigramFreq = df_bigramFreq.reset_index(drop=True)
#show only top five
df_bigramFreq.head(5)

In [None]:
# Function to check the gram is noun gram or not
def IsNounGram(ngram):
    if ('-pron-' in ngram) or ('t' in ngram):
        return False
    
    first_type = ('JJ','JJR','JJS','NN','NNS','NNP','NNPS')
    second_type = ('NN','NNS','NNP','NNPS')
    tags = nltk.pos_tag(ngram,lang='eng')
    if (tags[0][1] in first_type) and (tags[1][1] in second_type):
        return True
    else:
        return False

In [None]:
##we decided to look only at top 50 for efficiency sake
df_bigramFreq_filter = df_bigramFreq.copy().head(50)
#create a new column which check for each bigram, which are noun grams , returns true or false
df_bigramFreq_filter['noun_gram'] = df_bigramFreq_filter["bigram"].map(lambda x : IsNounGram(x))
#filter out those that are false
df_bigramFreq_filter = df_bigramFreq_filter[df_bigramFreq_filter.noun_gram != False]
df_bigramFreq_filter.head(5)

We can see from this that our top five have changed a little (as has the entire dataframe). We were concerned with the (tell, BBC) bigram, and so decided that we needed to add an additional step of checking that this appeared in the text.

In [None]:
##check that these ngrams actually appear together in the text!
def CheckWordInText(word, Text):
    if word in Text.lower():
        return True
    else:
        return False

In [None]:
#combine all texts
full_corpus =' '.join([document for document in df_handle["content"]])
# combine all the tokens
for index, row in df_handle.iterrows():
    full_corpus = full_corpus + df_handle['content'].iloc[index]

In [None]:
## this step adds a column stating that the n-gram appears as is, in the text
df_real_bigram = df_bigramFreq_filter.copy()

#true if exists
exists_list = []
for index, row in df_bigramFreq_filter.iterrows():
    gram = row['bigram']
    word = (' '.join(gram))
    exists_list.append(CheckWordInText(word, full_corpus))

#create new column for these    
df_bigramFreq_filter['exist'] = exists_list    

#delete those that are false
df_real_bigram = df_bigramFreq_filter[df_bigramFreq_filter.exist != False]

#show top 5
df_real_bigram.head(5)

We shall now do the same for trigrams. 

### Trigrams

In [None]:
joined_tokens = [token for document in df_handle["Tokens"] for token in document]
trigram = ngrams(joined_tokens, 3)
tri_frequencies = nltk.FreqDist(trigram)
dict_items = list(dict(tri_frequencies).items())
#make a dataframe of the trigrams and their frequencies
df_trigramFreq = pd.DataFrame(dict_items, columns=['trigram','freq']).sort_values(by='freq', ascending=False)
df_trigramFreq = df_trigramFreq.reset_index(drop=True)

#check for noun grams
##look only at top 50 for efficiency sake
df_trigramFreq_filter = df_trigramFreq.copy().head(50)
#creaete a new column which check for each bigram, which are noun grams , returns true or false
df_trigramFreq_filter['noun_gram'] = df_trigramFreq_filter["trigram"].map(lambda x : IsNounGram(x))
#filter out those that are false
df_trigramFreq_filter = df_trigramFreq_filter[df_trigramFreq_filter.noun_gram != False]
#printf(df_bigramFreq_filter.head(5))

#check in text
df_real_trigram = df_trigramFreq_filter.copy()

exits_list = []
for index, row in df_trigramFreq_filter.iterrows():
    gram = row['trigram']
    word = (' '.join(gram))
    exits_list.append(CheckWordInText(word, full_corpus))
    
df_real_trigram['exist'] = exits_list
df_real_trigram = df_real_trigram.loc[df_real_trigram.exist==True]


#df_trigramFreq_filter.head(5)
df_real_trigram.head(5)

After looking at the results from our bigrams and trigrams, we decided **against** using them for classification purposes. Looking at the top 5 of both, we thought that they were mostly only relevant for the politics class and therefore by using bigrams and trigrams as part of our model we may bias, what looks to be, a fairly balanced dataset.

# Vectorization

TF-IDF vectorisation is used for identifying significant words within the dataset, giving greater wieght to words that are uncommon across the full corpus.

## Step 5: Vectorising the training set

TF-IDF vectorisation is used for identifying significant words within documents, giving greater weight to words that are especially common in an individual document while being uncommon across the full corpus. By doing so, we can get an idea of what words are especially important within individual documents, and use this information to better classify documents in the corpus.

The structure of the TF-IDF vectors must constant in order to use them for most models. Because of this, words will be present in the TF-IDF vectors if and only if they appear in the training data (after normalisation). Therefore, certain words which would have had greater weights in documents in the test dataset will be discarded. This should have minimal impact on our model's ability to classify with TF-IDF, as the absense of those words from the training set means that the models will have no known values for the new words, and will therefore be unable to use them. 

In [None]:
#the sample output given shows only zeros but it is working
merged_tokens = [" ".join(x) for x in df_handle["noun_tokens"]]
tfidf_vectorizer = TfidfVectorizer(norm=None)
tfidf_train = pd.DataFrame(tfidf_vectorizer.fit_transform(merged_tokens).todense(), columns = tfidf_vectorizer.get_feature_names())
tfidf_train.head(5)

## Step 6: Vectorising the test set

In [None]:
test_clean_data = [" ".join(x) for x in df_handle_test.Tokens]

# Create a tfidf for the test data with the same dimensions (i.e. headers)
# as the set above. necessary for comparing models
tfidf_clean = tfidf_vectorizer.transform(test_clean_data)
## For printing that tf-idf matrix, we convert it into dataframe
tfidf_test = pd.DataFrame(tfidf_clean.toarray(),columns=[tfidf_vectorizer.get_feature_names()])
tfidf_test.head(5)

# Model Learning

## Step 7: Class Analysis

In [None]:
# this takes our known labels from the dataframe to use for training our model
labels_train = df_handle['category']

We decided to plot the frequency of the labels to ensure that we had a fairly balanced dataset.

In [None]:
import matplotlib.pyplot as plt

unique, counts = np.unique(labels_train, return_counts=True)

plt.bar(unique,counts)
plt.title('Class Frequency')
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

From this plot we can see this is a balanced dataset, and this reinforced our decision not to use bigrams and trigrams, especially as politics is not the most frequent class and as such we could bias our model.

## Step 8: Select Classifiers

### K Cross Fold Validation

In [None]:
validation_size = 0.1
num_folds = 10
model_tfidf, validation_tfidf, model_labels, validation_labels = train_test_split(tfidf_train, labels_train, test_size=validation_size, shuffle=True, stratify=labels_train)

### BAYES classifier

In [None]:
modelbayes = MultinomialNB()
modelbayes.fit(model_tfidf, model_labels)

### Random Forest classifier

In [None]:
randomforest = RandomForestClassifier()
randomforest.fit(model_tfidf, model_labels)

### K-Means classifier

In [None]:
modelkmeans = KMeans(n_clusters=5, init='k-means++', max_iter=200, n_init=100)
modelkmeans.fit(model_tfidf)

### KNN classifier

Below, we attempt to use GridSearchCV to identify the ideal value for k in the kNN classifier. Calculating this hyperparameter is costly, as it requires the creation of a kNN model for each possible value in the given set. Although the set of possible values here is kept small in our submission notebook for the sake of running faster, we did run tests on the range \[1, 2, 3, 4, 5\], and still found that 1 was the optimal value for accuracy. 

The difference between documents of differing classification is small enough that comparing to more than the singular nearest neighbour results in more incorrect classifications. Using k=1 may be a case of overfitting the data: a lower value for k gives a more complex model with lower bias and higher variance. However, this is not likely to be an issue, as our use of ensemble classification will help to greatly reduce the variance of the final classifier.

In [None]:
#takes about 5-6 mins to complete
# we only used 1 to 2 here as this step takes so long. Please note that we had tried for [1,..,5] but after it ran for about 30 mins, determined that the optimum is actually within the range [1,2] so dicided to fo with this for efficiencies sake
k_values = [1,2]
print("the range of k is " + str(k_values))
grid_search = GridSearchCV(KNeighborsClassifier(), {'n_neighbors':k_values}, scoring='accuracy', cv = num_folds)
grid_search.fit(model_tfidf, model_labels)

The following step will take the optimum k value determined in the previous step and use that for our k neighbours classifier:

In [None]:
modelknn = KNeighborsClassifier(**grid_search.best_params_)
modelknn.fit(model_tfidf, model_labels)

According to this subset, the best value is k=1

### Logistic classifier

In [None]:
c_values = [1e-10, 1e-8, 1e-6, 1e-4, 1e-2, 1, 1e2]
grid_search = GridSearchCV(LogisticRegression(), {'C':c_values}, scoring='accuracy', cv = num_folds)
grid_search.fit(model_tfidf, model_labels)

modellr = LogisticRegression(**grid_search.best_params_)
modellr.fit(model_tfidf, model_labels)

## Step 9: Applying models to test set to estimate accuracy

Next we decided to apply our models to our partitioned training set to see how they performed.

In [None]:
predicted_labels_bayes = modelbayes.predict(validation_tfidf)
predicted_probas_bayes = modelbayes.predict_proba(validation_tfidf)

In [None]:
predicted_labels_randomforest = randomforest.predict(validation_tfidf)
predicted_probas_randomforest = randomforest.predict_proba(validation_tfidf)

In [None]:
predicted_labels_knn = modelknn.predict(validation_tfidf)
predicted_probas_knn = modelknn.predict_proba(validation_tfidf)

In [None]:
predicted_labels_lr = modellr.predict(validation_tfidf)
predicted_probas_lr = modellr.predict_proba(validation_tfidf)

KMeans requires a little more work here as it is an unsupervised algorithm. The following predicts the clusters, but currently they are labelled 0-4 and as such we need to set these clusters as actually category labels so that we can discover how well this method has worked.

In [None]:
predicted_labels_kmeans = modelkmeans.predict(validation_tfidf)
predicted_labels_kmeans

To find the corresponding names for the clusters I applied the kmeans to the whole set and looked at the most common category for each. I quickly learned that the number of iterations i would need to make on my model to make it provide even sized clusters that I could confidently label as the five categories, simply took far too much time to make this an optimal model to use in the future. 

In [None]:
modelkmeans = KMeans(n_clusters=5, init='k-means++')
modelkmeans.fit(tfidf_train)
all_predicted_label = modelkmeans.predict(tfidf_train)
newdf = df_handle.copy()
newdf["Cluster"] = all_predicted_label
newdf.head(3)

In [None]:
cluster0df = newdf.loc[newdf['Cluster'].isin(['0'])]
cluster1df = newdf.loc[newdf['Cluster'].isin(['1'])]
cluster2df = newdf.loc[newdf['Cluster'].isin(['2'])]
cluster3df = newdf.loc[newdf['Cluster'].isin(['3'])]
cluster4df = newdf.loc[newdf['Cluster'].isin(['4'])]
print("Cluster sizes: ")
print("Cluster 0: " + str(cluster0df.size))
print("Cluster 1: " + str(cluster1df.size))
print("Cluster 2: " + str(cluster2df.size))
print("Cluster 3: " + str(cluster3df.size))
print("Cluster 4: " + str(cluster4df.size))

# Step 10: Evaluation

In this step we can see percentages for the accuracy rates for our model for each of the classifer methods used:

In [None]:
Acc_bayes = accuracy_score(validation_labels, predicted_labels_bayes)
Acc_ranfor = accuracy_score(validation_labels, predicted_labels_randomforest)
Acc_knn = accuracy_score(validation_labels, predicted_labels_knn)
Acc_lr = accuracy_score(validation_labels, predicted_labels_lr)
print('Accuracy rate for NB model: {:0.2f}%'.format(Acc_bayes*100))
print('Accuracy rate for RandomForest model: {:0.2f}%'.format(Acc_ranfor*100))
print('Accuracy rate for KNN model: {:0.2f}%'.format(Acc_knn*100))
print('Accuracy rate for Logistic model: {:0.2f}%'.format(Acc_lr*100))

We chose Log Loss as our evaluation metric because it maps the real numbers to probability in the smoothest manner amking it useful when dealing with confidence, which we want to because we know that later on we will be using a voting model that takes this confidence into account.

In [None]:
log_loss_bayes = log_loss(validation_labels, predicted_probas_bayes)
log_loss_ranfor = log_loss(validation_labels, predicted_probas_randomforest)
log_loss_lr = log_loss(validation_labels, predicted_probas_lr)
print('Error rate for Bayes model using Log Loss evaluation metric: {:0.2f}%'.format(log_loss_bayes*100))
print('Error rate for Random Forest model using Log Loss evaluation metric: {:0.2f}%'.format(log_loss_ranfor*100))
print('Error rate for Logistic Regression model using Log Loss evaluation metric: {:0.2f}%'.format(log_loss_lr*100))

In [None]:
# there's a problem here something about the dimension of the probability matrix, tried normalising, didn't fix it
log_loss_knn = log_loss(validation_labels, predicted_probas_knn)
print('Error rate for Logistic model using Log Loss evaluation metric: {:0.2f}%'.format(log_loss_knn*100))

In both tests, Bayes seems to perform the best

# Step 11: Applying Model to Test Set

Finally, after training our model, we can now apply it to our test set.
<br>
<br>
We applied each of the four models that we currently have and added them as columns to a dataframe for easy comparison:

In [None]:
predicted_test_labels_bayes = modelbayes.predict(tfidf_test)
predicted_test_labels_rf = randomforest.predict(tfidf_test)
predicted_test_labels_knn = modelknn.predict(tfidf_test)
predicted_test_labels_lr = modellr.predict(tfidf_test)

In [None]:
df_handle_test_onhold["Pred Labels Bayes"] = predicted_test_labels_bayes
df_handle_test_onhold["Pred Labels RF"] = predicted_test_labels_rf
df_handle_test_onhold["Pred Labels KNN"] = predicted_test_labels_knn
df_handle_test_onhold["Pred Labels LR"] = predicted_test_labels_lr
df_handle_test_onhold.to_csv('predicted labels.csv')
df_handle_test_onhold.head()

We first wanted to see if all of our models agreed on their class label for each document. To do this, we compared where the models agreed and where they disagreed. 

In [None]:
count_agree = 0
count_disagree = 0

for index, row in df_handle_test_onhold.iterrows():
    if row["Pred Labels Bayes"]==row["Pred Labels RF"]==row["Pred Labels KNN"]==row["Pred Labels LR"]:
        count_agree+=1
    else:
        count_disagree+=1
        
print("Number of articles agreed on by the models: {:0.2f}%".format((count_agree/df_handle_test_onhold.shape[0])*100) )
print("Number of articles disagreed on by the models: {:0.2f}%".format((count_disagree/df_handle_test_onhold.shape[0])*100))

As we can see, the models agree with each other 70% of the time. Based on this we decided that we wanted to use a voting model to find a consensus on cases where the models disagreed.

# Step 12: Classifier Voting

The following is an unweighted voting system. It uses the previous models to create a voting classifier and we then fit this moel to the training set.

In [None]:
eclf_unweighted = VotingClassifier(estimators=[('nb', modelbayes), ('rf', randomforest), ('knn', modelknn), ('lr', modellr)], voting='soft')
eclf_unweighted.fit(model_tfidf, model_labels)

We then tested this model on the training set:

In [None]:
predicted_labels_eclf_unweighted = eclf_unweighted.predict(validation_tfidf)
predicted_probas_eclf_unweighted = eclf_unweighted.predict_proba(validation_tfidf)

Next we found the log loss of the predicted and actual labels based off this training set:

In [None]:
log_loss_eclf_unweighted = log_loss(validation_labels, predicted_probas_eclf_unweighted)
print('Error rate for unweighted model using Log Loss evaluation metric: {:0.2f}%'.format(log_loss_eclf_unweighted*100))

Generate a variety of weights

As Bayes and Logistic Regression have performed the best, we weighted them to reflect that. After some trial and error we chose these values

In [None]:
weight_values =[16, 1, 1, 6] # this is been chosen by trial and error, it would be great if we got a systematic method to get the best values
eclf_weighted = VotingClassifier(estimators=[('nb', modelbayes), ('rf', randomforest), ('knn', modelknn), ('lr', modellr)], voting='soft', weights=weight_values)
eclf_weighted.fit(model_tfidf, model_labels)

In [None]:
predicted_labels_eclf_weighted = eclf_weighted.predict(validation_tfidf)
predicted_probas_eclf_weighted = eclf_weighted.predict_proba(validation_tfidf)

In [None]:
log_loss_eclf_weighted = log_loss(validation_labels, predicted_probas_eclf_weighted)
print('Error rate for weighted model using Log Loss evaluation metric: {:0.2f}%'.format(log_loss_eclf_weighted*100))

This error rate is much lower than our previous singular models

# Step 13: Final Model Application

Finally, we added our final predicted class labels as a new column to our dataframe:

In [None]:
predicted_test_labels_eclf_weighted = eclf_weighted.predict(tfidf_test)
df_handle_test_onhold["Pred Labels ECLF-W"] = predicted_test_labels_eclf_weighted
df_handle_test_onhold.to_csv('predicted labels final.csv')

In [None]:
df_handle_test_onhold.head()

In [None]:
df_handle_test_onhold[["content", "Pred Labels ECLF-W"]]