Useful links

1. For the architecture https://towardsdatascience.com/deep-learning-for-specific-information-extraction-from-unstructured-texts-12c5b9dceada
2. https://androidkt.com/multi-label-text-classification-in-tensorflow-keras/
3. https://keras.io/preprocessing/sequence/
4. https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/ ( Not really)
5. For deep learning using word embeddings https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/



In [1]:
import spacy
import pandas as pd
from tqdm import tqdm

In [2]:
DATA_DIR = "../../data/processed/"
INPUT_FILE_NAME = 'final_squash15_with_pos_ner_tm.parquet'

In [3]:
df = pd.read_parquet(DATA_DIR + INPUT_FILE_NAME)
df.head()

Unnamed: 0,speaker,headline,description,duration,tags,transcript,WC,clean_transcript,clean_transcript_string,sim_tags,squash15_tags,pos_sequence,ner_sequence,tm
0,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,0:16:17,"cars,alternative energy,culture,politics,scien...","0:14\r\r\rThank you so much, Chris.\rAnd it's ...",2281.0,"b'[""thank"", ""chris"", ""truly"", ""great"", ""honor""...",thank chris truly great honor opportunity come...,"cars,solar system,energy,culture,politics,scie...","culture,politics,science,global issues,technology",VERB PROPN ADV ADJ NOUN NOUN VERB NOUN ADV ADV...,PERSON ORG ORG GPE LOC ORG PRODUCT GPE GPE PER...,"[0.04325945698517057, 0.0, 0.00142482934694180..."
1,Amy Smith,Simple designs to save a life,Fumes from indoor cooking fires kill more than...,0:15:06,"MacArthur grant,simplicity,industrial design,a...","0:11\r\r\rIn terms of invention,\rI'd like to ...",2687.0,"b'[""term"", ""invention"", ""like"", ""tell"", ""tale""...",term invention like tell tale favorite project...,"macarthur grant,simplicity,design,solar system...","design,global issues",NOUN NOUN SCONJ VERB PROPN ADJ NOUN VERB NOUN ...,GPE DATE CARDINAL DATE ORG PERSON LOC ORG GPE ...,"[0.013287880838036227, 0.0, 0.0, 0.00511725094..."
2,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,0:18:45,"corruption,poverty,economics,investment,milita...","0:12\r\r\rA public, Dewey long ago observed,\r...",2506.0,"b'[""public"", ""dewey"", ""long"", ""ago"", ""observe""...",public dewey long ago observe constitute discu...,"corruption,inequality,science,investment,war,c...","science,culture,politics,global issues,business",ADJ PROPN ADV ADV VERB ADJ NOUN NOUN PROPN PRO...,DATE NORP ORDINAL DATE MONEY DATE DATE DATE EV...,"[0.0, 0.006699599134802422, 0.0, 0.00564851883..."
3,Burt Rutan,The real future of space exploration,"In this passionate talk, legendary spacecraft ...",0:19:37,"aircraft,flight,industrial design,NASA,rocket ...","0:11\r\r\rI want to start off by saying, Houst...",3092.0,"b'[""want"", ""start"", ""say"", ""houston"", ""problem...",want start say houston problem enter second ge...,"flight,design,nasa,science,invention,entrepren...","design,science,business",VERB NOUN VERB PROPN NOUN VERB ADJ NOUN NOUN N...,GPE ORDINAL ORG PERSON DATE DATE DATE TIME PER...,"[0.040282108339079505, 0.03732895646484358, 0...."
4,Chris Bangle,Great cars are great art,American designer Chris Bangle explains his ph...,0:20:04,"cars,industrial design,transportation,inventio...","0:12\r\r\rWhat I want to talk about is, as bac...",3781.0,"b'[""want"", ""talk"", ""background"", ""idea"", ""car""...",want talk background idea car art actually mea...,"cars,design,transportation,invention,technolog...","design,technology,business,science",VERB NOUN NOUN NOUN NOUN NOUN ADV ADJ NOUN NOU...,PERSON PRODUCT ORG ORG PERSON PERSON PERSON OR...,"[0.08049208168957463, 0.0, 0.0, 0.008031187136..."


In [4]:
df.iloc[:,:14].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2313 entries, 0 to 2312
Data columns (total 14 columns):
speaker                    2313 non-null object
headline                   2313 non-null object
description                2313 non-null object
duration                   2313 non-null object
tags                       2313 non-null object
transcript                 2313 non-null object
WC                         2313 non-null float64
clean_transcript           2313 non-null object
clean_transcript_string    2313 non-null object
sim_tags                   2313 non-null object
squash15_tags              2313 non-null object
pos_sequence               2313 non-null object
ner_sequence               2313 non-null object
tm                         2313 non-null object
dtypes: float64(1), object(13)
memory usage: 253.1+ KB


In [6]:
def print_full_dataframe(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')
    
def compute_tag_ratio(target_column, df=df):
    tags = df[target_column].str.replace(', ',',').str.lower().str.strip()
    split_tags = tags.str.split(',')
    tag_counts_per_talk = split_tags.apply(len)

    joined_tags = tags.str.cat(sep=',').split(',')
    all_tags = pd.Series(joined_tags)

    tag_counts = all_tags.value_counts().rename_axis(target_column).reset_index(name='counts')
    tag_counts['no_count'] = len(df)-tag_counts['counts']
    tag_counts['ratio'] = tag_counts['counts']/tag_counts['no_count']
    tag_counts['overall_ratio'] = tag_counts['counts']/(tag_counts['no_count'] + tag_counts['counts'])
    return tag_counts

#print(compute_tag_ratio('squash3_tags', df))
squashed_tag_counts = compute_tag_ratio('squash15_tags', df)
print_full_dataframe(squashed_tag_counts)


    squash15_tags  counts  no_count     ratio  overall_ratio
0         science    1467       846  1.734043       0.634241
1         culture    1155      1158  0.997409       0.499351
2      technology     787      1526  0.515727       0.340251
3   global issues     679      1634  0.415545       0.293558
4          design     477      1836  0.259804       0.206226
5         history     385      1928  0.199689       0.166450
6        business     349      1964  0.177699       0.150886
7   entertainment     285      2028  0.140533       0.123217
8           media     279      2034  0.137168       0.120623
9    biomechanics     220      2093  0.105112       0.095115
10         future     218      2095  0.104057       0.094250
11   biodiversity     218      2095  0.104057       0.094250
12       humanity     217      2096  0.103531       0.093818
13       politics     199      2114  0.094134       0.086035
14  communication     185      2128  0.086936       0.079983


In [7]:
print(squashed_tag_counts['overall_ratio'][0])

0.6342412451361867


# 3. Feature Extraction via Deep learning

## 3.1 Create one hot encoding

In [8]:
joined_tags = df['squash15_tags'].str.cat(sep=',').split(',')
all_tags = pd.Series(joined_tags).str.strip().str.lower()
all_tags = list(dict.fromkeys(all_tags))
try:
    all_tags.remove('')
except:
    pass
print(all_tags)
print(len(all_tags))

['culture', 'politics', 'science', 'global issues', 'technology', 'design', 'business', 'biomechanics', 'biodiversity', 'media', 'entertainment', 'history', 'future', 'communication', 'humanity']
15


In [9]:
def create_one_hot_encode(df=df):
    complete_transcripts_tags = []
    for rows, value in df.iterrows():
        one_hot_encoding = [0] * len(all_tags)
        headline = [value['headline']]
        transcript = [value['clean_transcript_string']]
        pos_sequence = [value['pos_sequence']]
        ner_sequence = [value['ner_sequence']]
        indiv_tags = value['squash15_tags'].split(',')
        for tags in indiv_tags:
            if tags == '':
                continue
            index = all_tags.index(tags.lower().lstrip(' '))
            one_hot_encoding[index] = 1
        indiv_transcript_tags = headline + transcript + pos_sequence + ner_sequence + one_hot_encoding
        complete_transcripts_tags.append(indiv_transcript_tags)
    return pd.DataFrame(complete_transcripts_tags, columns=['headline', 'transcript', 'pos_sequence', 'ner_sequence'] + all_tags)

In [10]:
df = create_one_hot_encode()
df

Unnamed: 0,headline,transcript,pos_sequence,ner_sequence,culture,politics,science,global issues,technology,design,business,biomechanics,biodiversity,media,entertainment,history,future,communication,humanity
0,Averting the climate crisis,thank chris truly great honor opportunity come...,VERB PROPN ADV ADJ NOUN NOUN VERB NOUN ADV ADV...,PERSON ORG ORG GPE LOC ORG PRODUCT GPE GPE PER...,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0
1,Simple designs to save a life,term invention like tell tale favorite project...,NOUN NOUN SCONJ VERB PROPN ADJ NOUN VERB NOUN ...,GPE DATE CARDINAL DATE ORG PERSON LOC ORG GPE ...,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
2,How to rebuild a broken state,public dewey long ago observe constitute discu...,ADJ PROPN ADV ADV VERB ADJ NOUN NOUN PROPN PRO...,DATE NORP ORDINAL DATE MONEY DATE DATE DATE EV...,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0
3,The real future of space exploration,want start say houston problem enter second ge...,VERB NOUN VERB PROPN NOUN VERB ADJ NOUN NOUN N...,GPE ORDINAL ORG PERSON DATE DATE DATE TIME PER...,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0
4,Great cars are great art,want talk background idea car art actually mea...,VERB NOUN NOUN NOUN NOUN NOUN ADV ADJ NOUN NOU...,PERSON PRODUCT ORG ORG PERSON PERSON PERSON OR...,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2308,Why glass towers are bad for city life -- and ...,imagine walk even discover everybody room look...,VERB NOUN ADV VERB PRON NOUN VERB ADV NOUN NOU...,ORG GPE ORG GPE GPE GPE GPE GPE GPE GPE GPE PE...,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0
2309,What happens in your brain when you pay attent...,pay close attention easy attention pull differ...,VERB ADJ NOUN ADJ NOUN VERB ADJ NOUN NOUN NOUN...,ORDINAL PERSON PRODUCT DATE DATE,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0
2310,Why you should define your fears instead of yo...,happy pic take senior college right dance prac...,ADJ PROPN VERB ADJ NOUN ADJ NOUN NOUN ADJ VERB...,DATE PERSON ORG PERSON PERSON GPE PERSON GPE O...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2311,12 truths I learned from life and writing,sevenyearold grandson sleep hall wake lot morn...,PROPN PROPN PROPN PROPN VERB NOUN NOUN VERB VE...,PERSON PERSON PERSON PERSON PERSON DATE CARDIN...,1,0,1,0,0,0,0,0,0,0,0,1,0,1,1


In [55]:
def get_target_column(target_tag, df=df):
    '''Returns a dataframe of a single tag
    headline | transcript | pos_sequence | ner_sequence | <tag>
    '''
    return df[['headline', 'transcript','pos_sequence', 'ner_sequence', target_tag]]
# single_class = get_target_column('culture', df) # Retrieve a single data frame with teh foll

In [56]:
df_x = list(single_class['transcript']) # List of sentences
df_y = df[all_tags] # Dataframe of all the y columns

## 3.2 Perform train test split

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, train_y, valid_y = train_test_split(df_x, df_y, random_state=1000)

In [14]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM, Conv1D, MaxPooling1D
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import numpy as np

Using TensorFlow backend.


## 3.3 Use word embeddings for the main transcript

In [15]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 800 # too many and the model cant tell the difference 

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

In [16]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('./glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [17]:
#print(X_train[1])
print(X_train.shape)

[2844 1728  574 1072  826 4761 4500   61 1728 1750  332   98  151  332
 4672   42  332  208   16   42 1750   16   61  348 1270  620 4117  157
 4117   56 1943 1187  826 1773  630   35   56 1943   56  826   56  630
   56 4498   56  163  780  826 1773  630   75  926  525   17   13   75
  168  551 1225   65   27   56 1187  168  551  462  525  309 1072 2379
  376   20   85  714   28  376    3   20   56  826   56 2303   56  977
   44   56  464  404    1   28   27  561  322 4990  503   56 1091 1091
  270  102  561   32   12    1 1168 3210 1314 3210  172  518  430    1
 1477 3210  567   56  264   56 1364   47 4060 1051  172   15   10 1773
  264   82 1773  264  270  957   16 2344    1   37  714 1187 4116  567
   46    1 1477 2344    2   20  977 1187  630 1538 1281  157  630   20
 1684  105   30   12   20   57   10  630  249   57   18 1477 2964  561
  561 2227  290 3604 1052 1442  290    1  134   26    7  281   26  300
   10   14 1364    1 1477 3183   12 2343   14  410  630    7  561 2227
  561 

In [19]:
print(train_y['culture'])
print(train_y['culture'].value_counts())
print(train_y['culture'].value_counts()[0])
print(train_y['culture'].value_counts()[1])
print(type(train_y['culture']))

1573    1
236     1
2079    1
1830    0
584     1
       ..
769     1
1372    1
2119    1
599     1
1459    1
Name: culture, Length: 1734, dtype: int64


# Model

In [52]:
#from keras import backend as K
def compute_class_weight(tag):
    if train_y[tag].value_counts()[0] >train_y[tag].value_counts()[1]:
        class_weight = {0: 1,
                        1: 1/(train_y[tag].value_counts()[1]/train_y[tag].value_counts()[0]),
                       }
    else:
        class_weight = {0: 1/(train_y[tag].value_counts()[0]/train_y[tag].value_counts()[1]),
                    1: 1,
                   }
    return class_weight
print(compute_class_weight('humanity'))

{0: 1, 1: 10.187096774193549}


In [58]:
deep_inputs = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(deep_inputs)
LSTM_Layer_1 = LSTM(128)(embedding_layer)
dense_layer_1 = Dense(1, activation='sigmoid')(LSTM_Layer_1)
model = Model(inputs=deep_inputs, outputs=dense_layer_1)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
#history = model.fit(X_train, train_y, batch_size=128, epochs=4, verbose=1, validation_split=0.2)

In [38]:
#predictions = model.predict(X_test)

In [60]:
def get_tag(threshold, predictions=predictions):
    ''' Flush predictions to either 1 or 0
    Returns
    [[1], [0], ..., [1]]
    '''
    return [[1 if j > threshold else 0 for j in i.tolist()] for i in predictions]

def get_tag_flat(threshold, predictions=predictions):
    ''' Flush predictions to either 1 or 0
    Returns
    [[1], [0], ..., [1]]
    '''
    return [1 if j > threshold else 0 for i in predictions for j in i]
#predictions_flushed = get_tag(0.4)


In [61]:
def compute_tp_tn_fp_fn(y_test, y_pred, classes):
    '''
    parameters
    y_test (list): [1,...,0]
    y_pred (list): [[1], [0], ..., [1]]
    classes: [tag1, tag2, ..., tagn]
    
    Return:
    pre_score = {
        'tag_1': {
            'index': ,
            'tp': ,
            'tn': ,
            'fp': ,
            'fn': 
        }
    }
    '''
    # Create dictionary of tags 
    pre_score = {}
    for index_tag, tag in enumerate(classes):
        pre_score[tag] = {
            'index':index_tag,
            'tp': 0,
            'tn': 0,
            'fp': 0,
            'fn': 0
        }
    for transcript_index, transcript_value in enumerate(y_test):
        if transcript_value == y_pred[transcript_index][0] and transcript_value == 1:
            pre_score[classes[0]]['tp'] += 1
        elif transcript_value == y_pred[transcript_index][0] and transcript_value == 0:
            pre_score[classes[0]]['tn'] += 1
        elif transcript_value != y_pred[transcript_index][0] and transcript_value == 1:
            pre_score[classes[0]]['fn'] += 1
        elif transcript_value != y_pred[transcript_index][0] and transcript_value == 0:
            pre_score[classes[0]]['fp'] += 1
    return pre_score
#scores_preprocess = compute_tp_tn_fp_fn(valid_y, predictions_flushed, ['culture'])

In [74]:
def compute_precision_recall_f1(preprocessed_scores):
    '''
    parameters
    preprocessed_scores = {
        'tag_1': {
            'index': ,
            'tp': ,
            'tn': ,
            'fp': ,
            'fn': 
        }
    }
    return
    preprocessed_scores = {
        'tag_1': {
            'index': ,
            'tp': ,
            'tn': ,
            'fp': ,
            'fn': ,
            'precision': ,
            'recall': ,
            'f1': ,
            'accuracy': ,
        }
    }
    '''
    for key, value in preprocessed_scores.items():
        try:
            precision = value['tp']/(value['tp']+value['fp'])
        except:
            #print('precision issue: {}'.format(key))
            precision = 0.0
        try:
            recall = value['tp']/(value['tp']+value['fn'])
        except:
            #print('recall issue: {}'.format(key))
            recall = 0.0
        try:
            f1 = (2 * precision * recall)/(precision + recall)
        except:
            #print('f1 issue: {}'.format(key))
            f1=0.0
        try:
            accuracy = (value['tp'] + value['tn'])/(value['tp']+value['fn'] + value['fp'] + value['tn'])
        except:
            accuracy = 0.0
        preprocessed_scores[key]['precision'] = round(precision,4)
        preprocessed_scores[key]['recall'] = round(recall,4)
        preprocessed_scores[key]['f1'] = round(f1,4)
        preprocessed_scores[key]['accuracy'] = round(accuracy,4)
    return preprocessed_scores
# final_scores = compute_precision_recall_f1(scores_preprocess)
# print(final_scores)

In [63]:
def print_full_dataframe(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

In [69]:
def format_scores_df(tag_classes, final_scores):
    '''
    Given the list of classes and the dicitionary of all the classes and their scores, we write to a dataframe
    '''
    precision = []
    recall = []
    f1 = []
    accuracy = []
    for index, value in enumerate(tag_classes):
        precision.append(final_scores[value]['precision'])
        recall.append(final_scores[value]['recall'])
        f1.append(final_scores[value]['f1'])
        accuracy.append(final_scores[value]['accuracy'])
    df_result = pd.DataFrame(list(zip(tag_classes, precision, recall, f1, accuracy)), 
               columns =['class', 'precision', 'recall', 'f1', 'accuracy']) 
    return df_result
# df_results = format_scores_df(['culture'], final_scores)
# print_full_dataframe(df_results)

In [87]:
def get_threshold(tag,valid_y, predictions):
    '''
    tag (string): Specific tag
    valid_y (List of tags): [0,1,1, ... ,1]
    predictions (list of lists)
    '''
    # We want to find the threshold that gives the highest recall and accuracy
    highest_f1 = 0
    f1_i = []
    highest_accuracy_f1 = 0
    accuracy_f1_i = []
    highest_accuracy = 0
    accuracy_i = []
    f1_metrics = [0, 0, 0, 0] # tp, tn, fp, fn
    accuracy_metrics = [0, 0, 0, 0]
    for i in range(0, 100):
        i = i/100
        predictions_flushed = get_tag(i,predictions)
        scores_preprocess = compute_tp_tn_fp_fn(valid_y, predictions_flushed, [tag])
        final_scores = compute_precision_recall_f1(scores_preprocess)
    #     print(final_scores)
        df_results = format_scores_df([tag], final_scores)
        #print(df_results)
        f1 = final_scores[tag]['f1']
        accuracy = df_results.accuracy[0]

        if f1 > highest_f1:
            highest_f1 = f1
            f1_i = [i]
            if accuracy > highest_accuracy_f1:
                highest_accuracy_f1 = accuracy
                accuracy_f1_i = [i]
                f1_metrics[0] = scores_preprocess[tag]['tp']
                f1_metrics[1] = scores_preprocess[tag]['tn']
                f1_metrics[2] = scores_preprocess[tag]['fp']
                f1_metrics[3] = scores_preprocess[tag]['fn']
            elif accuracy == highest_accuracy_f1:
                accuracy_f1_i.append(i)
        elif f1 == highest_f1:
            f1_i.append(i)
            if accuracy > highest_accuracy_f1:
                highest_accuracy_f1 = accuracy
                accuracy_f1_i = [i]
                
            elif accuracy == highest_accuracy_f1:
                accuracy_f1_i.append(i)

        if accuracy > highest_accuracy:
            highest_accuracy = accuracy
            accuracy_i = [i]
            accuracy_metrics[0] = scores_preprocess[tag]['tp']
            accuracy_metrics[1] = scores_preprocess[tag]['tn']
            accuracy_metrics[2] = scores_preprocess[tag]['fp']
            accuracy_metrics[3] = scores_preprocess[tag]['fn']
        elif accuracy == highest_accuracy:
            accuracy_i.append(i)

    #     print('\n')

#     print(highest_f1,f1_i)
#     print(highest_accuracy_f1,accuracy_f1_i)
#     print(highest_accuracy,accuracy_i)
    return highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics

In [98]:
# tag_results = {}
# for i in range(len(all_tags)):
#     tag = all_tags[i]
#     print(tag)
#     single_class = get_target_column(tag, df)
#     train_y_tag = train_y[tag]
#     valid_y_tag = valid_y[tag]
#     history = model.fit(X_train, train_y_tag, batch_size=32, epochs=4, verbose=1, validation_split=0.2)
#     predictions = model.predict(X_test)
#     highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i = get_threshold(tag,valid_y_tag,predictions)
#     print(highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i)
#     tag_results[tag] = [highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i]
    
tag_results = {}
for i in range(len(all_tags)):
    tag = all_tags[i]
    print(tag)
    train_y_tag = train_y[tag]
    valid_y_tag = valid_y[tag]
    class_weight = compute_class_weight(tag)
    history = model.fit(X_train, train_y_tag, batch_size=32, epochs=4, verbose=1, validation_split=0.2, class_weight=class_weight)
    predictions = model.predict(X_test)
    model.save('{}_transcript_only.h5'.format(tag))
    highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics = get_threshold(tag,valid_y_tag,predictions)
    print(highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics)
    tag_results[tag] = [highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics]

culture
Train on 1387 samples, validate on 347 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
0.6826 [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17] 0.5181 [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17] 0.5389 [0.52, 0.54] [300, 0, 279, 0] [98, 214, 65, 202]
politics
Train on 1387 samples, validate on 347 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
0.2609 [0.4] 0.8826 [0.4] 0.9188 [0.51, 0.54, 0.55, 0.7] [12, 499, 27, 41] [8, 524, 2, 45]
science
Train on 1387 samples, validate on 347 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
0.7767 [0.4] 0.6356 [0.4] 0.6373 [0.44] [367, 1, 211, 0] [363, 6, 206, 4]
global issues
Train on 1387 samples, validate on 347 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
0.4539 [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19] 0.2936 [0.0, 0.01, 0.02, 0.03, 0.04, 0

In [92]:
highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics = get_threshold(tag,valid_y_tag,predictions)
print(highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics )

0.6826 [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35] 0.5181 [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35] 0.5354 [0.43] [300, 0, 279, 0] [275, 35, 244, 25]


In [99]:
print(tag_results)

{'culture': [0.6826, [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17], 0.5181, [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17], 0.5389, [0.52, 0.54], [300, 0, 279, 0], [98, 214, 65, 202]], 'politics': [0.2609, [0.4], 0.8826, [0.4], 0.9188, [0.51, 0.54, 0.55, 0.7], [12, 499, 27, 41], [8, 524, 2, 45]], 'science': [0.7767, [0.4], 0.6356, [0.4], 0.6373, [0.44], [367, 1, 211, 0], [363, 6, 206, 4]], 'global issues': [0.4539, [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19], 0.2936, [0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19], 0.7254, [0.53, 0.54, 0.55], [170, 0, 409, 0], [20, 400, 9, 150]], 'technology': [0.5102, [0.33], 0.3765, [0.33], 0.6753, [0.49, 0.52], [188, 30, 356, 5], [47, 344, 42, 146]], 'design': [0.3819, [0.43], 0.3627, [0

In [100]:
pd.DataFrame.from_dict(tag_results, orient='index', columns=['highest_f1', 'thresholds_for_highest_f1', 'highest_accuracy_at_highest_f1', 'thresholds_for_highest_accuracy_f1','highest_accuracy','threshold_for_highest_accuracy_i', 'highest_f1_confusion_metrics', 'highest_accuracy_confusion_metrics'])

Unnamed: 0,highest_f1,thresholds_for_highest_f1,highest_accuracy_at_highest_f1,thresholds_for_highest_accuracy_f1,highest_accuracy,threshold_for_highest_accuracy_i,highest_f1_confusion_metrics,highest_accuracy_confusion_metrics
culture,0.6826,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.5181,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.5389,"[0.52, 0.54]","[300, 0, 279, 0]","[98, 214, 65, 202]"
politics,0.2609,[0.4],0.8826,[0.4],0.9188,"[0.51, 0.54, 0.55, 0.7]","[12, 499, 27, 41]","[8, 524, 2, 45]"
science,0.7767,[0.4],0.6356,[0.4],0.6373,[0.44],"[367, 1, 211, 0]","[363, 6, 206, 4]"
global issues,0.4539,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.2936,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.7254,"[0.53, 0.54, 0.55]","[170, 0, 409, 0]","[20, 400, 9, 150]"
technology,0.5102,[0.33],0.3765,[0.33],0.6753,"[0.49, 0.52]","[188, 30, 356, 5]","[47, 344, 42, 146]"
design,0.3819,[0.43],0.3627,[0.43],0.7927,"[0.7, 0.71, 0.72, 0.73, 0.74]","[114, 96, 362, 7]","[1, 458, 0, 120]"
business,0.37,[0.52],0.7824,[0.52],0.8687,"[0.74, 0.75, 0.76]","[37, 416, 83, 43]","[10, 493, 6, 70]"
biomechanics,0.3822,[0.69],0.8325,[0.69],0.8998,"[0.78, 0.8, 0.81, 0.82, 0.85]","[30, 452, 68, 29]","[12, 509, 11, 47]"
biodiversity,0.5263,[0.56],0.8912,[0.56],0.9119,[0.88],"[35, 481, 46, 17]","[1, 527, 0, 51]"
media,0.2327,[0.49],0.3282,[0.49],0.8756,"[0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73...","[59, 131, 376, 13]","[0, 507, 0, 72]"
