Useful links

1. For the architecture https://towardsdatascience.com/deep-learning-for-specific-information-extraction-from-unstructured-texts-12c5b9dceada
2. https://androidkt.com/multi-label-text-classification-in-tensorflow-keras/
3. https://keras.io/preprocessing/sequence/
4. https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/ ( Not really)
5. For deep learning using word embeddings https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/



In [1]:
import spacy
import pandas as pd
from tqdm import tqdm

In [2]:
DATA_DIR = "../../data/processed/"
INPUT_FILE_NAME = 'final_squash15_with_pos_ner_tm.parquet'

In [3]:
df = pd.read_parquet(DATA_DIR + INPUT_FILE_NAME)
df.head()

Unnamed: 0,speaker,headline,description,duration,tags,transcript,WC,clean_transcript,clean_transcript_string,sim_tags,squash15_tags,pos_sequence,ner_sequence,tm
0,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,0:16:17,"cars,alternative energy,culture,politics,scien...","0:14\r\r\rThank you so much, Chris.\rAnd it's ...",2281.0,"b'[""thank"", ""chris"", ""truly"", ""great"", ""honor""...",thank chris truly great honor opportunity come...,"cars,solar system,energy,culture,politics,scie...","culture,politics,science,global issues,technology",VERB PROPN ADV ADJ NOUN NOUN VERB NOUN ADV ADV...,PERSON ORG ORG GPE LOC ORG PRODUCT GPE GPE PER...,"[0.04325945698517057, 0.0, 0.00142482934694180..."
1,Amy Smith,Simple designs to save a life,Fumes from indoor cooking fires kill more than...,0:15:06,"MacArthur grant,simplicity,industrial design,a...","0:11\r\r\rIn terms of invention,\rI'd like to ...",2687.0,"b'[""term"", ""invention"", ""like"", ""tell"", ""tale""...",term invention like tell tale favorite project...,"macarthur grant,simplicity,design,solar system...","design,global issues",NOUN NOUN SCONJ VERB PROPN ADJ NOUN VERB NOUN ...,GPE DATE CARDINAL DATE ORG PERSON LOC ORG GPE ...,"[0.013287880838036227, 0.0, 0.0, 0.00511725094..."
2,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,0:18:45,"corruption,poverty,economics,investment,milita...","0:12\r\r\rA public, Dewey long ago observed,\r...",2506.0,"b'[""public"", ""dewey"", ""long"", ""ago"", ""observe""...",public dewey long ago observe constitute discu...,"corruption,inequality,science,investment,war,c...","science,culture,politics,global issues,business",ADJ PROPN ADV ADV VERB ADJ NOUN NOUN PROPN PRO...,DATE NORP ORDINAL DATE MONEY DATE DATE DATE EV...,"[0.0, 0.006699599134802422, 0.0, 0.00564851883..."
3,Burt Rutan,The real future of space exploration,"In this passionate talk, legendary spacecraft ...",0:19:37,"aircraft,flight,industrial design,NASA,rocket ...","0:11\r\r\rI want to start off by saying, Houst...",3092.0,"b'[""want"", ""start"", ""say"", ""houston"", ""problem...",want start say houston problem enter second ge...,"flight,design,nasa,science,invention,entrepren...","design,science,business",VERB NOUN VERB PROPN NOUN VERB ADJ NOUN NOUN N...,GPE ORDINAL ORG PERSON DATE DATE DATE TIME PER...,"[0.040282108339079505, 0.03732895646484358, 0...."
4,Chris Bangle,Great cars are great art,American designer Chris Bangle explains his ph...,0:20:04,"cars,industrial design,transportation,inventio...","0:12\r\r\rWhat I want to talk about is, as bac...",3781.0,"b'[""want"", ""talk"", ""background"", ""idea"", ""car""...",want talk background idea car art actually mea...,"cars,design,transportation,invention,technolog...","design,technology,business,science",VERB NOUN NOUN NOUN NOUN NOUN ADV ADJ NOUN NOU...,PERSON PRODUCT ORG ORG PERSON PERSON PERSON OR...,"[0.08049208168957463, 0.0, 0.0, 0.008031187136..."


In [4]:
df.iloc[:,:14].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2313 entries, 0 to 2312
Data columns (total 14 columns):
speaker                    2313 non-null object
headline                   2313 non-null object
description                2313 non-null object
duration                   2313 non-null object
tags                       2313 non-null object
transcript                 2313 non-null object
WC                         2313 non-null float64
clean_transcript           2313 non-null object
clean_transcript_string    2313 non-null object
sim_tags                   2313 non-null object
squash15_tags              2313 non-null object
pos_sequence               2313 non-null object
ner_sequence               2313 non-null object
tm                         2313 non-null object
dtypes: float64(1), object(13)
memory usage: 253.1+ KB


In [5]:
def print_full_dataframe(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')
    
def compute_tag_ratio(target_column, df=df):
    tags = df[target_column].str.replace(', ',',').str.lower().str.strip()
    split_tags = tags.str.split(',')
    tag_counts_per_talk = split_tags.apply(len)

    joined_tags = tags.str.cat(sep=',').split(',')
    all_tags = pd.Series(joined_tags)

    tag_counts = all_tags.value_counts().rename_axis(target_column).reset_index(name='counts')
    tag_counts['no_count'] = len(df)-tag_counts['counts']
    tag_counts['ratio'] = tag_counts['counts']/tag_counts['no_count']
    tag_counts['overall_ratio'] = tag_counts['counts']/(tag_counts['no_count'] + tag_counts['counts'])
    return tag_counts

#print(compute_tag_ratio('squash3_tags', df))
squashed_tag_counts = compute_tag_ratio('squash15_tags', df)
print_full_dataframe(squashed_tag_counts)

    squash15_tags  counts  no_count     ratio  overall_ratio
0         science    1467       846  1.734043       0.634241
1         culture    1155      1158  0.997409       0.499351
2      technology     787      1526  0.515727       0.340251
3   global issues     679      1634  0.415545       0.293558
4          design     477      1836  0.259804       0.206226
5         history     385      1928  0.199689       0.166450
6        business     349      1964  0.177699       0.150886
7   entertainment     285      2028  0.140533       0.123217
8           media     279      2034  0.137168       0.120623
9    biomechanics     220      2093  0.105112       0.095115
10   biodiversity     218      2095  0.104057       0.094250
11         future     218      2095  0.104057       0.094250
12       humanity     217      2096  0.103531       0.093818
13       politics     199      2114  0.094134       0.086035
14  communication     185      2128  0.086936       0.079983


# 3. Feature Extraction via Deep learning

## 3.1 Create one hot encoding

In [6]:
# from sklearn.preprocessing import MultiLabelBinarizer

# y = []
# for index, row in df.iterrows():
#     y.append(set(row['squash3_tags'].split(',')))
    
# mlb = MultiLabelBinarizer()
# encoded_y = mlb.fit_transform(y)

In [7]:
# print(encoded_y[0])
# print(len(encoded_y[0]))

In [8]:
joined_tags = df['squash15_tags'].str.cat(sep=',').split(',')
all_tags = pd.Series(joined_tags).str.strip().str.lower()
all_tags = list(dict.fromkeys(all_tags))
try:
    all_tags.remove('')
except:
    pass
print(all_tags)
print(len(all_tags))

['culture', 'politics', 'science', 'global issues', 'technology', 'design', 'business', 'biomechanics', 'biodiversity', 'media', 'entertainment', 'history', 'future', 'communication', 'humanity']
15


In [9]:
def create_one_hot_encode(df=df):
    complete_transcripts_tags = []
    for rows, value in df.iterrows():
        one_hot_encoding = [0] * len(all_tags)
        headline = [value['headline']]
        transcript = [value['clean_transcript_string']]
        pos_sequence = [value['pos_sequence']]
        ner_sequence = [value['ner_sequence']]
        tm = [value['tm']]
        indiv_tags = value['squash15_tags'].split(',')
        for tags in indiv_tags:
            if tags == '':
                continue
            index = all_tags.index(tags.lower().lstrip(' '))
            one_hot_encoding[index] = 1
        indiv_transcript_tags = headline + transcript + pos_sequence + ner_sequence + tm +one_hot_encoding
        complete_transcripts_tags.append(indiv_transcript_tags)
    return pd.DataFrame(complete_transcripts_tags, columns=['headline', 'transcript', 'pos_sequence', 'ner_sequence','tm'] + all_tags)

In [10]:
df = create_one_hot_encode()
df

Unnamed: 0,headline,transcript,pos_sequence,ner_sequence,tm,culture,politics,science,global issues,technology,design,business,biomechanics,biodiversity,media,entertainment,history,future,communication,humanity
0,Averting the climate crisis,thank chris truly great honor opportunity come...,VERB PROPN ADV ADJ NOUN NOUN VERB NOUN ADV ADV...,PERSON ORG ORG GPE LOC ORG PRODUCT GPE GPE PER...,"[0.04325945698517057, 0.0, 0.00142482934694180...",1,1,1,1,1,0,0,0,0,0,0,0,0,0,0
1,Simple designs to save a life,term invention like tell tale favorite project...,NOUN NOUN SCONJ VERB PROPN ADJ NOUN VERB NOUN ...,GPE DATE CARDINAL DATE ORG PERSON LOC ORG GPE ...,"[0.013287880838036227, 0.0, 0.0, 0.00511725094...",0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
2,How to rebuild a broken state,public dewey long ago observe constitute discu...,ADJ PROPN ADV ADV VERB ADJ NOUN NOUN PROPN PRO...,DATE NORP ORDINAL DATE MONEY DATE DATE DATE EV...,"[0.0, 0.006699599134802422, 0.0, 0.00564851883...",1,1,1,1,0,0,1,0,0,0,0,0,0,0,0
3,The real future of space exploration,want start say houston problem enter second ge...,VERB NOUN VERB PROPN NOUN VERB ADJ NOUN NOUN N...,GPE ORDINAL ORG PERSON DATE DATE DATE TIME PER...,"[0.040282108339079505, 0.03732895646484358, 0....",0,0,1,0,0,1,1,0,0,0,0,0,0,0,0
4,Great cars are great art,want talk background idea car art actually mea...,VERB NOUN NOUN NOUN NOUN NOUN ADV ADJ NOUN NOU...,PERSON PRODUCT ORG ORG PERSON PERSON PERSON OR...,"[0.08049208168957463, 0.0, 0.0, 0.008031187136...",0,0,1,0,1,1,1,0,0,0,0,0,0,0,0
5,Sampling the ocean's DNA,break ask people comment age debate comment un...,VERB VERB NOUN VERB NOUN NOUN NOUN VERB NOUN A...,DATE DATE ORG DATE DATE PERSON ORG CARDINAL CA...,"[0.0, 0.01122282724927712, 0.0, 0.163765591818...",0,0,1,0,1,0,0,1,1,0,0,0,0,0,0
6,Simplicity sells,music sound silence simon garfunkel hello voic...,NOUN PROPN PROPN PROPN PROPN INTJ NOUN NOUN AD...,PERSON TIME TIME ORG PERSON FAC DATE DATE ORG ...,"[0.062272408748748564, 0.0, 0.0243049615007748...",0,0,1,0,1,1,0,0,0,1,1,0,0,0,0
7,A memorial at Ground Zero,kurt andersen like architect david hog limelig...,PROPN PROPN SCONJ PROPN PROPN PROPN PROPN ADV ...,PERSON PERSON ORG PERSON DATE GPE PERSON PERSO...,"[0.045631610155157765, 0.0, 0.0, 0.0, 0.004847...",1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
8,To invent is to give,point time come learn morning world expert gue...,NOUN NOUN VERB VERB NOUN NOUN NOUN VERB ADJ NO...,DATE DATE TIME PERSON DATE ORG CARDINAL GPE CA...,"[0.041662618917564086, 0.0, 0.0, 0.0, 0.004215...",1,0,1,1,1,1,1,0,0,1,0,0,0,0,0
9,The killer American diet that's sweeping the p...,legitimate concern aid avian flu hear brillian...,ADJ NOUN NOUN ADJ NOUN VERB ADJ PROPN ADJ ADV ...,NORP TIME LOC LOC CARDINAL DATE DATE DATE NORP...,"[0.003366184031329983, 0.0, 0.0007976442417315...",1,0,1,1,0,0,0,0,0,0,0,0,0,0,0


In [11]:
def get_target_column(target, df):
    return df[['headline', 'transcript','pos_sequence', 'ner_sequence','tm', target_tag]]

In [64]:
df_x = df[['headline', 'transcript','pos_sequence', 'ner_sequence','tm']]

In [57]:
def tm_columns(x,i):
    val = x[i]
    pyval = val.item()
#     print(type(pyval))
    return pyval

In [59]:
# for i in range(15):
#     print(i)
#     df_x[str(i)] = df_x['tm'].map(lambda x: tm_columns(x,i))
#     print(type(df_x[str(i)][0]))
#     df_x[str(i)] = df_x[str(i)].map(lambda x: x.item())
#     print(type(df_x[str(i)][0]))
# df_x = df_x.drop('tm', axis = 1)
# df_x

In [61]:
df_split = pd.DataFrame(df_x.tm.values.tolist(), index = df_x.index)
df_split

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.043259,0.000000,0.001425,0.000000,0.017688,0.003206,0.002644,0.017519,0.004639,0.000000,0.000000,0.000000,0.124716,0.000000,0.035293
1,0.013288,0.000000,0.000000,0.005117,0.034926,0.017484,0.020667,0.009246,0.005919,0.000000,0.001089,0.010383,0.084818,0.028670,0.000000
2,0.000000,0.006700,0.000000,0.005649,0.116568,0.000000,0.000000,0.026561,0.018581,0.001628,0.002034,0.000000,0.004737,0.005569,0.081975
3,0.040282,0.037329,0.000000,0.003867,0.029308,0.000000,0.008637,0.018175,0.000000,0.000000,0.023373,0.000000,0.050151,0.045343,0.003090
4,0.080492,0.000000,0.000000,0.008031,0.000000,0.000000,0.000000,0.048297,0.017136,0.000000,0.000560,0.000000,0.094348,0.034353,0.000000
5,0.000000,0.011223,0.000000,0.163766,0.003340,0.050081,0.000000,0.003768,0.000000,0.000000,0.002204,0.000000,0.025514,0.007171,0.009985
6,0.062272,0.000000,0.024305,0.003455,0.011596,0.000000,0.000000,0.000000,0.007107,0.000000,0.000000,0.000000,0.008445,0.120207,0.000000
7,0.045632,0.000000,0.000000,0.000000,0.004847,0.000000,0.000000,0.150516,0.000000,0.000000,0.000000,0.001513,0.000000,0.014483,0.012876
8,0.041663,0.000000,0.000000,0.000000,0.004215,0.000000,0.000000,0.149192,0.000000,0.000000,0.000000,0.000000,0.210725,0.000000,0.000000
9,0.003366,0.000000,0.000798,0.009314,0.062765,0.007362,0.004103,0.000000,0.000000,0.000161,0.000000,0.113221,0.000000,0.000000,0.000000


In [63]:
type(df_split[0][0])

numpy.float64

In [16]:
df_x.dtypes

headline         object
transcript       object
pos_sequence     object
ner_sequence     object
0               float64
1               float64
2               float64
3               float64
4               float64
5               float64
6               float64
7               float64
8               float64
9               float64
10              float64
11              float64
12              float64
13              float64
14              float64
dtype: object

In [25]:
df_y = df[all_tags]
# dff = list(get_target_column('culture', df))
# dff
# print(len(df_y))
print(type(df_y['culture'][0]))
df_y.dtypes

<class 'numpy.int64'>


culture          int64
politics         int64
science          int64
global issues    int64
technology       int64
design           int64
business         int64
biomechanics     int64
biodiversity     int64
media            int64
entertainment    int64
history          int64
future           int64
communication    int64
humanity         int64
dtype: object

## 3.2 Perform train test split

In [66]:
from sklearn.model_selection import train_test_split
X_train, X_test, train_y, valid_y = train_test_split(df_x, df_y, random_state = 42)

In [23]:
# from skmultilearn.model_selection import iterative_train_test_split

# X_train, train_y, X_test, valid_y = iterative_train_test_split(df_x, df_y, test_size = 0.2)
# # X_train = pd.DataFrame(X_train)[0]
# # X_test = pd.DataFrame(X_test)[0]

TypeError: unhashable type: 'numpy.ndarray'

In [67]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM, Conv1D, MaxPooling1D, concatenate
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate
from keras import optimizers
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import numpy as np

Using TensorFlow backend.


## 3.3 Use word embeddings for the main transcript

In [68]:
# Extract train and test transcripts to list 
X_train_transcripts = X_train['transcript'].tolist()
X_test_transcripts = X_test['transcript'].tolist()
# Extract headline - we will use tfidf because headlines are short 
X_train_headline = X_train['headline'].tolist()
X_test_headline = X_test['headline'].tolist()
# Extract POS tags
X_train_pos_seq= X_train['pos_sequence'].tolist()
X_test_pos_seq = X_test['pos_sequence'].tolist()
# Extract NER tags
X_train_ner_seq = X_train['ner_sequence'].tolist()
X_test_ner_seq = X_test['ner_sequence'].tolist()
# Extract topic modelling arrays
X_train_tm = X_train['tm']
X_test_tm = X_test['tm']

In [69]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train_transcripts)

X_train_transcripts = tokenizer.texts_to_sequences(X_train_transcripts)
X_test_transcripts = tokenizer.texts_to_sequences(X_test_transcripts)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 500 # since the average length is about there. Too long and the predicions are bad. we assume the intro has the most info

X_train_transcripts = pad_sequences(X_train_transcripts, padding='post', maxlen=maxlen)
X_test_transcripts = pad_sequences(X_test_transcripts, padding='post', maxlen=maxlen)

In [70]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()
glove_path = "C:/Users/JSaw/Downloads/"
glove_file = open(glove_path+'glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [71]:
print(X_train_transcripts[1])
print(X_train_transcripts.shape)
print(type(X_train_transcripts))

[4469 4298 2877  570  399  847 1837 1132 2904 1242  663 2904 1242 3117
 2877  848   67  449  583  301 2904 1242 1447 1132  261   21  570    8
   17   68 2420 1035  169   21  382 3410  399  598  101  444  101   12
 1783  715 2211  471 1378  368 2511  715  433 4299  570  153 2843  273
  438  169   76  368  438 1423 1483 1026 1285 1035  528    6   19 2453
 2683  598 1274   78  149  598 3460  598   11  813 1721 4157   11   38
   24  153 2843  678 1437 2878 1036 2033  469    5 1514 1688   17  802
  292   25 2878 1437  814 1026 1026 2454  197  387  141  814 1326  117
 1226   14  588 1233 1462  149  241 2878    8  500  102 1548  387  141
  964   30 4300   34  241 2878   51   34 3461   51 3709  448  387  141
  544  206  241 2878  625   30   14    7 1530 1186  387  141  402   69
  241 2878   21   34   47   56  136   47  241 2878   21   77  663   43
  663 1394 4552 1437 1676   10  241 2878    7  459 2810  389  888  176
 4931 4301  626  264  285  264  285   14  567  399  241  709  389  814
 2878 

## 3.4 tfidf the headline

In [72]:
# tfidf_vect_pos = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=50)
# tfidf_vect_pos.fit(X_train_headline)

# xtrain_tfidf_headline =  tfidf_vect_pos.transform(X_train_headline)
# xtest_tfidf_headline =  tfidf_vect_pos.transform(X_test_headline)


In [73]:
# print(xtrain_tfidf_headline.shape)
# print(xtest_tfidf_headline.shape)
# print(xtrain_tfidf_headline[0])
# print(type(xtrain_tfidf_headline))

In [74]:
# tfidf_vect_pos = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
# tfidf_vect_pos.fit(df['pos_sequence'])
# tfidf_vect_ner = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
# tfidf_vect_ner.fit(df['ner_sequence'])

# xtrain_tfidf_pos =  tfidf_vect_pos.transform(X_train['pos_sequence'])
# xtest_tfidf_pos =  tfidf_vect_pos.transform(X_test['pos_sequence'])

# xtrain_tfidf_ner =  tfidf_vect_ner.transform(X_train['ner_sequence'])
# xtest_tfidf_ner =  tfidf_vect_ner.transform(X_test['ner_sequence'])

In [140]:
ave_head = 0
for i in range(len(X_train_headline)):
    small = len(X_train_headline[i])
    ave_head += small
for j in range(len(X_test_headline)):
    small = len(X_test_headline[j])
    ave_head += small
ave_head = int(ave_head / (len(X_train_headline)+len(X_test_headline)))
print(ave_head)

100


In [142]:
unique_head = list(df['headline'].str.split(' ', expand=True).stack().unique())
print(len(unique_head))

4434


In [75]:
# Try word embeddings on the vector 
tokenizer2 = Tokenizer(num_words=100)
tokenizer2.fit_on_texts(X_train_headline)

X_train_headline = tokenizer2.texts_to_sequences(X_train_headline)
X_test_headline = tokenizer2.texts_to_sequences(X_test_headline)

vocab_size2 = len(tokenizer2.word_index) + 1

maxlen2 = 100 # since the average length is about there. Too long and the predicions are bad. we assume the intro has the most info

X_train_headline = pad_sequences(X_train_headline, padding='post', maxlen=maxlen2)
X_test_headline = pad_sequences(X_test_headline, padding='post', maxlen=maxlen2)

# 3.5 Place all tm vectors into big array

In [76]:
def compile_vectors(series,num):
    big = np.zeros((len(series),num))
    for i in range(len(series)):
        array = series.iloc[i]
        big[i] = array
        return big

In [77]:
print(type(X_train_tm))
print(type(X_test_tm))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [78]:
X_train_tm = compile_vectors(X_train_tm,15)
X_test_tm = compile_vectors(X_test_tm,15)

In [79]:
print(type(X_train_tm))
print(type(X_test_tm))
print(X_train_tm.shape)
print(X_test_tm.shape)

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(1734, 15)
(579, 15)


# 3.5 POS NER

In [80]:
print(type(X_train_pos_seq))
print(type(X_test_pos_seq))

print(type(X_train_ner_seq))
print(type(X_test_ner_seq))
print(len(X_train_pos_seq[0]),len(X_train_pos_seq[1]))

<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
5135 3703


In [81]:
ave_pos = 0
for i in range(len(X_train_pos_seq)):
    small = len(X_train_pos_seq[i])
    ave_pos += small
for j in range(len(X_test_pos_seq)):
    small = len(X_test_pos_seq[j])
    ave_pos += small
ave_pos = int(ave_pos / (len(X_train_pos_seq)+len(X_test_pos_seq)))
print(ave_pos)

4031


In [82]:
unique_pos = list(df['pos_sequence'].str.split(' ', expand=True).stack().unique())
print(unique_pos)

['VERB', 'PROPN', 'ADV', 'ADJ', 'NOUN', 'ADP', 'SCONJ', 'INTJ', 'AUX', 'PRON', 'X', 'NUM', 'DET', 'PART', 'PUNCT', 'SYM', 'CCONJ']


In [83]:
ave_ner = 0
for i in range(len(X_train_ner_seq)):
    small = len(X_train_ner_seq[i])
    ave_ner += small
for j in range(len(X_test_ner_seq)):
    small = len(X_test_ner_seq[j])
    ave_ner += small
ave_ner = int(ave_ner / (len(X_train_ner_seq)+len(X_test_ner_seq)))
print(ave_ner)

215


In [84]:
unique_ner = list(df['ner_sequence'].str.split(' ', expand=True).stack().unique())
print(unique_ner)

['PERSON', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'DATE', 'EVENT', 'NORP', 'TIME', 'CARDINAL', 'MONEY', 'ORDINAL', 'PERCENT', 'FAC', 'QUANTITY', 'LANGUAGE', 'LAW', 'WORK_OF_ART', '']


In [85]:
# Try word embeddings on the vector 
tokenizer3 = Tokenizer(num_words=len(unique_pos))
tokenizer3.fit_on_texts(X_train_pos_seq)

X_train_pos_seq = tokenizer3.texts_to_sequences(X_train_pos_seq)
X_test_pos_seq = tokenizer3.texts_to_sequences(X_test_pos_seq)

vocab_size3 = len(tokenizer3.word_index) + 1

maxlen3 = ave_pos # since the average length is about there. Too long and the predicions are bad. we assume the intro has the most info

X_train_pos_seq = pad_sequences(X_train_pos_seq, padding='post', maxlen=maxlen3)
X_test_pos_seq = pad_sequences(X_test_pos_seq, padding='post', maxlen=maxlen3)

In [86]:
# Try word embeddings on the vector 
tokenizer4 = Tokenizer(num_words=len(unique_ner))
tokenizer4.fit_on_texts(X_train_ner_seq)

X_train_ner_seq = tokenizer4.texts_to_sequences(X_train_ner_seq)
X_test_ner_seq = tokenizer4.texts_to_sequences(X_test_ner_seq)

vocab_size4 = len(tokenizer4.word_index) + 1

maxlen4 = ave_ner # since the average length is about there. Too long and the predicions are bad. we assume the intro has the most info

X_train_ner_seq = pad_sequences(X_train_ner_seq, padding='post', maxlen=maxlen4)
X_test_ner_seq = pad_sequences(X_test_ner_seq, padding='post', maxlen=maxlen4)

# Model

In [87]:
from keras.utils import plot_model
# define two sets of inputs
inputA = Input(shape=(maxlen2,))
inputB = Input(shape=(maxlen,))
inputTM = Input(shape=(15,))
inputNER = Input(shape=(maxlen4,))
inputPOS = Input(shape=(maxlen3,))
 
# the first branch operates on the first input which is the headline
embedding_layer_headline = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(inputA) 
#model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
#model.add(layers.Conv1D(128, 5, activation='relu'))
x = Conv1D(128, 5, activation='relu')(embedding_layer_headline)
# model.add(layers.GlobalMaxPooling1D())
x = GlobalMaxPooling1D()(x)
x = Dense(10, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(4, activation="relu")(x)
# model.add(layers.Dense(10, activation='relu'))
# model.add(layers.Dense(1, activation='sigmoid'))

# x = Dense(50, activation="relu")(inputA)
# x = Dense(4, activation="relu")(x)
x = Model(inputs=inputA, outputs=x)
 
# the second branch opreates on the second input
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(inputB)
y = LSTM(128)(embedding_layer)
y = Dropout(0.2)(y)
y = Dense(4, activation='relu')(y)
y = Model(inputs=inputB, outputs=y)

# third input
tm = Dense(64,activation='relu')(inputTM)
tm = Dense(4,activation='relu')(tm)
tm = Model(inputs=inputTM,outputs=tm)

# fourth input: NER
embedding_layer_ner = Embedding(vocab_size4, 20,trainable=True)(inputNER)
ner = Conv1D(128, 5, activation='relu')(embedding_layer_ner)
# model.add(layers.GlobalMaxPooling1D())
ner = GlobalMaxPooling1D()(ner)
ner = Dense(10, activation='relu')(ner)
ner = Dropout(0.2)(ner)
ner = Dense(4, activation="relu")(ner)
ner = Model(inputs=inputNER, outputs=ner)

# fifth input: POS
embedding_layer_pos = Embedding(vocab_size3, 20,trainable=True)(inputPOS)
pos = Conv1D(128, 5, activation='relu')(embedding_layer_pos)
# model.add(layers.GlobalMaxPooling1D())
pos = GlobalMaxPooling1D()(pos)
pos = Dense(10, activation='relu')(pos)
pos = Dropout(0.2)(pos)
pos = Dense(4, activation="relu")(pos)
pos = Model(inputs=inputPOS, outputs=pos)

# combine the output of the two branches
combined = concatenate([x.output, y.output, tm.output, ner.output, pos.output])
 
# apply a FC layer and then a regression prediction on the
# combined outputs
z = Dense(2, activation="relu")(combined)
z = Dense(1, activation="sigmoid")(z)
 
# our model will accept the inputs of the two branches and
# then output a single value
model = Model(inputs=[x.input, y.input, tm.input, ner.input, pos.input], outputs=z)
print(model.summary())
adam = optimizers.adam(lr=0.0001)
#model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['acc'])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 215)          0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 4031)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embeddin

In [88]:
print(X_train_transcripts.shape,maxlen,type(X_train_transcripts))
print(X_train_headline.shape,maxlen2,type(X_train_headline))
print(X_train_tm.shape,15,type(X_train_tm))
print(X_train_ner_seq.shape,maxlen4,type(X_train_ner_seq))
print(X_train_pos_seq.shape,maxlen3,type(X_train_pos_seq))

(1734, 500) 500 <class 'numpy.ndarray'>
(1734, 100) 100 <class 'numpy.ndarray'>
(1734, 15) 15 <class 'numpy.ndarray'>
(1734, 215) 215 <class 'numpy.ndarray'>
(1734, 4031) 4031 <class 'numpy.ndarray'>


In [89]:
plot_model(model, to_file='model_plot_cnn_tm_ner_pos.png', show_shapes=True, show_layer_names=True)

In [96]:

def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        """Recall metric.

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        """Precision metric.

        Only computes a batch-wise average of precision.

        Computes the precision, a metric for multi-label classification of
        how many selected items are relevant.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [124]:
adam = optimizers.adam(lr=0.001)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
# model.compile(loss='binary_crossentropy', optimizer=adam, metrics=[f1])

In [98]:
def get_tag(threshold, predictions):
    return [[1 if j > threshold else 0 for j in i.tolist()] for i in predictions]

def get_tag_flat(threshold, predictions):
    return [1 if j > threshold else 0 for i in predictions for j in i]
# predictions_flushed = get_tag(0.4)


In [99]:
def compute_tp_tn_fp_fn(y_test, y_pred, classes):
    '''
    Return:
    pre_score = {
        'tag_1': {
            'index': ,
            'tp': ,
            'tn': ,
            'fp': ,
            'fn': 
        }
    }
    '''
    # Create dictionary of tags 
    pre_score = {}
    for index_tag, tag in enumerate(classes):
        pre_score[tag] = {
            'index':index_tag,
            'tp': 0,
            'tn': 0,
            'fp': 0,
            'fn': 0
        }
    for transcript_index, transcript_value in enumerate(y_test):
        if transcript_value == y_pred[transcript_index][0] and transcript_value == 1:
            pre_score[classes[0]]['tp'] += 1
        elif transcript_value == y_pred[transcript_index][0] and transcript_value == 0:
            pre_score[classes[0]]['tn'] += 1
        elif transcript_value != y_pred[transcript_index][0] and transcript_value == 1:
            pre_score[classes[0]]['fn'] += 1
        elif transcript_value != y_pred[transcript_index][0] and transcript_value == 0:
            pre_score[classes[0]]['fp'] += 1
    return pre_score
# scores_preprocess = compute_tp_tn_fp_fn(valid_y, predictions_flushed, ['culture'])

In [100]:
def compute_precision_recall_f1(preprocessed_scores):
    for key, value in preprocessed_scores.items():
        try:
            precision = value['tp']/(value['tp']+value['fp'])
        except:
#             print('precision issue: {}'.format(key))
            precision = 0.0
        try:
            recall = value['tp']/(value['tp']+value['fn'])
        except:
#             print('recall issue: {}'.format(key))
            recall = 0.0
        try:
            f1 = (2 * precision * recall)/(precision + recall)
        except:
#             print('f1 issue: {}'.format(key))
            f1=0.0
        preprocessed_scores[key]['precision'] = round(precision,2)
        preprocessed_scores[key]['recall'] = round(recall,2)
        preprocessed_scores[key]['f1'] = round(f1,2)
    return preprocessed_scores
# final_scores = compute_precision_recall_f1(scores_preprocess)
# print(final_scores)

In [101]:
def print_full_dataframe(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

In [102]:
def format_scores_df(tag_classes, final_scores):
    precision = []
    recall = []
    f1 = []
    accuracy = []
    for index, value in enumerate(tag_classes):
        precision.append(final_scores[value]['precision'])
        recall.append(final_scores[value]['recall'])
        f1.append(final_scores[value]['f1'])
        accuracy.append((final_scores[value]['tp'] + final_scores[value]['tn'])/(final_scores[value]['tp'] + final_scores[value]['tn'] + final_scores[value]['fp'] + final_scores[value]['fn']))
    df_result = pd.DataFrame(list(zip(tag_classes, precision, recall, f1, accuracy)), 
               columns =['class', 'precision', 'recall', 'f1', 'accuracy']) 
    return df_result
# df_results = format_scores_df(['culture'], final_scores)
# print_full_dataframe(df_results)

In [127]:
def get_threshold(tag,valid_y, predictions):
    '''
    tag (string): Specific tag
    valid_y (List of tags): [0,1,1, ... ,1]
    predictions (list of lists)
    '''
    # We want to find the threshold that gives the highest recall and accuracy
    highest_f1 = 0
    f1_i = []
    highest_accuracy_f1 = 0
    accuracy_f1_i = []
    highest_accuracy = 0
    accuracy_i = []
    f1_metrics = [0, 0, 0, 0] # tp, tn, fp, fn
    accuracy_metrics = [0, 0, 0, 0]
    for i in range(0, 100):
        i = i/100
        predictions_flushed = get_tag(i,predictions)
        scores_preprocess = compute_tp_tn_fp_fn(valid_y, predictions_flushed, [tag])
        final_scores = compute_precision_recall_f1(scores_preprocess)
    #     print(final_scores)
        df_results = format_scores_df([tag], final_scores)
        #print(df_results)
        f1 = final_scores[tag]['f1']
        accuracy = df_results.accuracy[0]

        if f1 > highest_f1:
            highest_f1 = f1
            f1_i = [i]
            if accuracy > highest_accuracy_f1:
                highest_accuracy_f1 = accuracy
                accuracy_f1_i = [i]
                f1_metrics[0] = scores_preprocess[tag]['tp']
                f1_metrics[1] = scores_preprocess[tag]['tn']
                f1_metrics[2] = scores_preprocess[tag]['fp']
                f1_metrics[3] = scores_preprocess[tag]['fn']
            elif accuracy == highest_accuracy_f1:
                accuracy_f1_i.append(i)
        elif f1 == highest_f1:
            f1_i.append(i)
            if accuracy > highest_accuracy_f1:
                highest_accuracy_f1 = accuracy
                accuracy_f1_i = [i]
                
            elif accuracy == highest_accuracy_f1:
                accuracy_f1_i.append(i)

        if accuracy > highest_accuracy:
            highest_accuracy = accuracy
            accuracy_i = [i]
            accuracy_metrics[0] = scores_preprocess[tag]['tp']
            accuracy_metrics[1] = scores_preprocess[tag]['tn']
            accuracy_metrics[2] = scores_preprocess[tag]['fp']
            accuracy_metrics[3] = scores_preprocess[tag]['fn']
        elif accuracy == highest_accuracy:
            accuracy_i.append(i)

    #     print('\n')

#     print(highest_f1,f1_i)
#     print(highest_accuracy_f1,accuracy_f1_i)
#     print(highest_accuracy,accuracy_i)
    return highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics

In [104]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
def evaluate_on_training_set(y_test, y_pred):
  # Calculate AUC
  print("AUC is: ", roc_auc_score(y_test, y_pred))
  # print out recall and precision
  print(classification_report(y_test, y_pred))
  # print out confusion matrix
  print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
  # # calculate points for ROC curve
  fpr, tpr, thresholds = roc_curve(y_test, y_pred)
  # Plot ROC curve
  plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc_score(y_test, y_pred)) 
  plt.plot([0, 1], [0, 1], 'k--') # random predictions curve
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.0])
  plt.xlabel('False Positive Rate or (1 - Specifity)')
  plt.ylabel('True Positive Rate or (Sensitivity)')
  plt.title('Receiver Operating Characteristic')

In [105]:
# evaluate_on_training_set(valid_y, get_tag_flat(0.35))
# #print(valid_y)

In [None]:
# tag_results = {}
# for i in range(len(all_tags)):
#     tag = all_tags[i]
#     print(tag)
#     train_y_tag = train_y[tag]
#     valid_y_tag = valid_y[tag]
#     class_weight = compute_class_weight(tag)
#     history = model.fit(X_train, train_y_tag, batch_size=32, epochs=4, verbose=1, validation_split=0.2, class_weight=class_weight)
#     predictions = model.predict(X_test)
#     model.save('{}_transcript_only.h5'.format(tag))
#     highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics = get_threshold(tag,valid_y_tag,predictions)
#     print(highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics)
#     tag_results[tag] = [highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i, f1_metrics, accuracy_metrics]

In [132]:
from keras import backend as K

def find_threshold(tag):
    train_y_tag = train_y[tag]
    valid_y_tag = valid_y[tag]
    
    zeroes = (train_y_tag==0).sum()
    ones = (train_y_tag==1).sum()

    if zeroes > ones:
        print('more zeroes')
        class_weight = {0: zeroes/zeroes,
                        1: ones/zeroes,
                       }
    else:
        print('more ones')
        class_weight = {0: zeroes/ones,
                        1: ones/ones,
                       }
    # print(class_weight)

    history = model.fit([X_train_headline, X_train_transcripts, X_train_tm, X_train_ner_seq, X_train_pos_seq], train_y_tag, batch_size=32, epochs=4, verbose=1, validation_split=0.2, class_weight = class_weight)
    predictions = model.predict([X_test_headline, X_test_transcripts, X_test_tm, X_test_ner_seq, X_train_pos_seq])
#     model.save('{}_transcript_headline_tm_pos_ner.h5'.format(tag))
    highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i,f1_metrics,accuracy_metrics = get_threshold(tag,valid_y_tag,predictions)
  
    return highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i,f1_metrics,accuracy_metrics

In [144]:
tag_results = {}
for i in range(10,15):
# for i in range(1):
    tag = all_tags[i]
    print(tag)
    highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i,f1_metrics,accuracy_metrics = find_threshold(tag)
    tag_results[tag] = [highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i,f1_metrics,accuracy_metrics]

entertainment
more zeroes
Train on 1387 samples, validate on 347 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
history
more zeroes
Train on 1387 samples, validate on 347 samples
Epoch 1/4


Epoch 2/4
Epoch 3/4
Epoch 4/4
future
more zeroes
Train on 1387 samples, validate on 347 samples
Epoch 1/4


Epoch 2/4
Epoch 3/4
Epoch 4/4
communication
more zeroes
Train on 1387 samples, validate on 347 samples
Epoch 1/4


Epoch 2/4
Epoch 3/4
Epoch 4/4
humanity
more zeroes
Train on 1387 samples, validate on 347 samples
Epoch 1/4


Epoch 2/4
Epoch 3/4
Epoch 4/4


In [143]:
for i in range(10,15):
    print(i)

10
11
12
13
14


In [145]:
results5 = pd.DataFrame.from_dict(tag_results, orient='index', columns=['highest_f1', 'thresholds_for_highest_f1', 'highest_accuracy_at_highest_f1', 'thresholds_for_highest_accuracy_f1','highest_accuracy','threshold_for_highest_accuracy_i', 'highest_f1_confusion_metrics', 'highest_accuracy_confusion_metrics'])
results5

Unnamed: 0,highest_f1,thresholds_for_highest_f1,highest_accuracy_at_highest_f1,thresholds_for_highest_accuracy_f1,highest_accuracy,threshold_for_highest_accuracy_i,highest_f1_confusion_metrics,highest_accuracy_confusion_metrics
entertainment,0.26,[0.01],0.309154,[0.01],0.861831,"[0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11...","[71, 108, 391, 9]","[0, 499, 0, 80]"
history,0.29,[0.03],0.172712,[0.03],0.834197,"[0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13...","[96, 4, 479, 0]","[0, 483, 0, 96]"
future,0.18,[0.01],0.214162,[0.01],0.906736,"[0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1...","[51, 73, 452, 3]","[0, 525, 0, 54]"
communication,0.17,[0.0],0.091537,[0.0],0.908463,"[0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.0...","[53, 0, 526, 0]","[0, 526, 0, 53]"
humanity,0.3,[0.01],0.692573,[0.01],0.875648,"[0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.0...","[38, 363, 144, 34]","[0, 507, 0, 72]"


In [138]:
# results10 = pd.DataFrame.from_dict(tag_results, orient='index', columns=['highest_f1', 'thresholds_for_highest_f1', 'highest_accuracy_at_highest_f1', 'thresholds_for_highest_accuracy_f1','highest_accuracy','threshold_for_highest_accuracy_i', 'highest_f1_confusion_metrics', 'highest_accuracy_confusion_metrics'])
results10

Unnamed: 0,highest_f1,thresholds_for_highest_f1,highest_accuracy_at_highest_f1,thresholds_for_highest_accuracy_f1,highest_accuracy,threshold_for_highest_accuracy_i,highest_f1_confusion_metrics,highest_accuracy_confusion_metrics
culture,0.68,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.533679,[0.41],0.576857,[0.48],"[300, 0, 279, 0]","[154, 180, 99, 146]"
politics,0.15,[0.01],0.504318,[0.01],0.939551,"[0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12...","[25, 267, 277, 10]","[0, 544, 0, 35]"
science,0.8,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.663212,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.663212,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...","[384, 0, 195, 0]","[384, 0, 195, 0]"
global issues,0.47,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.360967,[0.13],0.696028,"[0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5...","[176, 0, 403, 0]","[0, 403, 0, 176]"
technology,0.49,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.328152,[0.19],0.677029,[0.35],"[188, 0, 391, 0]","[1, 391, 0, 187]"
design,0.36,[0.06],0.390328,[0.06],0.792746,"[0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.2...","[98, 128, 331, 22]","[0, 459, 0, 120]"
business,0.28,[0.02],0.174439,[0.02],0.841105,"[0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2...","[91, 10, 477, 1]","[0, 487, 0, 92]"
biomechanics,0.18,[0.0],0.098446,[0.0],0.901554,"[0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15...","[57, 0, 522, 0]","[0, 522, 0, 57]"
biodiversity,0.18,"[0.0, 0.01]",0.452504,[0.01],0.903282,"[0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13...","[56, 0, 523, 0]","[0, 523, 0, 56]"
media,0.21,"[0.0, 0.01]",0.120898,[0.01],0.882556,"[0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13...","[68, 0, 511, 0]","[0, 511, 0, 68]"


In [146]:
df_results = pd.concat([results10,results5])
df_results

Unnamed: 0,highest_f1,thresholds_for_highest_f1,highest_accuracy_at_highest_f1,thresholds_for_highest_accuracy_f1,highest_accuracy,threshold_for_highest_accuracy_i,highest_f1_confusion_metrics,highest_accuracy_confusion_metrics
culture,0.68,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.533679,[0.41],0.576857,[0.48],"[300, 0, 279, 0]","[154, 180, 99, 146]"
politics,0.15,[0.01],0.504318,[0.01],0.939551,"[0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12...","[25, 267, 277, 10]","[0, 544, 0, 35]"
science,0.8,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.663212,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.663212,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...","[384, 0, 195, 0]","[384, 0, 195, 0]"
global issues,0.47,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.360967,[0.13],0.696028,"[0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5...","[176, 0, 403, 0]","[0, 403, 0, 176]"
technology,0.49,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",0.328152,[0.19],0.677029,[0.35],"[188, 0, 391, 0]","[1, 391, 0, 187]"
design,0.36,[0.06],0.390328,[0.06],0.792746,"[0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.2...","[98, 128, 331, 22]","[0, 459, 0, 120]"
business,0.28,[0.02],0.174439,[0.02],0.841105,"[0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2...","[91, 10, 477, 1]","[0, 487, 0, 92]"
biomechanics,0.18,[0.0],0.098446,[0.0],0.901554,"[0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15...","[57, 0, 522, 0]","[0, 522, 0, 57]"
biodiversity,0.18,"[0.0, 0.01]",0.452504,[0.01],0.903282,"[0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13...","[56, 0, 523, 0]","[0, 523, 0, 56]"
media,0.21,"[0.0, 0.01]",0.120898,[0.01],0.882556,"[0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13...","[68, 0, 511, 0]","[0, 511, 0, 68]"


In [147]:
df_results.to_csv(DATA_DIR+'multi_headline_transcript_tm_ner_pos_results.csv')

In [133]:
politics_results = {}
tag = 'politics'
print(tag)
highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i,f1_metrics,accuracy_metrics = find_threshold(tag)
politics_results[tag] = [highest_f1,f1_i,highest_accuracy_f1,accuracy_f1_i,highest_accuracy,accuracy_i,f1_metrics,accuracy_metrics]

politics
more zeroes
Train on 1387 samples, validate on 347 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [134]:
politics = pd.DataFrame.from_dict(politics_results)
politics

Unnamed: 0,politics
0,0.11
1,[0.0]
2,0.0604491
3,[0.0]
4,0.939551
5,"[0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11..."
6,"[35, 0, 544, 0]"
7,"[0, 544, 0, 35]"


In [131]:
results = results.rename(index = {0:'f1',1:'f1_index',2:'accuracy_f1',3:'accuracy_f1_index',4:'accuracy',5:'accuracy_index'})
results

Unnamed: 0,culture,politics,science,global issues,technology,design,business,biomechanics,biodiversity,media
f1,0.68,0.15,0.8,0.47,0.49,0.36,0.28,0.18,0.18,0.21
f1_index,"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",[0.01],"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...","[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...","[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",[0.06],[0.02],[0.0],"[0.0, 0.01]","[0.0, 0.01]"
accuracy_f1,0.533679,0.504318,0.663212,0.360967,0.328152,0.390328,0.174439,0.0984456,0.452504,0.120898
accuracy_f1_index,[0.41],[0.01],"[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...",[0.13],[0.19],[0.06],[0.02],[0.0],[0.01],[0.01]
accuracy,0.576857,0.939551,0.663212,0.696028,0.677029,0.792746,0.841105,0.901554,0.903282,0.882556
accuracy_index,[0.48],"[0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12...","[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07...","[0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5...",[0.35],"[0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.2...","[0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2...","[0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15...","[0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13...","[0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13..."
6,"[300, 0, 279, 0]","[25, 267, 277, 10]","[384, 0, 195, 0]","[176, 0, 403, 0]","[188, 0, 391, 0]","[98, 128, 331, 22]","[91, 10, 477, 1]","[57, 0, 522, 0]","[56, 0, 523, 0]","[68, 0, 511, 0]"
7,"[154, 180, 99, 146]","[0, 544, 0, 35]","[384, 0, 195, 0]","[0, 403, 0, 176]","[1, 391, 0, 187]","[0, 459, 0, 120]","[0, 487, 0, 92]","[0, 522, 0, 57]","[0, 523, 0, 56]","[0, 511, 0, 68]"


In [None]:
results.dtypes

In [139]:
results10.to_csv(DATA_DIR+'multi_headline_transcript_tm_ner_pos_results10.csv')

In [None]:
# results_csv = pd.read_csv(DATA_DIR+'multi_headline_transcript_tm_ner_pos_results.csv',index_col = 0)
# results_csv

In [None]:
print_full_dataframe(squashed_tag_counts)


In [None]:
print(results['culture']['f1_index'])
print(results['culture']['accuracy_f1_index'])

In [None]:
find_threshold('culture')

In [119]:
tag = 'politics'

train_y_tag = train_y[tag]
valid_y_tag = valid_y[tag]

zeroes = (train_y_tag==0).sum()
ones = (train_y_tag==1).sum()

if zeroes > ones:
    print('more zeroes')
    class_weight = {0: zeroes/zeroes,
                    1: ones/zeroes,
                   }
else:
    print('more ones')
    class_weight = {0: zeroes/ones,
                    1: ones/ones,
                   }
# print(class_weight)

more zeroes
{0: 1.0, 1: 0.9726962457337884}


In [113]:
train_y_tag.count()

1734

In [114]:
(train_y_tag==0).sum()

879

In [115]:
(train_y_tag==1).sum()

855

In [117]:
855/879

0.9726962457337884